SBCCI 2019 TOC

Full Citation in the ACM Digital Library

PHICC: an error correction code for memory devices

Philippe Magalhães
Otávio Alcântara
Jarbas Silveira

With the evolution of technology in the microelectronics field, integrated circuits (ICs) have been developed with decreasing dimensions. Despite the advances provided by the scale reduction, the occurrence of Multiple Cell Upsets (MCUs) caused by interferences such as ionizing radiation, has become increasingly common. Error Correction Codes (ECCs) are capable of augmenting fault tolerance of computer systems, however, there must be balance between error correction effectiveness and silicon implementation costs. The purpose of this article is to present the Parity Hamming Interleaved Correction Code (PHICC), which consists of a code capable of correcting multiple transient errors in memory cells, with low implementation cost. The validation of the PHICC was performed through a comparative analysis of correction effectiveness, implementation costs, reliability and Mean Time to Failure (MTTF) with others ECCs. The results show that PHICC can maintain the reliability system for longer time, which makes it a strong candidate for use in critical applications.

Lightweight security mechanisms for MPSoCs

Anderson Camargo Sant’Ana
Henrique Martins Medina
Kevin Boucinha Fiorentin
Fernando Gehm Moraes

Computational systems tend to adopt parallel architectures, by using multiprocessor systems-on-chip (MPSoCs). MPSoCs are vulnerable to software and hardware attacks, as infected applications and Hardware Trojans respectively. These attacks may have the purpose to gain access to sensitive data, interrupt a given application or even damage the system physically. The literature presents countermeasures using dedicated routing algorithms, cryptography, firewalls and secure zones. These approaches present a significant hardware cost (firewalls, cryptography) or are too restrictive regarding the use of MPSoC resources (secure zones). The goal of this paper is to present lightweight security mechanisms for MPSoCs, using four techniques: spatial isolation of applications; dedicated network to send sensitive data; traffic blocking filter; lightweight cryptography. These mechanisms protect the MPSoC against the most common software attacks, as Denial of Service (DoS) and spoofing (man-in-the-middle), and ensures confidentiality and integrity to applications. Results present low area and latency overhead, as well as the effectiveness of using the mechanisms to block malicious traffic.

Exploiting approximate computing for low-cost fault tolerant architectures

Gennaro S. Rodrigues
Juan Fonseca
Fabio Benevenuti
Fernanda Kastensmidt
Alberto Bosio

This work investigates how the approximate computing paradigm can be exploited to provide low-cost fault tolerant architectures. In particular, we focus on the implementation of Approximate Triple Modular Redundancy (ATMR) designs using the precision reduction technique. The proposed method is applied to two benchmarks and a multitude of ATMR designs with different degrees of approximation. The benchmarks are implemented on a Xilinx Zynq-7000 APSoC FPGA through high-level synthesis and evaluated concerning area usage and the inaccuracy caused by approximation. Fault injection experiments are performed by flipping bits of the FPGA configuration bitstream. Results show that the proposed approximation method can decrease the DSP usage of the hardware implementation up to 80% and the number of sensitive configuration bits up to 75% while maintaining an accuracy of more than 99.96%.

Fine-grain temperature monitoring for many-core systems

Alzemiro Lucas da Silva
André Luís del Mestre Martins
Fernando Gehm Moraes

The power density may limit the amount of energy a many-core system can consume. A many-core at its maximum performance may lead to safe temperature violations and, consequently, result in reliability issues. Dynamic Thermal Management (DTM) techniques have been proposed to guarantee that many-core systems run at good performance without compromising reliability. DTM techniques rely on accurate temperature information and estimation, which is a computationally complex problem. However, related works usually abstract the temperature monitoring complexity, assuming available temperature sensors. An issue related to temperature sensors is their granularity, frequently measuring the temperature of a large system area instead of a processing element (PE) area. Therefore, the first goal of this work is to propose a fine-grain (PE level) temperature monitoring for many-core systems. The second one is to present a dedicated hardware accelerator to estimate the system temperature. Results show that software performance can be a limiting factor when applying an accurate model to provide temperature estimation for system management. On the other side, the hardware accelerator connected to the many-core enables the fine-grain temperature estimation at runtime without sacrificing system performance.

An adaptive discrete particle swarm optimization for mapping real-time applications onto network-on-a-chip based MPSoCs

Jessé Barreto de Barros
Renato Coral Sampaio
Carlos Humberto Llanos

This paper presents a modified version of the well-known Particle Swarm Optimization (PSO) algorithm as an alternative for the single-objective Genetic Algorithm (GA) that is currently the state-of-the-art method to map real-time applications tasks onto Multiple Processors System-on-a-Chip (MPSoC) using preemptive capable wormhole-based Network-on-a-Chip (NoC) as their communication architecture. A statistical study based on an experimental setup has been performed to compare the GA-based task mapper and the proposed method by using a real-time application as a benchmark, as well as a group of randomly generated ones. Preliminary results have shown that our method is capable of achieving quicker convergence than the GA-based method, and it even produces better results when the application utilization is smaller than the available processing capacity, i.e., a fully schedulable mapping solution exists.

Exploring Tabu search based algorithms for mapping and placement in NoC-based reconfigurable systems

Guilherme A. Silva Novaes
Luiz Carlos Moreira
Wang Jiang Chau

Nowadays, the development of systems based on Networks-on-Chip (NoCs) brings big challenges to the designers due to problems of scalability, such as efficient Mapping and Placement, which are NP-hard problems. Several solutions have been proposed to solve this type of problem that is a variation of Quadratic Assignment Problems (QAP), being Tabu Search (TS) algorithms the ones showing most promising results. In NoC-based dynamically reconfigurable systems (NoC-DRSs), both mapping and placement problems present several layers of complexity due the reconfigurable scenarios. A previous work has adopted TS algorithm variations, but the best solution is not achieved with the wished high frequency. This paper introduces the original Forced Inversion (FI) Heuristic over Tabu Search algorithms for 2D-Mesh FPGA NoC-DRSs, in order to avoid local minima. Results with a series of benchmarks are presented and the performances of different approaches are quantitatively and qualitatively compared.

Performance evaluation of HEVC RCL applications mapped onto NoC-based embedded platforms

Wagner Penny
Daniel Palomino
Marcelo Porto
Bruno Zatt
Leandro Indrusiak

Today, several applications running into embedded systems have to fulfill soft or hard timing constraints. Video applications, like the modern High Efficiency Video Coding (HEVC), e.g., most often have soft real-time constraints. However, in specific scenarios, such as in robotic surgeries, the coupling of satellites and so on, harder timing constraints arise, becoming a huge challenge. Although the implementation of such applications in Networks-on-Chip (NoCs) being an alternative to reduce their algorithmic complexity and meet real-time constraints, a performance evaluation of the mapped NoC and the schedulability analysis for a given application are mandatory. In this work we make a performance evaluation of HEVC Residual Coding Loop (RCL) mapped onto a NoC-based embedded platform, considering the encoding of a single 1920×1080 pixels frame. A set of analysis exploring the combination of different NoC sizes and task mapping strategies were performed, showing for the typical and upper-bound workload cases scenarios when the application is schedulable and meets the real-time constraints.

An FPGA-based evaluation platform for energy harvesting embedded systems

Roberto Paulo Dias Alcantara Filho
Otavio Alcantara de Lima Junior
Corneli Gomes Furtado Junior

Extreme low-power embedded systems are essential in Smart Cities and the Internet of Things, once these systems are responsible for acquiring, processing, and transmitting valuable environmental data. Some of these systems should run for a very long time without any human intervention, even for batteries replacement. Energy harvesting technologies allow embedded systems to be powered up from the environment by converting surrounding energy sources into electrical energy. However, energy-harvesting embedded systems (EHES) heavily depends on the nature of the energy sources, which are mostly uncontrollable and unpredictable. To improve the evaluation of energy management techniques in EHES, we propose the emulation of I-V curves of low-power energy harvesting transducers. An FPGA-based platform controls the energy source emulation combined with an integrated logic analyzer, which allows real-time data gathering from the EHES in multiple evaluation scenarios. The experiments show that the platform replicates solar energy scenarios with only 0.56% mean error.

A comparison of two embedded systems to detect electrical disturbances using decision tree algorithm

Reneilson Santos
Edward David Moreno
Carlos Estombelo-Montesco

The Electrical Power Quality (EPQ) is a relevant subject in the academic area because of its importance on real-world problems. The anomalies on an electrical network can cause strong losses in equipment and data. In this context, much effort has been made by many types of research approaches to get solutions for this kind of problem, seeking for better accuracy on the classification of the anomalies, or building a system to detect them. This paper, therefore, aims to compare two systems built to classify electrical disturbances even in noised environments. For this purpose, it was used a microprocessor system (Raspberry Pi3) and a micro-controller system (NodeMCU Amica), analyzing their time to classify the input signal. The microprocessor achieves better results (45.50ms against 267.10ms), with an accuracy of 97.96% in an ideal environment and 76.79% in a noisy environment (20dB of SNR) for both systems.

FPGA hardware linear regression implementation using fixed-point arithmetic

Willian de Assis Pedrobon Ferreira
Ian Grout
Alexandre César Rodrigues da Silva

In this paper, a hardware design based on the field programmable gate array (FPGA) to implement a linear regression algorithm is presented. The arithmetic operations were optimized by applying a fixed-point number representation for all hardware based computations. A floating-point number training data point was initially created and stored in a personal computer (PC) which was then converted to fixed-point representation and transmitted to the FPGA via a serial communication link. With the proposed VHDL design description synthesized and implemented within the FPGA, the custom hardware architecture performs the linear regression algorithm based on matrix algebra considering a fixed size training data point set. To validate the hardware fixed-point arithmetic operations, the same algorithm was implemented in the Python language and the results of the two computation approaches were compared. The power consumption of the proposed embedded FPGA system was estimated to be 136.82 mW.

New insight for next generation SRAM: tunnel FET versus FinFET for different topologies

Adriana Arevalo
Romain Liautard
Daniel Romero
Lionel Trojman
Luis-Miguel Procel

The purpose of this work is to point out the main differences between a Static Random-Access Memory (SRAM) cells implemented by using Tunnel FET (TFET) and FinFET technologies. We have compared the behavior of SRAM cells implemented in both technologies cells for a supply voltage range from 0.4V to 1.2V. Furthermore, for our study, we have chosen different SRAM cell topologies, such as 6T, 8T, 9T and 10T. Therefore, we have simulated all of these topologies for both technologies and extracted the Static Noise Margins (SNM) for the reading and writing processes. In addition, we have determined the power consumption in order to find the best trade-off between stability and power. By analyzing these results, we have determined the best topology for each technology. Finally, we have compared these best topologies for each technology in order to perform a study of advantages and shortcomings. Our results show more advantages using TFET technology instead of FinFET one.

DNAr-logic: a constructive DNA logic circuit design library in R language for molecular computing

Renan A. Marks
Daniel K. S. Vieira
Marcos V. Guterres
Poliana A. C. Oliveira
Omar P. Vilela Neto

This paper describes the DNAr-Logic: an implementation of a software package in R language that provides ease of use and scalability of the design process of digital logic circuits in molecular computing, more specifically, DNA computing. These devices may be used in-vitro, in-vivo, or even replace the CMOS technology in some applications. Using a technique known as DNA strand displacement reaction (DSD) in conjunction with chemical reaction networks (CRN’s), DNA strands can be used as “wet” hardware to construct molecular logic circuits analogous to electronic digital projects. The circuits designed using the DNAr-Logic can be created in a constructive manner and simulated without requiring knowledge of chemistry or DSD mechanism. The package implements all the main logic gates. We describe the design of a majority gate (also available in the package) and a full-adder circuit that only uses this port. We describe the results and simulation of our design.

Finding optimal qubit permutations for IBM’s quantum computer architectures

Alexandre A. A. de Almeida
Gerhard W. Dueck
Alexandre C. R. da Silva

IBM offers quantum processors for Clifford+T circuits. The only restriction is that not all CNOT gates are implemented and must be substituted with alternate sequences of gates. Each CNOT has its own mapping with a respective cost. However, by permuting the qubits, the number of CNOT that need mappings can be reduced. The problem is to find a good permutation without an exhaustive search. In this paper we propose a solution for this problem. The permutation problem is formulated as an Integer Linear Programming (ILP) problem. Solving the ILP problem, the lowest cost permutation for the CNOT mappings is guaranteed. To test and validated the proposed formulation, quantum architectures with 5 and 16 qubits were used. The ILP formulation along with mapping techniques found circuits with up to 64% fewer gates than other approaches.

Hardware implementation of a shape recognition algorithm based on invariant moments

Clement Raffaitin
Juan-Sebastian Romero
Juan-Sebastian Romero
Luis-Miguel Procel

The present work shows the description of a simple fast shape detection algorithm and its implementation in hardware in a FPGA system. The detection algorithm is based on the concepts of Hu’s moments which are invariant to similarity transformations. The recognition algorithm is implemented by using a non-local means filter. The algorithm is implemented on a FPGA system by using a hardware description language. We present the different design stages of the algorithm implementation which is based on the finite state machine technique. This algorithm is able to recognize a target shape over a test image. Furthermore, this work, describes the advantages of the implementation in hardware, such as speed and parallelism in signal processing. Finally, we show some results of the implementation of this algorithm.

A custom processor for an FPGA-based platform for automatic license plate recognition

Guilherme A. M. Sborz
Guilherme A. Pohl
Felipe Viel
Cesar A. Zeferino

Automatic License Plate Recognition (ALPR) systems are used to identify a vehicle from an image that contains its plate. These systems have applications in a wide range of areas, such as toll payment, border control, and traffic surveillance. ALPR systems demand high computational power, especially for real-time applications. In this context, this paper describes the development of a custom processor designed to accelerate part of the processing of an FPGA-based ALPR system. This processor reduces the latency for computing the most expensive function of the ALPR system in almost 23 times, thus reducing the time necessary for detection of a vehicle plate.

Hardware design of DC/CFL intra-prediction decoder for the AV1 codec

Jones Goebel
Bruno Zatt
Luciano Agostini
Marcelo Porto

This paper presents a dedicated hardware design for the DC and Chroma from Luma (CFL) intra-prediction modes of AV1 decoder. The hardware was designed to reach real-time when processing UHD 4K videos. The AV1 codec is an open-source and royalties-free video coding, which was developed by the AOMedia group, this group is composed of multiple companies like Google, Netflix, AMD, ARM, Intel, Nvidia, Microsoft, Mozilla and others. The proposed solution can support all 19 block sizes allowed in AV1 encoder, being able to process UHD 4K videos at 60 frames per second. The DC/CFL modules were synthesized to the TSMC 40 nm cells library targeting the frequency of 132.1 MHz. Synthesis results show the proposed hardware used 89.39 Kgates and a power dissipation of 7.96mW.

Approximate interpolation filters for the fractional motion estimation in HEVC encoders and their VLSI design

Rafael da Silva
Ícaro Siqueira
Mateus Grellert

Motion Estimation (ME) is one of the most complex HEVC steps, consuming more than 60% of the average encoding time, most of which is spent on its fractional part (Fractional Motion Estimation – FME), in which sub-pixel samples are interpolated and searched over to find motion vectors with higher precision. This paper presents hardware designs for the sub-pixel interpolation unit of the FME step. The designs employ approximate computing techniques by reducing the number of taps in each filter to reduce memory access and hardware cost. The approximate filters were implemented in the HEVC reference software to assess their impact on coding performance. A complete interpolation architecture was implemented in VHDL and synthesized with different filter precision and input sizes in order to assess the effect of these parameters on hardware area and performance. The approximate designs reduce the number of adders/subtractors by up to 67.65% and memory bandwidth by up to 75% with a tolerable loss in coding performance (less than 1% using the Bjontegaard Delta bitrate metric). When synthesized to an FPGA device, 52.9% less logic elements are required with a modest increase in frequency.

An SVM-based hardware accelerator for onboard classification of hyperspectral images

Lucas A. Martins
Guilherme A. M. Sborz
Felipe Viel
Cesar A. Zeferino

Hyperspectral images (HSIs) have been used in civil and military scenarios for ground recognition, urban development management, rare minerals identification, and diverse other purposes. However, HSIs have a significant volume of information and require high computational power, especially for real-time processing in embedded applications, as in onboard computers in satellites. These issues have driven the development of hardware-based solutions able to provide the processing power necessary to meet such requirements. In this paper, we present a hardware accelerator to enhance the performance of one of the most computational expensive stages of HSI processing: the classification. We have employed the Entropy Multiple Correlation Ratio procedure to select the spectral bands to be used in the training process. For the classification step, we have applied a Support Vector Machine classifier with a Hamming Distance decision approach. The proposed custom processor was implemented in FPGA and compared with high-level implementations. The results obtained demonstrate that the processor has a silicon cost lower than similar solutions and can perform a real-time pixel classification in 0.1 ms and achieves a state-of-the-art accuracy of 99.7%.

A sub-1mA highly linear inductorless wideband LNA with low IP3 sensitivity to variability for IoT applications

Arthur Liraneto Torres Costa
Hamilton Klimach
Sergio Bampi

This paper proposes a wideband 0.4-2 GHz cascode common-gate LNA that can be used as a building block for a noise canceling topology (which entails its noise to be canceled at the output node). The design strategy is to set the operating point by analyzing the third order coefficient (α₃) of the output current and the output voltage, which is designed using a load composed by a diode-connected PMOS transistor and a resistor in parallel. This operating point allows a reasonable V_GS spread, maintaining a high IIP3 which implies a low IIP3 sensitivity to process variability. The design strategy also achieves a current consumption under 1 mA and, depending on the technology node V_DD (CMOS 130 nm in this case), it can consume under 1 mW of power. This makes the wideband LNA suitable for IoT applications. Monte Carlo simulations have been carried out to demonstrate the operating region sensitivity to variability and achieves a result of worst case IIP3_μ = +0.2 dBm with σ = 0.8 dBm (@2GHz) up to a nominal 2.75 dBm @900 MHz, S₁₁ < -23 dB, NF < 5.5 dB (canceled by virtue of its topology), a voltage gain of 11.6-14.6 dB (S₂₁ = 6.4-9.4 dB with a buffer to 50 Ω), and consuming just 1.19 mW from a 1.2 V supply.

Comparison between direct and indirect learnings for the digital pre-distortion of concurrent dual-band power amplifiers

Luis Schuartz
Artur T. Hara
André A. Mariano
Bernardo Leite
Eduardo G. Lima

Current radio-communication systems demand high linearity and high efficiency. The digital baseband pre-distorter (DPD) is a cost-effective solution to guarantee the required linearity without compromising the efficiency. In the design of a DPD for a single band power amplifier (PA), the position of the inverse system is exchanged during the identification procedure to avoid the necessity of a PA model within a cumbersome closed-loop process. However, in a practical environment where only an approximation to the inverse is achieved, the linearization capability is affected by shifting the post-inverse placed after the PA to a pre-inverse located before the PA. In DPD intended for concurrent dual-band PAs, an additional advantage of such approach is that the post-inverse identifications for each band are completely independent of each other. This work performs a comparative analysis between two learning architectures applied to the linearization of two concurrent dual-band PAs stimulated by 2.4 GHz Wi-Fi and 3.5 GHz LTE signals. For the first PA, an exact PA model is known and the replacement of a post-inverse to a pre-inverse produces only negligible degradation in linearity. For the second PA, only an approximate PA model is available and the accuracy of such PA model produces a major impact on the linearization capability than the shifting of the inverse.

Interactive evolutionary approach to reduce the optimization cycle time of a low noise amplifier

Rodrigo A. L. Moreto
Douglas Rocha
Carlos E. Thomaz
André Mariano
Salvador P. Gimenez

Nowadays, wireless communications at frequencies of gigahertz have an increasing demand due to the ever-increasing number of electronic devices that uses this type of communication. They are implemented by Radio Frequency (RF) circuits. However, the design of RF circuits is difficult, time-consuming and based on designer knowledge and experience. This work proposes an interactive evolutionary approach using the genetic algorithm, which is implemented in the in-house iMTGSPICE optimization tool, to perform the optimization process of a robust (corner and Monte Carlo analyses) Ultra Low-Power Low Noise Amplifier (LNA) dedicated to Wireless Sensor Networks (WSN), which is implemented in a 130 nm Bulk CMOS technology. We performed two experimental studies to optimize the LNA. The first one used the interactive approach of iMTGSPICE, which was monitored and assisted by a beginner designer during the optimization process. The second one used the conventional approach of iMTGSPICE (non-interactive), which was not assisted by a designer during the optimization process. The obtained results demonstrated that the interactive approach of iMTGSPICE performed the optimization process of the robust LNA around 94% faster (in approximately 20 minutes only) than the non-interactive evolutionary approach (in approximately 6 hours).

An innovative strategy to reduce die area of robust OTA by using iMTGSPICE and diamond layout style for MOSFETs

José Roberto Banin Júnior
Rodrigo Alves de Lima Moreto
Gabriel Augusto da Silva
Carlos Eduardo Thomaz
Salvador Pinillos Gimenez

This paper describes a pioneering design and optimization methodology that provides a remarkable die area reduction of robust analog Complementary Metal-Oxide-Semiconductor (CMOS) Integrated Circuits (ICs) by using a computational tool based on artificial intelligence (iMTGSPICE) and the Diamond layout style for MOSFETs. The validation of this innovative optimization strategy for analog CMOS ICs was made for an Operational Transconductance Amplifiers (OTA) by using 180 nm CMOS ICs technology. The main finding of this work reports a remarkable reduction of the total die area of a robust OTA around 30%, regarding the use of Diamond MOSFETs with alfa angles of 45° when compared to the one implemented with standard rectangular MOSFETs.

NMLSim 2.0: a robust CAD and simulation tool for in-plane nanomagnetic logic based on the LLG equation

Lucas A. Lascasas Freitas
Omar P. Vilela Neto
João G. Nizer Rahmeier
Luiz G. C. Melo

Nanomagnetic Logic (NML) is a new technology based on the magnetization of nanometric magnets. Logic operations are performed via dipolar coupling through ferromagnetic and antiferromagnetic interactions. The low energy dissipation and the possibility of higher integration density in circuits are significant advantages over CMOS technology. Even so, there is a great need for simulation and CAD tools for the proper study of large NML circuits. This paper presents a high-efficiency tool that uses the Landau-Lifshitz-Gilbert equation to evolve the magnetization of the particles over time in amonodomain approach. The new version of NMLSim comes with flexibility in its code, allowing expansion of the tool with ease and consistency. The results of simulated structures show the reliability of the simulator when compared with the current state of the art Object-Oriented Micromagnetic Framework (OOMMF). It also presents an improvement of up to 716 times in execution time and up to 41 times in memory usage.

Ropper: a placement and routing framework for field-coupled nanotechnologies

Ruan Evangelista Formigoni
Ricardo S. Ferreira
José Augusto M. Nacif

Field-Coupled Nanocomputing technologies are the subject of extensive research to overcome current CMOS limitations. These technologies include nanomagnetic and quantum structures, each with its design and synchronization challenges. In this scenario clocking schemes are used to ensure circuit synchronization and avoid signal disruptions at the cost of some area overhead. Unfortunatelly, a nanocomputing technology is limited to a small subset of clocking schemes due to its number of clocking phases and signal propagation system, thus, leading to complex design challenges when tackling the placement and routing problem resulting in technology dependant solutions. Our work consists on presenting a novel framework developed by our team that solves these design challenges when using distinct schemes, therefore, avoiding the need to design pre-defined routing algorithms for each one. The framework offers a technology independent solution and provides interfaces for the implementation of efficient and scalable placement strategies, moreover, it has full integration with reference state-of-the-art optimization and synthesis tools.

Toward nanometric scale integration: an automatic routing approach for NML circuits

Pedro Arthur R. L. Silva
Omar Paranaíba V. Neto
José Augusto M. Nacif

In recent years, many technologies have been studied to replace or complement CMOS. Some of these emerging technologies are known as Field Coupled Nanocomputing. However, these new technologies introduce the need for developing tools to perform circuit mapping, placement, and routing. Nanomagnetic Logic Circuit (NML) is one of these emergent technologies. It relies on the magnetization of nanomagnets to perform operations through majority logic. In this work, we propose an approach to map a gate-level circuit to an NML layout automatically. We use the Breadth First Search to perform the placement and the A* algorithm to transverse the circuit and build the routes for each node. To evaluate the effectiveness of our approach, we use a series of ISCAS’85 benchmarks. Our results show an area reduction varying from 20% to 60%.

Energy efficient fJ/spike LTS e-Neuron using 55-nm node

Pietro M. Ferreira
Nathan De Carvalho
Geoffroy Klisnick
Aziz Benlarbi-Delai

While CMOS technology is currently reaching its limits in power consumption and circuit density, a challenger is emerging from the analogy between biology and silicon. Hardware-based neural networks may drive a new generation of bio-inspired computers by the urge of a hardware solution for real-time applications. This paper redesigns a previous proposed electronic neuron (e-Neuron) in a higher firing rate to reduce the silicon area and highlight a better energy efficiency trade-off. Besides, an innovative schematic is proposed to state an e-Neuron library based on Izhikevichs model of neural firing patterns. Both e-Neuron circuits are designed using 55 nm technology node. Physical design of transistors in weak inversion are discussed to a minimal leakage. Neural firing pattern behaviors are validated by post-layout simulations, demonstrating the spike frequency adaptation and the rebound spikes due to post-inhibitory effect in LTS e-Neuron. Presented results suggest that the time to rebound spikes is dependent of the excitation current amplitude. Both e-Neurons have presented a fF/spike energy efficiency and a smaller silicon area in comparison to Izhikevichs library propositions in the literature.

CMOS analog four-quadrant multiplier free of voltage reference generators

Antonio José Sobrinho de Sousa
Fabian de Andrade
Hildeloi dos Santos
Gabriele Gonçalves
Maicon Deivid Pereira
Edson Santana
Ana Isabela Cunha

This work presents a CMOS four quadrant analog multiplier architecture for application as the synapse element in analog cellular neural networks. The circuit has voltage-mode inputs and a current-mode output and includes a signal application method that avoids voltage or current reference generators. Simulations have been accomplished for a CMOS 130 nm technology, featuring ±50 mV input voltage range, 60 μW static power and -25 dB maximum THD. The active area is 346 μm².

Amplifier-based MOS analog neural network implementation and weights optimization

Tiago Oliveira Weber
Diogo da Silva Labres
Fabián Leonardo Cabrera

Neural networks are achieving state-of-the-art performance in many applications, from speech recognition to computer vision. A neuron in a multi-layer network needs to multiply each input by its weight, sum the results and perform an activation function. This paper presents a variation of the implementation of an amplifier-based MOS analog neuron capable of performing these tasks and also the optimization of the synaptic weights using in-loop circuit simulations. MOS transistors operating in the triode region are used as variable resistors to convert the input and weight voltage to a proportional input current. To test the analog neuron in full networks, an automatic generator is developed to produce a netlist based on the number of neurons on each layer, inputs and weights. Simulation results using a CMOS 180 nm technology demonstrate the neuron proper transfer function and its functionality while trained in test datasets.

Reduction of neural network circuits by constant and nearly constant signal propagation

Augusto Andre Souza Berndt
Alan Mishchenko
Paulo Francisco Butzen
Andre Inacio Reis

This work focuses on optimizing circuits representing neural networks (NNs) in the form of and-inverter graphs (AIGs). The optimization is done by analyzing the training set of the neural network to find constant bit values at the primary inputs. The constant values are then propagated through the AIG, which results in removing unnecessary nodes. Furthermore, a trade-off between neural network accuracy and its reduction due to constant propagation is investigated by replacing with constants those inputs that are likely to be zero or one. The experimental results show a significant reduction in circuit size with negligible loss in accuracy.

A FPGA parameterizable multi-layer architecture for CNNs

Guilherme Korol
Fernando Gehm Moraes

Advances in hardware platforms boosted the use of Convolutional Neural Networks (CNNs) to solve problems in several fields such as Computer Vision and Natural Language Processing. With the improvements of algorithms involved in learning and inferencing for CNNs, dedicated hardware architectures have been proposed with the goal to speed up the CNNs’ performance. However, the CNNs’ requirements in bandwidth and processing power challenge designers to create architectures fitted for ASICs and FPGAs. Embedded applications targeting IoT (including sensors and actuators), health devices, smartphones, and any other battery-powered device may benefit from CNNs. For that, the CNN design must follow a different path, where the cost function is a small area footprint and reduced power consumption. This paper is a step towards this goal, by proposing an architecture for the main modules of modern CNNs. The proposal uses as case-study the Alexnet CNN, targeting Xilinx FPGA devices. Compared to the literature, results show a reduction up to 9 times in the amount of required DSP modules.

Design of a low power 10-bit 12MS/s asynchronous SAR ADC in 65nm CMOS

Arthur Lombardi Campos
João Navarro
Maximiliam Luppe

During the last decades we have witnessed the performance improvement and the aggressive growth of the complexity of integrated circuits (ICs). The progressive size reduction of transistors in recent technological nodes has allowed IC designers to perform analog tasks in the digital domain, increasing the demand for analog-to-digital converters (ADCs). This work presents the design and implementation of a low power successive approximation register analog-to-digital converter (SAR ADC) in a 65nm CMOS technology, suitable for low power frontend of wireless receivers with a flexible sampling rate up to 12 MS/s. At maximum sampling rate, the post-layout simulated circuit achieved an equivalent number of bits (ENOB) of 9.65 and a power consumption of 151.4μW, leading to a Figure of Merit of 15.8fJ/Conversion-step, inside an area of 0.074mm².

A new algorithm for an incremental sigma-delta converter reconstruction filter

Li Huang
Caroline Lelandais-Perrault
Anthony Kolar
Philippe Benabes

Image sensors dedicated for the applications of the Earth observation require medium-speed and high-resolution analog-to-digital converters (ADCs). For that purpose, an incremental sigma-delta analog-to-digital converter (IΣΔ ADC) has been designed. Post-layout simulations highlighted a degradation in resolution caused by the circuit imperfections. Therefore, a digital correction has been investigated. This paper proposes a new reconstruction filter which takes into account not only the bit values of the modulator output sequence but also the occurrence of certain patterns. This technique has been applied to an incremental sigma-delta analog-to-digital converter in order to correct the conversion errors. Performing with 400 clock periods for each conversion cycle, the theoretical resolution is 15.4 bits. Post-layout simulation shows that a 13.5-bit resolution is obtained by using the classical optimal filter whereas a 14.8-bit resolution is obtained with our reconstruction filter.

Behavioral modeling of a control module for an energy-investing piezoelectric harvester

Tales Luiz Bortolin
André Luiz Aita
João Baptista dos Santos Martins

This work analyzes a piezoelectric energy harvesting system that uses a single inductor and the concept of energy investment. The harvester behavior, with special focus on its control logic module and state machine, is fully described and modeled in Verilog-A. The needed sensors and control variables were also identified and modeled. Simulation results have shown the correct behavioral modeling of the piezoelectric energy harvester system and proposed control, highlighting the harvesting mechanism based on the concept of energy-investment and the effect of the energy invested on the characteristics of the battery charging profile. The speed of the behavioral simulations when compared to electrical ones and the obtained model accuracy, have shown a reliable and prospective higher-level design approach.

An IR-UWB pulse generator using PAM modulation with adaptive PSD in 130nm CMOS process

Luiz Carlos Moreira
José Fontebasso Neto
Walter Silva Oliveira
Thiago Ferauche
Guilherme Heck
Ney Laert Vilar Calazans
Fernando Gehm Moraes

This paper proposes an adaptive pulse generator using Pulse Amplitude Modulation (PAM). The circuit was implemented with eight Pulse Generator Units (PGUs) to produce up to eight monocycles per pulse. The number of monocycles per pulse is inversely proportional to the Power Spectrum Density (PSD) bandwidth in the Impulse Radio Ultra-Wide Band (IR-UWB). The complete circuit contains two pulse generator blocks, each one composed by eight PGUs to build a rectangular waveform at the output. The PGU has been implemented with Edge Combiners High (ECH) and Edge Combiners Low (ECL) to encode the information. Each Edge Combiner has a high impedance circuit that is selected by digital control signals. The circuit has been simulated, showing an output pulse amplitude of ≈70mV for the high logic level and an amplitude of ≈35mV for the low logic level, both at 100 MHz Pulse Repetition Frequency (PRF). This produces a mean pulse duration of ≈270ps, a mean central frequency of ≈3.7GHz and a power consumption less than 0,22μW. The pulse generator block occupies an area of 0.54mm².