ISLPED 2018 TOC

Full Citation in the ACM Digital Library

SESSION: Machine Learning – Inference

Value-driven Synthesis for Neural Network ASICs

Zhiyuan Yang
Ankur Srivastava

In order to enable low power and high performance evaluation of neural network (NN) applications, we investigate new design methodologies for synthesizing neural network ASICs (NN-ASICs). An NN-ASIC takes a trained NN and implements a chip with customized optimization. Knowing the NN topology and weights allows us to develop unique optimization schemes which are not available to regular ASICs. In this work, we investigate two types of value-driven optimized multipliers which exploit the knowledge of synaptic weights and we develop an algorithm to synthesize the multiplication of trained NNs using these special multipliers instead of general ones. The proposed method is evaluated using several Deep Neural Networks. Experimental results demonstrate that compared to traditional NNPs, our proposed NN-ASICs can achieve up to 6.5x and 55x improvement in performance and energy efficiency (i.e. inverse of Energy-Delay-Product), respectively.

CLINK: Compact LSTM Inference Kernel for Energy Efficient Neurofeedback Devices

Zhe Chen
Andrew Howe
Hugh T. Blair
Jason Cong

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

Compact Convolution Mapping on Neuromorphic Hardware using Axonal Delay

Jinseok Kim
Yulhwa Kim
Sungho Kim
Jae-Joon Kim

Mapping Convolutional Neural Network (CNN) to a neuromorphic hardware has been inefficient in synapse memory usage because both kernel/input reuse are not exploited well. We propose a method to enable kernel reuse by utilizing axonal delay, which is a biological parameter for a spiking neuron. Using IBM TrueNorth as a test platform, we demonstrate that the number of cores, neurons, synapses, and synaptic operations per time step can be reduced by up to 20.9x, 27.9x, 88.4x, and 1586x, respectively, compared to the conventional scheme, which raises the possibility of implementing large-scale CNN on neuromorphic hardware.

NNest: Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators

Liu Ke
Xin He
Xuan Zhang

Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN’s enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an “one-size-fits-all” hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).

SESSION: Hardware Security

Blacklist Core: Machine-Learning Based Dynamic Operating-Performance-Point Blacklisting for Mitigating Power-Management Security Attacks

Sheng Zhang
Adrian Tang
Zhewei Jiang
Simha Sethumadhavan
Mingoo Seok

Most modern computing devices make available fine-grained control of operating frequency and voltage for power management. These interfaces, as demonstrated by recent attacks, open up a new class of software fault injection attacks that compromise security on commodity devices. CLKSCREW, a recently-published attack that stretches the frequency of devices beyond their operational limits to induce faults, is one such attack. Statically and permanently limiting frequency and voltage modulation space, i.e., guard-banding, could mitigate such attacks but it incurs large performance degradation and long testing time. Instead, in this paper, we propose a run-time technique which dynamically blacklists unsafe operating performance points using a neural-net model. The model is first trained offline in the design time and then subsequently adjusted at run-time by inspecting a selected set of features such as power management control registers, timing-error signals, and core temperature. We designed the algorithm and hardware, titled a BlackList (BL) core, which is capable of detecting and mitigating such power management-based security attack at high accuracy. The BL core incurs a reasonably small amount of overhead in power, delay, and area.

Threshold Defined Camouflaged Gates in 65nm Technology for Reverse Engineering Protection

Anirudh S. Iyengar
Deepak Vontela
Ithihasa Reddy
Swaroop Ghosh
Syedhamidreza Motaman
Jae-won Jang

Due to the ever-increasing threat of Reverse Engineering (RE) of Intellectual Property (IP) for malicious gains, camouflaging of logic gates is becoming very important. In this paper, we present experimental demonstration of transistor threshold voltage-defined switch [2] based camouflaged logic gates that can hide six logic functionalities i.e. NAND, AND, NOR, OR, XOR and XNOR. The proposed gates can be used to design the IP, forcing an adversary to perform brute-force guess-and-verify of the underlying functionality—increasing the RE effort. We propose two flavors of camouflaging, one employing only a pass transistor (NMOS-switch) and the other utilizing a full pass transistor (CMOS-switch). The camouflaged gates are used to design Ring-Oscillators (RO) in ST 65nm technology, one for each functionality, on which we have performed temperature, voltage, and process-variation analysis. We observe that CMOS-switch based camouflaged gate offers a higher performance (~1.5-8X better) than NMOS-switch based gate at an added area cost of only 5%. The proposed gates show functionality till 0.65V. We are also able to reclaim lost performance by dynamically changing the switch gate voltage and show that robust operation can be achieved at lower voltage and under temperature fluctuation.

Reliability and Uniformity Enhancement in 8T-SRAM based PUFs operating at NTC

Pramesh Pandey
Asmita Pal
Koushik Chakraborty
Sanghamitra Roy

SRAM-based PUFs (SPUFs) have emerged as promising security primitives for low-power devices. However, operating 8T-SPUFs at Near-Threshold Computing (NTC) realm is plagued by exacerbated process variation (PV) sensitivity which thwarts their reliable operation. In this paper, we demonstrate the massive degradation in the reliability and uniformity characteristics of 8T-SPUF. By exploiting the opportunities bestowed by schematic asymmetry of 8T-SPUF cells, we propose biasing and sizing based design strategies. Our techniques achieve an immense improvement of more than 55% in the percentage of unreliable cells and improves the proximity to ideal uniformity by 82%, over a baseline NTC 8T-SPUF with no enhancement.

Efficient and Secure Group Key Management in IoT using Multistage Interconnected PUF

Hongxiang Gu
Miodrag Potkonjak

Secure group-oriented communication is crucial to a wide range of applications in Internet of Things (IoT). Security problems related to group-oriented communications in IoT-based applications placed in a privacy-sensitive environment have become a major concern along with the development of the technology. Unfortunately, many IoT devices are designed to be portable and light-weight; thus, their functionalities, including security modules, are heavily constrained by the limited energy resources (e.g., battery capacity). To address these problems, we propose a group key management scheme based on a novel physically unclonable function (PUF) design: multistage interconnected PUF (MIPUF) to secure group communications in an energy-constrained environment. Our design is capable of performing key management tasks such as key distribution, key storage and rekeying securely and efficiently. We show that our design is secure against multiple attack methods and our experimental results show that our design saves 47.33% of energy globally comparing to state-of-the-art Elliptic-curve cryptography (ECC)-based key management scheme on average.

SESSION: Energy Efficient Wireline Circuits

An Energy-Efficient High-Swing PAM-4 Voltage-Mode Transmitter

Lejie Lu
Yong Wang
Hui Wu

As the data rate of high-speed I/Os continues to increase, four-level pulse amplitude modulation (PAM-4) is adopted to improve the bandwidth density and link margin at 50 Gb/s and beyond. Compared to non-return-to-zero (NRZ) signaling, however, the PAM-4 eye height is reduced, which calls for larger transmitter swing to maintain signal-to-noise-ratio. A new energy-efficient transmitter is proposed to generate large swing PAM-4 signals with a cascode voltage-mode driver and supporting pre-drivers and logic circuits. By reconfiguring the pull-up and pull-down branches based on the transmit data and steering the bypass currents, the proposed voltage-mode driver significantly reduces power consumption compared to conventional implementation while maintaining impedance matching. Voltage stacking technique is adopted for pre-drivers to further improve energy efficiency. To demonstrate the new transmitter design, a prototype 56 Gb/s PAM-4 transmitter is designed using a generic 28-nm CMOS technology with a 2-V power supply voltage. It achieves a overall output swing of 2 V and a minimum eye height of 490 mV with good linearity (98.7% level separation mismatch ratio). Compared to a conventional voltage-mode transmitter design with the same swing, the static power consumption of the new transmitter is reduced almost by half (from 30 mW to 16 mW), and its overall energy efficiency improves from 0.7 pJ/b to 0.5 pJ/b.

Energy-Efficient Dynamic Comparator with Active Inductor for Receiver of Memory Interfaces

Jae Whan Lee
Joo-Hyung Chae
Jihwan Park
Hyunkyu Park
Jaekwang Yun
Suhwan Kim

In this paper, we propose a dynamic comparator that improved the operation performance of receiver (RX) with the effort to reduce power consumption. It is implemented via double-tail StrongARM latch comparator with an active inductor and efforts are made to minimize power consumption for high-speed resulting in better energy efficiency at the targeted high frequency. In this regard, our comparator is suitable for memory application RX to satisfy both low-power and high-speed. It is applied to the single-ended RX designed with a continuous-time linear equalizer, a clock generator and a quarter-rate 2-tap decision-feedback equalizer which is appropriate for the high-frequency memory application. Compared to the conventional one, our design, fabricated in 55nm CMOS process, provides an improvement of 7% in unit interval (UI) margin under the same power consumption and receives up to 10Gb/s PRBS15 data at BER < 10-12 with 0.4 UI margin and energy efficiency of 0.67pJ/bit.

4-Channel Push-Pull VCSEL Drivers for HDMI Active Optical Cable in 0.18-μm CMOS

Jeongho Hwang
Hong-Seok Choi
Hyungrok Do
Gyu-Seob Jeong
Daehyun Koh
Seong Ho Park
Deog-Kyoon Jeong

The price and power consumption of standard HDMI cables exponentially rise when the data rate increases or cable runs longer. HDMI active optical cable (AOC) can potentially solve price and power issues since fibers are tolerant to loss. However, additional optical components such as vertical-cavity surface-emitting laser (VCSEL) and photodiode (PD) are required. Therefore, drivers and transimpedance amplifiers should be designed carefully for normal operations. In this paper, two types of 4-channel VCSEL drivers for HDMI AOC are presented. The first type of the driver passes data and bias separately. It uses off-chip capacitors for AC coupling. On the other hand, the second type of the driver passes data including DC value without using off-chip capacitors. Structures of the both drivers are based on push-pull current-mode logic (CML) to achieve better power efficiency. Drivers fabricated in 0.18-μm CMOS process consume 36.5 mW/channel at 6 Gb/s and 24.7 mW/channel at 12 Gb/s, respectively.

SESSION: Approximate Computing

RMAC: Runtime Configurable Floating Point Multiplier for Approximate Computing

Mohsen Imani
Ricardo Garcia
Saransh Gupta
Tajana Rosing

Approximate computing is a way to build fast and energy efficient systems, which provides responses of good enough quality tailored for different purposes. In this paper, we propose a novel approximate floating point multiplier which efficiently multiplies two floating numbers and yields a high precision product. RMAC approximates the costly mantissa multiplication to a simple addition between the mantissa of input operands. To tune the level of accuracy, RMAC looks at the first bit of the input mantissas as well as the first N bits of the result of addition to dynamically estimate the maximum multiplication error rate. Then, RMAC decides to either accept the approximate result or re-execute the exact multiplication. Depending on the value of N, the proposed RMAC can be configured to achieve different levels of accuracy. We integrate the proposed RMAC in AMD southern Island GPU, by replacing RMAC with the existing floating point units. We test the efficiency and accuracy of the enhanced GPU on a wide range of applications including multimedia and machine learning applications. Our evaluations show that a GPU enhanced by the proposed RMAC can achieve 5.2x energydelay product improvement as opposed to GPU using conventional FPUs while ensuring less than 2% quality loss. Comparing our approach with other state-of-the-art approximate multipliers shows that RMAC can achieve 3.1x faster and 1.8x more energy efficient computations while providing the same quality of service.

Designing Efficient Imprecise Adders using Multi-bit Approximate Building Blocks

Sarvenaz Tajasob
Morteza Rezaalipour
Masoud Dehyadegari
Mahdi Nazm Bojnordi

Energy-efficiency has become a major concern in designing computer systems. One of the most promising solutions to enhance power and energy-efficiency in error tolerant applications is approximate computing that balances accuracy, area, delay, and power consumption based on the computational needs. By trading accuracy of computation, approximate computing may achieve significant improvements in speed, power, and area consumption.

Adders are important arithmetic units widely used in almost every digital processing system, which contribute to significant amounts of power dissipation. With the emergence of deep learning tasks and fault tolerant big data processing in every aspect of today’s computing, the demand for low-power and energy-efficient approximate adders has increased significantly. Numerous designs have been proposed in the literature that build multi-bit adders using novel approximate full adder circuits. Regrettably, relying on single-bit building blocks only limits the design space of approximate adders and prevents the designers from achieving the most significant benefits of approximate circuits. This paper presents a novel approach to designing imprecise multi-bit adders, based on four novel approximate 2 and 3-bit adder building blocks. The proposed circuits are evaluated and compared with the existing low power adders in terms of various design characteristics, such as area, delay, power, and error tolerance. Our simulation results indicate that the proposed adders achieve more than 60% reduction in power and area consumption compared to the state-of-the-art approximate adders while introducing 12-17% less error in computation.

An Energy-Efficient, Yet Highly-Accurate, Approximate Non-Iterative Divider

Marzieh Vaeztourshizi
Mehdi Kamal
Ali Afzali-Kusha
Massoud Pedram

In1 this paper, we present a highly accurate and energy efficient non-iterative divider, which uses multiplication as its main building block. In this structure, the division operation is performed by first reforming both dividend and divisor inputs, and then multiplying the rounded value of the scaled dividend by the reciprocal of the rounded value of the scaled divisor. Precisely, the interval representing the fractional value of the scaled divisor is partitioned into non-overlapping sub-intervals, and the reciprocal of the scaled divisor is then approximated with a linear function in each of these sub-intervals. The efficacy of the proposed divider structure is assessed by comparing its design parameters and accuracy with state-of-the-art, non-iterative approximate dividers as well as exact dividers in 45nm digital CMOS technology. Circuit simulation results show that the mean absolute relative error of the proposed structure for doing 1 32-bit division is less than 0.2%, while the proposed structure has significantly lower energy consumption than the exact divider. Finally, the effectiveness of the proposed divider in one image processing application is reported and discussed.

SESSION: Architectural Techniques

Aggressive Slack Recycling via Transparent Pipelines

Gokul Subramanian Ravi
Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation – maximizing slack conservation and timing speculation efficiency.

We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.

Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache

Martin Rapp
Anuj Pathania
Jörg Henkel

Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.

SPONGE: A Scalable Pivot-based On/Off Gating Engine for Reducing Static Power in NoC Routers

Hossein Farrokhbakht
Hadi Mardani Kamali
Natalie Enright Jerger
Shaahin Hessabi

Due to high aggregate idle time of Networks-on-Chip (NoCs) routers in practical applications, power-gating techniques have been proposed to combat the ever-increasing ratio of static power. Nevertheless, the sporadic packet arrivals compromise the effectiveness of power-gating by incurring significant latency and energy overhead. In this paper, we propose a Scalable Pivot-based On/Off Gating Engine (SPONGE) which efficiently manages power-gating decisions and routing mechanism by adaptively selecting a small set of powered-on columns of routers and keeping the others in power-gated state. To this end, a router architecture augmented with a novel routing algorithm is proposed in which a packet can traverse powered-off routers without waking them up, and can only turn in predetermined powered-on routers. Experimental results on SPLASH-2 benchmarks demonstrate that, compared to the conventional power-gating method, SPONGE on average not only improves static power consumption by 81.7%, it also improves average packet latency by 63%.

SESSION: Machine Learning – Training

Taming the beast: Programming Peta-FLOP class Deep Learning Systems

Swagath Venkataramani
Vijayalakshmi Srinivasan
Jungwook Choi
Kailash Gopalakrishnan
Leland Chang

TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training

Seungkyu Choi
Jaehyeong Sim
Myeonggu Kang
Lee-Sup Kim

Training convolutional neural network on device has become essential where it allows applications to consider user’s individual environment. Meanwhile, the weight update operation from the training process is the primary factor of high energy consumption due to its substantial memory accesses. We propose a dedicated weight update architecture with two key features: (1) a specialized local buffer for the DRAM access deduction (2) a novel dataflow and its suitable processing element array structure for weight gradient computation to optimize the energy consumed by internal memories. Our scheme achieves 14.3%-30.2% total energy reduction by drastically eliminating the memory accesses.

AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

Xin He
Liu Ke
Wenyan Lu
Guihai Yan
Xuan Zhang

The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods—one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain’s ability to obtain resilient neural network parameters and system energy efficiency improvement.

Spin Orbit Torque Device based Stochastic Multi-bit Synapses for On-chip STDP Learning

Gyuseong Kang
Yunho Jang
Jongsun Park

As a large number of neurons and synapses are needed in spike neural network (SNN) design, emerging devices have been employed to implement synapses and neurons. In this paper, we present a stochastic multi-bit spin orbit torque (SOT) memory based synapse, where only one SOT device is switched for potentiation and depression using modified Gray code. The modified Gray code based approach needs only N devices to represent 2N levels of synapse weights. Early read termination scheme is also adopted to reduce the power consumption of training process by turning off less associated neurons and its ADCs. For MNIST dataset, with comparable classification accuracy, the proposed SNN architecture using 3-bit synapse achieves 68.7% reduction of ADC overhead compared to the conventional 8-level synapse.

SESSION: Non-volatile Memory

Enabling Intra-Plane Parallel Block Erase in NAND Flash to Alleviate the Impact of Garbage Collection

Tyler Garrett
Jun Yang
Youtao Zhang

Garbage collection (GC) in NAND flash can significantly decrease I/O performance in SSDs by copying valid data to other locations, thus blocking incoming I/O requests. To help improve performance, NAND flash utilizes various advanced commands to increase internal parallelism. Currently, these commands only parallelize operations across channels, chips, dies, and planes, neglecting the block level due to risk of disturbances that can compromise valid data by inducing errors. However, due to the triple-well structure of the NAND flash plane architecture, it is possible to erase multiple blocks within a plane, in parallel, without diminishing the integrity of the valid data. The number of page movements due to multiple block erases can be restrained so as to bound the overhead per GC. Moreover, more capacity can be reclaimed per GC which delays future GCs and effectively reduces their frequency. Such an Intra-Plane Parallel Block Erase (IPPBE) in turn diminishes the impact of GC on incoming requests, improving their response times. Experimental results show that IPPBE can reduce the time spent performing GC by up to 50.7% and 33.6% on average, read/write response time by up to 47.0%/45.4% and 16.5%/14.8% on average respectively, page movements by up to 52.2% and 26.6% on average, and blocks erased by up to 14.2% and 3.6% on average. An energy analysis conducted indicates that by reducing the number of page copies and the number of block erases, the energy cost of garbage collection can be reduced up to 44.1% and 19.3% on average.

Enhancing the Energy Efficiency of Journaling File System via Exploiting Multi-Write Modes on MLC NVRAM

Shuo-Han Chen
Yuan-Hao Chang
Tseng-Yi Chen
Yu-Ming Chang
Pei-Wen Hsiao
Hsin-Wen Wei
Wei-Kuan Shih

Non-volatile random-access memory (NVRAM) is regarded as a great alternative storage medium owing to its attractive features, including low idle energy consumption, byte addressability, and short read/write latency. In addition, multi-level-cell (MLC) NVRAM has also been proposed to provide higher bit density. However, MLC NVRAM has lower energy efficiency and longer write latency when compared with single-level-cell (SLC) NVRAM. These drawbacks could lead to higher energy consumption of MLC NVRAM-based storage systems. The energy consumption is magnified by existing journaling file systems (JFS) on MLC NVRAM-based storage devices due to the JFS’s fail-safe policy of writing the same data twice. Such observations motivate us to propose a multi-write-mode journaling file systems (mwJFS) to alleviate the drawbacks of MLC NVRAM and lower the energy consumption of MLC NVRAM-based JFS. The proposed mwJFS differentiates the data retention requirement of journaled data and applies different write modes to enhance the energy efficiency with better access performance. A series of experiments was conducted to demonstrate the capability of mwJFS on a MLC NVRAM-based storage system.

Computing in memory with FeFETs

Dayane Reis
Michael Niemier
X. Sharon Hu

Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.

Information Leakage Attacks on Emerging Non-Volatile Memory and Countermeasures

Mohammad Nasim Imtiaz Khan
Swaroop Ghosh

Emerging Non-Volatile Memories (NVMs) suffer from high and asymmetric read/write current and long write latency which can result in supply noise, such as supply voltage droop and ground bounce. The magnitude of supply noise depends on the old data and the new data that is being written (for a write operation) or on the stored data (for a read operation). Therefore, victim’s write operation creates a supply noise which propagates to adversary’s memory space. The adversary can detect victim’s write initiation and can leverage faster read latency (compared to write) to further sense the Hamming Weight (HW) of the victim’s write data by detecting read failures in his memory space. These attacks are specifically possible if exhaustive testing of the memory for all patterns, all possible location combinations, all possible parallel read/write conditions are not performed under bit-to-bit process variations and specified (-10°C to 90°C) and unspecified temperature ranges (i.e., less than -10°C and greater than 90°C). Simulation result indicates that adversary can sense HW of victim’s (near-by) write data = 66.77%, and further narrow the range based on read/write failure characteristics. Side Channel Attacks can utilize this information to strengthen the attacks.

SESSION: Energy-efficient Parallelism

Load-Triggered Warp Approximation on GPU

Zhenhong Liu
Daniel Wong
Nam Sung Kim

Value similarity of operands across warps have been exploited to improve energy efficiency of GPUs. Prior work, however, incurs significant overheads to check value similarity for every instruction and does not improve performance as it does not reduce the number of executed instructions. This work proposes Lock ‘n Load (LnL) which triggers approximate execution of code regions by only checking similarity of values returned from load instructions and fuses multiple approximated warps into a single warp.

GAS: A Heterogeneous Memory Architecture for Graph Processing

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Tajana Rosing

Graph processing has become important for various applications in today’s big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.

ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU

Tahmoures Shabanian
Aatreyi Bal
Prabal Basu
Koushik Chakraborty
Sanghamitra Roy

The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.

SESSION: Self-powered Devices

HomeRun: HW/SW Co-Design for Program Atomicity on Self-Powered Intermittent Systems

Chih-Kai Kang
Chun-Han Lin
Pi-Cheng Hsiu
Ming-Syan Chen

Self-powered intermittent systems featuring nonvolatile processors (NVPs) allow for accumulative execution in unstable power environments. However, frequent power failures may cause incorrect NVP execution results due to invalid data generated intermittently. This paper presents a HW/SW co-design, called HomeRun, to guarantee atomicity by ensuring that an uninterruptible program section can be run through at one execution. We design a HW module to ensure that a power pulse is sufficient for an atomic section, and develop a SW mechanism for programmers to protect atomic sections. The proposed design is validated through the development of a prototype pattern locking system. Experimental results demonstrate that the proposed design can completely guarantee atomicity and significantly improve the energy utilization of self-powered intermittent systems.

EcoMicro: A Miniature Self-Powered Inertial Sensor Node Based on Bluetooth Low Energy

Cheng-Ting Lee
Yun-Hao Liang
Pai H. Chou
Ali Heydari Gorji
Seyede Mahya Safavi
Wen-Chan Shih
Wen-Tsuen Chen

This paper describes EcoMicro, a miniature, self-powered, wireless inertial-sensing node in the volume of 8 x 13 x 9.5 mm3, including energy storage and solar cells. It is smaller than existing systems with similar functionality while retaining rich functionality and efficiency. It is capable of measuring motion using a inertial measurement unit (IMU) and communication over Bluetooth Low Energy (BLE) protocol. It is self-powered by miniature solar cells and can perform maximum power point tracking (MPPT). Its integrated energy-storage device combines the longevity and power density of supercapacitors with the relatively flat discharge curve of batteries. Our power-ground gating circuit minimizes leakage current during sleep mode and is used in conjunction with the real-time-clock for duty cycling. Experimental results show EcoMicro to be operational and efficient for a class of wireless sensing applications.

Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems

S. K. Thirumala
A. Raha
H. Jayakumar
K. Ma
V. Narayanan
V. Raghunathan
S. K. Gupta

In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.

SESSION: Design and 3D Integration

Multiple Combined Write-Read Peripheral Assists in 6T FinFET SRAMs for Low-VMIN IoT and Cognitive Applications

Arijit Banerjee
Sumanth Kamineni
Benton H. Calhoun

Battery-operated or energy-harvested IoT and cognitive SoCs in modern FinFET processes prefer the use of low-VMIN SRAMs for ultra-low power (ULP) operations. However, the 1:1:1 high-density (HD) FinFET 6T bitcell faces challenges in achieving a lower VMIN across process variation. The 6T bitcell VMIN improves either by increasing the size of the bitcell or by using combinations of peripheral assists (PAs) since a single PA cannot achieve the best VMIN across process variation. State-of-the-art works show some combinations of write and read PAs that lower the VMIN of 6T FinFET SRAMs. However, the better combinations of PA for 14nm HD 6T FinFET SRAMs are unknown. This work compares all the possible dual combinations of PAs and reveals the better ones. We show that in a usual column mux scenario the combination of negative bitline with VDD boosting and VDD collapse with VDD boosting in a proportion of 14% and 6% (total 20%), respectively, maximize the static VMIN improvement close to 191mV for ULP IoT and cognitive applications. We also show that a combination of wordline boosting with negative bitline and wordline boosting with VSS lowering achieve a 150mV and 25mV of dynamic VMIN improvement at the 5GHz frequency for the worst-case write and read corners, respectively, beating other combinations.

Road to High-Performance 3D ICs: Performance Optimization Methodologies for Monolithic 3D ICs

Kyungwook Chang
Sai Pentapati
Da Eun Shim
Sung Kyu Lim

As we approach the limits of 2D device scaling, monolithic 3D IC (M3D) has emerged as a potential solution offering performance and power benefits. Although various studies have been done to increase power savings of M3D designs, efforts to improve their performance are rarely made. In this paper, we, for the first time, perform in-depth analysis of the factors that affect the performance of M3D, and present methodologies to improve the performance. Our methodologies outperform the state-of-the-art M3D design flow by offering 15.6% performance improvement and 16.2% energy-delay product (EDP) benefit over 2D designs.

A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support

Srivatsa Srinivasa
Akshay Krishna Ramanathan
Xueqing Li
Wei-Hao Chen
Fu-Kuo Hsueh
Chih-Chao Yang
Chang-Hong Shen
Jia-Min Shieh
Sumeet Gupta
Meng-Fan Marvin Chang
Swaroop Ghosh
Jack Sampson
Vijaykrishnan Narayanan

We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.

SESSION: Industry ML/AI Compute

Across the Stack Opportunities for Deep Learning Acceleration

Vijayalakshmi Srinivasan
Bruce Fleischer
Sunil Shukla
Matthew Ziegler
Joel Silberman
Jinwook Oh
Jungwook Choi
Silvia Mueller
Ankur Agrawal
Tina Babinsky
Nianzheng Cao
Chia-Yu Chen
Pierce Chuang
Thomas Fox
George Gristede
Michael Guillorn
Howard Haynie
Michael Klaiber
Dongsoo Lee
Shih-Hsien Lo
Gary Maier
Michael Scheuermann
Swagath Venkataramani
Christos Vezyrtzis
Naigang Wang
Fanchieh Yee
Ching Zhou
Pong-Fei Lu
Brian Curran
Leland Chang
Kailash Gopalakrishnan

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers.

One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the “exactness” constraint enables exploiting opportunities for approximate computing across all layers of the system stack.

In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware.

Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training’s scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core.

Overall, our work shows how the benefits from exploiting approximation using algorithm/application’s robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

SESSION: Mobile Applications

App-Oriented Thermal Management of Mobile Devices

Jihoon Park
Seokjun Lee
Hojung Cha

The thermal issue for mobile devices becomes critical as the devices’ performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.

DiReCt: Resource-Aware Dynamic Model Reconfiguration for Convolutional Neural Network in Mobile Systems

Zirui Xu
Zhuwei Qin
Fuxun Yu
Chenchen Liu
Xiang Chen

Although Convolutional Neural Networks (CNNs) have been widely applied in various applications, their deployment in resource-constrained mobile systems remains a significant concern. To overcome the computation resource constraints, such as limited memory and energy capacity, many works are proposed for mobile CNN optimization. However, most of them lack a comprehensive modeling analysis of the CNN computation consumption and merely focus on static optimization schemes regardless of different mobile computation scenarios. In this work, we proposed DiReCt — a resource-aware CNN reconfiguration system. Leveraging accurate CNN computation consumption modeling and mobile resource constraint analysis, DiReCt can reconfigure a CNN with different accuracy and resource consumption levels to adapt to various mobile computation scenarios. The experiment results show that: the proposed computation consumption models in DiReCt can well estimate the CNN computation consumption with 94.1% accuracy, and DiReCt achieves at most 34.9% computation acceleration, 52.7% memory reduction, and 27.1% energy saving. Eventually, DiReCt can effectively adapt CNNs to dynamic mobile usage scenarios for optimal performance.

POSTER SESSION: Posters

A Low-power [email protected] H.265/HEVC Video Encoder for Smart Video Surveillance

Ke Xu
Yu Li
Bo Huang
Xiangkai Liu
Hong Wang
Zhuoyan Wu
Zhanpeng Yan
Xueying Tu
Tongqing Wu
Daibing Zeng

This paper presents the design and VLSI implementation of a low-power HEVC main profile encoder, which is able to process up to [email protected] 4:2:0 encoding in real-time with five-stage pipeline architecture. A pyramid ME (Motion Estimation) engine is employed to reduce search complexity. To compensate for the video sequences with fast moving objects, GME (Global Motion Estimation) are introduced to alleviate the effect of limited search range. We also implement an alternative 5×5 search along with 3×3 to boost video quality. For intra mode decision, original pixels, instead of reconstructed ones are used to reduce pipeline stall. The encoder supports DVFS (Dynamic Voltage and Frequency Scaling) and features three operating modes, which helps to reduce power consumption by 25%. Scalable quality that trades encoding quality for power by reducing size of search range and intra prediction candidates, achieves 11.4% power reduction with 3.5% quality degradation. Furthermore, a lossless frame buffer compression is proposed which reduced DDR bandwidth by 49.1% and power consumption by 13.6%. The entire video surveillance SoC is fabricated with TSMC 28nm technology with 1.96 mm2 area. It consumes 2.88M logic gates and 117KB SRAM. The measured power consumption is 103mW at 350MHz for 4K encoding with high-quality mode. The 0.39nJ/pixel of energy efficiency of this work, which achieves 42% ~ 97% power reduction as compared with reference designs, make it ideal for real-time low-power smart video surveillance applications.

Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories

M. Reza Mahmoodi
Dmitri Strukov

Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100×100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.

Efficient Image Sensor Subsampling for DNN-Based Image Classification

Jia Guo
Hongxiang Gu
Miodrag Potkonjak

Today’s mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.

Input-Splitting of Large Neural Networks for Power-Efficient Accelerator with Resistive Crossbar Memory Array

Yulhwa Kim
Hyungjun Kim
Daehyun Ahn
Jae-Joon Kim

Resistive Crossbar memory Arrays (RCA) have been gaining interest as a promising platform to implement Convolutional Neural Networks (CNN). One of the major challenges in RCA-based design is that the number of rows in an RCA is often smaller than the number of input neurons in a layer. Previous works used high-resolution Analog-to-Digital Converters (ADCs) to compute the partial weighted sum in each array and merged partial sums from multiple arrays outside the RCAs. However, such approach suffers from significant power consumption due to the need for high-resolution ADCs. In this paper, we propose a methodology to more efficiently construct a large CNN with multiple RCAs. By splitting the input feature map and retraining the CNN with proper initialization, we demonstrate that any CNN model can be represented with multiple arrays without using intermediate partial sums. The experimental results show that the ADC power of the proposed design is 32x smaller and the total chip power of the proposed design is 3x smaller than those of the baseline design.

Design Optimization of 3D Multi-Processor System-on-Chip with Integrated Flow Cell Arrays

Artem Andreev
Fulya Kaplan
Marina Zapater
Ayse K. Coskun
David Atienza

Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs

Swaminathan Narayanaswamy
Sangyoung Park
Sebastian Steinhorst
Samarjit Chakraborty

Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.

Intrinsic and Database-free Watermarking in ICs by Exploiting Process and Design Dependent Variability in Metal-Oxide-Metal Capacitances

Ahish Shylendra
Swarup Bhunia
Amit Ranjan Trivedi

Authentication of integrated circuits (IC) to verify their integrity has emerged as a critical need to address increasing concerns associated with counterfeit ICs in the supply chain. In this paper, novel SAR-ADC based intrinsic and database-free authentication scheme has been proposed. Proposed technique utilizes mismatch in back end of line (BEOL) capacitors used in charge-redistribution SAR ADC to generate authentication signature. BEOL metal-oxide-metal (MOM) capacitors form a reliable source of process variation information and are less sensitive to aging & temperature induced variations. Line edge roughness is the primary source of mismatch in BEOL capacitors and thus, capacitor mismatch variation has been analyzed in terms of LER and geometric parameters. Resource overhead incurred by the proposed modifications to the ADC architecture to incorporate authentication ability is minimal and existing on-chip calibration circuitry is used to extract signature. Proposed technique does not require sophisticated test setup, thereby, simplifying the authentication procedure.

Scheduling of Hybrid Battery-Supercapacitor Control Instructions for Longevity in Systems with Power Gating

Sumanta Pyne

The in-rush current due to wake-up of power gating (PG) components causes faster discharge of battery. This work introduces an instruction controlled hybrid battery-supercapacitor (B-SC) system for longer battery life in systems with instruction controlled PG. Two instructions have been introduced along with architectural support. The first instruction disconnects the battery from the PG components if the charge in the supercapacitor greater than or equal to the charge required by wake-up of PG components. The other instruction connects the battery to the PG components for recharging the supercapacitor. Disconnecting the battery during wake-up minimizes rate capacity effect (C-rate) for longer battery life. An algorithm is designed to schedule the proposed battery control instructions within a program having PG instructions. The efficacy of the proposed method is evaluated on MiBench and MediaBench benchmark programs. The proposed method reduces C-rate by an average of 14.25% at the cost of average performance loss of 6.87%.

Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter

Dongkwun Kim
Mingoo Seok

We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch’s resistance and flying-capacitor’s density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch’s resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.

Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware

Daniele Jahier Pagliari
Enrico Macii
Massimo Poncino

Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency.

In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.

Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

Mahdi Nazemi
Massoud Pedram

Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations. Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.

Battery-Aware Energy Model of Drone Delivery Tasks

Donkyu Baek
Yukai Chen
Alberto Bocca
Alberto Macii
Enrico Macii
Massimo Poncino

Drones are becoming increasingly popular in the commercial market for various package delivery services. In this scenario, the mostly adopted drones are quad-rotors (i.e., quadcopters). The energy consumed by a drone may become an issue, since it may affect (i) the delivery deadline (quality of service), (ii) the number of packages that can be delivered (throughput) and (iii) the battery lifetime (number of recharging cycles). It is thus fundamental try to find the proper compromise between the energy used to complete the delivery and the speed at which the quadcopter flies to reach the destination. In order to achieve this, we have to consider that the energy required by the drone for completing a given delivery task does not exactly correspond to the energy requested to the battery, since the latter is a non-ideal power supply that is able to deliver power with different efficiencies depending on its state of charge. In this paper, we demonstrate that the proposed battery-aware delivery scheduling algorithm carries more packages than the traditional delivery model with the same battery capacity. Moreover, the battery-aware delivery model is 17% more accurate than the traditional delivery model for the same delivery scheme, which prevents the unexpected drone landing.

A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference

Li Yang
Zhezhi He
Deliang Fan

Deep convolutional neural network has taken an important role in machine learning algorithm which has been widely used in computer vision tasks. However, its enormous model size and massive computation cost have became the main obstacle for deployment of such powerful algorithm in low power and resource limited embedded system, such as FPGA. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binary activation function, can significantly reduce the model size and computation complexity, which paves a new road for energy-efficient FPGA implementation. In this work, we first propose a new BNN algorithm, called Parallel-Convolution BNN (i.e. PC-BNN), which replaces the original binary convolution layer in conventional BNN with two parallel binary convolution layers. PC-BNN achieves ~86% on CIFAR-10 dataset with only 2.3Mb parameter size. We then deploy our proposed PC-BNN into the Xilinx PYNQ Z1 FPGA board with only 4.9Mb on-chip RAM. Since the ultra-small network parameter, it is feasible to store the whole network parameter into on-chip RAM, which could greatly reduce the energy and delay overhead to load network parameter from off-chip memory. Meanwhile, a new data streaming pipeline architecture is proposed in PC-BNN FPGA implementation to further improve throughput. The experiment results show that our PC-BNN based FPGA implementation achieves 930 frames per second, 387.5 FPS/Watt and 396×10-4 FPS/LUT, which are among the best throughput and energy efficiency compared to most recent works.

In-situ Stochastic Training of MTJ Crossbar based Neural Networks

Ankit Mondal
Ankur Srivastava

Owing to high device density, scalability and non-volatility, Magnetic Tunnel Junction-based crossbars have garnered significant interest for implementing the weights of an artificial neural network. The existence of only two stable states in MTJs implies a high overhead of obtaining optimal binary weights in software. We illustrate that the inherent parallelism in the crossbar structure makes it highly appropriate for in-situ training, wherein the network is taught directly on the hardware. It leads to significantly smaller training overhead as the training time is independent of the size of the network, while also circumventing the effects of alternate current paths in the crossbar and accounting for manufacturing variations in the device. We show how the stochastic switching characteristics of MTJs can be leveraged to perform probabilistic weight updates using the gradient descent algorithm. We describe how the update operations can be performed on crossbars both with and without access transistors and perform simulations on them to demonstrate the effectiveness of our techniques. The results reveal that stochastically trained MTJ-crossbar NNs achieve a classification accuracy nearly same as that of real-valued-weight networks trained in software and exhibit immunity to device variations.

Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit

Ioannis Tsiokanos
Lev Mukhanov
Dimitrios S. Nikolopoulos
Georgios Karakonstantis

In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.

A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications

Sana Arshad
Azam Beg
Rashad Ramzan

This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA’s power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.

SESSION: Far-out Ideas

Insights from Biology: Low Power Circuits in the Fruit Fly

Louis K. Scheffer

Fruit flies (Drosophila melanogaster) are small insects, with correspondingly small power budgets. Despite this, they perform sophisticated neural computations in real time. Careful study of these insects is revealing how some of these circuits work. Insights from these systems might be helpful in designing other low power circuits.