ISLPED ’20: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Full Citation in the ACM Digital Library

SESSION: ML related software and systems

How to cultivate a green decision tree without loss of accuracy?

Tseng-Yi Chen
Yuan-Hao Chang
Ming-Chang Yang
Huang-Wei Chen

Decision tree is the core algorithm of the random forest learning that has been widely applied to classification and regression problems in the machine learning field. For avoiding underfitting, a decision tree algorithm will stop growing its tree model when the model is a fully-grown tree. However, a fully-grown tree will result in an overfitting problem reducing the accuracy of a decision tree. In such a dilemma, some post-pruning strategies have been proposed to reduce the model complexity of the fully-grown decision tree. Nevertheless, such a process is very energy-inefficiency over an non-volatile-memory-based (NVM-based) system because NVM generally have high writing costs (i.e., energy consumption and I/O latency). Such unnecessary data will induce high writing energy consumption and long I/O latency on NVM-based architectures, especially for low-power-oriented embedded systems. In order to establish a green decision tree (i.e., a tree model with minimized construction energy consumption), this study rethinks a pruning algorithm, namely duo-phase pruning framework, which can significantly decrease the energy consumption on the NVM-based computing system without loss of accuracy.

Approximate inference systems (AxIS): end-to-end approximations for energy-efficient inference at the edge

Soumendu Kumar Ghosh
Arnab Raha
Vijay Raghunathan

The rapid proliferation of the Internet-of-Things (IoT) and the dramatic resurgence of artificial intelligence (AI) based application workloads has led to immense interest in performing inference on energy-constrained edge devices. Approximate computing (a design paradigm that yields large energy savings at the cost of a small degradation in application quality) is a promising technique to enable energy-efficient inference at the edge. This paper introduces the concept of an approximate inference system (AxIS) and proposes a systematic methodology to perform joint approximations across different subsystems in a deep neural network-based inference system, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use a smart camera system that executes various convolutional neural network (CNN) based image recognition applications to illustrate how the sensor, memory, compute, and communication subsystems can all be approximated synergistically. We demonstrate our proposed methodology using two variants of a smart camera system: (a) Cam_edge, where the CNN executes locally on the edge device, and (b) Cam_cloud, where the edge device sends the captured image to a remote cloud server that executes the CNN. We have prototyped such an approximate inference system using an Altera Stratix IV GX-based Terasic TR4-230 FPGA development board. Experimental results obtained using six CNNs demonstrate significant energy savings (around 1.7× for Cam_edge and 3.5× for Cam_cloud) for minimal (< 1%) loss in application quality. Compared to approximating a single subsystem in isolation, AxIS achieves additional energy benefits of 1.6×–1.7× (Cam_edge) and 1.4×–3.4× (Cam_cloud) on average for minimal application-level quality loss.

Time-step interleaved weight reuse for LSTM neural network computing

Naebeom Park
Yulhwa Kim
Daehyun Ahn
Taesu Kim
Jae-Joon Kim

In Long Short-Term Memory (LSTM) neural network models, a weight matrix tends to be repeatedly loaded from DRAM if the size of on-chip storage of the processor is not large enough to store the entire matrix. To alleviate heavy overhead of DRAM access for weight loading in LSTM computations, we propose a weight reuse scheme which utilizes the weight sharing characteristics in two adjacent time-step computations. Experimental results show that the proposed weight reuse scheme reduces the energy consumption by 28.4-57.3% and increases the overall throughput by 110.8% compared to the conventional schemes.

Sound event detection with binary neural networks on tightly power-constrained IoT devices

Gianmarco Cerutti
Renzo Andri
Lukas Cavigelli
Elisabetta Farella
Michele Magno
Luca Benini

Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on deep neural networks (DNNs) are very effective, but highly demanding in terms of memory, power, and throughput when targeting ultra-low power always-on devices.

Latency, availability, cost, and privacy requirements are pushing recent IoT systems to process the data on the node, close to the sensor, with a very limited energy supply, and tight constraints on the memory size and processing capabilities precluding to run state-of-the-art DNNs.

In this paper, we explore the combination of extreme quantization to a small-footprint binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller. Starting from an existing CNN for SED whose footprint (815 kB) exceeds the 512 kB of memory available on our platform, we retrain the network using binary filters and activations to match these memory constraints. (Fully) binary neural networks come with a natural drop in accuracy of 12-18% on the challenging ImageNet object recognition challenge compared to their equivalent full-precision baselines. This BNN reaches a 77.9% accuracy, just 7% lower than the full-precision version, with 58 kB (7.2× less) for the weights and 262 kB (2.4× less) memory in total. With our BNN implementation, we reach a peak throughput of 4.6 GMAC/s and 1.5 GMAC/s over the full network, including preprocessing with Mel bins, which corresponds to an efficiency of 67.1 GMAC/s/W and 31.3 GMAC/s/W, respectively. Compared to the performance of an ARM Cortex-M4 implementation, our system has a 10.3× faster execution time and a 51.1× higher energy-efficiency.

SESSION: Low power circuit designs

Analysis of crosstalk in NISQ devices and security implications in multi-programming regime

Abdullah Ash-Saki
Mahabubul Alam
Swaroop Ghosh

The noisy intermediate-scale quantum (NISQ) computers suffer from unwanted coupling across qubits referred to as crosstalk. Existing literature largely ignores the crosstalk effects which can introduce significant error in circuit optimization. In this work, we present a crosstalk modeling analysis framework for near-term quantum computers after extracting the error-rates experimentally. Our analysis reveals that crosstalk can be of the same order of gate error which is considered a dominant error in NISQ devices. We also propose adversarial fault injection using crosstalk in a multiprogramming environment where the victim and the adversary share the same quantum hardware. Our simulation and experimental results from IBM quantum computers demonstrated that the adversary can inject fault and launch a Denial-of-Service attack. Finally, we propose system- and device-level countermeasures.

An 88.6nW ozone pollutant sensing interface IC with a 159 dB dynamic range

Rishika Agarwala
Peng Wang
Akhilesh Tanneeru
Bongmook Lee
Veena Misra
Benton H. Calhoun

This paper presents a low power resistive sensor interface IC designed at 0.6V for ozone pollutant sensing. The large resistance range of gas sensors poses challenges in designing a low power sensor interface. Exiting architectures are insufficient for achieving a high dynamic range while enabling low V_DD operation, resulting in high power consumption regardless of the adopted architecture. We present an adaptive architecture that provides baseline resistance cancellation and dynamic current control to enable low V_DD operation while maintaining a dynamic range of 159dB across 20kΩ-1MΩ. The sensor interface IC is fabricated in a 65nm bulk CMOS process and consumes 88.6nW of power which is 300x lower than the state-of-art. The full system power ranges between 116 nW – 1.09 μW which includes the proposed sensor interface IC, analog to digital converter and peripheral circuits. The sensor interface’s performance was verified using custom resistive metal-oxide sensors for ozone concentrations from 50 ppb to 900 ppb.

A 1.2-V, 1.8-GHz low-power PLL using a class-F VCO for driving 900-MHz SRD band SC-circuits

Tim Schumacher
Markus Stadelmayer
Thomas Faseth
Harald Pretl

This work presents a 1.6 GHz to 2 GHz integer PLL with 2 MHz stepping, which is optimized for driving low-power 180 nm switched-capacitor (SC) circuits at a 1.2 V supply. To reduce the overall power consumption, a class-F VCO is implemented. Due to enriched odd harmonics of the oscillator, a rectangular oscillator signal is generated, which allows omitting output buffering stages. The rectangular signal results in a lowered power consumption and enables to directly drive SC-filters and an RF-divider using the oscillator signal. In addition, the proposed RF-divider includes a differential 4-phase signal generation at 868 MHz and 915 MHz SRD band frequencies that can be used for complex modulation schemes. With a fully integrated loop-filter, a maximum of integration is achieved. A test-chip was manufactured in a 1P6M 180 nm CMOS technology with triple-well option and confirms a PLL with a total active power consumption of 4.1 mW. It achieves a phase noise of -111 dBc/Hz at 1 MHz offset and a -42 dBc spurious response from a 1 MHz reference.

A 640pW 32kHz switched-capacitor ILO analog-to-time converter for wake-up sensor applications

Nicolas Goux
Jean-Baptiste Casanova
Gaël Pillonnet
Franck Badets

This paper presents the architecture and ultra-low power (ULP) implementation of a switched-capacitor injection-locked oscillator (SC-ILO) used as analog-to-time converter for wake-up sensor applications. Thanks to a novel injection-locking scheme based on switched capacitors, the SC-ILO architecture avoids the use of power-hungry constant injection current sources. The SC-ILO design parameters and transfer function, resulting from an analytical study, are determined, and used to optimize the design. The ULP implementation strategy regarding power consumption, gain, modulation bandwidth and output phase dynamic range is presented and optimized to be compatible with audio wake-up sensor application that require ultra-low power consumption but low dynamic range performances. This paper reports SC-ILO circuit experimental measurements, fabricated in a 22 nm FDSOI process. The measured chip exhibits a 129° phase-shift range, a 6kHz bandwidth leading to a 34.6dB-dynamic range for a power consumption of 640pW under 0.4V.

SESSION: Low power management

Dynamic idle core management and leakage current reuse in MPSoC platforms

MD Shazzad Hossain
Ioannis Savidis

In this paper, algorithmic and circuit techniques are proposed for dynamic power management that allows for the reuse of the leakage current of idle circuit blocks and cores in a multiprocessor system-on-chip platform. First, a novel scheduling algorithm, longest idle time – leakage reuse (LIT-LR), is proposed for energy efficient reuse of leakage current, which generates a supply voltage of 340 mV with less than ±3% variation across the tt, ff, and ss process corners. The LIT-LR algorithm reduces the energy consumption of the leakage control blocks and the peak power consumption by, respectively, 25% and 7.4% as compared to random assignment of idle cores for leakage reuse. Second, a novel usage ranking based algorithm, longest idle time – simultaneous leakage reuse and power gating (LIT-LRPG), is proposed for simultaneous implementation of power gating and leakage reuse. Applying power gating with leakage reuse reduces the total energy consumption of the MPSoC by 50.2%, 14.4%, and 5.7% as compared to, respectively, a baseline topology that includes neither leakage reuse or power gating, only includes power gating, and only includes leakage reuse.

Towards wearable piezoelectric energy harvesting: modeling and experimental validation

Yigit Tuncel
Shiva Bandyopadhyay
Shambhavi V. Kulshrestha
Audrey Mendez
Umit Y. Ogras

Motion energy harvesting is an ideal alternative to battery in wearable applications since it can produce energy on demand. So far, widespread use of this technology has been hindered by bulky, inflexible and impractical designs. New flexible piezoelectric materials enable comfortable use of this technology. However, the energy harvesting potential of this approach has not been thoroughly investigated to date. This paper presents a novel mathematical model for estimating the energy that can be harvested from joint movements on the human body. The proposed model is validated using two different piezoelectric materials attached on a 3D model of the human knee. To the best of our knowledge, this is the first study that combines analytical modeling and experimental validation for joint movements. Thorough experimental evaluations show that 1) users can generate on average 13 μW power while walking, 2) we can predict the generated power with 4.8% modeling error.

RAMANN: in-SRAM differentiable memory computations for memory-augmented neural networks

Mustafa Ali
Amogh Agrawal
Kaushik Roy

Memory-Augmented Neural Networks (MANNs) have been shown to outperform Recurrent Neural Networks (RNNs) in terms of long-term dependencies. Since MANNs are equipped with an external memory, they can store and retrieve more data through longer periods of time. A MANN generally consists of a network controller and an external memory. Unlike conventional memory having read/write operations to specific addresses, a differentiable memory has soft read and write operations involving all the data stored in the memory. Such soft read and write operations present new computational challenges for hardware implementation of MANNs. In this work, we present a novel in-memory computing primitive to accelerate the differentiable memory operations of MANNs in SRAMs. We propose a 9T SRAM macro capable of performing both Hamming similarity and dot products (crucial for soft read/write and addressing mechanisms in MANNs). Regarding Hamming similarity, we operate the 9T cell in analog Content-Addressable Memory (CAM) mode by applying the key at the bitlines (RBLs/RBLBs) in each column, and reading out the analog output at the sourceline (SL). To perform dot product operation, the input data is applied at the wordlines, and the current passing through RBLs represents the dot product between the input data and the stored bits. The proposed SRAM array performs computations that reliably match the operations required for a differentiable memory, thereby leading to energy-efficient on-chip acceleration of MANNs. Compared to standard GPU systems, the proposed scheme achieves 43x and 85x performance and energy improvements respectively, for computing the differentiable memory operations.

Swan: a two-step power management for distributed search engines

Liang Zhou
Laxmi N. Bhuyan
K. K. Ramakrishnan

The service quality of web search depends considerably on the request tail latency from Index Serving Nodes (ISNs), prompting data centers to operate them at low utilization and wasting server power. ISNs can be made more energy efficient utilizing Dynamic Voltage and Frequency Scaling (DVFS) or sleep states techniques to take advantage of slack in latency of search queries. However, state-of-the-art frameworks use a single distribution to predict a request’s service time and select a high percentile tail latency to derive the CPU’s frequency or sleep states. Unfortunately, this misses plenty of energy saving opportunities. In this paper, we develop a simple linear regression predictor to estimate each individual search request’s service time, based on the length of the request’s posting list. To use this prediction for power management, the major challenge lies in reducing miss rates for deadlines due to prediction errors, while improving energy efficiency. We present Swan, a two-Step poWer mAnagement for distributed search eNgines. For each request, Swan selects an initial, lower frequency to optimize power, and then appropriately boosts the CPU frequency just at the right time to meet the deadline. Additionally, we re-configure the time instant for boosting frequency, when a critical request arrives and avoid deadline violations. Swan is implemented on the widely-used Solr search engine and evaluated with two representative, large query traces. Evaluations show Swan outperforms state-of-the-art approaches, saving at least 39% CPU power on average.

SESSION: Tuning the design flow for low power: From synthesis to pin assignment

Deep-PowerX: a deep learning-based framework for low-power approximate logic synthesis

Ghasem Pasandi
Mackenzie Peterson
Moises Herrera
Shahin Nazarian
Massoud Pedram

This paper aims at integrating three powerful techniques namely Deep Learning, Approximate Computing, and Low Power Design into a strategy to optimize logic at the synthesis level. We utilize advances in deep learning to guide an approximate logic synthesis engine to minimize the dynamic power consumption of a given digital CMOS circuit, subject to a predetermined error rate at the primary outputs. Our framework, Deep-PowerX¹, focuses on replacing or removing gates on a technology-mapped network and uses a Deep Neural Network (DNN) to predict error rates at primary outputs of the circuit when a specific part of the netlist is approximated. The primary goal of Deep-PowerX is to reduce the dynamic power whereas area reduction serves as a secondary objective. Using the said DNN, Deep-PowerX is able to reduce the exponential time complexity of standard approximate logic synthesis to linear time. Experiments are done on numerous open source benchmark circuits. Results show significant reduction in power and area by up to 1.47× and 1.43× compared to exact solutions and by up to 22% and 27% compared to state-of-the-art approximate logic synthesis tools while having orders of magnitudes lower run-time.

Steady state driven power gating for lightening always-on state retention storage

Taehwan Kim
Gyounghwan Hyun
Taewhan Kim

It is generally known that a considerable portion of flip-flops in circuits is occupied by the ones with mux-feedback loop (called self-loop), which are the critical (inherently unavoidable) bottleneck in minimizing total (always-on) storage size for the allocation of non-uniform multi-bits for retaining flip-flop states in power gated circuits. This is because it is necessary to replace every self-loop flip-flop with a distinct retention flip-flop with at least one-bit storage for retaining its state since there is no clue as to where the flip-flop state, when waking up, comes from, i.e., from the mux-feedback loop or from the driving flip-flops other than itself. This work breaks this bottleneck by safely treating a large portion of the self-loop flip-flops as if they were the same as the flip-flops with no self-loop. Specifically, we design a novel mechanism of steady state monitoring, operating for a few cycles just before sleeping, on a partial set of self-loop flip-flops, by which the expensive state retention storage never be needed for the monitored flip-flops, contributing to a significant saving on the total size of the always- on state retention storage for power gating.

Pin-in-the-middle: an efficient block pin assignment methodology for block-level monolithic 3D ICs

Bon Woong Ku
Sung Kyu Lim

In a 2D design, the periphery of a block serves as the optimal pin location since blocks are placed aside horizontally in a single placement layer. However, Monolithic 3D (M3D) integration relieves this boundary constraint by allowing vertical block communication between different tiers based on a nm-scale 3D interconnection pitch. In this paper, we present a design methodology named Pin-in-the-Middle that assigns block pins in the middle of a block using commercial 2D P&R tools to enable efficient block implementation and integration for two-tier block-level M3D ICs. Based on a 28nm two-tier M3D hierarchical design result, we show that our solution offers 13.6% and 24.7% energy-delay-product reduction compared to the M3D design with pins assigned at the block boundaries and its 2D counterpart, respectively.

SESSION: ML related

GRLC: grid-based run-length compression for energy-efficient CNN accelerator

Yoonho Park
Yesung Kang
Sunghoon Kim
Eunji Kwon
Seokhyeong Kang

Convolutional neural networks (CNNs) require a huge amount of off-chip DRAM access, which accounts for most of its energy consumption. Compression of feature maps can reduce the energy consumption of DRAM access. However, previous compression methods show poor compression ratio if the feature maps are either extremely sparse or dense. To improve the compression ratio efficiently, we have exploited the spatial correlation and the distribution of non-zero activations in output feature maps. In this work, we propose a grid-based run-length compression (GRLC) and have implemented a hardware for the GRLC. Compared with a previous compression method [1], GRLC reduces 11% of the DRAM access and 5% of the energy consumption on average in VGG-16, ExtractionNet and ResNet-18.

NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting

Qin Li
Sheng Lin
Changlu Liu
Yidong Liu
Fei Qiao
Yanzhi Wang
Huazhong Yang

Keyword spotting (KWS) is a crucial front-end module in the whole speech interaction system. The always-on KWS module detects input words, then activates the energy-consuming complex backend system when keywords are detected. The performance of the KWS determines the standby performance of the whole system and the conventional KWS module encounters the power consumption bottleneck problem of the data conversion near the microphone sensor. In this paper, we propose an energy-efficient near-sensor processing architecture for always-on KWS, which could enhance continuous perception of the whole speech interaction system. By implementing the keyword detection in the analog domain after the microphone sensor, this architecture avoids energy-consuming data converter and achieves faster speed than conventional realizations. In addition, we propose a lightweight gated recurrent unit (GRU) with negligible accuracy loss to ensure the recognition performance. We also implement and fabricate the proposed KWS system with the CMOS 0.18μm process. In the system-view evaluation results, the hardware-software co-design architecture achieves 65.6% energy consumption saving and 71 times speed up than state of the art.

Multi-channel precision-sparsity-adapted inter-frame differential data codec for video neural network processor

Yixiong Yang
Zhe Yuan
Fang Su
Fanyang Cheng
Zhuqing Yuan
Huazhong Yang
Yongpan Liu

Activation I/O traffic is a critical bottleneck of video neural network processor. Recent works adopted an inter-frame difference method to reduce activation size. However, current methods can’t fully adapt to the various precision and sparsity in differential data. In this paper, we propose the multi-channel precision-sparsity-adapted codec, which will separate the differential activation and encode activation in multiple channels. We analyze the most adapted encoding of each channel, and select the optimal channel number with the best performance. A two-channel codec hardware has been implemented in the ASIC accelerator, which can encode/decode activations in parallel. Experiment results show that our coding achieves 2.2x-18.2x compression rate in three scenarios with no accuracy loss, and the hardware has 42x/174x improvement on speed and energy-efficiency compared with the software codec.

SESSION: Non-ML low-power architecture

Slumber: static-power management for GPGPU register files

Devashree Tripathy
Hadi Zamani
Debiprasanna Sahoo
Laxmi N. Bhuyan
Manoranjan Satpathy

The leakage power dissipation has become one of the major concerns with technology scaling. The GPGPU register file has grown in size over last decade in order to support the parallel execution of thousands of threads. Given that each thread has its own dedicated set of physical registers, these registers remain idle when corresponding threads go for long latency operation. Existing research shows that the leakage energy consumption of the register file can be reduced by under volting the idle registers to a data-retentive low-leakage voltage (Drowsy Voltage) to ensure that the data is not lost while not in use. In this paper, we develop a realistic model for determining the wake-up time of registers from various under-volting and power gating modes. Next, we propose a hybrid energy saving technique where a combination of power-gating and under-volting can be used to save optimum energy depending on the idle period of the registers with a negligible performance penalty. Our simulation shows that the hybrid energy-saving technique results in 94% leakage energy savings in register files on an average when compared with the conventional clock gating technique and 9% higher leakage energy saving compared to the state-of-art technique.

STINT: selective transmission for low-energy physiological monitoring

Tao-Yi Lee
Khuong Vo
Wongi Baek
Michelle Khine
Nikil Dutt

Noninvasive, and continuous physiological sensing enabled by novel wearable sensors is generating unprecedented diagnostic insights in many medical practices. However, the limited battery capacity of these wearable sensors poses a critical challenge in extending device lifetime in order to prevent omission of informative events. In this work, we exploit the inherent sparsity of physiological signals to intelligently enable selective transmission of these signals and thereby improve the energy efficiency of wearable sensors. We propose STINT, a selective transmission framework that generates a sparse representation of the raw signal based on domain-specific knowledge, and which can be integrated into a wide range of resource-constrained embedded sensing IoT platforms. STINT employs a neural network (NN) for selective transmission: the NN identifies, and transmits only the informative parts of the raw signal, thereby achieving low power operation. We validate STINT and establish its efficacy in the domain of IoT for energy-efficient physiological monitoring, by testing our framework on EcoBP – a novel miniaturized, and wireless continuous blood pressure sensor. Early experimental results on the EcoBP device demonstrate that the STINT-enabled EcoBP sensor outperforms the native platform by 14% of sensor energy consumption, with room for additional energy savings via complementary bluetooth and wireless optimizations.

Reconfigurable tiles of computing-in-memory SRAM architecture for scalable vectorization

R. Gauchi
V. Egloff
M. Kooli
J.-P. Noel
B. Giraud
P. Vivet
S. Mitra
H.-P. Charles

For big data applications, bringing computation to the memory is expected to reduce drastically data transfers, which can be done using recent concepts of Computing-In-Memory (CIM). To address kernels with larger memory data sets, we propose a reconfigurable tile-based architecture composed of Computational-SRAM (C-SRAM) tiles, each enabling arithmetic and logic operations within the memory. The proposed horizontal scalability and vertical data communication are combined to select the optimal vector width for maximum performance. These schemes allow to use vector-based kernels available on existing SIMD engines onto the targeted CIM architecture. For architecture exploration, we propose an instruction-accurate simulation platform using SystemC/TLM to quantify performance and energy of various kernels. For detailed performance evaluation, the platform is calibrated with data extracted from the Place&Route C-SRAM circuit, designed in 22nm FDSOI technology. Compared to 512-bit SIMD architecture, the proposed CIM architecture achieves an EDP reduction up to 60× and 34× for memory bound kernels and for compute bound kernels, respectively.

SESSION: Memory technology and in-memory computing

FeFET-based low-power bitwise logic-in-memory with direct write-back and data-adaptive dynamic sensing interface

Mingyen Lee
Wenjun Tang
Bowen Xue
Juejian Wu
Mingyuan Ma
Yu Wang
Yongpan Liu
Deliang Fan
Vijaykrishnan Narayanan
Huazhong Yang
Xueqing Li

Compute-in-memory (CiM) is a promising method for mitigating the memory wall problem in data-intensive applications. The proposed bitwise logic-in-memory (BLiM) is targeted at data intensive applications, such as database, data encryption. This work proposes a low-power BLiM approach using the emerging nonvolatile ferroelectric FETs with direct write-back and data-adaptive dynamic sensing interface. Apart from general-purpose random-access memory, it also supports BLiM operations such as copy, not, nand, xor, and full adder (FA). The novel features of the proposed architecture include: (i) direct result-write-back based on the remnant bitline BLiM charge that avoids bitline sensing and charging operations; (ii) a fully dynamic sensing interface that needs no static reference current, but adopts data-adaptive voltage references for certain multi-operand operations, and (iii) selective bitline charging from wordline (instead of pre-charging all bitlines) to save power and also enable direct write-back. Detailed BLiM operations and benchmarking against conventional approaches show the promise of low-power computing with the FeFET-based circuit techniques.

Enabling efficient ReRAM-based neural network computing via crossbar structure adaptive optimization

Chenchen Liu
Fuxun Yu
Zhuwei Qin
Xiang Chen

Resistive random-access memory (ReRAM) based accelerators have been widely studied to achieve efficient neural network computing in speed and energy. Neural network optimization algorithms such as sparsity are developed to achieve efficient neural network computing on traditional computer architectures such as CPU and GPU. However, such computing efficiency improvement is hindered when deploying these algorithms on the ReRAM-based accelerator because of its unique crossbar-structural computations. And a specific algorithm and hardware co-optimization for the ReRAM-based architecture is still in a lack. In this work, we propose an efficient neural network computing framework that is specialized for the crossbar-structural computations on the ReRAM-based accelerators. The proposed framework includes a crossbar specific feature map pruning and an adaptive neural network deployment. Experimental results show our design can improve the computing accuracy by 9.1% compared with the state-of-the-art sparse neural networks. Based on a famous ReRAM-based DNN accelerator, the proposed framework demonstrates up to 1.4× speedup, 4.3× power efficiency, and 4.4× area saving.

Embedding error correction into crossbars for reliable matrix vector multiplication using emerging devices

Qiuwen Lou
Tianqi Gao
Patrick Faley
Michael Niemier
X. Sharon Hu
Siddharth Joshi

Emerging memory devices are an attractive choice for implementing very energy-efficient in-situ matrix-vector multiplication (MVM) for use in intelligent edge platforms. Despite their great potential, device-level non-idealities have a large impact on the application-level accuracy of deep neural network (DNN) inference. We introduce a low-density parity-check code (LDPC) based approach to correct non-ideality induced errors encountered during in-situ MVM. We first encode the weights using error correcting codes (ECC), perform MVM on the encoded weights, and then decode the result after in-situ MVM. We show that partial encoding of weights can maintain DNN inference accuracy while minimizing the overhead of LDPC decoding. Within two iterations, our ECC method recovers 60% of the accuracy in MVM computations when 5% of underlying computations are error-prone. Compared to an alternative ECC method which uses arithmetic codes, using LDPC improves AlexNet classification accuracy by 0.8% at iso-energy. Similarly, at iso-energy, we demonstrate an improvement in CIFAR-10 classification accuracy of 54% with VGG-11 when compared to a strategy that uses 2× redundancy in weights. Further design space explorations demonstrate that we can leverage the resilience endowed by ECC to improve energy efficiency (by reducing operating voltage). A 3.3× energy efficiency improvement in DNN inference on CIFAR-10 dataset with VGG-11 is achieved at iso-accuracy.

SESSION: Low power system and NVM

A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs

Kshitij Bhardwaj
Marton Havasi
Yuan Yao
David M. Brooks
José Miguel Hernández-Lobato
Gu-Yeon Wei

Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.

DidaSel: dirty data based selection of VC for effective utilization of NVM buffers in on-chip interconnects

Khushboo Rani
Sukarn Agarwal
Hemangee K. Kapoor

In a multi-core system, communication across cores is managed by an on-chip interconnect called Network-on-Chip (NoC). The utilization of NoC results in limitations such as high communication delay and high network power consumption. The buffers of the NoC router consume a considerable amount of leakage power. This paper attempts to reduce leakage power consumption by using Non-Volatile Memory technology-based buffers. NVM technology has the advantage of higher density and low leakage but suffers from costly write operation, and weaker write endurance. These characteristics impact on the total network power consumption, network latency, and lifetime of the router as a whole.

In this paper, we propose a write reduction technique, which is based on dirty flits present in write-back data packets. The method also suggests a dirty flit based Virtual Channel (VC) allocation technique that distributes writes in NVM technology-based VCs to improve the lifetime of NVM buffers.

The experimental evaluation on the full system simulator shows that the proposed policy obtains a 53% reduction in write-back flits, which results in 27% lesser total network flit on average. All these results in a significant decrease in total and dynamic network power consumption. The policy also shows remarkable improvement in the lifetime.

WELCOMF: wear leveling assisted compression using frequent words in non-volatile main memories

Arijit Nath
Hemangee K. Kapoor

Emerging Non-Volatile memories such as Phase Change Memory (PCM) and Resistive RAM are projected as potential replacements of the traditional DRAM-based main memories. However, limited write endurance and high write energy limit their chances of adoption as a mainstream main memory standard.

In this paper, we propose a word-level compression scheme called COMF to reduce bitflips in PCMs by removing the most repeated words from the cache lines before writing into memory. Later, we also propose an intra-line wear leveing technique called WELCOMF that extends COMF to improve lifetime. Experimental results show that the proposed technique improves lifetime by 75% and, reduce bit flips and energy by 45% and 46% respectively over baseline.

SESSION: ML-based low-power architecture

Low-power object counting with hierarchical neural networks

Abhinav Goel
Caleb Tung
Sara Aghajanzadeh
Isha Ghodgaonkar
Shreya Ghosh
George K. Thiruvathukal
Yung-Hsiang Lu

Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in many computer vision tasks, such as object counting. Object counting takes two inputs: an image and an object query and reports the number of occurrences of the queried object. To achieve high accuracy, DNNs require billions of operations, making them difficult to deploy on resource-constrained, low-power devices. Prior work shows that a significant number of DNN operations are redundant and can be eliminated without affecting the accuracy. To reduce these redundancies, we propose a hierarchical DNN architecture for object counting. This architecture uses a Region Proposal Network (RPN) to propose regions-of-interest (RoIs) that may contain the queried objects. A hierarchical classifier then efficiently finds the RoIs that actually contain the queried objects. The hierarchy contains groups of visually similar object categories. Small DNNs at each node of the hierarchy classify between these groups. The RoIs are incrementally processed by the hierarchical classifier. If the object in an RoI is in the same group as the queried object, then the next DNN in the hierarchy processes the RoI further; otherwise, the RoI is discarded. By using a few small DNNs to process each image, this method reduces the memory requirement, inference time, energy consumption, and number of operations with negligible accuracy loss when compared with the existing techniques.

Integrating event-based dynamic vision sensors with sparse hyperdimensional computing: a low-power accelerator with online learning capability

Michael Hersche
Edoardo Mello Rella
Alfio Di Mauro
Luca Benini
Abbas Rahimi

We propose to embed features extracted from event-driven dynamic vision sensors to binary sparse representations in hyperdimensional (HD) space for regression. This embedding compresses events generated across 346×260 differential pixels to a sparse 8160-bit vector by applying random activation functions. The sparse representation not only simplifies inference, but also enables online learning with the same memory footprint. Specifically, it allows efficient updates by retaining binary vector components over the course of online learning that cannot be otherwise achieved with dense representations demanding multibit vector components. We demonstrate online learning capability: using estimates and confidences of an initial model trained with only 25% of data, our method continuously updates the model for the remaining 75% of data, resulting in a close match with accuracy obtained with an oracle model on ground truth labels. When mapped on an 8-core accelerator, our method also achieves lower error, latency, and energy compared to other sparse/dense alternatives. Furthermore, it is 9.84× more energy-efficient and 6.25× faster than an optimized 9-layer perceptron with comparable accuracy.

FTRANS: energy-efficient acceleration of transformers using FPGA

Bingbing Li
Santosh Pandey
Haowen Fang
Yanjun Lyv
Ji Li
Jieyang Chen
Mimi Xie
Lipeng Wan
Hang Liu
Caiwen Ding

In natural language processing (NLP), the “Transformer” architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduce the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07× and 81 × improvement in performance and energy efficiency compared to CPU, and up to 8.80× improvement in energy efficiency compared to GPU.

POSTER SESSION: Poster papers

BrainWave: an energy-efficient EEG monitoring system – evaluation and trade-offs

Barry de Bruin
Kamlesh Singh
Jos Huisken
Henk Corporaal

This paper presents the design and evaluation of an energy-efficient seizure detection system for emerging EEG-based monitoring applications, such as non-convulsive epileptic seizure detection and Freezing-of-Gait (FoG) detection. As part of the BrainWave system, a BrainWave processor for flexible and energy-efficient signal processing is designed. The key system design parameters, including algorithmic optimizations, feature offloading and near-threshold computing are evaluated in this work. The BrainWave processor is evaluated while executing a complex EEG-based epileptic seizure detection algorithm. In a 28-nm FDSOI technology, 325 μJ per classification at 0.9 V and 290 μJ at 0.5 V are achieved using an optimized software-only implementation. By leveraging a Coarse-Grained Reconfigurable Array (CGRA), 160 μJ and 135 μJ are obtained, respectively, while maintaining a high level of flexibility. Near-threshold computing combined with CGRA acceleration leads to an energy reduction of up to 59%, or 55% including idle-time overhead.

QUANOS: adversarial noise sensitivity driven hybrid quantization of neural networks

Priyadarshini Panda

Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial attacks, wherein, a model gets fooled by applying slight perturbations on the input. In this paper, we investigate the use of quantization to potentially resist adversarial attacks. Several recent studies have reported remarkable results in reducing the energy requirement of a DNN through quantization. However, no prior work has considered the relationship between adversarial sensitivity of a DNN and its effect on quantization. We propose QUANOS- a framework that performs layer-specific hybrid quantization based on Adversarial Noise Sensitivity (ANS). We identify a novel noise stability metric (ANS) for DNNs, i.e., the sensitivity of each layer’s computation to adversarial noise. ANS allows for a principled way of determining optimal bit-width per layer that incurs adversarial robustness as well as energy-efficiency with minimal loss in accuracy. Essentially, QUANOS assigns layer significance based on its contribution to adversarial perturbation and accordingly scales the precision of the layers. We evaluate the benefits of QUANOS on precision scalable Multiply and Accumulate (MAC) hardware architectures with data gating and subword parallelism capabilities. Our experiments on CIFAR10, CIFAR100 datasets show that QUANOS outperforms homogeneously quantized 8-bit precision baseline in terms of adversarial robustness (3 — 4% higher) while yielding improved compression (> 5×) and energy savings (> 2×) at iso-accuracy. At iso-compression rate, QUANOS yields significantly higher adversarial robustness (> 10%) than similar sized baseline against strong white-box attacks. We also find that combining QUANOS with state-of-the-art defense methods outperforms the state-of-the-art in robustness (~ 5% — 16% higher) against very strong attacks.

Pre-layout clock tree estimation and optimization using artificial neural network

Sunwha Koh
Yonghwi Kwon
Youngsoo Shin

Clock tree synthesis (CTS) takes place in a very late design stage, so most of the time, power consumption is analyzed while a circuit does not contain a clock tree. We build an artificial neural network (ANN) to estimate the number of clock buffers and apply to each clock gater as well as clock source in ideal clock network. Clock structure is then constructed using such estimated clock buffers. Experiments with a few test circuits demonstrate very high accuracy of this method, average clock power estimation error less than 5%. The proposed method also allows us to find the possible minimum number of clock buffers with optimized clock parameters (e.g. target skew, clock transition time). The possible minimum number of buffers can be found by binary search algorithm and on each step of the algorithm, trained ANN is used to find such clock parameters for the target number of buffers. Using proposed clock parameter optimization, we found that the number of buffers in clock network can be reduced by 31%, on average.

GC-eDRAM design using hybrid FinFET/NC-FinFET

Ramin Rajaei
Yen-Kai Lin
Sayeef Salahuddin
Michael Niemier
X. Sharon Hu

Gain cell embedded DRAMs (GC-eDRAM) are a potential alternative for conventional static random access memories thanks to their attractive advantages such as high density, low-leakage, and two-ported operation. As CMOS technology nodes scale down, the design of GC-eDRAM at deeply scaled nanometer nodes becomes more challenging. Deeply-scale technology nodes suffer from high leakage currents and result in low data retention times (DRTs) for GC-eDRAMs. Negative capacitance FinFETs (NC-FinFETs) are a promising emerging device for ultra-low-power VLSI design. Due to the lower leakage currents, NC-FinFETs can facilitate GC-eDRAM design with higher DRTs. We show that though NC-FinFETs have lower OFF currents and higher I_ON/I_OFF ratios, their ON current is lower than FinFETs by approximately 30%, which results in lower performance. To benefit from the potential power efficiencies and the high DRTs of NC-FinFETs without sacrificing performance, we propose hybrid FinFET/NC-FinFET configurations for some prior 2T, 3T, and 4T GC-eDRAM cells. Simulations based on a 14nm experimentally calibrated NC-FinFET model suggest that the hybrid designs offer up to 96.8% and 86.3% improvements in DRT and static power consumption, respectively, when compared to the FinFET implementation. They also offer up to 47% read delay improvement over the NC-FinFET design. We also study the voltage scaling effects on DRT and refresh-energy of the proposed GC-eDRAM cells. The associated simulation results reveal that, with different supply voltages, the proposed hybrid 4T GC-eDRAM cell offers up to 370× less refresh-energy when compared to the other designs.

SAOU: safe adaptive overclocking and undervolting for energy-efficient GPU computing

Hadi Zamani
Devashree Tripathy
Laxmi Bhuyan
Zizhong Chen

The current trend of ever-increasing performance in scientific applications comes with tremendous growth in energy consumption. In this paper, we present a framework for GPU applications, which reduces energy consumption in GPUs through Safe Overclocking and Undervolting (SAOU) without sacrificing performance. The idea is to increase the frequency beyond the safe frequency f_{sa f eMax} and undervolt below V_{sa f eMin} to get maximum energy saving. Since such overclocking and undervolting may give rise to faults, we employ an enhanced checkpoint-recovery technique to cover the possible errors. Empirically, we explore different errors and derive a fault model that can set the undervolting and overclocking level for maximum energy saving. We target cuBLAS Matrix Multiplication (cuBLAS-MM) kernel for error correction using the checkpoint and recovery (CR) technique as an example of scientific applications. In case of cuBLAS, SAOU achieves up to 22% energy reduction through undervolting and overclocking without sacrificing the performance.

SparTANN: sparse training accelerator for neural networks with threshold-based sparsification

Hyeonuk Sim
Jooyeon Choi
Jongeun Lee

While sparsity has been exploited in many inference accelerators, not much work is done for training accelerators. Exploiting sparsity in training accelerators involves multiple issues, including where to find sparsity, how to exploit sparsity, and how to create more sparsity. In this paper we present a novel sparse training architecture that can exploit sparsity in gradient tensors in both back propagation and weight update computation. We also propose a single-pass sparsification algorithm, which is a hardware-friendly version of a recently proposed sparse training algorithm, that can create additional sparsity aggressively during training. Our experimental results using large networks such as AlexNet and GoogleNet demonstrate that our sparse training architecture can accelerate convolution layer training time by 4.20~8.88× over baseline dense training without accuracy loss, and further increase the training speed by 7.30~11.87× over the baseline with minimal accuracy loss.

BLINK: bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices

Zhe Chen
Garrett J. Blair
Hugh T. Blair
Jason Cong

Miniaturized fluorescent calcium imaging microscopes are widely used for monitoring the activity of a large population of neurons in freely behaving animals in vivo. Conventional calcium image analyses extract calcium traces by iterative and bulk image processing and they are hard to meet the power and latency requirements for neurofeedback devices. In this paper, we propose the calcium image processing pipeline based on a bit-sparse long short-term memory (LSTM) inference kernel (BLINK) for efficient calcium trace extraction. It largely reduces the power and latency while remaining the trace extraction accuracy. We implemented the customized pipeline on the Ultra96 platform. It can extract calcium traces from up to 1024 cells with sub-ms latency on a single FPGA device. We designed the BLINK circuits in a 28-nm technology. Evaluation shows that the proposed bit-sparse representation can reduce the circuit area by 38.7% and save the power consumption by 38.4% without accuracy loss. The BLINK circuits achieve 410 pJ/inference, which has 6293x and 52.4x gains in energy efficiency compared to the evaluation on the high performance CPU and GPU, respectively.

BiasP: a DVFS based exploit to undermine resource allocation fairness in linux platforms

Harshit Kumar
Nikhil Chawla
Saibal Mukhopadhyay

Dynamic Voltage and Frequency Scaling (DVFS) plays an integral role in reducing the energy consumption of mobile devices, meeting the targeted performance requirements at the same time. We examine the security obliviousness of CPUFreq, the DVFS framework in Linux-kernel based systems. Since Linux-kernel based operating systems are present in a wide array of applications, the high-level CPUFreq policies are designed to be platform-independent. Using these policies, we present BiasP exploit, which restricts the allocation of CPU resources to a set of targeted applications, thereby degrading their performance. The exploit involves detecting the execution of instructions on the CPU core pertinent to the targeted applications, thereafter using CPUFreq policies to limit the available CPU resources available to those instructions. We demonstrate the practicality of the exploit by operating it on a commercial smartphone, running Android OS based on Linux-kernel. We can successfully degrade the User Interface (UI) performance of the targeted applications by increasing the frame processing time and the number of dropped frames by up to 200% and 947% for the animations belonging to the targeted-applications. We see a reduction of up to 66% in the number of retired instructions of the targeted-applications. Furthermore, we propose a robust detector which is capable of detecting exploits aimed at undermining resource allocation fairness through malicious use of the DVFS framework.

Resiliency analysis and improvement of variational quantum factoring in superconducting qubit

Ling Qiu
Mahabubul Alam
Abdullah Ash-Saki
Swaroop Ghosh

Variational algorithm using Quantum Approximate Optimization Algorithm (QAOA) can solve the prime factorization problem in near-term noisy quantum computers. Conventional Variational Quantum Factoring (VQF) requires a large number of 2-qubit gates (especially for factoring a large number) resulting in deep circuits. The output quality of the deep quantum circuit is degraded due to errors limiting the computational power of quantum computing. In this paper, we explore various transformations to optimize the QAOA circuit for integer factorization. We propose two criteria to select the optimal quantum circuit that can improve the noise resiliency of VQF.

HIPE-MAGIC: a technology-aware synthesis and mapping flow for highly parallel execution of memristor-aided LoGIC

Arash Fayyazi
Amirhossein Esmaili
Massoud Pedram

Recent efforts for finding novel computing paradigms that meet today’s design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing.

SHEARer: highly-efficient hyperdimensional computing by software-hardware enabled multifold approximation

Behnam Khaleghi
Sahand Salamat
Anthony Thomas
Fatemeh Asgarinejad
Yeseong Kim
Tajana Rosing

Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, followed by a bundling procedure that adds up the hypervectors to realize the encoding hypervector. The operations of HD are simple and highly parallelizable, but the large number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithmhardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. Unlike previous works that generate the encoding hypervectors in full precision and then and then perform ex-post quantization, we compute the encoding hypervectors in an approximate manner that saves resources yet affords high accuracy. We also propose a novel FPGA architecture that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904× (15.7×) and energy savings of up to 56,044× (301×) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.

Implementing binary neural networks in memory with approximate accumulation

Saransh Gupta
Mohsen Imani
Hengyu Zhao
Fan Wu
Jishen Zhao
Tajana Šimunić Rosing

Processing in-memory (PIM) has shown great potential to accelerate the inference tasks of binarized neural networks (BNNs) by reducing data movement between processing units and memory. However, existing PIM architectures require analog/mixed-signal circuits that do not scale with the CMOS technology. On the contrary, we propose BitNAP (Binarized neural network acceleration with in-memory ThreSholding), which performs optimization at operation, peripheral, and architecture levels for an efficient BNN accelerator. BitNAP supports row-parallel bitwise operations in crossbar memory by exploiting the switching of 1-bit bipolar resistive devices and a unique hybrid tunable thresholding operation. In order to reduce the area overhead of sensing-based operations, BitNAP presents a memory sense amplifier sharing scheme and also, a novel operation pipelining to reduce the latency overhead of sharing. We evaluate the efficiency of BitNAP on the MNIST and ImageNet datasets using popular neural networks. BitNAP is on average 1.24× (10.7×) faster and 185.6× (10.5×) more energy-efficient as compared to the state-of-the-art PIM accelerator for simple (complex) networks.

ISLPED 2020 TOC