ISLPED’22 TOC
ISLPED ’22: ACM/IEEE International Symposium on Low Power Electronics and Design
Full Citation in the ACM Digital Library
SESSION: Session 1: Energy-efficient and Robust Neural Networks
Examining the Robustness of Spiking Neural Networks on Non-ideal Memristive Crossbars
- Abhiroop Bhattacharjee
- Youngeun Kim
- Abhishek Moitra
- Priyadarshini Panda
Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) owing to their asynchronous, sparse, and binary information processing. To improve the energy-efficiency and throughput, SNNs can be implemented on memristive crossbars where Multiply-and-Accumulate (MAC) operations are realized in the analog domain using emerging Non-Volatile-Memory (NVM) devices. Despite the compatibility of SNNs with memristive crossbars, there is little attention to study on the effect of intrinsic crossbar non-idealities and stochasticity on the performance of SNNs. In this paper, we conduct a comprehensive analysis of the robustness of SNNs on non-ideal crossbars. We examine SNNs trained via learning algorithms such as, surrogate gradient and ANN-SNN conversion. Our results show that repetitive crossbar computations across multiple time-steps induce error accumulation, resulting in a huge performance drop during SNN inference. We further show that SNNs trained with a smaller number of time-steps achieve better accuracy when deployed on memristive crossbars.
Identifying Efficient Dataflows for Spiking Neural Networks
- Deepika Sharma
- Aayush Ankit
- Kaushik Roy
Deep feed-forward Spiking Neural Networks (SNNs) trained using appropriate learning algorithms have been shown to match the performance of state-of-the-art Artificial Neural Networks (ANNs). The inputs to an SNN layer are 1-bit spikes distributed over several timesteps. In addition, along with the standard artificial neural network (ANN) data structures, SNNs require one additional data structure – the membrane potential (Vmem) for each neuron which is updated every timestep. Hence, the dataflow requirements for energy-efficient hardware implementation of SNNs can be different from the standard ANNs. In this paper, we propose optimal dataflows for deep spiking neural network layers. To evaluate the energy and latency of different dataflows, we considered three hardware architectures with varying on-chip resources to represent a class of spatial accelerators. We developed a set of rules leading to optimum dataflow for SNNs that achieve more than 90% improvement in Energy-Delay Product (EDP) compared to the baseline for some workloads and architectures.
Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators
- Jung Hwan Heo
- Arash Fayyazi
- Amirhossein Esmaili
- Massoud Pedram
This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49 × energy efficiency gain while lowering storage requirements by 3.67 × for total weight storage (non-pruned weights plus indexing) and 22,044 × for indexing memory.
SESSION: Session 2: Novel Computing Models (Chair: Priyadarshini Panda, Yale)
QMLP: An Error-Tolerant Nonlinear Quantum MLP Architecture using Parameterized Two-Qubit Gates
- Cheng Chu
- Nai-Hui Chia
- Lei Jiang
- Fan Chen
Despite potential quantum supremacy, state-of-the-art quantum neural networks (QNNs) suffer from low inference accuracy. First, the current Noisy Intermediate-Scale Quantum (NISQ) devices with high error rates of 10− 3 to 10− 2 significantly degrade the accuracy of a QNN. Second, although recently proposed Re-Uploading Units (RUUs) introduce some non-linearity into the QNN circuits, the theory behind it is not fully understood. Furthermore, previous RUUs that repeatedly upload original data can only provide marginal accuracy improvements. Third, current QNN circuit ansatz uses fixed two-qubit gates to enforce maximum entanglement capability, making task-specific entanglement tuning impossible, resulting in poor overall performance. In this paper, we propose a Quantum Multilayer Perceptron (QMLP) architecture featured by error-tolerant input embedding, rich nonlinearity, and enhanced variational circuit ansatz with parameterized two-qubit entangling gates. Compared to prior arts, QMLP increases the inference accuracy on the 10-class MNIST dataset by 10% with 2 × fewer quantum gates and 3 × reduced parameters. Our source code is available and can be found in https://github.com/chuchengc/QMLP/.
Design and Logic Synthesis of a Scalable, Efficient Quantum Number Theoretic Transform
- Chao Lu
- Shamik Kundu
- Abraham Kuruvila
- Supriya Margabandhu Ravichandran
- Kanad Basu
The advent of quantum computing has engendered a widespread proliferation of efforts utilizing qubits for optimizing classical computational algorithms. Number Theoretic Transform (NTT) is one such popular algorithm that accelerates polynomial multiplication significantly and is consequently, the core arithmetic operation in most homomorphic encryption algorithms. Hence, fast and efficient execution of NTT is highly imperative for practical implementation of homomorphic encryption schemes in different computing paradigms. In this paper, we, for the first time, propose an efficient and scalable Quantum Number Theoretic Transform (QNTT) circuit using quantum gates. We introduce a novel exponential unit for modular exponential operation, which furnishes an algorithmic complexity of O(n). Our proposed methodology performs further optimization and logic synthesis of QNTT, that is significantly fast and facilitates efficient implementations on IBM’s quantum computers. The optimized QNTT achieves a gate-level complexity reduction from power of two to one with respect to bit length. Our methodology utilizes 44.2% fewer gates, thereby minimizing the circuit depth and a corresponding reduction in overhead and error probability, for a 4-point QNTT compared to its unoptimized counterpart.
A Charge Domain P-8T SRAM Compute-In-Memory with Low-Cost DAC/ADC Operation for 4-bit Input Processing
- Joonhyung Kim
- Kyeongho Lee
- Jongsun Park
This paper presents a low cost PMOS-based 8T (P-8T) SRAM Compute-In-Memory (CIM) architecture that efficiently per-forms the multiply-accumulate (MAC) operations between 4-bit input activations and 8-bit weights. First, bit-line (BL) charge-sharing technique is employed to design the low-cost and reliable digital-to-analog conversion of 4-bit input activations in the pro-posed SRAM CIM, where the charge domain analog computing provides variation tolerant and linear MAC outputs. The 16 local arrays are also effectively exploited to implement the analog mul-tiplication unit (AMU) that simultaneously produces 16 multipli-cation results between 4-bit input activations and 1-bit weights. For the hardware cost reduction of analog-to-digital converter (ADC) without sacrificing DNN accuracy, hardware aware system simulations are performed to decide the ADC bit-resolutions and the number of activated rows in the proposed CIM macro. In addition, for the ADC operation, the AMU-based reference col-umns are utilized for generating ADC reference voltages, with which low-cost 4-bit coarse-fine flash ADC has been designed. The 256×80 P-8T SRAM CIM macro implementation using 28nm CMOS process shows that the proposed CIM shows the accuracies of 91.46% and 66.67% with CIFAR-10 and CIFAR-100 dataset, respectively, with the energy efficiency of 50.07-TOPS/W.
SESSION: Session 3: Efficient and Intelligent Memories (Chair: Kshitij Bhardwaj, LLNL)
FlexiDRAM: A Flexible in-DRAM Framework to Enable Parallel General-Purpose Computation
- Ranyang Zhou
- Arman Roohi
- Durga Misra
- Shaahin Angizi
In this paper, we propose a Flexible processing-in-DRAM framework named FlexiDRAM that supports the efficient implementation of complex bulk bitwise operations. This framework is developed on top of a new reconfigurable in-DRAM accelerator that leverages the analog operation of DRAM sub-arrays and elevates it to implement XOR2-MAJ3 operations between operands stored in the same bit-line. FlexiDRAM first generates an efficient XOR-MAJ representation of the desired logic and then appropriately allocates DRAM rows to the operands to execute any in-DRAM computation. We develop ISA and software support required to compute in-DRAM operation. FlexiDRAM transforms current memory architecture to a massively parallel computational unit and can be leveraged to significantly reduce the latency and energy consumption of complex workloads. Our extensive circuit-to-architecture simulation results show that averaged across two well-known deep learning workloads, FlexiDRAM achieves ∼ 15 × energy-saving and 13 × speedup over the GPU outperforming recent processing-in-DRAM platforms.
Evolving Skyrmion Racetrack Memory as Energy-Efficient Last-Level Cache Devices
- Ya-Hui Yang
- Shuo-Han Chen
- Yuan-Hao Chang
Skyrmion racetrack memory (SK-RM) has been regarded as a promising alternative to replace static random-access memory (SRAM) as a large-size on-chip cache device with high memory density. Different from other nonvolatile random-access memories (NVRAMs), data bits of SK-RM can only be altered or detected at access ports, and shift operations are required to move data bits across access ports along the racetrack. Owing to these special characteristics, word-based mapping and bit-interleaved mapping architectures have been proposed to facilitate reading and writing on SK-RM with different data layouts. Nevertheless, when SK-RM is used as an on-chip cache device, existing mapping architectures lead to the concerns of unpredictable access performance or excessive energy consumption during both data reads and writes. To resolve such concerns, this paper proposes extracting the merits of existing mapping architectures for allowing SK-RM to seamlessly switch its data update policy by considering the write latency requirement of cache accesses. Promising results have been demonstrated through a series of benchmark-driven experiments.
Exploiting successive identical words and differences with dynamic bases for effective compression in Non-Volatile Memories
- Swati Upadhyay
- Arijit Nath
- Hemangee Kapoor
Emerging Non-volatile memories are considered as potential candidates for replacing traditional DRAM in main memory. However, downsides like long write latency, high write energy, and low write endurance make their direct adoption in the memory hierarchy challenging. Approaches that reduce the number of bits written are beneficial to overcome such drawbacks.
In this direction, we propose a compression technique that reduces overall bits written to the NVM, thus improving its lifetime. The proposed method, SIBR, compresses the incoming blocks to PCM by either eliminating the words to be written or by reducing the number of bits written for each word. For the former, words that have either zero content or are identical to consecutive words are not written. The latter is done by computing the difference of each word with a base word and storing only the difference (or delta) instead of the full word. The novelty of our contribution is to update the base word at run-time, thus achieving better compression. It is shown that computing the delta with a dynamically decided base compared to a fixed base gives smaller delta values. The dynamic base is another word in the same block. SIBR outperforms two state-of-the-art compression techniques by achieving a fairly low compression ratio and high coverage. Experimental results show a substantial reduction in bit-flips and improvement in lifetime.
SESSION: Session 4: Circuit Design and Methodology for IoT Applications (Chair: Hun-Seok Kim, UMich)
HOGEye: Neural Approximation of HOG Feature Extraction in RRAM-Based 3D-Stacked Image Sensors
- Tianrui Ma
- Weidong Cao
- Fei Qiao
- Ayan Chakrabarti
- Xuan Zhang
Many computer vision tasks, ranging from recognition to multi-view registration, operate on feature representation of images rather than raw pixel intensities. However, conventional pipelines for obtaining these representations incur significant energy consumption due to pixel-wise analog-to-digital (A/D) conversions and costly storage and computations. In this paper, we propose HOGEye, an efficient near-pixel implementation for a widely-used feature extraction algorithm—Histograms of Oriented Gradients (HOG). HOGEye moves the key but computation-intensive derivative extraction (DE) and histogram generation (HG) steps into the analog domain by applying a novel neural approximation method in a resistive random-access memory (RRAM)-based 3D-stacked image sensor. The co-location of perception (sensor) and computation (DE and HG) and the alleviation of A/D conversions allow HOGEye design to achieve significant energy saving. With negligible detection rate degradation, the entire HOGEye sensor system consumes less than 48μW@30fps for an image resolution of 256 × 256 (equivalent to 24.3pJ/pixel) while the processing part only consumes 14.1pJ/pixel, achieving more than 2.5 × energy efficiency improvement than the state-of-the-art designs.
A Bit-level Sparsity-aware SAR ADC with Direct Hybrid Encoding for Signed Expressions for AIoT Applications
- Ruicong Chen
- H. T. Kung
- Anantha Chandrakasan
- Hae-Seung Lee
In this work, we propose the first bit-level sparsity-aware SAR ADC with direct hybrid encoding for signed expressions (HESE) for AIoT applications. ADCs are typically a bottleneck in reducing the energy consumption of analog neural networks (ANNs). For a pre-trained Convolutional Neural Network (CNN) inference, a HESE SAR for an ANN can reduce the number of non-zero signed digit terms to be output, and thus enables a reduction in energy along with the term quantization (TQ). The proposed SAR ADC directly produces the HESE signed-digit representation (SDR) using two thresholds per cycle for 2-bit look-ahead (LA). A prototype in 65nm shows that the HESE SAR provides sparsity encoding with a Walden FoM of 15.2fJ/conv.-step at 45MS/s. The core area is 0.072mm2.
Analysis of the Effect of Hot Carrier Injection in An Integrated Inductive Voltage Regulator
- Shida Zhang
- Nael Mizanur Rahman
- Venkata Chaitanya Krishna Chekuri
- Carlos Tokunaga
- Saibal Mukhopadhyay
This paper presents a simulation-based study to evaluate the effect of Hot Carrier Injection (HCI) on the characteristics of an on-chip, digitally-controlled, switched inductor voltage regulator (IVR) architecture. Our methodology integrates device-level aging models, circuit simulations in SPICE, and control loop simulations in Simulink. We characterize the effect of HCI on individual components of an IVR, and their combined effect on the efficiency and transient performance. Our analysis using an IVR designed in 65nm CMOS shows that aging of the power stages has a smaller impact on performance compared to that of the control loop. Further, we perform a comparative analysis to show that, with a 1.8V supply, HCI leads to higher aging-induced degradation of IVR than Negative Bias Temperature Instability (NBTI). Finally, our simulation shows that parasitic inductance near IVR input aggravates NBTI and parasitic capacitance near IVR output aggravates HCI effects on IVR’s performance.
SESSION: Session 5: Advances in Hardware Security (Chair: Apoorva Amarnath, IBM)
RACE: RISC-V SoC for En/decryption Acceleration on the Edge for Homomorphic Computation
- Zahra Azad
- Guowei Yang
- Rashmi Agrawal
- Daniel Petrisko
- Michael Taylor
- Ajay Joshi
As more and more edge devices connect to the cloud to use its storage and compute capabilities, they bring in security and data privacy concerns. Homomorphic Encryption (HE) is a promising solution to maintain data privacy by enabling computations on the encrypted user data in the cloud. While there has been a lot of work on accelerating HE computation in the cloud, little attention has been paid to optimize the en/decryption on the edge. Therefore, in this paper, we present RACE, a custom-designed area- and energy-efficient SoC for en/decryption of data for HE. Owing to similar operations in en/decryption, RACE unifies the en/decryption datapath to save area. RACE efficiently exploits techniques like memory reuse and data reordering to utilize minimal amount of on-chip memory. We evaluate RACE using a complete RTL design containing a RISC-V processor and our unified accelerator. Our analysis shows that, for the end-to-end en/decryption, using RACE leads to, on average, 48 × to 39729 × (for a wide range of security parameters) more energy-efficient solution than purely using a processor.
Sealer: In-SRAM AES for High-Performance and Low-Overhead Memory Encryption
- Jingyao Zhang
- Hoda Naghibijouybari
- Elaheh Sadredini
To provide data and code confidentiality and reduce the risk of information leak from memory or memory bus, computing systems are enhanced with encryption and decryption engine. Despite massive efforts in designing hardware enhancements for data and code protection, existing solutions incur significant performance overhead as the encryption/decryption is on the critical path. In this paper, we present Sealer, a high-performance and low-overhead in-SRAM memory encryption engine by exploiting the massive parallelism and bitline computational capability of SRAM subarrays. Sealer encrypts data before sending it off-chip and decrypts it upon receiving the memory blocks, thus, providing data confidentiality. Our proposed solution requires only minimal modifications to the existing SRAM peripheral circuitry. Sealer can achieve up to two orders of magnitude throughput-per-area improvement while consuming 3 × less energy compared to prior solutions.
SESSION: Session 6: Novel Physical Design Methodologies (Chair: Marisa Lopez Vallejo, UPM)
Hier-3D: A Hierarchical Physical Design Methodology for Face-to-Face-Bonded 3D ICs
- Anthony Agnesina
- Moritz Brunion
- Alberto Garcia-Ortiz
- Francky Catthoor
- Dragomir Milojevic
- Manu Komalan
- Matheus Cavalcante
- Samuel Riedel
- Luca Benini
- Sung Kyu Lim
Hierarchical very-large-scale integration (VLSI) flows are an understudied yet critical approach to achieving design closure at giga-scale complexity and gigahertz frequency targets. This paper proposes a novel hierarchical physical design flow enabling the building of high-density and commercial-quality two-tier face-to-face-bonded hierarchical 3D ICs. We significantly reduce the associated manufacturing cost compared to existing 3D implementation flows and, for the first time, achieve cost competitiveness against the 2D reference in large modern designs. Experimental results on complex industrial and open manycore processors demonstrate in two advanced nodes that the proposed flow provides major power, performance, and area/cost (PPAC) improvements of 1.2 to 2.2 × compared with 2D, where all metrics are improved simultaneously, including up to power savings.
A Study on Optimizing Pin Accessibility of Standard Cells in the Post-3 nm Node
- Jaehoon Jeong
- Jonghyun Ko
- Taigon Song
Nanosheet FETs (NSFETs) are expected to be the post-FinFET device in the technology nodes of 5 nm and beyond. However, despite the high potential of NSFETs, few studies report the impact of NSFETs in the digital VLSI’s perspective. In this paper, we present a study of NSFETs for the optimal standard cell (SDC) library design and pin accessibility-aware layout for less routing congestion and low power consumption. For this objective, we present five novel methodologies to tackle the pin accessibility issues that rise in SDC designs in extremely-low routing resource environments (4 tracks) and emphasize the importance of local trench contact (LTC) in it. Using our methodology, we improve design metrics such as power consumption, total area, and wirelength by -11.0%, -13.2%, and 16.0%, respectively. By our study, we expect the routing congestion issues that additionally occur in advanced technology nodes to be handled and better full-chip designs to be done in 3 nm and beyond.
Improving Performance and Power by Co-Optimizing Middle-of-Line Routing, Pin Pattern Generation, and Contact over Active Gates in Standard Cell Layout Synthesis
- Sehyeon Chung
- Jooyeon Jeong
- Taewhan Kim
This paper addresses the combined problem of the three core tasks, namely routing on the middle-of-line (MOL) layer, generating I/O pin patterns (PP), and allocating contacts over active gates (COAG) in cell layout synthesis with 7nm and below technology. As yet, the existing cell layout generators have paid partial or little attention to those tasks, even with no awareness of the synergistic effects. This work overcomes this limitation by proposing a systematic and tightly-linked solution to the combined problem to boost the synergistic effects on chip implementation. Precisely, we solve the problem in three steps: (1) fully utilizing the horizontal routing resource on MOL layer by formulating the problem of in-cell routing into a weighted interval scheduling problem, (2) simultaneously performing the remaining horizontal in-cell routing and PP generation on metal 1 layer through the COAG exploitation while ensuring the pin accessibility constraint, and (3) completing in-cell routing by allocating vertical routing resource on MOL layer. Through experiments with benchmark designs, it is shown that our proposed layout method is able to generate standard cells with on average 34.2% shorter total length of metal 1 wire while retaining pin patterns that ensure pin accessibility, resulting in the chip implementations with up to 72.5% timing slack improvement and up to 15.6% power reduction that produced by using the conventional best available cells. In addition, by using less wire and vias, our in-cell router is able to consistently reduce the worst delay of cells, noticeably, reducing the sum of setup time and clock-to-Q delay of flip-flops by 1.2% ∼ 3.0% on average over that by the existing best cells.
SESSION: Session 7: Enablers for Energy-efficient Platforms (Chair: Xue Lin, Northeastern)
Neural Contextual Bandits Based Dynamic Sensor Selection for Low-Power Body-Area Networks
- Berken Utku Demirel
- Luke Chen
- Mohammad Al Faruque
Providing health monitoring devices with machine intelligence is important for enabling automatic mobile healthcare applications. However, this brings additional challenges due to the resource scarcity of these devices. This work introduces a neural contextual bandits based dynamic sensor selection methodology for high-performance and resource-efficient body-area networks to realize next generation mobile health monitoring devices. The methodology utilizes contextual bandits to select the most informative sensor combinations during runtime and ignore redundant data for decreasing transmission and computing power in a body area network (BAN). The proposed method has been validated using one of the most common health monitoring applications: cardiac activity monitoring. Solutions from our proposed method are compared against those from related works in terms of classification performance and energy while considering the communication energy consumption. Our final solutions could reach 78.8% AU-PRC on the PTB-XL ECG dataset for cardiac abnormality detection while decreasing the overall energy consumption and computational energy by 3.7 × and 4.3 ×, respectively.
3D IC Tier Partitioning of Memory Macros: PPA vs. Thermal Tradeoffs
- Lingjun Zhu
- Nesara Eranna Bethur
- Yi-Chen Lu
- Youngsang Cho
- Yunhyeok Im
- Sung Kyu Lim
Micro-bump and hybrid bonding technologies have enabled 3D ICs and provided remarkable performance gain, but the memory macro partitioning problem also becomes more complicated due to the limited 3D connection density. In this paper, we evaluate and quantify the impacts of various macro partitioning on the performance and temperature in commercial-grade 3D ICs. In addition, we propose a set of partitioning guidelines and a quick constraint-graph-based approach to create floorplans for logic-on-memory 3D ICs. Experimental results show that the optimized macro partitioning can help improve the performance of logic-on-memory 3D ICs by up to 15%, at the cost of 8°C temperature increase. Assuming air cooling, our simulation shows the 3D ICs are thermally sustainable with 97°C maximum temperature.
A Domain-Specific System-On-Chip Design for Energy Efficient Wearable Edge AI Applications
- Yigit Tuncel
- Anish Krishnakumar
- Aishwarya Lekshmi Chithra
- Younghyun Kim
- Umit Ogras
Artificial intelligence (AI) based wearable applications collect and process a significant amount of streaming sensor data. Transmitting the raw data to cloud processors wastes scarce energy and threatens user privacy. Wearable edge AI devices should ideally balance two competing requirements: (1) maximizing the energy efficiency using targeted hardware accelerators and (2) providing versatility using general-purpose cores to support arbitrary applications. To this end, we present an open-source domain-specific programmable system-on-chip (SoC) that combines a RISC-V core with a meticulously determined set of accelerators targeting wearable applications. We apply the proposed design method to design an FPGA prototype and six real-life use cases to demonstrate the efficacy of the proposed SoC. Thorough experimental evaluations show that the proposed SoC provides up to 9.1 × faster execution and up to 8.9 × higher energy efficiency than software implementations in FPGA while maintaining programmability.
SESSION: Session 8: System Design for Energy-efficiency and Resiliency (Chair: Aatmesh Shrivastava, Northeastern)
SACS: A Self-Adaptive Checkpointing Strategy for Microkernel-Based Intermittent Systems
- Yen-Ting Chen
- Han-Xiang Liu
- Yuan-Hao Chang
- Yu-Pei Liang
- Wei-Kuan Shih
Intermittent systems are usually energy-harvesting embedded systems that harvest energy from ambient environment and perform computation intermittently. Due to the unreliable power, these intermittent systems typically adopt different checkpointing strategies for ensuring the data consistency and execution progress after the systems are resumed from unpredictable power failures. Existing checkpointing strategies are usually suitable for bare-metal intermittent systems with short run time. Due to the improvement of energy-harvesting techniques, intermittent systems are having longer run time and better computation power, so that more and more intermittent systems tend to function with a microkernel for handling more/multiple tasks at the same time. However, existing checkpointing strategies were not designed for (or aware of) such microkernel-based intermittent systems that support the running of multiple tasks, and thus have poor performance on preserving the execution progress. To tackle this issue, we propose a design, called self-adaptive checkpointing strategy (SACS), tailored for microkernel-based intermittent systems. By leveraging the time-slicing scheduler, the proposed design dynamically adjust the checkpointing interval at both run time and reboot time, so as to improve the system performance by achieving a good balance between the execution progress and the number of performed checkpoints. A series of experiments was conducted based on a development board of Texas Instrument (TI) with well-known benchmarks. Compared to the state-of-the-art designs, experiment results show that our design could reduce the execution time by at least 46.8% under different conditions of ambient environment while maintaining the number of performed checkpoints in an acceptable scale.
Drift-tolerant Coding to Enhance the Energy Efficiency of Multi-Level-Cell Phase-Change Memory
- Yi-Shen Chen
- Yuan-Hao Chang
- Tei-Wei Kuo
Phase-Change Memory (PCM) has emerged as a promising memory and storage technology in recent years, and Multi-Level-Cell (MLC) PCM further reduces the per-bit cost to improve its competitiveness by storing multiple bits in each PCM cell. However, MLC PCM has high energy consumption issue in its write operations. In contrast to existing works that try to enhance the energy efficiency of the physical program&verify strategy for MLC PCM, this work proposes a drift-tolerant coding scheme to enable the fast write operation on MLC PCM without sacrificing any data accuracy. By exploiting the resistance drift and asymmetric write characteristic of PCM cells, the proposed scheme can reduce the write energy consumption of MLC PCM significantly. Meanwhile, a segmentation strategy is proposed to further improve the write performance with our coding scheme. A series of analyses and experiments was conducted to evaluate the capability of the proposed scheme. The results show that the proposed scheme can reduce 6.2–17.1% energy consumption and 3.2–11.3% write latency under six representative benchmarks, compared with the existing well-known schemes.
A Unified Forward Error Correction Accelerator for Multi-Mode Turbo, LDPC, and Polar Decoding
- Yufan Yue
- Tutu Ajayi
- Xueyang Liu
- Peiwen Xing
- Zihan Wang
- David Blaauw
- Ronald Dreslinski
- Hun Seok Kim
Forward error correction (FEC) is a critical component in communication systems as the errors induced by noisy channels can be corrected using the redundancy in the coded message. This paper introduces a novel multi-mode FEC decoder accelerator that can decode Turbo, LDPC, and Polar codes using a unified architecture. The proposed design explores the similarities in these codes to enable energy efficient decoding with minimal overhead in the total area of the unified architecture. Moreover, the proposed design is highly reconfigurable to support various existing and future FEC standards including 3GPP LTE/5G, and IEEE 802.11n WiFi. Implemented in GF 12nm FinFET technology, the design occupies 8.47mm2 of chip area attaining 25% logic and 49% memory area savings compared to a collection of single-mode designs. Running at 250MHz and 0.8V, the decoder achieves per-iteration throughput and energy efficiency of 690Mb/s and 44pJ/b for Turbo; 740Mb/s and 27.4pJ/b for LDPC; and 950Mb/s and 45.8pJ/b for Polar.
SESSION: Poster Session
Canopy: A CNFET-based Process Variation Aware Systolic DNN Accelerator
- Cheng Chu
- Dawen Xu
- Ying Wang
- Fan Chen
Although systolic accelerators have become the dominant method for executing Deep Neural Networks (DNNs), their performance efficiency (quantified as Energy-Delay Product or EDP) is limited by the capabilities of silicon Field-Effect Transistors (FETs). FETs constructed from Carbon Nanotubes (CNTs) have demonstrated > 10 × EDP benefits, however, the processing variations inherent in carbon nanotube FETs (CNFETs) fabrication compromise the EDP benefits, resulting > 40% performance degradation. In this work, we study the impact of CNT process variations and present Canopy, a process variation aware systolic DNN accelerator by leveraging the spatial correlation in CNT variations. Canopy co-optimizes the architecture and dataflow to allow computing engines in a systolic array run at their best performance with non-uniform latency, minimizing the performance degradation incurred by CNT variations. Furthermore, we devise Canopy with dynamic reconfigurability such that the microarchitectural capability and its associated flexibility achieves an extra degree of adaptability with regard to the DNN topology and processing hyper-parameters (e.g., batch size). Experimental results show that Canopy improves the performance by 5.85 × (4.66 ×) and reduces the energy by 34% (90%) when inferencing a single (a batch of) input compared to the baseline design under an iso-area comparison across seven DNN workloads.
Layerwise Disaggregated Evaluation of Spiking Neural Networks
- Abinand Nallathambi
- Sanchari Sen
- Anand Raghunathan
- Nitin Chandrachoodan
Spiking Neural Networks (SNNs) have attracted considerable attention due to their suitability to processing temporal input streams, as well as the emergence of highly power-efficient neuromorphic hardware platforms. The computational cost of evaluating a Spiking Neural Network (SNN) is strongly correlated with the number of timesteps for which it is evaluated. To improve the computational efficiency of SNN evaluation, we propose layerwise disaggregated SNNs (LD-SNNs), wherein the number of timesteps is independently optimized for each layer of the network. In effect, LD-SNNs allow for a better allocation of computational effort across layers in a network, resulting in an improved tradeoff between accuracy and efficiency. We propose a methodology to design optimized LD-SNNs from any given SNN. Across four benchmark networks, LD-SNNs achieve 1.67-3.84x reduction in synaptic updates and 1.2-2.56x reduction in neurons evaluated. These improvements translate to 1.25-3.45x faster inference on four different hardware platforms including two server-class platforms, a desktop platform and an edge SoC.
Tightly Linking 3D Via Allocation Towards Routing Optimization for Monolithic 3D ICs
- Suwan Kim
- Sehyeon Chung
- Taewhan Kim
- Heechun Park
Monolithic 3D (M3D) is a revolutionary technology for high-density and high-performance chip design in the post-Moore era. However, it suffers from considerable thermal confinement due to the transistor stacking and insulating materials between the layers. As a way of reducing power, thereby mitigating the thermal problem, we propose a comprehensive physical design methodology that incorporates two new important items, one is blockage aware MIV (monolithic inter-tier via) placement and the other is 3D net ordering for routing, intending to optimize wire length. Precisely, we propose a three-step approach: (1) retrieving the MIV region candidates for each 3D net, (2) fine-tuning placement to secure MIV spots in the presence of blockages, and (3) performing M3D routing with net ordering to consider the fine-tuned placement result. We implement the proposed M3D design flow by utilizing commercial 2D IC EDA tools while providing seamless optimization for cross-tier connections. In the meantime, our experiments confirm that proposed M3D design flow saves wire length per cross-tier net by up to 41.42%, which corresponds to 7.68% less total net switching power, equivalently 36.79% lower energy-delay-product over the conventional state-of-the-art M3D design flow.
Enabling Capsule Networks at the Edge through Approximate Softmax and Squash Operations
- Alberto Marchisio
- Beatrice Bussolino
- Edoardo Salvati
- Maurizio Martina
- Guido Masera
- Muhammad Shafique
Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions.
Multi-Complexity-Loss DNAS for Energy-Efficient and Memory-Constrained Deep Neural Networks
- Matteo Risso
- Alessio Burrello
- Luca Benini
- Enrico Macii
- Massimo Poncino
- Daniele Jahier Pagliari
Neural Architecture Search (NAS) is increasingly popular to automatically explore the accuracy versus computational complexity trade-off of Deep Learning (DL) architectures. When targeting tiny edge devices, the main challenge for DL deployment is matching the tight memory constraints, hence most NAS algorithms consider model size as the complexity metric. Other methods reduce the energy or latency of DL models by trading off accuracy and number of inference operations. Energy and memory are rarely considered simultaneously, in particular by low-search-cost Differentiable NAS (DNAS) solutions.
We overcome this limitation proposing the first DNAS that directly addresses the most realistic scenario from a designer’s perspective: the co-optimization of accuracy and energy (or latency) under a memory constraint, determined by the target HW. We do so by combining two complexity-dependent loss functions during training, with independent strength. Testing on three edge-relevant tasks from the MLPerf Tiny benchmark suite, we obtain rich Pareto sets of architectures in the energy vs. accuracy space, with memory footprints constraints spanning from 75% to 6.25% of the baseline networks. When deployed on a commercial edge device, the STM NUCLEO-H743ZI2, our networks span a range of 2.18x in energy consumption and 4.04% in accuracy for the same memory constraint, and reduce energy by up to 2.2 × with negligible accuracy drop with respect to the baseline.
Visible Light Synchronization for Time-Slotted Energy-Aware Transiently-Powered Communication
- Alessandro Torrisi
- Maria Doglioni
- Kasim Sinan Yildirim
- Davide Brunelli
Energy-harvesting IoT devices that operate without batteries paved the way for sustainable sensing applications. These devices force applications to run intermittently since the ambient energy is sporadic, leading to frequent power failures. Unexpected power failures introduce several challenges to wireless communication since nodes are not synchronized and stop operating during data transmission. This paper presents a novel self-powered autonomous circuit design to remedy this problem. This circuit uses visible-light communication (VLC) to enable synchronization for time-slotted energy-aware transiently powered communication. Therefore, it aligns the activity phases of the batteryless sensors so that energy status communication occurs when these nodes are active simultaneously. Evaluations showed that our circuit has an ultra-low power consumption, can work with zero energy cost by relying only on the harvested energy, and supports efficient intermittent communication over intermittently powered nodes.
Directed Acyclic Graph-based Neural Networks for Tunable Low-Power Computer Vision
- Abhinav Goel
- Caleb Tung
- Nick Eliopoulos
- Xiao Hu
- George K. Thiruvathukal
- James C. Davis
- Yung-Hsiang Lu
Processing visual data on mobile devices has many applications, e.g., emergency response and tracking. State-of-the-art computer vision techniques rely on large Deep Neural Networks (DNNs) that are usually too power-hungry to be deployed on resource-constrained edge devices. Many techniques improve DNN efficiency of DNNs by compromising accuracy. However, the accuracy and efficiency of these techniques cannot be adapted for diverse edge applications with different hardware constraints and accuracy requirements. This paper demonstrates that a recent, efficient tree-based DNN architecture, called the hierarchical DNN, can be converted into a Directed Acyclic Graph-based (DAG) architecture to provide tunable accuracy-efficiency tradeoff options. We propose a systematic method that identifies the connections that must be added to convert the tree to a DAG to improve accuracy. We conduct experiments on popular edge devices and show that increasing the connectivity of the DAG improves the accuracy to within 1% of the existing high accuracy techniques. Our approach requires 93% less memory, 43% less energy, and 49% fewer operations than the high accuracy techniques, thus providing more accuracy-efficiency configurations.
Energy Efficient Cache Design with Piezoelectric FETs
- Reena Elangovan
- Ashish Ranjan
- Niharika Thakuria
- Sumeet Gupta
- Anand Raghunathan
Piezoelectric FETs (PeFETs) are a promising class of ferroelectric devices that use the piezoelectric effect to modulate strain in the channel. They present several desirable properties for on-chip memory, such as non-volatility, high-density, and low-power write capability. In this work, we present the first effort to design and evaluate cache architectures using PeFETs.
Two key goals in cache design are to maximize capacity and minimize latency. Accordingly, we consider two different variants of PeFET bit-cells – a high-density variant (HD-PeFET) that does not use a separate access transistor, and a high-performance 1T-1PeFET variant (HP-PeFET) that sacrifices density for lower access latency. We note that at the application level, there exists significant heterogeneity in the sensitivity of applications to cache capacity and latency. To enable a better tradeoff between these conflicting design goals, we propose a hybrid PeFET cache comprising of both HP-PeFET and HD-PeFET regions at the granularity of cache ways. We make the key observation that frequently reused blocks residing in the HD-PeFET region are detrimental to overall cache performance due to the higher access latency. Hence, we also propose a cache management policy to identify and migrate these blocks from the HD-PeFET region to the HP-PeFET region at runtime. We develop models of HD-PeFET and HP-PeFET caches using the CACTI framework and evaluate their benefits across a suite of PARSEC and SPLASH-2X benchmarks. We demonstrate 1.11x and 4.55x average improvements in performance and energy, respectively, using the proposed hybrid PeFET last-level cache against a baseline with traditional SRAM cache at iso-area.
Predictive Model Attack for Embedded FPGA Logic Locking
- Prattay Chowdhury
- Chaitali Sathe
- Benjamin Carrion Schaefer
With most VLSI design companies now being fabless it is imperative to develop methods to protect their Intellectual Property (IP). One approach that has become very popular due to its relative simplicity and practicality is logic locking. One of the problems with traditional locking mechanisms is that the locking circuitry is built into the netlist that the VLSI design company delivers to the foundry which has now access to the entire design including the locking mechanism. This implies that they could potentially tamper with this circuitry or reverse engineer it to obtain the locking key. One relatively new approach that has been coined logic locking through omission, or hardware redaction, maps a portion of the design to an embedded FPGA (eFPGA). The bitstream of the eFPGA now acts as the locking key. This new approach has been shown to be more secure as the foundry has no access to the bitstream during the manufacturing stage. The obvious drawbacks are the increase in design complexity and the area and performance overheads associated with the eFPGA. In this work we propose, to the best of our knowledge, the first attack on these type of new locking mechanisms by substituting the exact logic mapped onto the eFPGA by a synthesizable predictive model that replicates the behavior of the exact logic. We show that this approach is applicable in the context of approximate computing where hardware accelerators tolerate certain degree of errors at their outputs. Experimental results show that our proposed approach is very effective finding suitable predictive models while simultaneously reducing the overall power consumption.