DAC’22 TOC
DAC ’22: Proceedings of the 59th ACM/IEEE Design Automation Conference
Full Citation in the ACM Digital Library
QuantumNAT: quantum noise-aware training with noise injection, quantization and normalization
- Hanrui Wang
- Jiaqi Gu
- Yongshan Ding
- Zirui Li
- Frederic T. Chong
- David Z. Pan
- Song Han
Parameterized Quantum Circuits (PQC) are promising towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of PQC models has a severe degradation on real quantum devices. Take Quantum Neural Network (QNN) as an example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of PQC; on the other hand, existing PQC work does not consider noise effect. To this end, we present QuantumNAT, a PQC-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We experimentally observe that the effect of quantum noise to PQC measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to PQC according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that QuantumNAT improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class classification accuracy measured on real quantum computers. The code for construction and noise-aware training of PQC is available in the TorchQuantum library.
Optimizing quantum circuit synthesis for permutations using recursion
- Cynthia Chen
- Bruno Schmitt
- Helena Zhang
- Lev S. Bishop
- Ali Javadi-Abhar
We describe a family of recursive methods for the synthesis of qubit permutations on quantum computers with limited qubit connectivity. Two objectives are of importance: circuit size and depth. In each case we combine a scalable heuristic with a non-scalable, yet exact, synthesis.
A fast and scalable qubit-mapping method for noisy intermediate-scale quantum computers
- Sunghye Park
- Daeyeon Kim
- Minhyuk Kweon
- Jae-Yoon Sim
- Seokhyeong Kang
This paper presents an efficient qubit-mapping method that redesigns a quantum circuit to overcome the limitations of qubit connectivity. We propose a recursive graph-isomorphism search to generate the scalable initial mapping. In the main mapping, we use an adaptive look-ahead window search to resolve the connectivity constraint within a short runtime. Compared with the state-of-the-art method [15], our proposed method reduced the number of additional gates by 23% on average and the runtime by 68% for the three largest benchmark circuits. Furthermore, our method improved circuit stability by reducing the circuit depth and thus can be a step forward towards fault tolerance.
Optimizing quantum circuit placement via machine learning
- Hongxiang Fan
- Ce Guo
- Wayne Luk
Quantum circuit placement (QCP) is the process of mapping the synthesized logical quantum programs on physical quantum machines, which introduces additional SWAP gates and affects the performance of quantum circuits. Nevertheless, determining the minimal number of SWAP gates has been demonstrated to be an NP-complete problem. Various heuristic approaches have been proposed to address QCP, but they suffer from suboptimality due to the lack of exploration. Although exact approaches can achieve higher optimality, they are not scalable for large quantum circuits due to the massive design space and expensive runtime. By formulating QCP as a bilevel optimization problem, this paper proposes a novel machine learning (ML)-based framework to tackle this challenge. To address the lower-level combinatorial optimization problem, we adopt a policy-based deep reinforcement learning (DRL) algorithm with knowledge transfer to enable the generalization ability of our framework. An evolutionary algorithm is then deployed to solve the upper-level discrete search problem, which optimizes the initial mapping with a lower SWAP cost. The proposed ML-based approach provides a new paradigm to overcome the drawbacks in both traditional heuristic and exact approaches while enabling the exploration of optimality-runtime trade-off. Compared with the leading heuristic approaches, our ML-based method significantly reduces the SWAP cost by up to 100%. In comparison with the leading exact search, our proposed algorithm achieves the same level of optimality while reducing the runtime cost by up to 40 times.
HERO: hessian-enhanced robust optimization for unifying and improving generalization and quantization performance
- Huanrui Yang
- Xiaoxuan Yang
- Neil Zhenqiang Gong
- Yiran Chen
With the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model’s generalizability on unseen testing data, as well as enhance the model’s robustness under fixed-point quantization for efficient deployment. Minimizing the training loss, however, provides few guarantees on the generalization and quantization performance. In this work, we fulfill the need of improving generalization and quantization performance simultaneously by theoretically unifying them under the framework of improving the model’s robustness against bounded weight perturbation and minimizing the eigenvalues of the Hessian matrix with respect to model weights. We therefore propose HERO, a Hessian-enhanced robust optimization method, to minimize the Hessian eigenvalues through a gradient-based training process, simultaneously improving the generalization and quantization performance. HERO enables up to a 3.8% gain on test accuracy, up to 30% higher accuracy under 80% training label perturbation, and the best post-training quantization accuracy across a wide range of precision, including a > 10% accuracy improvement over SGD-trained models for common model architectures on various datasets.
Neural computation for robust and holographic face detection
- Mohsen Imani
- Ali Zakeri
- Hanning Chen
- TaeHyun Kim
- Prathyush Poduval
- Hyunsei Lee
- Yeseong Kim
- Elaheh Sadredini
- Farhad Imani
Face detection is an essential component of many tasks in computer vision with several applications. However, existing deep learning solutions are significantly slow and inefficient to enable face detection on embedded platforms. In this paper, we propose HDFace, a novel framework for highly efficient and robust face detection. HDFace exploits HyperDimensional Computing (HDC) as a neurally-inspired computational paradigm that mimics important brain functionalities towards high-efficiency and noise-tolerant computation. We first develop a novel technique that enables HDC to perform stochastic arithmetic computations over binary hypervectors. Next, we expand these arithmetic for efficient and robust processing of feature extraction algorithms in hyperspace. Finally, we develop an adaptive hyperdimensional classification algorithm for effective and robust face detection. We evaluate the effectiveness of HDFace on large-scale emotion detection and face detection applications. Our results indicate that HDFace provides, on average, 6.1X (4.6X) speedup and 3.0X (12.1X) energy efficiency as compared to neural networks running on CPU (FPGA), respectively.
FHDnn: communication efficient and robust federated learning for AIoT networks
- Rishikanth Chandrasekaran
- Kazim Ergun
- Jihyun Lee
- Dhanush Nanjunda
- Jaeyoung Kang
- Tajana Rosing
The advent of IoT and advances in edge computing inspired federated learning, a distributed algorithm to enable on device learning. Transmission costs, unreliable networks and limited compute power all of which are typical characteristics of IoT networks pose a severe bottleneck for federated learning. In this work we propose FHDnn, a synergetic federated learning framework that combines the salient aspects of CNNs and Hyperdimensional Computing. FHDnn performs hyperdimensional learning on features extracted from a self-supervised contrastive learning framework to accelerate training, lower communication costs, and increase robustness to network errors by avoiding the transmission of the CNN and training only the hyperdimensional component. Compared to CNNs, we show through experiments that FHDnn reduces communication costs by 66X, local client compute and energy consumption by 1.5 – 6X, while being highly robust to network errors with minimal loss in accuracy.
ODHD: one-class brain-inspired hyperdimensional computing for outlier detection
- Ruixuan Wang
- Xun Jiao
- X. Sharon Hu
Outlier detection is a classical and important technique that has been used in different application domains such as medical diagnosis and Internet-of-Things. Recently, machine learning-based outlier detection algorithms, such as one-class support vector machine (OCSVM), isolation forest and autoencoder, have demonstrated promising results in outlier detection. In this paper, we take a radical departure from these classical learning methods and propose ODHD, an outlier detection method based on hyperdimensional computing (HDC). In ODHD, the outlier detection process is based on a P-U learning structure, in which we train a one-class HV based on inlier samples. This HV represents the abstraction information of all inlier samples; hence, any (testing) sample whose corresponding HV is dissimilar from this HV will be considered as an outlier. We perform an extensive evaluation using six datasets across different application domains and compare ODHD with multiple baseline methods including OCSVM, isolation forest, and autoencoder using three metrics including accuracy, F1 score and ROC-AUC. Experimental results show that ODHD outperforms all the baseline methods on every dataset for every metric. Moreover, we perform a design space exploration for ODHD to illustrate the tradeoff between performance and efficiency. The promising results presented in this paper provide a viable option and alternative to traditional learning algorithms for outlier detection.
High-level synthesis performance prediction using GNNs: benchmarking, modeling, and advancing
- Nan Wu
- Hang Yang
- Yuan Xie
- Pan Li
- Cong Hao
Agile hardware development requires fast and accurate circuit quality evaluation from early design stages. Existing work of high-level synthesis (HLS) performance prediction usually requires extensive feature engineering after the synthesis process. To expedite circuit evaluation from as early design stage as possible, we propose rapid and accurate performance prediction methods, which exploit the representation power of graph neural networks (GNNs) by representing C/C++ programs as graphs. The contribution of this work is three-fold. (1) Benchmarking. We build a standard benchmark suite with 40k C programs, which includes synthetic programs and three sets of real-world HLS benchmarks. Each program is synthesized and implemented on FPGA to obtain post place-and-route performance metrics as the ground truth. (2) Modeling. We formally formulate the HLS performance prediction problem on graphs and propose multiple modeling strategies with GNNs that leverage different trade-offs between prediction timeliness (early/late prediction) and accuracy. (3) Advancing. We further propose a novel hierarchical GNN that does not sacrifice timeliness but largely improves prediction accuracy, significantly outperforming HLS tools. We apply extensive evaluations for both synthetic and unseen real-case programs; our proposed predictor largely outperforms HLS by up to 40X and excels existing predictors by 2X to 5X in terms of resource usage and timing prediction. The benchmark and explored GNN models are publicly available at https://github.com/lydiawunan/HLS-Perf-Prediction-with-GNNs.
Automated accelerator optimization aided by graph neural networks
- Atefeh Sohrabizadeh
- Yunsheng Bai
- Yizhou Sun
- Jason Cong
Using High-Level Synthesis (HLS), the hardware designers must describe only a high-level behavioral flow of the design. However, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level to explore. Besides, it takes several minutes to hours to evaluate the design with the HLS tool. To solve this problem, we model the HLS tool with a graph neural network that is trained to be used for a wide range of applications. The experimental results demonstrate that our model can estimate the quality of design in milliseconds with high accuracy, resulting in up to 79X speedup (with an average of 48X) for optimizing the design compared to the previous state-of-the-art work relying on the HLS tool.
Functionality matters in netlist representation learning
- Ziyi Wang
- Chen Bai
- Zhuolun He
- Guangliang Zhang
- Qiang Xu
- Tsung-Yi Ho
- Bei Yu
- Yu Huang
Learning feasible representation from raw gate-level netlists is essential for incorporating machine learning techniques in logic synthesis, physical design, or verification. Existing message-passing-based graph learning methodologies focus merely on graph topology while overlooking gate functionality, which often fails to capture underlying semantic, thus limiting their generalizability. To address the concern, we propose a novel netlist representation learning framework that utilizes a contrastive scheme to acquire generic functional knowledge from netlists effectively. We also propose a customized graph neural network (GNN) architecture that learns a set of independent aggregators to better cooperate with the above framework. Comprehensive experiments on multiple complex real-world designs demonstrate that our proposed solution significantly outperforms state-of-the-art netlist feature learning flows.
EMS: efficient memory subsystem synthesis for spatial accelerators
- Liancheng Jia
- Yuyue Wang
- Jingwen Leng
- Yun Liang
Spatial accelerators provide massive parallelism with an array of homogeneous PEs, and enable efficient data reuse with PE array dataflow and on-chip memory. Many previous works have studied the dataflow architecture of spatial accelerators, including performance analysis and automatic generation. However, existing accelerator generators fail to exploit the entire memory-level reuse opportunities, and generate suboptimal designs with data duplication and inefficient interconnection.
In this paper, we propose EMS, an efficient memory subsystem synthesis and optimization framework for spatial accelerators. We first use space-time transformation (STT) to analyze both PE-level and memory-level data reuse. Based on the reuse analysis, we develop an algorithm to automatically generate data layout of the multi-banked scratchpad memory, data mapping, and access controller for the memory. Our generated memory subsystem supports multiple PE-memory interconnection topologies including direct, multicast, and rotated connection. The memory and interconnection generation approach can efficiently utilize the memory-level reuse to avoid duplicated data storage with low hardware cost. EMS can automatically synthesize tensor algebra to hardware designed in Chisel. Experiments show that our proposed memory generator reduces the on-chip memory size by an average of 28% than the state-of-the-art, and achieves comparable hardware performance.
DA PUF: dual-state analog PUF
- Jiliang Zhang
- Lin Ding
- Zhuojun Chen
- Wenshang Li
- Gang Qu
Physical unclonable function (PUF) is a promising lightweight hardware security primitive that exploits process variations during chip fabrication for applications such as key generation and device authentication. Reliability of the PUF information plays a vital role and poses a major challenge for PUF design. In this paper, we propose a novel dual-state analog PUF (DA PUF) which has been successfully fabricated in 55nm process. The 40,960 bits generated by the fabricated DA PUF pass the NIST randomness test with reliability over 99.99% for working environment of -40 ~ 125° C (temperature) and 0.96 ~ 1.44V (voltage), outperforming the two state-of-the-art analog PUFs reported in JSSC 2016 and 2021.
PathFinder: side channel protection through automatic leaky paths identification and obfuscation
- Haocheng Ma
- Qizhi Zhang
- Ya Gao
- Jiaji He
- Yiqiang Zhao
- Yier Jin
Side-channel analysis (SCA) attacks show an enormous threat to cryptographic integrated circuits (ICs). To address this threat, designers try to adopt various countermeasures during the IC development process. However, many existing solutions are costly in terms of area, power and/or performance, and may require full-custom circuit design for proper implementations. In this paper, we propose a tool, namely PathFinder, to automatically identify leaky paths and protect the design, and is compatible with the commercial design flow. The tool first finds out partial logic cells that leak the most information through dynamic correlation analysis. PathFinder then exploits static security checking to construct complete leaky paths based on these cells. After leaky paths are identified, PathFinder will leverage proper hardware countermeasures, including Boolean masking and random precharge, to eliminate information leakage from these paths. The effectiveness of PathFinder is validated both through simulation and physical measurements on FPGA implementations. Results demonstrate more than 1000X improvements on side-channel resistance, with less than 6.53% penalty to the power, area and performance.
LOCK&ROLL: deep-learning power side-channel attack mitigation using emerging reconfigurable devices and logic locking
- Gaurav Kolhe
- Tyler Sheaves
- Kevin Immanuel Gubbi
- Soheil Salehi
- Setareh Rafatirad
- Sai Manoj PD
- Avesta Sasan
- Houman Homayoun
The security and trustworthiness of ICs are exacerbated by the modern globalized semiconductor business model. This model involves many steps performed at multiple locations by different providers and integrates various Intellectual Properties (IPs) from several vendors for faster time-to-market and cheaper fabrication costs. Many existing works have focused on mitigating the well-known SAT attack and its derivatives. Power Side-Channel Attacks (PSCAs) can retrieve the sensitive contents of the IP and can be leveraged to find the key to unlock the obfuscated circuit without simulating powerful SAT attacks. To mitigate P-SCA and SAT-attack together, we propose a multi-layer defense mechanism called LOCK&ROLL: Deep-Learning Power Side-Channel Attack Mitigation using Emerging Reconfigurable Devices and Logic Locking. LOCK&ROLL utilizes our proposed Magnetic Random-Access Memory (MRAM)-based Look Up Table called Symmetrical MRAM-LUT (SyM-LUT). Our simulation results using 45nm technology demonstrate that the SyM-LUT incurs a small overhead compared to traditional Static Random Access Memory LUT (SRAM-LUT). Additionally, SyM-LUT has a standby energy consumption of 20aJ while consuming 33fJ and 4.6fJ for write and read operations, respectively. LOCK&ROLL is resilient against various attacks such as SAT-attacks, removal attack, scan and shift attacks, and P-SCA.
Efficient access scheme for multi-bank based NTT architecture through conflict graph
- Xiangren Chen
- Bohan Yang
- Yong Lu
- Shouyi Yin
- Shaojun Wei
- Leibo Liu
Number Theoretical Transform (NTT) hardware accelerator becomes crucial building block in many cryptosystems like post-quantum cryptography. In this paper, we provide new insights into the construction of conflict-free memory mapping scheme (CFMMS) for multi-bank NTT architecture. Firstly, we offer parallel loop structure of arbitrary-radix NTT and propose two point-fetching modes. Afterwards, we transform the conflict-free mapping problem into conflict graph and develop novel heuristic to explore the design space of CFMMS, which turns out more efficient access scheme than classic works. To further verify the methodology, we design high-performance NTT/INTT kernels for Dilithium, whose area-time efficiency significantly outperforms state-of-the-art works on the similar FPGA platform.
InfoX: an energy-efficient ReRAM accelerator design with information-lossless low-bit ADCs
- Yintao He
- Songyun Qu
- Ying Wang
- Bing Li
- Huawei Li
- Xiaowei Li
ReRAM-based accelerators have shown great potential in neural network acceleration via in-memory analog computing. However, high-precision analog-to-digital converters (ADCs), which are required by the ReRAM crossbars to achieve high-accuracy network model inference, play an essential role in the energy-efficiency of the accelerators. Based on the discovery that the ADC precision requirements of crossbars are different, we propose the model-aware crossbarwise ADC precision assignment and the accompanied information-lossless low-bit ADCs to reduce energy overhead without sacrificing model accuracy. In experiments, the proposed information-lossless ReRAM accelerator, InfoX, only consumes 8.97% ADC energy of the SOTA baseline with no accuracy degradation at all.
PHANES: ReRAM-based photonic accelerator for deep neural networks
- Yinyi Liu
- Jiaqi Liu
- Yuxiang Fu
- Shixi Chen
- Jiaxu Zhang
- Jiang Xu
Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks.
In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.
CP-SRAM: charge-pulsation SRAM marco for ultra-high energy-efficiency computing-in-memory
- He Zhang
- Linjun Jiang
- Jianxin Wu
- Tingran Chen
- Junzhan Liu
- Wang Kang
- Weisheng Zhao
SRAM-based computing-in-memory (SRAM-CIM) provides fast speed and good scalability with advanced process technology. However, the energy efficiency of the state-of-the-art current-domain SRAM-CIM bit-cell structure is limited and the peripheral circuitry (e.g., DAC/ADC) for high-precision is expensive. This paper proposes a charge-pulsation SRAM (CP-SRAM) structure to achieve ultra-high energy-efficiency thanks to its charge-domain mechanism. Furthermore, our proposed CP-SRAM CIM supports configurable precision (2/4/6-bit). The CP-SRAM CIM macro was designed in 180nm (with silicon verification) and 40nm (simulation) nodes. The simulation results in 40nm show that our macro can achieve energy efficiency of ~2950Tops/W at 2-bit precision, ~576.4 Tops/W at 4-bit precision and ~111.7 Tops/W at 6-bit precision, respectively.
CREAM: computing in ReRAM-assisted energy and area-efficient SRAM for neural network acceleration
- Liukai Xu
- Songyuan Liu
- Zhi Li
- Dengfeng Wang
- Yiming Chen
- Yanan Sun
- Xueqing Li
- Weifeng He
- Shi Xu
Computing-in-memory has been widely explored to accelerate DNN. However, most existing CIM cannot store all NN weights due to limited SRAM capacity for edge AI devices, inducing a large amount off-chip DRAM access. In this paper, a new computing in ReRAM-assisted energy and area-efficient SRAM (CREAM) is proposed for implementing large-scale NNs while eliminating off-chip DRAM access. The weights of DNN are all stored in the high-dense on-chip ReRAM devices and restored to the proposed nvSRAM-CIM cells with array-level parallelism. A data-aware weight-mapping method is also proposed to enhance the CIM performance while fully exploiting the hardware utilization. Experiment results show that the proposed CREAM scheme enhances the storage density by up to 7.94x compared to the traditional SRAM arrays. The energy-efficiency of proposed CREAM is also enhanced by 2.14x and 1.99x, compared to the traditional SRAM-CIM with off-chip DRAM access and ReRAM-CIM circuits, respectively.
Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration
- Yinxiao Feng
- Kaisheng Ma
Multi-chip integration is widely recognized as the extension of Moore’s Law. Cost-saving is a frequently mentioned advantage, but previous works rarely present quantitative demonstrations on the cost superiority of multi-chip integration over monolithic SoC. In this paper, we build a quantitative cost model and put forward an analytical method for multi-chip systems based on three typical multi-chip integration technologies to analyze the cost benefits from yield improvement, chiplet and package reuse, and heterogeneity. We re-examine the actual cost of multi-chip systems from various perspectives and show how to reduce the total cost of the VLSI system through appropriate multi-chiplet architecture.
PANORAMA: divide-and-conquer approach for mapping complex loop kernels on CGRA
- Dhananjaya Wijerathne
- Zhaoying Li
- Thilini Kaushalya Bandara
- Tulika Mitra
CGRAs are well-suited as hardware accelerators due to power efficiency and reconfigurability. However, their potential is limited by the inability of the compiler to map complex loop kernels onto the architectures effectively. We propose PANORAMA, a fast and scalable compiler based on a divide-and-conquer approach to generate quality mapping for complex dataflow graphs (DFG) representing loop bodies onto larger CGRAs. PANORAMA improves the throughput of the mapped loops by up to 2.6x with 8.7x faster compilation time compared to the state-of-the-art techniques.
A fast parameter tuning framework via transfer learning and multi-objective bayesian optimization
- Zheng Zhang
- Tinghuan Chen
- Jiaxin Huang
- Meng Zhang
Design space exploration (DSE) can automatically and effectively determine design parameters to achieve the optimal performance, power and area (PPA) in very large-scale integration (VLSI) design. The lack of prior knowledge causes low efficient exploration. In this paper, a fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization is proposed to quickly find the optimal design parameters. Gaussian Copula is utilized to establish the correlation of the implemented technology. The prior knowledge is integrated into multi-objective Bayesian optimization through transforming the PPA data to residual observation. The uncertainty-aware search acquisition function is employed to explore design space efficiently. Experiments on a CPU design show that this framework can achieve a higher quality of Pareto frontier with less design flow running than state-of-the-art methodologies.
PriMax: maximizing DSL application performance with selective primitive acceleration
- Nicholas Wendt
- Todd Austin
- Valeria Bertacco
Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm’s implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.
Accelerating and pruning CNNs for semantic segmentation on FPGA
- Pierpaolo Morì
- Manoj-Rohit Vemparala
- Nael Fasfous
- Saptarshi Mitra
- Sreetama Sarkar
- Alexander Frickenstein
- Lukas Frickenstein
- Domenik Helms
- Naveen Shankar Nagaraja
- Walter Stechele
- Claudio Passerone
Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA
SoftSNN: low-cost fault tolerance for spiking neural network accelerators under soft errors
- Rachmad Vidya Wicaksana Putra
- Muhammad Abdullah Hanif
- Muhammad Shafique
Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.
A joint management middleware to improve training performance of deep recommendation systems with SSDs
- Chun-Feng Wu
- Carole-Jean Wu
- Gu-Yeon Wei
- David Brooks
As the sizes and variety of training data scale over time, data preprocessing is becoming an important performance bottleneck for training deep recommendation systems. This challenge becomes more serious when training data is stored in Solid-State Drives (SSDs). Due to the access behavior gap between recommendation systems and SSDs, unused training data may be read and filtered out during preprocessing. This work advocates a joint management middleware to avoid reading unused data by bridging the access behavior gap. The evaluation results show that our middleware can effectively improve the performance of the data preprocessing phase so as to boost training performance.
The larger the fairer?: small neural networks can achieve fairness for edge devices
- Yi Sheng
- Junhuan Yang
- Yawen Wu
- Kevin Mao
- Yiyu Shi
- Jingtong Hu
- Weiwen Jiang
- Lei Yang
Along with the progress of AI democratization, neural networks are being deployed more frequently in edge devices for a wide range of applications. Fairness concerns gradually emerge in many applications, such as face recognition and mobile medical. One fundamental question arises: what will be the fairest neural architecture for edge devices? By examining the existing neural networks, we observe that larger networks typically are fairer. But, edge devices call for smaller neural architectures to meet hardware specifications. To address this challenge, this work proposes a novel Fairness- and Hardware-aware Neural architecture search framework, namely FaHaNa. Coupled with a model freezing approach, FaHaNa can efficiently search for neural networks with balanced fairness and accuracy, while guaranteed to meet hardware specifications. Results show that FaHaNa can identify a series of neural networks with higher fairness and accuracy on a dermatology dataset. Target edge devices, FaHaNa finds a neural architecture with slightly higher accuracy, 5.28X smaller size, 15.14% higher fairness score, compared with MobileNetV2; meanwhile, on Raspberry PI and Odroid XU-4, it achieves 5.75X and 5.79X speedup.
SCAIE-V: an open-source SCAlable interface for ISA extensions for RISC-V processors
- Mihaela Damian
- Julian Oppermann
- Christoph Spang
- Andreas Koch
Custom instructions extending a base ISA are often used to increase performance. However, only few cores provide open interfaces for integrating such ISA Extensions (ISAX). In addition, the degree to which a core’s capabilities are exposed for extension varies wildly between interfaces. Thus, even when using open-source cores, the lack of standardized ISAX interfaces typically causes high engineering effort when implementing or porting ISAXes. We present SCAIE-V, a highly portable and feature-rich ISAX interface that supports custom control flow, decoupled execution, multi-cycle-instructions, and memory transactions. The cost of the interface itself scales with the complexity of the ISAXes actually used.
A scalable symbolic simulation tool for low power embedded systems
- Subhash Sethumurugan
- Shashank Hegde
- Hari Cherupalli
- John Sartori
Recent work has demonstrated the effectiveness of using symbolic simulation to perform hardware software co-analysis on an application-processor pair and developed a variety of hardware and software design techniques and optimizations, ranging from providing system security guarantees to automated generation of application-specific bespoke processors. Despite their potential benefits, current state-of-the-art symbolic simulation tools for hardware-software co-analysis are restricted in their applicability, since prior work relies on a costly process of building a custom simulation tool for each processor design to be simulated. Furthermore, prior work does not describe how to extend the symbolic analysis technique to other processor designs.
In an effort to generalize the technique for any processor design, we propose a custom symbolic simulator that uses iverilog to perform symbolic behavioral simulation. With iverilog – an open source synthesis and simulation tool – we implement a design-agnostic symbolic simulation tool for hardware-software co-analysis. To demonstrate the generality of our tool, we apply symbolic analysis to three embedded processors with different ISAs: bm32 (a MIPS-based processor), darkRiscV (a RISC-V-based processor), and openMSP430 (based on MSP430). We use analysis results to generate bespoke processors for each design and observe gate count reductions of 27%, 16%, and 56% on these processors, respectively. Our results demonstrate the versatility of our simulation tool and the uniqueness of each design with respect to symbolic analysis and the bespoke methodology.
Designing critical systems with iterative automated safety analysis
- Ran Wei
- Zhe Jiang
- Xiaoran Guo
- Haitao Mei
- Athanasios Zolotas
- Tim Kelly
Safety analysis is an important aspect in Safety-Critical Systems Engineering (SCSE) to discover design problems that can potentially lead to hazards and eventually, accidents. Performing safety analysis requires significant manual effort — its automation has become the research focus in the critical system domain due to the increasing complexity of systems and emergence of open adaptive systems. In this paper, we present a methodology, in which automated safety analysis drives the design of safety-critical systems. We discuss our approach with its tool support and evaluate its applicability. We briefly discuss how our approach fits into current practice of SCSE.
Efficient ensembles of graph neural networks
- Amrit Nagarajan
- Jacob R. Stevens
- Anand Raghunathan
Ensembles improve the accuracy and robustness of Graph Neural Networks (GNNs), but suffer from high latency and storage requirements. To address this challenge, we propose GNN Ensembles through Error Node Isolation (GEENI). The key concept in GEENI is to identify nodes that are likely to be incorrectly classified (error nodes) and suppress their outgoing messages, leading to simultaneous accuracy and efficiency improvements. GEENI also enables aggressive approximations of the constituent models in the ensemble while maintaining accuracy. To improve the efficacy of GEENI, we propose techniques for diverse ensemble creation and accurate error node identification. Our experiments establish that GEENI models are simultaneously up to 4.6% (3.8%) more accurate and up to 2.8X (5.7X) faster compared to non-ensemble (conventional ensemble) GNN models.
Sign bit is enough: a learning synchronization framework for multi-hop all-reduce with ultimate compression
- Feijie Wu
- Shiqi He
- Song Guo
- Zhihao Qu
- Haozhao Wang
- Weihua Zhuang
- Jie Zhang
Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.
GLite: a fast and efficient automatic graph-level optimizer for large-scale DNNs
- Jiaqi Li
- Min Peng
- Qingan Li
- Meizheng Peng
- Mengting Yuan
We propose a scalable graph-level optimizer named GLite to speed up search-based optimizations on large neural networks. GLite leverages a potential-based partitioning strategy to partition large computation graphs into small subgraphs without losing profitable substitution patterns. To avoid redundant subgraph matching, we propose a dynamic programming algorithm to reuse explored matching patterns. The experimental results show that GLite reduces the running time of search-based optimizations from hours to milliseconds, without compromising in inference performance.
Contrastive quant: quantization makes stronger contrastive learning
- Yonggan Fu
- Qixuan Yu
- Meng Li
- Xu Ouyang
- Vikas Chandra
- Yingyan Lin
Contrastive learning learns visual representations by enforcing feature consistency under different augmented views. In this work, we explore contrastive learning from a new perspective. Interestingly, we find that quantization, when properly engineered, can enhance the effectiveness of contrastive learning. To this end, we propose a novel contrastive learning framework, dubbed Contrastive Quant, to encourage feature consistency under both differently augmented inputs via various data transformations and differently augmented weights/activations via various quantization levels. Extensive experiments, built on top of two state-of-the-art contrastive learning methods SimCLR and BYOL, show that Contrastive Quant consistently improves the learned visual representation.
Serpens: a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication
- Linghao Song
- Yuze Chi
- Licheng Guo
- Jason Cong
Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV, which features memory-centric processing engines and index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy/bandwidth efficiency, Serpens is 1.71x/1.99x, 1.90x/2.69x, and 6.25x/4.06x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS) and up to 3.79x over GraphLily. The code is available at https://github.com/UCLA-VAST/Serpens.
An energy-efficient seizure detection processor using event-driven multi-stage CNN classification and segmented data processing with adaptive channel selection
- Jiahao Liu
- Zirui Zhong
- Yong Zhou
- Hui Qiu
- Jianbiao Xiao
- Jiajing Fan
- Zhaomin Zhang
- Sixu Li
- Yiming Xu
- Siqi Yang
- Weiwei Shan
- Shuisheng Lin
- Liang Chang
- Jun Zhou
Recently wearable EEG monitoring devices with seizure detection processor using convolutional neural network (CNN) have been proposed to detect the seizure onset of patients in real time for alert or stimulation purpose. High energy efficiency and accuracy are required for the seizure detection processor due to the tight energy constraint of wearable devices. However, the use of CNN and multi-channel processing nature of seizure detection result in significant energy consumption. In this work, an energy-efficient seizure detection processor is proposed, featuring multi-stage CNN classification, segmented data processing and adaptive channel selection to reduce the energy consumption while achieving high accuracy. The design has been fabricated and tested using a 55nm process technology. Compared with several state-of-the-art designs, the proposed design achieves the lowest energy per classification (0.32 μJ) with high sensitivity (97.78%) and low false positive rate per hour (0.5).
PatterNet: explore and exploit filter patterns for efficient deep neural networks
- Behnam Khaleghi
- Uday Mallappa
- Duygu Yaldiz
- Haichao Yang
- Monil Shah
- Jaeyoung Kang
- Tajana Rosing
Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. In this paper, we propose PatterNet, which enforces shared clustering topologies on filters. Cluster sharing leads to a greater extent of memory reduction by reusing the index information. PatterNet effectively factorizes input activations and post-processes the unique weights, which saves multiplications by several orders of magnitude. Furthermore, PatterNet reduces the add operations by harnessing the fact that filters sharing a clustering pattern have the same factorized terms. We introduce techniques for determining and assigning clustering patterns and training a network to fulfill the target patterns. We also propose and implement an efficient accelerator that builds upon the patterned filters. Experimental results show that PatterNet shrinks the memory and operation count up to 80.2% and 73.1%, respectively, with similar accuracy to the baseline models. PatterNet accelerator improves the energy efficiency by 107x over Nvidia 1080 1080 GTX and 2.2x over state of the art.
E2SR: an end-to-end video CODEC assisted system for super resolution acceleration
- Zhuoran Song
- Zhongkai Yu
- Naifeng Jing
- Xiaoyao Liang
Nowadays high-resolution (HR) videos have been a popular choice for a better viewing experience. Recent works have shown that super-resolution (SR) algorithms can provide superior quality HR video by applying the deep neural network (DNN) to each low-resolution (LR) frame. Obviously, such per-frame DNN processing is compute-intensive and hampers the deployment of SR algorithms on mobile devices. Although many accelerators have proposed solutions, they only focus on mobile devices. Differently, we notice that the HR video is originally stored in the cloud server and should be well exploited to gain high accuracy and performance improvement. Based on this observation, this paper proposes an end-to-end video CODEC assisted system (E2SR), which tightly couples the cloud server with the device to deliver a smooth and real-time video viewing experience. We propose the motion vector search algorithm executed in the cloud server, which can search the motion vectors and residuals for part of HR video frames and then pack them as addons. We further propose the reconstruction algorithm executed in the device to fast reconstruct the corresponding HR frames using the addons to skip part of DNN computations. We design the corresponding E2SR architecture to enable the reconstruction algorithm in the device, which achieves significant speedup with minimal hardware overhead. Our experimental results show that the E2SR system achieves 3.4x performance improvement with less than 0.56 PSNR loss compared with the state-of-the-art “EDVR” scheme.
MATCHA: a fast and energy-efficient accelerator for fully homomorphic encryption over the torus
- Lei Jiang
- Qian Lou
- Nrushad Joshi
Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.
VirTEE: a full backward-compatible TEE with native live migration and secure I/O
- Jianqiang Wang
- Pouya Mahmoody
- Ferdinand Brasser
- Patrick Jauernig
- Ahmad-Reza Sadeghi
- Donghui Yu
- Dahan Pan
- Yuanyuan Zhang
Modern security architectures provide Trusted Execution Environments (TEEs) to protect critical data and applications against malicious privileged software in so-called enclaves. However, the seamless integration of existing TEEs into the cloud is hindered, as they require substantial adaptation of the software executing inside an enclave as well as the cloud management software to handle enclaved workloads. We tackle these challenges by presenting VirTEE, the first TEE architecture that allows strongly isolated execution of unmodified virtual machines (VMs) in enclaves, as well as secure live migration of VM enclaves between VirTEE-enabled servers. Combined with its secure I/O capabilities, VirTEE enables the integration of enclaved computing in today’s complex cloud infrastructure. We thoroughly evaluate our RISC-V-based prototype, and show its effectiveness and efficiency.
Apple vs. EMA: electromagnetic side channel attacks on apple CoreCrypto
- Gregor Haas
- Aydin Aysu
Cryptographic instruction set extensions are commonly used for ciphers which would otherwise face unacceptable side channel risks. A prominent example of such an extension is the ARMv8 Cryptographic Extension, or ARM CE for short, which defines dedicated instructions to securely accelerate AES. However, while these extensions may be resistant to traditional “digital” side channel attacks, they may still be vulnerable to physical side channel attacks.
In this work, we demonstrate the first such attack on a standard ARM CE AES implementation. We specifically focus on the implementation used by Apple’s CoreCrypto library which we run on the Apple A10 Fusion SoC. To that end, we implement an optimized side channel acquisition infrastructure involving both custom iPhone software and accelerated analysis code. We find that an adversary which can observe 5–30 million known-ciphertext traces can reliably extract secret AES keys using electromagnetic (EM) radiation as a side channel. This corresponds to an encryption operation on less than half of a gigabyte of data, which could be acquired in less than 2 seconds on the iPhone 7 we examined. Our attack thus highlights the need for side channel defenses for real devices and production, industry-standard encryption software.
Algorithm/architecture co-design for energy-efficient acceleration of multi-task DNN
- Jaekang Shin
- Seungkyu Choi
- Jongwoo Ra
- Lee-Sup Kim
Real-world AI applications, such as augmented reality or autonomous driving, require processing multiple CV tasks simultaneously. However, the enormous data size and the memory footprint have been a crucial hurdle for deep neural networks to be applied in resource-constrained devices. To solve the problem, we propose an algorithm/architecture co-design. The proposed algorithmic scheme, named SqueeD, reduces per-task weight and activation size by 21.9x and 2.1x, respectively, by sharing those data between tasks. Moreover, we design architecture and dataflow to minimize DRAM access by fully utilizing benefits from SqueeD. As a result, the proposed architecture reduces the DRAM access increment and energy consumption increment per task by 2.2x and 1.3x, respectively.
EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks
- Fangxin Liu
- Wenbo Zhao
- Zongwu Wang
- Yongbiao Chen
- Zhezhi He
- Naifeng Jing
- Xiaoyao Liang
- Li Jiang
Model compression has been extensively investigated for supporting efficient neural network inference on edge-computing platforms due to the huge model size and computation amount. Recent researches embrace joint-way compression across multiple techniques for extreme compression. However, most joint-way methods adopt a naive solution that applies two approaches sequentially, which can be sub-optimal, as it lacks a systematic approach to incorporate them.
This paper proposes the integration of aggressive joint-way compression into hardware design, namely EBSP. It is motivated by 1) the quantization allows simplifying hardware implementations; 2) the bit distribution of quantized weights can be viewed as an independent trainable variable; 3) the exploitation of bit sparsity in the quantized network has the potential to achieve better performance. To achieve that, this paper introduces the bit sparsity patterns to construct both highly expressive and inherently regular bit distribution in the quantized network. We further incorporate our sparsity constraint in training to evolve inherently bit distributions to the bit sparsity pattern. Moreover, the structure of the introduced bit sparsity pattern engenders minimum hardware implementation under competitive classification accuracy. Specifically, the quantized network constrained by bit sparsity pattern can be processed using LUTs with the fewest entries instead of multipliers in minimally modified computational hardware. Our experiments show that compared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance gain on average, respectively.
A time-to-first-spike coding and conversion aware training for energy-efficient deep spiking neural network processor design
- Dongwoo Lew
- Kyungchul Lee
- Jongsun Park
In this paper, we present an energy-efficient SNN architecture, which can seamlessly run deep spiking neural networks (SNNs) with improved accuracy. First, we propose a conversion aware training (CAT) to reduce ANN-to-SNN conversion loss without hardware implementation overhead. In the proposed CAT, the activation function developed for simulating SNN during ANN training, is efficiently exploited to reduce the data representation error after conversion. Based on the CAT technique, we also present a time-to-first-spike coding that allows lightweight logarithmic computation by utilizing spike time information. The SNN processor design that supports the proposed techniques has been implemented using 28nm CMOS process. The processor achieves the top-1 accuracies of 91.7%, 67.9% and 57.4% with inference energy of 486.7uJ, 503.6uJ, and 1426uJ to process CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, when running VGG-16 with 5bit logarithmic weights.
XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method
- Fan Zhang
- Li Yang
- Jian Meng
- Jae-sun Seo
- Yu (Kevin) Cao
- Deliang Fan
ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells’ low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm’s benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.
SWIM: selective write-verify for computing-in-memory neural accelerators
- Zheyu Yan
- Xiaobo Sharon Hu
- Yiyu Shi
Computing-in-Memory architectures based on non-volatile emerging memories have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, these emerging devices can suffer from significant variations during the mapping process (i.e., programming weights to the devices), and if left undealt with, can cause significant accuracy degradation. The non-ideality of weight mapping can be compensated by iterative programming with a write-verify scheme, i.e., reading the conductance and rewriting if necessary. In all existing works, such a practice is applied to every single weight of a DNN as it is being mapped, which requires extensive programming time. In this work, we show that it is only necessary to select a small portion of the weights for write-verify to maintain the DNN accuracy, thus achieving significant speedup. We further introduce a second derivative based technique SWIM, which only requires a single pass of forward and backpropagation, to efficiently select the weights that need write-verify. Experimental results on various DNN architectures for different datasets show that SWIM can achieve up to 10x programming speedup compared with conventional full-blown write-verify while attaining a comparable accuracy.
Enabling efficient deep convolutional neural network-based sensor fusion for autonomous driving
- Xiaoming Zeng
- Zhendong Wang
- Yang Hu
Autonomous driving demands accurate perception and safe decision-making. To achieve this, automated vehicles are typically equipped with multiple sensors (e.g., cameras, Lidar, etc.), enabling them to exploit complementary environmental contexts by fusing data from different sensing modalities. With the success of Deep Convolutional Neural Network (DCNN), the fusion between multiple DCNNs has been proved to be a promising strategy to achieve satisfactory perception accuracy. However, existing mainstream DCNN fusion strategies conduct fusion by simply element-wisely adding feature maps extracted from different modalities together at various stages, failing to consider whether the features being fused are matched or not. Therefore, we first propose a feature disparity metric to quantitatively measure the degree of feature disparity between the fusing feature maps. Then, we propose a Fusion-filter as the Feature-matching techniques to tackle the feature-mismatching issue. We also propose a Layer-sharing technique in the deep layer of the DCNN to achieve high accuracy. With the assistance of feature disparity working as an additional loss, our proposed technologies enable DCNN to learn corresponding feature maps with similar characteristics and complementary visual context from different modalities. Evaluations demonstrate that our proposed fusion techniques can achieve higher accuracy on KITTI dataset with less computation resources consumption.
Zhuyi: perception processing rate estimation for safety in autonomous vehicles
- Yu-Shun Hsiao
- Siva Kumar Sastry Hari
- Michał Filipiuk
- Timothy Tsai
- Michael B. Sullivan
- Vijay Janapa Reddi
- Vasu Singh
- Stephen W. Keckler
The processing requirement of autonomous vehicles (AVs) for high-accuracy perception in complex scenarios can exceed the resources offered by the in-vehicle computer, degrading safety and comfort. This paper proposes a sensor frame processing rate (FPR) estimation model, Zhuyi, that quantifies the minimum safe FPR continuously in a driving scenario. Zhuyi can be employed post-deployment as an online safety check and to prioritize work. Experiments conducted using a multi-camera state-of-the-art industry AV system show that Zhuyi’s estimated FPRs are conservative, yet the system can maintain safety by processing only 36% or fewer frames compared to a default 30-FPR system in the tested scenarios.
Processing-in-SRAM acceleration for ultra-low power visual 3D perception
- Yuquan He
- Songyun Qu
- Gangliang Lin
- Cheng Liu
- Lei Zhang
- Ying Wang
Real-time ego-motion tracking and 3D structural estimation are the fundamental tasks for the ubiquitous cyper-physical systems, and they can be conducted via the state-of-the-art Edge-Based Visual Odometry (EBVO) algorithm. However, the intrinsic data-intensive process of EBVO emplaces a memory-wall hurdle in practical deployment on conventional von-Neumann-style computing systems. In this work, we attempt to leverage SRAM based processing-in-memory (PIM) technique to alleviate such memory-wall bottleneck, so as to optimize the EBVO systematically from the perspectives of the algorithm layer and physical layer. In the algorithm layer, we first investigate the data reuse patterns of the essential computing kernels required for the feature detection and pose estimation steps in EBVO, and propose PIM friendly data layout and computing scheme for each kernel accordingly. We distill the basic logical and arithmetical operations required in the algorithm layer, and in the physical layer, we propose a novel bit-parallel and reconfigurable SRAM-PIM architecture to realize the operations with high computing precision and throughput. Our experimental result shows that the proposed multi-layer optimization allows for high tracking accuracy of EBVO, and it can improve 11x processing speed and reduce 20x energy consumption compared to the CPU implementation.
Response time analysis for dynamic priority scheduling in ROS2
- Abdullah Al Arafat
- Sudharsan Vaidhun
- Kurt M. Wilson
- Jinghao Sun
- Zhishan Guo
Robot Operating System (ROS) is the most popular framework for developing robotics software. Typically, robotics software is safety-critical and employed in real-time systems requiring timing guarantees. Since the first generation of ROS provides no timing guarantee, the recent release of its second generation, ROS2, is necessary and timely, and has since received immense attention from practitioners and researchers. Unfortunately, the existing analysis of ROS2 showed the peculiar scheduling strategy of ROS2 executor, which severely affects the response time of ROS2 applications. This paper proposes a deadline-based scheduling strategy for the ROS2 executor. It further presents an analysis for an end-to-end response time of ROS2 workload (processing chain) and an evaluation of the proposed scheduling strategy for real workloads.
Voltage prediction of drone battery reflecting internal temperature
- Jiwon Kim
- Seunghyeok Jeon
- Jaehyun Kim
- Hojung Cha
Drones are commonly used in mission-critical applications, and the accurate estimation of available battery capacity before flight is critical for reliable and efficient mission planning. To this end, the battery voltage should be predicted accurately prior to launching a drone. However, in drone applications, a rise in the battery’s internal temperature changes the voltage significantly and leads to challenges in voltage prediction. In this paper, we propose a battery voltage prediction method that takes into account the battery’s internal temperature to accurately estimate the available capacity of the drone battery. To this end, we devise a temporal temperature factor (TTF) metric that is calculated by accumulating time series data about the battery’s discharge history. We employ a machine learning-based prediction model, reflecting the TTF metric, to achieve high prediction accuracy and low complexity. We validated the accuracy and complexity of our model with extensive evaluation. The results show that the proposed model is accurate with less than 1.5% error and readily operates on resource-constrained embedded devices.
A near-storage framework for boosted data preprocessing of mass spectrum clustering
- Weihong Xu
- Jaeyoung Kang
- Tajana Rosing
Mass spectrometry (MS) has been a key to proteomics and metabolomics due to its unique ability to identify and analyze protein structures. Modern MS equipment generates massive amount of tandem mass spectra with high redundancy, making spectral analysis the major bottleneck in design of new medicines. Mass spectrum clustering is one promising solution as it greatly reduces data redundancy and boosts protein identification. However, state-of-the-art MS tools take many hours to run spectrum clustering. Spectra loading and preprocessing consumes average 82% execution time and energy during clustering. We propose a near-storage framework, MSAS, to speed up spectrum preprocessing. Instead of loading data into host memory and CPU, MSAS processes spectra near storage, thus reducing the expensive cost of data movement. We present two types of accelerators that leverage internal bandwidth at two storage levels: SSD and channel. The accelerators are optimized to match the data rate at each storage level with negligible overhead. Our results demonstrate that the channel-level design yields the best performance improvement for preprocessing – it is up to 187X and 1.8X faster than the CPU and the state-of-the-art in-storage computing solution, INSIDER, respectively. After integrating channel-level MSAS into existing MS clustering tools, we measure system level improvements in speed of 3.5X to 9.8X with 2.8X to 11.9X better energy efficiency.
MetaZip: a high-throughput and efficient accelerator for DEFLATE
- Ruihao Gao
- Xueqi Li
- Yewen Li
- Xun Wang
- Guangming Tan
Booming data volume has become an important challenge for data center storage and bandwidth resources. Consequently, fast and efficient compression architecture is becoming the most fundamental design in data centers. However, the compression ratio (CR) and compression throughput are often difficult to achieve at the same time on existing computing platforms. DEFLATE is a widely used compression format in data centers, which is an ideal case for hardware acceleration. Unfortunately, Deflate has an inherent connection among its special memory access pattern, which limits a higher throughput.
In this paper, we propose MetaZip, a high-throughput and scalable data-compression architecture, which is targeted for FPGA-enabled data centers. To improve the compression throughput within the constraints of FPGA resources, we propose an adaptive parallel-width pipeline, which can be fed 64bytes per cycle. To balance the compression quality, we propose a series of sub-modules (e.g. 8-bytes MetaHistory, Seed Bypass, Serialization Predictor). Experimental results show that MetaZip achieves the throughput of 15.6GB/s with a single engine, which is 234X/2.78X than a CPU gzip baseline and FPGA based architecture, respectively.
Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations
- Hongxiang Fan
- Martin Ferianc
- Wayne Luk
Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.
Eventor: an efficient event-based monocular multi-view stereo accelerator on FPGA platform
- Mingjun Li
- Jianlei Yang
- Yingjie Qi
- Meng Dong
- Yuhao Yang
- Runze Liu
- Weitao Pan
- Bei Yu
- Weisheng Zhao
Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams. Event-based monocular multi-view stereo (EMVS) is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory. It is a critical task for event-based monocular SLAM. However, the required intensive computation workloads make it challenging for real-time deployment on embedded platforms. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. Highly paralleled and fully pipelined processing elements are specially designed via FPGA and integrated with the embedded ARM as a heterogeneous system to improve the throughput and reduce the memory footprint. Meanwhile, the EMVS algorithm is reformulated to a more hardware-friendly manner by rescheduling, approximate computing and hybrid data quantization. Evaluation results on DAVIS dataset show that Eventor achieves up to 24X improvement in energy efficiency compared with Intel i5 CPU platform.
GEML: GNN-based efficient mapping method for large loop applications on CGRA
- Mingyang Kou
- Jun Zeng
- Boxiao Han
- Fei Xu
- Jiangyuan Gu
- Hailong Yao
Coarse-grained reconfigurable architecture (CGRA) is an emerging hardware architecture, with reconfigurable Processing Elements (PEs) for executing operations efficiently and flexibly. One major challenge for current CGRA compilers is the scalability issue for large loop applications, where valid loop mapping results cannot be obtained in an acceptable time. This paper proposes an enhanced loop mapping method based on Graph Neural Network (GNN), which effectively addresses the scalability issue and generates valid loop mapping results for large applications. Experimental results show that the proposed method enhances the compilation time by 10.8x on average over existing methods, with even better loop mapping solutions.
Mixed-granularity parallel coarse-grained reconfigurable architecture
- Jinyi Deng
- Linyun Zhang
- Lei Wang
- Jiawei Liu
- Kexiang Deng
- Shibin Tang
- Jiangyuan Gu
- Boxiao Han
- Fei Xu
- Leibo Liu
- Shaojun Wei
- Shouyi Yin
Coarse-Grained Reconfigurable Architecture (CGRA) is a high-performance computing architecture. However, existing CGRA silicon utilization is low due to the lack of fine-grained parallelism inside Processing Element (PE) and general coarse-grained parallel approach on PE array. No fine-grained parallelism in PE not only leads to low silicon utilization of PE, but also makes the mapping loose and irregular. No generalized parallel method for the mapping cause low PE utilization on CGRA. Our goal is to design an execution model and a Mixed-granularity Parallel CGRA (MP-CGRA), which is capable to fine-grained parallelize operators excution in PEs and parallelize data transmission in channels, leading to a compact mapping. A coarse-grained general parallel method is proposed to vectorize the compact mapping. Evaluated with Machsuite, MP-CGRA achieves an improvement of 104.65% silicon utilization on PE array and a 91.40% performance per area improvement compared with baseline-CGRA.
GuardNN: secure accelerator architecture for privacy-preserving deep learning
- Weizhe Hua
- Muhammad Umar
- Zhiru Zhang
- G. Edward Suh
This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.
SRA: a secure ReRAM-based DNN accelerator
- Lei Zhao
- Youtao Zhang
- Jun Yang
Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM’s non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario.
In this paper, we propose SRA – a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM’s in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer’s intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC – a non-secure ReRAM-based baseline.
ABNN2: secure two-party arbitrary-bitwidth quantized neural network predictions
- Liyan Shen
- Ye Dong
- Binxing Fang
- Jinqiao Shi
- Xuebin Wang
- Shengli Pan
- Ruisheng Shi
Data privacy and security issues are preventing a lot of potential on-cloud machine learning as services from happening. In the recent past, secure multi-party computation (MPC) has been used to achieve the secure neural network predictions, guaranteeing the privacy of data. However, the cost of the existing two-party solutions is expensive and they are impractical in real-world setting.
In this work, we utilize the advantages of quantized neural network (QNN) and MPC to present ABNN2, a practical secure two-party framework that can realize arbitrary-bitwidth quantized neural network predictions. Concretely, we propose an efficient and novel matrix multiplication protocol based on 1-out-of-N OT extension and optimize the the protocol through a parallel scheme. In addition, we design optimized protocol for the ReLU function. The experiments demonstrate that our protocols are about 2X-36X and 1.4X–7X faster than SecureML (S&P’17) and MiniONN (CCS’17) respectively. And ABNN2 obtain comparable efficiency as state of the art QNN prediction protocol QUOTIENT (CCS’19), but the later only supports ternary neural network.
Adaptive neural recovery for highly robust brain-like representation
- Prathyush Poduval
- Yang Ni
- Yeseong Kim
- Kai Ni
- Raghavan Kumar
- Rossario Cammarota
- Mohsen Imani
Today’s machine learning platforms have major robustness issues dealing with insecure and unreliable memory systems. In conventional data representation, bit flips due to noise or attack can cause value explosion, which leads to incorrect learning prediction. In this paper, we propose RobustHD, a robust and noise-tolerant learning system based on HyperDimensional Computing (HDC), mimicking important brain functionalities. Unlike traditional binary representation, RobustHD exploits a redundant and holographic representation, ensuring all bits have the same impact on the computation. RobustHD also proposes a runtime framework that adaptively identifies and regenerates the faulty dimensions in an unsupervised way. Our solution not only provides security against possible bit-flip attacks but also provides a learning solution with high robustness to noises in the memory. We performed a cross-stacked evaluation from a conventional platform to emerging processing in-memory architecture. Our evaluation shows that under 10% random bit flip attack, RobustHD provides a maximum of 0.53% quality loss, while deep learning solutions are losing over 26.2% accuracy.
Efficiency attacks on spiking neural networks
- Sarada Krithivasan
- Sanchari Sen
- Nitin Rathi
- Kaushik Roy
- Anand Raghunathan
Spiking Neural Networks are a class of artificial neural networks that process information as discrete spikes. The time and energy consumed in SNN implementations is strongly dependent on the number of spikes processed. We explore this sensitivity from an adversarial perspective and propose SpikeAttack, a completely new class of attacks on SNNs. SpikeAttack impacts the efficiency of SNNs via imperceptible perturbations that increase the overall spiking activity of the network, leading to increased time and energy consumption. Across four SNN benchmarks, SpikeAttackresults in 1.7x-2.5X increase in spike activity, leading to increases of 1.6x-2.3x and 1.4x-2.2x in latency and energy consumption, respectively.
L-QoCo: learning to optimize cache capacity overloading in storage systems
- Ji Zhang
- Xijun Li
- Xiyao Zhou
- Mingxuan Yuan
- Zhuo Cheng
- Keji Huang
- Yifan Li
Cache plays an important role to maintain high and stable performance (i.e. high throughput, low tail latency and throughput jitter) in storage systems. Existing rule-based cache management methods, coupled with engineers’ manual configurations, cannot meet ever-growing requirements of both time-varying workloads and complex storage systems, leading to frequent cache overloading.
In this paper, we propose the first light-weight learning-based cache bandwidth control technique, called L-QoCo which can adaptively control the cache bandwidth so as to effectively prevent cache overloading in storage systems. Extensive experiments with various workloads on real systems show that L-QoCo, with its strong adaptability and fast learning ability, can adapt to various workloads to effectively control cache bandwidth, thereby significantly improving the storage performance (e.g. increasing the throughput by 10%-20% and reducing the throughput jitter and tail latency by 2X-6X and 1.5X-4X, respectively, compared with two representative rule-based methods).
Pipette: efficient fine-grained reads for SSDs
- Shuhan Bai
- Hu Wan
- Yun Huang
- Xuan Sun
- Fei Wu
- Changsheng Xie
- Hung-Chih Hsieh
- Tei-Wei Kuo
- Chun Jason Xue
Big data applications, such as recommendation system and social network, often generate a huge number of fine-grained reads to the storage. Block-oriented storage devices tend to suffer from these fine-grained read operations in terms of I/O traffic as well as performance. Motivated by this challenge, a fine-grained read framework, Pipette, is proposed in this paper, as an extension to the traditional I/O framework. With an adaptive caching design, Pipette framework offers a tremendous reduction in I/O traffic as well as achieves significant performance gain. A Pipette prototype was implemented with Ext4 file system on an SSD for two real-world applications, where the I/O throughput is improved by 31.6% and 33.5%, and the I/O traffic is reduced by 95.6% and 93.6%, respectively.
CDB: critical data backup design for consumer devices with high-density flash based hybrid storage
- Longfei Luo
- Dingcui Yu
- Liang Shi
- Chuanmin Ding
- Changlong Li
- Edwin H.-M. Sha
Hybrid flash based storage constructed with high-density and low-cost flash memory are becoming increasingly popular in consumer devices during the last decade. However, to protect critical data, existing methods are designed for improving reliability of consumer devices with non-hybrid flash storage. Based on evaluations and analysis, these methods will result in significant performance and lifetime degradation in consumer devices with hybrid storage. The reason is that different kinds of memory in hybrid storage have different characteristics, such as performance and access granularity. To address the above problems, a critical data backup (CDB) method is proposed to backup designated critical data with making full use of different kinds of memory in hybrid storage. Experiment results show that compared with the state-of-the-arts, CDB achieves encouraging performance and lifetime improvement.
SS-LRU: a smart segmented LRU caching
- Chunhua Li
- Man Wu
- Yuhan Liu
- Ke Zhou
- Ji Zhang
- Yunqing Sun
Many caching policies use machine learning to predict data reuse, but they ignore the impact of incorrect prediction on cache performance, especially for large-size objects. In this paper, we propose a smart segmented LRU (SS-LRU) replacement policy, which adopts a size-aware classifier designed for cache scenarios and considers the cache cost caused by misprediction. Besides, SS-LRU enhances the migration rules of segmented LRU (SLRU) and implements a smart caching with unequal priorities and segment sizes based on prediction and multiple access patterns. We conducted Extensive experiments under the real-world workloads to demonstrate the superiority of our approach over state-of-the-art caching policies.
NobLSM: an LSM-tree with non-blocking writes for SSDs
- Haoran Dang
- Chongnan Ye
- Yanpeng Hu
- Chundong Wang
Solid-state drives (SSDs) are gaining popularity. Meanwhile, key-value stores built on log-structured merge-tree (LSM-tree) are widely deployed for data management. LSM-tree frequently calls syncs to persist newly-generated files for crash consistency. The blocking syncs are costly for performance. We revisit the necessity of syncs for LSM-tree. We find that Ext4 journaling embraces asynchronous commits to implicitly persist files. Hence, we design NobLSM that makes LSM-tree and Ext4 cooperate to substitute most syncs with non-blocking asynchronous commits, without losing consistency. Experiments show that NobLSM significantly outperforms state-of-the-art LSM-trees with higher throughput on an ordinary SSD.
TailCut: improving performance and lifetime of SSDs using pattern-aware state encoding
- Jaeyong Lee
- Myungsunk Kim
- Wonil Choi
- Sanggu Lee
- Jihong Kim
Although lateral charge spreading is considered as a dominant error source in 3D NAND flash memory, little is known about its detailed characteristics at the storage system level. From a device characterization study, we observed that lateral charge spreading strongly depends on vertically adjacent state patterns and a few specific patterns are responsible for a large portion of bit errors from lateral charge spreading. We propose a new state encoding scheme, called TailCut, which removes vulnerable state patterns by modifying encoded states. By removing vulnerable patterns, TailCut can improve the SSD lifetime and read latency by 80% and 25%, respectively.
HIMap: a heuristic and iterative logic synthesis approach
- Xing Li
- Lei Chen
- Fan Yang
- Mingxuan Yuan
- Hongli Yan
- Yupeng Wan
Recently, many models show their superiority in sequence and parameter tuning. However, they usually generate non-deterministic flows and require lots of training data. We thus propose a heuristic and iterative flow, namely HIMap, for deterministic logic synthesis. In which, domain knowledge of the functionality and parameters of synthesis operators and their correlations to netlist PPA is fully utilized to design synthesis templates for various objetives. We also introduce deterministic and effective heuristics to tune the templates with relatively fixed operator combinations and iteratively improve netlist PPA. Two nested iterations with local searching and early stopping can thus generate dynamic sequence for various circuits and reduce runtime. HIMap improves 13 best results of the EPFL combinational benchmarks for delay (5 for area). Especially, for several arithmetic benchmarks, HIMap significantly reduces LUT-6 levels by 11.6 ~ 21.2% and delay after P&R by 5.0 ~ 12.9%.
Improving LUT-based optimization for ASICs
- Walter Lau Neto
- Luca Amarú
- Vinicius Possani
- Patrick Vuillod
- Jiong Luo
- Alan Mishchenko
- Pierre-Emmanuel Gaillardon
LUT-based optimization techniques are finding new applications in synthesis of ASIC designs. Intuitively, packing logic into LUTs provides a better balance between functionality and structure in logic optimization. On this basis, the LUT-engine framework [1] was introduced to enhance the ASIC synthesis. In this paper, we present key improvements, at both algorithmic and flow levels, making a much stronger LUT-engine. We restructure the flow of LUT-engine, to benefit from a heterogeneous mixture of LUT sizes, and revisit its requirements for maximum scalability. We propose a dedicated LUT mapper for the new flow, based on FlowMap, natively balancing LUT-count and NAND2-count for a wide range LUT sizes. We describe a specialized Boolean factoring technique, exploiting the fanin bounds in LUT networks, resulting in a very fast LUT-based AIG minimization. By using the proposed methodology, we improve 9 of the best area results in the ongoing EPFL synthesis competition. Integrated in a complete EDA flow for ASICs, the new LUT-engine performs well on a set of 87 benchmarks: -4.60% area and -3.41% switching power at +5% runtime, compared to the baseline flow without LUT-based optimizations, and -3.02% area and -2.54% switching power with -1% runtime, compared to the original LUT-engine.
NovelRewrite: node-level parallel AIG rewriting
- Shiju Lin
- Jinwei Liu
- Tianji Liu
- Martin D. F. Wong
- Evangeline F. Y. Young
Logic rewriting is an important part in logic optimization. It rewrites a circuit by replacing local subgraphs with logically equivalent ones, so that the area and the delay of the circuit can be optimized. This paper introduces a parallel AIG rewriting algorithm with a new concept of logical cuts. Experiments show that this algorithm implemented with one GPU can be on average 32X faster than the logic rewriting in the logic synthesis tool ABC on large benchmarks. Compared with other logic rewriting acceleration works, ours has the best quality and the shortest running time.
Search space characterization for approximate logic synthesis
- Linus Witschen
- Tobias Wiersema
- Lucas Reuter
- Marco Platzner
Approximate logic synthesis aims at trading off a circuit’s quality to improve a target metric. Corresponding methods explore a search space by approximating circuit components and verifying the resulting quality of the overall circuit, which is costly.
We propose a methodology that determines reasonable values for the component’s local error bounds prior to search space exploration. Utilizing formal verification on a novel approximation miter guarantees the circuit’s quality for such local error bounds, independent of employed approximation methods, resulting in reduced runtimes due to omitted verifications. Experiments show speed-ups of up to 3.7x for approximate logic synthesis using our method.
SEALS: sensitivity-driven efficient approximate logic synthesis
- Chang Meng
- Xuan Wang
- Jiajun Sun
- Sijun Tao
- Wei Wu
- Zhihang Wu
- Leibin Ni
- Xiaolong Shen
- Junfeng Zhao
- Weikang Qian
Approximate computing is an emerging computing paradigm to design energy-efficient systems. Many greedy approximate logic synthesis (ALS) methods have been proposed to automatically synthesize approximate circuits. They typically need to consider all local approximate changes (LACs) in each iteration of the ALS flow to select the best one, which is time-consuming. In this paper, we propose SEALS, a Sensitivity-driven Efficient ALS method to speed up a greedy ALS flow. SEALS centers around a newly proposed concept called sensitivity, which enables a fast and accurate error estimation method and an efficient method to filter out unpromising LACs. SEALS can handle any statistical error metric. The experimental results show that it outperforms a state-of-the-art ALS method in runtime by 12X to 15X without reducing circuit quality.
Beyond local optimality of buffer and splitter insertion for AQFP circuits
- Siang-Yun Lee
- Heinz Riener
- Giovanni De Micheli
Adiabatic quantum-flux parametron (AQFP) is an energy-efficient superconducting technology. Buffer and splitter (B/S) cells must be inserted to an AQFP circuit to meet the technology-imposed constraints on path balancing and fanout branching. These cells account for a significant amount of the circuit’s area and delay. In this paper, we identify that B/S insertion is a scheduling problem, and propose (a) a linear-time algorithm for locally optimal B/S insertion subject to a given schedule; (b) an SMT formulation to find the global optimum; and (c) an efficient heuristic for global B/S optimization. Experimental results show a reduction of 4% on the B/S cost and 124X speed-up compared to the state-of-the-art algorithm, and capability to scale to a magnitude larger benchmarks.
NAX: neural architecture and memristive xbar based accelerator co-design
- Shubham Negi
- Indranil Chakraborty
- Aayush Ankit
- Kaushik Roy
Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX – an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.
MC-CIM: a reconfigurable computation-in-memory for efficient stereo matching cost computation
- Zhiheng Yue
- Yabing Wang
- Leibo Liu
- Shaojun Wei
- Shuoyi Yin
This paper proposes the design of a computation-in-memory for stereo matching cost computation. The matching cost computation incurs large energy and latency overhead because of frequent memory access. To overcome previous design limitations, this work, named MC-CIM, performs matching cost computation without incurring memory access and introduces several key features. (1) Lightweight balanced computing unit is integrated within cell array to reduce memory access and improve system throughput. (2) Self-optimized circuit design enables to alter arithmetic operation for matching algorithm in various scenario. (3) Flexible data mapping method and reconfigurable digital peripheral explore maximum parallelism on different algorithm and bit-precision. The proposed design is implemented in 28nm technology and achieves average performance of 277 TOPs/W.
iMARS: an in-memory-computing architecture for recommendation systems
- Mengyuan Li
- Ann Franchesca Laguna
- Dayane Reis
- Xunzhao Yin
- Michael Niemier
- X. Sharon Hu
Recommendation systems (RecSys) suggest items to users by predicting their preferences based on historical data. Typical RecSys handle large embedding tables and many embedding table related operations. The memory size and bandwidth of the conventional computer architecture restrict the performance of RecSys. This work proposes an in-memory-computing (IMC) architecture (iMARS) for accelerating the filtering and ranking stages of deep neural network-based RecSys. iMARS leverages IMC-friendly embedding tables implemented inside a ferroelectric FET based IMC fabric. Circuit-level and system-level evaluation show that iMARS achieves 16.8x (713x) end-to-end latency (energy) improvement compared to the GPU counterpart for the MovieLens dataset.
ReGNN: a ReRAM-based heterogeneous architecture for general graph neural networks
- Cong Liu
- Haikun Liu
- Hai Jin
- Xiaofei Liao
- Yu Zhang
- Zhuohui Duan
- Jiahong Xu
- Huize Li
Graph Neural Networks (GNNs) have both graph processing and neural network computational features. Traditional graph accelerators and NN accelerators cannot meet these dual characteristics of GNN applications simultaneously. In this work, we propose a ReRAM-based processing-in-memory (PIM) architecture called ReGNN for GNN acceleration. ReGNN is composed of analog PIM (APIM) modules for accelerating matrix vector multiplication (MVM) operations, and digital PIM (DPIM) modules for accelerating non-MVM aggregation operations. To improve data parallelism, ReGNN maps data to aggregation sub-engines based on the degree of vertices and the dimension of feature vectors. Experimental results show that ReGNN speeds up GNN inference by 228x and 8.4x, and reduces energy consumption by 305.2x and 10.5x, compared with GPU and the ReRAM-based GNN accelerator ReGraphX, respectively.
You only search once: on lightweight differentiable architecture search for resource-constrained embedded platforms
- Xiangzhong Luo
- Di Liu
- Hao Kong
- Shuo Huai
- Hui Chen
- Weichen Liu
Benefiting from the search efficiency, differentiable neural architecture search (NAS) has evolved as the most dominant alternative to automatically design competitive deep neural networks (DNNs). We note that DNNs must be executed under strictly hard performance constraints in real-world scenarios, for example, the runtime latency on autonomous vehicles. However, to obtain the architecture that meets the given performance constraint, previous hardware-aware differentiable NAS methods have to repeat a plethora of search runs to manually tune the hyper-parameters by trial and error, and thus the total design cost increases proportionally. To resolve this, we introduce a lightweight hardware-aware differentiable NAS framework dubbed LightNAS, striving to find the required architecture that satisfies various performance constraints through a one-time search (i.e., you only search once). Extensive experiments are conducted to show the superiority of LightNAS over previous state-of-the-art methods. Related codes will be released at https://github.com/stepbuystep/LightNAS.
EcoFusion: energy-aware adaptive sensor fusion for efficient autonomous vehicle perception
- Arnav Vaibhav Malawade
- Trier Mortlock
- Mohammad Abdullah Al Faruque
Autonomous vehicles use multiple sensors, large deep-learning models, and powerful hardware platforms to perceive the environment and navigate safely. In many contexts, some sensing modalities negatively impact perception while increasing energy consumption. We propose EcoFusion: an energy-aware sensor fusion approach that uses context to adapt the fusion method and reduce energy consumption without affecting perception performance. EcoFusion performs up to 9.5% better at object detection than existing fusion methods with approximately 60% less energy and 58% lower latency on the industry-standard Nvidia Drive PX2 hardware platform. We also propose several context-identification strategies, implement a joint optimization between energy and performance, and present scenario-specific results.
Human emotion based real-time memory and computation management on resource-limited edge devices
- Yijie Wei
- Zhiwei Zhong
- Jie Gu
Emotional AI or Affective Computing has been projected to grow rapidly in the upcoming years. Despite many existing developments in the application space, there has been a lack of hardware-level exploitation of the user’s emotions. In this paper, we propose a deep collaboration between user’s affects and the hardware system management on resource-limited edge devices. Based on classification results from efficient affect classifiers on smartphone devices, novel real-time management schemes for memory, and video processing are proposed to improve the energy efficiency of mobile devices. Case studies on H.264 / AVC video playback and Android smartphone usages are provided showing significant power saving of up to 23% and reduction of memory loading of up to 17% using the proposed affect adaptive architecture and system management schemes.
Hierarchical memory-constrained operator scheduling of neural architecture search networks
- Zihan Wang
- Chengcheng Wan
- Yuting Chen
- Ziyi Lin
- He Jiang
- Lei Qiao
Neural Architecture Search (NAS) is widely used in industry, searching for neural networks meeting task requirements. Meanwhile, it faces a challenge in scheduling networks satisfying memory constraints. This paper proposes HMCOS that performs hierarchical memory-constrained operator scheduling of NAS networks: given a network, HMCOS constructs a hierarchical computation graph and employs an iterative scheduling algorithm to progressively reduce peak memory footprints. We evaluate HMCOS against RPO and Serenity (two popular scheduling techniques). The results show that HMCOS outperforms existing techniques in supporting more NAS networks, reducing 8.7~42.4% of peak memory footprints, and achieving 137–283x of speedups in scheduling.
MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning
- Abhiroop Bhattacharjee
- Yeshwanth Venkatesha
- Abhishek Moitra
- Priyadarshini Panda
Recent years have seen a paradigm shift towards multi-task learning. This calls for memory and energy-efficient solutions for inference in a multi-task scenario. We propose an algorithm-hardware co-design approach called MIME. MIME reuses the weight parameters of a trained parent task and learns task-specific threshold parameters for inference on multiple child tasks. We find that MIME results in highly memory-efficient DRAM storage of neural-network parameters for multiple tasks compared to conventional multi-task inference. In addition, MIME results in input-dependent dynamic neuronal pruning, thereby enabling energy-efficient inference with higher throughput on a systolic-array hardware. Our experiments with benchmark datasets (child tasks)- CIFAR10, CIFAR100, and Fashion-MNIST, show that MIME achieves ~ 3.48x memory-efficiency and ~ 2.4 – 3.1x energy-savings compared to conventional multi-task inference in Pipelined task mode.
Sniper: cloud-edge collaborative inference scheduling with neural network similarity modeling
- Weihong Liu
- Jiawei Geng
- Zongwei Zhu
- Jing Cao
- Zirui Lian
The cloud-edge collaborative inference demands scheduling the artificial intelligence (AI) tasks efficiently to the appropriate edge smart device. However, the continuously iterative deep neural networks (DNNs) and heterogeneous devices pose great challenges for inference tasks scheduling. In this paper, we propose a self-update cloud-edge collaborative inference scheduling system (Sniper) with time awareness. At first, considering that similar networks exhibit similar behaviors, we develop a non-invasive performance characterization network (PCN) based on neural network similarity (NNS) to accurately predict the inference time of DNNs. Moreover, PCN and time-based scheduling algorithms can be flexibly combined into the scheduling module of Sniper. Experimental results show that the average relative error of network inference time prediction is about 8.06%. Compared with the traditional method without time awareness, Sniper can reduce the waiting time by 52% on average while achieving a stable increase in throughput.
LPCA: learned MRC profiling based cache allocation for file storage systems
- Yibin Gu
- Yifan Li
- Hua Wang
- Li Liu
- Ke Zhou
- Wei Fang
- Gang Hu
- Jinhu Liu
- Zhuo Cheng
File storage system (FSS) uses multi-caches to accelerate data accesses. Unfortunately, efficient FSS cache allocation remains extremely difficult. First, as the key of cache allocation, existing miss ratio curve (MRC) constructions are limited to LRU. Second, existing techniques are suitable for same-layer caches but not for hierarchical ones.
We present a Learned MRC Profiling based Cache Allocation (LPCA) scheme for FSS. To the best of our knowledge, LPCA is the first to apply machine learning to model MRC under non-LRU, LPCA also explores optimization target for hierarchical caches, in that LPCA can provide universal and efficient cache allocation for FSSs.
Equivalence checking paradigms in quantum circuit design: a case study
- Tom Peham
- Lukas Burgholzer
- Robert Wille
As state-of-the-art quantum computers are capable of running increasingly complex algorithms, the need for automated methods to design and test potential applications rises. Equivalence checking of quantum circuits is an important, yet hardly automated, task in the development of the quantum software stack. Recently, new methods have been proposed that tackle this problem from widely different perspectives. However, there is no established baseline on which to judge current and future progress in equivalence checking of quantum circuits. In order to close this gap, we conduct a detailed case study of two of the most promising equivalence checking methodologies—one based on decision diagrams and one based on the ZX-calculus—and compare their strengths and weaknesses.
Accurate BDD-based unitary operator manipulation for scalable and robust quantum circuit verification
- Chun-Yu Wei
- Yuan-Hung Tsai
- Chiao-Shan Jhang
- Jie-Hong R. Jiang
Quantum circuit verification is essential, ensuring that quantum program compilation yields a sequence of primitive unitary operators executable correctly and reliably on a quantum processor. Most prior quantum circuit equivalence checking methods rely on edge-weighted decision diagrams and suffer from scalability and verification accuracy issues. This work overcomes these issues by extending a recent BDD-based algebraic representation of state vectors to support unitary operator manipulation. Experimental results demonstrate the superiority of the new method in scalability and exactness in contrast to the inexactness of prior approaches. Also, our method is much more robust in verifying dissimilar circuits than previous work.
Handling non-unitaries in quantum circuit equivalence checking
- Lukas Burgholzer
- Robert Wille
Quantum computers are reaching a level where interactions between classical and quantum computations can happen in real-time. This marks the advent of a new, broader class of quantum circuits: dynamic quantum circuits. They offer a broader range of available computing primitives that lead to new challenges for design tasks such as simulation, compilation, and verification. Due to the non-unitary nature of dynamic circuit primitives, most existing techniques and tools for these tasks are no longer applicable in an out-of-the-box fashion. In this work, we discuss the resulting consequences for quantum circuit verification, specifically equivalence checking, and propose two different schemes that eventually allow to treat the involved circuits as if they did not contain non-unitaries at all. As a result, we demonstrate methodically, as well as, experimentally that existing techniques for verifying the equivalence of quantum circuits can be kept applicable for this broader class of circuits.
A bridge-based algorithm for simultaneous primal and dual defects compression on topologically quantum-error-corrected circuits
- Wei-Hsiang Tseng
- Yao-Wen Chang
Topological quantum error correction (TQEC) using the surface code is among the most promising techniques for fault-tolerant quantum circuits. The required resource of a TQEC circuit can be modeled as a space-time volume of a three-dimensional diagram by describing the defect movement along the time axis. For large-scale complex problems, it is crucial to minimize the space-time volume for a quantum algorithm with a reasonable physical qubit number and computation time. Previous work proposed an automated tool to perform bridge compression on a large-scale TQEC circuit. However, the existing automated bridging compression is only for dual defects and not for primal defects. This paper presents an algorithm to perform bridge compression on primal and dual defects simultaneously. In addition, the automatic compression algorithm performs initialization/measurement simplification and flipping to improve the compression. Compared with the state-of-the-art work, experimental results show that our proposed algorithm can averagely reduce space-time volumes by 47%.
FaSe: fast selective flushing to mitigate contention-based cache timing attacks
- Tuo Li
- Sri Parameswaran
Caches are widely used to improve performance in modern processors. By carefully evicting cache lines and identifying cache hit/miss time, contention-based cache timing channel attacks can be orchestrated to leak information from the victim process. Existing hardware countermeasures explored cache partitioning and randomization, are either costly, not applicable for the L1 data cache, or are vulnerable to sophisticated attacks. Countermeasures using cache flush exist but are slow since all cache lines have to be evacuated during a cache flush. In this paper, we propose for the first time a hardware/software flush-based countermeasure, called fast selective flushing (FaSe). By utilizing an ISA extension and cache modification, FaSe selectively flushes cache lines and provides a mitigation method with a similar effect to methods using naive flush. FaSe is implemented on RISC-V Rocket Chip and evaluated on Xilinx FPGA running user programs and the Linux OS. Our experiments show that FaSe reduces time overhead by 36% for user programs and 42% for the OS compared to the methods with naive flushing, with less than 1% hardware overhead. Our security test shows FaSe can mitigate target cache timing attacks.
Conditional address propagation: an efficient defense mechanism against transient execution attacks
- Peinan Li
- Rui Hou
- Lutan Zhao
- Yifan Zhu
- Dan Meng
Speculative execution is a critical technique in modern high performance processors. However, continuously exposed transient execution attacks, including Spectre and Meltdown, disclosed a large attack surface in mispredicted execution. Current state-of-the-art defense strategy blocks all memory accesses that use addresses loaded speculatively. However, propagation of base addresses is common in general applications and we find that more than 60% blocked memory accesses use propagated base rather than offset addresses. Therefore, we propose a novel hardware defense mechanism, named Conditional Address Propagation, to identify safe base addresses through taint tracking and address checking by a History Table. Then, the safe base addresses are allowed to be propagated to retrieve performance. For remaining unsafe addresses, they cannot be propagated for security. We constructed experiments on cycle-accurate Gem5 simulator. Compared to the representative study, STT, our mechanism effectively decreases the performance overhead from 13.27% to 1.92% targeting Spectre-type and 19.66% to 5.23% targeting all-type cache-based transient execution attacks.
Timed speculative attacks exploiting store-to-load forwarding bypassing cache-based countermeasures
- Anirban Chakraborty
- Nikhilesh Singh
- Sarani Bhattacharya
- Chester Rebeiro
- Debdeep Mukhopadhyay
In this paper, we propose a novel class of speculative attacks, called Timed Speculative Attacks (TSA), that does not depend on the state changes in the cache memory. Instead, it makes use of the timing differences that occur due to store-to-load forwarding. We propose two attack strategies – Fill-and-Forward utilizing correctly speculated loads, and Fill-and-Misdirect using mis-speculated load instructions. While Fill-and-Forward exploits the shared store buffers in a multi-threaded CPU core, the Fill-and-Misdirect approach exploits the influence of rolled back mis-speculated loads on subsequent instructions. As case studies, we demonstrate a covert channel using Fill-and-Forward and key recovery attacks on OpenSSL AES and Romulus-N Authenticated Encryption with Associated Data scheme using Fill-and-Misdirect approach. Finally, we show that TSA is able to subvert popular cache-based countermeasures for transient attacks.
DARPT: defense against remote physical attack based on TDC in multi-tenant scenario
- Fan Zhang
- Zhiyong Wang
- Haoting Shen
- Bolin Yang
- Qianmei Wu
- Kui Ren
With rapidly increasing demands for cloud computing, Field Programmable Gate Array (FPGA) has become popular in cloud datacenters. Although it improves computing performance through flexible hardware acceleration, new security concerns also come along. For example, unavoidable physical leakage from the Power Distribution Network (PDN) can be utilized by attackers to mount remote Side-Channel Attacks (SCA), such as Correlation Power Attacks (CPA). Remote Fault Attacks (FA) can also be successfully presented by malicious tenants in a cloud multi-tenant scenario, posing a significant threat to legal tenants. There are few hardware-based countermeasures to defeat both remote attacks that aforementioned. In this work, we exploit Time-to-Digital Converter (TDC) and propose a novel defense technique called DARPT (Defense Against Remote Physical attack based on TDC) to protect sensitive information from CPA and FA. Specifically, DARPT produces random clock jitters to reduce possible information leakage through the power side-channel and provides an early warning of FA by constantly monitoring the variation of the voltage drop across PDN. In comparison to the fact that 8k traces are enough for a successful CPA on FPGA without DARPT, our experimental results show that up to 800k traces (100 times) are not enough for the same FPGA protected by DARPT. Meanwhile, the TDC-based voltage monitor presents significant readout changes (by 51.82% or larger) under FA with ring oscillators, demonstrating sufficient sensitivities to voltage-drop-based FA.
GNNIE: GNN inference engine with load-balancing and graph-specific caching
- Sudipta Mondal
- Susmita Dey Manasi
- Kishor Kunal
- Ramprasath S
- Sachin S. Sapatnekar
Graph neural networks (GNN) inferencing involves weighting vertex feature vectors, followed by aggregating weighted vectors over a vertex neighborhood. High and variable sparsity in the input vertex feature vectors, and high sparsity and power-law degree distributions in the adjacency matrix, can lead to (a) unbalanced loads and (b) inefficient random memory accesses. GNNIE ensures load-balancing by splitting features into blocks, proposing a flexible MAC architecture, and employing load (re)distribution. GNNIE’s novel caching scheme bypasses the high costs of random DRAM accesses. GNNIE shows high speedups over CPUs/GPUs; it is faster and runs a broader range of GNNs than existing accelerators.
SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences
- Guan Shen
- Jieru Zhao
- Quan Chen
- Jingwen Leng
- Chao Li
- Minyi Guo
The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.
NN-LUT: neural approximation of non-linear operations for efficient transformer inference
- Joonsang Yu
- Junki Park
- Seongmin Park
- Minsoo Kim
- Sihwa Lee
- Dong Hyun Lee
- Jungwook Choi
Non-linear operations such as GELU, Layer normalization, and Soft-max are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a Look-up table(LUT). The proposed framework called Neural network generated LUT(NN-LUT) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.
Self adaptive reconfigurable arrays (SARA): learning flexible GEMM accelerator configuration and mapping-space using ML
- Ananda Samajdar
- Eric Qin
- Michael Pellauer
- Tushar Krishna
This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present a self-adaptive (SA) unit, which runs a learnt model for one-shot configuration optimization in hardware offloading the software stack thus easing the deployment of the proposed design. We evaluate an instance of the proposed methodology with a 32.768 TOPS reference implementation called SAGAR, that can provide the same mapping flexibility as a compute equivalent distributed system while achieving 3.5X more power efficiency and 3.2X higher compute density demonstrated via architectural and post-layout simulation.
Enabling hard constraints in differentiable neural network and accelerator co-exploration
- Deokki Hong
- Kanghyun Choi
- Hye Yoon Lee
- Joonsang Yu
- Noseong Park
- Youngsok Kim
- Jinho Lee
Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.
Heuristic adaptability to input dynamics for SpMM on CPUs
- Guohao Dai
- Guyue Huang
- Shang Yang
- Zhongming Yu
- Hengrui Zhang
- Yufei Ding
- Yuan Xie
- Huazhong Yang
- Yu Wang
Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics.
To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26X~1.37X speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59X end-to-end speedup to Graph Neural Networks.
H2H: heterogeneous model to heterogeneous system mapping with computation and communication awareness
- Xinyi Zhang
- Cong Hao
- Peipei Zhou
- Alex Jones
- Jingtong Hu
The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.
PARIS and ELSA: an elastic scheduling algorithm for reconfigurable multi-GPU inference servers
- Yunseong Kim
- Yujeong Choi
- Minsoo Rhu
Providing low latency to end-users while maximizing server utilization and system throughput is crucial for cloud ML servers. NVIDIA’s recently announced Ampere GPU architecture provides features to “reconfigure” one large, monolithic GPU into multiple smaller “GPU partitions”. Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. We study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server, presenting a sophisticated partitioning algorithm for reconfigurable GPUs combined with an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server.
Pursuing more effective graph spectral sparsifiers via approximate trace reduction
- Zhiqiang Liu
- Wenjian Yu
Spectral graph sparsification aims to find ultra-sparse subgraphs which can preserve spectral properties of original graphs. In this paper, a new spectral criticality metric based on trace reduction is first introduced for identifying spectrally important off-subgraph edges. Then, a physics-inspired truncation strategy and an approach using approximate inverse of Cholesky factor are proposed to compute the approximate trace reduction efficiently. Combining them with the iterative densification scheme in [8] and the strategy of excluding spectrally similar off-subgraph edges in [13], we develop a highly effective graph sparsification algorithm. The proposed method has been validated with various kinds of graphs. Experimental results show that it always produces sparsifiers with remarkably better quality than the state-of-the-art GRASS [8] in same computational cost, enabling more than 40% time reduction for preconditioned iterative equation solver on average. In the applications of power grid transient analysis and spectral graph partitioning, the derived iterative solver shows 3.3X or more advantages on runtime and memory cost, over the approach based on direct sparse solver.
Accelerating nonlinear DC circuit simulation with reinforcement learning
- Zhou Jin
- Haojie Pei
- Yichao Dong
- Xiang Jin
- Xiao Wu
- Wei W. Xing
- Dan Niu
DC analysis is the foundation for nonlinear electronic circuit simulation. Pseudo transient analysis (PTA) methods have gained great success among various continuation algorithms. However, PTA tends to be computationally intensive without careful tuning of parameters and proper stepping strategies. In this paper, we harness the latest advancing in machine learning to resolve these challenges simultaneously. Particularly, an active learning is leveraged to provide a fine initial solver environment, in which a TD3-based Reinforcement Learning (RL) is implemented to accelerate the simulation on the fly. The RL agent is strengthen with dual agents, priority sampling, and cooperative learning to enhance its robustness and convergence. The proposed algorithms are implemented in an out-of-the-box SPICElike simulator, which demonstrated a significant speedup: up to 3.1X for the initial stage and 234X for the RL stage.
An efficient yield optimization method for analog circuits via gaussian process classification and varying-sigma sampling
- Xiaodong Wang
- Changhao Yan
- Fan Yang
- Dian Zhou
- Xuan Zeng
This paper presents an efficient yield optimization method for analog circuits via Gaussian process classification and varying-sigma sampling. To quickly determine the better design, yield estimations are executed at varying sigma of process variations. Instead of regression methods requiring accurate yield values, a Gaussian process classification method is applied to model these preference information of designs with binary comparison results, and the preferential Bayesian optimization framework is implemented to guide the search. Additionally, a multi-fidelity surrogate model is adopted to learn the yield correlation at different sigmas. Compared with the state-of-the-art methods, the proposed method achieves up to 12× speed-up without loss of accuracy.
Partition and place finite element model on wafer-scale engine
- Jinwei Liu
- Xiaopeng Zhang
- Shiju Lin
- Xinshi Zang
- Jingsong Chen
- Bentian Jiang
- Martin D. F. Wong
- Evangeline F. Y. Young
The finite element method (FEM) is a well-known technique for approximately solving partial differential equations and it finds application in various engineering disciplines. The recently introduced wafer-scale engine (WSE) has shown the potential to accelerate FEM by up to 10,000×. However, accelerating FEM to the full potential of a WSE is non-trivial. Thus, in this work, we propose a partitioning algorithm to partition a 3D finite element model into tiles. The tiles can be thought of as a special netlist and are placed onto the 2D array of a WSE by our placement algorithm. Compared to the best-known approach, our partitioning has around 5% higher accuracy, and our placement algorithm can produce around 11% shorter wirelength (L1.5-normalized) on average.
CNN-inspired analytical global placement for large-scale heterogeneous FPGAs
- Huimin Wang
- Xingyu Tong
- Chenyue Ma
- Runming Shi
- Jianli Chen
- Kun Wang
- Jun Yu
- Yao-Wen Chang
The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net model is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.
High-performance placement for large-scale heterogeneous FPGAs with clock constraints
- Ziran Zhu
- Yangjie Mei
- Zijun Li
- Jingwen Lin
- Jianli Chen
- Jun Yang
- Yao-Wen Chang
With the increasing complexity of the field-programmable gate array (FPGA) architecture, heterogeneity and clock constraints have greatly challenged FPGA placement. In this paper, we present a high-performance placement algorithm for large-scale heterogeneous FPGAs with clock constraints. We first propose a connectivity-aware and type-balanced clustering method to construct the hierarchy and improve the scalability. In each hierarchy level, we develop a novel hybrid penalty and augmented Lagrangian method to formulate the heterogeneous and clock-aware placement as a sequence of unconstrained optimization subproblems and adopt the Adam method to solve each unconstrained optimization subproblem. Then, we present a matching-based IP blocks legalization to legalize the RAMs and DSPs, and a multi-stage packing technique is proposed to cluster FFs and LUTs into HCLBs. Finally, history-based legalization is developed to legalize CLBs in an FPGA. Based on the ISPD 2017 clock-aware FPGA placement contest benchmarks, experimental results show that our algorithm achieves the smallest routed wirelength for all the benchmarks among all published works in a reasonable runtime.
Multi-electrostatic FPGA placement considering SLICEL-SLICEM heterogeneity and clock feasibility
- Jing Mai
- Yibai Meng
- Zhixiong Di
- Yibo Lin
Modern field-programmable gate arrays (FPGAs) contain heterogeneous resources, including CLB, DSP, BRAM, IO, etc. A Configurable Logic Block (CLB) slice is further categorized to SLICEL and SLICEM, which can be configured as specific combinations of instances in {LUT, FF, distributed RAM, SHIFT, CARRY}. Such kind of heterogeneity challenges the existing FPGA placement algorithms. Meanwhile, limited clock routing resources also lead to complicated clock constraints, causing difficulties in achieving clock feasible placement solutions. In this work, we propose a heterogeneous FPGA placement framework considering SLICEL-SLICEM heterogeneity and clock feasibility based on a multi-electrostatic formulation. We support a comprehensive set of the aforementioned instance types with a uniform algorithm for wirelength, routability, and clock optimization. Experimental results on both academic and industrial benchmarks demonstrate that we outperform the state-of-the-art placers in both quality and efficiency.
QOC: quantum on-chip training with parameter shift and gradient pruning
- Hanrui Wang
- Zirui Li
- Jiaqi Gu
- Yongshan Ding
- David Z. Pan
- Song Han
Parameterized Quantum Circuits (PQC) are drawing increasing research interest thanks to its potential to achieve quantum advantages on near-term Noisy Intermediate Scale Quantum (NISQ) hardware. In order to achieve scalable PQC learning, the training process needs to be offloaded to real quantum machines instead of using exponential-cost classical simulators. One common approach to obtain PQC gradients is parameter shift whose cost scales linearly with the number of qubits. We present QOC, the first experimental demonstration of practical on-chip PQC training with parameter shift. Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naïve parameter shift have low fidelity and thus degrading the training accuracy. To this end, we further propose probabilistic gradient pruning to firstly identify gradients with potentially large errors and then remove them. Specifically, small gradients have larger relative errors than large ones, thus having a higher probability to be pruned. We perform extensive experiments with the Quantum Neural Network (QNN) benchmarks on 5 classification tasks using 5 real quantum machines. The results demonstrate that our on-chip training achieves over 90% and 60% accuracy for 2-class and 4-class image classification tasks. The probabilistic gradient pruning brings up to 7% PQC accuracy improvements over no pruning. Overall, we successfully obtain similar on-chip training accuracy compared with noise-free simulation but have much better training scalability. The QOC code is available in the TorchQuantum library.
Memory-efficient training of binarized neural networks on the edge
- Mikail Yayla
- Jian-Jia Chen
A visionary computing paradigm is to train resource efficient neural networks on the edge using dedicated low-power accelerators instead of cloud infrastructures, eliminating communication overheads and privacy concerns. One promising resource-efficient approach for inference is binarized neural networks (BNNs), which binarize parameters and activations. However, training BNNs remains resource demanding. State-of-the-art BNN training methods, such as the binary optimizer (Bop), require to store and update a large number of momentum values in the floating point (FP) format.
In this work, we focus on memory-efficient FP encodings for the momentum values in Bop. To achieve this, we first investigate the impact of arbitrary FP encodings. When the FP format is not properly chosen, we prove that the updates of the momentum values can be lost and the quality of training is therefore dropped. With the insights, we formulate a metric to determine the number of unchanged momentum values in a training iteration due to the FP encoding. Based on the metric, we develop an algorithm to find FP encodings that are more memory-efficient than the standard FP encodings. In our experiments, the memory usage in BNN training is decreased by factors 2.47x, 2.43x, 2.04x, depending on the BNN model, with minimal accuracy cost (smaller than 1%) compared to using 32-bit FP encoding.
DeepGate: learning neural representations of logic gates
- Min Li
- Sadaf Khan
- Zhengyuan Shi
- Naixing Wang
- Huang Yu
- Qiang Xu
Applying deep learning (DL) techniques in the electronic design automation (EDA) field has become a trending topic. Most solutions apply well-developed DL models to solve specific EDA problems. While demonstrating promising results, they require careful model tuning for every problem. The fundamental question on “How to obtain a general and effective neural representation of circuits?” has not been answered yet. In this work, we take the first step towards solving this problem. We propose DeepGate, a novel representation learning solution that effectively embeds both logic function and structural information of a circuit as vectors on each gate. Specifically, we propose transforming circuits into unified and-inverter graph format for learning and using signal probabilities as the supervision task in DeepGate. We then introduce a novel graph neural network that uses strong inductive biases in practical circuits as learning priors for signal probability prediction. Our experimental results show the efficacy and generalization capability of DeepGate.
Bipolar vector classifier for fault-tolerant deep neural networks
- Suyong Lee
- Insu Choi
- Joon-Sung Yang
Deep Neural Networks (DNNs) surpass the human-level performance on specific tasks. The outperforming capability accelerate an adoption of DNNs to safety-critical applications such as autonomous vehicles and medical diagnosis. Millions of parameters in DNN requires a high memory capacity. A process technology scaling allows increasing memory density, however, the memory reliability confronts significant reliability issues causing errors in the memory. This can make stored weights in memory erroneous. Studies show that the erroneous weights can cause a significant accuracy loss. This motivates research on fault-tolerant DNN architectures. Despite of these efforts, DNNs are still vulnerable to errors, especially error in DNN classifier. In the worst case, because a classifier in convolutional neural network (CNN) is the last stage determining an input class, a single error in the classifier can cause a significant accuracy drop. To enhance the fault tolerance in CNN, this paper proposes a novel bipolar vector classifier which can be easily integrated with any CNN structures and can be incorporated with other fault tolerance approaches. Experimental results show that the proposed method stably maintains an accuracy with a high bit error rate up to 10−3 in the classifier.
HDLock: exploiting privileged encoding to protect hyperdimensional computing models against IP stealing
- Shijin Duan
- Shaolei Ren
- Xiaolin Xu
Hyperdimensional Computing (HDC) is facing infringement issues due to straightforward computations. This work, for the first time, raises a critical vulnerability of HDC — an attacker can reverse engineer the entire model, only requiring the unindexed hypervector memory. To mitigate this attack, we propose a defense strategy, namely HDLock, which significantly increases the reasoning cost of encoding. Specifically, HDLock adds extra feature hypervector combination and permutation in the encoding module. Compared to the standard HDC model, a two-layer-key HDLock can increase the adversarial reasoning complexity by 10 order of magnitudes without inference accuracy loss, with only 21% latency overhead.
Terminator on SkyNet: a practical DVFS attack on DNN hardware IP for UAV object detection
- Junge Xu
- Bohan Xuan
- Anlin Liu
- Mo Sun
- Fan Zhang
- Zeke Wang
- Kui Ren
With increasing computation of various applications, dynamic voltage and frequency scaling (DVFS) is gradually deployed on FPGAs. However, its reliability and security haven’t been sufficiently evaluated. In this paper, we present a practical DVFS fault attack targeting at the SkyNet accelerator IP and successfully destroy the detection accuracy. With no knowledge about the internal accelerator structure, our attack can achieve more than 98% detection accuracy loss under ten vulnerable operating point pairs (OPPs). Meanwhile, we explore the local injection with 1 ms duration and next double the intensity which can achieve more than 50% and 74% average accuracy loss respectively.
AL-PA: cross-device profiled side-channel attack using adversarial learning
- Pei Cao
- Hongyi Zhang
- Dawu Gu
- Yan Lu
- Yidong Yuan
In this paper, we focus on the portability issue in profiled side-channel attacks (SCAs) that arises due to significant device-to-device variations. Device discrepancy is inevitable in realistic attacks, but it is often neglected in research works. In this paper, we identify such device variations and take a further step towards leveraging the transferability of neural networks. We propose a novel adversarial learning-based profiled attack (AL-PA), which enables our neural network to learn device-invariant features. We evaluated our strategy on eight XMEGA microcontrollers. Without the need for target-specific preprocessing and multiple profiling devices, our approach has outperformed the state-of-the-art methods.
DETERRENT: detecting trojans using reinforcement learning
- Vasudev Gohil
- Satwik Patnaik
- Hao Guo
- Dileep Kalathil
- Jeyavijayan (JV) Rajendran
Insertion of hardware Trojans (HTs) in integrated circuits is a pernicious threat. Since HTs are activated under rare trigger conditions, detecting them using random logic simulations is infeasible. In this work, we design a reinforcement learning (RL) agent that circumvents the exponential search space and returns a minimal set of patterns that is most likely to detect HTs. Experimental results on a variety of benchmarks demonstrate the efficacy and scalability of our RL agent, which obtains a significant reduction (169×) in the number of test patterns required while maintaining or improving coverage (95.75%) compared to the state-of-the-art techniques.
Exploiting data locality in memory for ORAM to reduce memory access overheads
- Jinxi Kuang
- Minghua Shen
- Yutong Lu
- Nong Xiao
This paper proposes a locality-aware Oblivious RAM (ORAM) primitive, named Green ORAM, which exploits spatial locality of data in the physical memory for reducing ORAM overheads. The Green ORAM is novel consisting of three policies. The first is row-guided label allocation used for mapping spatial locality onto ORAM tree to reduce the number of memory commands. The second is segment-based path replacement able to improve the data locality within the path in the ORAM tree in order to remove the redundant memory accesses. The third is multi-path write-back able to improve the data locality between different paths in order to obtain theoretical best stash hit rate. Notably, the Green ORAM still maintains the security as we analyzed. Experimental results show that Green ORAM achieves a 28.72% access latency reduction, and a 19.06% memory energy consumption reduction on average, compared with the state-of-the-art String ORAM.
HWST128: complete memory safety accelerator on RISC-V with metadata compression
- Hsu-Kang Dow
- Tuo Li
- Sri Parameswaran
Memory safety is paramount for secure systems. Pointer-based memory safety relies on additional information (metadata) to check validity when a pointer is dereferenced. Such operations on the metadata introduce significant performance overhead to the system. This paper presents HWST128, a system to reduce performance overhead by using hardware/software co-design. As a result, the system described achieves spatial and temporal safety by utilizing microarchitecture support, pointer analysis from the compiler, and metadata compression. HWST128 is the first complete solution for memory safety (spatial and temporal) on RISC-V. The system is implemented and tested on a Xilinx ZCU102 FPGA board with 1536 LUTs (+4.11%) and 112 FFs (+0.66%) on top of a Rocket Chip processor. HWST128 is 3.74× faster than the equivalent software-based safety system in the SPEC2006 benchmark suite while providing similar or better security coverage for the Juliet test suite.
RegVault: hardware assisted selective data randomization for operating system kernels
- Jinyan Xu
- Haoran Lin
- Ziqi Yuan
- Wenbo Shen
- Yajin Zhou
- Rui Chang
- Lei Wu
- Kui Ren
This paper presents RegVault, a hardware-assisted lightweight data randomization scheme for OS kernels. RegVault introduces novel cryptographically strong hardware primitives to protect both the confidentiality and integrity of register-grained data. RegVault leverages annotations to mark sensitive data and instruments their loads and stores automatically. Moreover, RegVault also introduces new techniques to protect the interrupt context and safeguard the sensitive data spilling. We implement a prototype of RegVault by extending RISC-V architecture to protect six types of sensitive data in Linux kernel. Our evaluations show that RegVault can defend against the kernel data attacks effectively with a minimal performance overhead.
ASAP: reconciling asynchronous real-time operations and proofs of execution in simple embedded systems
- Adam Caulfield
- Norrathep Rattanavipanon
- Ivan De Oliveira Nunes
Embedded devices are increasingly ubiquitous and their importance is hard to overestimate. While they often support safety-critical functions (e.g., in medical devices and sensor-alarm combinations), they are usually implemented under strict cost/energy budgets, using low-end microcontroller units (MCUs) that lack sophisticated security mechanisms. Motivated by this issue, recent work developed architectures capable of generating Proofs of Execution (PoX) for the correct/expected software in potentially compromised low-end MCUs. In practice, this capability can be leveraged to provide “integrity from birth” to sensor data, by binding the sensed results/outputs to an unforgeable cryptographic proof of execution of the expected sensing process. Despite this significant progress, current PoX schemes for low-end MCUs ignore the real-time needs of many applications. In particular, security of current PoX schemes precludes any interrupts during the execution being proved. We argue that lack of asynchronous capabilities (i.e., interrupts within PoX) can obscure PoX usefulness, as several applications require processing real-time and asynchronous events. To bridge this gap, we propose, implement, and evaluate an Architecture for Secure Asynchronous Processing in PoX (ASAP). ASAP is secure under full software compromise, enables asynchronous PoX, and incurs less hardware overhead than prior work.
Towards a formally verified hardware root-of-trust for data-oblivious computing
- Lucas Deutschmann
- Johannes Müller
- Mohammad R. Fadiheh
- Dominik Stoffel
- Wolfgang Kunz
The importance of preventing microarchitectural timing side channels in security-critical applications has surged immensely over the last several years. Constant-time programming has emerged as a best-practice technique to prevent leaking out secret information through timing. It builds on the assumption that certain basic machine instructions execute timing-independently w.r.t. their input data. However, whether an instruction fulfills this data-independent timing criterion varies strongly from architecture to architecture.
In this paper, we propose a novel methodology to formally verify data-oblivious behavior in hardware using standard property checking techniques. Each successfully verified instruction represents a trusted hardware primitive for developing data-oblivious algorithms. A counterexample, on the other hand, represents a restriction that must be communicated to the software developer. We evaluate the proposed methodology in multiple case studies, ranging from small arithmetic units to medium-sized processors. One case study uncovered a data-dependent timing violation in the extensively verified and highly secure Ibex RISC-V core.
A scalable SIMD RISC-V based processor with customized vector extensions for CRYSTALS-kyber
- Huimin Li
- Nele Mentens
- Stjepan Picek
This paper uses RISC-V vector extensions to speed up lattice-based operations in architectures based on HW/SW co-design. We analyze the structure of the number-theoretic transform (NTT), inverse NTT (INTT), and coefficient-wise multiplication (CWM) in CRYSTALS-Kyber, a lattice-based key encapsulation mechanism. We propose 12 vector extensions for CRYSTALS-Kyber multiplication and four for finite field operations in combination with two optimizations of the HW/SW interface. This results in a speed-up of 141.7, 168.7, and 245.5 times for NTT, INTT, and CWM, respectively, compared with the baseline implementation, and a speed-up of over four times compared with the state-of-the-art HW/SW co-design using RV32IMC.
Hexagons are the bestagons: design automation for silicon dangling bond logic
- Marcel Walter
- Samuel Sze Hang Ng
- Konrad Walus
- Robert Wille
Field-coupled Nanocomputing (FCN) defines a class of post-CMOS nanotechnologies that promises compact layouts, low power operation, and high clock rates. Recent breakthroughs in the fabrication of Silicon Dangling Bonds (SiDBs) acting as quantum dots enabled the demonstration of a sub-30 nm2 OR gate and wire segments. This motivated the research community to invest manual labor in the design of additional gates and whole circuits which, however, is currently severely limited by scalability issues. In this work, these limitations are overcome by the introduction of a design automation framework that establishes a flexible topology based on hexagons as well as a corresponding Bestagon gate library for this technology and, additionally, provides automatic methods for physical design. By this, the first design automation solution for the promising SiDB platform is proposed. In an effort to support open research and open data, the resulting framework and all design files will be made available.
Improving compute in-memory ECC reliability with successive correction
- Brian Crafton
- Zishen Wan
- Samuel Spetalnick
- Jong-Hyeok Yoon
- Wei Wu
- Carlos Tokunaga
- Vivek De
- Arijit Raychowdhury
Compute in-memory (CIM) is an exciting technique that minimizes data transport, maximizes memory throughput, and performs computation on the bitline of memory sub-arrays. This is especially interesting for machine learning applications, where increased memory bandwidth and analog domain computation offer improved area and energy efficiency. Unfortunately, CIM faces new challenges traditional CMOS architectures have avoided. In this work, we explore the impact of device variation (calibrated with measured data on foundry RRAM arrays) and propose a new class of error correcting codes (ECC) for hard and soft errors in CIM. We demonstrate single, double, and triple error correction offering over 16,000× reduction in bit error rate over a design without ECC and over 427× over prior work, while consuming only 29.1% area and 26.3% power overhead.
Energy efficient data search design and optimization based on a compact ferroelectric FET content addressable memory
- Jiahao Cai
- Mohsen Imani
- Kai Ni
- Grace Li Zhang
- Bing Li
- Ulf Schlichtmann
- Cheng Zhuo
- Xunzhao Yin
Content Addressable Memory (CAM) is widely used for associative search tasks in advanced machine learning models and data-intensive applications due to the highly parallel pattern matching capability. Most state-of-the-art CAM designs focus on reducing the CAM cell area by exploiting the nonvolatile memories (NVMs). There exists only little research on optimizing the design and energy efficiency of NVM based CAMs for practical deployment in edge devices and AI hardware. In this paper, we propose a general compact and energy efficient CAM design scheme that alleviates the design overhead by employing just one NVM device in the cell. We also propose an adaptive matchline (ML) precharge and discharge scheme that further optimizes the search energy by fully reducing the ML voltage swing. We consider Ferroelectric field effect transistors (FeFETs) as the representative NVM, and present a 2T-1FeFET CAM array including a sense amplifier implementing the proposed ML scheme. Evaluation results suggest that our proposed 2T-1FeFET CAM design achieves 6.64×/4.74×/9.14×/3.02× better energy efficiency compared with CMOS/ReRAM/STT-MRAM/2FeFET CAM arrays. Benchmarking results show that our approach provides 3.3×/2.1× energy-delay product improvement over the 2T-2R/2FeFET CAM in accelerating query processing applications.
CamSkyGate: camouflaged skyrmion gates for protecting ICs
- Yuqiao Zhang
- Chunli Tang
- Peng Li
- Ujjwal Guin
Magnetic skyrmion has the potential to become one of the candidates for emerging technologies due to its ultra-high integration density and ultra-low energy. Skyrmion is a magnetic pattern created by transverse current injection in the ferromagnetic (FM) layer. A skyrmion can be generated by localized spin-polarized current and behaves like a stable pseudoparticle. Different logic gates have been proposed, where the presence or absence of a single skyrmion is represented as binary logic 1 or logic 0, respectively. In this paper, we propose novel camouflaged logic gate designs to prevent an adversary from extracting the original netlist. The proposal uses differential doping to block the propagation of the skyrmions to realize the camouflaged gates. To the best of our knowledge, we are the first to propose camouflaged skyrmion gates to prevent an adversary from performing reverse engineering. We demonstrate the functionality of different camouflaged gates using the mumax3 micromagnetic simulator. We have also evaluated the security of the proposed camouflaged designs using SAT attacks. We show that the same security from the traditional CMOS-based camouflaged circuits can be retained.
GNN-based concentration prediction for random microfluidic mixers
- Weiqing Ji
- Xingzhuo Guo
- Shouan Pan
- Tsung-Yi Ho
- Ulf Schlichtmann
- Hailong Yao
Recent years have witnessed significant advances brought by microfluidic biochips in automating biochemical processing. Accurate preparation of fluid samples with microfluidic mixers is a fundamental step in various biomedical applications, where concentration prediction and generation are critical. Finite element analysis (FEA) is the most commonly used simulation method for accurate concentration prediction of a given biochip design, such as COMSOL. However, the FEA simulation process is time-consuming with poor scalability for large biochip sizes. This paper proposes a new concentration prediction method based on the graph neural networks (GNN), which efficiently and accurately predicts the generated concentration by random microfluidic mixers of different sizes. Experimental results show that compared with the state-of-the-art method, the proposed GNN-based simulation method obtains a reduction of 88% in terms of errors of predicted concentration, which validates the effectiveness of the proposed GNN model.
Designing ML-resilient locking at register-transfer level
- Dominik Sisejkovic
- Luca Collini
- Benjamin Tan
- Christian Pilato
- Ramesh Karri
- Rainer Leupers
Various logic-locking schemes have been proposed to protect hardware from intellectual property piracy and malicious design modifications. Since traditional locking techniques are applied on the gate-level netlist after logic synthesis, they have no semantic knowledge of the design function. Data-driven, machine-learning (ML) attacks can uncover the design flaws within gate-level locking. Recent proposals on register-transfer level (RTL) locking have access to semantic hardware information. We investigate the resilience of ASSURE, a state-of-the-art RTL locking method, against ML attacks. We used the lessons learned to derive two ML-resilient RTL locking schemes built to reinforce ASSURE locking. We developed ML-driven security metrics to evaluate the schemes against an RTL adaptation of the state-of-the-art, ML-based SnapShot attack.
O’clock: lock the clock via clock-gating for SoC IP protection
- M Sazadur Rahman
- Rui Guo
- Hadi M Kamali
- Fahim Rahman
- Farimah Farahmandi
- Mohamed Abdel-Moneum
- Mark Tehranipoor
Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O’Clock, a fully-automated clock-gating-based approach that ‘locks the clock’ to protect IPs in complex SoCs. O’Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O’Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.
ALICE: an automatic design flow for eFPGA redaction
- Chiara Muscari Tomajoli
- Luca Collini
- Jitendra Bhandari
- Abdul Khader Thalakkattu Moosa
- Benjamin Tan
- Xifan Tang
- Pierre-Emmanuel Gaillardon
- Ramesh Karri
- Christian Pilato
Fabricating an integrated circuit is becoming unaffordable for many semiconductor design houses. Outsourcing the fabrication to a third-party foundry requires methods to protect the intellectual property of the hardware designs. Designers can rely on embedded reconfigurable devices to completely hide the real functionality of selected design portions unless the configuration string (bitstream) is provided. However, selecting such portions and creating the corresponding reconfigurable fabrics are still open problems. We propose ALICE, a design flow that addresses the EDA challenges of this problem. ALICE partitions the RTL modules between one or more reconfigurable fabrics and the rest of the circuit, automating the generation of the corresponding redacted design.
DELTA: DEsigning a stealthy trigger mechanism for analog hardware trojans and its detection analysis
- Nishant Gupta
- Mohil Sandip Desai
- Mark Wijtvliet
- Shubham Rai
- Akash Kumar
This paper presents a stealthy triggering mechanism that reduces the dependencies of analog hardware Trojans on the frequent toggling of the software-controlled rare nets. The trigger to activate the Trojan is generated by using a glitch generation circuit and a clock signal, which increases the selectivity and feasibility of the trigger signal. The proposed trigger is able to evade the state-of-the-art run-time detection (R2D2) and Built-In Acceleration Structure (BIAS) schemes. Furthermore, the simulation results show that the proposed trigger circuit incurs a minimal overhead in side-channel footprints in terms of area (29 transistors), delay (less than 1ps in the clock cycle), and power (1μW).
VIPR-PCB: a machine learning based golden-free PCB assurance framework
- Aritra Bhattacharyay
- Prabuddha Chakraborty
- Jonathan Cruz
- Swarup Bhunia
Printed circuit boards (PCBs) form an integral part of the electronics life cycle by providing mechanical support and electrical connections to microchips and discrete electronic components. PCBs follow a similar life cycle as microchips and are vulnerable to similar assurance issues. Malicious design alterations, i.e., hardware Trojan attacks, have emerged as a major threat to PCB assurance. Board-level Trojans are extremely challenging to detect due to (1) the lack of golden or reference models in most use cases, (2) potentially unbounded attack space, and (3) the growing complexity of commercial PCB designs. Existing PCB inspection techniques (e.g., optical and electrical) do not scale to large volume and are expensive, time-consuming, and often not reliable in covering diverse Trojan space. To address these issues, in this paper, we present VIPR-PCB, a board-level Trojan detection framework that employs a machine learning (ML) model to learn Trojan signatures in functional and structural space and uses a trained model to discover Trojans in suspect PCB designs with high fidelity. Using extensive evaluation with 10 open-source PCB designs and a wide variety of Trojan instances, we demonstrate that VIPR-PCB can achieve over 98% accuracy and is even capable of detecting Trojans in partially-recovered PCB designs.
CLIMBER: defending phase change memory against inconsistent write attacks
- Zhuohui Duan
- Haobo Wang
- Haikun Liu
- Xiaofei Liao
- Hai Jin
- Yu Zhang
- Fubing Mao
Non-volatile Memories (NVMs) usually demonstrate vast endurance variation due to Process Variation (PV). They are vulnerable to an Inconsistent Write Attack (IWA) which reverses the write intensity distribution in two adjacent wear leveling windows. In this paper, we propose CLIMBER, a defense mechanism to neutralize IWA for NVMs. CLIMBER dynamically changes harmful address mappings so that intensive writes to weak cells are still redirected to strong cells. CLIMBER also conceals weak NVM cells from attackers by randomly mapping cold addresses to weak NVM regions. Experimental results show that CLIMBER can reduce maximum page wear rate by 43.2% compared with the state-of-the-art Toss-up Wear Leveling and prolong NVM lifetime from 4.19 years to 7.37 years with trivial performance/hardware overhead.
Rethinking key-value store for byte-addressable optane persistent memory
- Sung-Ming Wu
- Li-Pin Chang
Optane Persistent Memory (PM) is a pioneering solution to byte-addressable PM for commodity systems. However, the performance of Optane PM is highly workload-sensitive, rendering many prior designs of Key-Value (KV) store inefficient. To cope with this reality, we advocate rethinking KV store design for Optane PM. Our design follows a principle of Single-stream Writing with managed Multi-stream Reading (SWMR): Incoming KV pairs are written to PM through a single write stream and managed by an ordered index in DRAM. Through asynchronously sorting and rewriting large sets of KV pairs, range queries are handled with a managed number of concurrent streams. YCSB results show that our design improved upon existing ones by 116% and 21% for write-only throughput and read-write throughput, respectively.
libcrpm: improving the checkpoint performance of NVM
- Feng Ren
- Kang Chen
- Yongwei Wu
libcrpm is a new programming library to improve the checkpoint performance for applications running in NVM. It proposes the failure-atomic differential checkpointing protocol, which addresses two problems simultaneously that exist in the current NVM-based checkpoint-recovery libraries: (1) high write amplification when page-granularity incremental checkpointing is used, and (2) high persistence costs from excessive memory fence instructions when fine-grained undo-log or copy-on-write is used. Evaluation results show that libcrpm reduces the checkpoint overhead in realistic workloads. For MPI-based parallel applications such as LULESH, the checkpoint overhead of libcrpm is only 44.78% of FTI, an application-level checkpoint-recovery library.
Scalable crash consistency for secure persistent memory
- Ming Zhang
- Yu Hua
- Xuan Li
- Hao Xu
Persistent memory (PM) suffers from data security and crash consistency issues due to non-volatility. Counter-mode encryption (CME) and bonsai merkle tree (BMT) have been adopted to ensure data security by using security metadata. The data and its security metadata need to be atomically persisted for correct recovery. To ensure crash consistency, durable transactions have been widely employed. However, the long-time BMT update increases the transaction latency, and the security metadata incur heavy write traffic. This paper presents Secon to ensure SEcurity and crash CONsistency for PM with high performance. Secon leverages a scalable write-through metadata cache to ensure the atomicity of data and its security metadata. To reduce the transaction latency, Secon proposes a transaction-specific epoch persistency model to minimize the ordering constraints. To reduce the amount of PM writes, Secon co-locates counters with log entries and coalesces BMT blocks. Experimental results demonstrate that Secon significantly improves the transaction performance and decreases the write traffic.
Don’t open row: rethinking row buffer policy for improving performance of non-volatile memories
- Yongho Lee
- Osang Kwon
- Seokin Hong
Among the various NVM technologies, phase-change-memory (PCM) has attracted substantial attention as a candidate to replace the DRAM for next-generation memory. However, the characteristics of PCM cause it to have much longer read and write latencies than DRAM. This paper proposes a Write-Around PCM System that addresses this limitation using two novel schemes: Pseudo-Row Activation and Direct Write. Pseudo-Row Activation provides fast row activation for PCM writes by connecting a target row to bitlines, but it does not fetch the data into the row buffer. With the Direct Write scheme, our system allows for writing operations to update the data even if the target row is in the logically closed state.
SMART: on simultaneously marching racetracks to improve the performance of racetrack-based main memory
- Xiangjun Peng
- Ming-Chang Yang
- Ho Ming Tsui
- Chi Ngai Leung
- Wang Kang
RaceTrack Memory (RTM) is a promising media for modern Main Memory subsystems. However, the “shift-before-access” principle, as the nature of RTM, introduces considerable overheads to the access latency. To obtain more insights for the mitigation of shift overheads, this work characterizes and observes that the access patterns, exhibited by the state-of-the-art RTM-based Main Memory, mismatches with the granularity of shift commands (i.e., a group of RaceTracks called Domain Block Cluster (DBC)). Based on the characterization, we propose a novel mechanism called SMART, which simultaneously and proactively marches all DBCs within a subarray, so that subsequent accesses to other DBCs can be served without additional shift commands. Evaluation results show that, averaged across 15 real-world workloads, SMART significantly outperforms other state-of-the-art proposals of RTM-based Main Memory by at least 1.53X in terms of the total execution time, on two different generations of RTM technologies.
SAPredictor: a simple and accurate self-adaptive predictor for hierarchical hybrid memory system
- Yujuan Tan
- Wei Chen
- Zhulin Ma
- Dan Xiao
- Zhichao Yan
- Duo Liu
- Xianzhang Chen
In a hybrid memory system using DRAM as the NVM cache, DRAM and NVM can be accessed in serial or parallel mode. However, we found that using either mode alone will bring access latency and bandwidth problems. In this paper, we integrate these two access modes and design a simple but accurate predictor (called SAPredictor) to help choose the appropriate access mode, thereby avoiding long access latency and bandwidth problems to improve memory performance. Our experiments show that SAPredictor achieves an accuracy rate of up to 97.1% and helps reduce access latency by up to 35.6% at fairly low costs.
AVATAR: an aging- and variation-aware dynamic timing analyzer for application-based DVAFS
- Zuodong Zhang
- Zizheng Guo
- Yibo Lin
- Runsheng Wang
- Ru Huang
As the timing guardband continues to increase with the continuous technology scaling, better-than-worst-case (BTWC) design has gained more and more attention. BTWC design can improve energy efficiency and/or performance by relaxing the conservative static timing constraints and exploiting the dynamic timing margin. However, to avoid potential reliability hazards, the existing dynamic timing analysis (DTA) tools have to add extra aging and variation guardbands, which are estimated under the worst-case corners of aging and variation. Such guardbanding method introduces unnecessary margin in timing analysis, thus reducing the performance and efficiency gains of BTWC designs. Therefore, in this paper, we propose AVATAR, an aging- and variation-aware dynamic timing analyzer that can perform DTA with the impact of transistor aging and random process variation. We also propose an application-based dynamic-voltage-accuracy-frequency-scaling (DVAFS) design flow based on AVATAR, which can improve energy efficiency by exploiting both dynamic timing slack (DTS) and the intrinsic error tolerance of the application. The results show that a 45.8% performance improvement and 68% power savings can be achieved by exploiting the intrinsic error tolerance. Compared with the conventional flow based on the corner-based DTA, the additional performance improvement of the proposed flow can be up to 14% or the additional power-saving can be up to 20%.
A defect tolerance framework for improving yield
- Shiva Shankar Thiagarajan
- Suriyaprakash Natarajan
- Yiorgos Makris
In the latest technology nodes, there is a growing concern about yield loss due to timing failures and delay degradation resulting from manufacturing complexities. Largely, these process imperfections are fixed using empirical methods such as layout guidelines and process fixes which come late during the design cycle. In this work, we propose a framework for improving the design yield by synthesizing netlists with improved ability to withstand delay variations to reduce yield loss. We advocate a defect tolerant approach during early design stages to synthesize netlists by introducing defect-awareness to EDA synthesis, thereby generating robust netlists that can withstand delays induced by process imperfections. Toward this objective, we present a) a methodology to characterize standard library cells for delay defects to model the robustness of the cell delays, and b) a solution to drive design synthesis using the intelligence from the cell characterization to achieve design robustness to timing errors. We also introduce defect tolerance metrics to quantify the robustness of standard cells to timing variations, which we use to generate defect-aware libraries to guide defect-aware synthesis. Effectiveness of the proposed defect-aware methodology is evaluated on a set of benchmarks implemented in GF 12nm technology using static timing analysis (STA), revealing a 70–80% reduction of yield loss due to timing errors arising from manufacturing defects, with minimum impact on the area, power and no impact on performance.
Winograd convolution: a perspective from fault tolerance
- Xinghua Xue
- Haitong Huang
- Cheng Liu
- Tao Luo
- Lei Zhang
- Ying Wang
Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49% or energy consumption by 7.19% without any accuracy loss compared to that without being aware of the fault tolerance.
Towards resilient analog in-memory deep learning via data layout re-organization
- Muhammad Rashedul Haq Rashed
- Amro Awad
- Sumit Kumar Jha
- Rickard Ewetz
Processing in-memory paves the way for neural network inference engines. An arising challenge is to develop the software/hardware interface to automatically compile deep learning models onto in-memory computing platforms. In this paper, we observe that the data layout organization of a deep neural network (DNN) model directly impacts the model’s classification accuracy. This stems from that the resistive parasitics within a crossbar introduces a dependency between the matrix data and the precision of the analog computation. To minimize the impact of the parasitics, we first perform a case study to understand the underlying matrix properties that result in computation with low and high precision, respectively. Next, we propose the XORG framework that performs data layout organization for DNNs deployed on in-memory computing platforms. The data layout organization improves precision by optimizing the weight matrix to crossbar assignments at compile time. The experimental results show that the XORG framework improves precision with up to 3.2X and 31% on the average. When accelerating DNNs using XORG, the write bit-accuracy requirements are relaxed with 1-bit and the robustness to random telegraph noise (RTN) is improved.
SEM-latch: a lost-cost and high-performance latch design for mitigating soft errors in nanoscale CMOS process
- Zhong-Li Tang
- Chia-Wei Liang
- Ming-Hsien Hsiao
- Charles H.-P. Wen
Soft errors (primarily single-event transients (SET) and single-event upsets (SEU)) are receiving increased attention due to the increasing prevalence of automotive and biomedical electronics. In recent years, several latch designs have been developed for SEU/SET protection, but each has its own issues regarding timing, area, and power. Therefore, we propose a novel soft-error mitigating latch design, called SEM-Latch, which extends QUATRO and incorporates a speed path whereas embedding a reference voltage generator (RVG) for simultaneously improving timing, area, and power in 45nm CMOS process. SEM-Latch effectively reduces the power, area, and PDAP (product of delay, area, and power) by an average of 1.4%, 12.5%, and 8.7%, respectively, in comparison to a previous latch (HPST) with equivalent SEU protection. Furthermore, in comparison to AMSER-Latch, SEM-Latch reduces area, timing overhead and PDAP by 27.2%, 48.2%, and 60.2%, respectively, to provide 99.9999% particle rejection rate for SET protection.
BlueSeer: AI-driven environment detection via BLE scans
- Valentin Poirot
- Oliver Harms
- Hendric Martens
- Olaf Landsiedel
IoT devices rely on environment detection to trigger specific actions, e.g., for headphones to adapt noise cancellation to the surroundings. While phones feature many sensors, from GNSS to cameras, small wearables must rely on the few energy-efficient components they already incorporate. In this paper, we demonstrate that a Bluetooth radio is the only component required to accurately classify environments and present BlueSeer, an environment-detection system that solely relies on received BLE packets and an embedded neural network. BlueSeer achieves an accuracy of up to 84% differentiating between 7 environments on resource-constrained devices, and requires only ~ 12 ms for inference on a 64 MHz microcontroller-unit.
Compressive sensing based asymmetric semantic image compression for resource-constrained IoT system
- Yujun Huang
- Bin Chen
- Jianghui Zhang
- Qiu Han
- Shu-Tao Xia
The widespread application of Internet-of-Things (IoT) and deep learning have made machine-to-machine semantic communication possible. However, it remains challenging to deploy DNN model on IoT devices, due to their limited computing and storage capacity. In this paper, we propose Compressed Sensing based Asymmetric Semantic Image Compression (CS-ASIC) for resource-constrained IoT systems, which consists of a lightweight front encoder and a deep iterative decoder offloaded at the server. We further consider a task-oriented scenario and optimize CS-ASIC for the semantic recognition tasks. The experiment results demonstrate that CS-ASIC achieves considerable data-semantic rate-distortion trade-off, and low encoding complexity over prevailing codecs.
R2B: high-efficiency and fair I/O scheduling for multi-tenant with differentiated demands
- Diansen Sun
- Yunpeng Chai
- Chaoyang Liu
- Weihao Sun
- Qingpeng Zhang
Big data applications have differentiated requirements for I/O resources in cloud environments. For instance, data analytic and AI/ML applications usually have periodical burst I/O traffic, and data stream processing and database applications often introduce fluctuating I/O loads based on a guaranteed I/O bandwidth. However, the existing resource isolation model (i.e., RLW) and methods (e.g., Token-bucket, mClock, and cgroup) cannot support the fluctuating I/O load and differentiated I/O demands well, and thus cannot achieve fairness, high resource utilization, and high performance for applications at the same time. In this paper, we propose a novel efficient and fair I/O resource isolation model and method called R2B, which can adapt to the differentiated I/O characteristics and requirements of different applications in a shared resource environment. R2B can simultaneously satisfy the fairness and achieve both high application efficiency and high bandwidth utilization.
This work aims to help the cloud provider achieve higher utilization by shifting the burden to the cloud customers to specify their type of workload.
Fast and scalable human pose estimation using mmWave point cloud
- Sizhe An
- Umit Y. Ogras
Millimeter-Wave (mmWave) radar can enable high-resolution human pose estimation with low cost and computational requirements. However, mmWave data point cloud, the primary input to processing algorithms, is highly sparse and carries significantly less information than other alternatives such as video frames. Furthermore, the scarce labeled mmWave data impedes the development of machine learning (ML) models that can generalize to unseen scenarios. We propose a fast and scalable human pose estimation (FUSE) framework that combines multi-frame representation and meta-learning to address these challenges. Experimental evaluations show that FUSE adapts to the unseen scenarios 4× faster than current supervised learning approaches and estimates human joint coordinates with about 7 cm mean absolute error.
VWR2A: a very-wide-register reconfigurable-array architecture for low-power embedded devices
- Benoît W. Denkinger
- Miguel Peón-Quirós
- Mario Konijnenburg
- David Atienza
- Francky Catthoor
Edge-computing requires high-performance energy-efficient embedded systems. Fixed-function or custom accelerators, such as FFT or FIR filter engines, are very efficient at implementing a particular functionality for a given set of constraints. However, they are inflexible when facing application-wide optimizations or functionality upgrades. Conversely, programmable cores offer higher flexibility, but often with a penalty in area, performance, and, above all, energy consumption. In this paper, we propose VWR2A, an architecture that integrates high computational density and low power memory structures (i.e., very-wide registers and scratchpad memories). VWR2A narrows the energy gap with similar or better performance on FFT kernels with respect to an FFT accelerator. Moreover, VWR2A flexibility allows to accelerate multiple kernels, resulting in significant energy savings at the application level.
Alleviating datapath conflicts and design centralization in graph analytics acceleration
- Haiyang Lin
- Mingyu Yan
- Duo Wang
- Mo Zou
- Fengbin Tu
- Xiaochun Ye
- Dongrui Fan
- Yuan Xie
Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2× speedup (1.5× on average) as well as better scalability.
Hyperdimensional hashing: a robust and efficient dynamic hash table
- Mike Heddes
- Igor Nunes
- Tony Givargis
- Alexandru Nicolau
- Alex Veidenbaum
Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms that minimize key remapping as the hash table resizes. While memory errors in large-scale cloud deployments are common, neither algorithm offers both efficiency and robustness. Hyperdimensional Computing is an emerging computational model that has inherent efficiency, robustness and is well suited for vector or hardware acceleration. We propose Hyperdimensional (HD) hashing and show that it has the efficiency to be deployed in large systems. Moreover, a realistic level of memory errors causes more than 20% mismatches for consistent hashing while HD hashing remains unaffected.
In-situ self-powered intelligent vision system with inference-adaptive energy scheduling for BNN-based always-on perception
- Maimaiti Nazhamaiti
- Haijin Su
- Han Xu
- Zheyu Liu
- Fei Qiao
- Qi Wei
- Zidong Du
- Xinghua Yang
- Li Luo
This paper proposes an in-situ self-powered BNN-based intelligent visual perception system that harvests light energy utilizing the indispensable image sensor itself. The harvested energy is allocated to the low-power BNN computation modules layer by layer, adopting a light-weighted duty-cycling-based energy scheduler. A software-hardware co-design method, which exploits the layer-wise error tolerance of BNN as well as the computing-error and energy consumption characteristics of the computation circuit, is proposed to determine the parameters of the energy scheduler, achieving high energy efficiency for self-powered BNN inference. Simulation results show that with the proposed inference-adaptive energy scheduling method, self-powered MNIST classification task can be performed at a frame rate of 4 fps if the harvesting power is 1μW, while guaranteeing at least 90% inference accuracy using binary LeNet-5 network.
Adaptive window-based sensor attack detection for cyber-physical systems
- Lin Zhang
- Zifan Wang
- Mengyu Liu
- Fanxin Kong
Sensor attacks alter sensor readings and spoof Cyber-Physical Systems (CPS) to perform dangerous actions. Existing detection works tend to minimize the detection delay and false alarms at the same time, while there is a clear trade-off between the two metrics. Instead, we argue that attack detection should dynamically balance the two metrics when a physical system is at different states. Along with this argument, we propose an adaptive sensor attack detection system that consists of three components – an adaptive detector, detection deadline estimator, and data logger. It can adapt the detection delay and thus false alarms at run time to meet a varying detection deadline and improve usability (or false alarms). Finally, we implement our detection system and validate it using multiple CPS simulators and a reduced-scale autonomous vehicle testbed.
Design-while-verify: correct-by-construction control learning with verification in the loop
- Yixuan Wang
- Chao Huang
- Zhaoran Wang
- Zhilu Wang
- Qi Zhu
In the current control design of safety-critical cyber-physical systems, formal verification techniques are typically applied after the controller is designed to evaluate whether the required properties (e.g., safety) are satisfied. However, due to the increasing system complexity and the fundamental hardness of designing a controller with formal guarantees, such an open-loop process of design-then-verify often results in many iterations and fails to provide the necessary guarantees. In this paper, we propose a correct-by-construction control learning framework that integrates the verification into the control design process in a closed-loop manner, i.e., design-while-verify. Specifically, we leverage the verification results (computed reachable set of the system state) to construct feedback metrics for control learning, which measure how likely the current design of control parameters can meet the required reach-avoid property for safety and goal-reaching. We formulate an optimization problem based on such metrics for tuning the controller parameters, and develop an approximated gradient descent algorithm with a difference method to solve the optimization problem and learn the controller. The learned controller is formally guaranteed to meet the required reach-avoid property. By treating verifiability as a first-class objective and effectively leveraging the verification results during the control learning process, our approach can significantly improve the chance of finding a control design with formal property guarantees, demonstrated in a set of experiments that use model-based or neural network based controllers.
GaBAN: a generic and flexibly programmable vector neuro-processor on FPGA
- Jiajie Chen
- Le Yang
- Youhui Zhang
Spiking neural network (SNN) is the main computational model of brain-inspired computing and neuroscience, which also acts as the bridge between them. With the rapid development of neuroscience, accurate and flexible SNN simulation with high performance is becoming important. This paper proposes GaBAN, a generic and flexibly programmable neuro-processor on FPGA. Different from the majority of current designs that realize neural components by custom hardware directly, it is centered on a compact, versatile vector instruction set, which supports multiple-precision vector calculation, indexed-/strided-memory access, and conditional execution to accommodate computational characteristics. By software and hardware co-design, the compiler extracts memory-accesses from SNN programs to generate micro-ops executed by an independent hardware unit; the latter interacts with the computing pipeline through an asynchronous buffering mechanism. Thus memory access delay can fully cover the calculation. Tests show that GaBAN can not only outperform the SOTA ISA-based FPGA solution remarkably but also be comparable with counterparts of the hardware-fixed model on some tasks. Moreover, in end-to-end testing, its simulation performance exceeds that of high-performance X86 processor (1.44–3.0x).
ADEPT: automatic differentiable DEsign of photonic tensor cores
- Jiaqi Gu
- Hanqing Zhu
- Chenghao Feng
- Zixuan Jiang
- Mingjie Liu
- Shuhan Zhang
- Ray T. Chen
- David Z. Pan
Photonic tensor cores (PTCs) are essential building blocks for optical artificial intelligence (AI) accelerators based on programmable photonic integrated circuits. PTCs can achieve ultra-fast and efficient tensor operations for neural network (NN) acceleration. Current PTC designs are either manually constructed or based on matrix decomposition theory, which lacks the adaptability to meet various hardware constraints and device specifications. To our best knowledge, automatic PTC design methodology is still unexplored. It will be promising to move beyond the manual design paradigm and “nurture” photonic neurocomputing with AI and design automation. Therefore, in this work, for the first time, we propose a fully differentiable framework, dubbed ADEPT, that can efficiently search PTC designs adaptive to various circuit footprint constraints and foundry PDKs. Extensive experiments show superior flexibility and effectiveness of the proposed ADEPT framework to explore a large PTC design space. On various NN models and benchmarks, our searched PTC topology outperforms prior manually-designed structures with competitive matrix representability, 2×-30× higher footprint compactness, and better noise robustness, demonstrating a new paradigm in photonic neural chip design. The code of ADEPT is available at link using the TorchONN library.
Unicorn: a multicore neuromorphic processor with flexible fan-in and unconstrained fan-out for neurons
- Zhijie Yang
- Lei Wang
- Yao Wang
- Linghui Peng
- Xiaofan Chen
- Xun Xiao
- Yaohua Wang
- Weixia Xu
Neuromorphic processor is popular due to its high energy efficiency for spatio-temporal applications. However, when running the spiking neural network (SNN) topologies with the ever-growing scale, existing neuromorphic architectures face challenges due to their restrictions on neuron fan-in and fan-out. This paper proposes Unicorn, a multicore neuromorphic processor with a spike train sliding multicasting mechanism (STSM) and neuron merging mechanism (NMM) to support unconstrained fan-out and flexible fan-in of neurons. Unicorn supports 36K neurons and 45M synapses and thus supports a variety of neuromorphic applications. The peak performance and energy efficiency of Unicorn reach 36TSOPS and 424GSOPS/W respectively. Experimental results show that Unicorn can achieve 2×-5.5× energy reduction over the state-of-the-art neuromorphic processor when running an SNN with a relatively large fan-out and fan-in.
Effective zero compression on ReRAM-based sparse DNN accelerators
- Hoon Shin
- Rihae Park
- Seung Yul Lee
- Yeonhong Park
- Hyunseung Lee
- Jae W. Lee
For efficient DNN inference Resistive RAM (ReRAM) crossbars have emerged as a promising building block to compute matrix multiplication in an area- and power-efficient manner. To improve inference throughput sparse models can be deployed on the ReRAM-based DNN accelerator. While unstructured pruning maintains both high accuracy and high sparsity, it performs poorly on the crossbar architecture due to the irregular locations of pruned weights. Meanwhile, due to the non-ideality of ReRAM cells and the high cost of ADCs, matrix multiplication is usually performed at a fine granularity, called Operation Unit (OU), along both wordline and bitline dimensions. While fine-grained, OU- based row compression (ORC) has recently been proposed to increase weight compression ratio, significant performance potentials are still left on the table due to sub-optimal weight mappings. Thus, we propose a novel weight mapping scheme that effectively clusters zero weights via OU-level filter reordering, hence improving the effective weight compression ratio. We also introduce a weight recovery scheme to further improve accuracy or compression ratio, or both. Our evaluation with three popular DNNs demonstrates that the proposed scheme effectively eliminates redundant weights in the crossbar array and hence ineffectual computation to achieve 3.27–4.26× of array compression ratio with negligible accuracy loss over the baseline ReRAM-based DNN accelerator.
Y-architecture-based flip-chip routing with dynamic programming-based bend minimization
- Szu-Ru Nie
- Yen-Ting Chen
- Yao-Wen Chang
In modern VLSI designs, I/O counts have been growing continuously as the system becomes more complicated. To achieve higher routability, the hexagonal array is introduced with higher pad density and a larger pitch. However, the routing for hexagonal arrays is significantly different from that for traditional gird and staggered arrays. In this paper, we consider the Y-architecture-based flip-chip routing used for the hexagonal array. Unlike the conventional Manhattan and the X-architectures, the Y-architecture allows wires to be routed in three directions, namely, 0-, 60-, and 120-degrees. We first analyze the routing properties of the hexagonal array. Then, we propose a triangular tile model and a chord-based internal node division method that can handle both pre-assignment and free-assignment nets without wire crossing. Finally, we develop a novel dynamic programming-based bend minimization method to reduce the number of routing bends in the final solution. Experimental results show that our algorithm can achieve 100% routability with minimized total wirelength and the number of routing bends effectively.
Towards collaborative intelligence: routability estimation based on decentralized private data
- Jingyu Pan
- Chen-Chia Chang
- Zhiyao Xie
- Ang Li
- Minxue Tang
- Tunhou Zhang
- Jiang Hu
- Yiran Chen
Applying machine learning (ML) in design flow is a popular trend in Electronic Design Automation (EDA) with various applications from design quality predictions to optimizations. Despite its promise, which has been demonstrated in both academic researches and industrial tools, its effectiveness largely hinges on the availability of a large amount of high-quality training data. In reality, EDA developers have very limited access to the latest design data, which is owned by design companies and mostly confidential. Although one can commission ML model training to a design company, the data of a single company might be still inadequate or biased, especially for small companies. Such data availability problem is becoming the limiting constraint on future growth of ML for chip design. In this work, we propose an Federated-Learning based approach for well-studied ML applications in EDA. Our approach allows an ML model to be collaboratively trained with data from multiple clients but without explicit access to the data for respecting their data privacy. To further strengthen the results, we co-design a customized ML model FLNet and its personalization under the decentralized training scenario. Experiments on a comprehensive dataset show that collaborative training improves accuracy by 11% compared with individual local models, and our customized model FLNet significantly outperforms the best of previous routability estimators in this collaborative training flow.
A2-ILT: GPU accelerated ILT with spatial attention mechanism
- Qijing Wang
- Bentian Jiang
- Martin D. F. Wong
- Evangeline F. Y. Young
Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.
Generic lithography modeling with dual-band optics-inspired neural networks
- Haoyu Yang
- Zongyi Li
- Kumara Sastry
- Saumyadip Mukhopadhyay
- Mark Kilgard
- Anima Anandkumar
- Brucek Khailany
- Vivek Singh
- Haoxing Ren
Lithography simulation is a critical step in VLSI design and optimization for manufacturability. Existing solutions for highly accurate lithography simulation with rigorous models are computationally expensive and slow, even when equipped with various approximation techniques. Recently, machine learning has provided alternative solutions for lithography simulation tasks such as coarse-grained edge placement error regression and complete contour prediction. However, the impact of these learning-based methods has been limited due to restrictive usage scenarios or low simulation accuracy. To tackle these concerns, we introduce an dual-band optics-inspired neural network design that considers the optical physics underlying lithography. To the best of our knowledge, our approach yields the first published via/metal layer contour simulation at 1nm2/pixel resolution with any tile size. Compared to previous machine learning based solutions, we demonstrate that our framework can be trained much faster and offers a significant improvement on efficiency and image quality with 20× smaller model size. We also achieve 85× simulation speedup over traditional lithography simulator with ~ 1% accuracy loss.
Statistical computing framework and demonstration for in-memory computing systems
- Bonan Zhang
- Peter Deaville
- Naveen Verma
With the increasing importance of data-intensive workloads, such as AI, in-memory computing (IMC) has demonstrated substantial energy/throughput benefits by addressing both compute and data-movement/accessing costs, and holds significant further promise by its ability to leverage emerging forms of highly-scaled memory technologies. However, IMC fundamentally derives its advantages through parallelism, which poses a trade-off with SNR, whereby variations and noise in nanoscaled devices directly limit possible gains. In this work, we propose novel training approaches to improve model tolerance to noise via a contrastive loss function and a progressive training procedure. We further propose a methodology for modeling and calibrating hardware noise, efficiently at the level of a macro operation and through a limited number of hardware measurements. The approaches are demonstrated on a fabricated MRAM-based IMC prototype in 22nm FD-SOI, together with a neural network training framework implemented in PyTorch. For CIFAR-10/100 classifications, model performance is restored to the level of ideal noise-free execution, and generalized performance of the trained model deployed across different chips is demonstrated.
Write or not: programming scheme optimization for RRAM-based neuromorphic computing
- Ziqi Meng
- Yanan Sun
- Weikang Qian
One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.
ReSMA: accelerating approximate string matching using ReRAM-based content addressable memory
- Huize Li
- Hai Jin
- Long Zheng
- Yu Huang
- Xiaofei Liao
- Zhuohui Duan
- Dan Chen
- Chuangyi Gui
Approximate string matching (ASM) functions as the basic operation kernel for a large number of string processing applications. Existing Von-Neumann-based ASM accelerators suffer from huge intermediate data with the ever-increasing string data, leading to massive off-chip data transmissions. This paper presents a novel ASM processing-in-memory (PIM) accelerator, namely ReSMA, based on ReCAM- and ReRAM-arrays to eliminate the off-chip data transmissions in ASM. We develop a novel ReCAM-friendly filter-and-filtering algorithm to process the q-grams filtering in ReCAM memory. We also design a new data mapping strategy and a new verification algorithm, which enables computing the edit distances totally in ReRAM crossbars for energy saving. Experimental results show that ReSMA outperforms the CPU-, GPU-, FPGA-, ASIC-, and PIM-based solutions by 268.7×, 38.6×, 20.9×, 707.8×, and 14.7× in terms of performance, and 153.8×, 42.2×, 31.6×, 18.3×, and 5.3× in terms of energy-saving, respectively.
VStore: in-storage graph based vector search accelerator
- Shengwen Liang
- Ying Wang
- Ziming Yuan
- Cheng Liu
- Huawei Li
- Xiaowei Li
Graph-based vector search that finds best matches to user queries based on their semantic similarities using a graph data structure, becomes instrumental in data science and AI application. However, deploying graph-based vector search in production systems requires high accuracy and cost-efficiency with low latency and memory footprint, which existing work fails to offer. We present VStore, a graph-based vector search solution that collaboratively optimizes accuracy, latency, memory, and data movement on large-scale vector data based on in-storage computing. The evaluation shows that VStore exhibits significant search efficiency improvement and energy reduction while attaining accuracy over CPU, GPU, and ZipNN platforms.
Scaled-CBSC: scaled counting-based stochastic computing multiplication for improved accuracy
- Shuyuan Yu
- Sheldon X.-D. Tan
Stochastic computing (SC) can lead area-efficient implementation of logic designs. Existing SC multiplication, however, suffers a long-standing problem: large multiplication error with small inputs due to its intrinsic nature of bit-stream based computing. In this article, we propose a new scaled counting-based SC multiplication approach, called Scaled-CBSC, to mitigate this issue by introducing scaling bits to ensure the bit ‘1’ density of the stochastic number is sufficiently large. The idea is to convert the “small” inputs to “large” inputs, thus improve the accuracy of SC multiplication. But different from an existing stream-bit based approach, the new method uses the binary format and does not require stochastic addition as the SC multiplication always starts with binary numbers. Furthermore, Scaled-CBSC only requires all the numbers to be larger than 0.5 instead of arbitrary defined threshold, which leads to integer numbers only for the scaling term. The experimental results show that the 8-bit Scaled-CBSC multiplication with 3 scaling bits can achieve up to 46.6% and 30.4% improvements in mean error and standard deviation, respectively; reduce the peak relative error from 100% to 1.8%; and improve 12.6%, 51.5%, 57.6%, 58.4% in delay, area, area-delay product, energy consumption, respectively, over the state of art work. Furthermore, we evaluate the proposed multiplication approach in a discrete cosine transformation (DCT) application. The results show that with 3 scaling bits, 8-bit scaled counting-based SC multiplication can improve the image quality with 5.9dB upon the state of art work in average.
Tailor: removing redundant operations in memristive analog neural network accelerators
- Xingchen Li
- Zhihang Yuan
- Guangyu Sun
- Liang Zhao
- Zhichao Lu
Analog in-situ computation based on memristive circuits has been regarded as a promising approach for designing high-performance and low-power neural network accelerators. However, despite the low-cost and highly parallel memristive crossbars, the peripheral circuits especially analog-digital-converters (ADCs) induce significant overhead. Quantitative analysis shows that ADCs can contribute up to 91% energy consumption and 72% chip area, which significantly offset the advantages of memristive NN accelerators.
To address this problem, we first mathematically analyze the computation flow in a memristive accelerator, and find that there are many useless operations. These operations significantly increase the demand for peripheral circuits. Then, based on our discovery, we propose a novel architecture, Tailor, which removes these unnecessary operations without accuracy loss. We design two types of Tailor. General Tailor is compatible with most existing memristive accelerators and can be easily applied to them. Customized Tailor is specialized for a certain NN application and can obtain more improvement. Experimental results show that, General Tailor can reduce 14% ~ 20% inference time and 33% ~ 41% energy consumption. Customized Tailor can further achieve 56% ~ 87% higher computation density.
Domain knowledge-infused deep learning for automated analog/radio-frequency circuit parameter optimization
- Weidong Cao
- Mouhacine Benosman
- Xuan Zhang
- Rui Ma
The design automation of analog circuits is a longstanding challenge. This paper presents a reinforcement learning method enhanced by graph learning to automate the analog circuit parameter optimization at the pre-layout stage, i.e., finding device parameters to fulfill desired circuit specifications. Unlike all prior methods, our approach is inspired by human experts who rely on domain knowledge of analog circuit design (e.g., circuit topology and couplings between circuit specifications) to tackle the problem. By originally incorporating such key domain knowledge into policy training with a multimodal network, the method best learns the complex relations between circuit parameters and design targets, enabling optimal decisions in the optimization process. Experimental results on exemplary circuits show it achieves human-level design accuracy (~99%) with 1.5× efficiency of existing best-performing methods. Our method also shows better generalization ability to unseen specifications and optimality in circuit performance optimization. Moreover, it applies to design radio-frequency circuits on emerging semiconductor technologies, breaking the limitations of prior learning methods in designing conventional analog circuits.
A cost-efficient fully synthesizable stochastic time-to-digital converter design based on integral nonlinearity scrambling
- Qiaochu Zhang
- Shiyu Su
- Mike Shuo-Wei Chen
Stochastic time-to-digital converters (STDCs) are gaining increasing interest in submicron CMOS analog/mixed-signal design for their superior tolerance to nonlinear quantization levels. However, the large number of required delay units and time comparators for conventional STDC operation incurs excessive implementation costs. This paper presents a fully synthesizable STDC architecture based on an integral non-linearity (INL) scrambling technique, allowing order-of-magnitude cost reduction. The proposed technique randomizes and averages the STDC INL using a digital-to-time converter. Moreover, we propose an associated design automation flow and demonstrate an STDC design in 12nm FinFET process. Post-layout simulations show significant linearity and area/power efficiency improvements compared to prior arts.
Using machine learning to optimize graph execution on NUMA machines
- Hiago Mayk G. de A. Rocha
- Janaina Schwarzrock
- Arthur F. Lorenzon
- Antonio Carlos S. Beck
This paper proposes PredG, a Machine Learning framework to enhance the graph processing performance by finding the ideal thread and data mapping on NUMA systems. PredG is agnostic to the input graph: it uses the available graphs’ features to train an ANN to perform predictions as new graphs arrive – without any application execution after being trained. When evaluating PredG over representative graphs and algorithms on three NUMA systems, its solutions are up to 41% faster than the Linux OS Default and the Best Static – on average 2% far from the Oracle -, and it presents lower energy consumption.
HCG: optimizing embedded code generation of simulink with SIMD instruction synthesis
- Zhuo Su
- Zehong Yu
- Dongyan Wang
- Yixiao Yang
- Yu Jiang
- Rui Wang
- Wanli Chang
- Jiaguang Sun
Simulink is widely used for the model-driven design of embedded systems. It is able to generate optimized embedded control software code through expression folding, variable reuse, etc. However, for some commonly used computing-sensitive models, such as the models for signal processing applications, the efficiency of the generated code is still limited.
In this paper, we propose HCG, an optimized code generator for the Simulink model with SIMD instruction synthesis. It will select the optimal implementations for intensive computing actors based on adaptively pre-calculation of the input scales, and synthesize the appropriate SIMD instructions for batch computing actors based on the iterative dataflow graph mapping. We implemented and evaluated its performance on benchmark Simulink models. Compared to the built-in Simulink Coder and the most recent DFSynth, the code generated by HCG achieves an improvement of 38.9%-92.9% and 41.2%-76.8% in terms of execution time across different architectures and compilers, respectively.
Raven: a novel kernel debugging tool on RISC-V
- Hongyi Lu
- Fengwei Zhang
Debugging is an essential part of kernel development. However, debugging features are not available on RISC-V without the use of external hardware. In this paper, we leverage a security feature called Physical Memory Protection (PMP) as a debugging primitive to address this issue. Based on this debugging primitive, we design Raven, a novel kernel debugging tool with the standard functionalities (breakpoints, watchpoints, stepping, introspection). A prototype of Raven is implemented on a SiFive Unmatched development board. Our experiments show that Raven imposes a moderate but acceptable overhead to the kernel. Moreover, a real-world debugging scenario is set up to test its effectiveness.
GTuner: tuning DNN computations on GPU via graph attention network
- Qi Sun
- Xinyun Zhang
- Hao Geng
- Yuxuan Zhao
- Yang Bai
- Haisheng Zheng
- Bei Yu
It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.
Pref-X: a framework to reveal data prefetching in commercial in-order cores
- Quentin Huppert
- Francky Catthoor
- Lionel Torres
- David Novo
Computer system simulators are major tools used by architecture researchers to develop and evaluate new ideas. Clearly, such evaluations are more conclusive when compared to commercial state-of-the-art architectures. However, the behavior of key components in existing processors is often not disclosed, complicating the construction of faithful reference models. The data prefetching engine is one of such obscured components that can have a significant impact on key metrics such as performance and energy.
In this paper, we propose Pref-X, a framework to analyze functional characteristics of data prefetching in commercial in-order cores. Our framework reveals data prefetches by X-raying into the cache memory at the request granularity, which allows linking memory access patterns with changes in the cache content. To demonstrate the power and accuracy of our methodology, we use Pref-X to replicate the data prefetching mechanisms of two representative processors, namely the Arm Cortex-A7 and the Arm Cortex-A53, with a 99.8% and 96.9% average accuracy, respectively.
Architecting DDR5 DRAM caches for non-volatile memory systems
- Xin Xin
- Wanyi Zhu
- Li Zhao
With the release of Intel’s Optane DIMM, Non-Volatile Memories (NVMs) are emerging as viable alternatives to DRAM memories because of the advantage of higher capacity. However, the higher latency and lower bandwidth of Optane prevent it from outright replacing DRAM. A prevailing strategy is to employ existing DRAM as a data cache for Optane, thereby achieving overall benefit in capacity, bandwidth, and latency.
In this paper, we inspect new features in DDR5 to better support the DRAM cache design for Optane. Specifically, we leverage the two-level ECC scheme, i.e., DIMM ECC and on-die ECC, in DDR5 to construct a narrower channel for tag probing and propose a new operation for fast cache replacement. Experimental results show that our proposed strategy can achieve, on average, 26% performance improvement.
GraphRing: an HMC-ring based graph processing framework with optimized data movement
- Zerun Li
- Xiaoming Chen
- Yinhe Han
Due to the irregular memory access and high bandwidth demanding, graph processing is usually inefficient on conventional computer architectures. The recent development of the processing-in-memory (PIM) technique such as hybrid memory cube (HMC) has provided a feasible design direction for graph processing accelerators. Although PIM provides high internal bandwidth, inter-node memory access is inevitable in large-scale graph processing, which greatly affects the performance. In this paper, we propose an HMC-based graph processing framework, GraphRing. GraphRing is a software-hardware codesign framework that optimizes inter-HMC communication. It contains a regularity- and locality-aware graph execution model and a ring-based multi-HMC architecture. The evaluation results based on 5 graph datasets and 4 graph algorithms show that GraphRing achieves on average 2.14× speedup and 3.07× inter-HMC communication energy saving, compared with GraphQ, a state-of-the-art graph processing architecture.
AxoNN: energy-aware execution of neural network inference on multi-accelerator heterogeneous SoCs
- Ismet Dagli
- Alexander Cieslewicz
- Jedidiah McClurg
- Mehmet E. Belviranli
The energy and latency demands of critical workload execution, such as object detection, in embedded systems vary based on the physical system state and other external factors. Many recent mobile and autonomous System-on-Chips (SoC) embed a diverse range of accelerators with unique power and performance characteristics. The execution flow of the critical workloads can be adjusted to span into multiple accelerators so that the trade-off between performance and energy fits to the dynamically changing physical factors.
In this study, we propose running neural network (NN) inference on multiple accelerators of an SoC. Our goal is to enable an energy-performance trade-off with an by distributing layers in a NN between a performance- and a power-efficient accelerator. We first provide an empirical modeling methodology to characterize execution and inter-layer transition times. We then find an optimal layers-to-accelerator mapping by representing the trade-off as a linear programming optimization constraint. We evaluate our approach on the NVIDIA Xavier AGX SoC with commonly used NN models. We use the Z3 SMT solver to find schedules for different energy consumption targets, with up to 98% prediction accuracy.
PIPF-DRAM: processing in precharge-free DRAM
- Nezam Rohbani
- Mohammad Arman Soleimani
- Hamid Sarbazi-Azad
To alleviate costly data communication among processing cores and memory modules, parallel processing-in-memory (PIM) is a promising approach which exploits the huge available internal memory bandwidth. High capacity, wide row size, and maturity of DRAM technology, make DRAM an alluring structure for PIM. However, dense layout, high process variation, and noise vulnerability of DRAMs make it very challenging to apply PIM for DRAMs in practice. This work proposes a PIM structure which eliminates these DRAM limitations, exploiting a precharge-free DRAM (PF-DRAM) structure. The proposed PIM structure, called PIPF-DRAM, performs parallel bitwise operations only by modifying control signal sequences in PF-DRAM, with almost zero structural and circuit modifications. Comparing the state-of-the-art PIM techniques, PIPF-DRAM is 4.2× more robust to process variation, 4.1% faster in average cycle time of operations, and consumes 66.1% less energy.
TAIM: ternary activation in-memory computing hardware with 6T SRAM array
- Nameun Kang
- Hyungjun Kim
- Hyunmyung Oh
- Jae-Joon Kim
Recently, various in-memory computing accelerators for low precision neural networks have been proposed. While in-memory Binary Neural Network (BNN) accelerators achieved significant energy efficiency, BNNs show severe accuracy degradation compared to their full precision counterpart models. To mitigate the problem, we propose TAIM, an in-memory computing hardware that can support ternary activation with negligible hardware overhead. In TAIM, a 6T SRAM cell can compute the multiplication between ternary activation and binary weight. Since the 6T SRAM cell consumes no energy when the input activation is 0, the proposed TAIM hardware can achieve even higher energy efficiency compared to BNN case by exploiting input 0’s. We fabricated the proposed TAIM hardware in 28nm CMOS process and evaluated the energy efficiency on various image classification benchmarks. The experimental results show that the proposed TAIM hardware can achieve ~ 3.61× higher energy efficiency on average compared to previous designs which support ternary activation.
PIM-DH: ReRAM-based processing-in-memory architecture for deep hashing acceleration
- Fangxin Liu
- Wenbo Zhao
- Yongbiao Chen
- Zongwu Wang
- Zhezhi He
- Rui Yang
- Qidong Tang
- Tao Yang
- Cheng Zhuo
- Li Jiang
Deep hashing has gained growing momentum in large-scale image retrieval. However, deep hashing is computation- and memory-intensive, which demands hardware acceleration. The unique process of hash sequence computation in deep hashing is non-trivial to accelerate due to the lack of an efficient compute primitive for Hamming distance calculation and ranking.
This paper proposes the first PIM-based scheme for deep hashing accelerator, namely PIM-DH. PIM-DH is supported by an algorithm and architecture co-design. The proposed algorithm seeks to compress the hash sequence to increase the retrieval efficiency by exploiting the hash code sparsity without accuracy loss. Further, we design a lightweight circuit to assist CAM to optimize hash computation efficiency. This design leads to an elegant extension of current PIM-based architectures for adapting to various hashing algorithms and arbitrary size of hash sequence induced by pruning. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA RTX2080 GPU, PIM-DH achieves an average 4.75E+03 speedup with 4.64E+05 energy reduction over CPU, 2.30E+02 speedup with 3.38E+04 energy reduction over GPU. Compared with PIM architecture CASCADE, PIM-DH can improve computing efficiency by 17.49× and energy efficiency by 41.38×.
YOLoC: deploy large-scale neural network by ROM-based computing-in-memory using residual branch on a chip
- Yiming Chen
- Guodong Yin
- Zhanhong Tan
- Mingyen Lee
- Zekun Yang
- Yongpan Liu
- Huazhong Yang
- Kaisheng Ma
- Xueqing Li
Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.
ASTERS: adaptable threshold spike-timing neuromorphic design with twin-column ReRAM synapses
- Ziru Li
- Qilin Zheng
- Bonan Yan
- Ru Huang
- Bing Li
- Yiran Chen
Complex event-driven neuron dynamics was an obstacle to implementing efficient brain-inspired computing architectures with VLSI circuits. To solve this problem and harness the event-driven advantage, we propose ASTERS, a resistive random-access memory (ReRAM) based neuromorphic design to conduct the time-to-first-spike SNN inference. In addition to the fundamental novel axon and neuron circuits, we also propose two techniques through hardware-software co-design: “Multi-Level Firing Threshold Adjustment” to mitigate the impact of ReRAM device process variations, and “Timing Threshold Adjustment” to further speed up the computation. Experimental results show that our cross-layer solution ASTERS achieves more than 34.7% energy savings compared to the existing spiking neuromorphic designs, meanwhile maintaining 90.1% accuracy under the process variations with a 20% standard deviation.
SATO: spiking neural network acceleration via temporal-oriented dataflow and architecture
- Fangxin Liu
- Wenbo Zhao
- Zongwu Wang
- Yongbiao Chen
- Tao Yang
- Zhezhi He
- Xiaokang Yang
- Li Jiang
Event-driven spiking neural networks (SNNs) have shown great promise for being strikingly energy-efficient. SNN neurons integrate the spikes, accumulate the membrane potential, and fire output spike when the potential exceeds a threshold. Existing SNN accelerators, however, have to carry out such accumulation-comparison operation in serial. Repetitive spike generation at each time step not only increases latency as well as overall energy budget, but also incurs memory access overhead of fetching membrane potentials, both of which lessen the efficiency of SNN accelerators. Meanwhile, inherent highly sparse spikes of SNNs lead to imbalanced workloads among neurons that hurdle the utilization of processing elements (PEs).
This paper proposes SATO, a temporal-parallel SNN accelerator that accumulates the membrane potential for all time steps in parallel. SATO architecture contains a novel binary adder-search tree to generate the output spike train, which decouples the chronological dependence in the accumulation-comparison operation. Moreover, SATO can evenly dispatch the compressed workloads to all PEs with maximized data locality of input spike trains based on a bucket-sort-based method. Our evaluations show that SATO outperforms the previous ANN accelerator 8-bit version of “Eyeriss” by 30.9× in terms of speedup and 12.3×, in terms of energy-saving. Compared with the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve 6.4× performance gain and 4.8× energy reduction, which is quite impressive for inference.
LeHDC: learning-based hyperdimensional computing classifier
- Shijin Duan
- Yejia Liu
- Shaolei Ren
- Xiaolin Xu
Thanks to the tiny storage and efficient execution, hyperdimensional Computing (HDC) is emerging as a lightweight learning framework on resource-constrained hardware. Nonetheless, the existing HDC training relies on various heuristic methods, significantly limiting their inference accuracy. In this paper, we propose a new HDC framework, called LeHDC, which leverages a principled learning approach to improve the model accuracy. Concretely, LeHDC maps the existing HDC framework into an equivalent Binary Neural Network architecture, and employs a corresponding training strategy to minimize the training loss. Experimental validation shows that LeHDC outperforms previous HDC training strategies and can improve on average the inference accuracy over 15% compared to the baseline HDC.
GENERIC: highly efficient learning engine on edge using hyperdimensional computing
- Behnam Khaleghi
- Jaeyoung Kang
- Hanyang Xu
- Justin Morris
- Tajana Rosing
Hyperdimensional Computing (HDC) mimics the brain’s basic principles in performing cognitive tasks by encoding the data to high-dimensional vectors and employing non-complex learning techniques. Conventional processing platforms such as CPUs and GPUs are incapable of taking full advantage of the highly-parallel bit-level operations of HDC. On the other hand, existing HDC encoding techniques do not cover a broad range of applications to make a custom design plausible. In this paper, we first propose a novel encoding that achieves high accuracy for diverse applications. Thereafter, we leverage the proposed encoding and design a highly efficient and flexible ASIC accelerator, dubbed GENERIC, suited for the edge domain. GENERIC supports both classification (train and inference) and clustering for unsupervised learning on edge. Our design is flexible in the input size (hence it can run various applications) and hypervectors dimensionality, allowing it to trade off the accuracy and energy/performance on-demand. We augment GENERIC with application-opportunistic power-gating and voltage over-scaling (thanks to the notable error resiliency of HDC) for further energy reduction. GENERIC encoding improves the prediction accuracy over previous HDC and ML techniques by 3.5% and 6.5%, respectively. At 14 nm technology node, GENERIC occupies an area of 0.30 mm2, and consumes 0.09 mW static and 1.97 mW active power. Compared to the previous inference-only accelerator, GENERIC reduces the energy consumption by 4.1×.
Solving traveling salesman problems via a parallel fully connected ising machine
- Qichao Tao
- Jie Han
Annealing-based Ising machines have shown promising results in solving combinatorial optimization problems. As a typical class of these problems, however, traveling salesman problems (TSPs) are very challenging to solve due to the constraints imposed on the solution. This article proposes a parallel annealing algorithm for a fully connected Ising machine that significantly improves the accuracy and performance in solving constrained combinatorial optimization problems such as the TSP. Unlike previous parallel annealing algorithms, this improved parallel annealing (IPA) algorithm efficiently solves TSPs using an exponential temperature function with a dynamic offset. Compared with digital annealing (DA) and momentum annealing (MA), the IPA reduces the run time by 44.4 times and 19.9 times for a 14-city TSP, respectively. Large scale TSPs can be more efficiently solved by taking a k-medoids clustering approach that decreases the average travel distance of a 22-city TSP by 51.8% compared with DA and by 42.0% compared with MA. This approach groups neighboring cities into clusters to form a reduced TSP, which is then solved in a hierarchical manner by using the IPA algorithm.
PATH: evaluation of boolean logic using path-based in-memory computing
- Sven Thijssen
- Sumit Kumar Jha
- Rickard Ewetz
Processing in-memory breaks von Neumann-based constructs to accelerate data-intensive applications. Noteworthy efforts have been devoted to executing Boolean logic using digital in-memory computing. The limitation of state-of-the-art paradigms is that they heavily rely on repeatedly switching the state of the non-volatile resistive devices using expensive WRITE operations. In this paper, we propose a new in-memory computing paradigm called path-based computing for evaluating Boolean logic. Computation within the paradigm is performed using a one-time expensive compile phase and a fast and efficient evaluation phase. The key property of the paradigm is that the execution phase only involves cheap READ operations. Moreover, a synthesis tool called PATH is proposed to automatically map computation to a single crossbar design. The PATH tool also supports the synthesis of path-based computing systems where the total number of crossbars and the number of inter-crossbar connections are minimized. We evaluate the proposed paradigm using 10 circuits from the RevLib benchmark suite. Compared with state-of-the-art digital in-memory computing paradigms, path-based computing improves energy and latency up to 4.7X and 8.5X, respectively.
A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining
- Hongwu Peng
- Shaoyi Huang
- Shiyang Chen
- Bingbing Li
- Tong Geng
- Ang Li
- Weiwen Jiang
- Wujie Wen
- Jinbo Bi
- Hang Liu
- Caiwen Ding
Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 × and 2.6 × speedup compared to CPU and GPU implementation, and 4 × higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.
HDPG: hyperdimensional policy-based reinforcement learning for continuous control
- Yang Ni
- Mariam Issa
- Danny Abraham
- Mahdi Imani
- Xunzhao Yin
- Mohsen Imani
Traditional robot control or more general continuous control tasks often rely on carefully hand-crafted classic control methods. These models often lack the self-learning adaptability and intelligence to achieve human-level control. On the other hand, recent advancements in Reinforcement Learning (RL) present algorithms that have the capability of human-like learning. The integration of Deep Neural Networks (DNN) and RL thereby enables autonomous learning in robot control tasks. However, DNN-based RL brings both high-quality learning and high computation cost, which is no longer ideal for currently fast-growing edge computing scenarios.
In this paper, we introduce HDPG, a highly-efficient policy-based RL algorithm using Hyperdimensional Computing. Hyperdimensional computing is a lightweight brain-inspired learning methodology; its holistic representation of information leads to a well-defined set of hardware-friendly high-dimensional operations. Our HDPG fully exploits the efficient HDC for high-quality state value approximation and policy gradient update. In our experiments, we use HDPG for robotics tasks with continuous action space and achieve significantly higher rewards than DNN-based RL. Our evaluation also shows that HDPG achieves 4.7× faster and 5.3× higher energy efficiency than DNN-based RL running on embedded FPGA.
CarM: hierarchical episodic memory for continual learning
- Soobee Lee
- Minindu Weerakoon
- Jonghyun Choi
- Minjia Zhang
- Di Wang
- Myeongjae Jeon
Continual Learning (CL) is an emerging machine learning paradigm in mobile or IoT devices that learns from a continuous stream of tasks. To avoid forgetting of knowledge of the previous tasks, episodic memory (EM) methods exploit a subset of the past samples while learning from new data. Despite the promising results, prior studies are mostly simulation-based and unfortunately do not promise to meet an insatiable demand for both EM capacity and system efficiency in practical system setups. We propose CarM, the first CL framework that meets the demand by a novel hierarchical EM management strategy. CarM has EM on high-speed RAMs for system efficiency and exploits the abundant storage to preserve past experiences and alleviate the forgetting by allowing CL to efficiently migrate samples between memory and storage. Extensive evaluations show that our method significantly outperforms popular CL methods while providing high training efficiency.
Shfl-BW: accelerating deep neural network inference with tensor-core aware weight pruning
- Guyue Huang
- Haoran Li
- Minghai Qin
- Fei Sun
- Yufei Ding
- Yuan Xie
Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.
In this work, we propose a novel sparse pattern, Shuffled Blockwise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer [1] by 1.81, 4.18 and 1.90× on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.
QuiltNet: efficient deep learning inference on multi-chip accelerators using model partitioning
- Jongho Park
- HyukJun Kwon
- Seowoo Kim
- Junyoung Lee
- Minho Ha
- Euicheol Lim
- Mohsen Imani
- Yeseong Kim
We have seen many successful deployments of deep learning accelerator designs on different platforms and technologies, e.g., FPGA, ASIC, and Processing In-Memory platforms. However, the size of the deep learning models keeps increasing, making computations a burden on the accelerators. A naive approach to resolve this issue is to design larger accelerators; however, it is not scalable due to high resource requirements, e.g., power consumption and off-chip memory sizes. A promising solution is to utilize multiple accelerators and use them as needed, similar to conventional multiprocessing. For example, for smaller networks, we may use a single accelerator, while we may use multiple accelerators with proper network partitioning for larger networks. However, partitioning DNN models into multiple parts leads to large communication overheads due to inter-layer communications. In this paper, we propose a scalable solution to accelerate DNN models on multiple devices by devising a new model partitioning technique. Our technique transforms a DNN model into layer-wise partitioned models using an autoencoder. Since the autoencoder encodes a tensor output into a smaller dimension, we can split the neural network model into multiple pieces while significantly reducing the communication overhead to pipeline them. Our evaluation results conducted on state-of-the-art deep learning models show that the proposed technique significantly improves performance and energy efficiency. Our solution increases performance and energy efficiency by up to 30.5% and 28.4% with minimal accuracy loss as compared to running the same model on pipelined multi-block accelerators without the autoencoder.
Glimpse: mathematical embedding of hardware specification for neural compilation
- Byung Hoon Ahn
- Sean Kinzer
- Hadi Esmaeilzadeh
Success of Deep Neural Networks (DNNs) and their computational intensity has heralded Cambrian explosion of DNN hardware. While hardware design has advanced significantly, optimizing the code for them is still an open challenge. Recent research has moved past traditional compilation techniques and taken a stochastic search algorithmic path that blindly generates rather stochastic samples of the binaries for real hardware measurements to guide the search. This paper opens a new dimension by incorporating the mathematical embedding of the hardware specification of the GPU accelerators dubbed Blueprint to better guide the search algorithm and focus on sub-spaces that have higher potential for yielding higher performance binaries. While various sample efficient yet blind hardware-agnostic techniques have been proposed, none of the state-of-the-art compilers have considered hardware specification as hints to improve the sample efficiency and the search. To mathematically embed the hardware specifications into the search, we devise a Bayesian optimization framework called Glimpse with multiple exclusively unique components. We first use the Blueprint as an input to generate prior distributions of different dimensions in the search space. Then, we devise a light-weight neural acquisition function that takes into account the Blueprint to conform to the hardware specification while balancing the exploration-exploitation trade-off. Finally, we generate an ensemble of predictors from the Blueprint that collectively vote to reject invalid binary samples. We compare Glimpse with hardware-agnostic compilers. Comparison to AutoTVM [3], Chameleon [2], and DGP [16] with multiple generations of GPUs shows that Glimpse provides 6.73×, 1.51×, and 1.92× faster compilation time, respectively, while also achieving the best inference latency.
Bringing source-level debugging frameworks to hardware generators
- Keyi Zhang
- Zain Asgar
- Mark Horowitz
High-level hardware generators have significantly increased the productivity of design engineers. They use software engineering constructs to reduce the repetition required to express complex designs and enable more composability. However, these benefits are undermined by a lack of debugging infrastructure, requiring hardware designers to debug generated, usually incomprehensible, RTL code. This paper describes a framework that connects modern software source-level debugging frameworks to RTL created from hardware generators. Our working prototype offers an Integrated Development Environment (IDE) experience for generators such as RocketChip (Chisel), allowing designers to set breakpoints in complex source code, relate RTL simulation state back to source-level variables, and do forward and backward debugging, with almost no simulation overhead (less than 5%).
Verifying SystemC TLM peripherals using modern C++ symbolic execution tools
- Pascal Pieper
- Vladimir Herdt
- Daniel Große
- Rolf Drechsler
In this paper we propose an effective approach for verification of real-world SystemC TLM peripherals using modern C++ symbolic execution tools. We designed a lightweight SystemC peripheral kernel that enables an efficient integration with the modern symbolic execution engine KLEE and acts as a drop-in replacement for the normal SystemC kernel on pre-processed TLM peripherals. The pre-processing step essentially replaces context switches in SystemC threads with normal function calls which can be handled by KLEE. Our experiments, using a publicly available RISC-V specific interrupt controller, demonstrate the scalability and bug hunting effectiveness of our approach.
Formal verification of modular multipliers using symbolic computer algebra and boolean satisfiability
- Alireza Mahzoon
- Daniel Große
- Christoph Scholl
- Alexander Konrad
- Rolf Drechsler
Modular multipliers are the essential components in cryptography and Residue Number System (RNS) designs. Especially, 2n – 1 and 2n + 1 modular multipliers have gained more attention due to their regular structures and a wide variety of applications. However, there is no automated formal verification method to prove the correctness of these multipliers. As a result, bugs might remain undetected after the design phase.
In this paper, we present our modular verifier that combines Symbolic Computer Algebra (SCA) and Boolean Satisfiability (SAT) to prove the correctness of 2n – 1 and 2n + 1 modular multipliers. Our verifier takes advantage of three techniques, i.e. coefficient correction, SAT-based local vanishing removal, and SAT-based output condition check to overcome the challenges of SCA-based verification. The efficiency of our verifier is demonstrated using an extensive set of modular multipliers with up to several million gates.
Silicon validation of LUT-based logic-locked IP cores
- Gaurav Kolhe
- Tyler Sheaves
- Kevin Immanuel Gubbi
- Tejas Kadale
- Setareh Rafatirad
- Sai Manoj PD
- Avesta Sasan
- Hamid Mahmoodi
- Houman Homayoun
Modern semiconductor manufacturing often leverages a fabless model in which design and fabrication are partitioned. This has led to a large body of work attempting to secure designs sent to an untrusted third party through obfuscation methods. On the other hand, efficient de-obfuscation attacks have been proposed, such as Boolean Satisfiability attacks (SAT attacks). However, there is a lack of frameworks to validate the security and functionality of obfuscated designs. Additionally, unconventional obfuscated design flows, which vary from one obfuscation to another, have been key impending factors in realizing logic locking as a mainstream approach for securing designs. In this work, we address these two issues for Lookup Table-based obfuscation. We study both Volatile and Non-volatile versions of LUT-based obfuscation and develop a framework to validate SAT runtime using machine learning. We can achieve unparallel SAT-resiliency using LUT-based obfuscation while incurring 7% area and less than 1% power overheads. Following this, we discuss and implement a validation flow for obfuscated designs. We then fabricate a chip consisting of several benchmark designs and a RISC-V CPU in TSMC 65nm for post functionality validation. We show that the design flow and SAT-runtime validation can easily integrate LUT-based obfuscation into existing CAD tools while adding minimal verification overhead. Finally, we justify SAT-resilient LUT-based obfuscation as a promising candidate for securing designs.
Efficient bayesian yield analysis and optimization with active learning
- Shuo Yin
- Xiang Jin
- Linxu Shi
- Kang Wang
- Wei W. Xing
Yield optimization for circuit design is computationally intensive due to the expensive yield estimation based on Monte Carlo methods and the difficult optimization process. In this work, a uniform framework to solve these problems simultaneously is proposed. Firstly, a novel efficient Bayesian yield analysis framework, BYA, is proposed by deriving a Bayesian estimation for the yield and introducing active learning based on reductions of integral entropy. A tractable convolutional entropy infill technique is then proposed to efficiently solve the entropy reduction problem. Lastly, we extend BYA for yield optimization by transforming knowledge across the design space and variational space. Experimental results based on SRAM and adder circuits show that BYA is 410x faster (in terms of the number of simulations) than standard MC and averagely 10x (up to 10000x) more accurate than the state-of-the-art method for yield estimation, and is about 5x faster than the SOTA yield optimization methods.
Accelerated synthesis of neural network-based barrier certificates using collaborative learning
- Jun Xia
- Ming Hu
- Xin Chen
- Mingsong Chen
Most of existing Neural Network (NN)-based barrier certificate synthesis methods cannot deal with high-dimensional continuous systems, since a large quantity of sampled data may easily result in inaccurate initial models coupled with slow convergence rate. To accelerate the synthesis of NN-based barrier certificates, this paper presents an effective two-stage approach named CL-BC, which fully exploits the parallel processing capability of underlying hardware to enable quick search for a barrier certificate. Unlike existing NN-based methods that adopt a random initial model for barrier certificate synthesis, in the first stage CL-BC pre-trains an initial model based on a small subset of sampling data. In this way, an approximate barrier certificate in an NN form can be quickly achieved with little overhead. Based on our proposed collaborative learning scheme, in the second stage CL-BC conducts the parallel learning on partitioned domains, where the learned knowledge from different partitions can be aggregated to accelerate the convergence of a global NN model for barrier certificate synthesis. In this way, the overall synthesis time of an NN-based barrier certificate can be drastically reduced. Experimental results show that our approach can not only drastically reduce barrier synthesis time, but also can synthesize barrier certificates for complex systems that cannot be handled by state-of-the-art.
A timing engine inspired graph neural network model for pre-routing slack prediction
- Zizheng Guo
- Mingjie Liu
- Jiaqi Gu
- Shuhan Zhang
- David Z. Pan
- Yibo Lin
Fast and accurate pre-routing timing prediction is essential for timing-driven placement since repetitive routing and static timing analysis (STA) iterations are expensive and unacceptable. Prior work on timing prediction aims at estimating net delay and slew, lacking the ability to model global timing metrics. In this work, we present a timing engine inspired graph neural network (GNN) to predict arrival time and slack at timing endpoints. We further leverage edge delays as local auxiliary tasks to facilitate model training with increased model performance. Experimental results on real-world open-source designs demonstrate improved model accuracy and explainability when compared with vanilla deep GNN models.
Accurate timing prediction at placement stage with look-ahead RC network
- Xu He
- Zhiyong Fu
- Yao Wang
- Chang Liu
- Yang Guo
Timing closure is a critical but effort-taking task in VLSI designs. In placement stage, a fast and accurate net delay estimator is highly desirable to guide the timing optimization prior to routing, and thus reduce the timing pessimism and shorten the design turn-around time. To handle the timing uncertainty at the placement stage, we propose a fast net delay timing predictor based on machine learning, which extract the fully timing features using a look-ahead RC network. Experimental results show that the proposed timing predictor has achieved average correlation over 0.99 with the post-routing sign-off timing results obtained in Synopsys PrimeTime.
Timing macro modeling with graph neural networks
- Kevin Kai-Chun Chang
- Chun-Yao Chiang
- Pei-Yu Lee
- Iris Hui-Ru Jiang
Due to rapidly growing design complexity, timing macro modeling has been widely adopted to enable hierarchical and parallel timing analysis. The main challenge of timing macro modeling is to identify timing variant pins for achieving high timing accuracy while keeping a compact model size. To tackle this challenge, prior work applied ad-hoc techniques and threshold setting. In this work, we present a novel timing macro modeling approach based on graph neural networks (GNNs). A timing sensitivity metric is proposed to precisely evaluate the influence of each pin on the timing accuracy. Based on the timing sensitivity data and the circuit topology, the GNN model can effectively learn and capture timing variant pins. Experimental results show that our GNN-based framework reduces 10% model sizes while preserving the same timing accuracy as the state-of-the-art. Furthermore, taking common path pessimism removal (CPPR) as an example, the generality and applicability of our framework on various timing analysis models and modes are also validated empirically.
Worst-case dynamic power distribution network noise prediction using convolutional neural network
- Xiao Dong
- Yufei Chen
- Xunzhao Yin
- Cheng Zhuo
Worst-case dynamic PDN noise analysis is an essential step in PDN sign-off to ensure the performance and reliability of chips. However, with the growing PDN size and increasing scenarios to be validated, it becomes very time- and resource-consuming to conduct full-stack PDN simulation to check the worst-case noise for different test vectors. Recently, various works have proposed machine learning based methods for supply noise prediction, many of which still suffer from large training overhead, inefficiency, or non-scalability. Thus, this paper proposed an efficient and scalable framework for the worst-case dynamic PDN noise prediction. The framework first reduces the spatial and temporal redundancy in the PDN and input current vector, and then employs efficient feature extraction as well as a novel convolutional neural network architecture to predict the worst-case dynamic PDN noise. Experimental results show that the proposed framework consistently outperforms the commercial tool and the state-of-the-art machine learning method with only 0.63–1.02% mean relative error and 25–69× speedup.
GATSPI: GPU accelerated gate-level simulation for power improvement
- Yanqing Zhang
- Haoxing Ren
- Akshay Sridharan
- Brucek Khailany
In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry-sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared to a commercial gate-level simulator running on a single CPU core. GATSPI supports a range of simple to complex cell types from an industry standard cell library and SDF conditional delay statements without requiring prior calibration runs and produces industry-standard SAIF files from delay-aware gate-level simulation. Finally, we deploy GATSPI in a glitch-optimization flow, achieving a 1.4% power saving with a 449X speedup in turnaround time compared to a similar flow using a commercial simulator.
PPATuner: pareto-driven tool parameter auto-tuning in physical design via gaussian process transfer learning
- Hao Geng
- Qi Xu
- Tsung-Yi Ho
- Bei Yu
Thanks to the amazing semiconductor scaling, incredible design complexity makes the synthesis-centric very large-scale integration (VLSI) design flow increasingly rely on electronic design automation (EDA) tools. However, invoking EDA tools especially the physical synthesis tool may require several hours or even days for only one possible parameters combination. Even worse, for a new design, oceans of attempts to navigate high quality-of-results (QoR) after physical synthesis have to be made via multiple tool runs with numerous combinations of tunable tool parameters. Additionally, designers often puzzle over simultaneously considering multiple QoR metrics of interest (e.g., delay, power, and area). To tackle the dilemma within finite resource budget, designing a multi-objective parameter auto-tuning framework of the physical design tool which can learn from historical tool configurations and transfer the associated knowledge to new tasks is in demand. In this paper, we propose PPATuner, a Pareto-driven physical design tool parameter tuning methodology, to achieve a good trade-off among multiple QoR metrics of interest (e.g., power, area, delay) at the physical design stage. By incorporating the transfer Gaussian process (GP) model, it can autonomously learn the transfer knowledge from the existing tool parameter combinations. The experimental results on industrial benchmarks under the 7nm technology node demonstrate the merits of our framework.
Efficient maximum data age analysis for cause-effect chains in automotive systems
- Ran Bi
- Xinbin Liu
- Jiankang Ren
- Pengfei Wang
- Huawei Lv
- Guozhen Tan
Automotive systems are often subjected to stringent requirements on the maximum data age of certain cause-effect chains. In this paper, we present an efficient method for formally analyzing maximum data age of cause-effect chains. In particular, we decouple the problem of bounding the maximum data age of a chain into a problem of bounding the releasing interval of successive Last-to-Last data propagation instances in the chain. Owing to the problem decoupling, a relatively tighter data age upper bound can be effectively obtained in polynomial time. Experiments demonstrate that our approach can achieve high precision analysis with lower computational cost.
Optimizing parallel PREM compilation over nested loop structures
- Zhao Gu
- Rodolfo Pellizzoni
We consider automatic parallelization of a computational kernel executed according to the PRedictable Execution Model (PREM), where each thread is divided into execution and memory phases. We target a scratchpad-based architecture, where memory phases are executed by a dedicated DMA component. We employ data analysis and loop tiling to split the kernel execution into segments, and schedule them based on a DAG representation of data and execution dependencies. Our main observation is that properly selecting tile sizes is key to optimize the makespan of the kernel. We thus propose a heuristic that efficiently searches for optimized tile size and core assignments over deeply nested loops, and demonstrate its applicability and performance compared to the state-of-the-art in PREM compilation using the PolyBench-NN benchmark suite.
Scheduling and analysis of real-time tasks with parallel critical sections
- Yang Wang
- Xu Jiang
- Nan Guan
- Mingsong Lv
- Dong Ji
- Wang Yi
Locks are the most widely used mechanisms to coordinate simultaneous accesses to exclusive shared resources. While locking protocols and associated schedulability analysis techniques have been extensively studied for sequential real-time tasks, work for parallel tasks largely lags behind. In the limited existing work on this topic, a common assumption is that a critical section must execute sequentially. However, this is not necessarily the case with parallel programming languages. In this paper, we study the analysis of parallel heavy real-time tasks (the density of which is greater than 1) with critical sections in parallel structures. We show that applying existing analysis techniques directly could be unsafe or much pessimistic for the considered model, and develop new techniques to address these problems. Comprehensive experiments are conducted to evaluate the performance of our method.
This work was partially supported by the National Natural Science Foundation of China (NSFC 62102072) and Research Grants Council of Hong Kong (GRF 15206221).
BlueScale: a scalable memory architecture for predictable real-time computing on highly integrated SoCs
- Zhe Jiang
- Kecheng Yang
- Neil Audsley
- Nathan Fisher
- Weisong Shi
- Zheng Dong
In real-time embedded computing, time-predictability and performance are required simultaneously by memory transactions. However, with increasingly more elements being integrated into hardware, memory interconnects become a critical stumbling block to satisfying timing correctness, due to lack of hardware and scheduling scalability. In this paper, we propose a new hierarchically distributed memory interconnect, BlueScale, managing memory transactions using identical Scale Elements, which ensures hardware scalability. The Scale Element introduces two nested priority queues, achieving iterative compositional scheduling for memory transactions, guaranteeing transaction tasks’ scheduling schedulability. Associated with the new architecture, a theoretical model is established to improve BlueScale’s real-time performance.
Precise and scalable shared cache contention analysis for WCET estimation
- Wei Zhang
- Mingsong Lv
- Wanli Chang
- Lei Ju
Worst-Case Execution Time (WCET) analysis for real-time tasks must precisely predict cache hit/miss of memory accesses. While bringing great performance benefits, multi-core processors significantly complicate the cache analysis problem due to the shared cache contentions among different cores. Existing methods pessimistically consider that memory references of parallel executing tasks will contend with each other as long as they are mapped to the same cache line. However, in reality, numerous shared cache contentions are mutually exclusive, due to the partial orders among the programs executed in parallel. The presence of shared cache contentions greatly exacerbates the computational complexity of the WCET computation, as finding the longest path needs exploring an exponentially large partial ordering space. In this paper, we propose a quantitative method with O(n2) time complexity to precisely estimate the worst-case extra execution time (WCEET) caused by shared cache contentions. The proposed method can be easily integrated into the abstract-interpretation based WCET estimation framework. Experiments with MRTC benchmarks show that our method can averagely tighten the WCET estimation by 13% without sacrificing the analysis efficiency.
Predictable sharing of last-level cache partitions for multi-core safety-critical systems
- Zhuanhao Wu
- Hiren Patel
Last-level cache (LLC) partitioning is a technique to provide temporal isolation and low worst-case latency (WCL) bounds when cores access the shared LLC in multicore safety-critical systems. A typical approach to cache partitioning involves allocating a separate partition to a distinct core. A central criticism of this approach is its poor utilization of cache storage. Today’s trend of integrating a larger number of cores exacerbates this issue such that we are forced to consider shared LLC partitions for effective deployments. This work presents an approach to share LLC partitions among multiple cores while being able to provide low WCL bounds.
Thermal-aware optical-electrical routing codesign for on-chip signal communications
- Yu-Sheng Lu
- Kuan-Cheng Chen
- Yu-Ling Hsu
- Yao-Wen Chang
The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.
Power-aware pruning for ultrafast, energy-efficient, and accurate optical neural network design
- Naoki Hattori
- Yutaka Masuda
- Tohru Ishihara
- Akihiko Shinya
- Masaya Notomi
With the rapid progress of the integrated nanophotonics technology, the optical neural network (ONN) architecture has been widely investigated. Although the ONN inference is fast, conventional densely connected network structures consume large amounts of power in laser sources. We propose a novel ONN design method that finds an ultrafast, energy-efficient, and accurate ONN structure. The key idea is power-aware edge pruning that derives the near-optimal numbers of edges in the entire network. Optoelectronic circuit simulation demonstrates the correct functional behavior of the ONN. Furthermore, experimental evaluations using tensor-flow show the proposed methods achieved 98.28% power reduction without significant loss of accuracy.
REACT: a heterogeneous reconfigurable neural network accelerator with software-configurable NoCs for training and inference on wearables
- Mohit Upadhyay
- Rohan Juneja
- Bo Wang
- Jun Zhou
- Weng-Fai Wong
- Li-Shiuan Peh
On-chip training improves model accuracy on personalised user data and preserves privacy. This work proposes REACT, an AI accelerator for wearables that has heterogeneous cores supporting both training and inference. REACT’s architecture is NoC-centric, with weights, features and gradients distributed across cores, accessed and computed efficiently through software-configurable NoCs. Unlike conventional dynamic NoCs, REACT’s NoCs have no buffer queues, flow control or routing, as they are entirely configured by software for each neural network. REACT’s online learning realises upto 75% accuracy improvement, and is upto 25× faster and 520× more energy-efficient than state-of-the-art accelerators with similar memory and computation footprint.
LHNN: lattice hypergraph neural network for VLSI congestion prediction
- Bowen Wang
- Guibao Shen
- Dong Li
- Jianye Hao
- Wulong Liu
- Yu Huang
- Hongzhong Wu
- Yibo Lin
- Guangyong Chen
- Pheng Ann Heng
Precise congestion prediction from a placement solution plays a crucial role in circuit placement. This work proposes the lattice hypergraph (LH-graph), a novel graph formulation for circuits, which preserves netlist data during the whole learning process, and enables the congestion information propagated geometrically and topologically. Based on the formulation, we further developed a heterogeneous graph neural network architecture LHNN, jointing the routing demand regression to support the congestion spot classification. LHNN constantly achieves more than 35% improvements compared with U-nets and Pix2Pix on the F1 score. We expect our work shall highlight essential procedures using machine learning for congestion prediction.
Floorplanning with graph attention
- Yiting Liu
- Ziyi Ju
- Zhengming Li
- Mingzhi Dong
- Hai Zhou
- Jia Wang
- Fan Yang
- Xuan Zeng
- Li Shang
Floorplanning has long been a critical physical design task with high computation complexity. Its key objective is to determine the initial locations of macros and standard cells with optimized wirelength for a given area constraint. This paper presents Flora, a graph attention-based floorplanner to learn an optimized mapping between circuit connectivity and physical wirelength, and produce a chip floorplan using efficient model inference. Flora has been integrated with two state-of-the-art mixed-size placers. Experimental studies using both academic benchmarks and industrial designs demonstrate that compared to state-of-the-art mixed-size placers alone, Flora improves placement runtime by 18%, with 2% wirelength reduction on average.
Xplace: an extremely fast and extensible global placement framework
- Lixin Liu
- Bangqi Fu
- Martin D. F. Wong
- Evangeline F. Y. Young
Placement serves as a fundamental step in VLSI physical design. Recently, GPU-based global placer DREAMPlace[1] demonstrated its superiority over CPU-based global placers. In this work, we develop an extremely fast GPU accelerated global placer Xplace which achieves around 2x speedup with better solution quality compared to DREAMPlace. We also plug a novel Fourier neural network into Xplace as an extension to further improve the solution quality. We believe this work not only proposes a new, fast, extensible placement framework but also illustrates a possibility to incorporate a neural network component into a GPU accelerated analytical placer.
Differentiable-timing-driven global placement
- Zizheng Guo
- Yibo Lin
Placement is critical to the timing closure of the very-large-scale integrated (VLSI) circuit design flow. This paper proposes a differentiable-timing-driven global placement framework inspired by deep neural networks. By establishing the analogy between static timing analysis and neural network propagation, we propose a differentiable timing objective for placement to explicitly optimize timing metrics such as total negative slack (TNS) and worst negative slack (WNS). The framework can achieve at most 32.7% and 59.1% improvements on WNS and TNS respectively compared with the state-of-the-art timing-driven placer, and achieve 1.80× speed-up when both running on GPU.
TAAS: a timing-aware analytical strategy for AQFP-capable placement automation
- Peiyan Dong
- Yanyue Xie
- Hongjia Li
- Mengshu Sun
- Olivia Chen
- Nobuyuki Yoshikawa
- Yanzhi Wang
Adiabatic Quantum-Flux-Parametron (AQFP) is a superconducting logic with extremely high energy efficiency. AQFP circuits adopt the deep pipeline structure, where the four-phase AC-power serves as both the energy supply and the clock signal and transfers the data from one clock phase to the next. However, the deep pipeline structure causes the stage delay of the data propagation is comparable to the delay of the zigzag clocking, which triggers timing violations easily. In this paper, we propose a timing-aware analytical strategy for the AQFP placement, TAAS, that immensely reduces timing violations under specific spacing constraints and wirelength constraints of AQFP. TAAS includes two main characteristics: 1) a timing-aware objective function that incorporates a four-phase timing model for the analytical global placement. 2) a unique detailed placement including the timing-aware dynamic programming technique and the time-space cell regularization. To validate the effectiveness of TAAS, various representative circuits are adopted as benchmarks. As shown in the experimental results, our strategy can increase the maximum operating frequency by up to 30% ~ 40% with a negligible wirelength increase -3.41%~1%.
A cross-layer approach to cognitive computing: invited
- Gobinda Saha
- Cheng Wang
- Anand Raghunathan
- Kaushik Roy
Remarkable advances in machine learning and artificial intelligence have been made in various domains, achieving near-human performance in a plethora of cognitive tasks including vision, speech and natural language processing. However, implementations of such cognitive algorithms in conventional “von-Neumann” architectures are orders of magnitude more area and power expensive than the biological brain. Therefore, it is imperative to search for fundamentally new approaches so that the improvement in computing performance and efficiency can keep up with the exponential growth of the AI computational demand. In this article, we present a cross-layer approach to the exploration of new paradigms in cognitive computing. This effort spans new learning algorithms inspired from biological information processing principles, network architectures best suited for such algorithms, and neuromorphic hardware substrates such as computing-in-memory fabrics in order to build intelligent machines that can achieve orders of improvement in energy efficiency at cognitive processing. We argue that such cross-layer innovations in cognitive computing are well-poised to enable a new wave of autonomous intelligence across the computing spectrum, from resource-constrained IoT devices to the cloud.
Generative self-supervised learning for gate sizing: invited
- Siddhartha Nath
- Geraldo Pradipta
- Corey Hu
- Tian Yang
- Brucek Khailany
- Haoxing Ren
Self-supervised learning has shown great promise in leveraging large amounts of unlabeled data to achieve higher accuracy than supervised learning methods in many domains. Generative self-supervised learning can generate new data based on the trained data distribution. In this paper, we evaluate the effectiveness of generative self-supervised learning on combinational gate sizing in VLSI designs. We propose a novel use of Transformers for gate sizing when trained on a large dataset generate from a commercial EDA tool. We demonstrate that our trained model can achieve 93% accuracy, 1440X speedup and fast design convergence when compared to a leading commercial EDA tool.
Hammer: a modular and reusable physical design flow tool: invited
- Harrison Liew
- Daniel Grubb
- John Wright
- Colin Schmidt
- Nayiri Krzysztofowicz
- Adam Izraelevitz
- Edward Wang
- Krste Asanović
- Jonathan Bachrach
- Borivoje Nikolić
Process technology scaling and hardware architecture specialization have vastly increased the need for chip design space exploration, while optimizing for power, performance, and area. Hammer is an open-source, reusable physical design (PD) flow generator that reduces design effort and increases portability by enforcing a separation among design-, tool-, and process technology-specific concerns with a modular software architecture. In this work, we outline Hammer’s structure and highlight recent extensions that support both physical chip designers and hardware architects evaluating the merit and feasibility of their proposed designs. This is accomplished through the integration of more tools and process technologies—some open-source—and the designer-driven development of flow step generators. An evaluation of chip designs in process technologies ranging from 130nm down to 12nm across a series of RISC-V-based chips shows how Hammer-generated flows are reusable and enable efficient optimization for diverse applications.
mflowgen: a modular flow generator and ecosystem for community-driven physical design: invited
- Alex Carsello
- James Thomas
- Ankita Nayak
- Po-Han Chen
- Mark Horowitz
- Priyanka Raina
- Christopher Torng
Achieving high code reuse in physical design flows is challenging but increasingly necessary to build complex systems. Unfortunately, existing approaches based on parameterized Tcl generators support very limited reuse as designers customize flows for specific designs and technologies, preventing their reuse in future flows. We present a vision and framework based on modular flow generators that encapsulates coarse-grained and fine-grained reusable code in modular nodes and assembles them into complete flows. The key feature is a flow consistency and instrumentation layer embedded in Python, which supports mechanisms for rapid and early feedback on inconsistent composition. We evaluate the design flows of successive generations of silicon prototypes built in TSMC16, TSMC28, TSMC40, SKY130, and IBM180 technologies, showing how our approach can enable significant code reuse in future flows.
A distributed approach to silicon compilation: invited
- Andreas Olofsson
- William Ransohoff
- Noah Moroze
Hardware specialization for the long tail of future energy constrained edge applications will require reducing design costs by orders of magnitude. In this work, we take a distributed approach to hardware compilation, with the goal of creating infrastructure that scales to thousands of developers and millions of servers. Technical contributions in this work include (i) a standardized hardware build system manifest, (ii) a light-weight flowgraph based programming model, (iii) a client/server execution model, and (iv) a provenance tracking system for distributed development. These ideas have been reduced to practice in SiliconCompiler, an open source build system that demonstrates an order of magnitude compilation speed up on multiple designs and PDKs compared to single threaded build systems.
Improving GNN-based accelerator design automation with meta learning
- Yunsheng Bai
- Atefeh Sohrabizadeh
- Yizhou Sun
- Jason Cong
Recently, there is a growing interest in developing learning-based models as a surrogate of the High-Level Synthesis (HLS) tools, where the key objective is rapid prediction of the quality of a candidate HLS design for automated design space exploration (DSE). Training is usually conducted on a given set of computation kernels (or kernels in short) needed for hardware acceleration. However, the model must also perform well on new kernels. The discrepancy between the training set and new kernels, called domain shift, frequently leads to model accuracy drop which in turn negatively impact the DSE performance. In this paper, we investigate the possibility of adapting an existing meta-learning approach, named MAML, to the task of design quality prediction. Experiments show the MAML-enhanced model outperforms a simple baseline based on fine tuning in terms of both offline evaluation on hold-out test sets and online evaluation for DSE speedup results1.
Accelerator design with decoupled hardware customizations: benefits and challenges: invited
- Debjit Pal
- Yi-Hsiang Lai
- Shaojie Xiang
- Niansong Zhang
- Hongzheng Chen
- Jeremy Casas
- Pasquale Cocchini
- Zhenkun Yang
- Jin Yang
- Louis-Noël Pouchet
- Zhiru Zhang
The past decade has witnessed increasing adoption of high-level synthesis (HLS) to implement specialized hardware accelerators targeting either FPGAs or ASICs. However, current HLS programming models entangle algorithm specifications with hardware customization techniques, which lowers both the productivity and portability of the accelerator design. To tackle this problem, recent efforts such as HeteroCL propose to decouple algorithm definition from essential hardware customization techniques in compute, data type, and memory, increasing productivity, portability, and performance.
While the decoupling of the algorithm and customizations provides benefits to the compilation/synthesis process, they also create new hurdles for the programmers to productively debug and validate the correctness of the optimized design. In this work, using HeteroCL and realistic machine learning applications as case studies, we first explain the key advantages of the decoupled programming model brought to a programmer to rapidly develop high-performance accelerators. Using the same case studies, we will further show how seemingly benign usage of the customization primitives can lead to new challenges to verification. We will then outline the research opportunities and discuss some of our recent efforts as the first step to enable a robust and viable verification solution in the future.
ScaleHLS: a scalable high-level synthesis framework with multi-level transformations and optimizations: invited
- Hanchen Ye
- HyeGang Jun
- Hyunmin Jeong
- Stephen Neuendorffer
- Deming Chen
This paper presents an enhanced version of a scalable HLS (High-Level Synthesis) framework named ScaleHLS, which can compile HLS C/C++ programs and PyTorch models to highly-efficient and synthesizable C++ designs. The original version of ScaleHLS achieved significant speedup on both C/C++ kernels and PyTorch models [14]. In this paper, we first highlight the key features of ScaleHLS on tackling the challenges present in the representation, optimization, and exploration of large-scale HLS designs. To further improve the scalability of ScaleHLS, we then propose an enhanced HLS transform and analysis library supported in both C++ and Python, and a new design space exploration algorithm to handle HLS designs with hierarchical structures more effectively. Comparing to the original ScaleHLS, our enhanced version improves the speedup by up to 60.9× on FPGAs. ScaleHLS is fully open-sourced at https://github.com/hanchenye/scalehls.
The SODA approach: leveraging high-level synthesis for hardware/software co-design and hardware specialization: invited
- Nicolas Bohm Agostini
- Serena Curzel
- Ankur Limaye
- Vinay Amatya
- Marco Minutoli
- Vito Giovanni Castellana
- Joseph Manzano
- Antonino Tumeo
- Fabrizio Ferrandi
Novel “converged” applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.
Automatic oracle generation in microsoft’s quantum development kit using QIR and LLVM passes
- Mathias Soeken
- Mariia Mykhailova
Automatic oracle generation techniques can find optimized quantum circuits for classical components in quantum algorithms. However, most implementations of oracle generation techniques require that the classical component is expressed in terms of a conventional logic representation such as logic networks, truth tables, or decision diagrams. We implemented LLVM passes that can automatically generate QIR functions representing classical Q# functions into QIR code implementing such functions quantumly. We are using state-of-the-art logic optimization and oracle generation techniques based on XOR-AND graphs for this purpose. This enables not only a more natural description of the quantum algorithm on a higher level of abstraction, but also enables technology-dependent or application-specific generation of the oracles.
The basis of design tools for quantum computing: arrays, decision diagrams, tensor networks, and ZX-calculus
- Robert Wille
- Lukas Burgholzer
- Stefan Hillmich
- Thomas Grurl
- Alexander Ploier
- Tom Peham
Quantum computers promise to efficiently solve important problems classical computers never will. However, in order to capitalize on these prospects, a fully automated quantum software stack needs to be developed. This involves a multitude of complex tasks from the classical simulation of quantum circuits, over their compilation to specific devices, to the verification of the circuits to be executed as well as the obtained results. All of these tasks are highly non-trivial and necessitate efficient data structures to tackle the inherent complexity. Starting from rather straight-forward arrays over decision diagrams (inspired by the design automation community) to tensor networks and the ZX-calculus, various complementary approaches have been proposed. This work provides a look “under the hood” of today’s tools and showcases how these means are utilized in them, e.g., for simulation, compilation, and verification of quantum circuits.
Secure by construction: addressing security vulnerabilities introduced during high-level synthesis: invited
- Md Rafid Muttaki
- Zahin Ibnat
- Farimah Farahmandi
Working towards a higher level of abstraction (C/C++) facilitates designers to execute and validate complex designs faster in response to highly demanding time-to-market requirements. High-Level Synthesis (HLS) is an automatic process that translates the high-level description of the design behaviors into the corresponding hardware description language (HDL) modules. However, HLS translation steps/optimizations can cause security vulnerabilities since they have not been designed with security in mind. It is very important that HLS generates functionally correct RTL in a secure manner in the first place since it is not easy to read the automatically generated codes and trace them back to the source of vulnerabilities. Even if one manages to identify and fix the security vulnerabilities in one design, the core of the HLS engine remains vulnerable. Therefore, the same vulnerabilities will appear in all other HLS generated RTL codes. This paper shows a systematic approach for identifying the source of security vulnerabilities introduced during HLS and mitigating them.
High-level design methods for hardware security: is it the right choice? invited
- Christian Pilato
- Donatella Sciuto
- Benjamin Tan
- Siddharth Garg
- Ramesh Karri
Due to the globalization of the electronics supply chain, hardware engineers are increasingly interested in modifying their chip designs to protect their intellectual property (IP) or the privacy of the final users. However, the integration of state-of-the-art solutions for hardware and hardware-assisted security is not fully automated, requiring the amendment of stable tools and industrial toolchains. This significantly limits the application in industrial designs, potentially affecting the security of the resulting chips. We discuss how existing solutions can be adapted to implement security features at higher levels of abstractions (during high-level synthesis or directly at the register-transfer level) and complement current industrial design and verification flows. Our modular framework allows designers to compose these solutions and create additional protection layers.
Trusting the trust anchor: towards detecting cross-layer vulnerabilities with hardware fuzzing
- Chen Chen
- Rahul Kande
- Pouya Mahmoody
- Ahmad-Reza Sadeghi
- JV Rajendran
The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.
Automating hardware security property generation: invited
- Ryan Kastner
- Francesco Restuccia
- Andres Meza
- Sayak Ray
- Jason Fung
- Cynthia Sturton
Security verification is an important part of the hardware design process. Security verification teams can uncover weaknesses, vulnerabilities, and flaws. Unfortunately, the verification process involves substantial manual analysis to create the threat model, identify important security assets, articulate weaknesses, define security requirements, and specify security properties that formally describe security requirements upon the hardware. This work describes current hardware security verification practices. Many of these rely on manual analysis. We argue that the property generation process is a first step towards scalable and reproducible hardware security verification.
Efficient timing propagation with simultaneous structural and pipeline parallelisms: late breaking results
- Cheng-Hsiang Chiu
- Tsung-Wei Huang
Graph-based timing propagation (GBP) is an essential component for all static timing analysis (STA) algorithms. To speed up GBP, the state-of-the-art timer leverages the task graph model to explore structural parallelism in an STA graph. However, many designs exhibit linear segments that cause the parallelism to serialize, degrading the performance significantly. To overcome this problem, we introduce an efficient GBP framework by exploring both structural and pipeline parallelisms in an STA task graph. Our framework identifies linear segments and parallelizes their propagation tasks using pipeline in an STA task graph. We have shown up to 25% performance improvement over the state-of-the-art task graph-based timer.
A fast and low-cost comparison-free sorting engine with unary computing: late breaking results
- Amir Hossein Jalilvand
- Seyedeh Newsha Estiri
- Samaneh Naderi
- M. Hassan Najafi
- Mohsen Imani
Hardware-efficient implementation of sorting operation is crucial for numerous applications, particularly when fast and energy-efficient sorting of data is desired. Unary computing has been used for low-cost hardware sorting. This work proposes a comparison-free unary sorting engine by iteratively finding maximum values. Synthesis results show up to 81% reduction in hardware area compared to the state-of-the-art unary sorting design. By processing right-aligned unary bit-streams, our unary sorter is able to sort many inputs in fewer clock cycles.
Flexible chip placement via reinforcement learning: late breaking results
- Fu-Chieh Chang
- Yu-Wei Tseng
- Ya-Wen Yu
- Ssu-Rui Lee
- Alexandru Cioba
- I-Lun Tseng
- Da-shan Shiu
- Jhih-Wei Hsu
- Cheng-Yuan Wang
- Chien-Yi Yang
- Ren-Chu Wang
- Yao-Wen Chang
- Tai-Chen Chen
- Tung-Chieh Chen
Recently, successful applications of reinforcement learning to chip placement have emerged. Pretrained models are necessary to improve efficiency and effectiveness. Currently, the weights of objective metrics (e.g., wirelength, congestion, and timing) are fixed during pretraining. However, fixed-weighed models cannot generate the diversity of placements required for engineers to accommodate changing requirements as they arise. This paper proposes flexible multiple-objective reinforcement learning (MORL) to support objective functions with inference-time variable weights using just a single pretrained model. Our macro placement results show that MORL can generate the Pareto frontier of multiple objectives effectively.
FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization: late breaking results
- Mengshu Sun
- Zhengang Li
- Alec Lu
- Haoyu Ma
- Geng Yuan
- Yanyue Xie
- Hao Tang
- Yanyu Li
- Miriam Leeser
- Zhangyang Wang
- Xue Lin
- Zhenman Fang
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.31% to 1.25% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6× improvement on the frame rate (i.e., 56.4 FPS vs. 10.0 FPS) with 0.83% accuracy drop for DeiT-base.
Hardware-efficient stochastic rounding unit design for DNN training: late breaking results
- Sung-En Chang
- Geng Yuan
- Alec Lu
- Mengshu Sun
- Yanyu Li
- Xiaolong Ma
- Zhengang Li
- Yanyue Xie
- Minghai Qin
- Xue Lin
- Zhenman Fang
- Yanzhi Wang
Stochastic rounding is crucial in the training of low-bit deep neural networks (DNNs) to achieve high accuracy. Unfortunately, prior studies require a large number of high-precision stochastic rounding units (SRUs) to guarantee the low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we propose an automated framework to explore hardware-efficient low-bit SRUs (ESRUs) that can still generate high-quality random numbers to guarantee the accuracy of low-bit DNN training. Experimental results using state-of-the-art DNN models demonstrate that, compared to the prior 24-bit SRU with 24-bit pseudo random number generator (PRNG), our 8-bit with 3-bit PRNG reduces the SRU resource usage by 9.75× while achieving a higher accuracy.
Placement initialization via a projected eigenvector algorithm: late breaking results
- Pengwen Chen
- Chung-Kuan Cheng
- Albert Chern
- Chester Holtz
- Aoxi Li
- Yucheng Wang
Canonical methods for analytical placement of VLSI designs rely on solving nonlinear programs to minimize wirelength and cell overlap. We focus on producing initial layouts such that a global analytical placer performs better compared to existing heuristics for initialization. We reduce the problem of initialization to a quadratically constrained quadratic program. Our formulation is aware of fixed macros. We propose an efficient algorithm which can quickly generate initializations for testcases with millions of cells. We show that the our method for parameter initialization results in superior performance with respect to post-detailed placement wirelength.
Subgraph matching based reference placement for PCB designs: late breaking results
- Miaodi Su
- Yifeng Xiao
- Shu Zhang
- Haiyuan Su
- Jiacen Xu
- Huan He
- Ziran Zhu
- Jianli Chen
- Yao-Wen Chang
Reference placement is promising to handle the increasing complexity in PCB design. We model the netlist into a graph and use a subgraph matching algorithm to find the isomorphism of the placed template in component combination to reuse the placement. The state-of-the-art VF3 algorithm can achieve high matching accuracy while suffering from high computation time in large-scale instances. Thus, we propose the D2BS algorithm to guarantee matching quality and efficiency. We build and filter the candidate set (CS) according to designed features to construct the CS structure. In the CS optimization, a graph diversity tolerance strategy is adopted to achieve inexact matching. Then, hierarchical match is developed to search the template embeddings in the CS structure guided by branch backtracking and matched nodes snatching. Experimental results show that D2BS outperforms VF3 in accuracy and runtime, achieving 100% accuracy on PCB instances.
Thermal-aware drone battery management: late breaking results
- Hojun Choi
- Youngmoon Lee
Users have reported that their drones unexpectedly shutoff even when they show more than 10% remaining battery capacity. We discovered that the causes of these unexpected shutoffs to be significant thermal degradation of a cell caused by thermal coupling between the drones and their battery cells. This causes a large voltage drop for the cell affected by the drone heat dissipation, which leads to low supply voltage and unexpected shutoffs. This paper describes the design and implementation of a thermal and battery-aware power management framework designed specifically for drones. Our framework provides an accurate state-of-charge and state-of-power estimation for individual battery cells by accounting for their different thermal degradation. We have implemented our framework on commodity drones without additional hardware or system modification. We have evaluated its effectiveness using three different batteries demonstrating our framework generates accurate state-of-charge and prevents unexpected shutoffs.
Waveform-based performance analysis of RISC-V processors: late breaking results
- Lucas Klemmer
- Daniel Große
In this paper, we demonstrate the use of the open-source domain specific language WAL to analyze performance metrics of RISC-V processors. The WAL programs calculate these metrics by evaluating the processors signals while “walking” over the simulation waveform (VCD). The presented WAL programs are flexible and generic, and can be easily adapted to different RISC-V cores.