ASPDAC ’23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

 Full Citation in the ACM Digital Library

SESSION: Technical Program: Reliability Considerations for Emerging Computing and Memory Architectures

A Fast Semi-Analytical Approach for Transient Electromigration Analysis of Interconnect Trees Using Matrix Exponential

  • Pavlos Stoikos
  • George Floros
  • Dimitrios Garyfallou
  • Nestor Evmorfopoulos
  • George Stamoulis

As integrated circuit technologies are moving to smaller technology nodes, Electromigration (EM) has become one of the most challenging problems facing the EDA industry. While numerical approaches have been widely deployed since they can handle complicated interconnect structures, they tend to be much slower than analytical approaches. In this paper, we present a fast semi-analytical approach, based on the matrix exponential, for the solution of Korhonen’s stress equation at discrete spatial points of interconnect trees, which enables the analytical calculation of EM stress at any time and point independently. The proposed approach is combined with the extended Krylov subspace method to accurately simulate large EM models and accelerate the calculation of the final solution. Experimental evaluation on OpenROAD benchmarks demonstrates that our method achieves 0.5% average relative error over the COMSOL industrial tool while being up to three orders of magnitude faster.

Chiplet Placement for 2.5D IC with Sequence Pair Based Tree and Thermal Consideration

  • Hong-Wen Chiou
  • Jia-Hao Jiang
  • Yu-Teng Chang
  • Yu-Min Lee
  • Chi-Wen Pan

This work develops an efficient chiplet placer with thermal consideration for 2.5D ICs. Combining the sequence-pair based tree, branch-and-bound method, and advanced placement/pruning techniques, the developed placer can find the solution fast with the optimized total wirelength (TWL) on half-perimeter wirelength (HPWL). Additionally, with the post placement procedure, the placer reduces maximum temperatures with slight increase of wirelength. Experimental results show that the placer can not only find better optimized TWL (reducing 1.035% HPWL) but also speed up at most two orders of magnitude than the prior art. With thermal consideration, the placer can reduce the maximum temperature up to 8.214 °C with an average 5.376% increase of TWL.

An On-Line Aging Detection and Tolerance Framework for Improving Reliability of STT-MRAMs

  • Yu-Guang Chen
  • Po-Yeh Huang
  • Jin-Fu Li

Spin-transfer-torque magnetic random-access memory (STT-MRAM) is one of the most promising emerging memories for on-chip memory. However, the magnetic tunnel junction (MTJ) in the STT-MRAM suffers from several reliability threats which degrade the endurance, create defects, and cause memory failure. One of the primary reliability issues comes from time-dependent dielectric breakdown (TDDB) on MTJ, which deviates resistance value of MTJ over time and may lead to reading error. To overcome this challenge, in this paper we present an on-line aging detection and tolerance framework to dynamically monitor the electrical parameter deviations and provide appropriate compensation to avoid reading error. The on-line aging detection mechanism can identify aged words by monitoring read current and then the aging tolerance mechanism can adjust the reference resistance of the sensing amplifier to compensate the aging-induced resistance drop of MTJ. In comparison with existing testing-based aging detection techniques, our mechanism can operate on-line with read operations for both aging detection and tolerance simultaneously with negligible performance overhead. Simulation and analysis results show that the proposed techniques can successfully detect 99% aging words under process variation and achieve at most 25% reliability improvement of STT-MRAMs.

SESSION: Technical Program: Accelerators and Equivalence Checking

Automated Equivalence Checking Method for Majority Based In-Memory Computing on ReRAM Crossbars

  • Arighna Deb
  • Kamalika Datta
  • Muhammad Hassan
  • Saeideh Shirinzadeh
  • Rolf Drechsler

Recent progress in the fabrication of Resistive Random Access Memory (ReRAM) devices has paved the way for large scale crossbar structures. In particular, in-memory computing on ReRAM crossbars helps in bridging the processor-memory speed gap for current CMOS technology. To this end, synthesis and mapping of Boolean functions to such crossbars have been investigated by researchers. However the verification of simple designs on crossbar is still done through manual inspection or sometimes complemented by simulation based techniques. Clearly this is an important problem as real world designs are complex and have higher number of inputs. As a result manual inspection and simulation based methods for these designs are not practical.

In this paper for the first time as per our knowledge we propose an automated equivalence checking methodology for majority based in-memory designs on ReRAM crossbars. Our contributions are twofold: first, we introduce an intermediate data structure called ReRAM Sequence Graph (ReSG) to represent the logic-in-memory design. This in turn is translated into Boolean Satifiability (SAT) formulas. These SAT formulas are verified against the golden functional specification using Z3 Satifiability Modulo Theory (SMT) solver. We validate the proposed method by running widely available benchmarks.

An Equivalence Checking Framework for Agile Hardware Design

  • Yanzhao Wang
  • Fei Xie
  • Zhenkun Yang
  • Pasquale Cocchini
  • Jin Yang

Agile hardware design enables designers to produce new design iterations efficiently. Equivalence checking is critical in ensuring that a new design iteration conforms to its specification. In this paper, we introduce an equivalence checking framework for hardware designs represented in HalideIR. HalideIR is a popular intermediate representation in software domains such as deep learning and image processing, and it is increasingly utilized in agile hardware design. We have developed a fully automatic equivalence checking workflow seamlessly integrated with HalideIR and several optimizations that leverage the incremental nature of agile hardware design to scale equivalence checking. Evaluations of two deep learning accelerator designs show our automatic equivalence checking framework scales to hardware designs of practical sizes and detects inconsistencies that manually crafted tests have missed.

Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector Duplication

  • Bowen Liu
  • Dajiang Liu

Sparse matrix-vector multiplication (SpMV) is widely used in many fields and usually dominates the execution time of a task. With large off-chip memory bandwidth, customizable on-chip resources and high-performance float-point operation, FPGA is a potential platform to accelerate SpMV tasks. However, as compressed data formats for SpMV usually introduce irregular memory access while it is also memory-intensive, implementing an SpMV accelerator on FPGA to achieve a high bandwidth utilization (BU) is a challenging work. Existing works either eliminate irregular memory access at the sacrifice of increasing data redundancy or try to locally reduce the port conflicts introduced by irregular memory access, leading to a limited BU improvement. To this end, this paper proposes a high-bandwidth-utilization SpMV accelerator on FPGAs using partial vector duplication, where read-conflict-free vector buffer, writing-conflict-free adder tree, and ping-pong-like accumulator registers are well elaborated. The FPGA implementation results show that the proposed design can achieve an average of 1.10x performance speedup compared to the state-of-the-art work.

SESSION: Technical Program: New Frontiers in Cyber-Physical and Autonomous Systems

Safety-Driven Interactive Planning for Neural Network-Based Lane Changing

  • Xiangguo Liu
  • Ruochen Jiao
  • Bowen Zheng
  • Dave Liang
  • Qi Zhu

Neural network-based driving planners have shown great promises in improving task performance of autonomous driving. However, it is critical and yet very challenging to ensure the safety of systems with neural network-based components, especially in dense and highly interactive traffic environments. In this work, we propose a safety-driven interactive planning framework for neural network-based lane changing. To prevent over-conservative planning, we identify the driving behavior of surrounding vehicles and assess their aggressiveness, and then adapt the planned trajectory for the ego vehicle accordingly in an interactive manner. The ego vehicle can proceed to change lanes if a safe evasion trajectory exists even in the predicted worst case; otherwise, it can stay around the current lateral position or return back to the original lane. We quantitatively demonstrate the effectiveness of our planner design and its advantage over baseline methods through extensive simulations with diverse and comprehensive experimental settings, as well as in real-world scenarios collected by an autonomous vehicle company.

Safety-Aware Flexible Schedule Synthesis for Cyber-Physical Systems Using Weakly-Hard Constraints

  • Shengjie Xu
  • Bineet Ghosh
  • Clara Hobbs
  • P. S. Thiagarajan
  • Samarjit Chakraborty

With the emergence of complex autonomous systems, multiple control tasks are increasingly being implemented on shared computational platforms. Due to the resource-constrained nature of such platforms in domains such as automotive, scheduling all the control tasks in a timely manner is often difficult. The usual requirement—that all task invocations must meet their deadlines—stems from the isolated design of a control strategy and its implementation (including scheduling) in software. This separation of concerns, where the control designer sets the deadlines, and the embedded software engineer aims to meet them, eases the design and verification process. However, it is not flexible and is overly conservative. In this paper, we show how to capture the deadline miss patterns under which the safety properties of the controllers will still be satisfied. The allowed patterns of such deadline misses may be captured using what are referred to as “weakly-hard constraints.” But scheduling tasks under these weakly-hard constraints is non-trivial since common scheduling policies like fixed-priority or earliest deadline first do not satisfy them in general. The main contribution of this paper is to automatically synthesize schedules from the safety properties of controllers. Using real examples, we demonstrate the effectiveness of this strategy and illustrate that traditional notions of schedulability, e.g., utility ratios, are not applicable when scheduling controllers to satisfy safety properties.

Mixed-Traffic Intersection Management Utilizing Connected and Autonomous Vehicles as Traffic Regulators

  • Pin-Chun Chen
  • Xiangguo Liu
  • Chung-Wei Lin
  • Chao Huang
  • Qi Zhu

Connected and autonomous vehicles (CAVs) can realize many revolutionary applications, but it is expected to have mixed-traffic including CAVs and human-driving vehicles (HVs) together for decades. In this paper, we target the problem of mixed-traffic intersection management and schedule CAVs to control the subsequent HVs. We develop a dynamic programming approach and a mixed integer linear programming (MILP) formulation to optimally solve the problems with the corresponding intersection models. We then propose an MILP-based approach which is more efficient and real-time-applicable than solving the optimal MILP formulation, while keeping good solution quality as well as outperforming the first-come-first-served (FCFS) approach. Experimental results and SUMO simulation indicate that controlling CAVs by our approaches is effective to regulate mixed-traffic even if the CAV penetration rate is low, which brings incentive to early adoption of CAVs.

SESSION: Technical Program: Machine Learning Assisted Optimization Techniques for Analog Circuits

Fully Automated Machine Learning Model Development for Analog Placement Quality Prediction

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Yaguang Li
  • Yishuang Lin
  • Jiang Hu
  • Yiran Chen

Analog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art.

Efficient Hierarchical mm-Wave System Synthesis with Embedded Accurate Transformer and Balun Machine Learning Models

  • F. Passos
  • N. Lourenço
  • L. Mendes
  • R. Martins
  • J. Vaz
  • N. Horta

Integrated circuit design in millimeter-wave (mm-Wave) bands is exceptionally complex and dependent on costly electromagnetic (EM) simulations. Therefore, in the past few years, a growing interest has emerged in developing novel optimization-based methodologies for the automatic design of mm-Wave circuits. However, current approaches lack scalability when the circuit/system complexity increases. Besides, many also depend on EM simulators, which degrade their efficiency. This work resorts to hierarchical system partitioning and bottom-up design approaches, where a precise machine learning model – composed of hundreds of seamlessly integrated sub-models that guarantee high accuracy (validated against EM simulations and measurements) up to 200GHz – is embedded to design passive components, e.g., transformers and baluns. The model generates optimal design surfaces to be fed to the hierarchical levels above or acts as a performance estimator. With the proposed scheme, it is possible to remove the dependency of EM simulations during optimization. The proposed mixed-optimal-surface, performance estimator, and simulation-based bottom-up multiobjective optimization (MOO) are used to fully design a Ka-band mm-Wave transmitter from the device up to the system level in 65-nm CMOS for state-of-the-art specifications.

APOSTLE: Asynchronously Parallel Optimization for Sizing Analog Transistors Using DNN Learning

  • Ahmet F. Budak
  • David Smart
  • Brian Swahn
  • David Z. Pan

Analog circuit sizing is a high-cost process in terms of the manual effort invested and the computation time spent. With rapidly developing technology and high market demand, bringing automated solutions for sizing has attracted great attention. This paper presents APOSTLE, an asynchronously parallel optimization method for sizing analog transistors using Deep Neural Network (DNN) learning. This work introduces several methods to minimize real-time of optimization when the sizing task consists of several different simulations with varying time costs. The key contributions of this paper are: (1) a batch optimization framework, (2) a novel deep neural network architecture for exploring design points when the existed solutions are not always fully evaluated, (3) a ranking approximation method based on cheap evaluations and (4) a theoretical approach to balance between the cheap and the expensive simulations to maximize the optimization efficiency. Our method shows high real-time efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits while reaching similar or better performance.

SESSION: Technical Program: Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems

ML to the Rescue: Reliability Estimation from Self-Heating and Aging in Transistors All the Way up Processors

  • Hussam Amrouch
  • Florian Klemme

With increasingly confined 3D structures and newly-adopted materials of higher thermal resistance, transistor self-heating has risen to a critical reliability threat in state-of-the-art and emerging process nodes. One of the challenges of transistor self-heating is accelerated transistor aging, which leads to earlier failure of the chip if not considered appropriately. Nevertheless, adequate consideration of accelerated aging effects, induced by self-heating, throughout a large circuit design is profoundly challenging due to the large gap between where self-heating does originate (i.e., at the transistor level) and where its ultimate effect occurs (i.e., at the circuit and system levels). In this work, we demonstrate an end-to-end workflow starting from self-heating and aging effects in individual transistors all the way up to large circuits and processor designs. We demonstrate that with our accurately estimated degradations, the required timing guardband to ensure reliable operation of circuits is considerably reduced by up to 96% compared to otherwise worst-case estimations that are conventionally employed.

Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs

  • Lilas Alrahis
  • Johann Knechtel
  • Ozgur Sinanoglu

Graph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design.

In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at Finally, we discuss interesting open problems for future research.

Detection and Classification of Malicious Bitstreams for FPGAs in Cloud Computing

  • Jayeeta Chaudhuri
  • Krishnendu Chakrabarty

As FPGAs are increasingly shared and remotely accessed by multiple users and third parties, they introduce significant security concerns. Modules running on an FPGA may include circuits that induce voltage-based fault attacks and denial-of-service (DoS). An attacker might configure some regions of the FPGA with bitstreams that implement malicious circuits. Attackers can also perform side-channel analysis and fault attacks to extract secret information (e.g., secret key of an AES encryption). In this paper, we present a convolutional neural network (CNN)-based defense to detect bitstreams of RO-based malicious circuits by analyzing the static features extracted from FPGA bitstreams. We further explore the criticality of RO-based circuits in order to detect malicious Trojans that are configured on the FPGA. Evaluation on Xilinx FPGAs demonstrates the effectiveness of the security solutions.

Learning Based Spatial Power Characterization and Full-Chip Power Estimation for Commercial TPUs

  • Jincong Lu
  • Jinwei Zhang
  • Wentian Jin
  • Sachin Sachdeva
  • Sheldon X.-D. Tan

In this paper, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer’s total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98mW/mm2, or 2.6% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.

SESSION: Technical Program: High Performance Memory for Storage and Computing

DECC: Differential ECC for Read Performance Optimization on High-Density NAND Flash Memory

  • Yunpeng Song
  • Yina Lv
  • Liang Shi

3D NAND flash memory with advanced multi-level-cell technology has been widely adopted due to its high density, but with significantly degraded reliability. To solve the reliability issue, flash memory often adopts the low-density parity-check code (LDPC) as error correction code (ECC) to encode data and provide fault tolerance. For LDPC with a low code rate, it can provide a strong correction capability, but with a high energy cost. To avoid the cost, LDPC with a higher code rate is always adopted. When the accessed data is not successfully decoded, LDPC will rely on read retry operations to improve the error correction capability. However, the read retry operation will induce degraded read performance. In this work, a differential ECC (DECC) method is proposed to improve the read performance. The basic idea of DECC is to adopt LDPC with different code rates for data with different access characteristics. Specifically, when data is hot read and retried due to reliability, LDPC with a low code rate will be adopted to optimize performance. With this approach, the cost from LDPC with a low code rate is minimized and the performance is optimized. Through careful design and real-world workloads evaluation on a 3D triple-level-cell (TLC) NAND flash memory, DECC achieves encouraging read performance optimization.

Optimizing Data Layout for Racetrack Memory in Embedded Systems

  • Peng Hui
  • Edwin H.-M. Sha
  • Qingfeng Zhuge
  • Rui Xu
  • Han Wang

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too many shift operations decrease the performance of RTM and cause unpredictable performance. In this paper, we propose three schemes to optimize the performance of RTM from different aspects, including intra-DBC, inter-DBC, and hybrid SPM with SRAM and RTM. Firstly, a balanced group-based data placement method for the data layout inside one DBC is proposed to reduce shifts. Second, a grouping method for the data allocation among DBCs is proposed. It helps with the shift reduction while using fewer DBCs by using one DBC as multiple DBCs. Finally, we use SRAM to further help the cost reduction, and a cost evaluation metric is proposed to assist the shrinking method which determines the data allocation for hybrid SPM with SRAM and RTM. Experiments show that the proposed schemes can significantly improve the performance of pure RTM and hybrid SPM while using fewer DBCs.

Exploring Architectural Implications to Boost Performance for in-NVM B+-Tree

  • Yanpeng Hu
  • Qisheng Jiang
  • Chundong Wang

Computer architecture keeps evolving to support the byte-addressable non-volatile memory (NVM). Researchers have tailored the prevalent B+-tree with NVM, crafting a history of utilizing architectural supports to gain both high performance and crash consistency. The latest architecture-level changes for NVM, e.g., the eADR, motivate us to further explore architectural implications in the design and implementation of in-NVM B+-tree. Our quantitative study finds that eADR makes the cache misses impact increasingly on an in-NVM B+-tree’s performance. We hence propose Conan for the conflict-aware node allocation based on theoretical justifications. Conan decomposes the virtual addresses of B+-tree nodes regarding a VIPT cache and intentionally places them into different cache sets. Experiments show that Conan evidently reduces cache conflicts and boosts the performance of state-of-the-art in-NVM B+-tree.

An Efficient near-Bank Processing Architecture for Personalized Recommendation System

  • Yuqing Yang
  • Weidong Yang
  • Qin Wang
  • Naifeng Jing
  • Jianfei Jiang
  • Zhigang Mao
  • Weiguang Sheng

Personalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory challenges, near-memory processing (NMP) would be an effective solution which provides high bandwidth. Recent work proposes an NMP approach to accelerate the recommendation models by utilizing the through-silicon via (TSV) bandwidth in 3D-stacked DRAMs. However, the total bandwidth provided by TSVs is insufficient for a batch of embedding layers processed in parallel. In this paper, we propose a near-bank processing architecture to accelerate recommendation models. By integrating the compute-logic near memory banks on DRAM dies of the 3D-stacked DRAM, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. We also present a hardware/software interface for embedding layers offloading. Moreover, we propose an efficient mapping scheme to enhance the utilization of bank-level bandwidth. As a result, our architecture achieves up to 2.10X speedup and 31% energy saving for data movement over the state-of-the-art NMP solution for recommendation acceleration based on 3D-stacked memory.

SESSION: Technical Program: Cool and Efficient Approximation

PAALM: Power Density Aware Approximate Logarithmic Multiplier Design

  • Shuyuan Yu
  • Sheldon X.-D. Tan

Approximate hardware designs can lead to significant power or energy reduction. However, a recent study showed that approximated designs might lead to unwanted higher temperature and related reliability issues due to the increased power density. In this work, we try to mitigate this important problem by proposing a novel power density aware approximate logarithmic multiplier (called PAALM) design for the first time. The new multiplier design is based on the approximate logarithmic multiplier (ALM) framework due to its rigorous mathematics based foundation. The idea is to re-design the high computing switch activities of existing ALM designs based on equivalent mathematical formula so that the power density can be reduced at no accuracy loss while at costs of some area overheads. Our results show that the proposed PAALM design can improve 11.5%/5.7% of power density and 31.6%/70.8% of area with 8/16-bit precision when compared with the fixed-point multiplier baseline, respectively. And also achieves extremely low error bias: -0.17/0.08 for 8/16-bit precision, respectively. On top of this, we further implement the PAALM design in a Convolutional Neural Network (CNN) and test it on CIFAR10 dataset. The results show that with error compensation, PAALM can achieve the same inference accuracy as the fixed-point multiplier baseline. We also evaluate the PAALM in a discrete cosine transformation (DCT) application. The results show that with error compensation, PAALM can improve the image quality of 8.6dB in average when compared to the ALM design.

Approximate Floating-Point FFT Design with Wide Precision-Range and High Energy Efficiency

  • Chenyi Wen
  • Ying Wu
  • Xunzhao Yin
  • Cheng Zhuo

Fast Fourier Transform (FFT) is a key digital signal processing algorithm that is widely deployed in mobile and portable devices. Recently, with the popularity of human perception related tasks, it is noted that the requirements of full precision and exactness are not always necessary for FFT computation. We propose a top-down approximate Floating-Point FFT design methodology to fully exploit the error-tolerance nature of the FFT algorithm. An efficient error modeling of the configurable approximate multiplier is proposed to link the multiplier approximation to the FFT algorithm precision. Then an approximation optimization flow is formulated to maximize the energy efficiency. Experimental results show that the proposed approximate FFT can achieve up to 52% Area-Delay-Product improvement and 23% energy saving when compared to the exact FFT. The proposed approximate FFT is also found to cover almost 2X wider precision range with higher energy efficiency in comparison with the prior state-of-the-art approximate FFT.

RUCA: RUntime Configurable Approximate Circuits with Self-Correcting Capability

  • Jingxiao Ma
  • Sherief Reda

Approximate computing is an emerging computing paradigm that offers improved power consumption by relaxing the requirement for full accuracy. Since the requirements for accuracy may vary according to specific real-world applications, one trend of approximate computing is to design quality-configurable circuits, which are able to switch at runtime among different accuracy modes with different power and delay. In this paper, we present a novel framework RUCA which aims to synthesize runtime configurable approximate circuits based on arbitrary input circuits. By decomposing the truth table, our approach aims to approximate and separate the input circuit into multiple configuration blocks which support different accuracy levels, including a corrector circuit to restore full accuracy. Power gating is used to activate different blocks, such that the approximate circuit is able to operate at different accuracy-power configurations. To improve the scalability of our algorithm, we also provide a design space exploration scheme with circuit partitioning. We evaluate our methodology on a comprehensive set of benchmarks. For 3-level designs, RUCA saves power consumption by 43.71% within 2% error and by 30.15% within 1% error on average.

Approximate Logic Synthesis by Genetic Algorithm with an Error Rate Guarantee

  • Chun-Ting Lee
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

Approximate computing is an emerging design technique for error-tolerant applications, which may improve circuit area, delay, or power consumption by trading off a circuit’s correctness. In this paper, we propose a novel approximate logic synthesis approach based on genetic algorithm targeting at depth minimization with an error rate guarantee. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results demonstrate that the depth can be reduced by up to 50%, and 22% on average under a 5% error rate constraint. As compared with the state-of-the-art method, our approach can achieve an average of 159% more depth reduction under the same 5% error rate constraint.

SESSION: Technical Program: Logic Synthesis for AQFP, Quantum Logic, AI Driven and Efficient Data Layout for HBM

Depth-Optimal Buffer and Splitter Insertion and Optimization in AQFP Circuits

  • Alessandro Tempia Calvino
  • Giovanni De Micheli

The Adiabatic Quantum-Flux Parametron (AQFP) is an energy-efficient superconducting logic family. AQFP technology requires buffer and splitting elements (B/S) to be inserted to satisfy path-balancing and fanout-branching constraints. B/S insertion policies and optimization strategies have been recently proposed to minimize the number of buffers and splitters needed in an AQFP circuit. In this work, we study the B/S insertion and optimization methods. In particular, the paper proposes: i) an algorithm for B/S insertion that guarantees global depth optimality; ii) a new approach for B/S optimization based on minimum register retiming; iii) a B/S optimization flow based on (i), (ii), and existing work. We show that our approach reduces the number of B/S up to 20% while guaranteeing optimal depth and providing a 55X speed-up in run time compared to the state-of-the-art.

Area-Driven FPGA Logic Synthesis Using Reinforcement Learning

  • Guanglei Zhou
  • Jason H. Anderson

Logic synthesis involves a rich set of optimization algorithms applied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed “recipe” of such algorithms deemed to work well for a wide range of different circuits. We apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the set of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs. the conventional approach (resyn2). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of “unseen” circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.

Optimization of Reversible Logic Networks with Gate Sharing

  • Yung-Chih Chen
  • Feng-Jie Chao

Logic synthesis for quantum computing aims to transform a Boolean logic network into a quantum circuit. A conventional two-stage flow first synthesizes the given Boolean logic network into a reversible logic network composed of reversible logic gates. Then, it maps each reversible logic gate into quantum gates to generate a quantum circuit. The state-of-the-art method for the first stage takes advantage of the lookup-table (LUT) mapping technology for FPGAs to decompose the given Boolean logic network into sub-networks, and then maps the sub-networks into reversible logic networks. Although every sub-network is well synthesized, we observe that the reversible logic networks could be further optimized by sharing the reversible logic gates belonging to different sub-networks. Thus, in this paper, we propose a new optimization method for the reversible logic networks by sharing gates. We translate the problem of extracting shareable gates to the exclusive-sums-of-product term optimization problem. The experimental results show that the proposed method successfully optimizes the reversible logic networks generated by the LUT-based method. It is able to reduce an average of approximately 4% of quantum gate cost without increasing the number of ancilla lines for a set of IWLS 2005 benchmarks.

Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization

  • Stephanie Soldavini
  • Donatella Sciuto
  • Christian Pilato

Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.

SESSION: Technical Program: University Design Contest

ViraEye: An Energy-Efficient Stereo Vision Accelerator with Binary Neural Network in 55 nm CMOS

  • Yu Zhang
  • Gang Chen
  • Tao He
  • Qian Huang
  • Kai Huang

This paper presents the ViraEye chip, an energy-efficient stereo vision accelerator based on the binary neural network (BNN) to achieve high-quality and real-time stereo estimation. This stereo vision accelerator is designed as an end-to-end full pipeline architecture where all processing procedures, including stereo rectification, BNNs, cost aggregation and post-processing, are implemented on the ViraEye chip. ViraEye allows for top level pipelining between accelerator and image sensors, and no external CPUs or GPUs are required. The accelerator is implemented using SMIC 55nm CMOS technology and achieves top-performing processing speed in terms of million disparity estimations per second (MDE/s) metric among the existing ASIC in the open literature.

A 1.2nJ/Classification Fully Synthesized All-Digital Asynchronous Wired-Logic Processor Using Quantized Non-Linear Function Blocks in 0.18μm CMOS

  • Rei Sumikawa
  • Kota Shiba
  • Atsutake Kosuge
  • Mototsugu Hamada
  • Tadahiro Kuroda

A 5.3 times smaller and 2.6 times more energy-efficient all-digital wired-logic processor which infers MNIST with 90.6% accuracy and 1.2nJ of energy consumption has been developed. To improve area efficiency of wired-logic architecture, nonlinear neural network (NNN), which is a neuron and synapse efficient network, and logical compression technology to implement it with area-saving and low-power digital circuits by logic synthesis are proposed, and asynchronous digital combinational circuit DNN hardware has been developed.

A Fully Synthesized 13.7μJ/Prediction 88% Accuracy CIFAR-10 Single-Chip Data-Reusing Wired-Logic Processor Using Non-Linear Neural Network

  • Yao-Chung Hsu
  • Atsutake Kosuge
  • Rei Sumikawa
  • Kota Shiba
  • Mototsugu Hamada
  • Tadahiro Kuroda

An FPGA-based wired-logic CNN processor is presented that can process CIFAR-10 at 13.7μJ/prediction with an 88% accuracy, which is 2,036 times more energy-efficient than the prior state-of-the-art FPGA-based processor. Energy efficiency is greatly improved by implementing all processing elements and wirings in parallel on a single FPGA chip to eliminate the memory access. By utilizing both (1) a non-linear neural network which saves on neurons and synapses and (2) a shift register-based wired-logic architecture, hardware resource usage is reduced by three orders of magnitude.

A Multimode Hybrid Memristor-CMOS Prototyping Platform Supporting Digital and Analog Projects

  • K.-E. Harabi
  • C. Turck
  • M. Drouhin
  • A. Renaudineau
  • T. Bersani-Veroni
  • D. Querlioz
  • T. Hirtzlin
  • E. Vianello
  • M Bocquet
  • J.-M. Portal

We present an integrated circuit fabricated in a process co-integrating CMOS and hafnium-oxide memristor technology, which provides a prototyping platform for projects involving memristors. Our circuit includes the periphery circuitry for using memristors within digital circuits, as well as an analog mode with direct access to memristors. The platform allows optimizing the conditions for reading and writing memristors, as well as developing and testing innovative memristor-based neuromorphic concepts.

A Fully Synchronous Digital LDO with Built-in Adaptive Frequency Modulation and Implicit Dead-Zone Control

  • Shun Yamaguchi
  • Mahfuzul Islam
  • Takashi Hisakado
  • Osami Wada

This paper proposes a synchronous digital LDO with adaptive clocking and dead-zone control without additional reference voltages. A test chip fabricated in a commercial 65 nm CMOS general-purpose (GP) process achieves 580x frequency modulation with 99.9% maximum efficiency at 0.6V supply.

Demonstration of Order Statistics Based Flash ADC in a 65nm Process

  • Mahfuzul Islam
  • Takehiro Kitamura
  • Takashi Hisakado
  • Osami Wada

This paper presents measurement results of a flash ADC that utilizes offset voltages as references. To operate the minimum number of comparators, we select the target comparators based on the rankings of the offset voltage. We present performance improvement by tuning offset voltage distribution using multiple comparator groups under the same power. A test chip in a commercial 65 nm GP process demonstrates the ADCs at 1 GS/s operation.

SESSION: Technical Program: Synthesis of Quantum Circuits and Systems

A SAT Encoding for Optimal Clifford Circuit Synthesis

  • Sarah Schneider
  • Lukas Burgholzer
  • Robert Wille

Executing quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to devices’ limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm’s description first has to be synthesized using the device’s gate library. In this paper, we consider the optimal synthesis of Clifford circuits—an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are limited to a maximum of six qubits. The contribution of this work is twofold: First, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. The resulting tool is demonstrated to synthesize optimal circuits for up to 26 qubits—more than four times as many as the current state of the art. Second, we experimentally show that the overhead introduced by state-of-the-art heuristics exceeds the lower bound by 27 % on average. The resulting tool is publicly available at

An SMT-Solver-Based Synthesis of NNA-Compliant Quantum Circuits Consisting of CNOT, H and T Gates

  • Kyohei Seino
  • Shigeru Yamashita

It is natural to assume that we can perform quantum operations between only two adjacent physical qubits (quantum bits) to realize a quantum computer for both the current and possible future technologies. This restriction is called the Nearest Neighbor Architecture (NNA) restriction. This paper proposes an SMT-solver-based synthesis of quantum circuits consisting of CNOT, H, and T gates to satisfy the NNA restriction. Although the existing SMT-solver-based synthesis cannot treat H and T gates directly, our method treats the functionality of quantum-specific T and H gates carefully so that we can utilize an SMT-solver to minimize the number of CNOT gates; unlike the existing SMT-solver-based methods, our method considers “Don’t Care” conditions in intermediate points of a quantum circuit by exploiting the property of T gates to reduce CNOT gates. Experimental results show that our approach can reduce the number of CNOT gates by 58.11% on average compared to the naive application of the existing method which does not consider the “Don’t Care” condition.

Compilation of Entangling Gates for High-Dimensional Quantum Systems

  • Kevin Mato
  • Martin Ringbauer
  • Stefan Hillmich
  • Robert Wille

Most quantum computing architectures to date natively support multi-valued logic, albeit being typically operated in a binary fashion. Multi-valued, or qudit, quantum processors have access to much richer forms of quantum entanglement, which promise to significantly boost the performance and usefulness of quantum devices. However, much of the theory as well as corresponding design methods required for exploiting such hardware remain insufficient and generalizations from qubits are not straightforward. A particular challenge is the compilation of quantum circuits into sets of native qudit gates supported by state-of-the-art quantum hardware. In this work, we address this challenge by introducing a complete workflow for compiling any two-qudit unitary into an arbitrary native gate set. Case studies demonstrate the feasibility of both, the proposed approach as well as the corresponding implementation (which is freely available at

WIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code

  • Wang Liao
  • Yasunari Suzuki
  • Teruo Tanimoto
  • Yosuke Ueno
  • Yuuki Tokunaga

Large error rates of quantum bits (qubits) are one of the main difficulties in the development of quantum computing. Performing quantum error correction (QEC) with surface codes is considered the most promising approach to reduce the error rates of qubits effectively. To perform error correction, we need an error-decoding unit, which estimates errors in the noisy physical qubits repetitively, to create a robust logical qubit. While complicated graph-matching problems must be solved within a strict time restriction for the error decoding, several hardware implementations that satisfy the restriction at a large code distance have been proposed.

However, the existing decoder designs are still challenging in reducing the logical error rate. This is because they assume that the error rates of physical qubits are uniform while they have large variations in practice. According to our numerical simulation based on the quantum chip with the largest qubit number, neglecting the non-uniform error properties of a real quantum chip in the decoding process induces significant degradation of the logical error rate and spoils the benefit of QEC. To take the non-uniformity into account, decoders need to solve matching problems on a weighted graph, but they are difficult to solve using the existing designs without exceeding the time limit of decoding. Therefore, a decoder that can treat both the non-uniform physical error rates and the large surface code is strongly demanded.

In this paper, we propose a hardware design of decoding units for the surface code that can treat the non-identical error properties with small latency at a large code distance. The key idea of our design is 1) constructing a look-up table for calculating the shortest paths between nodes in a weighted graph and 2) enabling parallel processing during decoding. The implementation results in field programmable gate array (FPGA) indicate that our design scales up to code distance 11 within a microsecond-level delay, which is comparable to the existing state-of-the-art designs, while our design can treat non-identical errors.

Quantum Data Compression for Efficient Generation of Control Pulses

  • Daniel Volya
  • Prabhat Mishra

In order to physically realize a robust quantum gate, a specifically tailored laser pulse needs to be derived via strategies such as quantum optimal control. Unfortunately, such strategies face exponential complexity with quantum system size and become infeasible even for moderate-sized quantum circuits. In this paper, we propose an automated framework for effective utilization of these quantum resources. Specifically, this paper makes three important contributions. First, we utilize an effective combination of register compression and dimensionality reduction to reduce the area of a quantum circuit. Next, due to the properties of an autoencoder, the compressed gates produced are robust even in the presence of noise. Finally, our proposed compression reduces the computation time of quantum control. Experimental evaluation using popular quantum algorithms demonstrates that our proposed approach can enable efficient generation of noise-resilient control pulses while state-of-the-art fails to handle large-scale quantum systems.

SESSION: Technical Program: In-Memory/Near-Memory Computing for Neural Networks

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

  • Yueting Li
  • He Zhang
  • Xueyan Wang
  • Hao Cai
  • Yundong Zhang
  • Shuqin Lv
  • Renguang Liu
  • Weisheng Zhao

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.

RIMAC: An Array-Level ADC/DAC-Free ReRAM-Based in-Memory DNN Processor with Analog Cache and Computation

  • Peiyu Chen
  • Meng Wu
  • Yufei Ma
  • Le Ye
  • Ru Huang

By directly computing in analog domain, processing-in-memory (PIM) is emerging as a promising alternative to overcome the memory bottleneck of traditional von-Neuman architecture, especially for deep neural networks (DNNs). However, the data outside PIM macros in most existing PIM accelerators are stored and operated as digital signals that require massive expensive digital-to-analog (D/A) and analog-to-digital (A/D) converters. In this work, an array-level ADC/DAC-free ReRAM-based in-memory DNN processor named RIMAC is proposed, which accelerates various DNNs in pure analog-domain with analog cache and analog computation modules to eliminate the expensive D/A and A/D conversions. Our experiment result shows the peak energy efficiency is improved by about 34.8×, 97.6×, 10.7×, and 14.0× compared to PRIME, ISAAC, Lattice, and 21’DAC for various DNNs on ImageNet, respectively.

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient in-Memory Acceleration

  • Shuo Huai
  • Di Liu
  • Xiangzhong Luo
  • Hui Chen
  • Weichen Liu
  • Ravi Subramaniam

Crossbar-based In-Memory Computing (IMC) accelerators preload the entire Deep Neural Network (DNN) into crossbars before inference. However, devices with limited crossbars cannot infer increasingly complex models. IMC-pruning can reduce the usage of crossbars, but current methods need expensive extra hardware for data alignment. Meanwhile, quantization can represent weights of DNNs by integers, but they employ non-integer scaling factors to ensure accuracy, requiring costly multipliers. In this paper, we first propose crossbar-aligned pruning to reduce the usage of crossbars without hardware overhead. Then, we introduce a quantization scheme to avoid multipliers in IMC devices. Finally, we design a learning method to complete above two schemes and cultivate an optimal compact DNN with high accuracy and large sparsity during training. Experiments demonstrate that our framework, compared to state-of-the-art methods, achieves larger sparsity and lower power consumption with higher accuracy. We even improve the accuracy by 0.43% for VGG-16 with an 88.25% sparsity rate on the Cifar-10 dataset. Compared to the original model, we reduce computing power and area by 19.8x and 18.8x, respectively.

Discovering the in-Memory Kernels of 3D Dot-Product Engines

  • Muhammad Rashedul Haq Rashed
  • Sumit Kumar Jha
  • Rickard Ewetz

The capability of resistive random access memory (ReRAM) to implement multiply-and-accumulate operations promises unprecedented efficiency in the design of scientific computing applications. While the use of two-dimensional (2D) ReRAM crossbar has been well investigated in the last few years, the design of in-memory dot-product engines using three-dimensional (3D) ReRAM crossbars remains a topic of active investigations. In this paper, we holistically explore how to leverage 3D ReRAM crossbars with several (2 to 7) stacked crossbar layers. In contrast, previous studies have focused on 3D ReRAM with at most 2 stacked crossbar layers. We first discover the in-memory compute kernels that can be realized using 3D ReRAM with multiple stacked crossbar layers. We discover that matrices with different sparsity patterns can be realized by appropriately assigning the inputs and outputs to the perpendicular metal wires within the 3D stack. We present a design automation tool to map sparse matrices within scientific computing applications to the discovered 3D kernels. The proposed framework is evaluated using 20 applications from the SuitSparse Matrix Collection. Compared with 2D crossbars, the proposed approach using 3D crossbars improves area, energy, and latency with 2.02X, 2.37X, 2.45X, respectively.

RVComp: Analog Variation Compensation for RRAM-Based in-Memory Computing

  • Jingyu He
  • Yucong Huang
  • Miguel Lastras
  • Terry Tao Ye
  • Chi-Ying Tsui
  • Kwang-Ting Cheng

Resistive Random Access Memory (RRAM) has shown great potential in accelerating memory-intensive computation in neural network applications. However, RRAM-based computing suffers from significant accuracy degradation due to the inevitable device variations. In this paper, we propose RVComp, a fine-grained analog Compensation approach to mitigate the accuracy loss of in-memory computing incurred by the Variations of the RRAM devices. Specifically, weights in the RRAM crossbar are accompanied by dedicated compensation RRAM cells to offset their programming errors with a scaling factor. A programming target shifting mechanism is further designed with the objectives of reducing the hardware overhead and minimizing the compensation errors under large device variations. Based on these two key concepts, we propose double and dynamic compensation schemes and the corresponding support architecture. Since the RRAM cells only account for a small fraction of the overall area of the computing macro due to the dominance of the peripheral circuitry, the overall area overhead of RVComp is low and manageable. Simulation results show RVComp achieves a negligible 1.80% inference accuracy drop for ResNet18 on the CIFAR-10 dataset under 30% device variation with only 7.12% area and 5.02% power overhead and no extra latency.

SESSION: Technical Program: Machine Learning-Based Design Automation

Rethink before Releasing Your Model: ML Model Extraction Attack in EDA

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Jiang Hu
  • Yiran Chen

Machine learning (ML)-based techniques for electronic design automation (EDA) have boosted the performance of modern integrated circuits (ICs). Such achievement makes ML model to be of importance for the EDA industry. In addition, ML models for EDA are widely considered having high development cost because of the time-consuming and complicated training data generation process. Thus, confidentiality protection for EDA models is a critical issue. However, an adversary could apply model extraction attacks to steal the model in the sense of achieving the comparable performance to the victim’s model. As model extraction attacks have posed great threats to other application domains, e.g., computer vision and natural language process, in this paper, we study model extraction attacks for EDA models under two real-world scenarios. It is the first work that (1) introduces model extraction attacks on EDA models and (2) proposes two attack methods against the unlimited and limited query budget scenarios. Our results show that our approach can achieve competitive performance with the well-trained victim model without any performance degradation. Based on the results, we demonstrate that model extraction attacks truly threaten the EDA model privacy and hope to raise concerns about ML security issues in EDA.

MacroRank: Ranking Macro Placement Solutions Leveraging Translation Equivariancy

  • Yifan Chen
  • Jing Mai
  • Xiaohan Gao
  • Muhan Zhang
  • Yibo Lin

Modern large-scale designs make extensive use of heterogeneous macros, which can significantly affect routability. Predicting the final routing quality in the early macro placement stage can filter out poor solutions and speed up design closure. By observing that routing is correlated with the relative positions between instances, we propose MacroRank, a macro placement ranking framework leveraging translation equivariance and a Learning to Rank technique. The framework is able to learn the relative order of macro placement solutions and rank them based on routing quality metrics like wirelength, number of vias, and number of shorts. The experimental results show that compared with the most recent baseline, our framework can improve the Kendall rank correlation coefficient by 49.5% and the average performance of top-30 prediction by 8.1%, 2.3%, and 10.6% on wirelength, vias, and shorts, respectively.

BufFormer: A Generative ML Framework for Scalable Buffering

  • Rongjian Liang
  • Siddhartha Nath
  • Anand Rajaram
  • Jiang Hu
  • Haoxing Ren

Buffering is a prevalent interconnect optimization technique to help timing closure and is often performed after placement. A common buffering approach is to construct a Steiner tree and then buffers are inserted on the tree based on Ginneken-Lillis style algorithm. Such an approach is difficult to scale with large nets. Our work attempts to solve this problem with a generative machine-learning (ML) approach without Steiner tree construction. Our approach can extract and reuse knowledge from high quality samples and therefore has significantly improved scalability. A generative ML framework, BufFormer, is proposed to construct abstract tree topology while simultaneously determining buffer sizes & locations. A baseline method, FLUTE-based Steiner tree construction followed by Ginneken-Lillis style buffer insertion, is implemented to generate training samples. After training, BufFormer can produce solutions for unseen nets highly comparable to baseline results with a correlation coefficient 0.977 in terms of buffer area and 0.934 for driver-sink delays. On average, BufFormer-generated tree achieves similar delays with slightly larger buffer area. And up to 160X speedup can be achieved for large nets when running on a GPU over the baseline on a single CPU thread.

Decoupling Capacitor Insertion Minimizing IR-Drop Violations and Routing DRVs

  • Daijoon Hyun
  • Younggwang Jung
  • Insu Cho
  • Youngsoo Shin

Decoupling capacitor (decap) cells are inserted near function cells of high switching activities so that their IR-drop can be suppressed. Their design becomes more complex and uses higher metal layers, thereby starting to manifest themselves as routing blockage. Post-placement decap insertion, with a goal of minimizing both IR-drop violations and routing design rule violations (DRVs), is addressed for the first time. U-Net with graph convolutional network is introduced to predict routing DRV penalty. The decap insertion problem is formulated and a heuristic algorithm is presented. Experiments with a few test circuits demonstrate that DRVs are reduced by 16% on average with no IR-drop violations, compared to a conventional method which does not explicitly consider DRVs. This results in 48% reduction in routing runtime and 23% improvement in total negative slack.

DPRoute: Deep Learning Framework for Package Routing

  • Yeu-Haw Yeh
  • Simon Yi-Hung Chen
  • Hung-Ming Chen
  • Deng-Yao Tu
  • Guan-Qi Fang
  • Yun-Chih Kuo
  • Po-Yang Chen

For routing closures in package designs, net order is critical due to complex design rules and severe wire congestion. However, existing solutions are deliberatively designed using heuristics and are difficult to adapt to different design requirements unless updating the algorithm. This work presents a novel deep learning-based routing framework that can keep improving by accumulating data to accommodate increasingly complex design requirements. Based on the initial routing results, we apply deep learning to concurrent detailed routing to deal with the problem of net ordering decisions. We use multi-agent deep reinforcement learning to learn routing schedules between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict. Experimental results on industrial package design show that the proposed framework can improve the number of design rule violations by 99.5% and the wirelength by 2.9% for initial routing.

SESSION: Technical Program: Advanced Techniques for Yields, Low Power and Reliability

High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction

  • Shuo Yin
  • Guohao Dai
  • Wei W. Xing

Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of “dimensionality”, which is inevitable when dealing with modern large-scale circuits, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel and acts as a surrogate model to emulate the expensive SPICE simulation. To further improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is also enhanced with parallel batch sampling for parallel computing, making it ready for practical deployment. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art (SOTA) approaches in terms of accuracy and efficiency with up to 11.1x speedup over SOTA methods.

MIA-Aware Detailed Placement and VT Reassignment for Leakage Power Optimization

  • Hung-Chun Lin
  • Shao-Yun Fang

As the feature size decreases, leakage power consumption becomes an important target in the design. Using multiple threshold voltages (VTs) in cell-based designs is a popular technique to simultaneously optimize circuit timing and minimize leakage power. However, an arbitrary cell placement result of a multi-VT design may suffer from many design rule violations induced by the Minimum-Implant-Area (MIA) rule, and thus it is necessary to take the MIA rules into consideration during the detailed placement stage. The state-of-the-art works on detailed placement comprehensively tackling MIA rules either disallow VT change or only allow reducing cell VTs to avoid timing degradation. However, these limitations may either result in larger cell displacement or cause overhead in leakage power. In this paper, we propose an optimization framework of VT reassignment and detailed placement to simultaneously consider MIA rules and leakage power minimization under timing constraints. Experimental results show that compared with the state-of-the-art works, the proposed framework can efficiently achieve better trade-off between leakage power and cell displacement.

SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network

  • Junchi Ma
  • Sulei Huang
  • Zongtao Duan
  • Lei Tang
  • Luyang Wang

The trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.

SESSION: Technical Program: Microarchitectural Design and Neural Networks

Microarchitecture Power Modeling via Artificial Neural Network and Transfer Learning

  • Jianwang Zhai
  • Yici Cai
  • Bei Yu

Accurate and robust power models are highly demanded to explore better CPU designs. However, previous learning-based power models ignore the discrepancies in data distribution among different CPU designs, making it difficult to use data from the historical configuration to aid modeling for new target configuration. In this paper, we investigate the transferability of power models and propose a microarchitecture power modeling method based on transfer learning (TL). A novel TL method for artificial neural network (ANN)-based power models is proposed, where cross-domain mixup generates more auxiliary samples close to the target configuration to fill in the distribution discrepancy and domain-adversarial training extracts domain-invariant features to complete the target model construction. Experiments show that our method greatly improves the model transferability and can effectively utilize the knowledge of the existing CPU configuration to facilitate target power model construction.

MUGNoC: A Software-Configured Multicast-Unicast-Gather NoC for Accelerating CNN Dataflows

  • Hui Chen
  • Di Liu
  • Shiqing Li
  • Shuo Huai
  • Xiangzhong Luo
  • Weichen Liu

Current communication infrastructures for convolutional neural networks (CNNs) only focus on specific transmission patterns, not applicable to benefit the whole system if the dataflow changes or different dataflows run in one system. To reduce data movement, various CNN dataflows are presented. For these dataflows, parameters and results are delivered using different traffic patterns, i.e., multicast, unicast, and gather, preventing dataflow-specific communication backbones from benefiting the entire system if the dataflow changes or different dataflows run in the same system. Thus, in this paper, we propose MUG-NoC to support typical traffic patterns and accelerate them, therefore boosting multiple dataflows. Specifically, (i) we for the first time support multicast in 2D-mesh software configurable NoC by revising router configuration and proposing the efficient multicast routing; (ii) we decrease unicast latency by transmitting data through the different routes in parallel; (iii) we reduce output gather overheads by pipelining basic dataflow units. Experiments show that at least our proposed design can reduce 39.2% total data transmission time compared with the state-of-the-art CNN communication backbone.

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

  • Bo-Wun Cheng
  • En-Ming Huang
  • Chen-Hao Chao
  • Wei-Fang Sun
  • Tsung-Tai Yeh
  • Chun-Yi Lee

In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

SESSION: Technical Program: Novel Techniques for Scheduling and Memory Optimizations in Embedded Software

Mixed-Criticality with Integer Multiple WCETs and Dropping Relations: New Scheduling Challenges

  • Federico Reghenzani
  • William Fornaciari

Scheduling Mixed-Criticality (MC) workload is a challenging problem in real-time computing. Earliest Deadline First Virtual Deadline (EDF-VD) is one of the most famous scheduling algorithm with optimal speedup bound properties. However, when EDF-VD is used to schedule task sets using a model with additional or relaxed constraints, its scheduling properties change. Inspired by an application of MC to the scheduling of fault tolerant tasks, in this article, we propose two models for multiple criticality levels: the first is a specialization of the MC model, and the second is a generalization of it. We then show, via formal proofs and numerical simulations, that the former considerably improves the speedup bound of EDF-VD. Finally, we provide the proofs related to the optimality of the two models, identifying the need of new scheduling algorithms.

An Exact Schedulability Analysis for Global Fixed-Priority Scheduling of the AER Task Model

  • Thilanka Thilakasiri
  • Matthias Becker

Commercial off-the-shelf (COTS) multi-core platforms offer high performance and large availability of processing resources. Increased contention when accessing shared resources is a result of the high parallelism and one of the main challenges when realtime applications are deployed to these platforms. As a result, several execution models have been proposed to avoid contention by separating access to shared resources from execution.

In this work, we consider the Acquisition-Execution-Restitution (AER) model where contention to shared resources is avoided by design. We propose an exact schedulability test for the AER model under global fixed-priority scheduling using timed automata where we describe the schedulability problem as a reachability problem. To the best of our knowledge, this is the first exact schedulability test for the AER model under global fixed-priority scheduling on multiprocessor platforms. The performance of the proposed approach is evaluated using synthetic experiments and provides up to 65% more schedulable task sets than the state-of-the-art.

Skyrmion Vault: Maximizing Skyrmion Lifespan for Enabling Low-Power Skyrmion Racetrack Memory

  • Syue-Wei Lu
  • Shuo-Han Chen
  • Yu-Pei Liang
  • Yuan-Hao Chang
  • Kang Wang
  • Tseng-Yi Chen
  • Wei-Kuan Shih

Skyrmion racetrack memory (SK-RM) has demonstrated great potential as a high-density and low-cost nonvolatile memory. Nevertheless, even though random data accesses are supported on SK-RM, data accesses can not be carried out on individual data bit directly. Instead, special skyrmion manipulations, such as injecting and shifting, are required to support random information update and deletion. With such special manipulations, the latency and energy consumption of skyrmion manipulations could quickly accumulate and induce additional overhead on the data read/write path of SK-RM. Meanwhile, injection operation consumes more energy and has higher latency than any other manipulations. Although prior arts have tried to alleviate the overhead of skyrmion manipulations, the possibility of minimizing injections through buffering skyrmions for future reuse and energy conservation receives much less attention. Such observation motivates us to propose the concept of skyrmion vault to effectively utilize the skyrmion buffer track structure for energy conservation through maximizing the lifespan of injected skyrmions and minimizing the number of skyrmion injections. Experimental results have shown promising improvements in both energy consumption and skyrmions’ lifespan.

SESSION: Technical Program: Efficient Circuit Simulation and Synthesis for Analog Designs

Parallel Incomplete LU Factorization Based Iterative Solver for Fixed-Structure Linear Equations in Circuit Simulation

  • Lingjie Li
  • Zhiqiang Liu
  • Kan Liu
  • Shan Shen
  • Wenjian Yu

A series of fixed-structure sparse linear equations are solved in a circuit simulation process. We propose a parallel incomplete LU (ILU) preconditioned GMRES solver for those equations. A new subtree-based scheduling algorithm for ILU factorization and forward/backward substitution is adopted to overcome the load-balancing and data locality problem of the conventional levelization-based scheduling. Experimental results show that the proposed scheduling algorithm can achieve up to 2.6X speedup for ILU factorization and 3.1X speedup for forward/backward substitution compared to the levelization-based scheduling. The proposed ILU-GMRES solver achieves around 4X parallel speedup with 8 threads, which is up to 2.1X faster than that based on the levelization-based scheme. The proposed parallel solver also shows remarkable advantage over existing methods (including HSPICE) on transient simulation of linear and nonlinear circuits.

Accelerated Capacitance Simulation of 3-D Structures with Considerable Amounts of General Floating Metals

  • Jiechen Huang
  • Wenjian Yu
  • Mingye Song
  • Ming Yang

Floating metals are special conductors introduced into conductor structures by design for manufacturing (DFM). They bring difficulty to accurate capacitance simulation. In this work, we aim to accelerate the floating random walk (FRW) based capacitance simulation for structures with considerable amounts of general floating metals. We first discuss how the existing modified FRW is affected by the integral surfaces of floating metals and propose an improved placement of integral surface. Then, we propose a hybrid approach called incomplete network reduction to avoid random transitions trapped by floating metals. Experiments on structures from IC and FPD design, which involves multiple floating metals and single or multiple master conductors, have shown the effectiveness of the proposed techniques. The proposed techniques reduce the computational time of capacitance calculation, while preserving the accuracy.

On Automating Finger-Cap Array Synthesis with Optimal Parasitic Matching for Custom SAR ADC

  • Cheng-Yu Chiang
  • Chia-Lin Hu
  • Mark Po-Hung Lin
  • Yu-Szu Chung
  • Shyh-Jye Jou
  • Jieh-Tsorng Wu
  • Shiuh-hua Wood Chiang
  • Chien-Nan Jimmy Liu
  • Hung-Ming Chen

Due to its excellent power efficiency, the successive-approximation-register (SAR) analog-to-digital converter (ADC) is an attractive design choice for low-power ADC implements. In analog layout design, the parasitics induced by interconnecting wires and elements affect the accuracy and performance of the device. Due to the requirement of low-power and high-speed, series of very small lateral metal-metal capacitor units are usually adopted as the architecture of capacitor array. Besides power consumption and area reduction, the parasitic capacitance would significantly affect the matching properties and settling time of capacitors. This work presents a framework to synthesize good-quality binary-weighted capacitors for custom SAR ADC. Also, this work proposes a parasitic-aware ILP-based weight-dynamic network routing algorithm to generate a layout considering parasitic capacitance and capacitance ratio mismatch simultaneously. The experimental result shows that the effective number of bits (ENOB) of the layout generated by our approach is comparable to or better than that of manual design and other automated works, closing the gap between pre-sim and post-sim results.

SESSION: Technical Program: Security of Heterogeneous Systems Containing FPGAs

FPGANeedle: Precise Remote Fault Attacks from FPGA to CPU

  • Mathieu Gross
  • Jonas Krautter
  • Dennis Gnad
  • Michael Gruber
  • Georg Sigl
  • Mehdi Tahoori

FPGA as general-purpose accelerators can greatly improve system efficiency and performance in cloud and edge devices alike. However, they have recently become the focus of remote attacks, such as fault and side-channel attacks from one to another user of a part of the FPGA fabric. In this work, we consider system-on-chip platforms, where an FPGA and an embedded processor core are located on the same die. We show that the embedded processor core is vulnerable to voltage drops generated by the FPGA logic. Our experiments demonstrate the possibility of compromising the data transfer from external DDR memory to the processor cache hierarchy. Furthermore, we were also able to fault and skip instructions executed on an ARM Cortex-A9 core. The FPGA based fault injection is shown precise enough to recover the secret key of an AES T-tables implementation found in the mbedTLS library.

FPGA Based Countermeasures against Side Channel Attacks on Block Ciphers

  • Darshana Jayasinghe
  • Brian Udugama
  • Sri Parameswaran

Field Programmable Gate Arrays (FPGAs) are increasingly ubiquitous. FPGAs enable hardware acceleration and reconfigurability. Any security breach or attack on critical computations occurring on an FPGA can lead to devastating consequences. Side-channel attacks have the ability to reveal secret information, such as secret keys from cryptographic circuits running on FPGAs. Power dissipation (PA), Electromagnetic (EM) radiation, fault injection (FI) and remote power dissipation (RPA) attacks are the most compelling and noninvasive side-channel attacks demonstrated on FPGAs. This paper discusses two PA attack countermeasures (QuadSeal and RFTC) and one RPA attack countermeasure (UCloD) in detail to protect FPGAs.

SESSION: Technical Program: Novel Application & Architecture-Specific Quantization Techniques

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

  • Ruoyang Liu
  • Chenhan Wei
  • Yixiong Yang
  • Wenxun Wang
  • Huazhong Yang
  • Yongpan Liu

Data quantization is an effective method to accelerate neural network training and reduce power consumption. However, it is challenging to perform low-bit quantized training: the conventional equal-precision quantization will lead to either high accuracy loss or limited bit-width reduction, while existing mixed-precision methods offer high compression potential but failed to perform accurate and efficient bit-width assignment. In this work, we propose DYNASTY, a block-wise dynamic-precision neural network training framework. DYNASTY provides accurate data sensitivity information through fast online analytics, and maintains stable training convergence with an adaptive bit-width map generator. Network training experiments on CIFAR-100 and ImageNet dataset are carried out, and compared to 8-bit quantization baseline, DYNASTY brings up to 5.1× speedup and 4.7× energy consumption reduction with no accuracy drop and negligible hardware overhead.

Quantization through Search: A Novel Scheme to Quantize Convolutional Neural Networks in Finite Weight Space

  • Qing Lu
  • Weiwen Jiang
  • Xiaowei Xu
  • Jingtong Hu
  • Yiyu Shi

Quantization has become an essential technique in compressing deep neural networks for deployment onto resource-constrained hardware. It is noticed that, the hardware efficiency of implementing quantized networks is highly coupled with the actual values to be quantized into, and therefore, with given bit widths, we can smartly choose a value space to further boost the hardware efficiency. For example, using weights of only integer powers of two, multiplication can be fulfilled by bit operations. Under such circumstances, however, existing quantization-aware training methods are either not suitable to apply or unable to unleash the expressiveness of very low bit-widths. For the best hardware efficiency, we revisit the quantization of convolutional neural networks and propose to address the training process from a weight-searching angle, as opposed to optimizing the quantizer functions as in existing works. Extensive experiments on CIFAR10 and ImageNet classification tasks are examined with implementations onto well-established CNN architectures, such as ResNet, VGG, and MobileNet, etc. It is shown the proposed method can achieve a lower accuracy loss than the state of arts, and/or improving implementation efficiency by using hardware-friendly weight values at the same time.

Multi-Wavelength Parallel Training and Quantization-Aware Tuning for WDM-Based Optical Convolutional Neural Networks Considering Wavelength-Relative Deviations

  • Ying Zhu
  • Min Liu
  • Lu Xu
  • Lei Wang
  • Xi Xiao
  • Shaohua Yu

Wavelength Division Multiplexing (WDM)-based Mach-Zehnder Interferometer Optical Convolutional Neural Networks (MZI-OCNNs) have emerged as a promising platform to accelerate convolutions that cost most computing sources in neural networks. However, the wavelength-relative imperfect split ratios and actual phase shifts in MZIs and quantization errors from the electronic configuration module will degrade the inference accuracy of WDM-based MZI-OCNNs and thus render them unusable in practice. In this paper, we propose a framework that models the split ratios and phase shifts under different wavelengths, incorporates them into OCNN training, and introduces quantization-aware tuning to maintain inference accuracy and reduce electronic module complexity. Consequently, the framework can improve the inference accuracy by 49%, 76%, and 76%, respectively, for LeNet5, VGG7, and VGG8 implemented with multi-wavelength parallel computing. And instead of using Float 32/64 quantization resolutions, only 5,6, and 4 bits are needed and fewer quantization levels are utilized for configuration signals.

Semantic Guided Fine-Grained Point Cloud Quantization Framework for 3D Object Detection

  • Xiaoyu Feng
  • Chen Tang
  • Zongkai Zhang
  • Wenyu Sun
  • Yongpan Liu

Unlike the grid-paced RGB images, network compression, i.e.pruning and quantization, for the irregular and sparse 3D point cloud face more challenges. Traditional quantization ignores the unbalanced semantic distribution in 3D point cloud. In this work, we propose a semantic-guided adaptive quantization framework for 3D point cloud. Different from traditional quantization methods that adopt a static and uniform quantization scheme, our proposed framework can adaptively locate the semantic-rich foreground points in the feature maps to allocate a higher bitwidth for these “important” points. Since the foreground points are in a low proportion in the sparse 3D point cloud, such adaptive quantization can achieve higher accuracy than uniform compression under a similar compression rate. Furthermore, we adopt a block-wise fine-grained compression scheme in the proposed framework to fit the larger dynamic range in the point cloud. Moreover, a 3D point cloud based software and hardware co-evaluation process is proposed to evaluate the effectiveness of the proposed adaptive quantization in actual hardware devices. Based on the nuScenes dataset, we achieve 12.52% precision improvement under average 2-bit quantization. Compared with 8-bit quantization, we can achieve 3.11× energy efficiency based on co-evaluation results.

SESSION: Technical Program: Approximate Brain-Inspired Architectures for Efficient Learning

ReMeCo: Reliable Memristor-Based in-Memory Neuromorphic Computation

  • Ali BanaGozar
  • Seyed Hossein Hashemi Shadmehri
  • Sander Stuijk
  • Mehdi Kamal
  • Ali Afzali-Kusha
  • Henk Corporaal

Memristor-based in-memory neuromorphic computing systems promise a highly efficient implementation of vector-matrix multiplications, commonly used in artificial neural networks (ANNs). However, the immature fabrication process of memristors and circuit level limitations, i.e., stuck-at-fault (SAF), IR-drop, and device-to-device (D2D) variation, degrade the reliability of these platforms and thus impede their wide deployment. In this paper, we present ReMeCo, a redundancy-based reliability improvement framework. It addresses the non-idealities while constraining the induced overhead. It achieves this by performing a sensitivity analysis on ANN. With the acquired insight, ReMeCo avoids the redundant calculation of least sensitive neurons and layers. ReMeCo uses a heuristic approach to find the balance between recovered accuracy and imposed overhead. ReMeCo further decreases hardware redundancy by exploiting the bit-slicing technique. In addition, the framework employs the ensemble averaging method at the output of every ANN layer to incorporate the redundant neurons. The efficacy of the ReMeCo is assessed using two well-known ANN models, i.e., LeNet, and AlexNet, running the MNIST and CIFAR10 datasets. Our results show 98.5% accuracy recovery with roughly 4% redundancy which is more than 20× lower than the state-of-the-art.

SyFAxO-GeN: Synthesizing FPGA-Based Approximate Operators with Generative Networks

  • Rohit Ranjan
  • Salim Ullah
  • Siva Satyendra Sahoo
  • Akash Kumar

With rising trends of moving AI inference to the edge, due to communication and privacy challenges, there has been a growing focus on designing low-cost Edge-AI. Given the diversity of application areas at the edge, FPGA-based systems are increasingly used for high-performance inference. Similarly, approximate computing has emerged as a viable approach to achieve disproportionate resource gains by utilizing the applications’ inherent robustness. However, most related research has focused on selecting the appropriate approximate operators for an application from a set of ASIC-based designs. This approach fails to leverage the FPGA’s architectural benefits and limits the scope of approximation to already existing generic designs. To this end, we propose an AI-based approach to synthesizing novel approximate operators for FPGA’s Look-up-table-based structure. Specifically, we use state-of-the-art generative networks to search for constraint-aware arithmetic operator designs optimized for FPGA-based implementation. With the proposed GANs, we report up to 49% faster training, with negligible accuracy degradation, than related generative networks. Similarly, we report improved hypervolume and increased pareto-front design points compared to state-of-the-art approaches to synthesizing approximate multipliers.

Approximating HW Accelerators through Partial Extractions onto Shared Artificial Neural Networks

  • Prattay Chowdhury
  • Jorge Castro Godínez
  • Benjamin Carrion Schafer

One approach that has been suggested to further reduce the energy consumption of heterogenous Systems-on-Chip (SoCs) is approximate computing. In approximate computing the error at the output is relaxed in order to simplify the hardware and thus, achieve lower power. Fortunately, most of the hardware accelerators in these SoCs are also amenable to approximate computing.

In this work we propose a fully automatic method that substitutes portions of a hardware accelerator specified in C/C++/SystemC for High-Level Synthesis (HLS) to an Artificial Neural Network (ANN). ANNs have many advantages that make them well suited for this. First, they are very scalable which allows to approximate multiple separate portions of the behavioral description simultaneously on them. Second, multiple ANNs can be fused together and re-optimized to further reduce the power consumption. We use this to share the ANN to approximate multiple different HW accelerators in the same SoC. Experimental results with different error thresholds show that our proposed approach leads to better results than the state of the art.

DependableHD: A Hyperdimensional Learning Framework for Edge-Oriented Voltage-Scaled Circuits

  • Dehua Liang
  • Hiromitsu Awano
  • Noriyuki Miura
  • Jun Shiomi

Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing the stable operation in modern VLSI. To tackle such issues, we propose DependableHD, a learning framework based on HyperDimensional Computing (HDC), which supports the systems to tolerate bit-level memory failure in the low voltage region with high robustness. For the first time, DependableHD introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. Our experiment shows that under 10% memory error, DependableHD exhibits a 1.22% accuracy loss on average, which achieves an 11.2× improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHD supports the systems to reduce the supply voltage from 400mV to 300mV, which provides a 50.41% energy consumption reduction while maintaining competitive accuracy performance.

SESSION: Technical Program: Retrospect and Prospect of Verifiation and Test Technologies

EDDY: A Multi-Core BDD Package with Dynamic Memory Management and Reduced Fragmentation

  • Rune Krauss
  • Mehran Goli
  • Rolf Drechsler

In recent years, hardware systems have significantly grown in complexity. Due to the increasing complexity, there is a need to continuously improve the quality of the hardware design process. This leads designers to strive for more efficient data structures and algorithms operating on them to guarantee the correct behavior of such systems through verification techniques like model checking and meet time-to-market constraints. A Binary Decision Diagram (BDD) is a suitable data structure as it provides a canonical compact representation of Boolean functions, given variable ordering, and efficient algorithms for manipulating them. However, reduced ordered BDDs also have challenges: There is a large memory consumption for the BDD construction of some complex practical functions and the use of realizations in the form of BDD packages strongly depends on the application.

To address these issues, this paper presents a novel multi-core package called Engineer Decision Diagrams Yourself (EDDY) with dynamic memory management and reduced fragmentation. Experiments on BDD benchmarks of both combinational circuits and model checking show that using EDDY leads to a significantly performance boost compared to state-of-the-art packages.

Exploiting Reversible Computing for Verification: Potential, Possible Paths, and Consequences

  • Lukas Burgholzer
  • Robert Wille

Today, the verification of classical circuits poses a severe challenge for the design of circuits and systems. While the underlying (exponential) complexity is tackled in various fashions (simulation-based approaches, emulation, formal equivalence checking, fuzzing, model checking, etc.), no “silver bullet” has been found yet which allows to escape the growing verification gap. In this work, we entertain and investigate the idea of a complementary approach which aims at exploiting reversible computing. More precisely, we show the potential of the reversible computing paradigm for verification, debunk misleading paths that do not allow to exploit this potential, and discuss the resulting consequences for the development of future, complementary design and verification flows. An extensive empirical study (involving more than 30 million simulations) confirms these findings. Although this work cannot provide a fully-fledged realization yet, it may provide the basis for an alternative path towards overcoming the verification gap.

Automatic Test Pattern Generation and Compaction for Deep Neural Networks

  • Dina Moussa
  • Michael Hefenbrock
  • Christopher Münch
  • Mehdi Tahoori

Deep Neural Networks (DNNs) have gained considerable attention lately due to their excellent performance on a wide range of recognition and classification tasks. Accordingly, fault detection in DNNs and their implementations plays a crucial role in the quality of DNN implementations to ensure that their post-mapping and infield accuracy matches with model accuracy. This paper proposes a functional-level automatic test pattern generation approach for DNNs. This is done by generating inputs which causes misclassification of the output class label in the presence of single or multiple faults. Furthermore, to obtain a smaller set of test patterns with full coverage, a heuristic algorithm as well as a test pattern clustering method using K-means were implemented. The experimental results showed that the proposed test patterns achieved the highest label misclassification and a high output deviation compared to state-of-the-art approaches.

Wafer-Level Characteristic Variation Modeling Considering Systematic Discontinuous Effects

  • Takuma Nagao
  • Tomoki Nakamura
  • Masuo Kajiyama
  • Makoto Eiki
  • Michiko Inoue
  • Michihiro Shintani

Statistical wafer-level variation modeling is an attractive method for reducing the measurement cost in large-scale integrated circuit (LSI) testing while maintaining the test quality. In this method, the performance of unmeasured LSI circuits manufactured on a wafer is statistically predicted from a few measured LSI circuits. Conventional statistical methods model spatially smooth variations in wafer. However, actual wafers may have discontinuous variations that are systematically caused by the manufacturing environments, such as shot dependence. In this study, we propose a modeling method that considers discontinuous variations in wafer characteristics by applying the knowledge of manufacturing engineers to a model estimated using Gaussian process regression. In the proposed method, the process variation is decomposed into the systematic discontinuous and global components to improve the estimation accuracy. An evaluation performed using an industrial production test dataset shows that the proposed method reduces the estimation error for an entire wafer by over 33% compared to conventional methods.

SESSION: Technical Program: Computing, Erasing, and Protecting: The Security Challenges for the Next Generation of Memories

Hardware Security Primitives Using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs

  • Simranjeet Singh
  • Furqan Zahoor
  • Gokul Rajendran
  • Sachin Patkar
  • Anupam Chattopadhyay
  • Farhad Merchant

With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.

Data Sanitization on eMMCs

  • Aya Fukami
  • Francesco Regazzoni
  • Zeno Geradts

Data sanitization of modern digital devices is an important issue given that electronic wastes are being recycled and repurposed. The embedded Multi Media Card (eMMC), one of the NAND flash memory-based commodity devices, is one of the popularly recycled products in the current recycling ecosystem. We analyze a repurposed devices and evaluate its sanitization practice. Data from the formerly used device can still be recovered, which may lead to an unintentional leakage of sensitive data such as personally identifiable information (PII). Since the internal storage of an eMMC is the NAND flash memory, sanitization practice of the NAND flash memory-based systems should apply to the eMMC. However, proper sanitize operation is obviously not always performed in the current recycling ecosystem. We discuss how data stored in eMMC and other flash memory-based devices need to be deleted in order to avoid the potential data leakage. We also review the NAND flash memory data sanitization schemes and discuss how they should be applied in eMMCs.

Fundamentally Understanding and Solving RowHammer

  • Onur Mutlu
  • Ataberk Olgun
  • A. Giray Yağlıkcı

We provide an overview of recent developments and future directions in the RowHammer vulnerability that plagues modern DRAM (Dynamic Random Memory Access) chips, which are used in almost all computing systems as main memory.

RowHammer is the phenomenon in which repeatedly accessing a row in a real DRAM chip causes bitflips (i.e., data corruption) in physically nearby rows. This phenomenon leads to a serious and widespread system security vulnerability, as many works since the original RowHammer paper in 2014 have shown. Recent analysis of the RowHammer phenomenon reveals that the problem is getting much worse as DRAM technology scaling continues: newer DRAM chips are fundamentally more vulnerable to RowHammer at the device and circuit levels. Deeper analysis of RowHammer shows that there are many dimensions to the problem as the vulnerability is sensitive to many variables, including environmental conditions (temperature & voltage), process variation, stored data patterns, as well as memory access patterns and memory control policies. As such, it has proven difficult to devise fully-secure and very efficient (i.e., low-overhead in performance, energy, area) protection mechanisms against RowHammer and attempts made by DRAM manufacturers have been shown to lack security guarantees.

After reviewing various recent developments in exploiting, understanding, and mitigating RowHammer, we discuss future directions that we believe are critical for solving the RowHammer problem. We argue for two major directions to amplify research and development efforts in: 1) building a much deeper understanding of the problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and 2) the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.

SESSION: Technical Program: System-Level Codesign in DNN Accelerators

Hardware-Software Codesign of DNN Accelerators Using Approximate Posit Multipliers

  • Tom Glint
  • Kailash Prasad
  • Jinay Dagli
  • Krishil Gandhi
  • Aryan Gupta
  • Vrajesh Patel
  • Neel Shah
  • Joycee Mekie

Emerging data intensive AI/ML workloads encounter memory and power wall when run on general-purpose compute cores. This has led to the development of a myriad of techniques to deal with such workloads, among which DNN accelerator architectures have found a prominent place. In this work, we propose a hardware-software co-design approach to achieve system-level benefits. We propose a quantized data-aware POSIT number representation that leads to a highly optimized DNN accelerator. We demonstrate this work on SOTA SIMBA architecture, extendable to any other accelerator. Our proposal reduces the buffer/storage requirements within the architecture and reduces the data transfer cost between the main memory and the DNN accelerator. We have investigated the impact of using integer, IEEE floating point, and posit multipliers for LeNet, ResNet and VGG NNs trained and tested on MNIST, CIFAR10 and ImageNet datasets, respectively. Our system-level analysis shows that the proposed approximate-fixed-posit multiplier when implemented on SIMBA architecture, achieves on average ~2.2× speed up, consumes ~3.1× less energy and requires ~3.2× less area, respectively, against the baseline SOTA architecture, without loss of accuracy (~±1%)

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

  • Susmita Dey Manasi
  • Suvadeep Banerjee
  • Abhijit Davare
  • Anton A. Sorokin
  • Steven M. Burns
  • Desmond A. Kirkpatrick
  • Sachin S. Sapatnekar

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

  • Mohammadhossein Askarihemmat
  • Sean Wagner
  • Olexa Bilaniuk
  • Yassine Hariri
  • Yvon Savaria
  • Jean-Pierre David

We present a DNN accelerator that allows inference at arbitrary precision with dedicated processing elements that are configurable at the bit level. Our DNN accelerator has 8 Processing Elements controlled by a RISC-V controller with a combined 8.2 TMACs of computational power when implemented with the recent Alveo U250 FPGA platform. We develop a code generator tool that ingests CNN models in ONNX format and generates an executable command stream for the RISC-V controller. We demonstrate the scalable throughput of our accelerator by running different DNN kernels and models when different quantization levels are selected. Compared to other low precision accelerators, our accelerator provides run time programmability without hardware reconfiguration and can accelerate DNNs with multiple quantization levels, regardless of the target FPGA size. BARVINN is an open source project and it is available at

Agile Hardware and Software Co-Design for RISC-V-Based Multi-Precision Deep Learning Microprocessor

  • Zicheng He
  • Ao Shen
  • Qiufeng Li
  • Quan Cheng
  • Hao Yu

Recent network architecture search (NAS) has been widely applied to simplify deep learning neural networks, which typically result in a multi-precision network. Many multi-precision accelerators have been developed as well to support computing multi-precision networks manually. A software-hardware interface is thereby needed to automatically map multi-precision networks onto multi-precision accelerators. In this paper, we have developed an agile hardware and software co-design for RISC-V-based multi-precision deep learning microprocessor. We have designed custom RISC-V instructions with a framework to automatically compile multi-precision CNN networks onto multi-precision CNN accelerators, demonstrated on FPGA. Experiments show that with NAS optimized multi-precision CNN models (LeNet, VGG16, ResNet, MobileNet), the RISC-V core with multi-precision accelerators can reach the highest throughput in 2,4,8-bit precisions respectively on a Xilinx ZCU102 FPGA.

SESSION: Technical Program: New Advances in Hardware Trojan Detection

Hardware Trojan Detection Using Shapley Ensemble Boosting

  • Zhixin Pan
  • Prabhat Mishra

Due to globalized semiconductor supply chain, there is an increasing risk of exposing system-on-chip designs to hardware Trojans (HT). While there are promising machine Learning based HT detection techniques, they have three major limitations: ad-hoc feature selection, lack of explainability, and vulnerability towards adversarial attacks. In this paper, we propose a novel HT detection approach using an effective combination of Shapley value analysis and boosting framework. Specifically, this paper makes two important contributions. We use Shapley value (SHAP) to analyze the importance ranking of input features. It not only provides explainable interpretation for HT detection, but also serves as a guideline for feature selection. We utilize boosting (ensemble learning) to generate a sequence of lightweight models that significantly reduces the training time while provides robustness against adversarial attacks. Experimental results demonstrate that our approach can drastically improve both detection accuracy (up to 24.6%) and time efficiency (up to 5.1x) compared to state-of-the-art HT detection techniques.

ASSURER: A PPA-friendly Security Closure Framework for Physical Design

  • Guangxin Guo
  • Hailong You
  • Zhengguang Tang
  • Benzheng Li
  • Cong Li
  • Xiaojue Zhang

Hardware security is emerging in the very large scale integration (VLSI). The seminal threats, like hardware Trojan insertion, probing attacks, and fault injection, are hard to detect and almost impossible to fix at post-design stage. The optimal solution is to prevent them at the physical design stage. Usually, defending against them may cause a lot of power, performance, and area (PPA) loss. In this paper, we propose a PPA-friendly physical layout security closure framework ASSURER. Reward-directed placement refinement and multi-threshold partition algorithm are proposed to assure Trojan threats are empty. Cleaning up probing attacks is established on a patch-based ECO routing flow. Evaluated on the ISPD’22 benchmarks, ASSURER can clean out the Trojan threat with no leakage power increase when shrinking the physical layout area. When not shrinking, ASSURER only increases 14% total power. Compared with the work of first place in the ISPD2022 Contest, ASSURE reduced 53% additional total power consumption, and probing vulnerability can be reduced by 97.6% under the premise of timing closure. We believe this work shall open up a new perspective for preventing Trojan insertion and probing attacks.

Static Probability Analysis Guided RTL Hardware Trojan Test Generation

  • Haoyi Wang
  • Qiang Zhou
  • Yici Cai

Directed test generation is an effective method to detect potential hardware Trojan (HT) in RTL. While the existing works are able to activate hard-to-cover Trojans by covering security targets, the effectiveness and efficiency of identifying the targets to cover are ignored. We propose a static probability analysis method for identifying the hard-to-active data channel targets and generating the corresponding assertions for the HT test generation. Our method could generate test vectors to trigger Trojans from Trusthub, DeTrust, and OpenCores in 1 minute and get 104.33X time improvement on average compared with the existing method.

Hardware Trojan Detection and High-Precision Localization in NoC-Based MPSoC Using Machine Learning

  • Haoyu Wang
  • Basel Halak

Networks-on-Chips (NoC) based Multi-Processor System-on-Chip (MPSoC) are increasingly employed in industrial and consumer electronics. Outsourcing third-party IPs (3PIPs) and tools in NoC-based MPSoC is a prevalent development way in most fabless companies. However, Hardware Trojan (HT) injected during its design stage can maliciously tamper with the functionality of this communication scheme, which undermines the security of the system and may cause a failure. Detecting and localizing HT with high precision is a challenge for current techniques. This work proposes for the first time a novel approach that allows detection and high-precision localization of HT, which is based on the use of packet information and machine learning algorithms. It is equipped with a novel Dynamic Confidence Interval (DCI) algorithm to detect malicious packets, and a novel Dynamic Security Credit Table (DSCT) algorithm to localize HT. We evaluated the proposed framework on the mesh NoC running real workloads. The average detection precision of 96.3% and the average localization precision of 100% were obtained from the experiment results, and the minimum HT localization time is around 5.8 ~ 12.9us at 2GHz depending on the different HT-infected nodes and workloads.

SESSION: Technical Program: Advances in Physical Design and Timing Analysis

An Integrated Circuit Partitioning and TDM Assignment Optimization Framework for Multi-FPGA Systems

  • Dan Zheng
  • Evangeline F. Y. Young

In multi-FPGA systems, Time-Division Multiplexing (TDM) is a widely used method for transferring multiple signals over a common wire. The circuit performance will be significantly influenced by this inter-FPGA delay. Some inter-FPGA nets are driven by different clocks, in which case they cannot share the same wire. In this paper, to minimize the maximum delay of inter-FPGA nets, we propose a two-step framework. First, a TDM-aware partitioning algorithm is adopted to minimize the maximum cut size between an FPGA-pair. A TDM ratio assignment method is then applied to assign TDM ratio for each inter-FPGA net optimally. Experimental results show that our algorithm can reduce the maximum TDM ratio significantly within reasonable runtime.

A Robust FPGA Router with Concurrent Intra-CLB Rerouting

  • Jiarui Wang
  • Jing Mai
  • Zhixiong Di
  • Yibo Lin

Routing is the most time-consuming step in the FPGA design flow with increasingly complicated FPGA architectures and design scales. The growing complexity of connections between logic pins inside CLBs of FPGAs challenges the efficiency and quality of FPGA routers. Existing negotiation-based rip-up and reroute schemes will result in a large number of iterations when generating paths inside CLBs. In this work, we propose a robust routing framework for FPGAs with complex connections between logic elements and switch boxes. We propose a concurrent intra-CLB rerouting algorithm that can effectively resolve routing congestion inside a CLB tile. Experimental results on modified ISPD 2016 benchmarks demonstrate that our framework can achieve 100% routability in less wirelength and runtime, while the state-of-the-art VTR 8.0 routing algorithm fails at 4 of 12 benchmarks.

Efficient Global Optimization for Large Scaled Ordered Escape Routing

  • Chuandong Chen
  • Dishi Lin
  • Rongshan Wei
  • Qinghai Liu
  • Ziran Zhu
  • Jianli Chen

Ordered Escape Routing (OER) problem, which is an NP-hard problem, is critical in PCB design. Primary methods based on integer linear programming (ILP) or heuristic algorithms work well on small-scale PCBs with fewer pins. However, when dealing with large-scale instances, the performance of ILP strategies suffers dramatically as the number of variables increases due to time-consuming preprocessing. As for heuristic algorithms, ripping-up and rerouting is adopted to increase resource utilization, which frequently causes time violation. In this paper, we propose an efficient ILP-based routing engine for dense PCB to simultaneously minimize wiring length and runtime, considering the specific routing constraints. By weighting the length, we first model the OER problem as a special network flow problem. Then we separate the non-crossing constraint from typical ILP modeling to reduce the number of integral variables greatly. In addition, considering the congestion of routing resources, the ILP method is proposed to detect congestion. Finally, unlike the traditional schemes that deal with negotiated congestion, our approach works by reducing the local area capacity and then allowing the global automatic optimization of congestion. Compared with the state-of-the-art work, experimental results show that our algorithm can solve cases in larger scale in high routing quality of less length and reduce routing time by 76%.

An Adaptive Partition Strategy of Galerkin Boundary Element Method for Capacitance Extraction

  • Shengkun Wu
  • Biwei Xie
  • Xingquan Li

In advanced process, electromagnetic coupling among interconnect wires plays an increasingly important role in signoff analysis. For VLSI chip design, the requirement of fast and accurate capacitance extraction is becoming more and more urgent. And the critical step of extracting capacitance among interconnect wires is solving electric field. However, due to the high computational complexity, solving electric field is extreme timing-consuming. The Galerkin boundary element method (GBEM) was used for capacitance extraction in [2]. In this paper, we are going to use some mathematical theorems to analysis its error. Furthermore, with the error estimation of the Galerkin method, we design a boundary partition strategy to fit the electric field attenuation. It is worth to mention that this boundary partition strategy can greatly reduce the number of boundary elements on the promise of ensuring that the error is small enough. As a consequence, the matrix order of the discretization equation will also decrease. We also provide our suggestion of the calculation of the matrix elements. Experimental analysis demonstrates that, our partition strategy obtains a good enough result with a small number of boundary elements.

Graph-Learning-Driven Path-Based Timing Analysis Results Predictor from Graph-Based Timing Analysis

  • Yuyang Ye
  • Tinghuan Chen
  • Yifei Gao
  • Hao Yan
  • Bei Yu
  • Longxing Shi

With diminishing margins in advanced technology nodes, the performance of static timing analysis (STA) is a serious concern, including accuracy and runtime. The STA can generally be divided into graph-based analysis (GBA) and path-based analysis (PBA). For GBA, the timing results are always pessimistic, leading to overdesign during design optimization. For PBA, the timing pessimism is reduced via propagating real path-specific slews with the cost of severe runtime overheads relative to GBA. In this work, we present a fast and accurate predictor of post-layout PBA timing results from inexpensive GBA based on deep edge-featured graph attention network, namely deep EdgeGAT. Compared with the conventional machine and graph learning methods, deep EdgeGAT can learn global timing path information. Experimental results demonstrate that our predictor has the potential to substantially predict PBA timing results accurately and reduce timing pessimism of GBA with maximum error reaching 6.81 ps, and our work achieves an average 24.80× speedup faster than PBA using the commercial STA tool.

SESSION: Technical Program: Brain-Inspired Hyperdimensional Computing to the Rescue for Beyond von Neumann Era

Beyond von Neumann Era: Brain-Inspired Hyperdimensional Computing to the Rescue

  • Hussam Amrouch
  • Paul R. Genssler
  • Mohsen Imani
  • Mariam Issa
  • Xun Jiao
  • Wegdan Mohammad
  • Gloria Sepanta
  • Ruixuan Wang

Breakthroughs in deep learning (DL) continuously fuel innovations that profoundly improve our daily life. However, DNNs overwhelm conventional computing architectures by their massive data movements between processing and memory units. As a result, novel computer architectures are indispensable to improve or even replace the decades-old von Neumann architecture. Nevertheless, going far beyond the existing von Neumann principles comes with profound reliability challenges for the performed computations. This is due to analog computing together with emerging beyond-CMOS technologies being inherently noisy and inevitably leading to unreliable computing. Hence, novel robust algorithms become a key to go beyond the boundaries of the von Neumann era. Hyper-dimensional Computing (HDC) is rapidly emerging as an attractive alternative to traditional DL and ML algorithms. Unlike conventional DL and ML algorithms, HDC is inherently robust against errors along a much more efficient hardware implementation. In addition to these advantages at hardware level, HDC’s promise to learn from little data and the underlying algebra enable new possibilities at the application level. In this work, the robustness of HDC algorithms against errors and beyond von Neumann architectures are discussed. Further, the benefits of HDC as a machine learning algorithm are demonstrated with the example of outlier detection and reinforcement learning.

SESSION: Technical Program: System Level Design Space Exploration

System-Level Exploration of In-Package Wireless Communication for Multi-Chiplet Platforms

  • Rafael Medina
  • Joshua Kein
  • Giovanni Ansaloni
  • Marina Zapater
  • Sergi Abadal
  • Eduard Alarcón
  • David Atienza

Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

Efficient System-Level Design Space Exploration for High-Level Synthesis Using Pareto-Optimal Subspace Pruning

  • Yuchao Liao
  • Tosiron Adegbija
  • Roman Lysecky

High-level synthesis (HLS) is a rapidly evolving and popular approach to designing, synthesizing, and optimizing embedded systems. Many HLS methodologies utilize design space exploration (DSE) at the post-synthesis stage to find Pareto-optimal hardware implementations for individual components. However, the design space for the system-level Pareto-optimal configurations is orders of magnitude larger than component-level design space, making existing approaches insufficient for system-level DSE. This paper presents Pruned Genetic Design Space Exploration (PG-DSE)—an approach to post-synthesis DSE that involves a pruning method to effectively reduce the system-level design space and an elitist genetic algorithm to accurately find the system-level Pareto-optimal configurations. We evaluate PG-DSE using an autonomous driving application subsystem (ADAS) and three synthetic systems with extremely large design spaces. Experimental results show that PG-DSE can reduce the design space by several orders of magnitude compared to prior work while achieving higher quality results (an average improvement of 58.1x).

Automatic Generation of Complete Polynomial Interpolation Design Space for Hardware Architectures

  • Bryce Orloski
  • Samuel Coward
  • Theo Drane

Hardware implementations of elementary functions regularly deploy piecewise polynomial approximations. This work determines the complete design space of piecewise polynomial approximations meeting a given accuracy specification. Knowledge of this design space determines the minimum number of regions required to approximate the function accurately enough and facilitates the generation of optimized hardware which is competitive against the state of the art. Designers can explore the space of feasible architectures without needing to validate their choices. A heuristic based decision procedure is proposed to generate optimal ASIC hardware designs. Targeting alternative hardware technologies simply requires a modified decision procedure to explore the space. We highlight the difficulty in choosing an optimal number of regions to approximate the function with, as this is input width dependent.

SESSION: Technical Program: Security Assurance and Acceleration

SHarPen: SoC Security Verification by Hardware Penetration Test

  • Hasan Al-Shaikh
  • Arash Vafaei
  • Mridha Md Mashahedur Rahman
  • Kimia Zamiri Azar
  • Fahim Rahman
  • Farimah Farahmandi
  • Mark Tehranipoor

As modern SoC architectures incorporate many complex/heterogeneous intellectual properties (IPs), the protection of security assets has become imperative, and the number of vulnerabilities revealed is rising due to the increased number of attacks. Over the last few years, penetration testing (PT) has become an increasingly effective means of detecting software (SW) vulnerabilities. As of yet, no such technique has been applied to the detection of hardware vulnerabilities. This paper proposes a PT framework, SHarPen, for detecting hardware vulnerabilities, which facilitates the development of a SoC-level security verification framework. SHarPen proposes a formalism for performing gray-box hardware (HW) penetration testing instead of relying on coverage-based testing and provides an automation for mapping hardware vulnerabilities to logical/mathematical cost functions. SHarPen supports both simulation and FPGA-based prototyping, allowing us to automate security testing at different stages of the design process with high capabilities for identifying vulnerabilities in the targeted SoC.

SecHLS: Enabling Security Awareness in High-Level Synthesis

  • Shang Shi
  • Nitin Pundir
  • Hadi M Kamali
  • Mark Tehranipoor
  • Farimah Farahmandi

In their quest for further optimization, High-level synthesis (HLS) utilizes advanced automatic optimization algorithms to achieve lower implementation time/effort for even more complex designs. These optimization algorithms are for the HLS tools’ backend stages, e.g., allocation, scheduling, and binding, and they are highly optimized for resources/latency constraints. However, current HLS tools’ backend is unaware of designs’ security assets, and their algorithms are incapable of handling security constraints. In this paper, we propose Secure-HLS (SecHLS), which aims to define underlying security constraints for HLS tools’ backend stages and intermediate representations. In SecHLS, we improve a set of widely-used scheduling and binding algorithms by integrating the proposed security-related constraints into them. We evaluate the effectiveness of SecHLS in terms of power, performance, area (PPA), security, and complexity (execution time) on small and real-size benchmarks, showing how the proposed security constraints can be integrated into HLS while maintaining low PPA/complexity burdens.

A Flexible ASIC-Oriented Design for a Full NTRU Accelerator

  • Francesco Antognazza
  • Alessandro Barenghi
  • Gerardo Pelosi
  • Ruggero Susella

Post-quantum cryptosystems are the subject of a significant research effort, witnessed by various international standardization competitions. Among them, the NTRU Key Encapsulation Mechanism has been recognized as a secure, patent-free, and efficient public key encryption scheme. In this work, we perform a design space exploration on an FPGA target, with the final goal of an efficient ASIC realization. Specifically, we focus on the possible choices for the design of polynomial multipliers with different memory bus widths to trade-off lower clock cycle counts with larger interconnections. Our design outperforms the best FPGA synthesis results at the state of the art, and we report the results of ASIC syntheses minimizing latency and area with a 40nm industrial grade technology library. Our speed-oriented design computes an encapsulation in 4.1 to 10.2μs and a decapsulation in 7.1 to 11.7μs, depending on the NTRU security level, while our most compact design only takes 20% more area than the underlying SHA-3 hash module.

SESSION: Technical Program: Hardware and Software Co-Design of Emerging Machine Learning Algorithms

Robust Hyperdimensional Computing against Cyber Attacks and Hardware Errors: A Survey

  • Dongning Ma
  • Sizhe Zhang
  • Xun Jiao

Hyperdimensional Computing (HDC), also known as Vector Symbolic Architecture (VSA), is an emerging AI algorithm inspired by the way the human brain functions. Compared with deep neural networks (DNNs), HDC possesses several advantages such as smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. With the increasing deployment of AI in safety-critical systems such as healthcare and robotics, it is not only important to strive for high accuracy, but also to ensure its robustness under even highly uncertain and adversarial environments. However, recent studies show that HDC, just like DNNs, is vulnerable to both cyber attacks (e.g., adversarial attacks) and hardware errors (e.g., memory failures). While a growing body of research has been studying the robustness of HDC, there is a lack of systematic review of research efforts on this increasingly-important topic. To the best of our knowledge, this paper presents the first survey dedicated to review the research efforts made to the robustness of HDC against cyber attacks and hardware errors. While the performance and accuracy of HDC as an AI method still expects future theoretical advancement, this survey paper aims to shed light and call for community efforts on robustness research of HDC.

In-Memory Computing Accelerators for Emerging Learning Paradigms

  • Dayane Reis
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu

Over the past decades, emerging, data-driven machine learning (ML) paradigms have increased in popularity, and revolutionized many application domains. To date, a substantial effort has been devoted to devising mechanisms for facilitating the deployment and near ubiquitous use of these memory intensive ML models. This review paper presents the use of in-memory computing (IMC) accelerators for emerging ML paradigms from a bottom-up perspective through the choice of devices, the design of circuits/architectures, to the application-level results.

Toward Fair and Efficient Hyperdimensional Computing

  • Yi Sheng
  • Junhuan Yang
  • Weiwen Jiang
  • Lei Yang

We are witnessing the evolution that Machine Learning (ML) is applied to varied applications, such as intelligent security systems, medical diagnoses, etc. With this trend, it has high demand to run ML on end devices with limited resources. What’s more, the fairness in these ML algorithms is mounting important, since these applications are not designed for specific users (e.g., people with fair skin in skin disease diagnosis) but need to be applied to all possible users (i.e., people with different skin tones). Brain-inspired hyperdimensional computing (HDC) has demonstrated its ability to run ML tasks on edge devices with a small memory footprint; yet, it is unknown whether HDC can satisfy the fairness requirements from applications (e.g., medical diagnosis for people with different skin tones). In this paper, for the first time, we reveal that the vanilla HDC has severe bias due to its sensitivity to color information. Toward a fair and efficient HDC, we propose a holistic framework, namely FE-HDC, which integrates the image processing and input compression techniques in HDC’s encoder. Compared with the vanilla HDC, results show that the proposed FE-HDC can reduce the unfairness score by 90%, achieving fairer architectures with competitively high accuracy.

SESSION: Technical Program: Full-Stack Co-Design for on-Chip Learning in AI Systems

Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design

  • Xiaoxuan Yang
  • Shiyu Li
  • Qilin Zheng
  • Yiran Chen

Processing-in-memory (PIM) based architecture shows great potential to process several emerging artificial intelligence workloads, including vision and language models. Cross-layer optimizations could bridge the gap between computing density and the available resources by reducing the computation and memory cost of the model and improving the model’s robustness against non-ideal hardware effects. We first introduce several hardware-aware training methods to improve the model robustness to the PIM device’s non-ideal effects, including stuck-at-fault, process variation, and thermal noise. Then, we further demonstrate a software/hardware (SW/HW) co-design methodology to efficiently process the state-of-the-art attention-based model on PIM-based architecture by performing sparsity exploration for the attention-based model and circuit-architecture co-design to support the sparse processing.

Hardware-Software Co-Design for On-Chip Learning in AI Systems

  • M. L. Varshika
  • Abhishek Kumar Mishra
  • Nagarajan Kandasamy
  • Anup Das

Spike-based convolutional neural networks (CNNs) are empowered with on-chip learning in their convolution layers, enabling the layer to learn to detect features by combining those extracted in the previous layer. We propose ECHELON, a generalized design template for a tile-based neuromorphic hardware with on-chip learning capabilities. Each tile in ECHELON consists of a neural processing units (NPU) to implement convolution and dense layers of a CNN model, an on-chip learning unit (OLU) to facilitate spike-timing dependent plasticity (STDP) in the convolution layer, and a special function unit (SFU) to implement other CNN functions such as pooling, concatenation, and residual computation. These tile resources are interconnected using a shared bus, which is segmented and configured via the software to facilitate parallel communication inside the tile. Tiles are themselves interconnected using a classical Network-on-Chip (NoC) interconnect. We propose a system software to map CNN models to ECHELON, maximizing the performance. We integrate the hardware design and software optimization within a co-design loop to obtain the hardware and software architectures for a target CNN, satisfying both performance and resource constraints. In this preliminary work, we show the implementation of a tile on a FPGA and some early evaluations. Using 8 STDP-enabled CNN models, we show the potential of our co-design methodology to optimize hardware resources.

Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis

  • Vito Giovanni Castellana
  • Nicolas Bohm Agostini
  • Ankur Limaye
  • Vinay Amatya
  • Marco Minutoli
  • Joseph Manzano
  • Antonino Tumeo
  • Serena Curzel
  • Michele Fiorito
  • Fabrizio Ferrandi

The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

SESSION: Technical Program: Energy-Efficient Computing for Emerging Applications

Knowledge Distillation in Quantum Neural Network Using Approximate Synthesis

  • Mahabubul Alam
  • Satwik Kundu
  • Swaroop Ghosh

Recent assertions of a potential advantage of Quantum Neural Network (QNN) for specific Machine Learning (ML) tasks have sparked the curiosity of a sizable number of application researchers. The parameterized quantum circuit (PQC), a major building block of a QNN, consists of several layers of single-qubit rotations and multi-qubit entanglement operations. The optimum number of PQC layers for a particular ML task is generally unknown. A larger network often provides better performance in noiseless simulations. However, it may perform poorly on hardware compared to a shallower network. Because the amount of noise varies amongst quantum devices, the optimal depth of PQC can vary significantly. Additionally, the gates chosen for the PQC may be suitable for one type of hardware but not for another due to compilation overhead. This makes it difficult to generalize a QNN design to wide range of hardware and noise levels. An alternate approach is to build and train multiple QNN models targeted for each hardware which can be expensive. To circumvent these issues, we introduce the concept of knowledge distillation in QNN using approximate synthesis. The proposed approach will create a new QNN network with (i) a reduced number of layers or (ii) a different gate set without having to train it from scratch. Training the new network for a few epochs can compensate for the loss caused by approximation error. Through empirical analysis, we demonstrate ≈71.4% reduction in circuit layers, and still achieve ≈16.2% better accuracy under noise.

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring

  • Wentao Hou
  • Kai Zhong
  • Shulin Zeng
  • Guohao Dai
  • Huazhong Yang
  • Yu Wang

Graph Attention Network (GAT) has demonstrated better performance in many graph tasks than previous Graph Neural Networks (GNN). However, it involves graph attention operations with extra computing complexity. While a large amount of existing literature has researched GNN acceleration, few have focused on the attention mechanism in GAT. The graph attention mechanism makes the computation flow different. Therefore, previous GNN accelerators can not support GAT well. Besides, GAT distinguishes the importance of neighbors and makes it possible to reduce the workload through runtime tailoring. We present NTGAT, a software-hardware co-design approach to accelerate GAT with runtime node tailoring. Our work comprises both a runtime node tailoring algorithm and an accelerator design. We propose a pipeline sorting method and a hardware unit to support node tailoring during inference. The experiments show that our algorithm can reduce up to 86% of aggregation workload while incurring slight accuracy loss (<0.4%). And the FPGA based accelerator can achieve up to 3.8× speedup and 4.98× energy efficiency comparing to the GPU baseline.

A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks

  • Pai-Yu Tan
  • Cheng-Wen Wu

Spiking neural networks (SNNs) that enable energy-efficient neuromorphic hardware are receiving growing attention. Training SNNs directly with back-propagation has demonstrated accuracy comparable to deep neural networks (DNNs). However, previous direct-training algorithms require high-precision floating-point operations, which are not suitable for low-power end-point devices. The high-precision operations also require the learning algorithm to run on high-performance accelerator hardware. In this paper, we propose an improved approach that converts the high-precision floating-point operations to low-bitwidth integer operations for an existing direct-training algorithm, i.e., the Spatio-Temporal Back-Propagation (STBP) algorithm. The proposed low-bitwidth Integer-STBP algorithm requires only integer arithmetic for SNN training and inference, which greatly reduces the computational complexity. Experimental results show that the proposed STBP algorithm achieves comparable accuracy and higher energy efficiency than the original floating-point STBP algorithm. Moreover, it can be implemented on low-power end-point devices to provide learning capability during inference, which are mostly supported by fixed-point hardware.

TiC-SAT: Tightly-Coupled Systolic Accelerator for Transformers

  • Alireza Amirshahi
  • Joshua Alexander Harrison Klein
  • Giovanni Ansaloni
  • David Atienza

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework1.

SESSION: Technical Program: Side-Channel Attacks and RISC-V Security

PMU-Leaker: Performance Monitor Unit-Based Realization of Cache Side-Channel Attacks

  • Pengfei Qiu
  • Qiang Gao
  • Dongsheng Wang
  • Yongqiang Lyu
  • Chunlu Wang
  • Chang Liu
  • Rihui Sun
  • Gang Qu

Performance Monitor Unit (PMU) is a special hardware module in processors that contains a set of counters to record various architectural and micro-architectural events. In this paper, we propose PMU-Leaker, a novel realization of all existing cache side-channel attacks where accurate execution time measurements are replaced by information leaked through PMU. The efficacy of PMU-Leaker is demonstrated by (1) leaking the secret data stored in Intel Software Guard Extensions (SGX) with the transient execution vulnerabilities including Spectre and ZombieLoad and (2) extracting the encryption key of a victim AES performed in SGX. We perform thorough experiments on a DELL Inspiron 15-7560 laptop that has an Intel® Core i5-7200U processor with the Kaby Lake architecture and the results show that, among the 176 PMU counters, 24 of them are vulnerable and can be used to launch the PMU-Leaker attack.

EO-Shield: A Multi-Function Protection Scheme against Side Channel and Focused Ion Beam Attacks

  • Ya Gao
  • Qizhi Zhang
  • Haocheng Ma
  • Jiaji He
  • Yiqiang Zhao

Smart devices, especially Internet-connected devices, typically incorporate security protocols and cryptographic algorithms to ensure the control flow integrity and information security. However, there are various invasive and non-invasive attacks trying to tamper with these devices. Chip-level active shield has been proved to be an effective countermeasure against invasive attacks, but existing active shields cannot be utilized to counter side-channel attacks (SCAs). In this paper, we propose a multi-function protection scheme and an active shield prototype to against invasive and non-invasive attacks simultaneously. The protection scheme has a complex active shield implemented using the top metal layer of the chip and an information leakage obfuscation module underneath. The leakage obfuscation module generates its protection patterns based on the operating conditions of the circuit that needs to be protected, thus reducing the correlation between electromagnetic (EM) emanations and cryptographic data. We implement the protection scheme on one Advanced Encryption Standard (AES) circuit to demonstrate the effectiveness of the method. Experiment results demonstrate that the information leakage obfuscation module decreases SNR below 0.6 and reduces the success rate of SCAs. Compared to existing single-function protection methods against physical attacks, the proposed scheme provides good performance against both invasive and non-invasive attacks.

CompaSeC: A Compiler-Assisted Security Countermeasure to Address Instruction Skip Fault Attacks on RISC-V

  • Johannes Geier
  • Lukas Auer
  • Daniel Mueller-Gritschneder
  • Uzair Sharif
  • Ulf Schlichtmann

Fault-injection attacks are a risk for any computing system executing security-relevant tasks, such as a secure boot process. While hardware-based countermeasures to these invasive attacks have been found to be a suitable option, they have to be implemented via hardware extensions and are thus not available in most Commonly used Off-The-Shelf (COTS) components. Software Implemented Hardware Fault Tolerance (SIHFT) is therefore the only valid option to enhance a COTS system’s resilience against fault attacks. Established SIHFT techniques usually target the detection of random hardware errors for functional safety and not targeted attacks. Using the example of a secure boot system running on a RISC-V processor, in this work we first show that when the software is hardened by these existing techniques from the safety domain, the number of vulnerabilities in the boot process to single, double, triple, and quadruple instruction skips cannot be fully closed. We extend these techniques to the security domain and propose Compiler-assisted Security Countermeasure (CompaSeC). We demonstrate that CompaSeC can close all vulnerabilities for the studied secure boot system. To further reduce performance and memory overheads we additionally propose a method for CompaSeC to selectively harden individual vulnerable functions without compromising the security against the considered instruction skip faults.

Trojan-D2: Post-Layout Design and Detection of Stealthy Hardware Trojans – A RISC-V Case Study

  • Sajjad Parvin
  • Mehran Goli
  • Frank Sill Torres
  • Rolf Drechsler

With the exponential increase in the popularity of the RISC-V ecosystem, the security of this platform must be re-evaluated especially for mission-critical and IoT devices. Besides, the insertion of a Hardware Trojan (HT) into a chip after the in-house mask design is outsourced to a chip manufacturer abroad for fabrication is a significant source of concern. Though abundant HT detection methods have been investigated based on side-channel analysis, physical measurements, and functional testing to overcome this problem, there exists stealthy HTs that can hide from detection. This is due to the small overhead of such HTs compared to the whole circuit.

In this work, we propose several novel HTs that can be placed into a RISC-V core’s post-layout in an untrusted manufacturing environment. Next, we propose a non-invasive analytical method based on contactless optical probing to detect any stealthy HTs. Finally, we propose an open-source library of HTs that can be used to be placed into a processor unit in the post-layout phase. All the designs in this work are done using a commercial 28nm technology.

SESSION: Technical Program: Simulation and Verification of Quantum Circuits

Graph Partitioning Approach for Fast Quantum Circuit Simulation

  • Jaekyung Im
  • Seokhyeong Kang

Owing to the exponential increase in computational complexity, the fast simulation of the large quantum circuit has become very difficult. This is an important challenge for the utilization of quantum computers because it is closely related to the verification of quantum computation by classical machines. The Hybrid Schrödinger-Feynman simulation seems to be a promising solution, but its application is very limited. To solve this drawback, we propose an improved simulation method based on graph partitioning. Experimental results show that our approach significantly reduces the simulation time of the Hybrid Schrödinger-Feynman simulation.

A Robust Approach to Detecting Non-Equivalent Quantum Circuits Using Specially Designed Stimuli

  • Hsiao-Lun Liu
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

As several compilation and optimization techniques have been proposed, equivalence checking for quantum circuits has become essential in design flows. The state-of-the-art to this problem observed that even small errors substantially affect the entire quantum system. As a result, it exploited random simulations to prove the non-equivalence of two quantum circuits. However, when errors occurred close to outputs, it was hard for the work to prove the non-equivalence of some non-equivalent quantum circuits under a limited number of simulations. In this work, we propose a novel simulation-based approach using a set of specially designed stimuli. The simulation runs of the proposed approach is linear rather than exponential to the number of quantum bits of a circuit. According to the experimental results, the success rate of our approach is 100% (100%) under a simulation run (execution time) constraint for a set of benchmarks, while that of the state-of-the-art is only 69% (74%) on average. Our approach also achieves a speedup of 26 on average.

Equivalence Checking of Parameterized Quantum Circuits: Verifying the Compilation of Variational Quantum Algorithms

  • Tom Peham
  • Lukas Burgholzer
  • Robert Wille

Variational quantum algorithms have been introduced as a promising class of quantum-classical hybrid algorithms that can already be used with the noisy quantum computing hardware available today by employing parameterized quantum circuits. Considering the non-trivial nature of quantum circuit compilation and the subtleties of quantum computing, it is essential to verify that these parameterized circuits have been compiled correctly. Established equivalence checking procedures that handle parameter-free circuits already exist. However, no methodology capable of handling circuits with parameters has been proposed yet. This work fills this gap by showing that verifying the equivalence of parameterized circuits can be achieved in a purely symbolic fashion using an equivalence checking approach based on the ZX-calculus. At the same time, proofs of inequality can be efficiently obtained with conventional methods by taking advantage of the degrees of freedom inherent to parameterized circuits. We implemented the corresponding methods and proved that the resulting methodology is complete. Experimental evaluations (using the entire parametric ansatz circuit library provided by Qiskit as benchmarks) demonstrate the efficacy of the proposed approach.

Software Tools for Decoding Quantum Low-Density Parity-Check Codes

  • Lucas Berent
  • Lukas Burgholzer
  • Robert Wille

Quantum Error Correction (QEC) is an essential field of research towards the realization of large-scale quantum computers. On the theoretical side, a lot of effort is put into designing error-correcting codes that protect quantum data from errors, which inevitably happen due to the noisy nature of quantum hardware and quantum bits (qubits). Protecting data with an error-correcting code necessitates means to recover the original data, given a potentially corrupted data set—a task referred to as decoding. It is vital that decoding algorithms can recover error-free states in an efficient manner. While theoretical properties of certain QEC methods have been extensively studied, good techniques to analyze their performance in practically more relevant settings is still a widely unexplored area. In this work, we propose a set of software tools that facilitate numerical experiments with so-called Quantum Low-Density Parity-Check codes (QLDPC codes)—a broad class of codes, some of which have recently been shown to be asymptotically good. Based on that, we provide an implementation of a general decoder for QLDPC codes. On top of that, we propose a highly efficient heuristic decoder that eliminates the runtime bottlenecks of the general QLDPC decoder while still maintaining comparable decoding performance. These tools eventually make it possible to confirm theoretical results around QLDPC codes in a more practical setting and showcase the value of software tools (in addition to theoretical considerations) for investigating codes for practical applications. The resulting tool, which is publicly available at as part of the Munich Quantum Toolkit (MQT), is meant to provide a playground for the search for “practically good” quantum codes.

SESSION: Technical Program: Learning x Security in DFM

Enabling Scalable AI Computational Lithography with Physics-Inspired Models

  • Haoyu Yang
  • Haoxing Ren

Computational lithography is a critical research area for the continued scaling of semiconductor manufacturing process technology by enhancing silicon printability via numerical computing methods. Today’s solutions for these problems are primarily CPU-based and require many thousands of CPUs running for days to tape out a modern chip. We seek AI/GPU-assisted solutions for the two problems, aiming at improving both runtime and quality. Prior academic research has proposed using machine learning for lithography modeling and mask optimization, typically represented as image-to-image mapping problems, where convolution layer backboned UNets and ResNets are applied. However, due to the lack of domain knowledge integrated into the framework designs, these solutions have been limited by their application scenarios or performance. Our method aims to tackle the limitations of such previous CNN-based solutions by introducing lithography bias into the neural network design, yielding a much more efficient model design and significant performance improvements.

Data-Driven Approaches for Process Simulation and Optical Proximity Correction

  • Hao-Chiang Shao
  • Chia-Wen Lin
  • Shao-Yun Fang

With continuous shrinking of process nodes, semiconductor manufacturing encounters more and more serious inconsistency between designed layout patterns and resulted wafer images. Conventionally, examining how a layout pattern can deviate from its original after complicated process steps, such as optical lithography and subsequent etching, relies on computationally expensive process simulation, which suffers from incredibly long runtime for large-scale circuit layouts, especially in advanced nodes. In addition, being one of the most important and commonly adopted resolution enhancement techniques, optical proximity correction (OPC) corrects image errors due to process effects by moving segment edges or adding extra polygons to mask patterns, while it is generally driven by simulation or time-consuming inverse lithography techniques (ILTs) to achieve acceptable accuracy. As a result, more and more state-of-the-art works on process simulation or/and OPC resort to the fast inference characteristic of machine/deep learning. This paper reviews these data-driven approaches to highlight the challenges in various aspects, explore preliminary solutions, and reveal possible future directions to push forward the frontiers of the research in design for manufacturability.

Mixed-Type Wafer Failure Pattern Recognition

  • Hao Geng
  • Qi Sun
  • Tinghuan Chen
  • Qi Xu
  • Tsung-Yi Ho
  • Bei Yu

The ongoing evolution in process fabrication enables us to step below the 5nm technology node. Although foundries can pattern and etch smaller but more complex circuits on silicon wafers, a multitude of challenges persist. For example, defects on the surface of wafers are inevitable during manufacturing. To increase the yield rate and reduce time-to-market, it is vital to recognize these failures and identify the failure mechanisms of these defects. Recently, applying machine learning-powered methods to combat single defect pattern classification has made significant progress. However, as the processes become increasingly complicated, various single-type defect patterns may emerge and be coupled on a wafer and thus shape a mixed-type pattern. In this paper, we will survey the recent pace of progress on advanced methodologies for wafer failure pattern recognition, especially for mixed-type one. We sincerely hope this literature review can highlight the future directions and promote the advancement of the wafer failure pattern recognition.

SESSION: Technical Program: Lightweight Models for Edge AI

Accelerating Convolutional Neural Networks in Frequency Domain via Kernel-Sharing Approach

  • Bosheng Liu
  • Hongyi Liang
  • Jigang Wu
  • Xiaoming Chen
  • Peng Liu
  • Yinhe Han

Convolutional neural networks (CNNs) are typically computationally heavy. Fast algorithms such as fast Fourier transforms (FFTs), are promising in significantly reducing computation complexity by replacing convolutions with frequency-domain element-wise multiplication. However, the increased high memory access overhead of complex weights counteracts the computing benefit, because frequency-domain convolutions not only pad weights to the same size as input maps, but also have no sharable complex kernel weights. In this work, we propose an FFT-based kernel-sharing technique called FS-Conv to reduce memory access. Based on FS-Conv, we derive the sharable complex weights in frequency-domain convolutions, which has never been solved. FS-Conv includes a hybrid padding approach, which utilizes the inherent periodic characteristic of FFT transformation to provide sharable complex weights for different blocks of complex input maps. We in addition build a frequency-domain inference accelerator (called Yixin) that can utilize the sharable complex weights for CNN accelerations. Evaluation results demonstrate the significant performance and energy efficiency benefits compared with the state-of-the-art baseline.

Mortar: Morphing the Bit Level Sparsity for General Purpose Deep Learning Acceleration

  • Yunhung Gao
  • Hongyan Li
  • Kevin Zhang
  • Xueru Yu
  • Hang Lu

Vanilla Deep Neural Networks (DNN) after training are represented with native floating-point 32 (fp32) weights. We observe that the bit-level sparsity of these weights is very abundant in the mantissa and can be directly exploited to speed up model inference. In this paper, we propose Mortar, an off-line/on-line collaborated approach for fp32 DNN acceleration, which includes two parts: first, an off-line bit sparsification algorithm to construct the target formulation by “mantissa morphing”, which maintains higher model accuracy while increasing bit-level sparsity; second, the associating hardware accelerator architecture to speed up the on-line fp32 inference through manipulating the enlarged bit sparsity. We highlight the following results by evaluating various deep learning tasks, including image classification, object detection, video understanding, video & image super-resolution, etc.: We (1) increase bit-level sparsity up to 1.28~2.51x with only a negligible -0.09~0.23% accuracy loss, (2) maintain on average 3.55% higher model accuracy while increasing more bit-level sparsity than the baseline, (3)and our hardware accelerator outperforms up to 4.8x over the baseline, with an area of 0.031 mm2 and power of 68.58 mW.

Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

  • Yimeng Zhang
  • Akshay Karkal Kamath
  • Qiucheng Wu
  • Zhiwen Fan
  • Wuyang Chen
  • Zhangyang Wang
  • Shiyu Chang
  • Sijia Liu
  • Cong Hao

In this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream. First, to enable ultra-light video intelligence, we propose temporal frame-filtering and spatial saliency-focusing approaches to reduce the complexity of massive video data. Second, we exploit structure-aware weight sparsity to design a hardware-friendly model compression method. Third, assisted with data and model complexity reduction, we propose a sparsity-aware, scalable, and low-power accelerator design, aiming to deliver real-time performance with high energy efficiency. Different from existing works, we make a solid step towards the synergized software/hardware co-optimization for realistic MOT model implementation. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5× latency reduction, 20.9× effective frame rate improvement, 5.83× lower power, and 9.78× better energy efficiency, without much accuracy drop.

Latent Weight-Based Pruning for Small Binary Neural Networks

  • Tianen Chen
  • Noah Anderson
  • Younghyun Kim

Binary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove.

In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseduogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 33%–46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count.

SESSION: Technical Program: Design Automation for Emerging Devices

AutoFlex: Unified Evaluation and Design Framework for Flexible Hybrid Electronics

  • Tianliang Ma
  • Zhihui Deng
  • Leilai Shao

Flexible hybrid electronics (FHE), integrating high performance silicon chips with multi-functional sensors and actuators on flexible substrates, can be intimately attached onto irregular surfaces without compromising their functionalities, thus enabling more innovations in healthcare, internet of things (IoTs) and various human-machine interfaces (HMIs). Recent developments on compact models and process design kits (PDKs) of flexible electronics have made designs of small to medium flexible circuits feasible. However, the absence of a unified model and comprehensive evaluation benchmarks for flexible electronics makes it infeasible for a designer to fairly compare different flexible technologies and to explore potential design options for a heterogeneous FHE design. In this paper, we present AutoFlex, a unified evaluation and design framework for flexible hybrid electronics, where device parameters can be extracted automatically and performance can be evaluated comprehensively from device levels, digital blocks to large-scale digital circuits. Moreover, a ubiquitous FHE sensor acquisition system, including a flexible multi-functional sensor array, scan drivers, amplifiers and a silicon based analog-to-digital converter (ADC), is developed to reveal the design challenges of a representative FHE system.

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology

  • Chenlin Shi
  • Shinobu Miwa
  • Tongxin Yang
  • Ryota Shioya
  • Hayato Yamaki
  • Hiroki Honda

In this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an open-source CNFET SPICE model called VS-CNFET, and various model parameters such as the channel width and carbon nanotube diameter are carefully tuned to mimic the predictive 7-nm CNFET technology presented in a published paper. Some nondisclosure parameters, such as the cell size and pin layout, are derived from those of the NanGate 15-nm open-source cell library in the same way as for an open-source framework for CNFET circuit design. CNFET7 includes two types of delay model (i.e., the composite current source and nonlinear delay model), each having 56 cells, such as INV_X1 and BUF_X1. CNFET7 supports both logic synthesis and timing-driven place and route in the Cadence design flow. Our experimental results for several synthesized circuits show that CNFET7 has reductions of up to 96%, 62% and 82% in dynamic and static power consumption and critical-path delay, respectively, when compared with ASAP7.

A Global Optimization Algorithm for Buffer and Splitter Insertion in Adiabatic Quantum-Flux-Parametron Circuits

  • Rongliang Fu
  • Mengmeng Wang
  • Yirong Kan
  • Nobuyuki Yoshikawa
  • Tsung-Yi Ho
  • Olivia Chen

As a highly energy-efficient application of low-temperature superconductivity, the adiabatic quantum-flux-parametron (AQFP) logic circuit has characteristics of extremely low-power consumption, making it an attractive candidate for extremely energy-efficient computing systems. Since logic gates are driven by the alternating current (AC) serving as the clock signal in AQFP circuits, plenty of AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. Meanwhile, since the currently developed AQFP logic gates can only drive a single output, splitters are required by logic gates to drive multiple fan-outs. These gates take up a significant amount of the circuit’s area and delay. This paper proposes a global optimization algorithm for buffer and splitter (B/S) insertion to address the issues above. The B/S insertion is first identified as a combinational optimization problem, and a dynamic programming formulation is presented to find the global optimal solution. Due to the limitation of its impractical search space, an integer linear programming formulation is proposed to explore the global optimization of B/S insertion approximately. Experimental results on the ISCAS’85 and simple arithmetic benchmark circuits show the effectiveness of the proposed method, with an average reduction of 8.22% and 7.37% in the number of buffers and splitters inserted compared to the state-of-the-art methods from ICCAD’21 and DAC’22, respectively.

FLOW-3D: Flow-Based Computing on 3D Nanoscale Crossbars with Minimal Semiperimeter

  • Sven Thijssen
  • Sumit Kumar Jha
  • Rickard Ewetz

The emergence of data-intensive applications has spurred the interest for in-memory computing using nanoscale crossbars. Flow-based in-memory computing is a promising approach for evaluating Boolean logic using the natural flow of electrical currents. While automated synthesis approaches have been developed for 2D crossbars, 3D crossbars have advantageous properties in terms of density, area, and performance. In this paper, we propose the first framework for performing flow-based computing using 3D crossbars. The framework, FLOW-3D, automatically synthesizes a Boolean function into a crossbar design. FLOW-3D is based on an analogy between BDDs and crossbars, resulting in the synthesis of 3D crossbar designs with minimal semiperimeter. A BDD with n nodes is mapped to a 3D crossbar with (n + k) metal wires. The k extra metal wires are needed to handle hardware-imposed constraints. Compared with the state-of-the-art synthesis tool for 2D crossbars, FLOW-3D improves semiperimeter, area, energy consumption, and latency up to 61%, 84%, 37%, and 41% on 15 Revlib benchmarks.