ISPD ’23: Proceedings of the 2023 International Symposium on Physical Design

 Full Citation in the ACM Digital Library

SESSION: Session 1: Opening Session and Keynote I

Automated Design of Chiplets

  • Alberto Sangiovanni-Vincentelli
  • Zheng Liang
  • Zhe Zhou
  • Jiaxi Zhang

Chiplet-based designs have gained recognition as a promising alternative to monolithic SoCs due to their lower manufacturing costs, improved re-usability, and optimized technology specialization. Despite progress made in various related domains, the design of chiplets remains largely reliant on manual processes. In this paper, we provide an examination of the historical evolution of chiplets, encompassing a review of crucial design considerations and a synopsis of recent advancements in relevant fields. Further, we identify and examine the opportunities and challenges in the automated design of chiplets. To further demonstrate the potential of this nascent area, we present a novel task that

SESSION: Session 2: Routing

FastPass: Fast Pin Access Analysis with Incremental SAT Solving

  • Fangzhou Wang
  • Jinwei Liu
  • Evangeline F.Y. Young

Pin access analysis is a critical step in detailed routing. With complicated design rules and pin shapes, efficient and accurate pin accessibility evaluation is desirable in many physical design scenarios. To this end, we present FastPass, a fast and robust pin access analysis framework, which first generates design rule checking (DRC)-clean pin access route candidates for each pin, pre-computes incompatible pairs of routes, and then uses incremental SAT solving to find an optimized pin access scheme. Experimental results on the ISPD 2018 benchmarks show that FastPass produces DRC-clean pin access schemes for all cases while being 14.7× faster than the known best pin access analysis framework on average.

Pin Access-Oriented Concurrent Detailed Routing

  • Yun-Jhe Jiang
  • Shao-Yun Fang

Due to continuously shrunk feature sizes and increased design complexity, the difficulty in pin access becomes one of the most critical challenges in large-scale full-chip routing. State-of-the-art pin access-aware detailed routing techniques suffer from either the ordering problem of the sequential routing scheme or the inflexibility of pre-determining an access point for each pin. Some other routing-related studies create pin extensions with Metal-2 metal segments to optimize pin accessibility; however, this strategy may not be practical without considering the contemporary routing flow. This paper presents a pin access-oriented concurrent detailed routing approach conducted after the track assignment stage. The core detailed routing engine is based on an integer linear programming (ILP) formulation, which has lower complexity and can flexibly tackle multi-pin nets compared to an existing formulation. Besides, to maximize the free routing resource and to keep the problem size tractable, a pre-processing flow trimming redundant metals and inserting assistant metals is developed. The experimental results show that compared to a state-of-the-art academic router, the proposed concurrent scheme can effectively derive good results with fewer design rule violations and less runtime.

Reinforcement Learning Guided Detailed Routing for Custom Circuits

  • Hao Chen
  • Kai-Chieh Hsu
  • Walker J. Turner
  • Po-Hsuan Wei
  • Keren Zhu
  • David Z. Pan
  • Haoxing Ren

Detailed routing is the most tedious and complex procedure in design automation and has become a determining factor in layout automation in advanced manufacturing nodes. Despite continuing advances in custom integrated circuit (IC) routing research, industrial custom layout flows remain heavily manual due to the high complexity of the custom IC design problem. Besides conventional design objectives such as wirelength minimization, custom detailed routing must also accommodate additional constraints (e.g., path-matching) across the analog/mixed-signal (AMS) and digital domains, making an already challenging procedure even more so. This paper presents a novel detailed routing framework for custom circuits that leverages deep reinforcement learning to optimize routing patterns while considering custom routing constraints and industrial design rules. Comprehensive post-layout analyses based on industrial designs demonstrate the effectiveness of our framework in dealing with the specified constraints and producing sign-off-quality routing solutions.

Voltage-Drop Optimization Through Insertion of Extra Stripes to a Power Delivery Network

  • Jai-Ming Lin
  • Yu-Tien Chen
  • Yang-Tai Kung
  • Hao-Jia Lin

As the complexity increases, power delivery network (PDN) optimization becomes a more important step in a modern design. In order to construct a robust PDN, most classic PDN optimization methods focus on adjusting the dimensions of power stripes. However, this approach becomes infeasible when voltage violation regions also have severe routing congestion. Hence, this paper proposes a delicate procedure to insert additional power stripes to reduce voltage violation while maintaining routability. In the beginning, IR-drop high related regions are identified to reveal those locations which are thirsty for more currents. Then, we solve a minimum-cost flow problem to find the topologies of power delivery paths (PDPs) from power sources to these regions and determine the widths of edges in each PDP so that enough currents can be provided to these regions. Moreover, vertical power stripes (VPSs for short) are inserted to the locations which have less routing congestion and severe voltage violations by the dynamic programming to reduce a probability to deteriorate routability. Finally, more wires will be inserted to IR-drop high related regions if there still exist voltage violations. Experimental results show that our method can use much less routing resource and induce less routing congestion to meet IR-drop constraint in industry designs.

NVCell 2: Routability-Driven Standard Cell Layout in Advanced Nodes with Lattice Graph Routability Model

  • Chia-Tung Ho
  • Alvin Ho
  • Matthew Fojtik
  • Minsoo Kim
  • Shang Wei
  • Yaguang Li
  • Brucek Khailany
  • Haoxing Ren

Standard cells are essential components of modern digital circuit designs. With process technologies advancing beyond the 5nm node, more routability issues have arisen due to the decreasing number of routing tracks, increasing number and complexity of design rules, and strict patterning rules. Automatic standard cell synthesis tools are struggling to design cells with severe routability issues. In this paper, we propose a routability-driven standard cell synthesis framework using a novel pin density aware congestion metric, lattice graph routability modelling approach, and dynamic external pin allocation methodology to generate routability optimized layouts. On a benchmark of 94 complex and hard-to-route standard cells, NVCell 2 improves the number of routable and LVS/DRC clean cell layouts by 84.0% and 87.2%, respectively. NVCell 2 can generate 98.9% of cells LVS/DRC clean, with 13.9% of the cells having smaller area, compared to an industrial standard cell library with over 1000 standard cells.

SESSION: Session 3: 3D ICs, Heterogeneous Integration, and Packaging I

FXT-Route: Efficient High-Performance PCB Routing with Crosstalk Reduction Using Spiral Delay Lines

  • Meng Lian
  • Yushen Zhang
  • Mengchu Li
  • Tsun-Ming Tseng
  • Ulf Schlichtmann

In high-performance printed circuit boards (PCBs), adding serpentine delay lines is the most prevalent delay-matching technique to balance the delays of time-critical signals. Serpentine topology, however, can induce simultaneous accumulation of the crosstalk noise, resulting in erroneous logic gate triggering and speed-up effects. The state-of-the-art approach for crosstalk alleviation achieves waveform integrity by enlarging wire separation, resulting in an increased routing area. We introduce a method that adopts spiral delay lines for delay matching to mitigate the speed-up effect by spreading the crosstalk noise uniformly in time. Our method avoids possible routing congestion while achieving a high density of transmission lines. We implement our method by constructing a mixed-integer-linear programming (MILP) model for routing and a quadratic programming (QP) model for spiral synthesis. Experimental results demonstrate that our method requires, on average, 31% less routing area than the original design. In particular, compared to the state-of-the-art approach, our method can reduce the magnitude of the crosstalk noise by at least 69%.

On Legalization of Die Bonding Bumps and Pads for 3D ICs

  • Sai Pentapati
  • Anthony Agnesina
  • Moritz Brunion
  • Yen-Hsiang Huang
  • Sung Kyu Lim

State-of-the-art 3D IC Place-and-Route flows were designed with older technology nodes and aggressive bonding pitch assumptions. As a result, these flows fail to honor the width and spacing rules for the 3D vias with realistic pitch values. We propose a critical new 3D via legalization stage during routing to reduce such violations. A force-based solver and bipartite-matching algorithm with Bayesian optimization are presented as viable legalizers and are compatible with various process nodes, bonding technologies, and partitioning types. With the modified 3D routing, we reduce the 3D via violations by more than 10× with zero impact on performance, power, or area.

Reshaping System Design in 3D Integration: Perspectives and Challenges

  • Hung-Ming Chen
  • Chu-Wen Ho
  • Shih-Hsien Wu
  • Wei Lu
  • Po-Tsang Huang
  • Hao-Ju Chang
  • Chien-Nan Jimmy Liu

In this paper, we depict modern system design methodologies via 3D integration along with the advance of packaging, considering system prototyping, interconnecting, and physical implementation. The corresponding challenges are presented as well.

SESSION: Session 4: 3D ICs, Heterogeneous Integration, and Packaging II

Co-design for Heterogeneous Integration: A Failure Analysis Perspective

  • Erica Douglas
  • Julia Deitz
  • Timothy Ruggles
  • Daniel Perry
  • Damion Cummings
  • Mark Rodriguez
  • Nichole Valdez
  • Brad Boyce

As scaling for CMOS transistors asymptotically approaches the end of Moore’s Law, the need to push into 3D integration schemes to innovate capabilities is gaining significant traction. Further, rapid development of new semiconductor solutions, such as heterogeneous integration, has turned the semiconductor industry’s consistent march towards next generation products into new arenas. In 2018, the Department of Energy Office of Science (DOE SC) released their “Basic Research Needs for Microelectronics,” communicating a strong push towards “parallel but intimately networked efforts to create radically new capabilities,”1 which they have coined as “co-design.”

Advanced packaging and heterogeneous integration, particularly with mixed semiconductor materials (e.g., CMOS FPGAs & GaN RF amplifiers) is a realm ripe for applicability towards DOE SC’s co-design call to action. In theory, development occurring at all scales across the semiconductor ecosystem, particularly across disciplines that are not traditionally adjacent, should significantly accelerate innovation. In reality, co-design requires a paradigm shift in approach, requiring not only interconnected parallel development. Further, accurate ground truth data during learning cycles is critical in order to effectively and efficiently communicate across disparate disciplines and advise design iterations across the microelectronics ecosystem.

This talk will outline three orthogonal facets towards co-design for HI: (1) on-going efforts towards development of materials characterization and failure analysis techniques to enable accurate evaluation of materials and heterogeneously integrated components, (2) development of artificial intelligence & machine learning algorithms for large scale, high throughput process development and characterization, and (3) development of capabilities for rapid communication and visualization of data across disparate disciplines.

Goal Driven PCB Synthesis Using Machine Learning and CloudScale Compute

  • Taylor Hogan

X AI is a cloud-based system that leverages machine learning, and search to place and route printed circuit boards using physics-based analysis and high-level design. We propose a feedback-based Monte Carlo Tree Search (MCTS) algorithm to explore the space of possible designs. A metric, or metrics, is given to evaluate the quality of designs as MCTS learns about possible solutions. A policy and value network are trained during exploration to learn to accurately weight quality actions and identify useful design states. This is performed as a feedback loop in conjunction with other feedforward tools for placement and routing.

Gate-All-Around Technology is Coming.: What’s Next After GAA?

  • Victor Moroz

Currently, the industry is transitioning from FinFETs to gate-all-around (GAA) technology and will likely have several GAA technology generations in the next few years. What’s next after that? This is the question that we are trying to answer in this project by benchmarking GAA technology with transistors on 2D materials and stacked transistors (CFETs).

The main objective for logic is to get a meaningful gain in power, performance, area, and cost (PPAC). The main objective for SRAM is to get a noticeable density scaling for the SRAM array and its periphery without losing performance and yield. Another objective is to move in the direction that has a promise of longer-term progress, such as to start stacking two layers of transistors before moving to a larger number of transistor layers. With that in mind, we explore and discuss the next steps beyond GAA technology.

SESSION: Session 5: Analog Design

VLSIR – A Modular Framework for Programming Analog & Custom Circuits & Layouts

  • Dan Fritchman

We present VLSIR, a modular and fully open-source framework for programming analog and custom circuits and layouts. VLSIR is centered around a protobuf-defined design database. It features high-productivity front-ends for hardware description (“circuit programming”), simulation, and custom layout programming, designed to be amenable to both human designers and automation.

Joint Optimization of Sizing and Layout for AMS Designs: Challenges and Opportunities

  • Ahmet F. Budak
  • Keren Zhu
  • Hao Chen
  • Souradip Poddar
  • Linran Zhao
  • Yaoyao Jia
  • David Z. Pan

Recent advances in analog device sizing algorithms show promising results on the automatic schematic design. However, the majority of the sizing algorithms are based on schematic-level simulations and layout-agnostic. The physical layout implementation brings extra parasitics to the analog circuits, leading to discrepancies between schematic and post-layout performance. This performance gap raises questions about the effectiveness of automatic analog device sizing tools. Prior work has leveraged procedural layout generation to account for layout-induced parasitics in the sizing process. However, the need for layout templates makes such methodology limited in application. In this paper, we propose to bridge automatic analog sizing with post-layout performance using state-of-the-art optimization-based analog layout generators. A quantitative study is conducted to measure the impact of layout awareness in state-of-the-art device sizing algorithms. Furthermore, we present our perspectives on the future directions in layout-aware analog circuit schematic design.

Learning from the Implicit Functional Hierarchy in an Analog Netlist

  • Helmut Graeb
  • Markus Leibl

Analog circuit design is characterized by a plethora of implicit design and technology aspects available to the experienced designer. In order to create useful computer-aided design methods, this implicit knowledge has to be captured in a systematic and hierarchical way. A key approach to this goal is to “learn” the knowledge from the netlist of an analog circuit. This requires a library of structural and functional blocks for analog circuits together with their individual constraints and performance equations, graph homomorphism techniques to recognize blocks that can have different structural implementations and I/O pins, as well as synthesis methods that exploit the learned knowledge. In this contribution, we will present how to make use of the functional and structural hierarchy of operational amplifiers. As an application, we explore the capabilities of machine learning in the context of structural and functional properties and show that the results can be substantially improved by pre-processing data with traditional methods for functional block analysis. This claim is validated on a data set of roughly 100,000 readily sized and simulated operational amplifiers.

The ALIGN Automated Analog Layout Engine: Progress, Learnings, and Open Issues

  • Sachin S. Sapatnekar

The ALIGN (Analog Layout, Intelligently Generated from Netlists) project [1, 2] is a joint university-industry effort to push the envelope of automated analog layout through a systematic new approach, novel algorithms, and open-source software [3]. Analog automation research has been active for several decades, but has not found widespread acceptance due to its general inability to meet the needs of the design community. Therefore, unlike digital design, which has a rich history of automation and extensive deployment of design tools, analog design is largely unautomated.

ALIGN attempts to overcome several of the major issues associated with this lack of success. First, to mimic the human designer’s ability to recognize sub-blocks and specify constraints, ALIGN has used machine learning (ML) based methods to assist in these tasks. Second, to overcome the limitation of past automation approaches, which are largely specific to a class of designs, ALIGN attempts to create a truly general layout engine by decomposing the layout automation process into a set of steps, with specific constraints that are specific to the family of circuits, which are divided into four classes: low-frequency components (e.g., analog-to-digital converters (ADCs), amplifiers, and filters); wireline components for high-speed links (e.g., equalizers, clock/data recovery circuits, and phase interpolators); RF/Wireless components (e.g., components of RF transmitters and receivers), and power delivery components (e.g., capacitor- and inductor-based DC-DC converters and low dropout (LDO) regulators). For each class of circuits, different sets of constraints are important, depending on their frequency, parasitic sensitivity, need for matching, etc., and ALIGN creates a unified methodological framework that can address each class. Third, in each step, ALIGN has generated new algorithms and approaches to help improve the performance of analog layout. Fourth, given that experienced analog designers desire greater visibility into the process and input into the way that design is carried out, ALIGN is built modularly, providing multiple entry points at which a designer may intervene in the process.

Analog Layout Automation On Advanced Process Technologies

  • Soner Yaldiz

Despite the digitization of analog and the disaggregated silicon trends, high-volume or high-performance system-on-chip (SoC) designs integrate numerous analog and mixed-signal (AMS) intellectual property (IP) blocks including voltage regulators, clock generators, sensors, memory and other interfaces. For example, fine-grain dynamic voltage and frequency scaling requires a dedicated clock generator and voltage regulator per compute unit. The design of these blocks in advanced FinFET or GAAFET technologies is challenging due to the i) increasing gap between schematic and post-layout simulation, ii) design rule complexity, and iii) strict reliability rules [1]. The convergence of a high-performance or a high-power block may require multiple iterations of circuit sizing and layout changes. As a result, physical design, which is primarily a manual effort, has become a key bottleneck in the design process. Migrating these blocks across process technologies or process variants only exacerbates the problem. Layout synthesis for AMS IP blocks is an on-going research problem with a long history [2] and is gaining more attention recently to leverage the latest advances in machine learning [3]. Yet neither template nor optimization-based approaches have reduced the burden significantly for high performance products on leading process technologies

This talk will first overview physical design of AMS IP blocks on an advanced process technology highlighting the opportunities and the expectations from layout automation during this process. On a new process technology, this process starts with conducting early layout studies on a selection of critical high performance or high power subcircuits. In parallel, the IP blocks are placed in a bottom-up fashion to optimize the IP floorplan but also to provide information to SoC floorplanning. Routing follows the placement to verify the post-layout performance. A quick turnaround during these explorations is vital to decide on any architectural changes or circuit re-sizing. The rest of the talk will share experiences with piloting an open-source analog layout synthesis tool flow [4] on a 22nm FinFET technology for voltage regulators [5].

The learnings from this exercise and the extensions to the tool flow will be summarized that include Boolean satisfiability-based routing algorithm, formally verifiable constraint language and leveraging parameterized and standard cells. The talk will conclude with opportunities for research.

SESSION: Session 6: Keynote II

Immersion and EUV Lithography: Two Pillars to Sustain Single-Digit Nanometer Nodes

  • Burn J. Lin

Semiconductor technology has advanced to single-digit nanometer dimensions for the circuit elements. The minimum feature size has reached subwavelength dimension. Many resolution enhancement techniques have been developed to extend the resolution limit of optical lithography systems, namely illumination optimization, phase-shifting masks, and proximity corrections. Needless to say, the actinic wavelength and the numerical aperture of the imaging lens have been reduced in stages, The most recent innovations are Immersion lithography and Extreme UV (EUV) lithography

In this presentation, the working principles, advantages, and challenges of immersion lithography are given. The defectivity issue is addressed by showing possible causes and solutions. The circuit design issues for pushing immersion lithography to single-digit nanometer delineation are presented.

Similarly, the working principles, advantages, and challenges of EUV lithography are given. There are special focusses on EUV power requirement, generation, and distribution; EUV mask components, absorber thickness, defects, flatness requirement, and pellicles; EUV resist challenges on sensitivity, line edge roughness, thickness, and etch resistance.

SESSION: Session 7: DFM, Reliability, and Electromigration

Advanced Design Methodologies for Directed Self-Assembly

  • Shao-Yun Fang

Directed self-assembly (DSA), which uses the segregation nature after an annealing process of block co-polymer (BCP) to generate tiny feature shapes, becomes one of the most promising next generation lithography technologies. According to the different proportions of the two monomers in an adopted BCP, either cylinders or lamellae can be generated by removing one of the two monomers, which are respectively referred to as cylindrical DSA and lamellar DSA. In addition, guiding templates are required to produce trenches before filling BCP such that the additional forces from the trench walls regulate the generated cylinders/lamellae. Both the two DSA technologies can be used to generate contact/via patterns in circuit layouts, while the practices of designing guiding templates are quite different due to different manufacturing principles. This paper reviews the existing studies on the guiding template design problem for contact/via hole fabrication with the DSA technology. The design constraints are differentiated and the design methodologies are respectively introduced for cylindrical DSA and lamellar DSA. Possible future research directions are finally suggested to further enhance contact/via manufacturability and the feasibility of adopting DSA in semiconductor manufacturing.

Challenges for Interconnect Reliability: From Element to System Level

  • Olalla Varela Pedreira
  • Houman Zahedmanesh
  • Youqi Ding
  • Ivan Ciofi
  • Kristof Croes

The high current densities carried by the interconnects have a direct impact on the back-end-of-line (BEOL) reliability degradation as they locally increase the temperature by Joule heating, and they lead to drift in the metal atoms. Local increase in temperature due to Joule heating will lead to thermal gradients along the interconnects inducing degradation through thermomigration. As the power density of the chip increases, thermal gradients may become a major reliability concern for scaled Cu interconnects. Therefore, it is of utmost relevance to fundamentally understand the impact of thermal gradients in metal migration. Our studies show that by using a combined modelling approach and a dedicated test structure we can assess the local temperatures and temperature gradients profiles. Moreover, with long-term experiments, we are able to successfully generate voids at the location of highest temperature gradients. Additionally, the main consequence of scaling the Cu interconnects is the dramatic drop of EM lifetime (Jmax). Currently the experimentally obtained EM parameters are used at system design level to set the current limits through the interconnect networks. However, this approach is very simplistic and neglects the benefits provided by the redundancy and interconnectivity from the network. Our studies by using a system-level physics-based EM simulation framework which can determine the EM induced IR drop at the standard cell level, show that the circuit reliability margins of the power delivery network (PDN) can be further relaxed.

Combined Modeling of Electromigration, Thermal and Stress Migration in AC Interconnect Lines

  • Susann Rothe
  • Jens Lienig

The migration of atoms in metal interconnects in integrated circuits (ICs) increasingly endangers chip reliability. The susceptibility of DC interconnects to electromigration has been extensively studied. A few works on thermal migration and AC electromigration are also available. Yet, the combined effect of both on chip reliability has been neglected thus far. This paper provides both FEM and analytical models for atomic migration and steady-state stress profiles in AC interconnects considering electromigration, thermal and stress migration combined. For this we expand existing models by the impact of self-healing, temperature-dependent resistivity, and short wire length. We conclude by analyzing the impact of thermal migration on interconnect robustness and show that it cannot be neglected any longer in migration robustness verification.

Recent Progress in the Analysis of Electromigration and Stress Migration in Large Multisegment Interconnects

  • Nestor Evmorfopoulos
  • Mohammad Abdullah Al Shohel
  • Olympia Axelou
  • Pavlos Stoikos
  • Vidya A. Chhabria
  • Sachin S. Sapatnekar

Traditional approaches to analyzing electromigration (EM) in on-chip interconnects are largely driven by semi-empirical models. However, such methods are inexact for the typical multisegment lines that are found in modern integrated circuits. This paper overviews recent advances in analyzing EM in on-chip interconnect structures based on physics-based models that use partial differential equations, with appropriate boundary conditions, to capture the impact of electron-wind and back-stress forces within an interconnect, across multiple wire segments. Methods for both steady-state and transient analysis are presented, highlighting approaches that can solve these problems with a computation time that is linear in the number of wire segments in the interconnect.

Electromigration Assessment in Power Grids with Account of Redundancy and Non-Uniform Temperature Distribution

  • Armen Kteyan
  • Valeriy Sukharev
  • Alexander Volkov
  • Jun Ho Choy
  • Farid N. Najm
  • Yong Hyeon Yi
  • Chris H. Kim
  • Stephane Moreau

A recently proposed methodology for electromigration (EM) assessment in on-chip power/ground grid of integrated circuits has been validated by means of measurements, performed on dedicated test grids. IR drop degradation in the grid is used for defining the EM failure criteria. Physics-based models are involved for simulation of EM-induced stress evolution in interconnect structures, void formation and evolution, resistance increase of the voided segments, and consequent re-distribution of electric current in the redundant grid paths. A grid-like test structure, fabricated with a 65 nm technology and consisting of two metal layers, allowed to calibrate the voiding models by tracking voltage evolution in all grid nodes in experiment and in simulation. Good fit of the measured and simulated time-to-failure (TTF) probability distribution was obtained in both cases of uniform and non-uniform temperature distribution across the grid. The second test grid was fabricated with a 28 nm technology, consisted of 4 metal layers, and contained power and ground nets connected to “quasi-cells” with poly-resistors, which were specially designed for operating at elevated temperatures ~350°C. The existing current distributions resulted in different behavior of EM-induced failures in these nets: a gradual voltage evolution in power net, and sharp changes in ground net were observed in experiment, and successfully reproduced in simulations.

SESSION: Session 8: Placement

Placement Initialization via Sequential Subspace Optimization with Sphere Constraints

  • Pengwen Chen
  • Chung-Kuan Cheng
  • Albert Chern
  • Chester Holtz
  • Aoxi Li
  • Yucheng Wang

State-of-the-art analytical placement algorithms for VLSI designs rely on solving nonlinear programs to minimize wirelength and cell congestion. As a consequence, the quality of solutions produced using these algorithms crucially depends on the initial cell coordinates. In this work, we reduce the problem of finding wirelength-minimal initial layouts subject to density and fixed-macro constraints to a Quadratically Constrained Quadratic Program (QCQP). We additionally propose an efficient sequential quadratic programming algorithm to recover a block-globally optimal solution and a subspace method to reduce the complexity of problem. We extend our formulation to facilitate direct minimization of the Half-Perimeter Wirelength (HPWL) by showing that a corresponding solution can be derived by solving a sequence of reweighted quadratic programs. Critically, our method is parameter-free, i.e. involves no hyperparameters to tune. We demonstrate that incorporating initial layouts produced by our algorithm with a global analytical placer results in improvements of up to 4.76% in post-detailed-placement wirelength on the ISPD’05 benchmark suite. Our code is available on github.

DREAM-GAN: Advancing DREAMPlace towards Commercial-Quality using Generative Adversarial Learning

  • Yi-Chen Lu
  • Haoxing Ren
  • Hao-Hsiang Hsiao
  • Sung Kyu Lim

DREAMPlace is a renowned open-source placer that provides GPU-acceleratable infrastructure for placements of Very-Large-Scale-Integration (VLSI) circuits. However, due to its limited focus on wirelength and density, existing placement solutions of DREAMPlace are not applicable to industrial design flows. To improve DREAMPlace towards commercial-quality without knowing the black-boxed algorithms of the tools, in this paper, we present DREAM-GAN, a placement optimization framework that advances DREAMPlace using generative adversarial learning. At each placement iteration, aside from optimizing the wirelength and density objectives of the vanilla DREAMPlace, DREAM-GAN computes and optimizes a differentiable loss that denotes the similarity score between the underlying placement and the tool-generated placements in commercial databases. Experimental results on 5 commercial and OpenCore designs using an industrial design flow implemented by Synopsys ICC2 not only demonstrate that DREAM-GAN significantly improves the vanilla DREAMPlace at the placement stage across each benchmark, but also show that the improvements last firmly to the post-route stage, where we observe improvements by up to 8.3% in wirelength and 7.4% in total power.

AutoDMP: Automated DREAMPlace-based Macro Placement

  • Anthony Agnesina
  • Puranjay Rajvanshi
  • Tian Yang
  • Geraldo Pradipta
  • Austin Jiao
  • Ben Keller
  • Brucek Khailany
  • Haoxing Ren

Macro placement is a critical very large-scale integration (VLSI) physical design problem that significantly impacts the design power-performance-area (PPA) metrics. This paper proposes AutoDMP, a methodology that leverages DREAMPlace, a GPU-accelerated placer, to place macros and standard cells concurrently in conjunction with automated parameter tuning using a multi-objective hyperparameter optimization technique. As a result, we can generate high-quality predictable solutions, improving the macro placement quality of academic benchmarks compared to baseline results generated from academic and commercial tools. AutoDMP is also computationally efficient, optimizing a design with 2.7 million cells and 320 macros in 3 hours on a single NVIDIA DGX Station A100. This work demonstrates the promise and potential of combining GPU-accelerated algorithms and ML techniques for VLSI design automation.

Assessment of Reinforcement Learning for Macro Placement

  • Chung-Kuan Cheng
  • Andrew B. Kahng
  • Sayak Kundu
  • Yucheng Wang
  • Zhiang Wang

We provide open, transparent implementation and assessment of Google Brain’s deep reinforcement learning approach to macro placement (Nature) and its Circuit Training (CT) implementation in GitHub. We implement in open-source key “blackbox” elements of CT, and clarify discrepancies between CT and Nature. New testcases on open enablements are developed and released. We assess CT alongside multiple alternative macro placers, with all evaluation flows and related scripts public in GitHub. Our experiments also encompass academic mixed-size placement benchmarks, as well as ablation and stability studies. We comment on the impact of Nature and CT, as well as directions for future research.

SESSION: Session 9: New Computing Techniques and Accelerators

GPU Acceleration in Physical Synthesis

  • Evangeline F.Y. Young

Placement and routing are essential steps in physical synthesis of VLSI designs. Modern circuits contain billions of cells and nets, which significantly increases the computational complexity of physical synthesis and brings big challenges to leading-edge physical design tools. With the fast development of GPU architecture and computational power, it becomes an important direction to explore speeding up physical synthesis with massive parallelism on GPU. In this talk, we will look into opportunities to improve EDA algorithms with GPU acceleration. Traditional EDA tools run on CPU with limited degree of parallelism. We will investigate a few examples of accelerating some classical algorithms in placement and routing using GPU. We will see how one can leverage the power of GPU to improve both quality and computational time in solving these EDA problems.

Efficient Runtime Power Modeling with On-Chip Power Meters

  • Zhiyao Xie

Accurate and efficient power modeling techniques are crucial for both design-time power optimization and runtime on-chip IC management. In prior research, different types of power modeling solutions have been proposed, optimizing multiple objectives including accuracy, efficiency, temporal resolution, and automation level, targeting various power/voltage-related applications. Despite extensive prior explorations in this topic, new solutions still keep emerging and achieve state-of-the-art performance. This paper aims at providing a review of the recent progress in power modeling, with more focus on runtime on-chip power meter (OPM) development techniques. It also serves as a vehicle for discussing some general development techniques for the runtime on-chip power modeling task.

DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs

  • Rachel Selina Rajarathnam
  • Zixuan Jiang
  • Mahesh A. Iyer
  • David Z. Pan

Placement plays a pivotal and strategic role in the FPGA implementation flow to allocate the physical locations of the heterogeneous instances in the design. Among the placement stages, the packing or clustering stage groups logic instances like look-up tables (LUTs) and flip-flops (FFs) that could be placed on the same site. The legalization stage determines all instances’ physical site locations. With advances in FPGA architecture and technology nodes, designs contain millions of logic instances, and placement algorithms must scale accordingly. While other placement stages – global placement and detailed placement, have been accelerated using GPUs, the acceleration of packing and legalization stages on a GPU remains largely unexplored. This work presents DREAMPlaceFPGA-PL, an open-source packer-legalizer for heterogeneous FPGAs that employs GPU for acceleration. We revise the existing consensus-based parallel algorithms employed for packing and legalizing a flat placement to obtain further speedup on a GPU. Our experiments on the ISPD’2016 benchmarks demonstrate more than 2× acceleration.

SESSION: Session 10: Lifetime Achievement Commemoration for Professor Malgorzata Marek-Sadowska

Building Oscillatory Neural Networks: AI Applications and Physical Design Challenges

  • Aida Todri-Sanial

This talk is about a novel computing paradigm based on coupled oscillatory neural networks. Oscillatory neural networks (ONNs) are recurrent neural networks where each neuron is an oscillator and oscillator couplings are the synaptic weights. Inspired by Hopfield Neural Networks, ONNs make use of nonlinear dynamics to compute and solve computational problems such as associative memory tasks and combinatorial optimization problems difficult to address with conventional digital computers. An exciting direction in recent years has been to implement Ising machines based on the Ising model of coupled binary spins on magnets. In this talk, I cover the design aspects of building ONNs from devices to architecture to allow to benefit from the parallel computations with oscillators while implementing them in an energy efficient way.

Optimization of AI SoC with Compiler-assisted Virtual Design Platform

  • Chih-Tsun Huang
  • Juin-Ming Lu
  • Yao-Hua Chen
  • Ming-Chih Tung
  • Shih-Chieh Chang

As deep learning keeps evolving dramatically with rapidly increasing complexity, the demand for efficient hardware accelerators has become vital. However, the lack of software/hardware co-development toolchains makes designing AI SoCs (artificial intelligent system-on-chips) considerably challenging. This paper presents a compiler-assisted virtual platform to facilitate the development of AI SoCs from the early design stage. The electronic system-level design platform provides rapid functional verification and performance/energy analysis. Cooperating with the neural network compiler, AI software and hardware can be co-optimized on the proposed virtual design platform. Our Deep Inference Processor is also utilized on the virtual design platform to demonstrate the effectiveness of the architectural evaluation and exploration methodology.

Challenges and Opportunities for Computing-in-Memory Chips

  • Xiang Qiu

In recent years, artificial neural networks have been applied to many scenarios, from daily life applications like face detection, to industry problems like placement and routing in physical design. Neural network inference mainly contains multiply-accumulate operations, which requires huge amount of data movement. Traditional Von-Neumann architecture computers are inefficient for neural networks as they have separate CPU and memory, and data transfer between them costs excessive energy and performance. To address this problem, in-memory or near-memory computing have been proposed and attracted much attention in both academic and industry. In this talk, we will give a brief review of non-volatile memory crossbar-based computing-in-memory architecture. Next, we will demonstrate the challenges for chips with such architecture to replace current CPUs/GPUs for neural network processing, from an industry perspective. Lastly, we will discuss possible solutions for those challenges.

ISPD 2023 Lifetime Achievement Award Bio

  • Malgorzata Marek-Sadowska

The 2023 International Symposium on Physical Design lifetime achievement award goes to Professor Malgorzata Marek-Sadowska for her outstanding contributions to the field.

SESSION: Session 11: Keynote III

Neural Operators for Solving PDEs and Inverse Design

  • Anima Anandkumar

Deep learning surrogate models have shown promise in modeling complex physical phenomena such as photonics, fluid flows, molecular dynamics and material properties. However, standard neural networks assume finite-dimensional inputs and outputs, and hence, cannot withstand a change in resolution or discretization between training and testing. We introduce Fourier neural operators that can learn operators, which are mappings between infinite dimensional spaces. They are discretization-invariant and can generalize beyond the discretization or resolution of training data. They can efficiently solve partial differential equations (PDEs) on general geometries. We consider a variety of PDEs for both forward modeling and inverse design problems, as well as show practical gains in the lithography domain.

SESSION: Session 12: Quantum Computing

Quantum Challenges for EDA

  • Leon Stok

Though early in its development, quantum computing is now available on real hardware and via the cloud through IBM Quantum. This radically new kind of computing holds open the possibility of solving some problems that are now and perhaps always will be intractable for “classical” computers.

As with any new technology things are developing rapidly but there are still a lot of open questions. What is the status of Quantum computers today? What are the key metrics we need to look at to improve a Quantum System? What are some of the technical opportunities being looked at from an EDA perspective.

We will look at the Quantum Roadmap for the next couple of years and outline challenges that need to be solved and how the EDA community can potentially contribute to solve these challenges.

Developing Quantum Workloads for Workload-Driven Co-design

  • Anne Matsuura

Quantum computing offers the future promise of solving problems that are intractable for classical computers today. However, as an entirely new kind of computational device, we must learn how to best develop useful workloads. Today’s small workloads serve the dual purpose that they can also be used to learn how to design a better quantum computing system architecture. At Intel Labs, we develop small application-oriented workloads and use them to drive research into the design of a scalable quantum computing system architecture. We run these small workloads on the small systems of qubits that we have today to understand what is required from the system architecture to run them efficiently and accurately on real qubits. In this presentation, I will give examples of quantum workload-driven co-design and what we have learned from this type of research.

MQT QMAP: Efficient Quantum Circuit Mapping

  • Robert Wille
  • Lukas Burgholzer

Quantum computing is an emerging technology that has the potential to revolutionize fields such as cryptography, machine learning, optimization, and quantum simulation. However, a major challenge in the realization of quantum algorithms on actual machines is ensuring that the gates in a quantum circuit (i.e., corresponding operations) match the topology of a targeted architecture so that the circuit can be executed while, at the same time, the resulting costs (e.g., in terms of the number of additionally introduced gates, fidelity, etc.) are kept low. This is known as the quantum circuit mapping problem. This summary paper provides an overview of QMAP-an open-source tool that is part of the Munich Quantum Toolkit (MQT) and offers efficient, automated, and accessible methods for tackling this problem. To this end, the paper first briefly reviews the problem. Afterwards, it shows how QMAP can be used to efficiently map quantum circuits to quantum computing architectures from both a user’s and a developer’s perspective. QMAP is publicly available as open-source at

SESSION: Session 13: Panel on EDA for Domain Specific Computing

EDA for Domain Specific Computing: An Introduction for the Panel

  • Iris Hui-Ru Jiang
  • David Chinnery

This panel explores domain-specific computing from hardware, software, and electronic design automation (EDA) perspectives.

Hennessey and Patterson signaled a new “golden age of computer architecture” in 2018 [1]. Process technology advances and general-purpose processor improvements provided much faster and more efficient computation, but scaling with Moore’s law has slowed significantly. Domain-specific customization can improve power-performance efficiency by orders-of-magnitude for important application domains, such as graphics, deep neural networks (DNN) for machine learning [2], simulation, bioinformatics [3], image processing, and many other tasks.

The common features of domain-specific architectures are: 1) dedicated memories to minimize data movement across chip; 2) more arithmetic units or bigger memories; 3) use of parallelism matching the domain; 4) smaller data types appropriate for the target applications; and 5) domain-specific software languages. Expediting software development with optimized compilation for efficient fast computation on heterogeneous architectures is a difficult task, and must be considered with the hardware design. For example, GPU programming has used CUDA and OpenCL.

The hardware comprises application-specific integrated circuits (ASICs) [4] and systems-of-chips (SoCs). General-purpose processor cores are often combined with graphics processing units (GPUs) for stream processing, digital signal processors, field programmable gate arrays (FPGAs) for configurability [5], artificial intelligence (AI) acceleration hardware, and so forth.

Domain-specific computers have been deployed recently. For example: the Google Tensor Processing Unit (DNN ASIC) [6]; Microsoft Catapult (FPGA-based cloud domain-service solution) [7]; Intel Crest (DNN ASIC) [8]; Google Pixel Visual Core (image processing and computer vision for cell phones and tablets) [9]; and the RISC-V architecture and open instruction set for heterogeneous computing [10].

Software-driven Design for Domain-specific Compute

  • Desmond A. Kirkpatrick

The end of Dennard scaling has created a focus on advancing domain-specific computing; we are seeing a renaissance of accelerating compute problems through specialization, with orders-of-magnitude improvement in performance and energy efficiency [1]. Domain-specific compute, with its wide proliferation of domains and narrow specialization of hardware and software, provides unique challenges in design automation not met by the methodologies matured under the model of high-volume manufacturing of competitive CPUs, GPUS, and SOCs [2]. Importantly, domain-specific compute targets smaller markets that move more rapidly so design NRE plays a much larger role. Secondly, the role of software is so much more significant that we believe a software-first approach, where software drives hardware design and the product is developed at the speed of software, is required to keep pace with domain-specific compute market requirements. This creates significant new challenges and opportunities for EDA to address the domain-specific compute design space. The forces that are driving the renaissance in domain-specific compute architectures also require a renaissance in the tools, flows, and methods to maintain this pace of innovation.

This talk will present a general framework for approaching automation of domain-specific compute co-design of SW/HW and draw upon recent innovations in EDA that can help us address this challenge. The focus will be on driving software-oriented techniques, such as agile design, into hardware design [3], as well as vertically oriented domain-specific codesign automation stacks [4], and some of the gaps in EDA that currently limit these approaches.

Google Investment in Open Source Custom Hardware Development Including No-Cost Shuttle Program

  • Tim Ansell

The end of Moore’s Law combined with unabated growth in usage have forced Google to turn to hardware acceleration to deliver efficiency gains to meet demand. Traditional hardware design methodology for accelerators is practical when there’s a common core – such as with Machine Learning (ML) or video transcoding, but what about the hundreds of smaller tasks performed in Google data centers? Our vision is “software-speed” development for hardware acceleration so that it becomes commonplace and, frankly, boring. Toward this goal Google is investing in open tooling to foster innovation in multiplying accelerator developer productivity.

Tim Ansell will provide an outline of these coordinated open source projects in EDA (including high level synthesis), IP, PDKs, and related areas. This will be followed by presenting the CFU (Custom Function Unit) Playground, which utilizes many of these projects.

The CFU Playground lets you build your own specialized & optimized ML processor based on the open RISC-V ISA, implemented on an FPGA using a fully open source stack. The goal isn’t general ML extensions; it’s about a methodology for building your own extension specialized just for your specific tiny ML model. The extension can range from a few simple new instructions, up to a complex accelerator that interfaces to the CPU via a set of custom instructions; we will show examples of both.

A Case for Open EDA Verticals

  • Zhiru Zhang
  • Matthew Hofmann
  • Andrew Butt

With the end of Dennard scaling and Moore’s Law reaching its limits, domain-specific hardware specialization has become a crucial method for improving compute performance and efficiency for various important applications. Leading companies in competitive fields, such as machine learning and video processing, are building their own in-house technology stacks to better suit their accelerator design needs. However, currently this approach is only a viable option for a few large enterprises that can afford to invest in teams of experts in hardware, systems, and compiler development for high-value applications. In particular, the high license cost of commercial electronic design automation (EDA) tools presents a significant barrier for small and mid-size engineering teams to create new hardware accelerators. These tools are essential for designing, simulating, and testing new hardware, but can be too expensive for smaller teams with limited budgets, reducing their ability to innovate and compete with larger organizations.

More recently, open-source EDA toolflows [1] [12] [11] [5] have emerged which offer a promising alternative to commercial tools, with the potential to provide more cost-effective solutions for hardware development. For example, OpenROAD [1] allows the design of custom ASICs with minimal human intervention and no licensing fees. During initial development, it was also able to take advantage of existing tools such as Yosys [14] and KLayout [6] to reduce the amount of new code required to get a working flow. However, early adoption of open-source alternatives carries risk, as open-source EDA projects often lack important features and are less reliable than commercial options. Additionally, current open-source EDA tools may produce less competitive quality of results (QoR) and may not be able to catch up to commercial solutions anytime soon. Even when EDA tool access is not an issue, designing and implementing special-purpose accelerators using conventional RTL methodology can be unproductive and incurs high non-recurring engineering (NRE) costs. High-level synthesis (HLS) has become increasingly popular in both academia and industry to automatically generate RTL designs from software programs. However, existing HLS tools do not help maintain domain-specific context throughout the design flow (e.g., placement, routing), which makes achieving good QoR difficult without significant manual fine-tuning. This hinders wider adoption of HLS.

We advocate for open EDA verticals as a solution to enabling more widespread use of domain-specific hardware acceleration. The objective is to empower small teams of domain experts to productively develop high-performance accelerators using programming interfaces they are already familiar with. For example, this means supporting domain-specific frameworks like PyTorch or TensorFlow for ML applications. In order for EDA verticals to proliferate, there must first be extensible infrastructure similar to LLVM [8] and MLIR [9] from which to build new tool flows. The proper EDA infrastructure would include novel intermediate representations specifically tailored to the unique challenges in gradually lowering high-level code down to gates.

Addressing the EDA Roadblocks for Domain-specific Compilers: An Industry Perspective

  • Alireza Kaviani

Computer architects are now widely subscribed to domain-specific architectures as being the only path left for major improvements in performance-cost-energy. As a result, future compilers need to go beyond their traditional role of mapping a design input to a generic hardware platform. Emerging domain-specific compilers must subscribe to a broader view in which compilers provide more control to the end users, enabling customization of hardware components to implement their corresponding tasks. Transitioning into this new design paradigm, where control and customization are key enablers, poses new challenges for domain-specific compiler.

Today, generic vendor backend EDA compilers are the only available mechanism to realize a broad range of applications in many domains. The necessity of breadth coverage by commercial tools often leads to implementations that do not take full advantage of the underlying hardware. Domain-specific compilers, on the other hand, can potentially deliver near-spec performance by taking advantage of both application attributes and architecture details. This issue is less pronounced for more generic computing platforms such CPUs due to leveraging open source as an essential component of software development. However, quality EDA software has remained mostly proprietary. Existing open-source attempts do not produce quality results to be useful commercially at scale. Addressing the EDA roadblocks towards quality domain-specific compilers will require stepping milestones from both industry and community.

This suggests the need for a framework capable of interfacing between closed source vendor backend tools and open-source domain compilers. RapidWright [1] is an example of such framework that enables a new level of optimization and customization for the application architect to further exploit FPGA silicon capabilities focusing on a specific domain.

There are a few factors that will expedite the progress for this approach. For example, RapidStream [2] demonstrates 30% higher performance and more than 5X faster compile time for data flow applications. The key enabler for RapidStream domain compiler is the split-compilation that was made possible for data flow applications with a latency-tolerant front-end and design entry. EDA vendors could enable such bottom-up flows by implementing a foundational infrastructure that allows multiple application modules to be implemented independently. Another useful step would be to decouple certain portions of monolithic EDA tools with separate more permissible licensing to be combined with open-source domain compilers.

Another key step that is required for domain-specific compilers to be successful is a process to offer a guarantee to the end customer. Today’s vendor tool flow offers full guarantee and support to the end customer at the expense of limiting the customization and control. The new paradigm of domain-specific compilers implies many variations of the tool flow, and it might not be feasible to provide the same level of support and guarantee as existing standard flows. The community needs to explore alternative ways of offering an equivalent level of support and guarantee to the end users in order to make domain-specific compilers widely adopted.

High-level Synthesis for Domain Specific Computing

  • Hanchen Ye
  • Hyegang Jun
  • Jin Yang
  • Deming Chen

This paper proposes a High-Level Synthesis (HLS) framework for domain-specific computing. The framework contains three key components: 1) ScaleHLS, a multi-level HLS compilation flow. Aimed to address the lack of expressiveness and hardware-dedicated representation of traditional software-oriented compilers. ScaleHLS introduces a hierarchical intermediate representation (IR) for the progressive optimization of HLS designs defined in various high-level languages. ScaleHLS consists of three levels of optimizations, including graph, loop, and directive levels, to realize an efficient compilation pipeline and generate highly-optimized domain-specific accelerators. 2) AutoScaleDSE is an automated design space exploration (DSE) engine. Real-world HLS designs often come with large design spaces that are difficult for designers to explore. Meanwhile, the connections between different components of an HLS design further complicate the design spaces. In order to address the DSE problem, AutoScaleDSE proposes a random forest classifier and a graph-driven approach to improve the accuracy of estimating the intermediate DSE results while reducing the time and computational cost. With this new approach, AutoScaleDSE can evaluate thousands of HLS design points and find the Pareto-dominating design points within a couple of hours. 3) PyTransform is a flexible pattern-driven design customization flow. Existing HLS flows demand manual code rewriting or intrusive compiler customization to conduct domain-specific optimizations, leading to unscalable or inflexible compiler solutions. PyTransform proposes a Python-based flow that enables users to define custom matching and rewriting patterns at a high level of abstraction, being able to be incorporated into the DSL compilation flow in an automatic and scalable manner. In summary, ScaleHLS, AutoScaleDSE, and PyTransform aim to address the challenges present in the compilation, DSE, and customization of existing HLS flows, respectively. With the three key components, our newly proposed HLS framework can deliver a scalable and extensible solution for designing domain-specific languages to automate and speed up the process of designing domain-specific accelerators.

SESSION: Session 14: Hardware Security and Bug Fixing

Security-aware Physical Design against Trojan Insertion, Frontside Probing, and Fault Injection Attacks

  • Jhih-Wei Hsu
  • Kuan-Cheng Chen
  • Yan-Syuan Chen
  • Yu-Hsiang Lo
  • Yao-Wen Chang

The dramatic growth of hardware attacks and the lack of security-concern solutions in design tools lead to severe security problems in modern IC designs. Although many existing countermeasures provide decent protection against security issues, they still lack the global design view with sufficient security consideration in design time. This paper proposes a security-aware framework against Trojan insertion, frontside probing, and fault injection attacks at the design stage. The framework consists of two major techniques: (1) a large-scale shielding method that effectively covers the exposed areas of assets and (2) a cell-movement-based method to eliminate the empty spaces vulnerable to Trojan insertion. Experimental results show that our framework effectively reduces the vulnerability of these attacks and achieves the best overall score compared with the top-3 teams in the 2022 ACM ISPD Security Closure of Physical Layouts Contest.

Security Closure of IC Layouts Against Hardware Trojans

  • Fangzhou Wang
  • Qijing Wang
  • Bangqi Fu
  • Shui Jiang
  • Xiaopeng Zhang
  • Lilas Alrahis
  • Ozgur Sinanoglu
  • Johann Knechtel
  • Tsung-Yi Ho
  • Evangeline F.Y. Young

Due to cost benefits, supply chains of integrated circuits (ICs) are largely outsourced nowadays. However, passing ICs through various third-party providers gives rise to many threats, like piracy of IC intellectual property or insertion of hardware Trojans, i.e., malicious circuit modifications.

In this work, we proactively and systematically harden the physical layouts of ICs against post-design insertion of Trojans. Toward that end, we propose a multiplexer-based logic-locking scheme that is (i) devised for layout-level Trojan prevention, (ii) resilient against state-of-the-art, oracle-less machine learning attacks, and (iii) fully integrated into a tailored, yet generic, commercial-grade design flow. Our work provides in-depth security and layout analysis on a challenging benchmark suite. We show that ours can render layouts resilient, with reasonable overheads, against Trojan insertion in general and also against second-order attacks (i.e., adversaries seeking to bypass the locking defense in an oracle-less setting).

We release our layout artifacts for independent verification[29].

X-Volt: Joint Tuning of Driver Strengths and Supply Voltages Against Power Side-Channel Attacks

  • Saideep Sreekumar
  • Mohammed Ashraf
  • Mohammed Nabeel
  • Ozgur Sinanoglu
  • Johann Knechtel

Power side-channel (PSC) attacks are well-known threats to sensitive hardware like advanced encryption standard (AES) crypto cores. Given the significant impact of supply voltages (VCCs) on power profiles, various countermeasures based on VCC tuning have been proposed, among other defense strategies. Driver strengths of cells, however, have been largely overlooked, despite having direct and significant impact on power profiles as well.

For the first time, we thoroughly explore the prospects of jointly tuning driver strengths and VCCs as novel working principle for PSC-attack countermeasures. Toward this end, we take the following steps: 1) we develop a simple circuit-level scheme for tuning; 2) we implement a CAD flow for design-time evaluation of ASICs, enabling security assessment of ICs before tape-out; 3) we implement a correlation power analysis (CPA) framework for thorough and comparative security analysis; 4) we conduct an extensive experimental study of a regular AES design, implemented in ASIC as well as FPGA fabrics, under various tuning scenarios; 5) we summarize design guidelines for secure and efficient joint tuning.

In our experiments, we observe that runtime tuning is more effective than static tuning, for both ASIC and FPGA implementations. For the latter, the AES core is rendered > 11.8x (i.e., at least 11.8 times) as resilient as the untuned baseline design. Layout overheads can be considered acceptable, with, e.g., around +10% critical-path delay for the most resilient tuning scenario in FPGA.

We release source codes for our methodology, as well as artifacts from the experimental study in[13].

Validating the Redundancy Assumption for HDL from Code Clone’s Perspective

  • Jianjun Xu
  • Jiayu He
  • Jingyan Zhang
  • Deheng Yang
  • Jiang Wu
  • Xiaoguang Mao

Automated program repair (APR) is being leveraged in hardware description languages (HDLs) to fix hardware bugs without human involvement. Most existing APR techniques search for donor code (i.e., code fragment for bug fixing) in the original program to generate repairs, which is based on the assumption that donor code can be found in existing source code. The redundancy assumption is the fundamental basis of most APR techniques, which has been widely studied in software by searching code clones of donor code. However, despite a large body of work on code clone detection, researchers have focused almost exclusively on repositories in traditional programming languages, such as C/C++ and Java, while few studies have been done on detecting code clones in HDLs. Furthermore, little attention has been paid on the repetitiveness of bug fixes in hardware designs, which limits automatic repair targeting HDLs. To validate the redundancy assumption for HDL, we perform an empirical study on code clones of real-world bug fixes in Verilog. On top of empirical results, we find that 17.71% of newly introduced code in bug fixes can be found from the clone pairs of buggy code in the original program, and 11.77% can be found in the file itself. The findings not only validate the assumption but also provides helpful insights for the design of APR targeting HDLs.

SESSION: Session 15: ISPD 2023 Contest Results and Closing Remarks

Benchmarking Advanced Security Closure of Physical Layouts: ISPD 2023 Contest

  • Mohammad Eslami
  • Johann Knechtel
  • Ozgur Sinanoglu
  • Ramesh Karri
  • Samuel Pagliarini

Computer-aided design (CAD) tools traditionally optimize “only” for power, performance, and area (PPA). However, given the wide range of hardware-security threats that have emerged, future CAD flows must also incorporate techniques for designing secure and trustworthy integrated circuits (ICs). This is because threats that are not addressed during design time will inevitably be exploited in the field, where system vulnerabilities induced by ICs are almost impossible to fix. However, there is currently little experience for designing secure ICs within the CAD community.

This contest seeks to actively engage with the community to close this gap. The theme is security closure of physical layouts, that is, hardening the physical layouts at design time against threats that are executed post-design time. Acting as security engineers, contest participants will proactively analyse and fix the vulnerabilities of benchmark layouts in a blue-team approach. Benchmarks and submissions are based on the generic DEF format and related files.

This contest is focused on the threat of Trojans, with challenging aspects for physical design in general and for hindering Trojan insertion in particular. For one, layouts are based on the ASAP7 library and rules are strict, e.g., no DRC issues and no timing violations are allowed at all. In the alpha/qualifying round, submissions are evaluated using first-order metrics focused on exploitable placement and routing resources, whereas in the final round, submissions are thoroughly evaluated (red-teamed) through actual insertion of different Trojans.


FPGA ’23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

 Full Citation in the ACM Digital Library

SESSION: Keynote I

Compiler Support for Structured Data

  • Saman Amarasinghe

In 1957, the FORTRAN language and compiler introduced multidimensional dense arrays or dense tensors. Subsequent programming languages added a myriad of data structures from lists, sets, hash tables, trees, to graphs. Still, when dealing with extremely large data sets, dense tensors are the only simple and practical solution. However, modern data is anything but dense. Real world data, generated by sensors, produced by computation, or created by humans, often contain underlying structure, such as sparsity, runs of repeated values, or symmetry.

In this talk I will describe how programming languages and compilers can support large data sets with structure. I will introduce TACO, a compiler for sparse data computing. TACO is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. It pioneered a new technique for compiling compound tensor expressions into efficient loops in a systematic way. TACO generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. With TACO, I will show how to put sparse array programming on the same compiler transformation and code generation footing as dense array codes. Structured data has immense potential for hardware acceleration. However, instead of one-off single-operation compute engines, with compilers frameworks such as TACO, I believe that it is possible to create hardware for an entire class of sparse computations. With the help of the FPGA community, I am looking forward to such a future.

SESSION: Session: High-Level Abstraction and Tools

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS

  • Linus Y. Wong
  • Jialiang Zhang
  • Jing (Jane) Li

Rapid growth in data size poses increasing computational and memory challenges to data processing. FPGA accelerators and near-storage processing are promising candidates for tackling computational and memory requirements, and many near-storage FPGA accelerators have been shown to be effective in processing large data. However, the current HLS development environment does not allow direct NVMe storage access from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. We prototyped DONGLE with an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD and demonstrate a geomean speed-up of 2.3× and a reduction of lines-of-code by 2.4× on evaluated workloads over the state-of-the-art commercial platform.

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

  • Linfeng Du
  • Tingyuan Liang
  • Sharad Sinha
  • Zhiyao Xie
  • Wei Zhang

Multi-die FPGAs are widely adopted to deploy large-scale hardware accelerators. Two factors impede the performance optimization of high-level synthesis (HLS) designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplanning of HLS designs generated under each group of configurations during directive optimization due to the large design scale.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive optimization, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit strategy from the online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing step to compact the floorplan, followed by pipelining of the long wires crossing die-boundaries.

Through experiments on a set of HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.

Eliminating Excessive Dynamism of Dataflow Circuits Using Model Checking

  • Jiahui Xu
  • Emmet Murphy
  • Jordi Cortadella
  • Lana Josipovic

Recent HLS efforts explore the generation of dynamically scheduled, dataflow circuits from high-level code; their ability to adapt the schedule at runtime to particular data and control outcomes promises superior performance to standard, statically scheduled HLS solutions. However, dataflow circuits are notoriously resource-expensive: their distributed handshake mechanism brings performance benefits in some cases, but causes an unneeded resource overhead when general dynamism is not required. In this work, we present a verification framework based on model checking to systematically reduce the hardware complexity of dataflow circuits. We devise a series of formal proofs that identify the absence of particular behavioral scenarios and use this information to replace the generic dataflow logic with simpler and cheaper control structures. On a set of benchmarks obtained from high-level code, we demonstrate that our technique significantly reduces the resource requirements of dataflow circuits (i.e., it results in LUT and FF reductions of up to 51% and 53%, respectively), while still reaping all performance benefits of dynamic scheduling.

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

  • Ayatallah Elakhras
  • Riya Sawhney
  • Andrea Guerrieri
  • Lana Josipovic
  • Paolo Ienne

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such a way as to mimic the program’s sequential execution. Although correct and quite effective in many cases, this adherence to control flow still significantly limits exploitable parallelism. Recent research introduced techniques to deliver data tokens directly from producers to consumers and achieved tangible benefits both in circuit complexity and execution time. Unfortunately, while this successfully addressed ordinary data dependencies, the problem of potential dependencies through memory remains open: When no technique can statically disambiguate accesses, circuits must be built with load-store queues (LSQs) which, to reorder accesses safely, need memory accesses to be allocated in the queues in program order. Such in-order allocation still demands control circuitry emulating sequential execution, with its negative impact on parallelization. In this paper, we transform potential memory dependencies into virtual data dependencies and use the new direct token delivery strategy to allocate accesses sequentially into the LSQ. In other words, we exploit more parallelism by constructing control circuitry to emulate exclusively those parts of the control flow strictly necessary for in-order allocation. Our results show that we can achieve up to a 74% reduction in execution time compared to prior work, in some cases, at no area cost.

SESSION: Poster Session I

OMT: A Demand-Adaptive, Hardware-Targeted Bonsai Merkle Tree Framework for Embedded Heterogeneous Memory Platform

  • Rakin Muhammad Shadab
  • Yu Zou
  • Sanjay Gandham
  • Mingjie Lin

Novel flash-based, crash-tolerant, non-volatile memory (NVM) such as Intel’s Optane DC memory brings about new and exciting use-case scenarios for both traditional and embedded computing systems involving Field-Programmable Gate Arrays (FPGA). However, NVMs cannot be proper replacement for existing DDR memory modules due to low write endurance and are more well-suited for a hybrid NVM + Volatile memory system. They are also well-known to be vulnerable to different memory-based adversaries that demand the use of a robust authentication method such as Bonsai Merkle Tree. However, typical update process of a BMT (eager update) requires updating the entire update chain frequently, affecting run-time performance even for the data that is not persistence-critical. The latest intermittent BMT update techniques can help provide better real-time throughput, but they lack crash-consistency.

A heterogeneous memory-based system would, therefore, greatly benefit from an authentication mechanism that can change its update method on-the-fly. Hence we propose a modular, unified and adaptable hardware-based BMT framework called Opportunistic Merkle tree (OMT). OMT combines two BMT with different update methods and streamlines the BMT read with a common datapath to provide support for both recovery-critical and general data, eliminating the need for individual authentication subsystems for heterogeneous memory platforms. It also allows for a switch between the update methods based on the request type (persistent/intermittent) while considerably reducing the resource overhead compared to standalone BMT implementations. We test OMT on a heterogeneous embedded secure memory system and the setup provides 44% lower memory overhead & up to 22% faster execution in synthetic benchmarks compared to a baseline.

Cyclone-NTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Platforms

  • Kaveh Aasaraai
  • Emanuele Cesena
  • Rahul Maganti
  • Nicolas Stalder
  • Javier Varela
  • Kevin Bowers

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier-Transform (FFT) on finite fields. NTT is being increasingly used in blockchain and zero-knowledge proof applications. Although FFT and NTT are widely studied for FPGA implementation, we believe CycloneNTT is the first to solve this problem for large data sets (2^24, 64-bit numbers) that would not fit in the on-chip RAM. CycloneNTT uses a state-of-the-art butterfly network and maps the dataflow to hybrid FIFOs composed of on-chip SRAM and external memory. This manifests into a quasi-streaming data access pattern minimizing external memory access latency and maximizing throughput. We implement two variants of CycloneNTT optimized for DDR and HBM external memories. Although historically this problem has been shown to be memory-bound, CycloneNTT’s quasi-streaming access pattern is optimized to the point that when using HBM (Xilinx C1100), the architecture becomes compute-bound. On the DDR-based platform (AWS F1), the latency of the application is equal to the streaming of the entire dataset log(N) times to/from external memory. Moreover, exploiting HBM’s larger number of channels, and following a series of additional optimizations, CycloneNTT only requires log(N)/6 passes.

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

  • Hyeong-Ju Kang

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of slow and power consuming external memories. This paper exploits two schemes to reduce the required memory amount and ultimately to implement a CNN of reasonable performance only with on-chip memory of a practical device like a low-end FPGA. To reduce the memory amount of the intermediate data, a stream-based line-buffer architecture and a dataflow for the architecture are proposed instead of the conventional frame-based architecture, where the amount of the intermediate data memory is proportional to the square of the input image size. The architecture consists of layer-dedicated blocks operating in a pipelined way with the input and output streams. Each convolutional layer block has a line buffer storing just a few rows of input data. The sizes of the line buffers are proportional to the width of the input image, so the architecture requires less intermediate data storage, especially in the trend of getting larger input size in modern object detection CNNs. In addition, the weight memory is reduced by the accelerator-aware pruning. The experimental results show that a whole object detection CNN can be implemented even on a low-end FPGA without an external memory. Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing higher efficiency.

Fault Detection on Multi COTS FPGA Systems for Physics Experiments on the International Space Station

  • Tim Oberschulte
  • Jakob Marten
  • Holger Blume

Field-programmable gate arrays (FPGAs) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose Einstein Condensate and Cold Atom Laboratory (BECCAL) apparatus, which is planned to operate on the International Space Station for several years. A total of more than 100 FPGAs distributed in the setup will be used for high-precision control of specialized sensors and actuators at nanosecond scale. Due to the large amount of devices in BECCAL, commercial off-the-shelf (COTS) FPGAs are used which are not radiation hardened. In this work, we detect and mitigate radiation effects in an application specific COTS-FPGA-based communication network. For that redundancy is integrated into the design while the firmware is optimized to stay within the FPGA’s resource constraints. A redundant integrity checker module is developed which can notify preceding network devices about data and configuration bit errors. The firmware is evaluated by injecting faults into data and configuration registers in simulation and real hardware. The FPGA resource usage of the firmware is cut down by more than half, enabling the use of double modular redundancy for the switching fabric. Together with the triple modular redundancy protected integrity checker, this combination fully prevents silent data corruptions in the design as shown in simulations and by injecting faults in hardware using the Intel Fault Injection FPGA IP Core while staying in the resource limitation of a COTS FPGA.

Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization

  • Meghna Mandava
  • Deming Chen

As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores both preemption and pipelining as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA. We evaluate our algorithm and validate its efficacy by measuring results from real workloads running on the board with different real-time constraints and priority levels. In our exploration, we compare our novel Nimblock algorithm against a no-sharing and no-virtualization baseline algorithm and three other algorithms which support sharing and virtualization. We achieve up to 5x lower average response time when compared to the baseline algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.

Graph-OPU: An FPGA-Based Overlay Processor for Graph Neural Networks

  • Ruiqi Chen
  • Haoyang Zhang
  • Yuhanxiao Ma
  • Enhao Tang
  • Shun Li
  • Yanxiang Zhu
  • Jun Yu
  • Kun Wang

Graph Neural Networks (GNNs) have outstanding performance on graph-structured data and have been extensively accelerated by field-programmable gate array (FPGA) in various ways. However, existing accelerators significantly lack flexibility, especially in the following two aspects: 1) Many FPGA-based accelerators only support one GNN model. 2) The processes of re-synthesizing and bitstream re-generating are very time-consuming for new GNN models. To this end, we propose a highly integrated FPGA-based overlay processor for general GNN accelerations named Graph-OPU. Regarding the data structure and operation irregularity, we customize the instruction sets to support irregular operation patterns in the inference process of GNN models. Then, we customize our datapath and optimize the data format in the microarchitecture to take full advantage of high bandwidth memory (HBM). Moreover, we design the computation module to ensure a unified and fully-pipelined process of sparse matrix multiplication (SpMM) and general matrix multiplication (GEMM). Users can avoid the process of FPGA reconfiguration or RTL regeneration for the newly invented GNN models. We implement the hardware prototype on Xilinx Alveo U50 and test the mainstream GNN models with 9 datasets. Graph-OPU can achieve an average of 435× and 18× speedup, while 2013× and 109× better energy efficiency, compared with the Intel I7-12700KF processor and NVIDIA RTX3090 GPU, respectively. To the best of our knowledge, Graph-OPU is the first in-depth study on FPGA-based general processors for GNN acceleration with high speedup and energy efficiency.

HMLib: Efficient Data Transfer for HLS Using Host Memory

  • Michael Lo
  • Young-kyu Choi
  • Weikang Qiao
  • Mau-Chung Frank Chang
  • Jason Cong

Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather than directly into its on-chip memory, substantially diminishing the end-to-end speedup, especially for small workloads (hundreds of kilobytes). AMD Xilinx’s Host Memory IP (HMI) aims to address the data movement problem by exposing to the developer an High-Level Synthesis (HLS) interface that moves the data from the host directly to the FPGA’s on-chip memory. However, using HMI purely for its interface without additional code changes incurred a 3.3x slowdown in comparison with the current programming model. The slowdown mainly originates from OpenCL call overhead and the kernel control logic unnecessarily switching states. To overcome these issues, we propose Host Memory Library (HMLib), an efficient HLS-based library that facilitates data transfer on behalf of the user. HMLib not only optimizes the runtime stack for efficient data transfer, but also provides HLS compatible and user-friendly interfaces. We demonstrate HMLib’s effectiveness for streaming applications (Deflate compression and CRC32) with improvements of up to up to 36.2X over OpenCL-DDR and up to 79.5X over raw HMI for small-scale data while maintaining little-to-no performance loss for large scale inputs. We plan to open source our work in the future.

An Efficient High-Speed FFT Implementation

  • Ross Martin

This poster introduces the “BxBFFT” parallel-pipelined Fast Fourier Transform (FFT), which gives higher clock speeds (Fmax) than competitors with substantial savings in power and logic resources. In comparisons with the Xilinx SSR FFT, Spiral FFT, Astron FFT, and ZipCPU FFT, the BxBFFT had clock speeds above 650MHz in cases where all others were below 300MHz. The BxBFFT’s LUTs and power were lower by a factor of ~1.5. The BxBFFT had faster Vivado implementation and faster RTL simulation, for improved productivity in design, testing, and verification. BxBFFT simulations were over 10 times faster than the Xilinx SSR FFT. The BxBFFT supports more features than other FFTs, including real-to-complex FFTs, non-power-of-2 FFTs, and features for high reliability in adverse environments. The BxBFFT’s improved performance has been verified in real applications. One customer design had to operate with a reduced workload due to excessive current draw of the Xilinx SSR FFT. A quick replacement of the Xilinx SSR FFT with the BxBFFT lowered die temperature by 34.8 degree Celsius and allowed the design to operate under full load. The source of the BxBFFT’s performance is intensive optimization of well-known FFT algorithms, not new algorithms. The BxBFFT’s coding style gives better control over synthesis to avoid and resolve performance bottlenecks. Automated generation of top-level code supports 13 different choices for radix and 2 different choices for data flow at each stage, to make optimal choices for each BxBFFT size. This results in a highly efficient FFT.

Weave: Abstraction for Accelerator Integration of Generated Modules

  • Tuo Dai
  • Bizhao Shi
  • Guojie Luo

As domain-specific accelerators demand multiple functional components for complex applications in a domain, the conventional wisdom for effective development involves module decomposition, module implementation, and module integration. In the recent decade, the generator-based design methodology improves the productivity of module implementation. However, with the guidance of current abstractions, it is difficult to integrate modules implemented by generators because of implicit interface definition, non-unified performance modeling, and fragmented memory management. These disadvantages cause low productivity of the integration flow and low performance of the integrated accelerators.

To address these drawbacks, we propose Weave, an abstraction for the integration of generated modules to facilitate an agile design flow for domain-specific accelerators. Weave abstraction enables the formulation and automation of optimizing the unified performance model under resource constraints of all modules. And we design a hierarchical memory management method with corresponding interface to integrate modules under the guidance of modular abstraction. In the experiments, the accelerator developed by Weave achieves 2.17× higher performance in the deep learning domain compared with an open-source accelerator, and the integrated acce-lerator attains 88.9% peak performance of generated accelerators.

A Novel FPGA Simulator Accelerating Reinforcement Learning-Based Design of Power Converters

  • Zhenyu Xu
  • Miaoxiang Yu
  • Qing Yang
  • Yeonho Jeong
  • Tao Wei

High-efficiency energy conversion systems have become increasingly important due to their wide use in all electronic systems such as data centers, smart mobile devices, E-vehicles, medical instruments, and so forth. Complex and interdependent parameters make optimal designs of power converters challenging to get. Recent research has shown that reinforcement learning (RL) shows great promise in the design of such converter circuits. A trained RL agent can search for optimal design parameters for power conversion circuit topologies under targeted application requirements. Training an RL agent requires numerous circuit simulations. As a result, they may take days to complete, primarily because of the slow time-domain circuit simulation.

This abstract proposes a new FPGA architecture that accelerates the circuit simulation and hence substantially speeds up the RL-based design method for power converters. Our new architecture supports all power electronic circuit converters and their variations. It substantially improves the training speed of RL-based design methods. High-level synthesis (HLS) was used to build the accelerator on Amazon Web Service (AWS) F1 instance. An AWS virtual PC hosts the training algorithm. The host interacts with the FPGA accelerator by updating the circuit parameters, initiating simulation, and collecting the simulation results during training iterations. A script was created on the host side to facilitate this design method to convert a netlist containing circuit topology and parameters into core matrices in the FPGA accelerator. Experimental results showed 60x overall speedup of our RL-based design method in comparison with using a popular commercial simulator, PowerSim.

A Fractal Astronomical Correlator Based on FPGA Cluster with Scalability

  • Lin Shu
  • Long Xiao
  • Yafang Song
  • Qiuxiang Fan
  • Guitian Fang
  • Jie Hao

Correlation is a highly computationally intensive and data-intensive signal processing application that is used heavily in radio astronomy for imaging and other measurements. For example, the next generation radio telescope, Square Kilometer Array Low (SKA-L), needs a correlator that calculates up to 22 million cross products, which is a real-time system with continuous input data rates of 6 terabits per second and equivalent computation of 2 Peta-operations per second. Therefore, a flexible and scalable solution with high performance per watt is very urgent and meaningful. In this work, a flexible FX correlation architecture based on FPGA cluster is proposed, which can be fractal in subsystem level, engine level and calculation module level, simplifying the complexity of data distribution network to increase the system’s scalability. The interconnect network between processing engines is a new two-stage solution, using self-developed data redistribution hardware to decouple full bandwidth correlation into several independent sub-bands’ computation. And the most intensive calculations, cross-multiplications among all the antennas, are modularly designed under MATLAB Simulink and AMD Xilinx System Generator, which are parametrized to scale to arbitrary antenna numbers with optional parallel granularity to minimize development effort on different FPGA or for different applications. What’s more, a fully FPGA-based FX correlator for a large array with 202 antennas, consisting of 26 F Engines based on AMD Xilinx Kintex-7 325T FPGAs, 13 X Engines based on AMD Xilinx Kintex ultrascale KU115 FPGAs, has been deployed in 2022, which is the largest full FPGA-based astronomical correlator as we know.

Power Side-channel Countermeasures for ARX Ciphers using High-level Synthesis

  • Saya Inagaki
  • Mingyu Yang
  • Yang Li
  • Kazuo Sakiyama
  • Yuko Hara-Azumi

In the era of Internet of Things (IoT), edge devices are considerably diversified and are often designed using high-level synthesis (HLS) to improve design-productivity. A problem here is that HLS tools were originally developed in a security-unaware fashion, inducing vulnerabilities to power side-channel attacks (PSCA), which is a serious threat in IoT. Although PSCA vulnerabilities induced by HLS tools recently started to be discussed, the effects and applicability of existing methods for PSCA-resistant designs using HLS are limited so far. In this paper, we propose a novel HLS-based design method for PSCA-resistant ciphers in hardware. Particularly focusing on lightweight block ciphers composed of Addition-Rotation-XOR (ARX)-based permutations, we studied the effects of applying ”threshold implementation”, one of the provably secure countermeasures against PSCA, to behavioral descriptions of the ciphers. In addition, we tuned the scheduling optimization of HLS tools that might cause power side-channel leakage. In our experiment, using ARX-based ciphers (Chaskey, Simon, and Speck) as benchmarks, we implemented the unprotected and protected circuit on FPGA and evaluated the PSCA vulnerability using Welch’s t-test. The results demonstrated that our proposed method can successfully mitigate vulnerabilities to PSCA for all benchmarks. From these results, we provide further discussion on the direction of PSCA countermeasures based on HLS.

Single-Batch CNN Training using Block Minifloats on FPGAs

  • Chuliang Guo
  • Binglei Lou
  • Xueyuan Liu
  • David Boland
  • Philip H.W. Leong

Training convolutional neural networks remains a challenge on resource-limited edge devices due to its intensive computations, large storage requirements, and high bandwidth. Error back-propagation, gradient generation, and weight update usually require high precision to guarantee model accuracy, which places a further burden on computation and bandwidth. This paper presents the first parallel FPGA CNN training accelerator with block minifloat datatypes. We first propose a heuristic bit-width allocation technique to derive a unified 8-bit block minifloat format with a sign bit, 2 exponent bits, and 5 mantissa bits. In contrast to previous techniques, the same data format is used for weights, activations, errors, and gradients. Using this format, accuracy similar to 32-bit single precision floating point is achieved and thus simplifies the FPGA-based designs of computational units such as multiply-and-add. In addition, we propose a unified Conv block to deal with Conv and transposed Conv in the forward and backward paths respectively; and a dilated Conv block with a weight kernel partition scheme for gradient generation. Both Conv blocks support non-unit stride, this being crucial for the residual connections that appear in modern CNNs. For training of ResNet20 on the CIFAR-10 dataset with a batch size of 1, our accelerator on a Xilinx Ultrascale+ ZCU102 FPGA achieves state-of-the-art single-batch throughput of 144.64 and 192.68 GOPs with and without batch normalisation layers respectively.

SESSION: Session: Applications and Design Studies I

A Study of Early Aggregation in Database Query Processing on FPGAs

  • Mehdi Moghaddamfar
  • Norman May
  • Christian Färber
  • Wolfgang Lehner
  • Akash Kumar

In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggregation operator. In this paper, we study early aggregation algorithms in the context of query processing acceleration in database systems on FPGAs. The comparative study leads us to set-associative caches with a low inter-reference recency set (LIRS) replacement policy. They show both great performance and modest implementation complexity compared to some of the most prominent early aggregation algorithms. We also present a novel application-specific architecture for implementing set-associative caches. Benchmarks of our implementation show speedups of up to 3x for end-to-end aggregation compared to a state-of-the-art FPGA-based query engine.

FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction

  • Chaoqiang Liu
  • Haifeng Liu
  • Long Zheng
  • Yu Huang
  • Xiangyu Ye
  • Xiaofei Liao
  • Hai Jin

The k-nearest neighbor graph has emerged as the key data structure for many critical applications. However, it can be notoriously challenging to construct k-nearest neighbor graphs over large graph datasets, especially with a high-dimensional vector feature. Many solutions have been recently proposed to support the construction of k-nearest neighbor graphs. However, these solutions involve substantial memory access and computational overheads and an architecture-level solution is still absent. To address these issues, we architect FNNG, the first FPGA-based accelerator to support k-nearest neighbor graph construction. Specifically, FNNG is equipped with the block-based scheduling technique to exploit the inherent data locality between vertices. It divides the vertices that are close in space into blocks and process the vertices according to the granularity of the blocks during the construction process. FNNG also adopts the useless computation aborting technique to identify superfluous useless computations. It keeps the existing maximum similarity values of all vertices inside the computing unit. In addition, we propose an improved architecture in order to fully utilize both techniques. We implement FNNG on the Xilinx Alveo U280 FPGA card. The results show that FNNG achieves 190x and 2.1x speedups over the state-of-the-art CPU and GPU solutions, running on Intel Xeon Gold 5117 CPU and NVIDIA GeForce RTX 3090 GPU, respectively.

ACTS: A Near-Memory FPGA Graph Processing Framework

  • Wole Jaiyeoba
  • Nima Elyasi
  • Changho Choi
  • Kevin Skadron

Despite the high off-chip bandwidth and on-chip parallelism offered by today’s near-memory accelerators, software-based (CPU and GPU) graph processing frameworks still suffer performance degradation from under-utilization of available memory bandwidth because graph traversal often exhibits poor locality. Emerging FPGAbased graph accelerators tackle this challenge by designing specialized graph processing pipelines and application-specific memory subsystems to maximize bandwidth utilization and efficiently utilize high-speed on-chip memory. To use the limited on-chip (BRAM) memory effectively while handling larger graph sizes, several FPGAbased solutions resort to some form of graph slicing or partitioning during preprocessing to stage vertex property data into the BRAM. While this has demonstrated performance superiority for small graphs, this approach breaks down with larger graph sizes. For example, GraphLily [19], a recent high-performance FPGA-based graph accelerator, experiences up to 11X performance degradation between graphs having 3M vertices and 28M vertices. This makes prior FPGA approaches impractical for large graphs.

We propose ACTS, an HBM-enabled FPGA graph accelerator, to address this problem. Rather than partitioning the graph offline to improve spatial locality, we partition vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This optimizes read bandwidth even as the graph size scales. We compare ACTS against Gunrock, a state-of-the-art graph processing accelerator for the GPU, and GraphLily, a recent FPGA-based graph accelerator also utilizing HBM memory. Our results show a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X over Gunrock, and a geometric speedup of 3.6X, with a maximum speedup of 16.5X, over GraphLily. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock.

Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation

  • Nick Brown

AMD Xilinx’s new Versal Adaptive Compute Acceleration Platform (ACAP) is an FPGA architecture combining reconfigurable fabric with other on-chip hardened compute resources. AI engines are one of these and, by operating in a highly vectorized manner, they provide significant raw compute that is potentially beneficial for a range of workloads including HPC simulation. However, this technology is still early-on, and as yet unproven for accelerating HPC codes, with a lack of benchmarking and best practice.

This paper presents an experience report, exploring porting of the Piacsek and Williams (PW) advection scheme onto the Versal ACAP, using the chip’s AI engines to accelerate the compute. A stencil-based algorithm, advection is commonplace in atmospheric modelling, including several Met Office codes who initially developed this scheme. Using this algorithm as a vehicle, we explore optimal approaches for structuring AI engine compute kernels and how best to interface the AI engines with programmable logic. Evaluating performance using a VCK5000 against non-AI engine FPGA configurations on the VCK5000 and Alveo U280, as well as a 24-core Xeon Platinum Cascade Lake CPU and Nvidia V100 GPU, we found that whilst the number of channels between the fabric and AI engines are a limitation, by leveraging the ACAP we can double performance compared to an Alveo U280.

SESSION: Session: Architecture, CAD, and Circuit Design

Regularity Matters: Designing Practical FPGA Switch-Blocks

  • Stefan Nikolic
  • Paolo Ienne

Several techniques have been proposed for automatically searching for FPGA switch-blocks which typically show some tangible advantage over the well-known academic architectures. However, the resulting switch-blocks usually exhibit high levels of irregularity, in contrast with what can be observed in a typical commercial architecture. One wonders whether the architectures produced by such search methods can actually be laid out in an efficient manner while retaining the perceived gains. In this work, we propose a new switch-block exploration method that enhances a recently published search algorithm by combining it with ILP, in order to guarantee that the obtained solution satisfies arbitrary regularity constraints. We measure the impact of regularity constraints commonly seen in industrial architectures (such as limiting the number of different multiplexer sizes or forced sharing of inputs between pairs of multiplexers) and observe benefits to the routability of complex circuits with only a limited reduction in performance.

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Time-to-Digital Converters

  • Colin Drewes
  • Olivia Weng
  • Keegan Ryan
  • Bill Hunter
  • Christopher McCarty
  • Ryan Kastner
  • Dustin Richmond

Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5× over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80% classification accuracy that generalizes beyond a single board. Finally, our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2×.

Post-Radiation Fault Analysis of a High Reliability FPGA Linux SoC

  • Andrew Elbert Wilson
  • Nathan Baker
  • Ethan Campbell
  • Jackson Sahleen
  • Michael Wirthlin

FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments.

FPGA Technology Mapping with Adaptive Gate Decomposition

  • Longfei Fan
  • Chang Wu

Most existing technology mapping algorithms use graph covering approaches and suffer from the netlist structural bias problem. Chen and Cong proposed a simultaneous simple gate decomposition with technology mapping algorithm that encodes many gate decomposition choices into the netlist. However, their algorithm suffers from the long runtime problem due to a large set of choices. Later on, A. Mishchenko et al. proposed a mapping algorithm based on choice generation with the so-called lossless synthesis. Nevertheless, their algorithm cannot guarantee to find and keep all good choices a priori before mapping. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvement. Compared with the lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvements. Compared with the state-of-the-art ABC lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction.

FPGA Mux Usage and Routability Estimates without Explicit Routing

  • Jonathan W. Greene

A new algorithm is proposed to rapidly evaluate an FPGA routing architecture without need of explicitly routing benchmark applications. The algorithm takes as input a probability distribution of nets to be accommodated and a description of an architecture. It produces an estimate for the usage of each type of mux in the FPGA (including intra-cluster muxes), valuable feedback to the architect. The estimates are shown to correlate with actual routed applications in both academic and commercial architectures. This is due in part to the algorithm’s novel ability to account for long and multi-fanout nets. Run time is reduced by applying periodic graphs to model FPGAs’ regular structure.

We then show how Percolation Theory (a branch of statistical physics) can be applied to elucidate the relationship between mux usage and routability. We show that any blockages when routing a net are most likely to occur in the neighborhood of its terminals, and demonstrate a quantitative relationship among the post-placement wirelength of an application, the percolation threshold of an architecture, and the channel width required to map the application to the architecture. Supporting experimental data is provided.

SESSION: Banquet and Panel

Open-source and FPGAs: Hardware, Software, Both or None?

  • Dana How
  • Tim Ansell
  • Vaughn Betz
  • Chris Lavin
  • Ted Speers
  • Pierre-Emmanuel Gaillardon

Following the footsteps of the open-source software movement that is at the foundation of many fundamental infrastructures today, e.g., Linux, the internet, etc., a growing amount of open-source hardware initiatives have been impacting our field, e.g., the RISC-V ISA, Open chiplet standards, etc.


FPGAs and Their Evolving Role in Domain Specific Architectures: A Case Study of the AMD 400G Adaptive SmartNIC/DPU SoC

  • Jaideep Dastidar

Domain Specific Architectures (DSA) typically apply heterogeneous compute elements such as FPGAs, GPUs, AI Engines, TPUs, etc. towards solving domain-specific problems, and have their accompanying Domain Specific Software. FPGAs have played a prominent role in DSAs for AI, Video Transcoding, Network Acceleration etc. This talk will start by going over a brief historical survey of FPGAs in DSAs and an emerging trend in Domain Specific Accelerators, where the programmable logic element is paired with other heterogeneous compute or acceleration elements. The talk will then perform a case study of AMD’s 400G Adaptive SmartNIC/DPU SoC and the considerations that went into that DSA. The case study includes where, why, and how the programmable logic element was paired with other hardened offload accelerators and embedded processors with the goal of striking the right balance between Software Processing on the embedded cores, Fastpath ASIC-like processing on the Hardened Accelerators, and Adaptive and Composable processing on the integrated FPGA. The talk will describe the data movement between various network, storage and interface acceleration elements and their shared and private memory resources. Throughout the talk, we will focus on the tradeoffs between the FPGA element and the rest of the heterogeneous compute or acceleration elements as they apply to SmartNIC/DPU offload acceleration.

SESSION: Session: Deep Learning

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

  • Jinming Zhuang
  • Jason Lau
  • Hanchen Ye
  • Zhuoping Yang
  • Yubo Du
  • Jack Lo
  • Kristof Denolf
  • Stephen Neuendorffer
  • Alex Jones
  • Jingtong Hu
  • Deming Chen
  • Jason Cong
  • Peipei Zhou

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?

We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing

  • Alireza Khataei
  • Gaurav Singh
  • Kia Bazargan

We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

  • Lei Cai
  • Jing Wang
  • Lianfeng Yu
  • Bonan Yan
  • Yaoyu Tao
  • Yuchao Yang

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

SESSION: Session: FPGA-Based Computing Engines

hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA

  • Xuan Wang
  • Lei Gong
  • Jing Cao
  • Wenqi Lou
  • Weiya Wang
  • Chao Wang
  • Xuehai Zhou

Regular expression (REGEX) matching tasks drive much research on automata processors (AP). Among them, the von Neumann AP can efficiently utilize on-chip memory to process the Deterministic Finite Automata (DFA), but it is limited to small REGEX sets due to the DFA’s state explosion problem. For large REGEX sets, the spatial AP based on Nondeterministic Finite Automaton (NFA) is the mainstream choice. However, there are two problems with previous FPGA-based spatial AP. First, it cannot obtain a balanced FPGA resource usage (LUT and BRAM), which easily leads to resource shortage. Second, to compress the report output data of large REGEX sets, it uses dynamic report compression, which not only consumes a lot of FPGA resources but also limits performance.

This paper optimizes the resource and IO overhead of spatial AP. First, noticing the resource optimization ability of the von Neumann AP, we propose the flex-hybrid-FA algorithm to generate small hybrid-FAs (an NFA/DFA hybrid model) and further propose the Spatial-von Neumann Heterogeneous AP to deploy hybrid-FA. Under the constraints of the flex-hybrid-FA algorithm, we can obtain balanced and efficient FPGA resource usage. Second, we propose High-Efficient Automata Report Compression (HEARC) with a compression ratio of up to 5.5-47.6x, which can thoroughly release the performance from IO congestion, and consumes less FPGA resource compared to previous dynamic report compression approaches. As far as we know, this is the first work to deploy large REGEX sets on low-cost small-scale FPGAs (e.g. Xilinx XCZU3CG). The experimental results show that compared to the previous FPGA-based APs, we save 4.0-6.6x power consumption and improve 2.7-5.9x energy efficiency.

CSAIL2019 Crypto-Puzzle Solver Architecture

  • Sergey Gribok
  • Bogdan Pasca
  • Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations (i.e. computations which cannot be parallelized), with each iteration performing a very large (3072-bit in the case of CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified.

In this paper, we present an FPGA architecture for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 21 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

ENCORE: Efficient Architecture Verification Framework with FPGA Acceleration

  • Kan Shi
  • Shuoxiang Xu
  • Yuhan Diao
  • David Boland
  • Yungang Bao

Verification typically consumes the majority of the time in the hardware development cycle. Primarily this is because multiple iterations to debug hardware using software simulation is extremely time-consuming. While FPGAs can be utilised to accelerate the simulation, existing methods either provide limited visibility of design details, or are expensive to check against a reference model dynamically at the system level.

In this paper, we present ENCORE, an FPGA-accelerated framework for processor architecture verification. The design-under-test (DUT) hardware and the corresponding software emulator run simultaneously on the same FPGA with hardened processors. ENCORE embodies hardware modules that dynamically monitor and compare key registers from both the DUT and reference model, pausing the execution if any mismatches are detected. In this case, ENCORE automatically creates snapshots of the current design status, and offloads this to software simulators for further debugging. We demonstrate the performance of ENCORE by running RISC-V processor designs and benchmarks. We show that ENCORE can achieve over 44000x speedup over a traditional software simulation-based approach, while maintaining full visibility and debugging capabilities.

BOBBER A Prototyping Platform for Batteryless Intermittent Accelerators

  • Vishak Narayanan
  • Rohit Sahu
  • Jidong Sun
  • Henry Duwe

Batteryless systems offer promising platforms to support pervasive, near-sensor intelligence in a sustainable manner. These systems solely rely on ambient energy sources that often provide limited power. One common approach to designing batteryless systems is using intermittent execution—a node banks energy into a capacitive store until a threshold voltage is met and the digital components turn on and consume the banked energy until the energy is depleted and they die. The limited amount of available energy demands the development of application- and domain-specific accelerators to achieve energy efficiency and timeliness. Given the extremely close relationship between volatile state and intermittent behavior, performing actual system prototyping has been critical for demonstrating feasibility of intermittent systems. However, no prototyping platform exists for intermittent accelerators. This paper introduces BOBBER, the first implementation of an intermittent FPGA-based accelerator prototyping platform. We demonstrate BOBBER in the optimization and evaluation of a neural network accelerator powered solely by RF energy harvesting.

SESSION: Poster Session II

Adapting Skip Connections for Resource-Efficient FPGA Inference

  • Olivia Weng
  • Gabriel Marcano
  • Vladimir Loncar
  • Alireza Khodamoradi
  • Nojan Sheybani
  • Farinaz Koushanfar
  • Kristof Denolf
  • Javier Mauricio Duarte
  • Ryan Kastner

Deep neural networks employ skip connections – identity functions that combine the outputs of different layers-to improve training convergence; however, these skip connections are costly to implement in hardware. In particular, for inference accelerators on resource-limited platforms, they require extra buffers, increasing not only on- and off-chip memory utilization but also memory bandwidth requirements. Thus, a network that has skip connections costs more to deploy in hardware than one that has none. We argue that, for certain classification tasks, a network’s skip connections are needed for the network to learn but not necessary for inference after convergence. We thus explore removing skip connections from a fully-trained network to mitigate their hardware cost. From this investigation, we introduce a fine-tuning/retraining method that adapts a network’s skip connections – by either removing or shortening them-to make them fit better in hardware with minimal to no loss in accuracy. With these changes, we decrease resource utilization by up to 34% for BRAMs, 7% for FFs, and 12% LUTs when implemented on an FPGA.

Multi-bit-width CNN Accelerator with Systolic-in-Systolic Dataflow and Single DSP Multiple Multiplication Scheme

  • Mingqiang Huang
  • Yucen Liu
  • Sixiao Huang
  • Kai Li
  • Qiuping Wu
  • Hao Yu

Multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has been one of the core hardware platforms for deploying various neural networks. However, it is still difficult to fully make use of the dedicated digital signal processing (DSP) blocks in FPGA for accelerating the multi-bit-width network. In this work, we develop state-of-the-art multi-bit-width convolutional neural network accelerator with novel systolic-in-systolic type of dataflow and single DSP multiple multiplication (SDMM) INT2/4/8 execution scheme. Multi-level optimizations have also been adopted to further improve the performance, including group-vector systolic array for maximizing the circuit efficiency as well as minimizing the systolic delay, and differential neural architecture search (NAS) method for the high accuracy multi-bit-width network generation. The proposed accelerator has been practically deployed on Xilinx ZCU102 with accelerating NAS optimized VGG16 and Resnet18 networks as case studies. Average performance on accelerating the convolutional layer in VGG16 and Resnet18 is 1289GOPs and 1155GOPs, respectively. Throughput for running the full multi-bit-width VGG16 network is 870.73 GOPS at 250MHz, which has exceeded all of previous CNN accelerators on the same platform.

Janus: An Experimental Reconfigurable SmartNIC with P4 Programmability and SDN Isolation

  • Bharat Sukhwani
  • Mohit Kapur
  • Alda Ohmacht
  • Liran Schour
  • Martin Ohmacht
  • Chris Ward
  • Chuck Haymes
  • Sameh Asaad

Disparate deployment models of cloud computing pose varying requirements on cloud infrastructure components such as networking, storage, provisioning, and security. Infrastructure providers need to study these and often create custom infrastructure components to satisfy these requirements. A major challenge in the research and development of these cloud infrastructure solutions, however, is the availability of customizable platforms for experimentation and trade-off analysis of the various hardware and software components. Most platforms are either general purpose or bespoke solutions created to assist a particular task, too rigid to allow meaningful customization. In this work, we present a 100G reconfigurable smartNIC prototyping platform called Janus that enables cloud infrastructure research and hardware-software co-design of infrastructure components such as hypervisor, secure boot, software defined networking and distributed storage. The platform provides a path to optimize the stack by offloading the functionalities from the host x86 to the embedded processor on the smartNIC and optimize performance by moving pieces to hardware using P4. Further, our platform provides hardware-enforced isolation of cloud network control plane, thereby securing the control plane from the tenants even for bare-metal deployments.

LAWS: Large-Scale Accelerated Wave Simulations on FPGAs

  • Dimitrios Gourounas
  • Bagus Hanindhito
  • Arash Fathi
  • Dimitar Trenev
  • Lizy John
  • Andreas Gerstlauer

Computing numerical solution to large-scale scientific computing problems described by partial differential equations is a common task in high-performance computing. Improving their performance and efficiency is critical to exa-scale computing. Application-specific hardware design is a well-known solution, but the wide range of kernels makes it infeasible to provision supercomputers with accelerators for all applications. This makes reconfigurable platforms a promising direction. In this work, we focus on wave simulations using discontinuous Galerkin solvers, as an important class of applications. Existing work using FPGAs is limited to accelerating specific kernels or small problems that fit into FPGA BRAM. We present LAWS, a generic and configurable architecture for large-scale accelerated wave simulation problems running on FPGAs out of DRAM. LAWS exploits fine- and coarse-grain parallelism using a scalable array of application-specific cores, and incorporates novel dataflow optimizations, including prefetching, kernel fusion, and memory layout optimizations to minimize data transfers and maximize DRAM bandwidth utilization. We further accompany LAWS with an analytical performance model that allows for scaling across technology trends and architecture configurations. We demonstrate LAWS on the simulation of elastic wave equations. Results show that a single FPGA core achieves 69% higher performance than 24 Xeon cores with 13.27x better energy efficiency, when given 1.94x less peak DRAM bandwidth. Scaling to the same peak DRAM bandwidth shows that an FPGA is 3.27x and 1.5x faster than 24 CPU cores and an Nvidia P100 GPU, with 22.3x and 4.53x better efficiency, respectively.

Mitigating the Last-Mile Bottleneck: A Two-Step Approach For Faster Commercial FPGA Routing

  • Shashwat Shrivastava
  • Stefan Nikolic
  • Chirag Ravishankar
  • Dinesh Gaitonde
  • Mirjana Stojilovic

We identified that in modern commercial FPGAs, routing signals from the general interconnect to the inputs of the CLB primitives through a very sparse input interconnect block (IIB) represents a significant runtime bottleneck. This is despite academic research often neglecting the runtime of last-mile routing through the IIB. We propose a two-step routing approach that allows resolving this bottleneck by leveraging massive parallelism of today’s compute infrastructure. The main premise that enables massive parallelization is that once the signals are legally routed in the general interconnect-only reaching the inputs of the IIB, but not the final targets-the remaining last-mile routing through the IIB can be completed independently for each FPGA tile.

We ran experiments using ISPD16 and industrial designs to demonstrate the dominant contribution of last-mile routing to the router’s runtime. We used an architectural model closely resembling Xilinx UltraScale FPGAs, which makes it highly representative of the current state of the art. For ISPD16 benchmarks, we observed that when the router is instructed to complete the entire routing, including its last-mile portion, the average number of heap pushes (a machine-agnostic measure of runtime) increases 4.1× compared to a simplified reference in which last-mile routing is neglected. On industrial designs, the number of heap pushes increased 4.4×. Last-mile routing was successfully completed using a SAT-based router in up to 83% of FPGA tiles. With a slight increase in density of IIB connectivity, we were able to bring the completion success rate up to 100%.

Towards a Machine Learning Approach to Predicting the Difficulty of FPGA Routing Problems

  • Andrew David Gunter
  • Steven Wilton

In this poster, we present a Machine Learning (ML) technique to predict the number of iterations needed for a Pathfinder-based FPGA router to complete a routing problem. Given a placed circuit, our technique uses features gathered on each routing iteration to predict if the circuit is routable and how many more iterations will be required to successfully route the circuit. This enables early exit for routing problems which are unlikely to be completed in a target number of iterations. Such early exit may help to achieve a successful route within tractable time by allowing the user to quickly retry the circuit compilation with a different random seed, a modified circuit design, or a different FPGA. We demonstrate our predictor in the VTR 8 framework; compared to VTR’s predictor, our ML predictor incurs lower prediction errors on the Koios Deep Learning benchmark suite. This corresponds with an approximate time saving of 48% from early rejection of unroutable FPGA designs while also successfully completing 5% more routable designs and having a 93% shorter early exit latency.

An FPGA-Based Weightless Neural Network for Edge Network Intrusion Detection

  • Zachary Susskind
  • Aman Arora
  • Alan T. L. Bacellar
  • Diego L. C. Dutra
  • Igor D. S. Miranda
  • Mauricio Breternitz
  • Priscila M. V. Lima
  • Felipe M. G. França
  • Lizy K. John

Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using ASICs, 6G emphasizes the adaptability of algorithms for specific user scenarios, which motivates broader adoption of FPGAs. In this paper, we propose the FPGA-based Weightless Intrusion Warden (FWIW), a novel solution for detecting anomalous network traffic on edge devices. While prior work in this domain is based on conventional deep neural networks (DNNs), FWIW incorporates a weightless neural network (WNN), a table lookup-based model which learns sophisticated nonlinear behaviors. This allows FWIW to achieve accuracy far superior to prior FPGA-based work at a very small fraction of the model footprint, enabling deployment on small, low-cost devices. FWIW achieves a prediction accuracy of 98.5% on the UNSW-NB15 dataset with a total model parameter size of just 192 bytes, reducing error by 7.9x and model size by 262x vs. LogicNets, the best prior edge-optimized implementation. Implemented on a Xilinx Virtex UltraScale+ FPGA, FWIW demonstrates a 59x reduction in LUT usage with a 1.6x increase in throughput. The accuracy of FWIW comes within 0.6% of the best-reported result in literature (Edge-Detect), a model several orders of magnitude larger. Our results make it clear that WNNs are worth exploring in the emerging domain of edge networking, and suggest that FPGAs are capable of providing the extreme throughput needed.

A Flexible Toolflow for Mapping CNN Models to High Performance FPGA-based Accelerators

  • Yongzheng Chen
  • Gang Wu

There have been many studies on developing automatic tools for mapping CNN models onto FPGAs. However, challenges remain in designing an easy-to-use toolflow. First, the toolflow should be able to handle models exported from various deep learning frameworks and models with different topologies. Second, the hardware architecture should make better use of on-chip resources to achieve high performance. In this work, we build a toolflow upon Open Neural Network Exchange (ONNX) IR to support different DL frameworks. We also try to maximize the overall throughput via multiple hardware-level efforts. We propose to accelerate the convolution operation by applying parallelism not only at the input and output channel level, but also at the output feature map level. Several on-chip buffers and corresponding management algorithms are also designed to leverage abundant memory resources. Moreover, we employ a fully pipelined systolic array running at 400 MHz as the convolution engine, and develop a dedicated bus to implement the im2col algorithm and provide feature inputs to the systolic array. We generated 4 accelerators with different systolic array shapes and compiled 12 CNN models for each accelerator. Deployed on a Xilinx VCU118 evaluation board, the performance of convolutional layers can reach 3267.61 GOPS, which is 99.72% of the ideal throughput (3276.8 GOPS). We also achieve an overall throughput of up to 2424.73 GOPS. Compared with previous studies, our toolflow is more user-friendly. The end-to-end performance of the generated accelerators is also better than that of related work at the same DSP utilization.

Senju: A Framework for the Design of Highly Parallel FPGA-based Iterative Stencil Loop Accelerators

  • Emanuele Del Sozzo
  • Davide Conficconi
  • Marco D. Santambrogio
  • Kentaro Sano

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving, seismic simulations, and financial option pricing, to name a few. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional system of points until it reaches convergence. However, due to their iterative and computationally intensive nature, these workloads are highly performance-hungry, demanding specialized solutions to boost performance and reduce power consumption. Here, FPGAs represent a valid architectural choice as their peculiar features enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for FPGA-based ISL accelerators. Starting from an input description, Senju builds highly parallel hardware modules and automatizes all their design phases. The experimental evaluation shows remarkable and scalable results, reaching significant performance and energy efficiency improvements compared to the other single-FPGA literature approaches.

FPGA Acceleration for Successive Interference Cancellation in Severe Multipath Acoustic Communication Channels

  • Jinfeng Li
  • Yahong Rosa Zheng

This paper proposes a hardware implementation of a Successive Interference Cancellation (SIC) scheme in a Turbo Equalizer for very long multipath fading channels where the Intersymbol-interference (ISI) channel length L is on the order of 100 taps. To reduce the computational complexity caused by large matrix arithmetic in the SIC, we explore the data dependencies and convolutional nature of the SIC algorithm and propose an FPGA acceleration architecture by taking advantage of the high degree of parallelism and the flexible data movements offered by FPGA. Instead of reconstructing interference in symbol-wise by matrix and vector multiplication directly for each symbol in a block, we propose a two-stage processing algorithm. The first stage is block-wise processing, where the convolution of the channel impulse response (CIR) vector and the vector consisting of the whole symbol block is computed. The second stage is symbol-wise processing but turns to the multiplication of one symbol and the CIR vector. The result shows that for a block of Nblk symbols and a channel of length L, the proposed architecture completes the SIC within 2Nblk+L clock cycles, while direct matrix multiplication requires L×Nblk clock cycles. Implemented on a Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit, the SIC and equalization in one turbo iteration based on this architecture is completed around 40 us for a 1024 symbol block and a channel length of L=100. This architecture achieves around 40× speed-up compared with the implementation on a powerful CPU platform.

FreezeTime: Towards System Emulation through Architectural Virtualization

  • Sergiu Mosanu
  • Joshua Fixelle
  • Kevin Skadron
  • Mircea Stan

High-end FPGAs enable architecture modeling through emulation with high speed and fidelity. However, the available reconfigurable logic and memory resources limit the size, complexity, and speed of the emulated target designs. The challenge is to map and model large and fast memory hierarchies, such as large caches and mixed main memory, various heterogeneous computation instances, such as CPUs, GPUs, AI/ML processing units and accelerator cores, and communication infrastructure, such as buses and networks. In addition to the spatial dimension, this work uses the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture. Our approach presents mechanisms to abstract instance plurality while preserving timing in sync. With only a subset of the architecture on the FPGA, we freeze a whole emulated module’s activity and state during the additional time intervals necessary for the action on the virtualized modules to elapse. We demonstrate this technique by emulating a hypothetical system consisting of a processor and an SRAM memory too large to map on the FPGA. For this, we modify a LiteX-generated SoC consisting of a VexRISC-V processor and DDR memory, with the memory controller issuing stall signals that freeze the processor, effectively ”hiding” the memory latency. For Linux boot, we measure significant emulation vs. simulation speedup while matching RTL simulation accuracy. The work is open-sourced.

SESSION: Session: Applications and Design Studies II

A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree Management

  • Yuan Meng
  • Rajgopal Kannan
  • Viktor Prasanna

Monte Carlo Tree Search (MCTS) is a widely used search technique in Artificial Intelligence (AI) applications. MCTS manages a dynamically evolving decision tree (i.e., one whose depth and height evolve at run-time) to guide an AI agent toward an optimal policy. In-tree operations are memory-bound leading to a critical performance bottleneck for large-scale parallel MCTS on general-purpose processors. CPU-FPGA accelerators can alleviate the memory bottleneck of in-tree operations. However, a major challenge for existing FPGA accelerators is the lack of dynamic memory management due to which they cannot efficiently support dynamically evolving MCTS trees. In this work, we address this challenge by proposing an MCTS acceleration framework that (1) incorporates an algorithm-hardware co-optimized accelerator design that supports in-tree operations on dynamically evolving trees without expensive hardware reconfiguration; (2) adopts a hybrid parallel execution model to fully exploit the compute power in a CPU-FPGA heterogeneous system; (3) supports Python-based programming API for easy integration of the proposed accelerator with RL domain-specific bench-marking libraries at run-time. We show that by using our framework, we achieve up to 6.8× speedup and superior scalability of parallel workers than state-of-the-art parallel MCTS on multi-core systems.

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

  • Linghao Song
  • Licheng Guo
  • Suhail Basalama
  • Yuze Chi
  • Robert F. Lucas
  • Jason Cong

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

  • Sasindu Wijeratne
  • Ta-Yang Wang
  • Rajgopal Kannan
  • Viktor Prasanna

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.


ASPDAC ’23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

 Full Citation in the ACM Digital Library

SESSION: Technical Program: Reliability Considerations for Emerging Computing and Memory Architectures

A Fast Semi-Analytical Approach for Transient Electromigration Analysis of Interconnect Trees Using Matrix Exponential

  • Pavlos Stoikos
  • George Floros
  • Dimitrios Garyfallou
  • Nestor Evmorfopoulos
  • George Stamoulis

As integrated circuit technologies are moving to smaller technology nodes, Electromigration (EM) has become one of the most challenging problems facing the EDA industry. While numerical approaches have been widely deployed since they can handle complicated interconnect structures, they tend to be much slower than analytical approaches. In this paper, we present a fast semi-analytical approach, based on the matrix exponential, for the solution of Korhonen’s stress equation at discrete spatial points of interconnect trees, which enables the analytical calculation of EM stress at any time and point independently. The proposed approach is combined with the extended Krylov subspace method to accurately simulate large EM models and accelerate the calculation of the final solution. Experimental evaluation on OpenROAD benchmarks demonstrates that our method achieves 0.5% average relative error over the COMSOL industrial tool while being up to three orders of magnitude faster.

Chiplet Placement for 2.5D IC with Sequence Pair Based Tree and Thermal Consideration

  • Hong-Wen Chiou
  • Jia-Hao Jiang
  • Yu-Teng Chang
  • Yu-Min Lee
  • Chi-Wen Pan

This work develops an efficient chiplet placer with thermal consideration for 2.5D ICs. Combining the sequence-pair based tree, branch-and-bound method, and advanced placement/pruning techniques, the developed placer can find the solution fast with the optimized total wirelength (TWL) on half-perimeter wirelength (HPWL). Additionally, with the post placement procedure, the placer reduces maximum temperatures with slight increase of wirelength. Experimental results show that the placer can not only find better optimized TWL (reducing 1.035% HPWL) but also speed up at most two orders of magnitude than the prior art. With thermal consideration, the placer can reduce the maximum temperature up to 8.214 °C with an average 5.376% increase of TWL.

An On-Line Aging Detection and Tolerance Framework for Improving Reliability of STT-MRAMs

  • Yu-Guang Chen
  • Po-Yeh Huang
  • Jin-Fu Li

Spin-transfer-torque magnetic random-access memory (STT-MRAM) is one of the most promising emerging memories for on-chip memory. However, the magnetic tunnel junction (MTJ) in the STT-MRAM suffers from several reliability threats which degrade the endurance, create defects, and cause memory failure. One of the primary reliability issues comes from time-dependent dielectric breakdown (TDDB) on MTJ, which deviates resistance value of MTJ over time and may lead to reading error. To overcome this challenge, in this paper we present an on-line aging detection and tolerance framework to dynamically monitor the electrical parameter deviations and provide appropriate compensation to avoid reading error. The on-line aging detection mechanism can identify aged words by monitoring read current and then the aging tolerance mechanism can adjust the reference resistance of the sensing amplifier to compensate the aging-induced resistance drop of MTJ. In comparison with existing testing-based aging detection techniques, our mechanism can operate on-line with read operations for both aging detection and tolerance simultaneously with negligible performance overhead. Simulation and analysis results show that the proposed techniques can successfully detect 99% aging words under process variation and achieve at most 25% reliability improvement of STT-MRAMs.

SESSION: Technical Program: Accelerators and Equivalence Checking

Automated Equivalence Checking Method for Majority Based In-Memory Computing on ReRAM Crossbars

  • Arighna Deb
  • Kamalika Datta
  • Muhammad Hassan
  • Saeideh Shirinzadeh
  • Rolf Drechsler

Recent progress in the fabrication of Resistive Random Access Memory (ReRAM) devices has paved the way for large scale crossbar structures. In particular, in-memory computing on ReRAM crossbars helps in bridging the processor-memory speed gap for current CMOS technology. To this end, synthesis and mapping of Boolean functions to such crossbars have been investigated by researchers. However the verification of simple designs on crossbar is still done through manual inspection or sometimes complemented by simulation based techniques. Clearly this is an important problem as real world designs are complex and have higher number of inputs. As a result manual inspection and simulation based methods for these designs are not practical.

In this paper for the first time as per our knowledge we propose an automated equivalence checking methodology for majority based in-memory designs on ReRAM crossbars. Our contributions are twofold: first, we introduce an intermediate data structure called ReRAM Sequence Graph (ReSG) to represent the logic-in-memory design. This in turn is translated into Boolean Satifiability (SAT) formulas. These SAT formulas are verified against the golden functional specification using Z3 Satifiability Modulo Theory (SMT) solver. We validate the proposed method by running widely available benchmarks.

An Equivalence Checking Framework for Agile Hardware Design

  • Yanzhao Wang
  • Fei Xie
  • Zhenkun Yang
  • Pasquale Cocchini
  • Jin Yang

Agile hardware design enables designers to produce new design iterations efficiently. Equivalence checking is critical in ensuring that a new design iteration conforms to its specification. In this paper, we introduce an equivalence checking framework for hardware designs represented in HalideIR. HalideIR is a popular intermediate representation in software domains such as deep learning and image processing, and it is increasingly utilized in agile hardware design. We have developed a fully automatic equivalence checking workflow seamlessly integrated with HalideIR and several optimizations that leverage the incremental nature of agile hardware design to scale equivalence checking. Evaluations of two deep learning accelerator designs show our automatic equivalence checking framework scales to hardware designs of practical sizes and detects inconsistencies that manually crafted tests have missed.

Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector Duplication

  • Bowen Liu
  • Dajiang Liu

Sparse matrix-vector multiplication (SpMV) is widely used in many fields and usually dominates the execution time of a task. With large off-chip memory bandwidth, customizable on-chip resources and high-performance float-point operation, FPGA is a potential platform to accelerate SpMV tasks. However, as compressed data formats for SpMV usually introduce irregular memory access while it is also memory-intensive, implementing an SpMV accelerator on FPGA to achieve a high bandwidth utilization (BU) is a challenging work. Existing works either eliminate irregular memory access at the sacrifice of increasing data redundancy or try to locally reduce the port conflicts introduced by irregular memory access, leading to a limited BU improvement. To this end, this paper proposes a high-bandwidth-utilization SpMV accelerator on FPGAs using partial vector duplication, where read-conflict-free vector buffer, writing-conflict-free adder tree, and ping-pong-like accumulator registers are well elaborated. The FPGA implementation results show that the proposed design can achieve an average of 1.10x performance speedup compared to the state-of-the-art work.

SESSION: Technical Program: New Frontiers in Cyber-Physical and Autonomous Systems

Safety-Driven Interactive Planning for Neural Network-Based Lane Changing

  • Xiangguo Liu
  • Ruochen Jiao
  • Bowen Zheng
  • Dave Liang
  • Qi Zhu

Neural network-based driving planners have shown great promises in improving task performance of autonomous driving. However, it is critical and yet very challenging to ensure the safety of systems with neural network-based components, especially in dense and highly interactive traffic environments. In this work, we propose a safety-driven interactive planning framework for neural network-based lane changing. To prevent over-conservative planning, we identify the driving behavior of surrounding vehicles and assess their aggressiveness, and then adapt the planned trajectory for the ego vehicle accordingly in an interactive manner. The ego vehicle can proceed to change lanes if a safe evasion trajectory exists even in the predicted worst case; otherwise, it can stay around the current lateral position or return back to the original lane. We quantitatively demonstrate the effectiveness of our planner design and its advantage over baseline methods through extensive simulations with diverse and comprehensive experimental settings, as well as in real-world scenarios collected by an autonomous vehicle company.

Safety-Aware Flexible Schedule Synthesis for Cyber-Physical Systems Using Weakly-Hard Constraints

  • Shengjie Xu
  • Bineet Ghosh
  • Clara Hobbs
  • P. S. Thiagarajan
  • Samarjit Chakraborty

With the emergence of complex autonomous systems, multiple control tasks are increasingly being implemented on shared computational platforms. Due to the resource-constrained nature of such platforms in domains such as automotive, scheduling all the control tasks in a timely manner is often difficult. The usual requirement—that all task invocations must meet their deadlines—stems from the isolated design of a control strategy and its implementation (including scheduling) in software. This separation of concerns, where the control designer sets the deadlines, and the embedded software engineer aims to meet them, eases the design and verification process. However, it is not flexible and is overly conservative. In this paper, we show how to capture the deadline miss patterns under which the safety properties of the controllers will still be satisfied. The allowed patterns of such deadline misses may be captured using what are referred to as “weakly-hard constraints.” But scheduling tasks under these weakly-hard constraints is non-trivial since common scheduling policies like fixed-priority or earliest deadline first do not satisfy them in general. The main contribution of this paper is to automatically synthesize schedules from the safety properties of controllers. Using real examples, we demonstrate the effectiveness of this strategy and illustrate that traditional notions of schedulability, e.g., utility ratios, are not applicable when scheduling controllers to satisfy safety properties.

Mixed-Traffic Intersection Management Utilizing Connected and Autonomous Vehicles as Traffic Regulators

  • Pin-Chun Chen
  • Xiangguo Liu
  • Chung-Wei Lin
  • Chao Huang
  • Qi Zhu

Connected and autonomous vehicles (CAVs) can realize many revolutionary applications, but it is expected to have mixed-traffic including CAVs and human-driving vehicles (HVs) together for decades. In this paper, we target the problem of mixed-traffic intersection management and schedule CAVs to control the subsequent HVs. We develop a dynamic programming approach and a mixed integer linear programming (MILP) formulation to optimally solve the problems with the corresponding intersection models. We then propose an MILP-based approach which is more efficient and real-time-applicable than solving the optimal MILP formulation, while keeping good solution quality as well as outperforming the first-come-first-served (FCFS) approach. Experimental results and SUMO simulation indicate that controlling CAVs by our approaches is effective to regulate mixed-traffic even if the CAV penetration rate is low, which brings incentive to early adoption of CAVs.

SESSION: Technical Program: Machine Learning Assisted Optimization Techniques for Analog Circuits

Fully Automated Machine Learning Model Development for Analog Placement Quality Prediction

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Yaguang Li
  • Yishuang Lin
  • Jiang Hu
  • Yiran Chen

Analog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art.

Efficient Hierarchical mm-Wave System Synthesis with Embedded Accurate Transformer and Balun Machine Learning Models

  • F. Passos
  • N. Lourenço
  • L. Mendes
  • R. Martins
  • J. Vaz
  • N. Horta

Integrated circuit design in millimeter-wave (mm-Wave) bands is exceptionally complex and dependent on costly electromagnetic (EM) simulations. Therefore, in the past few years, a growing interest has emerged in developing novel optimization-based methodologies for the automatic design of mm-Wave circuits. However, current approaches lack scalability when the circuit/system complexity increases. Besides, many also depend on EM simulators, which degrade their efficiency. This work resorts to hierarchical system partitioning and bottom-up design approaches, where a precise machine learning model – composed of hundreds of seamlessly integrated sub-models that guarantee high accuracy (validated against EM simulations and measurements) up to 200GHz – is embedded to design passive components, e.g., transformers and baluns. The model generates optimal design surfaces to be fed to the hierarchical levels above or acts as a performance estimator. With the proposed scheme, it is possible to remove the dependency of EM simulations during optimization. The proposed mixed-optimal-surface, performance estimator, and simulation-based bottom-up multiobjective optimization (MOO) are used to fully design a Ka-band mm-Wave transmitter from the device up to the system level in 65-nm CMOS for state-of-the-art specifications.

APOSTLE: Asynchronously Parallel Optimization for Sizing Analog Transistors Using DNN Learning

  • Ahmet F. Budak
  • David Smart
  • Brian Swahn
  • David Z. Pan

Analog circuit sizing is a high-cost process in terms of the manual effort invested and the computation time spent. With rapidly developing technology and high market demand, bringing automated solutions for sizing has attracted great attention. This paper presents APOSTLE, an asynchronously parallel optimization method for sizing analog transistors using Deep Neural Network (DNN) learning. This work introduces several methods to minimize real-time of optimization when the sizing task consists of several different simulations with varying time costs. The key contributions of this paper are: (1) a batch optimization framework, (2) a novel deep neural network architecture for exploring design points when the existed solutions are not always fully evaluated, (3) a ranking approximation method based on cheap evaluations and (4) a theoretical approach to balance between the cheap and the expensive simulations to maximize the optimization efficiency. Our method shows high real-time efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits while reaching similar or better performance.

SESSION: Technical Program: Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems

ML to the Rescue: Reliability Estimation from Self-Heating and Aging in Transistors All the Way up Processors

  • Hussam Amrouch
  • Florian Klemme

With increasingly confined 3D structures and newly-adopted materials of higher thermal resistance, transistor self-heating has risen to a critical reliability threat in state-of-the-art and emerging process nodes. One of the challenges of transistor self-heating is accelerated transistor aging, which leads to earlier failure of the chip if not considered appropriately. Nevertheless, adequate consideration of accelerated aging effects, induced by self-heating, throughout a large circuit design is profoundly challenging due to the large gap between where self-heating does originate (i.e., at the transistor level) and where its ultimate effect occurs (i.e., at the circuit and system levels). In this work, we demonstrate an end-to-end workflow starting from self-heating and aging effects in individual transistors all the way up to large circuits and processor designs. We demonstrate that with our accurately estimated degradations, the required timing guardband to ensure reliable operation of circuits is considerably reduced by up to 96% compared to otherwise worst-case estimations that are conventionally employed.

Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs

  • Lilas Alrahis
  • Johann Knechtel
  • Ozgur Sinanoglu

Graph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design.

In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at Finally, we discuss interesting open problems for future research.

Detection and Classification of Malicious Bitstreams for FPGAs in Cloud Computing

  • Jayeeta Chaudhuri
  • Krishnendu Chakrabarty

As FPGAs are increasingly shared and remotely accessed by multiple users and third parties, they introduce significant security concerns. Modules running on an FPGA may include circuits that induce voltage-based fault attacks and denial-of-service (DoS). An attacker might configure some regions of the FPGA with bitstreams that implement malicious circuits. Attackers can also perform side-channel analysis and fault attacks to extract secret information (e.g., secret key of an AES encryption). In this paper, we present a convolutional neural network (CNN)-based defense to detect bitstreams of RO-based malicious circuits by analyzing the static features extracted from FPGA bitstreams. We further explore the criticality of RO-based circuits in order to detect malicious Trojans that are configured on the FPGA. Evaluation on Xilinx FPGAs demonstrates the effectiveness of the security solutions.

Learning Based Spatial Power Characterization and Full-Chip Power Estimation for Commercial TPUs

  • Jincong Lu
  • Jinwei Zhang
  • Wentian Jin
  • Sachin Sachdeva
  • Sheldon X.-D. Tan

In this paper, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer’s total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98mW/mm2, or 2.6% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.

SESSION: Technical Program: High Performance Memory for Storage and Computing

DECC: Differential ECC for Read Performance Optimization on High-Density NAND Flash Memory

  • Yunpeng Song
  • Yina Lv
  • Liang Shi

3D NAND flash memory with advanced multi-level-cell technology has been widely adopted due to its high density, but with significantly degraded reliability. To solve the reliability issue, flash memory often adopts the low-density parity-check code (LDPC) as error correction code (ECC) to encode data and provide fault tolerance. For LDPC with a low code rate, it can provide a strong correction capability, but with a high energy cost. To avoid the cost, LDPC with a higher code rate is always adopted. When the accessed data is not successfully decoded, LDPC will rely on read retry operations to improve the error correction capability. However, the read retry operation will induce degraded read performance. In this work, a differential ECC (DECC) method is proposed to improve the read performance. The basic idea of DECC is to adopt LDPC with different code rates for data with different access characteristics. Specifically, when data is hot read and retried due to reliability, LDPC with a low code rate will be adopted to optimize performance. With this approach, the cost from LDPC with a low code rate is minimized and the performance is optimized. Through careful design and real-world workloads evaluation on a 3D triple-level-cell (TLC) NAND flash memory, DECC achieves encouraging read performance optimization.

Optimizing Data Layout for Racetrack Memory in Embedded Systems

  • Peng Hui
  • Edwin H.-M. Sha
  • Qingfeng Zhuge
  • Rui Xu
  • Han Wang

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too many shift operations decrease the performance of RTM and cause unpredictable performance. In this paper, we propose three schemes to optimize the performance of RTM from different aspects, including intra-DBC, inter-DBC, and hybrid SPM with SRAM and RTM. Firstly, a balanced group-based data placement method for the data layout inside one DBC is proposed to reduce shifts. Second, a grouping method for the data allocation among DBCs is proposed. It helps with the shift reduction while using fewer DBCs by using one DBC as multiple DBCs. Finally, we use SRAM to further help the cost reduction, and a cost evaluation metric is proposed to assist the shrinking method which determines the data allocation for hybrid SPM with SRAM and RTM. Experiments show that the proposed schemes can significantly improve the performance of pure RTM and hybrid SPM while using fewer DBCs.

Exploring Architectural Implications to Boost Performance for in-NVM B+-Tree

  • Yanpeng Hu
  • Qisheng Jiang
  • Chundong Wang

Computer architecture keeps evolving to support the byte-addressable non-volatile memory (NVM). Researchers have tailored the prevalent B+-tree with NVM, crafting a history of utilizing architectural supports to gain both high performance and crash consistency. The latest architecture-level changes for NVM, e.g., the eADR, motivate us to further explore architectural implications in the design and implementation of in-NVM B+-tree. Our quantitative study finds that eADR makes the cache misses impact increasingly on an in-NVM B+-tree’s performance. We hence propose Conan for the conflict-aware node allocation based on theoretical justifications. Conan decomposes the virtual addresses of B+-tree nodes regarding a VIPT cache and intentionally places them into different cache sets. Experiments show that Conan evidently reduces cache conflicts and boosts the performance of state-of-the-art in-NVM B+-tree.

An Efficient near-Bank Processing Architecture for Personalized Recommendation System

  • Yuqing Yang
  • Weidong Yang
  • Qin Wang
  • Naifeng Jing
  • Jianfei Jiang
  • Zhigang Mao
  • Weiguang Sheng

Personalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory challenges, near-memory processing (NMP) would be an effective solution which provides high bandwidth. Recent work proposes an NMP approach to accelerate the recommendation models by utilizing the through-silicon via (TSV) bandwidth in 3D-stacked DRAMs. However, the total bandwidth provided by TSVs is insufficient for a batch of embedding layers processed in parallel. In this paper, we propose a near-bank processing architecture to accelerate recommendation models. By integrating the compute-logic near memory banks on DRAM dies of the 3D-stacked DRAM, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. We also present a hardware/software interface for embedding layers offloading. Moreover, we propose an efficient mapping scheme to enhance the utilization of bank-level bandwidth. As a result, our architecture achieves up to 2.10X speedup and 31% energy saving for data movement over the state-of-the-art NMP solution for recommendation acceleration based on 3D-stacked memory.

SESSION: Technical Program: Cool and Efficient Approximation

PAALM: Power Density Aware Approximate Logarithmic Multiplier Design

  • Shuyuan Yu
  • Sheldon X.-D. Tan

Approximate hardware designs can lead to significant power or energy reduction. However, a recent study showed that approximated designs might lead to unwanted higher temperature and related reliability issues due to the increased power density. In this work, we try to mitigate this important problem by proposing a novel power density aware approximate logarithmic multiplier (called PAALM) design for the first time. The new multiplier design is based on the approximate logarithmic multiplier (ALM) framework due to its rigorous mathematics based foundation. The idea is to re-design the high computing switch activities of existing ALM designs based on equivalent mathematical formula so that the power density can be reduced at no accuracy loss while at costs of some area overheads. Our results show that the proposed PAALM design can improve 11.5%/5.7% of power density and 31.6%/70.8% of area with 8/16-bit precision when compared with the fixed-point multiplier baseline, respectively. And also achieves extremely low error bias: -0.17/0.08 for 8/16-bit precision, respectively. On top of this, we further implement the PAALM design in a Convolutional Neural Network (CNN) and test it on CIFAR10 dataset. The results show that with error compensation, PAALM can achieve the same inference accuracy as the fixed-point multiplier baseline. We also evaluate the PAALM in a discrete cosine transformation (DCT) application. The results show that with error compensation, PAALM can improve the image quality of 8.6dB in average when compared to the ALM design.

Approximate Floating-Point FFT Design with Wide Precision-Range and High Energy Efficiency

  • Chenyi Wen
  • Ying Wu
  • Xunzhao Yin
  • Cheng Zhuo

Fast Fourier Transform (FFT) is a key digital signal processing algorithm that is widely deployed in mobile and portable devices. Recently, with the popularity of human perception related tasks, it is noted that the requirements of full precision and exactness are not always necessary for FFT computation. We propose a top-down approximate Floating-Point FFT design methodology to fully exploit the error-tolerance nature of the FFT algorithm. An efficient error modeling of the configurable approximate multiplier is proposed to link the multiplier approximation to the FFT algorithm precision. Then an approximation optimization flow is formulated to maximize the energy efficiency. Experimental results show that the proposed approximate FFT can achieve up to 52% Area-Delay-Product improvement and 23% energy saving when compared to the exact FFT. The proposed approximate FFT is also found to cover almost 2X wider precision range with higher energy efficiency in comparison with the prior state-of-the-art approximate FFT.

RUCA: RUntime Configurable Approximate Circuits with Self-Correcting Capability

  • Jingxiao Ma
  • Sherief Reda

Approximate computing is an emerging computing paradigm that offers improved power consumption by relaxing the requirement for full accuracy. Since the requirements for accuracy may vary according to specific real-world applications, one trend of approximate computing is to design quality-configurable circuits, which are able to switch at runtime among different accuracy modes with different power and delay. In this paper, we present a novel framework RUCA which aims to synthesize runtime configurable approximate circuits based on arbitrary input circuits. By decomposing the truth table, our approach aims to approximate and separate the input circuit into multiple configuration blocks which support different accuracy levels, including a corrector circuit to restore full accuracy. Power gating is used to activate different blocks, such that the approximate circuit is able to operate at different accuracy-power configurations. To improve the scalability of our algorithm, we also provide a design space exploration scheme with circuit partitioning. We evaluate our methodology on a comprehensive set of benchmarks. For 3-level designs, RUCA saves power consumption by 43.71% within 2% error and by 30.15% within 1% error on average.

Approximate Logic Synthesis by Genetic Algorithm with an Error Rate Guarantee

  • Chun-Ting Lee
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

Approximate computing is an emerging design technique for error-tolerant applications, which may improve circuit area, delay, or power consumption by trading off a circuit’s correctness. In this paper, we propose a novel approximate logic synthesis approach based on genetic algorithm targeting at depth minimization with an error rate guarantee. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results demonstrate that the depth can be reduced by up to 50%, and 22% on average under a 5% error rate constraint. As compared with the state-of-the-art method, our approach can achieve an average of 159% more depth reduction under the same 5% error rate constraint.

SESSION: Technical Program: Logic Synthesis for AQFP, Quantum Logic, AI Driven and Efficient Data Layout for HBM

Depth-Optimal Buffer and Splitter Insertion and Optimization in AQFP Circuits

  • Alessandro Tempia Calvino
  • Giovanni De Micheli

The Adiabatic Quantum-Flux Parametron (AQFP) is an energy-efficient superconducting logic family. AQFP technology requires buffer and splitting elements (B/S) to be inserted to satisfy path-balancing and fanout-branching constraints. B/S insertion policies and optimization strategies have been recently proposed to minimize the number of buffers and splitters needed in an AQFP circuit. In this work, we study the B/S insertion and optimization methods. In particular, the paper proposes: i) an algorithm for B/S insertion that guarantees global depth optimality; ii) a new approach for B/S optimization based on minimum register retiming; iii) a B/S optimization flow based on (i), (ii), and existing work. We show that our approach reduces the number of B/S up to 20% while guaranteeing optimal depth and providing a 55X speed-up in run time compared to the state-of-the-art.

Area-Driven FPGA Logic Synthesis Using Reinforcement Learning

  • Guanglei Zhou
  • Jason H. Anderson

Logic synthesis involves a rich set of optimization algorithms applied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed “recipe” of such algorithms deemed to work well for a wide range of different circuits. We apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the set of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs. the conventional approach (resyn2). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of “unseen” circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.

Optimization of Reversible Logic Networks with Gate Sharing

  • Yung-Chih Chen
  • Feng-Jie Chao

Logic synthesis for quantum computing aims to transform a Boolean logic network into a quantum circuit. A conventional two-stage flow first synthesizes the given Boolean logic network into a reversible logic network composed of reversible logic gates. Then, it maps each reversible logic gate into quantum gates to generate a quantum circuit. The state-of-the-art method for the first stage takes advantage of the lookup-table (LUT) mapping technology for FPGAs to decompose the given Boolean logic network into sub-networks, and then maps the sub-networks into reversible logic networks. Although every sub-network is well synthesized, we observe that the reversible logic networks could be further optimized by sharing the reversible logic gates belonging to different sub-networks. Thus, in this paper, we propose a new optimization method for the reversible logic networks by sharing gates. We translate the problem of extracting shareable gates to the exclusive-sums-of-product term optimization problem. The experimental results show that the proposed method successfully optimizes the reversible logic networks generated by the LUT-based method. It is able to reduce an average of approximately 4% of quantum gate cost without increasing the number of ancilla lines for a set of IWLS 2005 benchmarks.

Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization

  • Stephanie Soldavini
  • Donatella Sciuto
  • Christian Pilato

Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.

SESSION: Technical Program: University Design Contest

ViraEye: An Energy-Efficient Stereo Vision Accelerator with Binary Neural Network in 55 nm CMOS

  • Yu Zhang
  • Gang Chen
  • Tao He
  • Qian Huang
  • Kai Huang

This paper presents the ViraEye chip, an energy-efficient stereo vision accelerator based on the binary neural network (BNN) to achieve high-quality and real-time stereo estimation. This stereo vision accelerator is designed as an end-to-end full pipeline architecture where all processing procedures, including stereo rectification, BNNs, cost aggregation and post-processing, are implemented on the ViraEye chip. ViraEye allows for top level pipelining between accelerator and image sensors, and no external CPUs or GPUs are required. The accelerator is implemented using SMIC 55nm CMOS technology and achieves top-performing processing speed in terms of million disparity estimations per second (MDE/s) metric among the existing ASIC in the open literature.

A 1.2nJ/Classification Fully Synthesized All-Digital Asynchronous Wired-Logic Processor Using Quantized Non-Linear Function Blocks in 0.18μm CMOS

  • Rei Sumikawa
  • Kota Shiba
  • Atsutake Kosuge
  • Mototsugu Hamada
  • Tadahiro Kuroda

A 5.3 times smaller and 2.6 times more energy-efficient all-digital wired-logic processor which infers MNIST with 90.6% accuracy and 1.2nJ of energy consumption has been developed. To improve area efficiency of wired-logic architecture, nonlinear neural network (NNN), which is a neuron and synapse efficient network, and logical compression technology to implement it with area-saving and low-power digital circuits by logic synthesis are proposed, and asynchronous digital combinational circuit DNN hardware has been developed.

A Fully Synthesized 13.7μJ/Prediction 88% Accuracy CIFAR-10 Single-Chip Data-Reusing Wired-Logic Processor Using Non-Linear Neural Network

  • Yao-Chung Hsu
  • Atsutake Kosuge
  • Rei Sumikawa
  • Kota Shiba
  • Mototsugu Hamada
  • Tadahiro Kuroda

An FPGA-based wired-logic CNN processor is presented that can process CIFAR-10 at 13.7μJ/prediction with an 88% accuracy, which is 2,036 times more energy-efficient than the prior state-of-the-art FPGA-based processor. Energy efficiency is greatly improved by implementing all processing elements and wirings in parallel on a single FPGA chip to eliminate the memory access. By utilizing both (1) a non-linear neural network which saves on neurons and synapses and (2) a shift register-based wired-logic architecture, hardware resource usage is reduced by three orders of magnitude.

A Multimode Hybrid Memristor-CMOS Prototyping Platform Supporting Digital and Analog Projects

  • K.-E. Harabi
  • C. Turck
  • M. Drouhin
  • A. Renaudineau
  • T. Bersani-Veroni
  • D. Querlioz
  • T. Hirtzlin
  • E. Vianello
  • M Bocquet
  • J.-M. Portal

We present an integrated circuit fabricated in a process co-integrating CMOS and hafnium-oxide memristor technology, which provides a prototyping platform for projects involving memristors. Our circuit includes the periphery circuitry for using memristors within digital circuits, as well as an analog mode with direct access to memristors. The platform allows optimizing the conditions for reading and writing memristors, as well as developing and testing innovative memristor-based neuromorphic concepts.

A Fully Synchronous Digital LDO with Built-in Adaptive Frequency Modulation and Implicit Dead-Zone Control

  • Shun Yamaguchi
  • Mahfuzul Islam
  • Takashi Hisakado
  • Osami Wada

This paper proposes a synchronous digital LDO with adaptive clocking and dead-zone control without additional reference voltages. A test chip fabricated in a commercial 65 nm CMOS general-purpose (GP) process achieves 580x frequency modulation with 99.9% maximum efficiency at 0.6V supply.

Demonstration of Order Statistics Based Flash ADC in a 65nm Process

  • Mahfuzul Islam
  • Takehiro Kitamura
  • Takashi Hisakado
  • Osami Wada

This paper presents measurement results of a flash ADC that utilizes offset voltages as references. To operate the minimum number of comparators, we select the target comparators based on the rankings of the offset voltage. We present performance improvement by tuning offset voltage distribution using multiple comparator groups under the same power. A test chip in a commercial 65 nm GP process demonstrates the ADCs at 1 GS/s operation.

SESSION: Technical Program: Synthesis of Quantum Circuits and Systems

A SAT Encoding for Optimal Clifford Circuit Synthesis

  • Sarah Schneider
  • Lukas Burgholzer
  • Robert Wille

Executing quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to devices’ limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm’s description first has to be synthesized using the device’s gate library. In this paper, we consider the optimal synthesis of Clifford circuits—an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are limited to a maximum of six qubits. The contribution of this work is twofold: First, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. The resulting tool is demonstrated to synthesize optimal circuits for up to 26 qubits—more than four times as many as the current state of the art. Second, we experimentally show that the overhead introduced by state-of-the-art heuristics exceeds the lower bound by 27 % on average. The resulting tool is publicly available at

An SMT-Solver-Based Synthesis of NNA-Compliant Quantum Circuits Consisting of CNOT, H and T Gates

  • Kyohei Seino
  • Shigeru Yamashita

It is natural to assume that we can perform quantum operations between only two adjacent physical qubits (quantum bits) to realize a quantum computer for both the current and possible future technologies. This restriction is called the Nearest Neighbor Architecture (NNA) restriction. This paper proposes an SMT-solver-based synthesis of quantum circuits consisting of CNOT, H, and T gates to satisfy the NNA restriction. Although the existing SMT-solver-based synthesis cannot treat H and T gates directly, our method treats the functionality of quantum-specific T and H gates carefully so that we can utilize an SMT-solver to minimize the number of CNOT gates; unlike the existing SMT-solver-based methods, our method considers “Don’t Care” conditions in intermediate points of a quantum circuit by exploiting the property of T gates to reduce CNOT gates. Experimental results show that our approach can reduce the number of CNOT gates by 58.11% on average compared to the naive application of the existing method which does not consider the “Don’t Care” condition.

Compilation of Entangling Gates for High-Dimensional Quantum Systems

  • Kevin Mato
  • Martin Ringbauer
  • Stefan Hillmich
  • Robert Wille

Most quantum computing architectures to date natively support multi-valued logic, albeit being typically operated in a binary fashion. Multi-valued, or qudit, quantum processors have access to much richer forms of quantum entanglement, which promise to significantly boost the performance and usefulness of quantum devices. However, much of the theory as well as corresponding design methods required for exploiting such hardware remain insufficient and generalizations from qubits are not straightforward. A particular challenge is the compilation of quantum circuits into sets of native qudit gates supported by state-of-the-art quantum hardware. In this work, we address this challenge by introducing a complete workflow for compiling any two-qudit unitary into an arbitrary native gate set. Case studies demonstrate the feasibility of both, the proposed approach as well as the corresponding implementation (which is freely available at

WIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code

  • Wang Liao
  • Yasunari Suzuki
  • Teruo Tanimoto
  • Yosuke Ueno
  • Yuuki Tokunaga

Large error rates of quantum bits (qubits) are one of the main difficulties in the development of quantum computing. Performing quantum error correction (QEC) with surface codes is considered the most promising approach to reduce the error rates of qubits effectively. To perform error correction, we need an error-decoding unit, which estimates errors in the noisy physical qubits repetitively, to create a robust logical qubit. While complicated graph-matching problems must be solved within a strict time restriction for the error decoding, several hardware implementations that satisfy the restriction at a large code distance have been proposed.

However, the existing decoder designs are still challenging in reducing the logical error rate. This is because they assume that the error rates of physical qubits are uniform while they have large variations in practice. According to our numerical simulation based on the quantum chip with the largest qubit number, neglecting the non-uniform error properties of a real quantum chip in the decoding process induces significant degradation of the logical error rate and spoils the benefit of QEC. To take the non-uniformity into account, decoders need to solve matching problems on a weighted graph, but they are difficult to solve using the existing designs without exceeding the time limit of decoding. Therefore, a decoder that can treat both the non-uniform physical error rates and the large surface code is strongly demanded.

In this paper, we propose a hardware design of decoding units for the surface code that can treat the non-identical error properties with small latency at a large code distance. The key idea of our design is 1) constructing a look-up table for calculating the shortest paths between nodes in a weighted graph and 2) enabling parallel processing during decoding. The implementation results in field programmable gate array (FPGA) indicate that our design scales up to code distance 11 within a microsecond-level delay, which is comparable to the existing state-of-the-art designs, while our design can treat non-identical errors.

Quantum Data Compression for Efficient Generation of Control Pulses

  • Daniel Volya
  • Prabhat Mishra

In order to physically realize a robust quantum gate, a specifically tailored laser pulse needs to be derived via strategies such as quantum optimal control. Unfortunately, such strategies face exponential complexity with quantum system size and become infeasible even for moderate-sized quantum circuits. In this paper, we propose an automated framework for effective utilization of these quantum resources. Specifically, this paper makes three important contributions. First, we utilize an effective combination of register compression and dimensionality reduction to reduce the area of a quantum circuit. Next, due to the properties of an autoencoder, the compressed gates produced are robust even in the presence of noise. Finally, our proposed compression reduces the computation time of quantum control. Experimental evaluation using popular quantum algorithms demonstrates that our proposed approach can enable efficient generation of noise-resilient control pulses while state-of-the-art fails to handle large-scale quantum systems.

SESSION: Technical Program: In-Memory/Near-Memory Computing for Neural Networks

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

  • Yueting Li
  • He Zhang
  • Xueyan Wang
  • Hao Cai
  • Yundong Zhang
  • Shuqin Lv
  • Renguang Liu
  • Weisheng Zhao

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.

RIMAC: An Array-Level ADC/DAC-Free ReRAM-Based in-Memory DNN Processor with Analog Cache and Computation

  • Peiyu Chen
  • Meng Wu
  • Yufei Ma
  • Le Ye
  • Ru Huang

By directly computing in analog domain, processing-in-memory (PIM) is emerging as a promising alternative to overcome the memory bottleneck of traditional von-Neuman architecture, especially for deep neural networks (DNNs). However, the data outside PIM macros in most existing PIM accelerators are stored and operated as digital signals that require massive expensive digital-to-analog (D/A) and analog-to-digital (A/D) converters. In this work, an array-level ADC/DAC-free ReRAM-based in-memory DNN processor named RIMAC is proposed, which accelerates various DNNs in pure analog-domain with analog cache and analog computation modules to eliminate the expensive D/A and A/D conversions. Our experiment result shows the peak energy efficiency is improved by about 34.8×, 97.6×, 10.7×, and 14.0× compared to PRIME, ISAAC, Lattice, and 21’DAC for various DNNs on ImageNet, respectively.

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient in-Memory Acceleration

  • Shuo Huai
  • Di Liu
  • Xiangzhong Luo
  • Hui Chen
  • Weichen Liu
  • Ravi Subramaniam

Crossbar-based In-Memory Computing (IMC) accelerators preload the entire Deep Neural Network (DNN) into crossbars before inference. However, devices with limited crossbars cannot infer increasingly complex models. IMC-pruning can reduce the usage of crossbars, but current methods need expensive extra hardware for data alignment. Meanwhile, quantization can represent weights of DNNs by integers, but they employ non-integer scaling factors to ensure accuracy, requiring costly multipliers. In this paper, we first propose crossbar-aligned pruning to reduce the usage of crossbars without hardware overhead. Then, we introduce a quantization scheme to avoid multipliers in IMC devices. Finally, we design a learning method to complete above two schemes and cultivate an optimal compact DNN with high accuracy and large sparsity during training. Experiments demonstrate that our framework, compared to state-of-the-art methods, achieves larger sparsity and lower power consumption with higher accuracy. We even improve the accuracy by 0.43% for VGG-16 with an 88.25% sparsity rate on the Cifar-10 dataset. Compared to the original model, we reduce computing power and area by 19.8x and 18.8x, respectively.

Discovering the in-Memory Kernels of 3D Dot-Product Engines

  • Muhammad Rashedul Haq Rashed
  • Sumit Kumar Jha
  • Rickard Ewetz

The capability of resistive random access memory (ReRAM) to implement multiply-and-accumulate operations promises unprecedented efficiency in the design of scientific computing applications. While the use of two-dimensional (2D) ReRAM crossbar has been well investigated in the last few years, the design of in-memory dot-product engines using three-dimensional (3D) ReRAM crossbars remains a topic of active investigations. In this paper, we holistically explore how to leverage 3D ReRAM crossbars with several (2 to 7) stacked crossbar layers. In contrast, previous studies have focused on 3D ReRAM with at most 2 stacked crossbar layers. We first discover the in-memory compute kernels that can be realized using 3D ReRAM with multiple stacked crossbar layers. We discover that matrices with different sparsity patterns can be realized by appropriately assigning the inputs and outputs to the perpendicular metal wires within the 3D stack. We present a design automation tool to map sparse matrices within scientific computing applications to the discovered 3D kernels. The proposed framework is evaluated using 20 applications from the SuitSparse Matrix Collection. Compared with 2D crossbars, the proposed approach using 3D crossbars improves area, energy, and latency with 2.02X, 2.37X, 2.45X, respectively.

RVComp: Analog Variation Compensation for RRAM-Based in-Memory Computing

  • Jingyu He
  • Yucong Huang
  • Miguel Lastras
  • Terry Tao Ye
  • Chi-Ying Tsui
  • Kwang-Ting Cheng

Resistive Random Access Memory (RRAM) has shown great potential in accelerating memory-intensive computation in neural network applications. However, RRAM-based computing suffers from significant accuracy degradation due to the inevitable device variations. In this paper, we propose RVComp, a fine-grained analog Compensation approach to mitigate the accuracy loss of in-memory computing incurred by the Variations of the RRAM devices. Specifically, weights in the RRAM crossbar are accompanied by dedicated compensation RRAM cells to offset their programming errors with a scaling factor. A programming target shifting mechanism is further designed with the objectives of reducing the hardware overhead and minimizing the compensation errors under large device variations. Based on these two key concepts, we propose double and dynamic compensation schemes and the corresponding support architecture. Since the RRAM cells only account for a small fraction of the overall area of the computing macro due to the dominance of the peripheral circuitry, the overall area overhead of RVComp is low and manageable. Simulation results show RVComp achieves a negligible 1.80% inference accuracy drop for ResNet18 on the CIFAR-10 dataset under 30% device variation with only 7.12% area and 5.02% power overhead and no extra latency.

SESSION: Technical Program: Machine Learning-Based Design Automation

Rethink before Releasing Your Model: ML Model Extraction Attack in EDA

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Jiang Hu
  • Yiran Chen

Machine learning (ML)-based techniques for electronic design automation (EDA) have boosted the performance of modern integrated circuits (ICs). Such achievement makes ML model to be of importance for the EDA industry. In addition, ML models for EDA are widely considered having high development cost because of the time-consuming and complicated training data generation process. Thus, confidentiality protection for EDA models is a critical issue. However, an adversary could apply model extraction attacks to steal the model in the sense of achieving the comparable performance to the victim’s model. As model extraction attacks have posed great threats to other application domains, e.g., computer vision and natural language process, in this paper, we study model extraction attacks for EDA models under two real-world scenarios. It is the first work that (1) introduces model extraction attacks on EDA models and (2) proposes two attack methods against the unlimited and limited query budget scenarios. Our results show that our approach can achieve competitive performance with the well-trained victim model without any performance degradation. Based on the results, we demonstrate that model extraction attacks truly threaten the EDA model privacy and hope to raise concerns about ML security issues in EDA.

MacroRank: Ranking Macro Placement Solutions Leveraging Translation Equivariancy

  • Yifan Chen
  • Jing Mai
  • Xiaohan Gao
  • Muhan Zhang
  • Yibo Lin

Modern large-scale designs make extensive use of heterogeneous macros, which can significantly affect routability. Predicting the final routing quality in the early macro placement stage can filter out poor solutions and speed up design closure. By observing that routing is correlated with the relative positions between instances, we propose MacroRank, a macro placement ranking framework leveraging translation equivariance and a Learning to Rank technique. The framework is able to learn the relative order of macro placement solutions and rank them based on routing quality metrics like wirelength, number of vias, and number of shorts. The experimental results show that compared with the most recent baseline, our framework can improve the Kendall rank correlation coefficient by 49.5% and the average performance of top-30 prediction by 8.1%, 2.3%, and 10.6% on wirelength, vias, and shorts, respectively.

BufFormer: A Generative ML Framework for Scalable Buffering

  • Rongjian Liang
  • Siddhartha Nath
  • Anand Rajaram
  • Jiang Hu
  • Haoxing Ren

Buffering is a prevalent interconnect optimization technique to help timing closure and is often performed after placement. A common buffering approach is to construct a Steiner tree and then buffers are inserted on the tree based on Ginneken-Lillis style algorithm. Such an approach is difficult to scale with large nets. Our work attempts to solve this problem with a generative machine-learning (ML) approach without Steiner tree construction. Our approach can extract and reuse knowledge from high quality samples and therefore has significantly improved scalability. A generative ML framework, BufFormer, is proposed to construct abstract tree topology while simultaneously determining buffer sizes & locations. A baseline method, FLUTE-based Steiner tree construction followed by Ginneken-Lillis style buffer insertion, is implemented to generate training samples. After training, BufFormer can produce solutions for unseen nets highly comparable to baseline results with a correlation coefficient 0.977 in terms of buffer area and 0.934 for driver-sink delays. On average, BufFormer-generated tree achieves similar delays with slightly larger buffer area. And up to 160X speedup can be achieved for large nets when running on a GPU over the baseline on a single CPU thread.

Decoupling Capacitor Insertion Minimizing IR-Drop Violations and Routing DRVs

  • Daijoon Hyun
  • Younggwang Jung
  • Insu Cho
  • Youngsoo Shin

Decoupling capacitor (decap) cells are inserted near function cells of high switching activities so that their IR-drop can be suppressed. Their design becomes more complex and uses higher metal layers, thereby starting to manifest themselves as routing blockage. Post-placement decap insertion, with a goal of minimizing both IR-drop violations and routing design rule violations (DRVs), is addressed for the first time. U-Net with graph convolutional network is introduced to predict routing DRV penalty. The decap insertion problem is formulated and a heuristic algorithm is presented. Experiments with a few test circuits demonstrate that DRVs are reduced by 16% on average with no IR-drop violations, compared to a conventional method which does not explicitly consider DRVs. This results in 48% reduction in routing runtime and 23% improvement in total negative slack.

DPRoute: Deep Learning Framework for Package Routing

  • Yeu-Haw Yeh
  • Simon Yi-Hung Chen
  • Hung-Ming Chen
  • Deng-Yao Tu
  • Guan-Qi Fang
  • Yun-Chih Kuo
  • Po-Yang Chen

For routing closures in package designs, net order is critical due to complex design rules and severe wire congestion. However, existing solutions are deliberatively designed using heuristics and are difficult to adapt to different design requirements unless updating the algorithm. This work presents a novel deep learning-based routing framework that can keep improving by accumulating data to accommodate increasingly complex design requirements. Based on the initial routing results, we apply deep learning to concurrent detailed routing to deal with the problem of net ordering decisions. We use multi-agent deep reinforcement learning to learn routing schedules between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict. Experimental results on industrial package design show that the proposed framework can improve the number of design rule violations by 99.5% and the wirelength by 2.9% for initial routing.

SESSION: Technical Program: Advanced Techniques for Yields, Low Power and Reliability

High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction

  • Shuo Yin
  • Guohao Dai
  • Wei W. Xing

Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of “dimensionality”, which is inevitable when dealing with modern large-scale circuits, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel and acts as a surrogate model to emulate the expensive SPICE simulation. To further improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is also enhanced with parallel batch sampling for parallel computing, making it ready for practical deployment. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art (SOTA) approaches in terms of accuracy and efficiency with up to 11.1x speedup over SOTA methods.

MIA-Aware Detailed Placement and VT Reassignment for Leakage Power Optimization

  • Hung-Chun Lin
  • Shao-Yun Fang

As the feature size decreases, leakage power consumption becomes an important target in the design. Using multiple threshold voltages (VTs) in cell-based designs is a popular technique to simultaneously optimize circuit timing and minimize leakage power. However, an arbitrary cell placement result of a multi-VT design may suffer from many design rule violations induced by the Minimum-Implant-Area (MIA) rule, and thus it is necessary to take the MIA rules into consideration during the detailed placement stage. The state-of-the-art works on detailed placement comprehensively tackling MIA rules either disallow VT change or only allow reducing cell VTs to avoid timing degradation. However, these limitations may either result in larger cell displacement or cause overhead in leakage power. In this paper, we propose an optimization framework of VT reassignment and detailed placement to simultaneously consider MIA rules and leakage power minimization under timing constraints. Experimental results show that compared with the state-of-the-art works, the proposed framework can efficiently achieve better trade-off between leakage power and cell displacement.

SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network

  • Junchi Ma
  • Sulei Huang
  • Zongtao Duan
  • Lei Tang
  • Luyang Wang

The trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.

SESSION: Technical Program: Microarchitectural Design and Neural Networks

Microarchitecture Power Modeling via Artificial Neural Network and Transfer Learning

  • Jianwang Zhai
  • Yici Cai
  • Bei Yu

Accurate and robust power models are highly demanded to explore better CPU designs. However, previous learning-based power models ignore the discrepancies in data distribution among different CPU designs, making it difficult to use data from the historical configuration to aid modeling for new target configuration. In this paper, we investigate the transferability of power models and propose a microarchitecture power modeling method based on transfer learning (TL). A novel TL method for artificial neural network (ANN)-based power models is proposed, where cross-domain mixup generates more auxiliary samples close to the target configuration to fill in the distribution discrepancy and domain-adversarial training extracts domain-invariant features to complete the target model construction. Experiments show that our method greatly improves the model transferability and can effectively utilize the knowledge of the existing CPU configuration to facilitate target power model construction.

MUGNoC: A Software-Configured Multicast-Unicast-Gather NoC for Accelerating CNN Dataflows

  • Hui Chen
  • Di Liu
  • Shiqing Li
  • Shuo Huai
  • Xiangzhong Luo
  • Weichen Liu

Current communication infrastructures for convolutional neural networks (CNNs) only focus on specific transmission patterns, not applicable to benefit the whole system if the dataflow changes or different dataflows run in one system. To reduce data movement, various CNN dataflows are presented. For these dataflows, parameters and results are delivered using different traffic patterns, i.e., multicast, unicast, and gather, preventing dataflow-specific communication backbones from benefiting the entire system if the dataflow changes or different dataflows run in the same system. Thus, in this paper, we propose MUG-NoC to support typical traffic patterns and accelerate them, therefore boosting multiple dataflows. Specifically, (i) we for the first time support multicast in 2D-mesh software configurable NoC by revising router configuration and proposing the efficient multicast routing; (ii) we decrease unicast latency by transmitting data through the different routes in parallel; (iii) we reduce output gather overheads by pipelining basic dataflow units. Experiments show that at least our proposed design can reduce 39.2% total data transmission time compared with the state-of-the-art CNN communication backbone.

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

  • Bo-Wun Cheng
  • En-Ming Huang
  • Chen-Hao Chao
  • Wei-Fang Sun
  • Tsung-Tai Yeh
  • Chun-Yi Lee

In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

SESSION: Technical Program: Novel Techniques for Scheduling and Memory Optimizations in Embedded Software

Mixed-Criticality with Integer Multiple WCETs and Dropping Relations: New Scheduling Challenges

  • Federico Reghenzani
  • William Fornaciari

Scheduling Mixed-Criticality (MC) workload is a challenging problem in real-time computing. Earliest Deadline First Virtual Deadline (EDF-VD) is one of the most famous scheduling algorithm with optimal speedup bound properties. However, when EDF-VD is used to schedule task sets using a model with additional or relaxed constraints, its scheduling properties change. Inspired by an application of MC to the scheduling of fault tolerant tasks, in this article, we propose two models for multiple criticality levels: the first is a specialization of the MC model, and the second is a generalization of it. We then show, via formal proofs and numerical simulations, that the former considerably improves the speedup bound of EDF-VD. Finally, we provide the proofs related to the optimality of the two models, identifying the need of new scheduling algorithms.

An Exact Schedulability Analysis for Global Fixed-Priority Scheduling of the AER Task Model

  • Thilanka Thilakasiri
  • Matthias Becker

Commercial off-the-shelf (COTS) multi-core platforms offer high performance and large availability of processing resources. Increased contention when accessing shared resources is a result of the high parallelism and one of the main challenges when realtime applications are deployed to these platforms. As a result, several execution models have been proposed to avoid contention by separating access to shared resources from execution.

In this work, we consider the Acquisition-Execution-Restitution (AER) model where contention to shared resources is avoided by design. We propose an exact schedulability test for the AER model under global fixed-priority scheduling using timed automata where we describe the schedulability problem as a reachability problem. To the best of our knowledge, this is the first exact schedulability test for the AER model under global fixed-priority scheduling on multiprocessor platforms. The performance of the proposed approach is evaluated using synthetic experiments and provides up to 65% more schedulable task sets than the state-of-the-art.

Skyrmion Vault: Maximizing Skyrmion Lifespan for Enabling Low-Power Skyrmion Racetrack Memory

  • Syue-Wei Lu
  • Shuo-Han Chen
  • Yu-Pei Liang
  • Yuan-Hao Chang
  • Kang Wang
  • Tseng-Yi Chen
  • Wei-Kuan Shih

Skyrmion racetrack memory (SK-RM) has demonstrated great potential as a high-density and low-cost nonvolatile memory. Nevertheless, even though random data accesses are supported on SK-RM, data accesses can not be carried out on individual data bit directly. Instead, special skyrmion manipulations, such as injecting and shifting, are required to support random information update and deletion. With such special manipulations, the latency and energy consumption of skyrmion manipulations could quickly accumulate and induce additional overhead on the data read/write path of SK-RM. Meanwhile, injection operation consumes more energy and has higher latency than any other manipulations. Although prior arts have tried to alleviate the overhead of skyrmion manipulations, the possibility of minimizing injections through buffering skyrmions for future reuse and energy conservation receives much less attention. Such observation motivates us to propose the concept of skyrmion vault to effectively utilize the skyrmion buffer track structure for energy conservation through maximizing the lifespan of injected skyrmions and minimizing the number of skyrmion injections. Experimental results have shown promising improvements in both energy consumption and skyrmions’ lifespan.

SESSION: Technical Program: Efficient Circuit Simulation and Synthesis for Analog Designs

Parallel Incomplete LU Factorization Based Iterative Solver for Fixed-Structure Linear Equations in Circuit Simulation

  • Lingjie Li
  • Zhiqiang Liu
  • Kan Liu
  • Shan Shen
  • Wenjian Yu

A series of fixed-structure sparse linear equations are solved in a circuit simulation process. We propose a parallel incomplete LU (ILU) preconditioned GMRES solver for those equations. A new subtree-based scheduling algorithm for ILU factorization and forward/backward substitution is adopted to overcome the load-balancing and data locality problem of the conventional levelization-based scheduling. Experimental results show that the proposed scheduling algorithm can achieve up to 2.6X speedup for ILU factorization and 3.1X speedup for forward/backward substitution compared to the levelization-based scheduling. The proposed ILU-GMRES solver achieves around 4X parallel speedup with 8 threads, which is up to 2.1X faster than that based on the levelization-based scheme. The proposed parallel solver also shows remarkable advantage over existing methods (including HSPICE) on transient simulation of linear and nonlinear circuits.

Accelerated Capacitance Simulation of 3-D Structures with Considerable Amounts of General Floating Metals

  • Jiechen Huang
  • Wenjian Yu
  • Mingye Song
  • Ming Yang

Floating metals are special conductors introduced into conductor structures by design for manufacturing (DFM). They bring difficulty to accurate capacitance simulation. In this work, we aim to accelerate the floating random walk (FRW) based capacitance simulation for structures with considerable amounts of general floating metals. We first discuss how the existing modified FRW is affected by the integral surfaces of floating metals and propose an improved placement of integral surface. Then, we propose a hybrid approach called incomplete network reduction to avoid random transitions trapped by floating metals. Experiments on structures from IC and FPD design, which involves multiple floating metals and single or multiple master conductors, have shown the effectiveness of the proposed techniques. The proposed techniques reduce the computational time of capacitance calculation, while preserving the accuracy.

On Automating Finger-Cap Array Synthesis with Optimal Parasitic Matching for Custom SAR ADC

  • Cheng-Yu Chiang
  • Chia-Lin Hu
  • Mark Po-Hung Lin
  • Yu-Szu Chung
  • Shyh-Jye Jou
  • Jieh-Tsorng Wu
  • Shiuh-hua Wood Chiang
  • Chien-Nan Jimmy Liu
  • Hung-Ming Chen

Due to its excellent power efficiency, the successive-approximation-register (SAR) analog-to-digital converter (ADC) is an attractive design choice for low-power ADC implements. In analog layout design, the parasitics induced by interconnecting wires and elements affect the accuracy and performance of the device. Due to the requirement of low-power and high-speed, series of very small lateral metal-metal capacitor units are usually adopted as the architecture of capacitor array. Besides power consumption and area reduction, the parasitic capacitance would significantly affect the matching properties and settling time of capacitors. This work presents a framework to synthesize good-quality binary-weighted capacitors for custom SAR ADC. Also, this work proposes a parasitic-aware ILP-based weight-dynamic network routing algorithm to generate a layout considering parasitic capacitance and capacitance ratio mismatch simultaneously. The experimental result shows that the effective number of bits (ENOB) of the layout generated by our approach is comparable to or better than that of manual design and other automated works, closing the gap between pre-sim and post-sim results.

SESSION: Technical Program: Security of Heterogeneous Systems Containing FPGAs

FPGANeedle: Precise Remote Fault Attacks from FPGA to CPU

  • Mathieu Gross
  • Jonas Krautter
  • Dennis Gnad
  • Michael Gruber
  • Georg Sigl
  • Mehdi Tahoori

FPGA as general-purpose accelerators can greatly improve system efficiency and performance in cloud and edge devices alike. However, they have recently become the focus of remote attacks, such as fault and side-channel attacks from one to another user of a part of the FPGA fabric. In this work, we consider system-on-chip platforms, where an FPGA and an embedded processor core are located on the same die. We show that the embedded processor core is vulnerable to voltage drops generated by the FPGA logic. Our experiments demonstrate the possibility of compromising the data transfer from external DDR memory to the processor cache hierarchy. Furthermore, we were also able to fault and skip instructions executed on an ARM Cortex-A9 core. The FPGA based fault injection is shown precise enough to recover the secret key of an AES T-tables implementation found in the mbedTLS library.

FPGA Based Countermeasures against Side Channel Attacks on Block Ciphers

  • Darshana Jayasinghe
  • Brian Udugama
  • Sri Parameswaran

Field Programmable Gate Arrays (FPGAs) are increasingly ubiquitous. FPGAs enable hardware acceleration and reconfigurability. Any security breach or attack on critical computations occurring on an FPGA can lead to devastating consequences. Side-channel attacks have the ability to reveal secret information, such as secret keys from cryptographic circuits running on FPGAs. Power dissipation (PA), Electromagnetic (EM) radiation, fault injection (FI) and remote power dissipation (RPA) attacks are the most compelling and noninvasive side-channel attacks demonstrated on FPGAs. This paper discusses two PA attack countermeasures (QuadSeal and RFTC) and one RPA attack countermeasure (UCloD) in detail to protect FPGAs.

SESSION: Technical Program: Novel Application & Architecture-Specific Quantization Techniques

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

  • Ruoyang Liu
  • Chenhan Wei
  • Yixiong Yang
  • Wenxun Wang
  • Huazhong Yang
  • Yongpan Liu

Data quantization is an effective method to accelerate neural network training and reduce power consumption. However, it is challenging to perform low-bit quantized training: the conventional equal-precision quantization will lead to either high accuracy loss or limited bit-width reduction, while existing mixed-precision methods offer high compression potential but failed to perform accurate and efficient bit-width assignment. In this work, we propose DYNASTY, a block-wise dynamic-precision neural network training framework. DYNASTY provides accurate data sensitivity information through fast online analytics, and maintains stable training convergence with an adaptive bit-width map generator. Network training experiments on CIFAR-100 and ImageNet dataset are carried out, and compared to 8-bit quantization baseline, DYNASTY brings up to 5.1× speedup and 4.7× energy consumption reduction with no accuracy drop and negligible hardware overhead.

Quantization through Search: A Novel Scheme to Quantize Convolutional Neural Networks in Finite Weight Space

  • Qing Lu
  • Weiwen Jiang
  • Xiaowei Xu
  • Jingtong Hu
  • Yiyu Shi

Quantization has become an essential technique in compressing deep neural networks for deployment onto resource-constrained hardware. It is noticed that, the hardware efficiency of implementing quantized networks is highly coupled with the actual values to be quantized into, and therefore, with given bit widths, we can smartly choose a value space to further boost the hardware efficiency. For example, using weights of only integer powers of two, multiplication can be fulfilled by bit operations. Under such circumstances, however, existing quantization-aware training methods are either not suitable to apply or unable to unleash the expressiveness of very low bit-widths. For the best hardware efficiency, we revisit the quantization of convolutional neural networks and propose to address the training process from a weight-searching angle, as opposed to optimizing the quantizer functions as in existing works. Extensive experiments on CIFAR10 and ImageNet classification tasks are examined with implementations onto well-established CNN architectures, such as ResNet, VGG, and MobileNet, etc. It is shown the proposed method can achieve a lower accuracy loss than the state of arts, and/or improving implementation efficiency by using hardware-friendly weight values at the same time.

Multi-Wavelength Parallel Training and Quantization-Aware Tuning for WDM-Based Optical Convolutional Neural Networks Considering Wavelength-Relative Deviations

  • Ying Zhu
  • Min Liu
  • Lu Xu
  • Lei Wang
  • Xi Xiao
  • Shaohua Yu

Wavelength Division Multiplexing (WDM)-based Mach-Zehnder Interferometer Optical Convolutional Neural Networks (MZI-OCNNs) have emerged as a promising platform to accelerate convolutions that cost most computing sources in neural networks. However, the wavelength-relative imperfect split ratios and actual phase shifts in MZIs and quantization errors from the electronic configuration module will degrade the inference accuracy of WDM-based MZI-OCNNs and thus render them unusable in practice. In this paper, we propose a framework that models the split ratios and phase shifts under different wavelengths, incorporates them into OCNN training, and introduces quantization-aware tuning to maintain inference accuracy and reduce electronic module complexity. Consequently, the framework can improve the inference accuracy by 49%, 76%, and 76%, respectively, for LeNet5, VGG7, and VGG8 implemented with multi-wavelength parallel computing. And instead of using Float 32/64 quantization resolutions, only 5,6, and 4 bits are needed and fewer quantization levels are utilized for configuration signals.

Semantic Guided Fine-Grained Point Cloud Quantization Framework for 3D Object Detection

  • Xiaoyu Feng
  • Chen Tang
  • Zongkai Zhang
  • Wenyu Sun
  • Yongpan Liu

Unlike the grid-paced RGB images, network compression, i.e.pruning and quantization, for the irregular and sparse 3D point cloud face more challenges. Traditional quantization ignores the unbalanced semantic distribution in 3D point cloud. In this work, we propose a semantic-guided adaptive quantization framework for 3D point cloud. Different from traditional quantization methods that adopt a static and uniform quantization scheme, our proposed framework can adaptively locate the semantic-rich foreground points in the feature maps to allocate a higher bitwidth for these “important” points. Since the foreground points are in a low proportion in the sparse 3D point cloud, such adaptive quantization can achieve higher accuracy than uniform compression under a similar compression rate. Furthermore, we adopt a block-wise fine-grained compression scheme in the proposed framework to fit the larger dynamic range in the point cloud. Moreover, a 3D point cloud based software and hardware co-evaluation process is proposed to evaluate the effectiveness of the proposed adaptive quantization in actual hardware devices. Based on the nuScenes dataset, we achieve 12.52% precision improvement under average 2-bit quantization. Compared with 8-bit quantization, we can achieve 3.11× energy efficiency based on co-evaluation results.

SESSION: Technical Program: Approximate Brain-Inspired Architectures for Efficient Learning

ReMeCo: Reliable Memristor-Based in-Memory Neuromorphic Computation

  • Ali BanaGozar
  • Seyed Hossein Hashemi Shadmehri
  • Sander Stuijk
  • Mehdi Kamal
  • Ali Afzali-Kusha
  • Henk Corporaal

Memristor-based in-memory neuromorphic computing systems promise a highly efficient implementation of vector-matrix multiplications, commonly used in artificial neural networks (ANNs). However, the immature fabrication process of memristors and circuit level limitations, i.e., stuck-at-fault (SAF), IR-drop, and device-to-device (D2D) variation, degrade the reliability of these platforms and thus impede their wide deployment. In this paper, we present ReMeCo, a redundancy-based reliability improvement framework. It addresses the non-idealities while constraining the induced overhead. It achieves this by performing a sensitivity analysis on ANN. With the acquired insight, ReMeCo avoids the redundant calculation of least sensitive neurons and layers. ReMeCo uses a heuristic approach to find the balance between recovered accuracy and imposed overhead. ReMeCo further decreases hardware redundancy by exploiting the bit-slicing technique. In addition, the framework employs the ensemble averaging method at the output of every ANN layer to incorporate the redundant neurons. The efficacy of the ReMeCo is assessed using two well-known ANN models, i.e., LeNet, and AlexNet, running the MNIST and CIFAR10 datasets. Our results show 98.5% accuracy recovery with roughly 4% redundancy which is more than 20× lower than the state-of-the-art.

SyFAxO-GeN: Synthesizing FPGA-Based Approximate Operators with Generative Networks

  • Rohit Ranjan
  • Salim Ullah
  • Siva Satyendra Sahoo
  • Akash Kumar

With rising trends of moving AI inference to the edge, due to communication and privacy challenges, there has been a growing focus on designing low-cost Edge-AI. Given the diversity of application areas at the edge, FPGA-based systems are increasingly used for high-performance inference. Similarly, approximate computing has emerged as a viable approach to achieve disproportionate resource gains by utilizing the applications’ inherent robustness. However, most related research has focused on selecting the appropriate approximate operators for an application from a set of ASIC-based designs. This approach fails to leverage the FPGA’s architectural benefits and limits the scope of approximation to already existing generic designs. To this end, we propose an AI-based approach to synthesizing novel approximate operators for FPGA’s Look-up-table-based structure. Specifically, we use state-of-the-art generative networks to search for constraint-aware arithmetic operator designs optimized for FPGA-based implementation. With the proposed GANs, we report up to 49% faster training, with negligible accuracy degradation, than related generative networks. Similarly, we report improved hypervolume and increased pareto-front design points compared to state-of-the-art approaches to synthesizing approximate multipliers.

Approximating HW Accelerators through Partial Extractions onto Shared Artificial Neural Networks

  • Prattay Chowdhury
  • Jorge Castro Godínez
  • Benjamin Carrion Schafer

One approach that has been suggested to further reduce the energy consumption of heterogenous Systems-on-Chip (SoCs) is approximate computing. In approximate computing the error at the output is relaxed in order to simplify the hardware and thus, achieve lower power. Fortunately, most of the hardware accelerators in these SoCs are also amenable to approximate computing.

In this work we propose a fully automatic method that substitutes portions of a hardware accelerator specified in C/C++/SystemC for High-Level Synthesis (HLS) to an Artificial Neural Network (ANN). ANNs have many advantages that make them well suited for this. First, they are very scalable which allows to approximate multiple separate portions of the behavioral description simultaneously on them. Second, multiple ANNs can be fused together and re-optimized to further reduce the power consumption. We use this to share the ANN to approximate multiple different HW accelerators in the same SoC. Experimental results with different error thresholds show that our proposed approach leads to better results than the state of the art.

DependableHD: A Hyperdimensional Learning Framework for Edge-Oriented Voltage-Scaled Circuits

  • Dehua Liang
  • Hiromitsu Awano
  • Noriyuki Miura
  • Jun Shiomi

Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing the stable operation in modern VLSI. To tackle such issues, we propose DependableHD, a learning framework based on HyperDimensional Computing (HDC), which supports the systems to tolerate bit-level memory failure in the low voltage region with high robustness. For the first time, DependableHD introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. Our experiment shows that under 10% memory error, DependableHD exhibits a 1.22% accuracy loss on average, which achieves an 11.2× improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHD supports the systems to reduce the supply voltage from 400mV to 300mV, which provides a 50.41% energy consumption reduction while maintaining competitive accuracy performance.

SESSION: Technical Program: Retrospect and Prospect of Verifiation and Test Technologies

EDDY: A Multi-Core BDD Package with Dynamic Memory Management and Reduced Fragmentation

  • Rune Krauss
  • Mehran Goli
  • Rolf Drechsler

In recent years, hardware systems have significantly grown in complexity. Due to the increasing complexity, there is a need to continuously improve the quality of the hardware design process. This leads designers to strive for more efficient data structures and algorithms operating on them to guarantee the correct behavior of such systems through verification techniques like model checking and meet time-to-market constraints. A Binary Decision Diagram (BDD) is a suitable data structure as it provides a canonical compact representation of Boolean functions, given variable ordering, and efficient algorithms for manipulating them. However, reduced ordered BDDs also have challenges: There is a large memory consumption for the BDD construction of some complex practical functions and the use of realizations in the form of BDD packages strongly depends on the application.

To address these issues, this paper presents a novel multi-core package called Engineer Decision Diagrams Yourself (EDDY) with dynamic memory management and reduced fragmentation. Experiments on BDD benchmarks of both combinational circuits and model checking show that using EDDY leads to a significantly performance boost compared to state-of-the-art packages.

Exploiting Reversible Computing for Verification: Potential, Possible Paths, and Consequences

  • Lukas Burgholzer
  • Robert Wille

Today, the verification of classical circuits poses a severe challenge for the design of circuits and systems. While the underlying (exponential) complexity is tackled in various fashions (simulation-based approaches, emulation, formal equivalence checking, fuzzing, model checking, etc.), no “silver bullet” has been found yet which allows to escape the growing verification gap. In this work, we entertain and investigate the idea of a complementary approach which aims at exploiting reversible computing. More precisely, we show the potential of the reversible computing paradigm for verification, debunk misleading paths that do not allow to exploit this potential, and discuss the resulting consequences for the development of future, complementary design and verification flows. An extensive empirical study (involving more than 30 million simulations) confirms these findings. Although this work cannot provide a fully-fledged realization yet, it may provide the basis for an alternative path towards overcoming the verification gap.

Automatic Test Pattern Generation and Compaction for Deep Neural Networks

  • Dina Moussa
  • Michael Hefenbrock
  • Christopher Münch
  • Mehdi Tahoori

Deep Neural Networks (DNNs) have gained considerable attention lately due to their excellent performance on a wide range of recognition and classification tasks. Accordingly, fault detection in DNNs and their implementations plays a crucial role in the quality of DNN implementations to ensure that their post-mapping and infield accuracy matches with model accuracy. This paper proposes a functional-level automatic test pattern generation approach for DNNs. This is done by generating inputs which causes misclassification of the output class label in the presence of single or multiple faults. Furthermore, to obtain a smaller set of test patterns with full coverage, a heuristic algorithm as well as a test pattern clustering method using K-means were implemented. The experimental results showed that the proposed test patterns achieved the highest label misclassification and a high output deviation compared to state-of-the-art approaches.

Wafer-Level Characteristic Variation Modeling Considering Systematic Discontinuous Effects

  • Takuma Nagao
  • Tomoki Nakamura
  • Masuo Kajiyama
  • Makoto Eiki
  • Michiko Inoue
  • Michihiro Shintani

Statistical wafer-level variation modeling is an attractive method for reducing the measurement cost in large-scale integrated circuit (LSI) testing while maintaining the test quality. In this method, the performance of unmeasured LSI circuits manufactured on a wafer is statistically predicted from a few measured LSI circuits. Conventional statistical methods model spatially smooth variations in wafer. However, actual wafers may have discontinuous variations that are systematically caused by the manufacturing environments, such as shot dependence. In this study, we propose a modeling method that considers discontinuous variations in wafer characteristics by applying the knowledge of manufacturing engineers to a model estimated using Gaussian process regression. In the proposed method, the process variation is decomposed into the systematic discontinuous and global components to improve the estimation accuracy. An evaluation performed using an industrial production test dataset shows that the proposed method reduces the estimation error for an entire wafer by over 33% compared to conventional methods.

SESSION: Technical Program: Computing, Erasing, and Protecting: The Security Challenges for the Next Generation of Memories

Hardware Security Primitives Using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs

  • Simranjeet Singh
  • Furqan Zahoor
  • Gokul Rajendran
  • Sachin Patkar
  • Anupam Chattopadhyay
  • Farhad Merchant

With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.

Data Sanitization on eMMCs

  • Aya Fukami
  • Francesco Regazzoni
  • Zeno Geradts

Data sanitization of modern digital devices is an important issue given that electronic wastes are being recycled and repurposed. The embedded Multi Media Card (eMMC), one of the NAND flash memory-based commodity devices, is one of the popularly recycled products in the current recycling ecosystem. We analyze a repurposed devices and evaluate its sanitization practice. Data from the formerly used device can still be recovered, which may lead to an unintentional leakage of sensitive data such as personally identifiable information (PII). Since the internal storage of an eMMC is the NAND flash memory, sanitization practice of the NAND flash memory-based systems should apply to the eMMC. However, proper sanitize operation is obviously not always performed in the current recycling ecosystem. We discuss how data stored in eMMC and other flash memory-based devices need to be deleted in order to avoid the potential data leakage. We also review the NAND flash memory data sanitization schemes and discuss how they should be applied in eMMCs.

Fundamentally Understanding and Solving RowHammer

  • Onur Mutlu
  • Ataberk Olgun
  • A. Giray Yağlıkcı

We provide an overview of recent developments and future directions in the RowHammer vulnerability that plagues modern DRAM (Dynamic Random Memory Access) chips, which are used in almost all computing systems as main memory.

RowHammer is the phenomenon in which repeatedly accessing a row in a real DRAM chip causes bitflips (i.e., data corruption) in physically nearby rows. This phenomenon leads to a serious and widespread system security vulnerability, as many works since the original RowHammer paper in 2014 have shown. Recent analysis of the RowHammer phenomenon reveals that the problem is getting much worse as DRAM technology scaling continues: newer DRAM chips are fundamentally more vulnerable to RowHammer at the device and circuit levels. Deeper analysis of RowHammer shows that there are many dimensions to the problem as the vulnerability is sensitive to many variables, including environmental conditions (temperature & voltage), process variation, stored data patterns, as well as memory access patterns and memory control policies. As such, it has proven difficult to devise fully-secure and very efficient (i.e., low-overhead in performance, energy, area) protection mechanisms against RowHammer and attempts made by DRAM manufacturers have been shown to lack security guarantees.

After reviewing various recent developments in exploiting, understanding, and mitigating RowHammer, we discuss future directions that we believe are critical for solving the RowHammer problem. We argue for two major directions to amplify research and development efforts in: 1) building a much deeper understanding of the problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and 2) the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.

SESSION: Technical Program: System-Level Codesign in DNN Accelerators

Hardware-Software Codesign of DNN Accelerators Using Approximate Posit Multipliers

  • Tom Glint
  • Kailash Prasad
  • Jinay Dagli
  • Krishil Gandhi
  • Aryan Gupta
  • Vrajesh Patel
  • Neel Shah
  • Joycee Mekie

Emerging data intensive AI/ML workloads encounter memory and power wall when run on general-purpose compute cores. This has led to the development of a myriad of techniques to deal with such workloads, among which DNN accelerator architectures have found a prominent place. In this work, we propose a hardware-software co-design approach to achieve system-level benefits. We propose a quantized data-aware POSIT number representation that leads to a highly optimized DNN accelerator. We demonstrate this work on SOTA SIMBA architecture, extendable to any other accelerator. Our proposal reduces the buffer/storage requirements within the architecture and reduces the data transfer cost between the main memory and the DNN accelerator. We have investigated the impact of using integer, IEEE floating point, and posit multipliers for LeNet, ResNet and VGG NNs trained and tested on MNIST, CIFAR10 and ImageNet datasets, respectively. Our system-level analysis shows that the proposed approximate-fixed-posit multiplier when implemented on SIMBA architecture, achieves on average ~2.2× speed up, consumes ~3.1× less energy and requires ~3.2× less area, respectively, against the baseline SOTA architecture, without loss of accuracy (~±1%)

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

  • Susmita Dey Manasi
  • Suvadeep Banerjee
  • Abhijit Davare
  • Anton A. Sorokin
  • Steven M. Burns
  • Desmond A. Kirkpatrick
  • Sachin S. Sapatnekar

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

  • Mohammadhossein Askarihemmat
  • Sean Wagner
  • Olexa Bilaniuk
  • Yassine Hariri
  • Yvon Savaria
  • Jean-Pierre David

We present a DNN accelerator that allows inference at arbitrary precision with dedicated processing elements that are configurable at the bit level. Our DNN accelerator has 8 Processing Elements controlled by a RISC-V controller with a combined 8.2 TMACs of computational power when implemented with the recent Alveo U250 FPGA platform. We develop a code generator tool that ingests CNN models in ONNX format and generates an executable command stream for the RISC-V controller. We demonstrate the scalable throughput of our accelerator by running different DNN kernels and models when different quantization levels are selected. Compared to other low precision accelerators, our accelerator provides run time programmability without hardware reconfiguration and can accelerate DNNs with multiple quantization levels, regardless of the target FPGA size. BARVINN is an open source project and it is available at

Agile Hardware and Software Co-Design for RISC-V-Based Multi-Precision Deep Learning Microprocessor

  • Zicheng He
  • Ao Shen
  • Qiufeng Li
  • Quan Cheng
  • Hao Yu

Recent network architecture search (NAS) has been widely applied to simplify deep learning neural networks, which typically result in a multi-precision network. Many multi-precision accelerators have been developed as well to support computing multi-precision networks manually. A software-hardware interface is thereby needed to automatically map multi-precision networks onto multi-precision accelerators. In this paper, we have developed an agile hardware and software co-design for RISC-V-based multi-precision deep learning microprocessor. We have designed custom RISC-V instructions with a framework to automatically compile multi-precision CNN networks onto multi-precision CNN accelerators, demonstrated on FPGA. Experiments show that with NAS optimized multi-precision CNN models (LeNet, VGG16, ResNet, MobileNet), the RISC-V core with multi-precision accelerators can reach the highest throughput in 2,4,8-bit precisions respectively on a Xilinx ZCU102 FPGA.

SESSION: Technical Program: New Advances in Hardware Trojan Detection

Hardware Trojan Detection Using Shapley Ensemble Boosting

  • Zhixin Pan
  • Prabhat Mishra

Due to globalized semiconductor supply chain, there is an increasing risk of exposing system-on-chip designs to hardware Trojans (HT). While there are promising machine Learning based HT detection techniques, they have three major limitations: ad-hoc feature selection, lack of explainability, and vulnerability towards adversarial attacks. In this paper, we propose a novel HT detection approach using an effective combination of Shapley value analysis and boosting framework. Specifically, this paper makes two important contributions. We use Shapley value (SHAP) to analyze the importance ranking of input features. It not only provides explainable interpretation for HT detection, but also serves as a guideline for feature selection. We utilize boosting (ensemble learning) to generate a sequence of lightweight models that significantly reduces the training time while provides robustness against adversarial attacks. Experimental results demonstrate that our approach can drastically improve both detection accuracy (up to 24.6%) and time efficiency (up to 5.1x) compared to state-of-the-art HT detection techniques.

ASSURER: A PPA-friendly Security Closure Framework for Physical Design

  • Guangxin Guo
  • Hailong You
  • Zhengguang Tang
  • Benzheng Li
  • Cong Li
  • Xiaojue Zhang

Hardware security is emerging in the very large scale integration (VLSI). The seminal threats, like hardware Trojan insertion, probing attacks, and fault injection, are hard to detect and almost impossible to fix at post-design stage. The optimal solution is to prevent them at the physical design stage. Usually, defending against them may cause a lot of power, performance, and area (PPA) loss. In this paper, we propose a PPA-friendly physical layout security closure framework ASSURER. Reward-directed placement refinement and multi-threshold partition algorithm are proposed to assure Trojan threats are empty. Cleaning up probing attacks is established on a patch-based ECO routing flow. Evaluated on the ISPD’22 benchmarks, ASSURER can clean out the Trojan threat with no leakage power increase when shrinking the physical layout area. When not shrinking, ASSURER only increases 14% total power. Compared with the work of first place in the ISPD2022 Contest, ASSURE reduced 53% additional total power consumption, and probing vulnerability can be reduced by 97.6% under the premise of timing closure. We believe this work shall open up a new perspective for preventing Trojan insertion and probing attacks.

Static Probability Analysis Guided RTL Hardware Trojan Test Generation

  • Haoyi Wang
  • Qiang Zhou
  • Yici Cai

Directed test generation is an effective method to detect potential hardware Trojan (HT) in RTL. While the existing works are able to activate hard-to-cover Trojans by covering security targets, the effectiveness and efficiency of identifying the targets to cover are ignored. We propose a static probability analysis method for identifying the hard-to-active data channel targets and generating the corresponding assertions for the HT test generation. Our method could generate test vectors to trigger Trojans from Trusthub, DeTrust, and OpenCores in 1 minute and get 104.33X time improvement on average compared with the existing method.

Hardware Trojan Detection and High-Precision Localization in NoC-Based MPSoC Using Machine Learning

  • Haoyu Wang
  • Basel Halak

Networks-on-Chips (NoC) based Multi-Processor System-on-Chip (MPSoC) are increasingly employed in industrial and consumer electronics. Outsourcing third-party IPs (3PIPs) and tools in NoC-based MPSoC is a prevalent development way in most fabless companies. However, Hardware Trojan (HT) injected during its design stage can maliciously tamper with the functionality of this communication scheme, which undermines the security of the system and may cause a failure. Detecting and localizing HT with high precision is a challenge for current techniques. This work proposes for the first time a novel approach that allows detection and high-precision localization of HT, which is based on the use of packet information and machine learning algorithms. It is equipped with a novel Dynamic Confidence Interval (DCI) algorithm to detect malicious packets, and a novel Dynamic Security Credit Table (DSCT) algorithm to localize HT. We evaluated the proposed framework on the mesh NoC running real workloads. The average detection precision of 96.3% and the average localization precision of 100% were obtained from the experiment results, and the minimum HT localization time is around 5.8 ~ 12.9us at 2GHz depending on the different HT-infected nodes and workloads.

SESSION: Technical Program: Advances in Physical Design and Timing Analysis

An Integrated Circuit Partitioning and TDM Assignment Optimization Framework for Multi-FPGA Systems

  • Dan Zheng
  • Evangeline F. Y. Young

In multi-FPGA systems, Time-Division Multiplexing (TDM) is a widely used method for transferring multiple signals over a common wire. The circuit performance will be significantly influenced by this inter-FPGA delay. Some inter-FPGA nets are driven by different clocks, in which case they cannot share the same wire. In this paper, to minimize the maximum delay of inter-FPGA nets, we propose a two-step framework. First, a TDM-aware partitioning algorithm is adopted to minimize the maximum cut size between an FPGA-pair. A TDM ratio assignment method is then applied to assign TDM ratio for each inter-FPGA net optimally. Experimental results show that our algorithm can reduce the maximum TDM ratio significantly within reasonable runtime.

A Robust FPGA Router with Concurrent Intra-CLB Rerouting

  • Jiarui Wang
  • Jing Mai
  • Zhixiong Di
  • Yibo Lin

Routing is the most time-consuming step in the FPGA design flow with increasingly complicated FPGA architectures and design scales. The growing complexity of connections between logic pins inside CLBs of FPGAs challenges the efficiency and quality of FPGA routers. Existing negotiation-based rip-up and reroute schemes will result in a large number of iterations when generating paths inside CLBs. In this work, we propose a robust routing framework for FPGAs with complex connections between logic elements and switch boxes. We propose a concurrent intra-CLB rerouting algorithm that can effectively resolve routing congestion inside a CLB tile. Experimental results on modified ISPD 2016 benchmarks demonstrate that our framework can achieve 100% routability in less wirelength and runtime, while the state-of-the-art VTR 8.0 routing algorithm fails at 4 of 12 benchmarks.

Efficient Global Optimization for Large Scaled Ordered Escape Routing

  • Chuandong Chen
  • Dishi Lin
  • Rongshan Wei
  • Qinghai Liu
  • Ziran Zhu
  • Jianli Chen

Ordered Escape Routing (OER) problem, which is an NP-hard problem, is critical in PCB design. Primary methods based on integer linear programming (ILP) or heuristic algorithms work well on small-scale PCBs with fewer pins. However, when dealing with large-scale instances, the performance of ILP strategies suffers dramatically as the number of variables increases due to time-consuming preprocessing. As for heuristic algorithms, ripping-up and rerouting is adopted to increase resource utilization, which frequently causes time violation. In this paper, we propose an efficient ILP-based routing engine for dense PCB to simultaneously minimize wiring length and runtime, considering the specific routing constraints. By weighting the length, we first model the OER problem as a special network flow problem. Then we separate the non-crossing constraint from typical ILP modeling to reduce the number of integral variables greatly. In addition, considering the congestion of routing resources, the ILP method is proposed to detect congestion. Finally, unlike the traditional schemes that deal with negotiated congestion, our approach works by reducing the local area capacity and then allowing the global automatic optimization of congestion. Compared with the state-of-the-art work, experimental results show that our algorithm can solve cases in larger scale in high routing quality of less length and reduce routing time by 76%.

An Adaptive Partition Strategy of Galerkin Boundary Element Method for Capacitance Extraction

  • Shengkun Wu
  • Biwei Xie
  • Xingquan Li

In advanced process, electromagnetic coupling among interconnect wires plays an increasingly important role in signoff analysis. For VLSI chip design, the requirement of fast and accurate capacitance extraction is becoming more and more urgent. And the critical step of extracting capacitance among interconnect wires is solving electric field. However, due to the high computational complexity, solving electric field is extreme timing-consuming. The Galerkin boundary element method (GBEM) was used for capacitance extraction in [2]. In this paper, we are going to use some mathematical theorems to analysis its error. Furthermore, with the error estimation of the Galerkin method, we design a boundary partition strategy to fit the electric field attenuation. It is worth to mention that this boundary partition strategy can greatly reduce the number of boundary elements on the promise of ensuring that the error is small enough. As a consequence, the matrix order of the discretization equation will also decrease. We also provide our suggestion of the calculation of the matrix elements. Experimental analysis demonstrates that, our partition strategy obtains a good enough result with a small number of boundary elements.

Graph-Learning-Driven Path-Based Timing Analysis Results Predictor from Graph-Based Timing Analysis

  • Yuyang Ye
  • Tinghuan Chen
  • Yifei Gao
  • Hao Yan
  • Bei Yu
  • Longxing Shi

With diminishing margins in advanced technology nodes, the performance of static timing analysis (STA) is a serious concern, including accuracy and runtime. The STA can generally be divided into graph-based analysis (GBA) and path-based analysis (PBA). For GBA, the timing results are always pessimistic, leading to overdesign during design optimization. For PBA, the timing pessimism is reduced via propagating real path-specific slews with the cost of severe runtime overheads relative to GBA. In this work, we present a fast and accurate predictor of post-layout PBA timing results from inexpensive GBA based on deep edge-featured graph attention network, namely deep EdgeGAT. Compared with the conventional machine and graph learning methods, deep EdgeGAT can learn global timing path information. Experimental results demonstrate that our predictor has the potential to substantially predict PBA timing results accurately and reduce timing pessimism of GBA with maximum error reaching 6.81 ps, and our work achieves an average 24.80× speedup faster than PBA using the commercial STA tool.

SESSION: Technical Program: Brain-Inspired Hyperdimensional Computing to the Rescue for Beyond von Neumann Era

Beyond von Neumann Era: Brain-Inspired Hyperdimensional Computing to the Rescue

  • Hussam Amrouch
  • Paul R. Genssler
  • Mohsen Imani
  • Mariam Issa
  • Xun Jiao
  • Wegdan Mohammad
  • Gloria Sepanta
  • Ruixuan Wang

Breakthroughs in deep learning (DL) continuously fuel innovations that profoundly improve our daily life. However, DNNs overwhelm conventional computing architectures by their massive data movements between processing and memory units. As a result, novel computer architectures are indispensable to improve or even replace the decades-old von Neumann architecture. Nevertheless, going far beyond the existing von Neumann principles comes with profound reliability challenges for the performed computations. This is due to analog computing together with emerging beyond-CMOS technologies being inherently noisy and inevitably leading to unreliable computing. Hence, novel robust algorithms become a key to go beyond the boundaries of the von Neumann era. Hyper-dimensional Computing (HDC) is rapidly emerging as an attractive alternative to traditional DL and ML algorithms. Unlike conventional DL and ML algorithms, HDC is inherently robust against errors along a much more efficient hardware implementation. In addition to these advantages at hardware level, HDC’s promise to learn from little data and the underlying algebra enable new possibilities at the application level. In this work, the robustness of HDC algorithms against errors and beyond von Neumann architectures are discussed. Further, the benefits of HDC as a machine learning algorithm are demonstrated with the example of outlier detection and reinforcement learning.

SESSION: Technical Program: System Level Design Space Exploration

System-Level Exploration of In-Package Wireless Communication for Multi-Chiplet Platforms

  • Rafael Medina
  • Joshua Kein
  • Giovanni Ansaloni
  • Marina Zapater
  • Sergi Abadal
  • Eduard Alarcón
  • David Atienza

Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

Efficient System-Level Design Space Exploration for High-Level Synthesis Using Pareto-Optimal Subspace Pruning

  • Yuchao Liao
  • Tosiron Adegbija
  • Roman Lysecky

High-level synthesis (HLS) is a rapidly evolving and popular approach to designing, synthesizing, and optimizing embedded systems. Many HLS methodologies utilize design space exploration (DSE) at the post-synthesis stage to find Pareto-optimal hardware implementations for individual components. However, the design space for the system-level Pareto-optimal configurations is orders of magnitude larger than component-level design space, making existing approaches insufficient for system-level DSE. This paper presents Pruned Genetic Design Space Exploration (PG-DSE)—an approach to post-synthesis DSE that involves a pruning method to effectively reduce the system-level design space and an elitist genetic algorithm to accurately find the system-level Pareto-optimal configurations. We evaluate PG-DSE using an autonomous driving application subsystem (ADAS) and three synthetic systems with extremely large design spaces. Experimental results show that PG-DSE can reduce the design space by several orders of magnitude compared to prior work while achieving higher quality results (an average improvement of 58.1x).

Automatic Generation of Complete Polynomial Interpolation Design Space for Hardware Architectures

  • Bryce Orloski
  • Samuel Coward
  • Theo Drane

Hardware implementations of elementary functions regularly deploy piecewise polynomial approximations. This work determines the complete design space of piecewise polynomial approximations meeting a given accuracy specification. Knowledge of this design space determines the minimum number of regions required to approximate the function accurately enough and facilitates the generation of optimized hardware which is competitive against the state of the art. Designers can explore the space of feasible architectures without needing to validate their choices. A heuristic based decision procedure is proposed to generate optimal ASIC hardware designs. Targeting alternative hardware technologies simply requires a modified decision procedure to explore the space. We highlight the difficulty in choosing an optimal number of regions to approximate the function with, as this is input width dependent.

SESSION: Technical Program: Security Assurance and Acceleration

SHarPen: SoC Security Verification by Hardware Penetration Test

  • Hasan Al-Shaikh
  • Arash Vafaei
  • Mridha Md Mashahedur Rahman
  • Kimia Zamiri Azar
  • Fahim Rahman
  • Farimah Farahmandi
  • Mark Tehranipoor

As modern SoC architectures incorporate many complex/heterogeneous intellectual properties (IPs), the protection of security assets has become imperative, and the number of vulnerabilities revealed is rising due to the increased number of attacks. Over the last few years, penetration testing (PT) has become an increasingly effective means of detecting software (SW) vulnerabilities. As of yet, no such technique has been applied to the detection of hardware vulnerabilities. This paper proposes a PT framework, SHarPen, for detecting hardware vulnerabilities, which facilitates the development of a SoC-level security verification framework. SHarPen proposes a formalism for performing gray-box hardware (HW) penetration testing instead of relying on coverage-based testing and provides an automation for mapping hardware vulnerabilities to logical/mathematical cost functions. SHarPen supports both simulation and FPGA-based prototyping, allowing us to automate security testing at different stages of the design process with high capabilities for identifying vulnerabilities in the targeted SoC.

SecHLS: Enabling Security Awareness in High-Level Synthesis

  • Shang Shi
  • Nitin Pundir
  • Hadi M Kamali
  • Mark Tehranipoor
  • Farimah Farahmandi

In their quest for further optimization, High-level synthesis (HLS) utilizes advanced automatic optimization algorithms to achieve lower implementation time/effort for even more complex designs. These optimization algorithms are for the HLS tools’ backend stages, e.g., allocation, scheduling, and binding, and they are highly optimized for resources/latency constraints. However, current HLS tools’ backend is unaware of designs’ security assets, and their algorithms are incapable of handling security constraints. In this paper, we propose Secure-HLS (SecHLS), which aims to define underlying security constraints for HLS tools’ backend stages and intermediate representations. In SecHLS, we improve a set of widely-used scheduling and binding algorithms by integrating the proposed security-related constraints into them. We evaluate the effectiveness of SecHLS in terms of power, performance, area (PPA), security, and complexity (execution time) on small and real-size benchmarks, showing how the proposed security constraints can be integrated into HLS while maintaining low PPA/complexity burdens.

A Flexible ASIC-Oriented Design for a Full NTRU Accelerator

  • Francesco Antognazza
  • Alessandro Barenghi
  • Gerardo Pelosi
  • Ruggero Susella

Post-quantum cryptosystems are the subject of a significant research effort, witnessed by various international standardization competitions. Among them, the NTRU Key Encapsulation Mechanism has been recognized as a secure, patent-free, and efficient public key encryption scheme. In this work, we perform a design space exploration on an FPGA target, with the final goal of an efficient ASIC realization. Specifically, we focus on the possible choices for the design of polynomial multipliers with different memory bus widths to trade-off lower clock cycle counts with larger interconnections. Our design outperforms the best FPGA synthesis results at the state of the art, and we report the results of ASIC syntheses minimizing latency and area with a 40nm industrial grade technology library. Our speed-oriented design computes an encapsulation in 4.1 to 10.2μs and a decapsulation in 7.1 to 11.7μs, depending on the NTRU security level, while our most compact design only takes 20% more area than the underlying SHA-3 hash module.

SESSION: Technical Program: Hardware and Software Co-Design of Emerging Machine Learning Algorithms

Robust Hyperdimensional Computing against Cyber Attacks and Hardware Errors: A Survey

  • Dongning Ma
  • Sizhe Zhang
  • Xun Jiao

Hyperdimensional Computing (HDC), also known as Vector Symbolic Architecture (VSA), is an emerging AI algorithm inspired by the way the human brain functions. Compared with deep neural networks (DNNs), HDC possesses several advantages such as smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. With the increasing deployment of AI in safety-critical systems such as healthcare and robotics, it is not only important to strive for high accuracy, but also to ensure its robustness under even highly uncertain and adversarial environments. However, recent studies show that HDC, just like DNNs, is vulnerable to both cyber attacks (e.g., adversarial attacks) and hardware errors (e.g., memory failures). While a growing body of research has been studying the robustness of HDC, there is a lack of systematic review of research efforts on this increasingly-important topic. To the best of our knowledge, this paper presents the first survey dedicated to review the research efforts made to the robustness of HDC against cyber attacks and hardware errors. While the performance and accuracy of HDC as an AI method still expects future theoretical advancement, this survey paper aims to shed light and call for community efforts on robustness research of HDC.

In-Memory Computing Accelerators for Emerging Learning Paradigms

  • Dayane Reis
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu

Over the past decades, emerging, data-driven machine learning (ML) paradigms have increased in popularity, and revolutionized many application domains. To date, a substantial effort has been devoted to devising mechanisms for facilitating the deployment and near ubiquitous use of these memory intensive ML models. This review paper presents the use of in-memory computing (IMC) accelerators for emerging ML paradigms from a bottom-up perspective through the choice of devices, the design of circuits/architectures, to the application-level results.

Toward Fair and Efficient Hyperdimensional Computing

  • Yi Sheng
  • Junhuan Yang
  • Weiwen Jiang
  • Lei Yang

We are witnessing the evolution that Machine Learning (ML) is applied to varied applications, such as intelligent security systems, medical diagnoses, etc. With this trend, it has high demand to run ML on end devices with limited resources. What’s more, the fairness in these ML algorithms is mounting important, since these applications are not designed for specific users (e.g., people with fair skin in skin disease diagnosis) but need to be applied to all possible users (i.e., people with different skin tones). Brain-inspired hyperdimensional computing (HDC) has demonstrated its ability to run ML tasks on edge devices with a small memory footprint; yet, it is unknown whether HDC can satisfy the fairness requirements from applications (e.g., medical diagnosis for people with different skin tones). In this paper, for the first time, we reveal that the vanilla HDC has severe bias due to its sensitivity to color information. Toward a fair and efficient HDC, we propose a holistic framework, namely FE-HDC, which integrates the image processing and input compression techniques in HDC’s encoder. Compared with the vanilla HDC, results show that the proposed FE-HDC can reduce the unfairness score by 90%, achieving fairer architectures with competitively high accuracy.

SESSION: Technical Program: Full-Stack Co-Design for on-Chip Learning in AI Systems

Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design

  • Xiaoxuan Yang
  • Shiyu Li
  • Qilin Zheng
  • Yiran Chen

Processing-in-memory (PIM) based architecture shows great potential to process several emerging artificial intelligence workloads, including vision and language models. Cross-layer optimizations could bridge the gap between computing density and the available resources by reducing the computation and memory cost of the model and improving the model’s robustness against non-ideal hardware effects. We first introduce several hardware-aware training methods to improve the model robustness to the PIM device’s non-ideal effects, including stuck-at-fault, process variation, and thermal noise. Then, we further demonstrate a software/hardware (SW/HW) co-design methodology to efficiently process the state-of-the-art attention-based model on PIM-based architecture by performing sparsity exploration for the attention-based model and circuit-architecture co-design to support the sparse processing.

Hardware-Software Co-Design for On-Chip Learning in AI Systems

  • M. L. Varshika
  • Abhishek Kumar Mishra
  • Nagarajan Kandasamy
  • Anup Das

Spike-based convolutional neural networks (CNNs) are empowered with on-chip learning in their convolution layers, enabling the layer to learn to detect features by combining those extracted in the previous layer. We propose ECHELON, a generalized design template for a tile-based neuromorphic hardware with on-chip learning capabilities. Each tile in ECHELON consists of a neural processing units (NPU) to implement convolution and dense layers of a CNN model, an on-chip learning unit (OLU) to facilitate spike-timing dependent plasticity (STDP) in the convolution layer, and a special function unit (SFU) to implement other CNN functions such as pooling, concatenation, and residual computation. These tile resources are interconnected using a shared bus, which is segmented and configured via the software to facilitate parallel communication inside the tile. Tiles are themselves interconnected using a classical Network-on-Chip (NoC) interconnect. We propose a system software to map CNN models to ECHELON, maximizing the performance. We integrate the hardware design and software optimization within a co-design loop to obtain the hardware and software architectures for a target CNN, satisfying both performance and resource constraints. In this preliminary work, we show the implementation of a tile on a FPGA and some early evaluations. Using 8 STDP-enabled CNN models, we show the potential of our co-design methodology to optimize hardware resources.

Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis

  • Vito Giovanni Castellana
  • Nicolas Bohm Agostini
  • Ankur Limaye
  • Vinay Amatya
  • Marco Minutoli
  • Joseph Manzano
  • Antonino Tumeo
  • Serena Curzel
  • Michele Fiorito
  • Fabrizio Ferrandi

The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

SESSION: Technical Program: Energy-Efficient Computing for Emerging Applications

Knowledge Distillation in Quantum Neural Network Using Approximate Synthesis

  • Mahabubul Alam
  • Satwik Kundu
  • Swaroop Ghosh

Recent assertions of a potential advantage of Quantum Neural Network (QNN) for specific Machine Learning (ML) tasks have sparked the curiosity of a sizable number of application researchers. The parameterized quantum circuit (PQC), a major building block of a QNN, consists of several layers of single-qubit rotations and multi-qubit entanglement operations. The optimum number of PQC layers for a particular ML task is generally unknown. A larger network often provides better performance in noiseless simulations. However, it may perform poorly on hardware compared to a shallower network. Because the amount of noise varies amongst quantum devices, the optimal depth of PQC can vary significantly. Additionally, the gates chosen for the PQC may be suitable for one type of hardware but not for another due to compilation overhead. This makes it difficult to generalize a QNN design to wide range of hardware and noise levels. An alternate approach is to build and train multiple QNN models targeted for each hardware which can be expensive. To circumvent these issues, we introduce the concept of knowledge distillation in QNN using approximate synthesis. The proposed approach will create a new QNN network with (i) a reduced number of layers or (ii) a different gate set without having to train it from scratch. Training the new network for a few epochs can compensate for the loss caused by approximation error. Through empirical analysis, we demonstrate ≈71.4% reduction in circuit layers, and still achieve ≈16.2% better accuracy under noise.

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring

  • Wentao Hou
  • Kai Zhong
  • Shulin Zeng
  • Guohao Dai
  • Huazhong Yang
  • Yu Wang

Graph Attention Network (GAT) has demonstrated better performance in many graph tasks than previous Graph Neural Networks (GNN). However, it involves graph attention operations with extra computing complexity. While a large amount of existing literature has researched GNN acceleration, few have focused on the attention mechanism in GAT. The graph attention mechanism makes the computation flow different. Therefore, previous GNN accelerators can not support GAT well. Besides, GAT distinguishes the importance of neighbors and makes it possible to reduce the workload through runtime tailoring. We present NTGAT, a software-hardware co-design approach to accelerate GAT with runtime node tailoring. Our work comprises both a runtime node tailoring algorithm and an accelerator design. We propose a pipeline sorting method and a hardware unit to support node tailoring during inference. The experiments show that our algorithm can reduce up to 86% of aggregation workload while incurring slight accuracy loss (<0.4%). And the FPGA based accelerator can achieve up to 3.8× speedup and 4.98× energy efficiency comparing to the GPU baseline.

A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks

  • Pai-Yu Tan
  • Cheng-Wen Wu

Spiking neural networks (SNNs) that enable energy-efficient neuromorphic hardware are receiving growing attention. Training SNNs directly with back-propagation has demonstrated accuracy comparable to deep neural networks (DNNs). However, previous direct-training algorithms require high-precision floating-point operations, which are not suitable for low-power end-point devices. The high-precision operations also require the learning algorithm to run on high-performance accelerator hardware. In this paper, we propose an improved approach that converts the high-precision floating-point operations to low-bitwidth integer operations for an existing direct-training algorithm, i.e., the Spatio-Temporal Back-Propagation (STBP) algorithm. The proposed low-bitwidth Integer-STBP algorithm requires only integer arithmetic for SNN training and inference, which greatly reduces the computational complexity. Experimental results show that the proposed STBP algorithm achieves comparable accuracy and higher energy efficiency than the original floating-point STBP algorithm. Moreover, it can be implemented on low-power end-point devices to provide learning capability during inference, which are mostly supported by fixed-point hardware.

TiC-SAT: Tightly-Coupled Systolic Accelerator for Transformers

  • Alireza Amirshahi
  • Joshua Alexander Harrison Klein
  • Giovanni Ansaloni
  • David Atienza

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework1.

SESSION: Technical Program: Side-Channel Attacks and RISC-V Security

PMU-Leaker: Performance Monitor Unit-Based Realization of Cache Side-Channel Attacks

  • Pengfei Qiu
  • Qiang Gao
  • Dongsheng Wang
  • Yongqiang Lyu
  • Chunlu Wang
  • Chang Liu
  • Rihui Sun
  • Gang Qu

Performance Monitor Unit (PMU) is a special hardware module in processors that contains a set of counters to record various architectural and micro-architectural events. In this paper, we propose PMU-Leaker, a novel realization of all existing cache side-channel attacks where accurate execution time measurements are replaced by information leaked through PMU. The efficacy of PMU-Leaker is demonstrated by (1) leaking the secret data stored in Intel Software Guard Extensions (SGX) with the transient execution vulnerabilities including Spectre and ZombieLoad and (2) extracting the encryption key of a victim AES performed in SGX. We perform thorough experiments on a DELL Inspiron 15-7560 laptop that has an Intel® Core i5-7200U processor with the Kaby Lake architecture and the results show that, among the 176 PMU counters, 24 of them are vulnerable and can be used to launch the PMU-Leaker attack.

EO-Shield: A Multi-Function Protection Scheme against Side Channel and Focused Ion Beam Attacks

  • Ya Gao
  • Qizhi Zhang
  • Haocheng Ma
  • Jiaji He
  • Yiqiang Zhao

Smart devices, especially Internet-connected devices, typically incorporate security protocols and cryptographic algorithms to ensure the control flow integrity and information security. However, there are various invasive and non-invasive attacks trying to tamper with these devices. Chip-level active shield has been proved to be an effective countermeasure against invasive attacks, but existing active shields cannot be utilized to counter side-channel attacks (SCAs). In this paper, we propose a multi-function protection scheme and an active shield prototype to against invasive and non-invasive attacks simultaneously. The protection scheme has a complex active shield implemented using the top metal layer of the chip and an information leakage obfuscation module underneath. The leakage obfuscation module generates its protection patterns based on the operating conditions of the circuit that needs to be protected, thus reducing the correlation between electromagnetic (EM) emanations and cryptographic data. We implement the protection scheme on one Advanced Encryption Standard (AES) circuit to demonstrate the effectiveness of the method. Experiment results demonstrate that the information leakage obfuscation module decreases SNR below 0.6 and reduces the success rate of SCAs. Compared to existing single-function protection methods against physical attacks, the proposed scheme provides good performance against both invasive and non-invasive attacks.

CompaSeC: A Compiler-Assisted Security Countermeasure to Address Instruction Skip Fault Attacks on RISC-V

  • Johannes Geier
  • Lukas Auer
  • Daniel Mueller-Gritschneder
  • Uzair Sharif
  • Ulf Schlichtmann

Fault-injection attacks are a risk for any computing system executing security-relevant tasks, such as a secure boot process. While hardware-based countermeasures to these invasive attacks have been found to be a suitable option, they have to be implemented via hardware extensions and are thus not available in most Commonly used Off-The-Shelf (COTS) components. Software Implemented Hardware Fault Tolerance (SIHFT) is therefore the only valid option to enhance a COTS system’s resilience against fault attacks. Established SIHFT techniques usually target the detection of random hardware errors for functional safety and not targeted attacks. Using the example of a secure boot system running on a RISC-V processor, in this work we first show that when the software is hardened by these existing techniques from the safety domain, the number of vulnerabilities in the boot process to single, double, triple, and quadruple instruction skips cannot be fully closed. We extend these techniques to the security domain and propose Compiler-assisted Security Countermeasure (CompaSeC). We demonstrate that CompaSeC can close all vulnerabilities for the studied secure boot system. To further reduce performance and memory overheads we additionally propose a method for CompaSeC to selectively harden individual vulnerable functions without compromising the security against the considered instruction skip faults.

Trojan-D2: Post-Layout Design and Detection of Stealthy Hardware Trojans – A RISC-V Case Study

  • Sajjad Parvin
  • Mehran Goli
  • Frank Sill Torres
  • Rolf Drechsler

With the exponential increase in the popularity of the RISC-V ecosystem, the security of this platform must be re-evaluated especially for mission-critical and IoT devices. Besides, the insertion of a Hardware Trojan (HT) into a chip after the in-house mask design is outsourced to a chip manufacturer abroad for fabrication is a significant source of concern. Though abundant HT detection methods have been investigated based on side-channel analysis, physical measurements, and functional testing to overcome this problem, there exists stealthy HTs that can hide from detection. This is due to the small overhead of such HTs compared to the whole circuit.

In this work, we propose several novel HTs that can be placed into a RISC-V core’s post-layout in an untrusted manufacturing environment. Next, we propose a non-invasive analytical method based on contactless optical probing to detect any stealthy HTs. Finally, we propose an open-source library of HTs that can be used to be placed into a processor unit in the post-layout phase. All the designs in this work are done using a commercial 28nm technology.

SESSION: Technical Program: Simulation and Verification of Quantum Circuits

Graph Partitioning Approach for Fast Quantum Circuit Simulation

  • Jaekyung Im
  • Seokhyeong Kang

Owing to the exponential increase in computational complexity, the fast simulation of the large quantum circuit has become very difficult. This is an important challenge for the utilization of quantum computers because it is closely related to the verification of quantum computation by classical machines. The Hybrid Schrödinger-Feynman simulation seems to be a promising solution, but its application is very limited. To solve this drawback, we propose an improved simulation method based on graph partitioning. Experimental results show that our approach significantly reduces the simulation time of the Hybrid Schrödinger-Feynman simulation.

A Robust Approach to Detecting Non-Equivalent Quantum Circuits Using Specially Designed Stimuli

  • Hsiao-Lun Liu
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

As several compilation and optimization techniques have been proposed, equivalence checking for quantum circuits has become essential in design flows. The state-of-the-art to this problem observed that even small errors substantially affect the entire quantum system. As a result, it exploited random simulations to prove the non-equivalence of two quantum circuits. However, when errors occurred close to outputs, it was hard for the work to prove the non-equivalence of some non-equivalent quantum circuits under a limited number of simulations. In this work, we propose a novel simulation-based approach using a set of specially designed stimuli. The simulation runs of the proposed approach is linear rather than exponential to the number of quantum bits of a circuit. According to the experimental results, the success rate of our approach is 100% (100%) under a simulation run (execution time) constraint for a set of benchmarks, while that of the state-of-the-art is only 69% (74%) on average. Our approach also achieves a speedup of 26 on average.

Equivalence Checking of Parameterized Quantum Circuits: Verifying the Compilation of Variational Quantum Algorithms

  • Tom Peham
  • Lukas Burgholzer
  • Robert Wille

Variational quantum algorithms have been introduced as a promising class of quantum-classical hybrid algorithms that can already be used with the noisy quantum computing hardware available today by employing parameterized quantum circuits. Considering the non-trivial nature of quantum circuit compilation and the subtleties of quantum computing, it is essential to verify that these parameterized circuits have been compiled correctly. Established equivalence checking procedures that handle parameter-free circuits already exist. However, no methodology capable of handling circuits with parameters has been proposed yet. This work fills this gap by showing that verifying the equivalence of parameterized circuits can be achieved in a purely symbolic fashion using an equivalence checking approach based on the ZX-calculus. At the same time, proofs of inequality can be efficiently obtained with conventional methods by taking advantage of the degrees of freedom inherent to parameterized circuits. We implemented the corresponding methods and proved that the resulting methodology is complete. Experimental evaluations (using the entire parametric ansatz circuit library provided by Qiskit as benchmarks) demonstrate the efficacy of the proposed approach.

Software Tools for Decoding Quantum Low-Density Parity-Check Codes

  • Lucas Berent
  • Lukas Burgholzer
  • Robert Wille

Quantum Error Correction (QEC) is an essential field of research towards the realization of large-scale quantum computers. On the theoretical side, a lot of effort is put into designing error-correcting codes that protect quantum data from errors, which inevitably happen due to the noisy nature of quantum hardware and quantum bits (qubits). Protecting data with an error-correcting code necessitates means to recover the original data, given a potentially corrupted data set—a task referred to as decoding. It is vital that decoding algorithms can recover error-free states in an efficient manner. While theoretical properties of certain QEC methods have been extensively studied, good techniques to analyze their performance in practically more relevant settings is still a widely unexplored area. In this work, we propose a set of software tools that facilitate numerical experiments with so-called Quantum Low-Density Parity-Check codes (QLDPC codes)—a broad class of codes, some of which have recently been shown to be asymptotically good. Based on that, we provide an implementation of a general decoder for QLDPC codes. On top of that, we propose a highly efficient heuristic decoder that eliminates the runtime bottlenecks of the general QLDPC decoder while still maintaining comparable decoding performance. These tools eventually make it possible to confirm theoretical results around QLDPC codes in a more practical setting and showcase the value of software tools (in addition to theoretical considerations) for investigating codes for practical applications. The resulting tool, which is publicly available at as part of the Munich Quantum Toolkit (MQT), is meant to provide a playground for the search for “practically good” quantum codes.

SESSION: Technical Program: Learning x Security in DFM

Enabling Scalable AI Computational Lithography with Physics-Inspired Models

  • Haoyu Yang
  • Haoxing Ren

Computational lithography is a critical research area for the continued scaling of semiconductor manufacturing process technology by enhancing silicon printability via numerical computing methods. Today’s solutions for these problems are primarily CPU-based and require many thousands of CPUs running for days to tape out a modern chip. We seek AI/GPU-assisted solutions for the two problems, aiming at improving both runtime and quality. Prior academic research has proposed using machine learning for lithography modeling and mask optimization, typically represented as image-to-image mapping problems, where convolution layer backboned UNets and ResNets are applied. However, due to the lack of domain knowledge integrated into the framework designs, these solutions have been limited by their application scenarios or performance. Our method aims to tackle the limitations of such previous CNN-based solutions by introducing lithography bias into the neural network design, yielding a much more efficient model design and significant performance improvements.

Data-Driven Approaches for Process Simulation and Optical Proximity Correction

  • Hao-Chiang Shao
  • Chia-Wen Lin
  • Shao-Yun Fang

With continuous shrinking of process nodes, semiconductor manufacturing encounters more and more serious inconsistency between designed layout patterns and resulted wafer images. Conventionally, examining how a layout pattern can deviate from its original after complicated process steps, such as optical lithography and subsequent etching, relies on computationally expensive process simulation, which suffers from incredibly long runtime for large-scale circuit layouts, especially in advanced nodes. In addition, being one of the most important and commonly adopted resolution enhancement techniques, optical proximity correction (OPC) corrects image errors due to process effects by moving segment edges or adding extra polygons to mask patterns, while it is generally driven by simulation or time-consuming inverse lithography techniques (ILTs) to achieve acceptable accuracy. As a result, more and more state-of-the-art works on process simulation or/and OPC resort to the fast inference characteristic of machine/deep learning. This paper reviews these data-driven approaches to highlight the challenges in various aspects, explore preliminary solutions, and reveal possible future directions to push forward the frontiers of the research in design for manufacturability.

Mixed-Type Wafer Failure Pattern Recognition

  • Hao Geng
  • Qi Sun
  • Tinghuan Chen
  • Qi Xu
  • Tsung-Yi Ho
  • Bei Yu

The ongoing evolution in process fabrication enables us to step below the 5nm technology node. Although foundries can pattern and etch smaller but more complex circuits on silicon wafers, a multitude of challenges persist. For example, defects on the surface of wafers are inevitable during manufacturing. To increase the yield rate and reduce time-to-market, it is vital to recognize these failures and identify the failure mechanisms of these defects. Recently, applying machine learning-powered methods to combat single defect pattern classification has made significant progress. However, as the processes become increasingly complicated, various single-type defect patterns may emerge and be coupled on a wafer and thus shape a mixed-type pattern. In this paper, we will survey the recent pace of progress on advanced methodologies for wafer failure pattern recognition, especially for mixed-type one. We sincerely hope this literature review can highlight the future directions and promote the advancement of the wafer failure pattern recognition.

SESSION: Technical Program: Lightweight Models for Edge AI

Accelerating Convolutional Neural Networks in Frequency Domain via Kernel-Sharing Approach

  • Bosheng Liu
  • Hongyi Liang
  • Jigang Wu
  • Xiaoming Chen
  • Peng Liu
  • Yinhe Han

Convolutional neural networks (CNNs) are typically computationally heavy. Fast algorithms such as fast Fourier transforms (FFTs), are promising in significantly reducing computation complexity by replacing convolutions with frequency-domain element-wise multiplication. However, the increased high memory access overhead of complex weights counteracts the computing benefit, because frequency-domain convolutions not only pad weights to the same size as input maps, but also have no sharable complex kernel weights. In this work, we propose an FFT-based kernel-sharing technique called FS-Conv to reduce memory access. Based on FS-Conv, we derive the sharable complex weights in frequency-domain convolutions, which has never been solved. FS-Conv includes a hybrid padding approach, which utilizes the inherent periodic characteristic of FFT transformation to provide sharable complex weights for different blocks of complex input maps. We in addition build a frequency-domain inference accelerator (called Yixin) that can utilize the sharable complex weights for CNN accelerations. Evaluation results demonstrate the significant performance and energy efficiency benefits compared with the state-of-the-art baseline.

Mortar: Morphing the Bit Level Sparsity for General Purpose Deep Learning Acceleration

  • Yunhung Gao
  • Hongyan Li
  • Kevin Zhang
  • Xueru Yu
  • Hang Lu

Vanilla Deep Neural Networks (DNN) after training are represented with native floating-point 32 (fp32) weights. We observe that the bit-level sparsity of these weights is very abundant in the mantissa and can be directly exploited to speed up model inference. In this paper, we propose Mortar, an off-line/on-line collaborated approach for fp32 DNN acceleration, which includes two parts: first, an off-line bit sparsification algorithm to construct the target formulation by “mantissa morphing”, which maintains higher model accuracy while increasing bit-level sparsity; second, the associating hardware accelerator architecture to speed up the on-line fp32 inference through manipulating the enlarged bit sparsity. We highlight the following results by evaluating various deep learning tasks, including image classification, object detection, video understanding, video & image super-resolution, etc.: We (1) increase bit-level sparsity up to 1.28~2.51x with only a negligible -0.09~0.23% accuracy loss, (2) maintain on average 3.55% higher model accuracy while increasing more bit-level sparsity than the baseline, (3)and our hardware accelerator outperforms up to 4.8x over the baseline, with an area of 0.031 mm2 and power of 68.58 mW.

Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

  • Yimeng Zhang
  • Akshay Karkal Kamath
  • Qiucheng Wu
  • Zhiwen Fan
  • Wuyang Chen
  • Zhangyang Wang
  • Shiyu Chang
  • Sijia Liu
  • Cong Hao

In this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream. First, to enable ultra-light video intelligence, we propose temporal frame-filtering and spatial saliency-focusing approaches to reduce the complexity of massive video data. Second, we exploit structure-aware weight sparsity to design a hardware-friendly model compression method. Third, assisted with data and model complexity reduction, we propose a sparsity-aware, scalable, and low-power accelerator design, aiming to deliver real-time performance with high energy efficiency. Different from existing works, we make a solid step towards the synergized software/hardware co-optimization for realistic MOT model implementation. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5× latency reduction, 20.9× effective frame rate improvement, 5.83× lower power, and 9.78× better energy efficiency, without much accuracy drop.

Latent Weight-Based Pruning for Small Binary Neural Networks

  • Tianen Chen
  • Noah Anderson
  • Younghyun Kim

Binary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove.

In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseduogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 33%–46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count.

SESSION: Technical Program: Design Automation for Emerging Devices

AutoFlex: Unified Evaluation and Design Framework for Flexible Hybrid Electronics

  • Tianliang Ma
  • Zhihui Deng
  • Leilai Shao

Flexible hybrid electronics (FHE), integrating high performance silicon chips with multi-functional sensors and actuators on flexible substrates, can be intimately attached onto irregular surfaces without compromising their functionalities, thus enabling more innovations in healthcare, internet of things (IoTs) and various human-machine interfaces (HMIs). Recent developments on compact models and process design kits (PDKs) of flexible electronics have made designs of small to medium flexible circuits feasible. However, the absence of a unified model and comprehensive evaluation benchmarks for flexible electronics makes it infeasible for a designer to fairly compare different flexible technologies and to explore potential design options for a heterogeneous FHE design. In this paper, we present AutoFlex, a unified evaluation and design framework for flexible hybrid electronics, where device parameters can be extracted automatically and performance can be evaluated comprehensively from device levels, digital blocks to large-scale digital circuits. Moreover, a ubiquitous FHE sensor acquisition system, including a flexible multi-functional sensor array, scan drivers, amplifiers and a silicon based analog-to-digital converter (ADC), is developed to reveal the design challenges of a representative FHE system.

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology

  • Chenlin Shi
  • Shinobu Miwa
  • Tongxin Yang
  • Ryota Shioya
  • Hayato Yamaki
  • Hiroki Honda

In this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an open-source CNFET SPICE model called VS-CNFET, and various model parameters such as the channel width and carbon nanotube diameter are carefully tuned to mimic the predictive 7-nm CNFET technology presented in a published paper. Some nondisclosure parameters, such as the cell size and pin layout, are derived from those of the NanGate 15-nm open-source cell library in the same way as for an open-source framework for CNFET circuit design. CNFET7 includes two types of delay model (i.e., the composite current source and nonlinear delay model), each having 56 cells, such as INV_X1 and BUF_X1. CNFET7 supports both logic synthesis and timing-driven place and route in the Cadence design flow. Our experimental results for several synthesized circuits show that CNFET7 has reductions of up to 96%, 62% and 82% in dynamic and static power consumption and critical-path delay, respectively, when compared with ASAP7.

A Global Optimization Algorithm for Buffer and Splitter Insertion in Adiabatic Quantum-Flux-Parametron Circuits

  • Rongliang Fu
  • Mengmeng Wang
  • Yirong Kan
  • Nobuyuki Yoshikawa
  • Tsung-Yi Ho
  • Olivia Chen

As a highly energy-efficient application of low-temperature superconductivity, the adiabatic quantum-flux-parametron (AQFP) logic circuit has characteristics of extremely low-power consumption, making it an attractive candidate for extremely energy-efficient computing systems. Since logic gates are driven by the alternating current (AC) serving as the clock signal in AQFP circuits, plenty of AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. Meanwhile, since the currently developed AQFP logic gates can only drive a single output, splitters are required by logic gates to drive multiple fan-outs. These gates take up a significant amount of the circuit’s area and delay. This paper proposes a global optimization algorithm for buffer and splitter (B/S) insertion to address the issues above. The B/S insertion is first identified as a combinational optimization problem, and a dynamic programming formulation is presented to find the global optimal solution. Due to the limitation of its impractical search space, an integer linear programming formulation is proposed to explore the global optimization of B/S insertion approximately. Experimental results on the ISCAS’85 and simple arithmetic benchmark circuits show the effectiveness of the proposed method, with an average reduction of 8.22% and 7.37% in the number of buffers and splitters inserted compared to the state-of-the-art methods from ICCAD’21 and DAC’22, respectively.

FLOW-3D: Flow-Based Computing on 3D Nanoscale Crossbars with Minimal Semiperimeter

  • Sven Thijssen
  • Sumit Kumar Jha
  • Rickard Ewetz

The emergence of data-intensive applications has spurred the interest for in-memory computing using nanoscale crossbars. Flow-based in-memory computing is a promising approach for evaluating Boolean logic using the natural flow of electrical currents. While automated synthesis approaches have been developed for 2D crossbars, 3D crossbars have advantageous properties in terms of density, area, and performance. In this paper, we propose the first framework for performing flow-based computing using 3D crossbars. The framework, FLOW-3D, automatically synthesizes a Boolean function into a crossbar design. FLOW-3D is based on an analogy between BDDs and crossbars, resulting in the synthesis of 3D crossbar designs with minimal semiperimeter. A BDD with n nodes is mapped to a 3D crossbar with (n + k) metal wires. The k extra metal wires are needed to handle hardware-imposed constraints. Compared with the state-of-the-art synthesis tool for 2D crossbars, FLOW-3D improves semiperimeter, area, energy consumption, and latency up to 61%, 84%, 37%, and 41% on 15 Revlib benchmarks.


SLIP ’22: Proceedings of the 24th ACM/IEEE Workshop on System Level Interconnect Pathfinding

 Full Citation in the ACM Digital Library

SESSION: Breaking the Interconnect Limits

Session details: Breaking the Interconnect Limits

  • Ismail Bustany

Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning be?

  • Raveena Raikar
  • Dirk Stroobandt

High-capacity multi-die FPGA systems generally consist of multiple dies connected by external interposer lines. These external connections are limited in number. Further, these connections also contribute to a higher delay as compared to the internal network on a monolithic FPGA and should therefore be sparsely used. These architectural changes compel the placement & routing tools to minimize the number of signals at the die boundary. Incorporating a netlist partitioning step in the CAD flow can help to minimize the overall number of signals using the cross-die connections.

Conventional partitioning techniques focus on minimizing the cut edges at the cost of generating unequal-sized partitions. Such highly unbalanced partitions can affect the overall placement & routing quality by causing congestion on the denser die. Moreover, this can also negatively impact the overall runtime of the placement & routing tools as well as the FPGA resource utilization.

In previous studies, a low value of the unbalance was proposed to generate equal-sized partitions. In this work, we investigate the factors that influence the netlist partitioning quality for a multi-die FPGA system. A die-level partitioning step, performed using hMETIS, is incorporated into the flow before the packing step. Large heterogeneous circuits from the Koios benchmark suite are used to analyze the partitioning-packing results. Consequently, we examine the variation in output unbalance, the number of cut edges vs the input value of unbalance. We propose an empirical optimal parametric value of the unbalance factor for achieving the desired partitioning quality for the Koios benchmark suite.

Limiting Interconnect Heating in Power-Driven Physical Synthesis

  • Xiuyan Zhang
  • Shantanu Dutt

Current technology trend of VLSI chips includes sub-10 nm nodes and 3D ICs. Unfortunately, due to significantly increased Joule heating in these technologies, interconnect reliability has become a significant casualty. In this paper, we explore how interconnect power dissipation (of CV2/2 per logic transition) and thus heating can be effectively constrained during a power-optimizing physical synthesis (PS) flow that applies three different PS transformations: cell sizing, Vth assignment and cell replication; the latter is particularly useful for limiting interconnect heating. Other constraints considered are timing, slew and cell fanout load. To address this multi-constraint power-optimization problem effectively, we consider the application of the aforementioned three transforms simultaneously (as opposed to sequentially in some order) as well as simultaneously across all cells of the circuit using a novel discrete optimization technique called discretized network flow (DNF). We applied our algorithm to ISPD-13 benchmark circuits: the ISPD-13 competition was for power optimization for cell-sizing and Vth assignment transforms under timing, slew and cell fanout load constraints; to these we added the interconnect heating constraint and the cell replication transform—a much harder transform to engineer in a simultaneous-consideration framework than the other two. Results show the significant efficacy of our techniques.

SESSION: 2.5D/3D Extension for High-Performance Computing

Session details: 2.5D/3D Extension for High-Performance Computing

  • Pascal Vivet

Opportunities of Chip Power Integrity and Performance Improvement through Wafer Backside (BS) Connection: Invited Paper

  • Rongmei Chen
  • Giuliano Sisto
  • Odysseas Zografos
  • Dragomir Milojevic
  • Pieter Weckx
  • Geert Van der Plas
  • Eric Beyne

Technology node scaling is driven by the need to increase system performance, but it also leads to a significant power integrity bottleneck, due to the associated back-end-of-line (BEOL) scaling. Power integrity degradation induced by on-chip Power Delivery Network (PDN) IR drop is a result of increased power density and number of metal layers in the BEOL and their resistivity. Meanwhile, signal routing limits the SoC performance improvements due to increased routing congestion and delays. To conquer these issues, we introduce a disruptive technology: wafer backside (BS) connection to realize chip BS PDN (BSPDN) and BS signal routing. We first provide some key wafer processes features that were developed at imec to enable this technology. Further, we show benefits of this technology by demonstrating a large improvement in chip power integrity and performance after applying this technology to BSPDN and BS routing with a sub-2nm technology node design rule. Challenges and outlook of the BS technology are also discussed before conclusion of this paper.

SESSION: Compute-in-Memory and Design of Structured Compute Arrays

Session details: Compute-in-Memory and Design of Structured Compute Arrays

  • Shantanu Dutt

An Automated Design Methodology for Computational SRAM Dedicated to Highly Data-Centric Applications: Invited Paper

  • A. Philippe
  • L. Ciampolini
  • A. Philippe
  • M. Gerbaud
  • M. Ramirez-Corrales
  • V. Egloff
  • B. Giraud
  • J.-P. Noel

To meet the performance requirements of highly data-centric applications (e.g. edge-AI or lattice-based cryptography), Computational SRAM (C-SRAM), a new type of computational memory, was designed as a key element of an emerging computing paradigm called near-memory computing. For this particular type of applications, C-SRAM has been specialized to perform low-latency vector operations in order to limit energy-intensive data transfers with the processor or dedicated processing units. This paper presents a design methodology that aims at making the C-SRAM design flow as simple as possible by automating the configuration of the memory part (e.g. number of SRAM cuts and access ports) according to system constraints (e.g. instruction frequency or memory capacity) and off-the-shelf SRAM compilers. In order to fairly quantify the benefits of the proposed memory selector, it has been evaluated with three different CMOS process technologies from two different foundries. The results show that this memory selection methodology makes it possible to determine the best memory configuration whatever the CMOS process technology and the trade-off between area and power consumption. Furthermore, we also show how this methodology could be used to efficiently assess the level of design optimization of available SRAM compilers in a targeted CMOS process technology.

A Machine Learning Approach for Accelerating SimPL-Based Global Placement for FPGA’s

  • Tianyi Yu
  • Nima Karimpour Darav
  • Ismail Bustany
  • Mehrdad Eslami Dehkordi

Many commercial FPGA placement tools are based on the SimPL framework where the Lower Bound (LB) phase optimizes wire length and timing without considering cell overlaps and the Upper Bound (UB) phase spreads out cells while considering the target FPGA architectures. In the SimPL framework, the number of iterations depends on design complexity and the quality of UB placement, which highly impacts runtime. In this work, we propose a machine learning (ML) scheme where the anchor weights of cells are dynamically adjusted to make the process converge in a pre-determined budget for the number of iterations. In our approach and for a given FPGA architecture, a ML model constructs a trajectory guide function that is used for adjusting anchor weights during SimPL’s iterations. Our experimental results on industrial benchmarks show, we can achieve on average 28.01% and 4.7% runtime reduction in the runtime of Global Placement and the runtime of the whole placer, respectively while maintaining the quality of solutions within an acceptable range.

SESSION: Interconnect Performance Estimation Techniques

Session details: Interconnect Performance Estimation Techniques

  • Rasit Topaloglu

Neural Network Model for Detour Net Prediction

  • Jaehoon Ahn
  • Taewhan Kim

Identifying nets in a placement which will be very likely to be detoured routes in routing is very useful in that (1) in conjunction with the routing congestion, path timing, or design rule violation (DRV) prediction, predicting detour nets can be used as a complementary means of characterizing the outcome of those predictions in a more depth and (2) we can place more importance on the detour predicted nets for optimizing timing and routing resources in the early stage of placement since those nets consume more timing budget as well as metal/via resources. In this context, this work proposes a neural network based detour net prediction model. Our proposed model consists of two parts: CNN based and ANN based. The CNN based model processes the features describing various physical proximity maps or states while the ANN based model processes the features of individual nets in the form of vector descriptions, concatenated to the CNN outputs. Through experiments, we analyze and assess the accuracy of our prediction model in terms of F1 score and the complementary role of timing prediction and optimization. More specifically, it is shown that our proposed model improves the prediction accuracy by 9.9% on average in comparison with that produced by the conventional (vanilla ANN based) detour net prediction model. Furthermore, linking our prediction model to a state-of-the-art timing optimization of the commercial tool is able to reduce the worst negative slack by 18.4%, the total negative slack by 40.8%, and the number of timing violation paths by 30.9% on average.

Machine-Learning Based Delay Prediction for FPGA Technology Mapping

  • Hailiang Hu
  • Jiang Hu
  • Fan Zhang
  • Bing Tian
  • Ismail Bustany

Accurate delay prediction is important in the early stages of logic and high-level synthesis. In technology mapping for field programmable gate array (FPGA), a gate-level circuit is transcribed into a lookup table (LUT)-level circuit. Quick timing analysis is necessary on a pre-mapped circuit to guide optimizations downstream. However, a static timing analyzer is too slow due to its complexity and highly inaccurate like other faster empirical heuristics before technology mapping. In this work, we present a machine learning based framework for accurately and efficiently estimating the delay of a gate-level circuit from predicting the depth of the corresponding LUT logic after technology mapping. Our experimental results show that the proposed method achieves a 56x accuracy improvement compared to the existing delay estimation heuristic. Instead of running the mapper for the ground truth, our delay estimator saves 87.5% on runtime with negligible error.


ICCAD ’22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

 Full Citation in the ACM Digital Library

SESSION: The Role of Graph Neural Networks in Electronic Design Automation

Session details: The Role of Graph Neural Networks in Electronic Design Automation

  • Jeyavijayan Rajendran

Why are Graph Neural Networks Effective for EDA Problems?: (Invited Paper)

  • Haoxing Ren
  • Siddhartha Nath
  • Yanqing Zhang
  • Hao Chen
  • Mingjie Liu

In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.

On Advancing Physical Design Using Graph Neural Networks

  • Yi-Chen Lu
  • Sung Kyu Lim

As modern Physical Design (PD) algorithms and methodologies evolve into the post-Moore era with the aid of machine learning, Graph Neural Networks (GNNs) are becoming increasingly ubiquitous given that netlists are essentially graphs. Recently, their ability to perform effective graph learning has provided significant insights to understand the underlying dynamics during netlist-to-layout transformations. GNNs follow a message-passing scheme, where the goal is to construct meaningful representations either at the entire graph or node-level by recursively aggregating and transforming the initial features. In the realm of PD, the GNN-learned representations have been leveraged to solve the tasks such as cell clustering, quality-of-result prediction, activity simulation, etc., which often overcome the limitations of traditional PD algorithms. In this work, we first revisit recent advancements that GNNs have made in PD. Second, we discuss how GNNs serve as the backbone of novel PD flows. Finally, we present our thoughts on ongoing and future PD challenges that GNNs can tackle and succeed.

Applying GNNs to Timing Estimation at RTL

  • Daniela Sánchez Lopera
  • Wolfgang Ecker

In the Electronic Design Automation (EDA) flow, signoff checks, such as timing analysis, are performed only after physical synthesis. Encountered timing violations cause re-iterations of the design flow. Hence, timing estimations at initial design stages, such as Register Transfer Level (RTL), would increase the quality of the results and lower the flow iterations. Machine learning has been used to estimate the timing behavior of chip components. However, existing solutions map EDA objects to Euclidean data without considering that EDA objects are represented naturally as graphs. Recent advances in Graph Neural Networks (GNNs) motivate the mapping from EDA objects to graphs for design metric prediction tasks at different stages. This paper maps RTL designs to directed, featured graphs with multidimensional node and edge features. These are the input to GNNs for estimating component delays and slews. An in-house hardware generation framework and open-source EDA tools for ASIC synthesis are employed for collecting training data. Experiments over unseen circuits show that GNN-based models are promising for timing estimation, even when the features come from early RTL implementations. Based on estimated delays, critical areas of the design can be detected, and proper RTL micro-architectures can be chosen without running long design iterations.

Embracing Graph Neural Networks for Hardware Security

  • Lilas Alrahis
  • Satwik Patnaik
  • Muhammad Shafique
  • Ozgur Sinanoglu

Graph neural networks (GNNs) have attracted increasing attention due to their superior performance in deep learning on graph-structured data. GNNs have succeeded across various domains such as social networks, chemistry, and electronic design automation (EDA). Electronic circuits have a long history of being represented as graphs, and to no surprise, GNNs have demonstrated state-of-the-art performance in solving various EDA tasks. More importantly, GNNs are now employed to address several hardware security problems, such as detecting intellectual property (IP) piracy and hardware Trojans (HTs), to name a few.

In this survey, we first provide a comprehensive overview of the usage of GNNs in hardware security and propose the first taxonomy to divide the state-of-the-art GNN-based hardware security systems into four categories: (i) HT detection systems, (ii) IP piracy detection systems, (iii) reverse engineering platforms, and (iv) attacks on logic locking. We summarize the different architectures, graph types, node features, benchmark data sets, and model evaluation of the employed GNNs. Finally, we elaborate on the lessons learned and discuss future directions.

SESSION: Compiler and System-Level Techniques for Efficient Machine Learning

Session details: Compiler and System-Level Techniques for Efficient Machine Learning

  • Sri Parameswaran
  • Martin Rapp

Fine-Granular Computation and Data Layout Reorganization for Improving Locality

  • Mahmut Kandemir
  • Xulong Tang
  • Jagadish Kotra
  • Mustafa Karakoy

While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.

An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration

  • Nicolas Bohm Agostini
  • Serena Curzel
  • Vinay Amatya
  • Cheng Tan
  • Marco Minutoli
  • Vito Giovanni Castellana
  • Joseph Manzano
  • David Kaeli
  • Antonino Tumeo

The generation of custom hardware accelerators for applications implemented within high-level productive programming frameworks requires considerable manual effort. To automate this process, we introduce SODA-OPT, a compiler tool that extends the MLIR infrastructure. SODA-OPT automatically searches, outlines, tiles, and pre-optimizes relevant code regions to generate high-quality accelerators through high-level synthesis. SODA-OPT can support any high-level programming framework and domain-specific language that interface with the MLIR infrastructure. By leveraging MLIR, SODA-OPT solves compiler optimization problems with specialized abstractions. Backend synthesis tools connect to SODA-OPT through progressive intermediate representation lowerings. SODA-OPT interfaces to a design space exploration engine to identify the combination of compiler optimization passes and options that provides high-performance generated designs for different backends and targets. We demonstrate the practical applicability of the compilation flow by exploring the automatic generation of accelerators for deep neural networks operators outlined at arbitrary granularity and by combining outlining with tiling on large convolution layers. Experimental results with kernels from the PolyBench benchmark show that our high-level optimizations improve execution delays of synthesized accelerators up to 60x. We also show that for the selected kernels, our solution outperforms the current of state-of-the art in more than 70% of the benchmarks and provides better average speedup in 55% of them. SODA-OPT is an open source project available at

Physics-Aware Differentiable Discrete Codesign for Diffractive Optical Neural Networks

  • Yingjie Li
  • Ruiyang Chen
  • Weilu Gao
  • Cunxi Yu

Diffractive optical neural networks (DONNs) have attracted lots of attention as they bring significant advantages in terms of power efficiency, parallelism, and computational speed compared with conventional deep neural networks (DNNs), which have intrinsic limitations when implemented on digital platforms. However, inversely mapping algorithm-trained physical model parameters onto real-world optical devices with discrete values is a non-trivial task as existing optical devices have non-unified discrete levels and non-monotonic properties. This work proposes a novel device-to-system hardware-software codesign framework, which enables efficient physics-aware training of DONNs w.r.t arbitrary experimental measured optical devices across layers. Specifically, Gumbel-Softmax is employed to enable differentiable discrete mapping from real-world device parameters into the forward function of DONNs, where the physical parameters in DONNs can be trained by simply minimizing the loss function of the ML task. The results have demonstrated that our proposed framework offers significant advantages over conventional quantization-based methods, especially with low-precision optical devices. Finally, the proposed algorithm is fully verified with physical experimental optical systems in low-precision settings.

Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture

  • Gokul Krishnan
  • A. Alper Goksoy
  • Sumit K. Mandal
  • Zhenyu Wang
  • Chaitali Chakrabarti
  • Jae-sun Seo
  • Umit Y. Ogras
  • Yu Cao

Monolithic in-memory computing (IMC) architectures face significant yield and fabrication cost challenges as the complexity of DNNs increases. Chiplet-based IMCs that integrate multiple dies with advanced 2.5D/3D packaging offers a low-cost and scalable solution. They enable heterogeneous architectures where the chiplets and their associated interconnection can be tailored to the non-uniform algorithmic structures to maximize IMC utilization and reduce energy consumption. This paper proposes a heterogeneous IMC architecture with big-little chiplets and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. For a given DNN, we develop a custom methodology to map the model onto the big-little architecture such that the early layers in the DNN are mapped to the little chiplets with higher NoP bandwidth and the subsequent layers are mapped to the big chiplets with lower NoP bandwidth. Furthermore, we achieve a scalable solution by incorporating a DRAM into each chiplet to support a wide range of DNNs beyond the area limit. Compared to a homogeneous chiplet-based IMC architecture, the proposed big-little architecture achieves up to 329× improvement in the energy-delay-area product (EDAP) and up to 2× higher IMC utilization. Experimental evaluation of the proposed big-little chiplet-based RRAM IMC architecture for ResNet-50 on ImageNet shows 259×, 139×, and 48× improvement in energy-efficiency at lower area compared to Nvidia V100 GPU, Nvidia T4 GPU, and SIMBA architecture, respectively.

SESSION: Addressing Sensor Security through Hardware/Software Co-Design

Session details: Addressing Sensor Security through Hardware/Software Co-Design

  • Marilyn Wolf

Attacks on Image Sensors

  • Marilyn Wolf
  • Kruttidipta Samal

This paper provides a taxonomy of security vulnerabilities of smart image sensor systems. Image sensors form an important class of sensors. Many image sensors include computation units that can provide traditional algorithms such as image or video compression along with machine learning tasks such as classification. Some attacks rely on the physics and optics of imaging. Other attacks take advantage of the complex logic and software required to perform imaging systems.

False Data Injection Attacks on Sensor Systems

  • Dimitrios Serpanos

False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.

Stochastic Mixed-Signal Circuit Design for In-Sensor Privacy

  • Ningyuan Cao
  • Jianbo Liu
  • Boyang Cheng
  • Muya Chang

The ubiquitous data acquisition and extensive data exchange of sensors pose severe security and privacy concerns for the end-users and the public. To enable real-time protection of raw data, it is demanding to facilitate privacy-preserving algorithms at data generation, or in-sensory privacy. However, due to the severe sensor resource constraints and intensive computation/security cost, it remains an open question of how to enable data protection algorithms with efficient circuit techniques. To answer this question, this paper discusses the potential of a stochastic mixed-signal (SMS) circuit for ultra-low-power, small-foot-print data security. In particular, this paper discusses digitally-controlled-oscillators (DCO) and their advantages in (1) seamless analog interface, (2) stochastic computation efficiency, and (3) unified entropy generation over conventional digital circuit baselines. With DCO as an illustrative case, we target (1) SMS privacy-preserving architecture definition and systematic SMS analysis on its performance gains across various hardware/software configurations, and (2) revisit analog/mixed-signal voltage/transistor scaling in the context of entropy-based data protection.

Sensor Security: Current Progress, Research Challenges, and Future Roadmap (Invited Paper)

  • Anomadarshi Barua
  • Mohammad Abdullah Al Faruque

Sensors are one of the most pervasive and integral components of today’s safety-critical systems. Sensors serve as a bridge between physical quantities and connected systems. The connected systems with sensors blindly believe the sensor as there is no way to authenticate the signal coming from a sensor. This could be an entry point for an attacker. An attacker can inject a fake input signal along with the legitimate signal by using a suitable spoofing technique. As the sensor’s transducer is not smart enough to differentiate between a fake and legitimate signal, the injected fake signal eventually can collapse the connected system. This type of attack is known as the transduction attack. Over the last decade, several works have been published to provide a defense against the transduction attack. However, the defenses are proposed on an ad-hoc basis; hence, they are not well-structured. Our work begins to fill this gap by providing a checklist that a defense technique should always follow to be considered as an ideal defense against the transduction attack. We name this checklist as the Golden reference of sensor defense. We provide insights on how this Golden reference can be achieved and argue that sensors should be redesigned from the transducer level to the sensor electronics level. We point out that only hardware or software modification is not enough; instead, a hardware/software (HW/SW) co-design approach is required to ride on this future roadmap to the robust and resilient sensor.

SESSION: Advances in Partitioning and Physical Optimization

Session details: Advances in Partitioning and Physical Optimization

  • Markus Olbrich
  • Yu-Guang Chen

SpecPart: A Supervised Spectral Framework for Hypergraph Partitioning Solution Improvement

  • Ismail Bustany
  • Andrew B. Kahng
  • Ioannis Koutis
  • Bodhisatta Pramanik
  • Zhiang Wang

State-of-the-art hypergraph partitioners follow the multilevel paradigm that constructs multiple levels of progressively coarser hypergraphs that are used to drive cut refinements on each level of the hierarchy. Multilevel partitioners are subject to two limitations: (i) Hypergraph coarsening processes rely on local neighborhood structure without fully considering the global structure of the hypergraph. (ii) Refinement heuristics can stagnate on local minima. In this paper, we describe SpecPart, the first supervised spectral framework that directly tackles these two limitations. SpecPart solves a generalized eigenvalue problem that captures the balanced partitioning objective and global hypergraph structure in a low-dimensional vertex embedding while leveraging initial high-quality solutions from multilevel partitioners as hints. SpecPart further constructs a family of trees from the vertex embedding and partitions them with a tree-sweeping algorithm. Then, a novel overlay of multiple tree-based partitioning solutions, followed by lifting to a coarsened hypergraph, where an ILP partitioning instance is solved to alleviate local stagnation. We have validated SpecPart on multiple sets of benchmarks. Experimental results show that for some benchmarks, our SpecPart can substantially improve the cutsize by more than 50% with respect to the best published solutions obtained with leading partitioners hMETIS and KaHyPar.

HyperEF: Spectral Hypergraph Coarsening by Effective-Resistance Clustering

  • Ali Aghdaei
  • Zhuo Feng

This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over 70× runtime speedups over hMetis and 20× speedups over HyperSF.

Design and Technology Co-Optimization Utilizing Multi-Bit Flip-Flop Cells

  • Soomin Kim
  • Taewhan Kim

The benefit of multi-bit flip-flop (MBFF) as opposed to single-bit flip-flop is sharing in-cell clock inverters among the master and slave latches in the internal flip-flops of MBFF. Theoretically, the more flip-flops an MBFF has, the more power saving it can achieve. However, in practice, physically increasing the size of MBFF to accommodate many flip-flops imposes two new challenging problems in physical design: (1) non-flexible MBFF cell flipping for multiple D-to-Q signals and (2) unbalanced or wasted use of MBFF footprint space. In this work, we solve the two problems in a way to enhance routability and timing at the placement and routing stages. Precisely, for problem 1, we make the non-flexible MBFF cell flipping to be fully flexible by generating MBFF layouts supporting diverse D-to-Q flow directions in the detailed placement to improve routability and for problem 2, we enhance the setup and clock-to-Q delay on timing critical flip-flops in MBFF through gate upsizing (i.e., transistor folding) by using the unused space in MBFF to improve timing slack at the post-routing stage. Through experiments with benchmark circuits, it is shown that our proposed design and technology co-optimization (DTCO) flow using MBFFs that solves problems 1 and 2 is very promising.

Transitive Closure Graph-Based Warpage-Aware Floorplanning for Package Designs

  • Yang Hsu
  • Min-Hsuan Chung
  • Yao-Wen Chang
  • Ci-Hong Lin

In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.

SESSION: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

Session details: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

  • Antonino Tumeo

A Mixed Open-Source and Proprietary EDA Commons for Education and Prototyping

  • Andrew B. Kahng

In recent years, several open-source projects have shown potential to serve a future technology commons for EDA and design prototyping. This paper examines how open-source and proprietary EDA technologies will inevitably take on complementary roles within a future technology commons. Proprietary EDA technologies offer numerous benefits that will endure, including (i) exceptional technology and engineering; (ii) ever-increasing importance in design-based equivalent scaling and the overall semiconductor value chain; and (iii) well-established commercial and partner relationships. On the other hand, proprietary EDA technologies face challenges that will also endure, including (i) inability to pursue directions such as massive leverage of cloud compute, extreme reduction of turnaround times, or “free tools”; and (ii) difficulty in evolving and addressing new applications and markets. By contrast, open-source EDA technologies offer benefits that include (i) the capability to serve as a friction-free, democratized platform for education and future workforce development (i.e., as a platform for EDA research, and as a means of teaching / training both designers and EDA developers with public code); and (ii) addressing the needs of underserved, non-enterprise account markets (e.g., older nodes, research flows, cost-sensitive IoT, new devices and integrations, system-design-technology pathfinding). This said, open-source will always face challenges such as sustainability, governance, and how to achieve critical mass and critical quality. The paper will conclude with key directions and synergies for open-source and proprietary EDA within an EDA Commons for education and prototyping.

SODA Synthesizer: An Open-Source, Multi-Level, Modular, Extensible Compiler from High-Level Frameworks to Silicon

  • Nicolas Bohm Agostini
  • Ankur Limaye
  • Marco Minutoli
  • Vito Giovanni Castellana
  • Joseph Manzano
  • Antonino Tumeo
  • Serena Curzel
  • Fabrizio Ferrandi

The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.

A Scalable Methodology for Agile Chip Development with Open-Source Hardware Components

  • Maico Cassel dos Santos
  • Tianyu Jia
  • Martin Cochet
  • Karthik Swaminathan
  • Joseph Zuckerman
  • Paolo Mantovani
  • Davide Giri
  • Jeff Jun Zhang
  • Erik Jens Loscalzo
  • Gabriele Tombesi
  • Kevin Tien
  • Nandhini Chandramoorthy
  • John-David Wellman
  • David Brooks
  • Gu-Yeon Wei
  • Kenneth Shepard
  • Luca P. Carloni
  • Pradip Bose

We present a scalable methodology for the agile physical design of tile-based heterogeneous system-on-chip (SoC) architectures that simplifies the reuse and integration of open-source hardware components. The methodology leverages the regularity of the on-chip communication infrastructure, which is based on a multi-plane network-on-chip (NoC), and the modularity of socket interfaces, which connect the tiles to the NoC. Each socket also provides its tile with a set of platform services, including independent clocking and voltage control. As a result, the physical design of each tile can be decoupled from its location in the top-level floorplan of the SoC and the overall SoC design can benefit from a hierarchical timing-closure flow, design reuse and, if necessary, fast respin. With the proposed methodology we completed two SoC tapeouts of increasing complexity, which illustrate its capabilities and the resulting gains in terms of design productivity.

SESSION: Accelerators on A New Horizon

Session details: Accelerators on A New Horizon

  • Vaibhav Verma
  • Georgios Zervakis

GraphRC: Accelerating Graph Processing on Dual-Addressing Memory with Vertex Merging

  • Wei Cheng
  • Chun-Feng Wu
  • Yuan-Hao Chang
  • Ing-Chao Lin

Architectural innovation in graph accelerators attracts research attention due to foreseeable inflation in data sizes and the irregular memory access pattern of graph algorithms. Conventional graph accelerators ignore the potential of Non-Volatile Memory (NVM) crossbar as a dual-addressing memory and treat it as a traditional single-addressing memory with higher density and better energy efficiency. In this work, we present GraphRC, a graph accelerator that leverages the power of dual-addressing memory by mapping in-edge/out-edge requests to column/row-oriented memory accesses. Although the capability of dual-addressing memory greatly improves the performance of graph processing, some memory accesses still suffer from low-utilization issues. Therefore, we propose a vertex merging (VM) method that improves cache block utilization rate by merging memory requests from consecutive vertices. VM reduces the execution time of all 6 graph algorithms on all 4 datasets by 24.24% on average. We then identify the data dependency inherent in a graph limits the usage of VM, and its effectiveness is bounded by the percentage of mergeable vertices. To overcome this limitation, we propose an aggressive vertex merging (AVM) method that outperforms VM by ignoring the data dependency inherent in a graph. AVM significantly reduces the execution time of ranking-based algorithms on all 4 datasets while preserving the correct ranking of the top 20 vertices.

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

  • Matheus Cavalcante
  • Domenic Wüthrich
  • Matteo Perotti
  • Samuel Riedel
  • Luca Benini

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz’ performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

  • Edward Richter
  • Deming Chen

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

ReSiPI: A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for Energy-Efficient Interposer Communication

  • Ebadollah Taheri
  • Sudeep Pasricha
  • Mahdi Nikdast

2.5D chiplet systems have been proposed to improve the low manufacturing yield of large-scale chips. However, connecting the chiplets through an electronic interposer imposes a high traffic load on the interposer network. Silicon photonics technology has shown great promise towards handling a high volume of traffic with low latency in intra-chip network-on-chip (NoC) fabrics. Although recent advances in silicon photonic devices have extended photonic NoCs to enable high bandwidth communication in 2.5D chiplet systems, such interposer-based photonic networks still suffer from high power consumption. In this work, we design and analyze a novel Reconfigurable power-efficient and congestion-aware Silicon-Photonic 2.5D Interposer network, called ReSiPI. Considering runtime traffic, ReSiPI is able to dynamically deploy inter-chiplet photonic gateways to improve the overall network congestion. ReSiPI also employs switching elements based on phase change materials (PCMs) to dynamically reconfigure and power-gate the photonic interposer network, thereby improving the network power efficiency. Compared to the best prior state-of-the-art 2.5D photonic network, ReSiPI demonstrates, on average, 37% lower latency, 25% power reduction, and 53% energy minimization in the network.

SESSION: CAD for Confidentiality of Hardware IPS

Session details: CAD for Confidentiality of Hardware IPS

  • Swarup Bhunia

Hardware IP Protection against Confidentiality Attacks and Evolving Role of CAD Tool

  • Swarup Bhunia
  • Amitabh Das
  • Saverio Fazzari
  • Vivian Kammler
  • David Kehlet
  • Jeyavijayan Rajendran
  • Ankur Srivastava

With growing use of hardware intellectual property (IP) based integrated circuits (IC) design and increasing reliance on a globalized supply chain, the threats to confidentiality of hardware IPs have emerged as major security concerns to the IP producers and owners. These threats are diverse, including reverse engineering (RE), piracy, cloning, and extraction of design secrets, and span different phases of electronics life cycle. The academic research community and the semiconductor industry have made significant efforts over the past decade on developing effective methodologies and CAD tools targeted to protect hardware IPs against these threats. These solutions include watermarking, logic locking, obfuscation, camouflaging, split manufacturing, and hardware redaction. This paper focuses on key topics on confidentiality of hardware IPs encompassing the major threats, protection approaches, security analysis, and metrics. It discusses the strengths and limitations of the major solutions in protecting hardware IPs against the confidentiality attacks, and future directions to address the limitations in the modern supply chain ecosystem.

SESSION: Analyzing Reliability, Defects and Patterning

Session details: Analyzing Reliability, Defects and Patterning

  • Gaurav Rajavendra Reddy
  • Kostas Adam

Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction Using Graph Neural Network and U-Net

  • Kyeonghyeon Baek
  • Hyunbum Park
  • Suwan Kim
  • Kyumyung Choi
  • Taewhan Kim

An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.

A Novel Semi-Analytical Approach for Fast Electromigration Stress Analysis in Multi-Segment Interconnects

  • Olympia Axelou
  • Nestor Evmorfopoulos
  • George Floros
  • George Stamoulis
  • Sachin S. Sapatnekar

As integrated circuit technologies move below 10 nm, Electromigration (EM) has become an issue of great concern for the longterm reliability due to the stricter performance, thermal and power requirements. The problem of EM becomes even more pronounced in power grids due to the large unidirectional currents flowing in these structures. The attention for EM analysis during the past years has been drawn to accurate physics-based models describing the interplay between the electron wind force and the back stress force, in a single Partial Differential Equation (PDE) involving wire stress. In this paper, we present a fast semi-analytical approach for the solution of the stress PDE at discrete spatial points in multi-segment lines of power grids, which allows the analytical calculation of EM stress independently at any time in these lines. Our method exploits the specific form of the discrete stress coefficient matrix whose eigenvalues and eigenvectors are known beforehand. Thus, a closed-form equation can be constructed with almost linear time complexity without the need of time discretization. This closed-form equation can be subsequently used at any given time in transient stress analysis. Our experimental results, using the industrial IBM power grid benchmarks, demonstrate that our method has excellent accuracy compared to the industrial tool COMSOL while being orders of magnitude times faster.

HierPINN-EM: Fast Learning-Based Electromigration Analysis for Multi-Segment Interconnects Using Hierarchical Physics-Informed Neural Network

  • Wentian Jin
  • Liang Chen
  • Subed Lamichhane
  • Mohammadamir Kavousi
  • Sheldon X.-D. Tan

Electromigration (EM) becomes a major concern for VLSI circuits as the technology advances in the nanometer regime. The crux of problem is to solve the partial differential Korhonen equations, which remains challenging due to the increasing integrated density. Recently, scientific machine learning has been explored to solve partial differential equations (PDE) due to breakthrough success in deep neural networks and existing approach such as physics-informed neural networks (PINN) shows promising results for some small PDE problems. However, for large engineering problems like EM analysis for large interconnect trees, it was shown that the plain PINN does not work well due the to large number of variables. In this work, we propose a novel hierarchical PINN approach, HierPINN-EM for fast EM induced stress analysis for multi-segment interconnects. Instead of solving the interconnect tree as a whole, we first solve EM problem for one wire segment under different boundary and geometrical parameters using supervised learning. Then we apply unsupervised PINN concept to solve the whole interconnects by enforcing the physics laws in the boundaries for all wire segments. In this way, HierPINN-EM can significantly reduce the number of variables at plain PINN solver. Numerical results on a number of synthetic interconnect trees show that HierPINN-EM can lead to orders of magnitude speedup in training and more than 79× better accuracy over the plain PINN method. Furthermore, HierPINN-EM yields 19% better accuracy with 99% reduction in training cost over recently proposed Graph Neural Network-based EM solver, EMGraph.

Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning

  • Guan-Ting Liu
  • Wei-Chen Tai
  • Yi-Ting Lin
  • Iris Hui-Ru Jiang
  • James P. Shiely
  • Pu-Jen Cheng

As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.

SESSION: New Frontier in Verification Technology

Session details: New Frontier in Verification Technology

  • Jyotirmoy Vinay
  • Zahra Ghodsi

Automatic Test Configuration and Pattern Generation (ATCPG) for Neuromorphic Chips

  • I-Wei Chiu
  • Xin-Ping Chen
  • Jennifer Shueh-Inn Hu
  • James Chien-Mo Li

The demand for low-power, high-performance neuromorphic chips is increasing. However, conventional testing is not applicable to neuromorphic chips due to three reasons: (1) lack of scan DfT, (2) stochastic characteristic, and (3) configurable functionality. In this paper, we present an automatic test configuration and pattern generation (ATCPG) method for testing a configurable stochastic neuromorphic chip without using scan DfT. We use machine learning to generate test configurations. Then, we apply a modified fast gradient sign method to generate test patterns. Finally, we determine test repetitions with statistical power of test. We conduct experiments on one of the neuromorphic architectures, spiking neural network, to evaluate the effectiveness of our ATCPG. The experimental results show that our ATCPG can achieve 100% fault coverage for the five fault models we use. For testing a 3-layer model at 0.05 significant level, we produce 5 test configurations and 67 test patterns. The average test repetitions of neuron faults and synapse faults are 2,124 and 4,557, respectively. Besides, our simulation results show that the overkill matched our significance level perfectly.

ScaleHD: Robust Brain-Inspired Hyperdimensional Computing via Adapative Scaling

  • Sizhe Zhang
  • Mohsen Imani
  • Xun Jiao

Brain-inspired hyperdimensional computing (HDC) has demonstrated promising capability in various cognition tasks such as robotics, bio-medical signal analysis, and natural language processing. Compared to deep neural networks, HDC models show advantages such as light-weight model and one/few-shot learning capabilities, making it a promising alternative paradigm to traditional resource-demanding deep learning models particularly in edge devices with limited resources. Despite the growing popularity of HDC, the robustness of HDC models and the approaches to enhance HDC robustness has not been systematically analyzed and sufficiently examined. HDC relies on high-dimensional numerical vectors referred to as hypervectors (HV) to perform cognition tasks and the values inside the HVs are critical to the robustness of an HDC model. We propose ScaleHD, an adaptive scaling method that scales the value of HVs in the associative memory of an HDC model to enhance the robustness of HDC models. We propose three different modes of ScaleHD including Global-ScaleHD, Class-ScaleHD, and (Class + Clip)-ScaleHD which are based on different adaptive scaling strategies. Results show that ScaleHD is able to enhance HDC robustness against memory errors up to 10,000X. Moreover, we leverage the enhanced HDC robustness in exchange for energy saving via voltage scaling method. Experimental results show that ScaleHD can reduce energy consumption on HDC memory system up to 72.2% with less than 1% accuracy loss.

Quantitative Verification and Design Space Exploration under Uncertainty with Parametric Stochastic Contracts

  • Chanwook Oh
  • Michele Lora
  • Pierluigi Nuzzo

This paper proposes an automated framework for quantitative verification and design space exploration of cyber-physical systems in the presence of uncertainty, leveraging assume-guarantee contracts expressed in Stochastic Signal Temporal Logic (StSTL). We introduce quantitative semantics for StSTL and formulations of the quantitative verification and design space exploration problems as bi-level optimization problems. We show that these optimization problems can be effectively solved for a class of stochastic systems and a fragment of bounded-time StSTL formulas. Our algorithm searches for partitions of the upper-level design space such that the solutions of the lower-level problems satisfy the upper-level constraints. A set of optimal parameter values are then selected within these partitions. We illustrate the effectiveness of our framework on the design of a multi-sensor perception system and an automatic cruise control system.

SESSION: Low Power Edge Intelligence

Session details: Low Power Edge Intelligence

  • Sabya Das
  • Jiang Hu

Reliable Machine Learning for Wearable Activity Monitoring: Novel Algorithms and Theoretical Guarantees

  • Dina Hussein
  • Taha Belkhouja
  • Ganapati Bhat
  • Janardhan Rao Doppa

Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.

Neurally-Inspired Hyperdimensional Classification for Efficient and Robust Biosignal Processing

  • Yang Ni
  • Nicholas Lesica
  • Fan-Gang Zeng
  • Mohsen Imani

The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.

In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.

EVE: Environmental Adaptive Neural Network Models for Low-Power Energy Harvesting System

  • Sahidul Islam
  • Shanglin Zhou
  • Ran Ran
  • Yu-Fang Jin
  • Wujie Wen
  • Caiwen Ding
  • Mimi Xie

IoT devices are increasingly being implemented with neural network models to enable smart applications. Energy harvesting (EH) technology that harvests energy from ambient environment is a promising alternative to batteries for powering those devices due to the low maintenance cost and wide availability of the energy sources. However, the power provided by the energy harvester is low and has an intrinsic drawback of instability since it varies with the ambient environment. This paper proposes EVE, an automated machine learning (autoML) co-exploration framework to search for desired multi-models with shared weights for energy harvesting IoT devices. Those shared models incur significantly reduced memory footprint with different levels of model sparsity, latency, and accuracy to adapt to the environmental changes. An efficient on-device implementation architecture is further developed to efficiently execute each model on device. A run-time model extraction algorithm is proposed that retrieves individual model with negligible overhead when a specific model mode is triggered. Experimental results show that the neural networks models generated by EVE is on average 2.5× times faster than the baseline models without pruning and shared weights.

SESSION: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

Session details: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

  • Hussam Amrouch
  • Sheldon Tan

Designing Energy-Efficient Decision Tree Memristor Crossbar Circuits Using Binary Classification Graphs

  • Pranav Sinha
  • Sunny Raj

We propose a method to design in-memory, energy-efficient, and compact memristor crossbar circuits for implementing decision trees using flow-based computing. We develop a new tool called binary classification graph, which is equivalent to decision trees in accuracy but uses bit values of input features to make decisions instead of thresholds. Our proposed design is resilient to manufacturing errors and can scale to large crossbar sizes due to the utilization of sneak paths in computations. Our design uses zero transistor and one memristor (0T1R) crossbars with only two resistance states of high and low, which makes it resilient to resistance drift and radiation degradation. We test the performance of our designs on multiple standard machine learning datasets and show that our method utilizes circuits of size 5.23 × 10-3 mm2 and uses 20.5 pJ per decision, and outperforms state-of-the-art decision tree acceleration algorithms on these metrics.

Fuse and Mix: MACAM-Enabled Analog Activation for Energy-Efficient Neural Acceleration

  • Hanqing Zhu
  • Keren Zhu
  • Jiaqi Gu
  • Harrison Jin
  • Ray T. Chen
  • Jean Anne Incorvia
  • David Z. Pan

Analog computing has been recognized as a promising low-power alternative to digital counterparts for neural network acceleration. However, conventional analog computing is mainly in a mixed-signal manner. Tedious analog/digital (A/D) conversion cost significantly limits the overall system’s energy efficiency. In this work, we devise an efficient analog activation unit with magnetic tunnel junction (MTJ)-based analog content-addressable memory (MACAM), simultaneously realizing nonlinear activation and A/D conversion in a fused fashion. To compensate for the nascent and therefore currently limited representation capability of MACAM, we propose to mix our analog activation unit with digital activation dataflow. A fully differential framework, SuperMixer, is developed to search for an optimized activation workload assignment, adaptive to various activation energy constraints. The effectiveness of our proposed methods is evaluated on a silicon photonic accelerator. Compared to standard activation implementation, our mixed activation system with the searched assignment can achieve competitive accuracy with >60% energy saving on A/D conversion and activation.

Aging-Aware Training for Printed Neuromorphic Circuits

  • Haibin Zhao
  • Michael Hefenbrock
  • Michael Beigl
  • Mehdi B. Tahoori

Printed electronics allow for ultra-low-cost circuit fabrication with unique properties such as flexibility, non-toxicity, and stretchability. Because of these advanced properties, there is a growing interest in adapting printed electronics for emerging areas such as fast-moving consumer goods and wearable technologies. In such domains, analog signal processing in or near the sensor is favorable. Printed neuromorphic circuits have been recently proposed as a solution to perform such analog processing natively. Additionally, their learning-based design process allows high efficiency of their optimization and enables them to mitigate the high process variations associated with low-cost printed processes. In this work, we address the aging of the printed components. This effect can significantly degrade the accuracy of printed neuromorphic circuits over time. For this, we develop a stochastic aging-model to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device. This approach ensures to provide acceptable accuracy over the device lifetime. Our experiments show that an overall 35.8% improvement in terms of expected accuracy over the device lifetime can be achieved using the proposed learning approach.

SESSION: Designing DNN Accelerators

Session details: Designing DNN Accelerators

  • Elliott Delaye
  • Yiyu Shi

Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates

  • Naebeom Park
  • Daehyun Ahn
  • Jae-Joon Kim

Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.

Re2fresh: A Framework for Mitigating Read Disturbance in ReRAM-Based DNN Accelerators

  • Hyein Shin
  • Myeonggu Kang
  • Lee-Sup Kim

A severe read disturbance problem degrades the inference accuracy of a resistive RAM (ReRAM) based deep neural network (DNN) accelerator. Refresh, which reprograms the ReRAM cells, is the most obvious solution for the problem, but programming ReRAM consumes huge energy. To address the issue, we first analyze the resistance drift pattern of each conductance state and the actual read stress applied to the ReRAM array by considering the characteristics of ReRAM-based DNN accelerators. Based on the analysis, we cluster ReRAM cells into a few groups for each layer of DNN and generate a proper refresh cycle for each group in the offline phase. The individual refresh cycles reduce energy consumption by reducing the number of unnecessary refresh operations. In the online phase, the refresh controller selectively launches refresh operations according to the generated refresh cycles. ReRAM cells are selectively refreshed by minimally modifying the conventional structure of the ReRAM-based DNN accelerator. The proposed work successfully resolves the read disturbance problem by reducing 97% of the energy consumption for the refresh operation while preserving inference accuracy.

FastStamp: Accelerating Neural Steganography and Digital Watermarking of Images on FPGAs

  • Shehzeen Hussain
  • Nojan Sheybani
  • Paarth Neekhara
  • Xinqiao Zhang
  • Javier Duarte
  • Farinaz Koushanfar

Steganography and digital watermarking are the tasks of hiding recoverable data in image pixels. Deep neural network (DNN) based image steganography and watermarking techniques are quickly replacing traditional hand-engineered pipelines. DNN based water-marking techniques have drastically improved the message capacity, imperceptibility and robustness of the embedded watermarks. However, this improvement comes at the cost of increased computational overhead of the watermark encoder neural network. In this work, we design the first accelerator platform FastStamp to perform DNN based steganography and digital watermarking of images on hardware. We first propose a parameter efficient DNN model for embedding recoverable bit-strings in image pixels. Our proposed model can match the success metrics of prior state-of-the-art DNN based watermarking methods while being significantly faster and lighter in terms of memory footprint. We then design an FPGA based accelerator framework to further improve the model throughput and power consumption by leveraging data parallelism and customized computation paths. FastStamp allows embedding hardware signatures into images to establish media authenticity and ownership of digital media. Our best design achieves 68× faster inference as compared to GPU implementations of prior DNN based watermark encoder while consuming less power.

SESSION: Novel Chiplet Approaches from Interconnect to System (Virtual)

Session details: Novel Chiplet Approaches from Interconnect to System (Virtual)

  • Xinfei Guo

GIA: A Reusable General Interposer Architecture for Agile Chiplet Integration

  • Fuping Li
  • Ying Wang
  • Yuanqing Cheng
  • Yujie Wang
  • Yinhe Han
  • Huawei Li
  • Xiaowei Li

2.5D chiplet technology is gaining popularity for the efficiency of integrating multiple heterogeneous dies or chiplets on interposers, and it is also considered an ideal option for agile silicon system design by mitigating the huge design, verification, and manufacturing overhead of monolithic SoCs. Although it significantly reduces development costs by chiplet reuse, the design and fabrication of interposers also introduce additional high non-recurring engineering (NRE) costs and development cycles which might be prohibitive for application-specific designs having low volume.

To address this challenge, in this paper, we propose a reusable general interposer architecture (GIA) to amortize NRE costs and accelerate integration flows of interposers across different chiplet-based systems effectively. The proposed assembly-time configurable interposer architecture covers both active interposers and passive interposers considering diverse applications of 2.5D systems. The agile interposer integration is also facilitated by a novel end-to-end design automation framework to generate optimal system assembly configurations including the selection of chiplets, inter-chiplet network configuration, placement of chiplets, and mapping on GIA, which are specialized for the given target workload. The experimental results show that our proposed active GIA and passive GIA achieve 3.15x and 60.92x performance boost with 2.57x and 2.99x power saving over baselines respectively.

Accelerating Cache Coherence in Manycore Processor through Silicon Photonic Chiplet

  • Chengeng Li
  • Fan Jiang
  • Shixi Chen
  • Jiaxu Zhang
  • Yinyi Liu
  • Yuxiang Fu
  • Jiang Xu

Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.

Re-LSM: A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store

  • Qian Wei
  • Zhaoyan Shen
  • Yiheng Tong
  • Zhiping Jia
  • Lei Ju
  • Jiezhi Chen
  • Bingzhe Li

Log-structured merge (LSM) tree based key-value (KV) stores organize writes into hierarchical batches for high-speed writing. However, the notorious compaction process of LSM-tree severely hurts system performance. It not only involves huge I/O operations but also consumes tremendous computation and memory resources. In this paper, first we find that when compaction happens in the high levels (i.e., L0L1) of the LSM-tree, it may saturate all system computation and memory resources, and eventually stall the whole system. Based on this observation, we present Re-LSM, a ReRAM-based Processing-in-Memory (PIM) framework for LSM-based Key-Value Store. Specifically, in Re-LSM, we propose to offload certain computation and memory-intensive tasks in the high levels of the LSM-tree to the ReRAM-based PIM space. A high parallel ReRAM compaction accelerator is designed by decomposing the three-phased compaction into basic logic operating units. Evaluation results based on db_bench and YCSB show that Re-LSM achieves 2.2× improvement on the throughput of random writes compared to RocksDB, and the ReRAM-based compaction accelerator speedups the CPU-based implementation by 64.3× and saves 25.5× energy.

SESSION: Architecture for DNN Acceleration (Virtual)

Session details: Architecture for DNN Acceleration (Virtual)

  • Zhezhi He

Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability

  • Yiming Chen
  • Guodong Yin
  • Mingyen Lee
  • Wenjun Tang
  • Zekun Yang
  • Yongpan Liu
  • Huazhong Yang
  • Xueqing Li

Motivated by reducing the data transfer activities in data-intensive neural network computing, SRAM-based compute-in-memory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.

This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that “hides” ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, Hidden-ROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.

DCIM-GCN: Digital Computing-in-Memory to Efficiently Accelerate Graph Convolutional Networks

  • Yikan Qiu
  • Yufei Ma
  • Wentao Zhao
  • Meng Wu
  • Le Ye
  • Ru Huang

Computing-in-memory (CIM) is emerging as a promising architecture to accelerate graph convolutional networks (GCNs) normally bounded by redundant and irregular memory transactions. Current analog based CIM requires frequent analog and digital conversions (AD/DA) that dominate the overall area and power consumption. Furthermore, the analog non-ideality degrades the accuracy and reliability of CIM. In this work, an SRAM based digital CIM system is proposed to accelerate memory intensive GCNs, namely DCIM-GCN, which covers innovations from CIM circuit level eliminating costly AD/DA converters to architecture level addressing irregularity and sparsity of graph data. DCIM-GCN achieves 2.07X, 1.76X, and 1.89× speedup and 29.98×, 1.29×, and 3.73× energy efficiency improvement on average over CIM based PIMGCN, TARe, and PIM-GCN, respectively.

Hardware Computation Graph for DNN Accelerator Design Automation without Inter-PU Templates

  • Jun Li
  • Wei Wang
  • Wu-Jun Li

Existing deep neural network (DNN) accelerator design automation (ADA) methods adopt architecture templates to predetermine parts of design choices and then explore the left design choices beyond templates. These templates can be classified into intra-PU templates and inter-PU templates according to the architecture hierarchy. Since templates limit the flexibility of ADA, designing effective ADA methods without templates has become an important research topic. Although there have appeared some works to enhance the flexibility of ADA by removing intra-PU templates, to the best of our knowledge no existing works have studied ADA methods without inter-PU templates. ADA with predetermined inter-PU templates is typically inefficient in terms of resource utilization, especially for DNNs with complex topology. In this paper, we propose a novel method, called hardware computation graph (HCG), for ADA without inter-PU templates. Experiments show that HCG method can achieve competitive latency while using only 1.4× ~ 5× fewer on-chip memory, compared with existing state-of-the-art ADA methods.

SESSION: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

Session details: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

  • Sabya Das
  • Mondira Pant

Dynamic Frequency Boosting Beyond Critical Path Delay

  • Nikolaos Zompakis
  • Sotirios Xydis

This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) technique to release “hidden” performance margins of digital circuit designs currently suppressed by typical critical path constraint design flows, thus defining higher limits of operation speed. The proposed technique goes beyond state-of-the-art and exploits the data-driven path delay variability incorporating an innovative hardware clocking mechanism that detects in real-time the paths’ activation. In contrast to timing speculation, the operating speed is adjusted on the nominal path delay activation, succeeding an error-free acceleration. The proposed technique has been evaluated on three FPGA-based use cases carefully selected to exhibit differing domain characteristics, i.e i) a third party DNN inference accelerator IP for CIFAR-10 images achieving an average speedup of 18%, ii) a highly designer-optimized Optical Digital Equalizer design, in which DBF delivered a speedup of 50% and iii) a set of 5 synthetic designs examining high frequency (beyond 400 MHz) applications in FPGAs, achieving accelerations of 20–60% depending on the underlying path variability.

ASPPLN: Accelerated Symbolic Probability Propagation in Logic Network

  • Weihua Xiao
  • Weikang Qian

Probability propagation is an important task used in logic network analysis, which propagates signal probabilities from its primary inputs to its primary outputs. It has many applications such as power estimation, reliability analysis, and error analysis for approximate circuits. Existing methods for the task can be divided into two categories: simulation-based and probability-based methods. However, most of them suffer from low accuracy or bad scalability. In this work, we propose ASPPLN, a method for accelerated symbolic probability propagation in logic network, which has a linear complexity with the network size. We first introduce a new definition in a graph called redundant input and take advantage of it to simplify the propagation process without losing accuracy. Then, a technique called symbol limitation is proposed to limit the complexity of each node’s propagation according to the partial probability significances of the symbols. The experimental results showed that compared to the existing methods, ASPPLN improves the estimation accuracy of switching activity by up to 24.70%, while it also has a speedup of up to 29X.

A High-Precision Stochastic Solver for Steady-State Thermal Analysis with Fourier Heat Transfer Robin Boundary Conditions

  • Longlong Yang
  • Cuiyang Ding
  • Changhao Yan
  • Dian Zhou
  • Xuan Zeng

In this work, we propose a path integral random walk (PIRW) solver, the first accurate stochastic method for steady-state thermal analysis with mixed boundary conditions, especially involving Fourier heat transfer Robin boundary conditions. We innovatively adopt the strictly correct calculation of the local time and the Feynman-Kac functional êc (t) to handle Neumann and Robin boundary conditions with high precision. Compared with ANSYS, experimental results show that PIRW achieves over 121× speedup and over 83× storage space reduction with a negligible error within 0.8° C at a single point. An application combining PIRW with low-accuracy ANSYS for the temperature calculation at hot-spots is provided as a more accurate and faster solution than only ANSYS used.

SESSION: GPU Acceleration for Routing Algorithms (Virtual)

Session details: GPU Acceleration for Routing Algorithms (Virtual)

  • Umamaheswara Rao Tida

Superfast Full-Scale CPU-Accelerated Global Routing

  • Shiju Lin
  • Martin D. F. Wong

Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.

X-Check: CPU-Accelerated Design Rule Checking via Parallel Sweepline Algorithms

  • Zhuolun He
  • Yuzhe Ma
  • Bei Yu

Design rule checking (DRC) is essential in physical verification to ensure high yield and reliability for VLSI circuit designs. To achieve reasonable design cycle time, acceleration for computationally intensive DRC tasks has been demanded to accommodate the ever-growing complexity of modern VLSI circuits. In this paper, we propose X-Check, a GPU-accelerated design rule checker. X-Check integrates novel parallel sweepline algorithms, which are both efficient in practice and with nontrivial theoretical guarantees. Experimental results have demonstrated significant speedup achieved by X-Check compared with a multi-threaded CPU checker.

GPU-Accelerated Rectilinear Steiner Tree Generation

  • Zizheng Guo
  • Feng Gu
  • Yibo Lin

Rectilinear Steiner minimum tree (RSMT) generation is a fundamental component in the VLSI design automation flow. Due to its extensive usage in circuit design iterations at early design stages like synthesis, placement, and routing, the performance of RSMT generation is critical for a reasonable design turnaround time. State-of-the-art RSMT generation algorithms, like fast look-up table estimation (FLUTE), are constrained by CPU-based parallelism with limited runtime improvements. The acceleration of RSMT on GPUs is an important yet difficult task, due to the complex and non-trivial divide-and-conquer computation patterns with recursions. In this paper, we present the first GPU-accelerated RSMT generation algorithm based on FLUTE. By designing GPU-efficient data structures and levelized decomposition, table look-up, and merging operations, we incorporate large-scale data parallelism into the generation of Steiner trees. An up to 10.47× runtime speed-up has been achieved compared with FLUTE running on 40 CPU cores, filling in a critical missing component in today’s GPU-accelerated design automation framework.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

  • Christian Pilato
  • Miroslav Velev

HECTOR: A Multi-Level Intermediate Representation for Hardware Synthesis Methodologies

  • Ruifan Xu
  • Youwei Xiao
  • Jin Luo
  • Yun Liang

Hardware synthesis requires a complicated process to generate synthesizable register transfer level (RTL) code. High-level synthesis tools can automatically transform a high-level description into hardware design, while hardware generators adopt domain specific languages and synthesis flows for specific applications. The implementation of these tools generally requires substantial engineering efforts due to RTL’s weak expressivity and low level of abstraction. Furthermore, different synthesis tools adopt different levels of intermediate representations (IR) and transformations. A unified IR obviously is a good way to lower the engineering cost and get competitive hardware design rapidly by exploring different synthesis methodologies.

In this paper, we propose Hector, a two-level IR providing a unified intermediate representation for hardware synthesis methodologies. The high-level IR binds computation with a control graph annotated with timing information, while the low-level IR provides a concise way to describe hardware modules and elastic interconnections among them. Implemented based on the multi-level compiler infrastructure (MLIR), Hector’s IRs can be converted to synthesizable RTL designs. To demonstrate the expressivity and versatility, we implement three synthesis approaches based on Hector: a high-level synthesis (HLS) tool, a systolic array generator, and a hardware accelerator. The hardware generated by Hector’s HLS approach is comparable to that generated by the state-of-the-art HLS tools, and the other two cases outperform HLS implementations in performance and productivity.

QCIR: Pattern Matching Based Universal Quantum Circuit Rewriting Framework

  • Mingyu Chen
  • Yu Zhang
  • Yongshang Li
  • Zhen Wang
  • Jun Li
  • Xiangyang Li

Due to multiple limitations of quantum computers in the NISQ era, quantum compilation efforts are required to efficiently execute quantum algorithms on NISQ devices Program rewriting based on pattern matching can improve the generalization ability of compiler optimization. However, it has rarely been explored for quantum circuit optimization, further considering physical features of target devices.

In this paper, we propose a pattern-matching based quantum circuit optimization framework QCIR with a novel pattern description format, enabling the user-configured cost model and two categories of patterns, i.e., generic patterns and folding patterns. To get better compilation latency, we propose a DAG representation of quantum circuit called QCir-DAG, and QVF algorithm for subcircuit matching. We implement continuous single-qubit optimization pass constructed by QCIR, achieving 10% and 20% optimization rate for benchmarks from Qiskit and ScaffCC, respectively. The practicality of QCIR is demonstrated by execution time and experimental results on the quantum simulator and quantum devices.

Batch Sequential Black-Box Optimization with Embedding Alignment Cells for Logic Synthesis

  • Chang Feng
  • Wenlong Lyu
  • Zhitang Chen
  • Junjie Ye
  • Mingxuan Yuan
  • Jianye Hao

During the logic synthesis flow of EDA, a sequence of graph transformation operators are applied to the circuits so that the Quality of Results (QoR) of the circuits highly depends on the chosen operators and their specific parameters in the sequence, making the search space operator-dependent and increasingly exponential. In this paper, we formulate the logic synthesis design space exploration as a conditional sequence optimization problem, where at each transformation step, an optimization operator is selected and its corresponding parameters are decided. To solve this problem, we propose a novel sequential black-box optimization approach without human intervention: 1) Due to the conditional and sequential structure of operator sequence with variable length, we build an embedding alignment cells based recurrent neural network as a surrogate model to estimate the QoR of the logic synthesis flow with historical data. 2) With the surrogate model, we construct acquisition function to balance exploration and exploitation with respect to each metric of the QoR. 3) We use multi-objective optimization algorithm to find the Pareto front of the acquisition functions, along which a batch of sequences, consisting of parameterized operators, are (randomly) selected to users for evaluation under the budget of computing resource. We repeat the above three steps until convergence or time limit. Experimental results on public EPFL benchmarks demonstrate the superiority of our approach over the expert-crafted optimization flows and other machine learning based methods. Compared to resyn2, we achieve 11.8% LUT-6 count descent improvements without sacrificing level values.

Heterogeneous Graph Neural Network-Based Imitation Learning for Gate Sizing Acceleration

  • Xinyi Zhou
  • Junjie Ye
  • Chak-Wa Pui
  • Kun Shao
  • Guangliang Zhang
  • Bin Wang
  • Jianye Hao
  • Guangyong Chen
  • Pheng Ann Heng

Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.

SESSION: Smart Search (Virtual)

Session details: Smart Search (Virtual)

  • Jianlei Yang

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

  • Huihong Shi
  • Haoran You
  • Yang Zhao
  • Zhongfeng Wang
  • Yingyan Lin

Multiplication is arguably the most cost-dominant operation in modern deep neural networks (DNNs), limiting their achievable efficiency and thus more extensive deployment in resource-constrained applications. To tackle this limitation, pioneering works have developed handcrafted multiplication-free DNNs, which require expert knowledge and time-consuming manual iteration, calling for fast development tools. To this end, we propose a Neural Architecture Search and Acceleration framework dubbed NASA, which enables automated multiplication-reduced DNN development and integrates a dedicated multiplication-reduced accelerator for boosting DNNs’ achievable efficiency. Specifically, NASA adopts neural architecture search (NAS) spaces that augment the state-of-the-art one with hardware inspired multiplication-free operators, such as shift and adder, armed with a novel progressive pretrain strategy (PGP) together with customized training recipes to automatically search for optimal multiplication-reduced DNNs; On top of that, NASA further develops a dedicated accelerator, which advocates a chunk-based template and auto-mapper dedicated for NASA-NAS resulting DNNs to better leverage their algorithmic properties for boosting hardware efficiency. Experimental results and ablation studies consistently validate the advantages of NASA’s algorithm-hardware co-design framework in terms of achievable accuracy and efficiency tradeoffs. Codes are available at

Personalized Heterogeneity-Aware Federated Search Towards Better Accuracy and Energy Efficiency

  • Zhao Yang
  • Qingshuang Sun

Federated learning (FL), a new distributed technology, allows us to train the global model on the edge and embedded devices without local data sharing. However, due to the wide distribution of different types of devices, FL faces severe heterogeneity issues. The accuracy and efficiency of FL deployment at the edge are severely impacted by heterogeneous data and heterogeneous systems. In this paper, we perform joint FL model personalization for heterogeneous systems and heterogeneous data to address the challenges posed by heterogeneities. We begin by using model inference efficiency as a starting point to personalize network scale on each node. Furthermore, it can be used to guide the efficient FL training process, which can help to ease the problem of straggler devices and improve FL’s energy efficiency. During FL training, federated search is then used to acquire highly accurate personalized network structures. By taking into account the unique characteristics of FL deployment at edge devices, the personalized network structures obtained by our federated search framework with a lightweight search controller can achieve competitive accuracy with state-of-the-art (SOTA) methods, while reducing inference and training energy consumption by up to 3.57× and 1.82×, respectively.

SESSION: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

Session details: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

  • Cheng Tan

Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning

  • Keqi Fu
  • Zhi Qi
  • Jiaxuan Cai
  • Xulong Shi

As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.

Towards High-Quality CGRA Mapping with Graph Neural Networks and Reinforcement Learning

  • Yan Zhuang
  • Zhihao Zhang
  • Dajiang Liu

Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution to accelerate domain applications due to its good combination of energy-efficiency and flexibility. Loops, as computation-intensive parts of applications, are often mapped onto CGRA and modulo scheduling is commonly used to improve the execution performance. However, the actual performance using modulo scheduling is highly dependent on the mapping ability of the Data Dependency Graph (DDG) extracted from a loop. As existing approaches usually separate routing exploration of multi-cycle dependence from mapping for fast compilation, they may easily suffer from poor mapping quality. In this paper, we integrate the routing explorations into the mapping process and make it have more opportunities to find a globally optimized solution. Meanwhile, with a reduced resource graph defined, the searching space of the new mapping problem is not greatly increased. To efficiently solve the problem, we introduce graph neural network based reinforcement learning to predict a placement distribution over different resource nodes for all operations in a DDG. Using the routing connectivity as the reward signal, we optimize the parameters of neural network to find a valid mapping solution with a policy gradient method. Without much engineering and heuristic designing, our approach achieves 1.57× mapping quality, as compared to the state-of-the-art heuristic.

SESSION: Hardware Security: Attacks and Countermeasures (Virtual)

Session details: Hardware Security: Attacks and Countermeasures (Virtual)

  • Johann Knechtel
  • Lejla Batina

Attack Directories on ARM big.LITTLE Processors

  • Zili Kou
  • Sharad Sinha
  • Wenjian He
  • Wei Zhang

Eviction-based cache side-channel attacks take advantage of inclusive cache hierarchies and shared cache hardware. Processors with the template ARM big.LITTLE architecture do not guarantee such preconditions and therefore will not usually allow cross-core attacks let alone cross-cluster attacks. This work reveals a new side-channel based on the snoop filter (SF), an unexplored directory structure embedded in template ARM big.LITTLE processors. Our systematic reverse engineering unveils the undocumented structure and property of the SF, and we successfully utilize it to bootstrap cross-core and cross-cluster cache eviction. We demonstrate a comprehensive methodology to exploit the SF side-channel, including the construction of eviction sets, the covert channel, and attacks against RSA and AES. When attacking TrustZone, we conduct an interrupt-based side-channel attack to extract the key of RSA by a single profiling trace, despite the strict cache clean defense. Supported by detailed experiments, the SF side-channel not only achieves competitive performance but also overcomes the main challenge of cache side-channel attacks on ARM big.LITTLE processors.

AntiSIFA-CAD: A Framework to Thwart SIFA at the Layout Level

  • Rajat Sadhukhan
  • Sayandeep Saha
  • Debdeep Mukhopadhyay

Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.

SESSION: Advanced VLSI Routing and Layout Learning

Session details: Advanced VLSI Routing and Layout Learning

  • Wing-Kai Chow
  • David Chinnery

A Stochastic Approach to Handle Non-Determinism in Deep Learning-Based Design Rule Violation Predictions

  • Rongjian Liang
  • Hua Xiang
  • Jinwook Jung
  • Jiang Hu
  • Gi-Joon Nam

Deep learning is a promising approach to early DRV (Design Rule Violation) prediction. However, non-deterministic parallel routing hampers model training and degrades prediction accuracy. In this work, we propose a stochastic approach, called LGC-Net, to solve this problem. In this approach, we develop new techniques of Gaussian random field layer and focal likelihood loss function to seamlessly integrate Log Gaussian Cox process with deep learning. This approach provides not only statistical regression results but also classification ones with different thresholds without retraining. Experimental results with noisy training data on industrial designs demonstrate that LGC-Net achieves significantly better accuracy of DRV density prediction than prior arts.

Obstacle-Avoiding Multiple Redistribution Layer Routing with Irregular Structures

  • Yen-Ting Chen
  • Yao-Wen Chang

In advanced packages, redistribution layers (RDLs) are extra metal layers for high interconnections among the chips and printed circuit board (PCB). To better utilize the routing resources of RDLs, published works adopted flexible vias such that they can place the vias everywhere. Furthermore, some regions may be blocked for signal integrity protection or manually prerouted nets (such as power/ground nets or feeding lines of antennas) to achieve higher performance. These blocked regions will be treated as obstacles in the routing process. Since the positions of pads, obstacles, and vias can be arbitrary, the structures of RDLs become irregular. The obstacles and irregular structures substantially increase the difficulty of the routing process. This paper proposes a three-stage algorithm: First, the layout is partitioned by a method based on constrained Delaunay triangulation (CDT). Then we present a global routing graph model and generate routing guides for unified-assignment netlists. Finally, a novel tile routing method is developed to obtain detailed routes. Experiment results demonstrate the robustness and effectiveness of our proposed algorithm.

TAG: Learning Circuit Spatial Embedding from Layouts

  • Keren Zhu
  • Hao Chen
  • Walker J. Turner
  • George F. Kokai
  • Po-Hsuan Wei
  • David Z. Pan
  • Haoxing Ren

Analog and mixed-signal (AMS) circuit designs still rely on human design expertise. Machine learning has been assisting circuit design automation by replacing human experience with artificial intelligence. This paper presents TAG, a new paradigm of learning the circuit representation from layouts leveraging Text, self Attention and Graph. The embedding network model learns spatial information without manual labeling. We introduce text embedding and a self-attention mechanism to AMS circuit learning. Experimental results demonstrate the ability to predict layout distances between instances with industrial FinFET technology benchmarks. The effectiveness of the circuit representation is verified by showing the transferability to three other learning tasks with limited data in the case studies: layout matching prediction, wirelength estimation, and net parasitic capacitance prediction.

SESSION: Physical Attacks and Countermeasures

Session details: Physical Attacks and Countermeasures

  • Satwik Patnaik
  • Gang Qu

PowerTouch: A Security Objective-Guided Automation Framework for Generating Wired Ghost Touch Attacks on Touchscreens

  • Huifeng Zhu
  • Zhiyuan Yu
  • Weidong Cao
  • Ning Zhang
  • Xuan Zhang

The wired ghost touch attacks are the emerging and severe threats against modern touchscreens. The attackers can make touchscreens falsely report nonexistent touches (i.e., ghost touches) by injecting common-mode noise (CMN) into the target devices via power cables. Existing attacks rely on reverse-engineering the touchscreens, then manually crafting the CMN waveforms to control the types and locations of ghost touches. Although successful, they are limited in practicality and attack capability due to the touchscreens’ black-box nature and the immense search space of attack parameters. To overcome the above limitations, this paper presents PowerTouch, a framework that can automatically generate wired ghost touch attacks. We adopt a software-hardware co-design approach and propose a domain-specific genetic algorithm-based method that is tailored to account for the characteristics of the CMN waveform. Based on the security objectives, our framework automatically optimizes the CMN waveform towards injecting the desired type of ghost touches into regions specified by attackers. The effectiveness of PowerTouch is demonstrated by successfully launching attacks on touchscreen devices from two different brands given nine different objectives. Compared with the state-of-the-art attack, we seminally achieve controlling taps on an extra dimension and injecting swipes on both dimensions. We can place an average of 84.2% taps on the targeted side of the screen, with the location error in the other dimension no more than 1.53mm. An average of 94.5% of injected swipes with correct directions is also achieved. The quantitative comparison with the state-of-the-art method shows that a better attack performance can be achieved by PowerTouch.

A Combined Logical and Physical Attack on Logic Obfuscation

  • Michael Zuzak
  • Yuntao Liu
  • Isaac McDaniel
  • Ankur Srivastava

Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.

A Pragmatic Methodology for Blind Hardware Trojan Insertion in Finalized Layouts

  • Alexander Hepp
  • Tiago Perez
  • Samuel Pagliarini
  • Georg Sigl

A potential vulnerability for integrated circuits (ICs) is the insertion of hardware trojans (HTs) during manufacturing. Understanding the practicability of such an attack can lead to appropriate measures for mitigating it. In this paper, we demonstrate a pragmatic framework for analyzing HT susceptibility of finalized layouts. Our framework is representative of a fabrication-time attack, where the adversary is assumed to have access only to a layout representation of the circuit. The framework inserts trojans into tapeoutready layouts utilizing an Engineering Change Order (ECO) flow. The attacked security nodes are blindly searched utilizing reverse-engineering techniques. For our experimental investigation, we utilized three crypto-cores (AES-128, SHA-256, and RSA) and a microcontroller (RISC-V) as targets. We explored 96 combinations of triggers, payloads and targets for our framework. Our findings demonstrate that even in high-density designs, the covert insertion of sophisticated trojans is possible. All this while maintaining the original target logic, with minimal impact on power and performance. Furthermore, from our exploration, we conclude that it is too naive to only utilize placement resources as a metric for HT vulnerability. This work highlights that the HT insertion success is a complex function of the placement, routing resources, the position of the attacked nodes, and further design-specific characteristics. As a result, our framework goes beyond just an attack, we present the most advanced analysis tool to assess the vulnerability of HT insertion into finalized layouts.

SESSION: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

Session details: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

  • Rolf Drechsler

Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

  • Rolf Drechsler
  • Alireza Mahzoon

Recently, a lot of effort has been put into developing formal verification approaches by both academic and industrial research. In practice, these techniques often give satisfying results for some types of circuits, while they fail for others. A major challenge in this domain is that the verification techniques suffer from unpredictability in their performance. The only way to overcome this challenge is the calculation of bounds for the space and time complexities. If a verification method has polynomial space and time complexities, scalability can be guaranteed.

In this tutorial paper, we review recent developments in formal verification techniques and give a comprehensive overview of Polynomial Formal Verification (PFV). In PFV, polynomial upper bounds for the run-time and memory needed during the entire verification task hold. Thus, correctness under resource constraints can be ensured. We discuss the importance and advantages of PFV in the design flow. Formal methods on the bit-level and the word-level, and their complexities when used to verify different types of circuits, like adders, multipliers, or ALUs are presented. The current status of this new research field and directions for future work are discussed.

SESSION: Scalable Verification Technologies

Session details: Scalable Verification Technologies

  • Viraphol Chaiyakul
  • Alex Orailoglu

Arjun: An Efficient Independent Support Computation Technique and its Applications to Counting and Sampling

  • Mate Soos
  • Kuldeep S. Meel

Given a Boolean formula ϕ over the set of variables X and a projection set P ⊆ X, then if I ⊆ P is independent support of P, then if two solutions of ϕ agree on I, then they also agree on P. The notion of independent support is related to the classical notion of definability dating back to 1901, and have been studied over the decades. Recently, the computational problem of determining independent support for a given formula has attained importance owing to the crucial importance of independent support for hashing-based counting and sampling techniques.

In this paper, we design an efficient and scalable independent support computation technique that can handle formulas arising from real-world benchmarks. Our algorithmic framework, called Arjun1, employs implicit and explicit definability notions, and is based on a tight integration of gate-identification techniques and assumption-based framework. We demonstrate that augmenting the state-of-the-art model counter ApproxMC4 and sampler UniGen3 with Arjun leads to significant performance improvements. In particular, ApproxMC4 augmented with Arjun counts 576 more benchmarks out of 1896 while UniGen3 augmented with Arjun samples 335 more benchmarks within the same time limit.

Compositional Verification Using a Formal Component and Interface Specification

  • Yue Xing
  • Huaixi Lu
  • Aarti Gupta
  • Sharad Malik

Property-based specification s uch a s SystemVerilog Assertions (SVA) uses mathematical logic to specify the temporal behavior of RTL designs which can then be formally verified using model checking algorithms. These properties are specified for a single component (which may contain other components in the design hierarchy). Composing design components that have already been verified requires additional verification since incorrect communication at their interface may invalidate the properties that have been checked for the individual components. This paper focuses on a specification for their interface which can be checked individually for each component, and which guarantees that refinement-based properties checked for each component continue to hold after their composition. We do this in the setting of the Instruction-level Abstraction (ILA) specification and verification methodology. The ILA methodology provides a uniform specification for processors, accelerators and general modules at the instruction-level, and the automatic generation of a complete set of correctness properties for checking that the RTL model is a refinement of the ILA specification. We add an interface specification to model the inter-ILA communication. Further, we use our interface specification to generate a set of interface checking properties that check that the communication between the RTL components is correct. This provides the following guarantee: if each RTL component is a refinement of its ILA specification and the interface checks pass, then the RTL composition is a refinement of the ILA composition. We have applied the proposed methodology to six case studies including parts of large-scale designs such as parts of the FlexASR and NVDLA machine learning accelerators, demonstrating the practical applicability of our method.

Usage-Based RTL Subsetting for Hardware Accelerators

  • Qinhan Tan
  • Aarti Gupta
  • Sharad Malik

Recent years have witnessed increasing use of domain-specific accelerators in computing platforms to provide power-performance efficiency for emerging applications. To increase their applicability within the domain, these accelerators tend to support a large set of functions, e.g. Nvidia’s open-source Deep Learning Accelerator, NVDLA, supports five distinct groups of functions [17]. However, an individual use case of an accelerator may utilize only a subset of these functions. The unused functions lead to unnecessary overhead of silicon area, power, and hardware verification/hardware-software co-verification complexity. This motivates our research question: Given an RTL design for an accelerator and a subset of functions of interest, can we automatically extract a subset of the RTL that is sufficient for these functions and sequentially equivalent to the original RTL? We call this the Usage-based RTL Subsetting problem, referred to as the RTL subsetting problem in short. We first formally define this problem and show that it can be formulated as a program synthesis problem, which can be solved by performing expensive hyperproperty checks. To overcome the high cost, we propose multiple levels of sound over-approximations to construct an effective algorithm based on relatively less expensive temporal property checking and taint analysis for information flow checking. We demonstrate the acceptable computation cost and the quality of the results of our algorithm through several case studies of accelerators from different domains. The applicability of our proposed algorithm can be seen in its ability to subset the large NVDLA accelerator (with over 50,000 registers and 1,600,000 gates) for the group of convolution functions, where the subset reduces the total number of registers by 18.6% and the total number of gates by 37.1%.

SESSION: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

Session details: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

  • Amit Gupta
  • Kerim Kalafala

TransSizer: A Novel Transformer-Based Fast Gate Sizer

  • Siddhartha Nath
  • Geraldo Pradipta
  • Corey Hu
  • Tian Yang
  • Brucek Khailany
  • Haoxing Ren

Gate sizing is a fundamental netlist optimization move and researchers have used supervised learning-based models in gate sizers. Recently, Reinforcement Learning (RL) has been tried for sizing gates (and other EDA optimization problems) but are very runtime-intensive. In this work, we explore a novel Transformer-based gate sizer, TransSizer, to directly generate optimized gate sizes given a placed and unoptimized netlist. TransSizer is trained on datasets obtained from real tapeout-quality industrial designs in a foundry 5nm technology node. Our results indicate that TransSizer achieves 97% accuracy in predicting optimized gate sizes at the postroute optimization stage. Furthermore, TransSizer has a speedup of ~1400× while delivering similar timing, power and area metrics when compared to a leading-edge commercial tool for sizing-only optimization.

Generation of Mixed-Driving Multi-Bit Flip-Flops for Power Optimization

  • Meng-Yun Liu
  • Yu-Cheng Lai
  • Wai-Kei Mak
  • Ting-Chi Wang

Multi-bit flip-flops (MBFFs) are often used to reduce the number of clock sinks, resulting in a low-power design. A traditional MBFF is composed of individual FFs of uniform driving strength. However, if some but not all of the bits of an MBFF violate timing constraints, the MBFF has to be sized up or decomposed into smaller bit-width combinations to satisfy timing, which reduces the power saving. In this paper, we present a new MBFF generation approach considering mixed-driving MBFFs whose certain bits have a higher driving strength than the other bits. To maximize the FF merging rate (and hence to minimize the final amount of clock sinks), our approach will first perform aggressive FF merging subject to timing constraints. Our merging is aggressive in the sense that we are willing to possibly oversize some FFs and allow the presence of empty bits in an MBFF to merge FFs into MBFFs of uniform driving strengths as much as possible. The oversized individual FFs of an MBFF will be later downsized subject to timing constraints by our approach, which results in a mixed-driving MBFF. Our MBFF generation approach has been combined with a commercial place and route tool, and our experimental results show the superiority of our approach over a prior work that considers uniform-driving MBFFs only in terms of the clock sink count, the FF power, the clock buffer count, and the routed clock wirelength.

DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters

  • Zhiyao Xie
  • Shiyu Li
  • Mingyuan Ma
  • Chen-Chia Chang
  • Jingyu Pan
  • Yiran Chen
  • Jiang Hu

Accurate and efficient on-chip power modeling is crucial to runtime power, energy, and voltage management. Such power monitoring can be achieved by designing and integrating on-chip power meters (OPMs) into the target design. In this work, we propose a new method named DEEP to automatically develop extremely efficient OPM solutions for a given design. DEEP selects OPM inputs from all individual bits in RTL signals. Such bit-level selection provides an unprecedentedly large number of input candidates and supports lower hardware cost, compared with signal-level selection in prior works. In addition, DEEP proposes a powerful two-step OPM input selection method, and it supports reporting both total power and the power of major design components. Experiments on a commercial microprocessor demonstrate that DEEP’s OPM solution achieves correlation R > 0.97 in per-cycle power prediction with an unprecedented low area overhead on hardware, i.e., < 0.1% of the microprocessor layout. This reduces the OPM hardware cost by 4 — 6× compared with the state-of-the-art solution.

SESSION: Energy Efficient Hardware Acceleration and Stochastic Computing

Session details: Energy Efficient Hardware Acceleration and Stochastic Computing

  • Sunil Khatri
  • Anish Krishnakumar

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

  • Ranyang Zhou
  • Arman Roohi
  • Durga Misra
  • Shaahin Angizi

In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.

Sparse-T: Hardware Accelerator Thread for Unstructured Sparse Data Processing

  • Pranathi Vasireddy
  • Krishna Kavi
  • Gayatri Mehta

Sparse matrix-dense vector (SpMV) multiplication is inherent in most scientific, neural networks and machine learning algorithms. To efficiently exploit sparsity of data in SpMV computations, several compressed data representations have been used. However, compressed data representations of sparse data can result in overheads of locating nonzero values, requiring indirect memory accesses which increases instruction count and memory access delays. We call these translations of compressed representations as metadata processing. We propose a memory-side accelerator for metadata (or indexing) computations and supplying only the required nonzero values to the processor, additionally permitting an overlap of indexing with core computations on nonzero elements. In this contribution, we target our accelerator for low-end micro-controllers with very limited memory and processing capabilities. In this paper we will explore two dedicated ASIC designs of the proposed accelerator that handles the indexed memory accesses for compressed sparse row (CSR) format working alongside a simple RISC-like programmable core. One version of the accelerator supplies only vector values corresponding to nonzero matrix values and the second version supplies both nonzero matrix and matching vector values for SpMV computations. Our experiments show speedups ranging between 1.3 and 2.1 times for SpMV for different levels of sparsity. Our accelerator also results in energy savings ranging between 15.8% and 52.7% over different matrix sizes, when compared to the baseline system with primary RISC-V core performing all computations. We use smaller synthetic matrices with different sparsity levels and larger real-world matrices with higher sparsity (below 1% non-zeros) in our experimental evaluations.

Sound Source Localization Using Stochastic Computing

  • Peter Schober
  • Seyedeh Newsha Estiri
  • Sercan Aygun
  • Nima TaheriNejad
  • M. Hassan Najafi

Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.

SESSION: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

Session details: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

  • Medhi Tahoori

Approximate Computing and the Efficient Machine Learning Expedition

  • Jörg Henkel
  • Hai Li
  • Anand Raghunathan
  • Mehdi B. Tahoori
  • Swagath Venkataramani
  • Xiaoxuan Yang
  • Georgios Zervakis

Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads associated with ML applications-that were effectively mitigated by corresponding approximations-led to a perfect matching and a fruitful synergy. AxC for AI/ML has transcended beyond academic prototypes. In this work, we enlighten the synergistic nature of AxC and ML and elucidate the impact of AxC in designing efficient ML systems. To that end, we present an overview and taxonomy of AxC for ML and use two descriptive application scenarios to demonstrate how AxC boosts the efficiency of ML systems.

SESSION: Co-Search Methods and Tools

Session details: Co-Search Methods and Tools

  • Cunxi Yu
  • Yingyan “Celine” Lin

ObfuNAS: A Neural Architecture Search-Based DNN Obfuscation Approach

  • Tong Zhou
  • Shaolei Ren
  • Xiaolin Xu

Malicious architecture extraction has been emerging as a crucial concern for deep neural network (DNN) security. As a defense, architecture obfuscation is proposed to remap the victim DNN to a different architecture. Nonetheless, we observe that, with only extracting an obfuscated DNN architecture, the adversary can still retrain a substitute model with high performance (e.g., accuracy), rendering the obfuscation techniques ineffective. To mitigate this under-explored vulnerability, we propose ObfuNAS, which converts the DNN architecture obfuscation into a neural architecture search (NAS) problem. Using a combination of function-preserving obfuscation strategies, ObfuNAS ensures that the obfuscated DNN architecture can only achieve lower accuracy than the victim. We validate the performance of ObfuNAS with open-source architecture datasets like NAS-Bench-101 and NAS-Bench-301. The experimental results demonstrate that ObfuNAS can successfully find the optimal mask for a victim model within a given FLOPs constraint, leading up to 2.6% inference accuracy degradation for attackers with only 0.14× FLOPs overhead. The code is available at:

Deep Learning Toolkit-Accelerated Analytical Co-Optimization of CNN Hardware and Dataflow

  • Rongjian Liang
  • Jianfeng Song
  • Yuan Bo
  • Jiang Hu

The continuous growth of CNN complexity not only intensifies the need for hardware acceleration but also presents a huge challenge. That is, the solution space for CNN hardware design and dataflow mapping becomes enormously large besides the fact that it is discrete and lacks a well behaved structure. Most previous works either are stochastic metaheuristics, such as genetic algorithm, which are typically very slow for solving large problems, or rely on expensive sampling, e.g., Gumbel Softmax-based differentiable optimization and Bayesian optimization. We propose an analytical model for evaluating power and performance of CNN hardware design and dataflow solutions. Based on this model, we introduce a co-optimization method consisting of nonlinear programming and parallel local search. A key innovation in this model is its matrix form, which enables the use of deep learning toolkit for highly efficient computations of power/performance values and gradients in the optimization. In handling power-performance tradeoff, our method can lead to better solutions than minimizing a weighted sum of power and latency. The average relative error of our model compared with Timeloop is as small as 1%. Compared to state-of-the-art methods, our approach achieves solutions with up to 1.7 × shorter inference latency, 37.5% less power consumption, and 3 × less area on ResNet 18. Moreover, it provides a 6.2 × speedup of optimization runtime.

HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration

  • William Andrew Simon
  • Una Pale
  • Tomas Teijeiro
  • David Atienza

The HyperDimensional Computing (HDC) Machine Learning (ML) paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other ML approaches, necessitating frameworks enabling fast HDC algorithm design space exploration. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch’s utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. We also demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data may not generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.

SESSION: Reconfigurable Computing: Accelerators and Methodologies II

Session details: Reconfigurable Computing: Accelerators and Methodologies II

  • Peipei Zhou

DARL: Distributed Reconfigurable Accelerator for Hyperdimensional Reinforcement Learning

  • Hanning Chen
  • Mariam Issa
  • Yang Ni
  • Mohsen Imani

Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems such as robotics control. Modern RL algorithms, i.e., Deep Q-Learning, are based on costly and resource hungry deep neural networks. This motivates us to deploy alternative models for powering RL agents on edge devices. Recently, brain-inspired Hyper-Dimensional Computing (HDC) has been introduced as a promising solution for lightweight and efficient machine learning, particularly for classification.

In this work, we develop a novel platform capable of real-time hyperdimensional reinforcement learning. Our heterogeneous CPU-FPGA platform, called DARL, maximizes FPGA’s computing capabilities by applying hardware optimizations to hyperdimensional computing’s critical operations, including hardware-friendly encoder IP, the hypervector chunk fragmentation, and the delayed model update. Aside from hardware innovation, we also extend the platform to basic single-agent RL to support multi-agents distributed learning. We evaluate the effectiveness of our approach on OpenAI Gym tasks. Our results show that the FPGA platform provides on average 20× speedup compared to current state-of-the-art hyperdimensional RL methods running on Intel Xeon 6226 CPU. In addition, DARL provides around 4.8× faster and 4.2× higher energy efficiency compared to the state-of-the-art RL accelerator while ensuring a better or comparable quality of learning.

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

  • Carl-Johannes Johnsen
  • Tiziano De Matteis
  • Tal Ben-Nun
  • Johannes de Fine Licht
  • Torsten Hoefler

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for finegrained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.

SESSION: Compute-in-Memory for Neural Networks

Session details: Compute-in-Memory for Neural Networks

  • Bo Yuan

ISSA: Input-Skippable, Set-Associative Computing-in-Memory (SA-CIM) Architecture for Neural Network Accelerators

  • Yun-Chen Lo
  • Chih-Chen Yeh
  • Jun-Shen Wu
  • Chia-Chun Wang
  • Yu-Chih Tsai
  • Wen-Chien Ting
  • Ren-Shuo Liu

Among several emerging architectures, computing in memory (CIM), which features in-situ analog computation, is a potential solution to the data movement bottleneck of the Von Neumann architecture for artificial intelligence (AI). Interestingly, more strengths of CIM significantly different from in-situ analog computation are not widely known yet. In this work, we point out that mutually stationary vectors (MSVs), which can be maximized by introducing associativity to CIM, are another inherent power unique to CIM. By MSVs, CIM exhibits significant freedom to dynamically vectorize the stored data (e.g., weights) to perform agile computation using the dynamically formed vectors.

We have designed and realized an SA-CIM silicon prototype and corresponding architecture and acceleration schemes in the TSMC 28 nm process. More specifically, the contributions of this paper are fourfold: 1) We identify MSVs as new features that can be exploited to improve the current performance and energy challenges of the CIM-based hardware. 2) We propose SA-CIM to enhance MSVs for skipping the zeros, small values, and sparse vectors. 3) We propose a transposed systolic dataflow to efficiently conduct conv3×3 while being capable of exploiting input-skipping schemes. 4) We propose a design flow to search for optimal aggressive skipping scheme setups while satisfying the accuracy loss constraint.

The proposed ISSA architecture improves the throughput by 1.91× to 2.97× speedup and the energy efficiency by 2.5× to 4.2×.

Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?

  • Zheyu Yan
  • Xiaobo Sharon Hu
  • Yiyu Shi

Computing-in-Memory (CiM) architectures based on emerging nonvolatile memory (NVM) devices have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, NVM devices suffer from various non-idealities, especially device-to-device variations due to fabrication defects and cycle-to-cycle variations due to the stochastic behavior of devices. As such, the DNN weights actually mapped to NVM devices could deviate significantly from the expected values, leading to large performance degradation. To address this issue, most existing works focus on maximizing average performance under device variations. This objective would work well for general-purpose scenarios. But for safety-critical applications, the worst-case performance must also be considered. Unfortunately, this has been rarely explored in the literature. In this work, we formulate the problem of determining the worst-case performance of CiM DNN accelerators under the impact of device variations. We further propose a method to effectively find the specific combination of device variation in the high-dimensional space that leads to the worst-case performance. We find that even with very small device variations, the accuracy of a DNN can drop drastically, causing concerns when deploying CiM accelerators in safety-critical applications. Finally, we show that surprisingly none of the existing methods used to enhance average DNN performance in CiM accelerators are very effective when extended to enhance the worst-case performance, and further research down the road is needed to address this problem.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist II

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist II

  • Sunil Khatri
  • Cunxi Yu

Language Equation Solving via Boolean Automata Manipulation

  • Wan-Hsuan Lin
  • Chia-Hsuan Su
  • Jie-Hong R. Jiang

Language equations are a powerful tool for compositional synthesis, modeled as the unknown component problem. Given a (sequential) system specification S and a fixed component F, we are asked to synthesize an unknown component X such that whose composition with F fulfills S. The synthesis of X can be formulated with language equation solving. Although prior work exploits partitioned representation for effective finite automata manipulation, it remains challenging to solve language equations involving a large number of states. In this work, we propose variants of Boolean automata as the underlying succinct representation for regular languages. They admit logic circuit manipulation and extend the scalability for solving language equations. Experimental results demonstrate the superiority of our method to the state-of-the-art in solving nine more cases out of the 36 studied benchmarks and achieving an average of 740× speedup.

How Good Is Your Verilog RTL Code?: A Quick Answer from Machine Learning

  • Prianka Sengupta
  • Aakash Tyagi
  • Yiran Chen
  • Jiang Hu

Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.

SESSION: In-Memory Computing Revisited

Session details: In-Memory Computing Revisited

  • Biresh Kumar Joardar
  • Ulf Schlichtmann

Logic Synthesis for Digital In-Memory Computing

  • Muhammad Rashedul Haq Rashed
  • Sumit Kumar Jha
  • Rickard Ewetz

Processing in-memory is a promising solution strategy for accelerating data-intensive applications. While analog in-memory computing is extremely efficient, the limited precision is only acceptable for approximate computing applications. Digital in-memory computing provides the deterministic precision required to accelerate high assurance applications. State-of-the-art digital in-memory computing schemes rely on manually decomposing arithmetic operations into in-memory compute kernels. In contrast, traditional digital circuits are synthesized using complex and automated design flows. In this paper, we propose a logic synthesis framework called LOGIC for mapping high-level applications into digital in-memory compute kernels that can be executed using non-volatile memory. We first propose techniques to decompose element-wise arithmetic operations into in-memory kernels while minimizing the number of in-memory operations. Next, the sequence of the in-memory operation is optimized to minimize non-volatile memory utilization. Lastly, data layout re-organization is used to efficiently accelerate applications dominated by sparse matrix-vector multiplication operations. The experimental evaluations show that the proposed synthesis approach improves the area and latency of fixed-point multiplication by 77% and 20% over the state-of-the-art, respectively. On scientific computing applications from Suite Sparse Matrix Collection, the proposed design improves the area, latency and, energy by 3.6X, 2.6X, and 8.3X, respectively.

Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators

  • Kang He
  • Indranil Chakraborty
  • Cheng Wang
  • Kaushik Roy

In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.

SESSION: Special Session: 2022 CAD Contest at ICCAD

Session details: Special Session: 2022 CAD Contest at ICCAD

  • Yu-Guang Chen

Overview of 2022 CAD Contest at ICCAD

  • Yu-Guang Chen
  • Chun-Yao Wang
  • Tsung-Wei Huang
  • Takashi Sato

The “CAD Contest at ICCAD” is a challenging, multi-month, research and development competition, focusing on advanced, real-world problems in the field of electronic design automation (EDA). Since 2012, the contest has been publishing many sophisticated circuit design problems, from system-level design to physical design, together with industrial benchmarks and solution evaluators. Contestants can participate in one or more problems provided by EDA/IC industry. The winners will be awarded at an ICCAD special session dedicated to this contest. Every year, the contest attracts more than a hundred teams, fosters productive industry-academia collaborations, and leads to hundreds of publications in top-tier conferences and journals. The 2022 CAD Contest has 166 teams from all over the world. Moreover, the problems of this year cover state-of-the-art EDA research trends such as circuit security, 3D-IC, and design space exploration from well-known EDA/IC companies. We believe the contest keeps enhancing impact and boosting EDA researches.

2022 CAD Contest Problem A: Learning Arithmetic Operations from Gate-Level Circuit

  • Chung-Han Chou
  • Chih-Jen (Jacky) Hsu
  • Chi-An (Rocky) Wu
  • Kuan-Hua Tu

Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.

In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.

2022 ICCAD CAD Contest Problem B: 3D Placement with D2D Vertical Connections

  • Kai-Shun Hu
  • I-Jye Lin
  • Yu-Hui Huang
  • Hao-Yu Chi
  • Yi-Hsuan Wu
  • Chin-Fang Cindy Shen

In the chiplet era, the benefits from multiple factors can be observed by splitting a large single die into multiple small dies. By having the multiple small dies with die-to-die (D2D) vertical connections, the benefits including: 1) better yield, 2) better timing/performance, and 3) better cost. How to do the netlist partitioning, cell placement in each of the small dies, and also how to determine the location of the D2D inter-connection terminals becomes a new topic.

To address this chiplet era physical implementation problem, ICCAD-2022 contest encourages the research in the techniques of multi-die netlist partitioning and placement with D2D vertical connections. We provided (i) a set of benchmarks and (ii) an evaluation metric that facilitate contestants to develop, test, and evaluate their new algorithms.

2022 ICCAD CAD Contest Problem C: Microarchitecture Design Space Exploration

  • Sicheng Li
  • Chen Bai
  • Xuechao Wei
  • Bizhao Shi
  • Yen-Kuang Chen
  • Yuan Xie

It is vital to select microarchitectures to achieve good trade-offs between performance, power, and area in the chip development cycle. Combining high-level hardware description languages and optimization of electronic design automation tools empowers microarchitecture exploration at the circuit level. Due to the extremely large design space and high runtime cost to evaluate a microarchitecture, ICCAD 2022 CAD Contest Problem C calls for an effective design space exploration algorithm to solve the problem. We formulate the research topic as a contest problem and provide benchmark suites, contest benchmark platforms, etc., for all contestants to innovate and estimate their algorithms.

IEEE CEDA DATC: Expanding Research Foundations for IC Physical Design and ML-Enabled EDA

  • Jinwook Jung
  • Andrew B. Kahng
  • Ravi Varadarajan
  • Zhiang Wang

This paper describes new elements in the RDF-2022 release of the DATC Robust Design Flow, along with other activities of the IEEE CEDA DATC. The RosettaStone initiated with RDF-2021 has been augmented to include 35 benchmarks and four open-source technologies (ASAP7, NanGate45 and SkyWater130HS/HD), plus timing-sensible versions created using path-cutting. The Hier-RTLMP macro placer is now part of DATC RDF, enabling macro placement for large modern designs with hundreds of macros. To establish a clear baseline for macro placers, new open-source benchmark suites on open PDKs, with corresponding flows for fully reproducible results, are provided. METRICS2.1 infrastructure in OpenROAD and OpenROAD-flow-scripts now uses native JSON metrics reporting, which is more robust and general than the previous Python script-based method. Calibrations on open enablements have also seen notable updates in the RDF. Finally, we also describe an approach to establishing a generic, cloud-native large-scale design of experiments for ML-enabled EDA. Our paper closes with future research directions related to DATC’s efforts.

SESSION: Architectures and Methodologies for Advanced Hardware Security

Session details: Architectures and Methodologies for Advanced Hardware Security

  • Amin Rezaei
  • Gang Qu

Inhale: Enabling High-Performance and Energy-Efficient In-SRAM Cryptographic Hash for IoT

  • Jingyao Zhang
  • Elaheh Sadredini

In the age of big data, information security has become a major issue of debate, especially with the rise of the Internet of Things (IoT), where attackers can effortlessly obtain physical access to edge devices. The hash algorithm is the current foundation for data integrity and authentication. However, it is challenging to provide a high-performance, high-throughput, and energy-efficient solution on resource-constrained edge devices. In this paper, we propose Inhale, an in-SRAM architecture to effectively compute hash algorithms with innovative data alignment and efficient read/write strategies to implicitly execute data shift operations through the in-situ controller. We present two variations of Inhale: Inhale-Opt, which is optimized for latency, throughput, and area-overhead; and Inhale-Flex, which offers flexibility in repurposing a part of last-level caches for hash computation. We thoroughly evaluate our proposed architectures on both SRAM and ReRAM memories and compare them with the state-of-the-art in-memory and ASIC accelerators. Our performance evaluation confirms that Inhale can achieve 1.4× – 14.5× higher throughput-per-area and about two-orders-of-magnitude higher throughput-per-area-per-energy compared to the state-of-the-art solutions.

Accelerating N-Bit Operations over TFHE on Commodity CPU-FPGA

  • Kevin Nam
  • Hyunyoung Oh
  • Hyungon Moon
  • Yunheung Paek

TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.

Fast and Compact Interleaved Modular Multiplication Based on Carry Save Addition

  • Oleg Mazonka
  • Eduardo Chielle
  • Deepraj Soni
  • Michail Maniatakos

Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.

Accelerating Fully Homomorphic Encryption by Bridging Modular and Bit-Level Arithmetic

  • Eduardo Chielle
  • Oleg Mazonka
  • Homer Gamil
  • Michail Maniatakos

The dramatic increase of data breaches in modern computing platforms has emphasized that access control is not sufficient to protect sensitive user data. Recent advances in cryptography allow end-to-end processing of encrypted data without the need for decryption using Fully Homomorphic Encryption (FHE). Such computation however, is still orders of magnitude slower than direct (unencrypted) computation. Depending on the underlying cryptographic scheme, FHE schemes can work natively either at bit-level using Boolean circuits, or over integers using modular arithmetic. Operations on integers are limited to addition/subtraction and multiplication. On the other hand, bit-level arithmetic is much more comprehensive allowing more operations, such as comparison and division. While modular arithmetic can emulate bit-level computation, there is a significant cost in performance. In this work, we propose a novel method, dubbed bridging, that blends faster and restricted modular computation with slower and comprehensive bit-level computation, making them both usable within the same application and with the same cryptographic scheme instantiation. We introduce and open source C++ types representing the two distinct arithmetic modes, offering the possibility to convert from one to the other. Experimental results show that bridging modular and bit-level arithmetic computation can lead to 1–2 orders of magnitude performance improvement for tested synthetic benchmarks, as well as one real-world FHE application: a genotype imputation case study.

SESSION: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

Session details: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

  • Jiang Hu

A Reconfigurable Hardware Library for Robot Scene Perception

  • Yanqi Liu
  • Anthony Opipari
  • Odest Chadwicke Jenkins
  • R. Iris Bahar

Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26× more energy efficient than a high-end GPU and 11× faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.

Analyzing and Improving Resilience and Robustness of Autonomous Systems

  • Zishen Wan
  • Karthik Swaminathan
  • Pin-Yu Chen
  • Nandhini Chandramoorthy
  • Arijit Raychowdhury

Autonomous systems have reached a tipping point, with a myriad of self-driving cars, unmanned aerial vehicles (UAVs), and robots being widely applied and revolutionizing new applications. The continuous deployment of autonomous systems reveals the need for designs that facilitate increased resiliency and safety. The ability of an autonomous system to tolerate, or mitigate against errors, such as environmental conditions, sensor, hardware and software faults, and adversarial attacks, is essential to ensure its functional safety. Application-aware resilience metrics, holistic fault analysis frameworks, and lightweight fault mitigation techniques are being proposed for accurate and effective resilience and robustness assessment and improvement. This paper explores the origination of fault sources across the computing stack of autonomous systems, discusses the various fault impacts and fault mitigation techniques of different scales of autonomous systems, and concludes with challenges and opportunities for assessing and building next-generation resilient and robust autonomous systems.

Factor Graph Accelerator for LiDAR-Inertial Odometry (Invited Paper)

  • Yuhui Hao
  • Bo Yu
  • Qiang Liu
  • Shaoshan Liu
  • Yuhao Zhu

Factor graph is a graph representing the factorization of a probability distribution function, and has been utilized in many autonomous machine computing tasks, such as localization, tracking, planning and control etc. We are developing an architecture with the goal of using factor graph as a common abstraction for most, if not, all autonomous machine computing tasks. If successful, the architecture would provide a very simple interface of mapping autonomous machine functions to the underlying compute hardware. As a first step of such an attempt, this paper presents our most recent work of developing a factor graph accelerator for LiDAR-Inertial Odometry (LIO), an essential task in many autonomous machines, such as autonomous vehicles and mobile robots. By modeling LIO as a factor graph, the proposed accelerator not only supports multi-sensor fusion such as LiDAR, inertial measurement unit (IMU), GPS, etc., but solves the global optimization problem of robot navigation in batch or incremental modes. Our evaluation demonstrates that the proposed design significantly improves the real-time performance and energy efficiency of autonomous machine navigation systems. The initial success suggests the potential of generalizing the factor graph architecture as a common abstraction for autonomous machine computing, including tracking, planning, and control etc.

Hardware Architecture of Graph Neural Network-Enabled Motion Planner (Invited Paper)

  • Lingyi Huang
  • Xiao Zang
  • Yu Gong
  • Bo Yuan

Motion planning aims to find a collision-free trajectory from the start to goal configurations of a robot. As a key cognition task for all the autonomous machines, motion planning is fundamentally required in various real-world robotic applications, such as 2-D/3-D autonomous navigation of unmanned mobile and aerial vehicles and high degree-of-freedom (DoF) autonomous manipulation of industry/medical robot arms and graspers.

Motion planning can be performed using either non-learning-based classical algorithms or learning-based neural approaches. Most recently, the powerful capabilities of deep neural networks (DNNs) make neural planners become very attractive because of their superior planning performance over the classical methods. In particular, graph neural network (GNN)-enabled motion planner has demonstrated the state-of-the-art performance across a set of challenging high-dimensional planning tasks, motivating the efficient hardware acceleration to fully unleash its potential and promote its widespread deployment in practical applications.

To that end, in this paper we perform preliminary study of the efficient accelerator design of the GNN-based neural planner, especially for the neural explorer as the key component of the entire planning pipeline. By performing in-depth analysis on the different design choices, we identify that the hybrid architecture, instead of the uniform sparse matrix multiplication (SpMM)-based solution that is popularly adopted in the existing GNN hardware, is more suitable for our target neural explorer. With a set of optimization on microarchitecture and dataflow, several design challenges incurred by using hybrid architecture, such as extensive memory access and imbalanced workload, can be efficiently mitigated. Evaluation results show that our proposed customized hardware architecture achieves order-of-magnitude performance improvement over the CPU/GPU-based implementation with respect to area and energy efficiency in various working environments.

SESSION: From Logical to Physical Qubits: New Models and Techniques for Mapping

Session details: From Logical to Physical Qubits: New Models and Techniques for Mapping

  • Weiwen Jiang

A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker

  • Tsou-An Wu
  • Yun-Jhe Jiang
  • Shao-Yun Fang

Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP-free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000× runtime speedup over a state-of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800× speedup and obtains near-optimal solutions with only 3% SWAP overhead.

Reinforcement Learning and DEAR Framework for Solving the Qubit Mapping Problem

  • Ching-Yao Huang
  • Chi-Hsiang Lien
  • Wai-Kei Mak

Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self-attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR-framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.

Qubit Mapping for Reconfigurable Atom Arrays

  • Bochen Tan
  • Dolev Bluvstein
  • Mikhail D. Lukin
  • Jason Cong

Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.

MCQA: Multi-Constraint Qubit Allocation for Near-FTQC Device

  • Sunghye Park
  • Dohun Kim
  • Jae-Yoon Sim
  • Seokhyeong Kang

In response to the rapid development of quantum processors, quantum software must be advanced by considering the actual hardware limitations. Among the various design automation problems in quantum computing, qubit allocation modifies the input circuit to match the hardware topology constraints. In this work, we present an effective heuristic approach for qubit allocation that considers not only the hardware topology but also other constraints for near-fault-tolerant quantum computing (near-FTQC). We propose a practical methodology to find an effective initial mapping to reduce both the number of gates and circuit latency. We then perform dynamic scheduling to maximize the number of gates executed in parallel in the main mapping phase. Our experimental results with a Surface-17 processor confirmed a substantial reduction in the number of gates, latency, and runtime by 58%, 28%, and 99%, respectively, compared with the previous method [18]. Moreover, our mapping method is scalable and has a linear time complexity with respect to the number of gates.

SESSION: Smart Embedded Systems (Virtual)

Session details: Smart Embedded Systems (Virtual)

  • Leonidas Kosmidis
  • Pietro Mercati

Smart Scissor: Coupling Spatial Redundancy Reduction and CNN Compression for Embedded Hardware

  • Hao Kong
  • Di Liu
  • Shuo Huai
  • Xiangzhong Luo
  • Weichen Liu
  • Ravi Subramaniam
  • Christian Makaya
  • Qian Lin

Scaling down the resolution of input images can greatly reduce the computational overhead of convolutional neural networks (CNNs), which is promising for edge AI. However, as an image usually contains much spatial redundancy, e.g., background pixels, directly shrinking the whole image will lose important features of the foreground object and lead to severe accuracy degradation. In this paper, we propose a dynamic image cropping framework to reduce the spatial redundancy by accurately cropping the foreground object from images. To achieve the instance-aware fine cropping, we introduce a lightweight foreground predictor to efficiently localize and crop the foreground of an image. The finely cropped images can be correctly recognized even at a small resolution. Meanwhile, computational redundancy also exists in CNN architectures. To pursue higher execution efficiency on resource-constrained embedded devices, we also propose a compound shrinking strategy to coordinately compress the three dimensions (depth, width, resolution) of CNNs. Eventually, we seamlessly combine the proposed dynamic image cropping and compound shrinking into a unified compression framework, Smart Scissor, which is expected to significantly reduce the computational overhead of CNNs while still maintaining high accuracy. Experiments on ImageNet-1K demonstrate that our method reduces the computational cost of ResNet50 by 41.5% while improving the top-1 accuracy by 0.3%. Moreover, compared to HRank, the state-of-the-art CNN compression framework, our method achieves 4.1% higher top-1 accuracy at the same computational cost. The codes and data are available at

SHAPE: Scheduling of Fixed-Priority Tasks on Heterogeneous Architectures with Multiple CPUs and Many PEs

  • Yuankai Xu
  • Tiancheng He
  • Ruiqi Sun
  • Yehan Ma
  • Yier Jin
  • An Zou

Despite being employed in burgeoning efforts to accelerate artificial intelligence, heterogeneous architectures have yet to be well managed with strict timing constraints. As a classic task model, multi-segment self-suspension (MSSS) has been proposed for general I/O-intensive systems and computation offloading. However, directly applying this model to heterogeneous architectures with multiple CPUs and many processing units (PEs) suffers tremendous pessimism. In this paper, we present a real-time scheduling approach, SHAPE, for general heterogeneous architectures with significant schedulability and improved utilization rate. We start with building the general task execution pattern on a heterogeneous architecture integrating multiple CPU cores and many PEs such as GPU streaming multiprocessors and FPGA IP cores. A real-time scheduling strategy and corresponding schedulability analysis are presented following the task execution pattern. Compared with state-of-the-art scheduling algorithms through comprehensive experiments on unified and versatile tasks, SHAPE improves the schedulability by 11.1% – 100%. Moreover, experiments performed on the NVIDIA GPU systems further indicate up to 70.9% of pessimism reduction can be achieved by the proposed scheduling. Since we target general heterogeneous architectures, SHAPE can be directly applied to off-the-shelf heterogeneous computing systems with guaranteed deadlines and improved schedulability.

On Minimizing the Read Latency of Flash Memory to Preserve Inter-Tree Locality in Random Forest

  • Yu-Cheng Lin
  • Yu-Pei Liang
  • Tseng-Yi Chen
  • Yuan-Hao Chang
  • Shuo-Han Chen
  • Wei-Kuan Shih

Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.

SESSION: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

Session details: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

  • Biying Xu
  • Ilya Yusim

Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation

  • Xiaoming Chen

A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to “guess” the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the “guessed” dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at

EI-MOR: A Hybrid Exponential Integrator and Model Order Reduction Approach for Transient Power/Ground Network Analysis

  • Cong Wang
  • Dongen Yang
  • Quan Chen

Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.

Multi-Package Co-Design for Chiplet Integration

  • Zhen Zhuang
  • Bei Yu
  • Kai-Yuan Chao
  • Tsung-Yi Ho

Due to the cost and design complexity associated with advanced technology nodes, it is difficult for traditional monolithic System-on-Chip to follow the Moore’s Law, which means the economic benefits have been weakened. Semiconductor industries are looking for advanced packages to improve the economic advantages. Since the multi-chiplet architecture supporting heterogeneous integration has the robust re-usability and effective cost reduction, chiplet integration has become the mainstream of advanced packages. Nowadays, the number of mounted chiplets in a package is continuously increasing with the requirement of high system performance. However, the large area caused by the increasing of chiplets leads to the serious reliability issues, including warpage and bump stress, which worsens the yield and cost. The multi-package architecture, which can distribute chiplets to multiple packages and use less area of each package, is a popular alternative to enhance the reliability and reduce the cost in advanced packages. However, the primary challenge of the multi-package architecture lies in the tradeoff between the inter-package costs, i.e., the interconnection among packages, and the intra-package costs, i.e., the reliability caused by warpage and bump stress. Therefore, a co-design methodology is indispensable to optimize multiple packages simultaneously to improve the quality of the whole system. To tackle this challenge, we adopt mathematical programming methods in the multi-package co-design problem regarding the nature of the synergistic optimization of multiple packages. To the best of our knowledge, this is the first work to solve the multi-package co-design problem.

SESSION: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

Session details: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

  • Grace Li Zhang

Gzippo: Highly-Compact Processing-in-Memory Graph Accelerator Alleviating Sparsity and Redundancy

  • Xing Li
  • Rachata Ausavarungnirun
  • Xiao Liu
  • Xueyuan Liu
  • Xuan Zhang
  • Heng Lu
  • Zhuoran Song
  • Naifeng Jing
  • Xiaoyao Liang

Graph application plays a significant role in real-world data computation. However, the memory access patterns become the performance bottleneck of the graph applications, which include low compute-to-communication ratio, poor temporal locality, and poor spatial locality. Existing RRAM-based processing-in-memory accelerators reduce the data movements but fail to address both sparsity and redundancy of graph data. In this work, we present Gzippo, a highly-compact design that supports graph computation in the compressed sparse format. Gzippo employs a tandem-isomorphic-crossbar architecture both to eliminate redundant searches and sequential indexing during iterations, and to remove sparsity leading to non-effective computation on zero values. Gzippo achieves a 3.0× (up to 17.4×) performance speedup, 23.9× (up to 163.2×) energy efficiency over state-of-the-art RRAM-based PIM accelerator, respectively.

CoMUX: Combinatorial-Coding-Based High-Performance Microfluidic Control Multiplexer Design

  • Siyuan Liang
  • Mengchu Li
  • Tsun-Ming Tseng
  • Ulf Schlichtmann
  • Tsung-Yi Ho

Flow-based microfluidic chips are one of the most promising platforms for biochemical experiments. Transportation channels and operation devices inside these chips are controlled by microvalves, which are driven by external pressure sources. As the complexity of experiments on these chips keeps increasing, control multiplexers (MUXes) become necessary for the actuation of the enormous number of valves. However, current binary-coding-based MUXes do not take full advantage of the coding capacity and suffer from the reliability problem caused by the high control channel density. In this work, we propose a novel MUX coding strategy, named Combinatorial Coding, along with an algorithm to synthesize combinatorial-coding-based MUXes (CoMUXes) of arbitrary sizes with the proven maximum coding capacity. Moreover, we develop a simplification method to reduce the number of valves and control channels in CoMUXes and thus improve their reliability. We compare CoMUX with the state-of-the-art MUXes under different control demands with up to 10 × 213 independent control channels. Experiments show that CoMUXes can reliably control more independent control channels with fewer resources. For example, when the number of the to-be-controlled control channels is up to 10 × 213, compared to a state-of-the-art MUX, the optimized CoMUX reduces the number of required flow channels by 44% and the number of valves by 90%.

Exploiting Uniform Spatial Distribution to Design Efficient Random Number Source for Stochastic Computing

  • Kuncai Zhong
  • Zexi Li
  • Haoran Jin
  • Weikang Qian

Stochastic computing (SC) generally suffers from long latency. One solution is to apply proper random number sources (RNSs). Nevertheless, current RNS designs either have high hardware cost or low accuracy. To address the issue, motivated by that the uniform spatial distribution generally leads to a high accuracy for an SC circuit, we propose a basic architecture to generate the uniform spatial distribution and a further detailed implementation of it. For the implementation, we further propose a method to optimize its hardware cost and a method to optimize its accuracy. The method for hardware cost optimization can optimize the hardware cost without affecting the accuracy. The experimental results show that our proposed implementation can achieve both low hardware cost and high accuracy. Compared to the state-of-the-art stochastic number generator design, the proposed design can reduce 88% area with close accuracy.

SESSION: On Automating Heterogeneous Designs (Virtual)

Session details: On Automating Heterogeneous Designs (Virtual)

  • Haocheng Li

A Novel Blockage-Avoiding Macro Placement Approach for 3D ICs Based on POCS

  • Jai-Ming Lin
  • Po-Chen Lu
  • Heng-Yu Lin
  • Jia-Ting Tsai

Although the 3D integrated circuit (IC) placement problem has been studied for many years, few publications devoted to the macro legalization. Due to large sizes of macros, the macro placement problem is harder than cell placement, especially when preplaced macros exist in a multi-tier structure. In order to have a more global view, this paper proposes the partitioning-last macro-first flow to handle 3D placement for mixed-size designs, which performs tier partitioning after placement prototyping and then legalizes macros before cell placement. A novel two-step approach is proposed to handle 3D macro placement. The first step determines locations of macros in a projection plane based on a new representation, named K-tier Partially Occupied Corner Stitching. It not only can keep the prototyping result but also guarantees a legal placement after tier assignment of macros. Next, macros are assigned to respective tiers by Integer Linear Programming (ILP) algorithm. Experimental results show that our design flow can obtain better solutions than other flows especially in the cases with more preplaced macros.

Routability-Driven Analytical Placement with Precise Penalty Models for Large-Scale 3D ICs

  • Jai-Ming Lin
  • Hao-Yuan Hsieh
  • Hsuan Kung
  • Hao-Jia Lin

Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.

SESSION: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

Session details: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

  • Swaroop Ghosh

Quantum Machine Learning for Material Synthesis and Hardware Security (Invited Paper)

  • Collin Beaudoin
  • Satwik Kundu
  • Rasit Onur Topaloglu
  • Swaroop Ghosh

Using quantum computing, this paper addresses two scientifically-pressing and day to day-relevant problems, namely, chemical retrosynthesis which is an important step in drug/material discovery and security of semiconductor supply chain. We show that Quantum Long Short-Term Memory (QLSTM) is a viable tool for retrosynthesis. We achieve 65% training accuracy with QLSTM whereas classical LSTM can achieve 100%. However, in testing we achieve 80% accuracy with the QLSTM while classical LSTM peaks at only 70% accuracy! We also demonstrate an application of Quantum Neural Network (QNN) in the hardware security domain, specifically in Hardware Trojan (HT) detection using a set of power and area Trojan features. The QNN model achieves detection accuracy as high as 97.27%.

Quantum Machine Learning Applications in High-Energy Physics

  • Andrea Delgado
  • Kathleen E. Hamilton

Some of the most significant achievements of the modern era of particle physics, such as the discovery of the Higgs boson, have been made possible by the tremendous effort in building and operating large-scale experiments like the Large Hadron Collider or the Tevatron. In these facilities, the ultimate theory to describe matter at the most fundamental level is constantly probed and verified. These experiments often produce large amounts of data that require storing, processing, and analysis techniques that continually push the limits of traditional information processing schemes. Thus, the High-Energy Physics (HEP) field has benefited from advancements in information processing and the development of algorithms and tools for large datasets. More recently, quantum computing applications have been investigated to understand how the community can benefit from the advantages of quantum information science. Nonetheless, to unleash the full potential of quantum computing, there is a need to understand the quantum behavior and, thus, scale up current algorithms beyond what can be simulated in classical processors. In this work, we explore potential applications of quantum machine learning to data analysis tasks in HEP and how to overcome the limitations of algorithms targeted for Noisy Intermediate-Scale Quantum (NISQ) devices.

SESSION: Making Patterning Work (Virtual)

Session details: Making Patterning Work (Virtual)

  • Yuzhe Ma

DeePEB: A Neural Partial Differential Equation Solver for Post Exposure Baking Simulation in Lithography

  • Qipan Wang
  • Xiaohan Gao
  • Yibo Lin
  • Runsheng Wang
  • Ru Huang

Post Exposure Baking (PEB) has been widely utilized in advanced lithography. PEB simulation is critical in the lithography simulation flow, as it bridges the optical simulation result and the final developed profile in the photoresist. The process of PEB can be described by coupled partial differential equations (PDE) and corresponding boundary and initial conditions. Recent years have witnessed growing presence of machine learning algorithms in lithography simulation, while PEB simulation is often ignored or treated with compact models, considering the huge cost of solving PDEs exactly. In this work, based on the observation of the physical essence of PEB, we propose DeePEB: a neural PDE Solver for PEB simulation. This model is capable of predicting the PEB latent image with high accuracy and >100 × acceleration (compared to the commercial rigorous simulation tool), paving the way for efficient and accurate photoresist modeling in lithography simulation and layout optimization.

AdaOPC: A Self-Adaptive Mask Optimization Framework for Real Design Patterns

  • Wenqian Zhao
  • Xufeng Yao
  • Ziyang Yu
  • Guojin Chen
  • Yuzhe Ma
  • Bei Yu
  • Martin D. F. Wong

Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that different sub-regions have different pattern complexity. Besides, we also find that many patterns repetitively appear in the design layout, and these patterns may possibly share optimized masks. We exploit these properties and propose a self-adaptive OPC framework to improve efficiency. Firstly we choose different OPC solvers adaptively for patterns of different complexity from an extensible solver pool to reach a speed/accuracy co-optimization. Apart from that, we prove the feasibility of reusing optimized masks for repeated patterns and hence, build a graph-based dynamic pattern library reusing stored masks to further speed up the OPC flow. Experimental results show that our framework achieves substantial improvement in both performance and efficiency.

LayouTransformer: Generating Layout Patterns with Transformer via Sequential Pattern Modeling

  • Liangjian Wen
  • Yi Zhu
  • Lei Ye
  • Guojin Chen
  • Bei Yu
  • Jianzhuang Liu
  • Chunjing Xu

Generating legal and diverse layout patterns to establish large pattern libraries is fundamental for many lithography design applications. Existing pattern generation models typically regard the pattern generation problem as image generation of layout maps and learn to model the patterns via capturing pixel-level coherence, which is insufficient to achieve polygon-level modeling, e.g., shape and layout of patterns, thus leading to poor generation quality. In this paper, we regard the pattern generation problem as an unsupervised sequence generation problem, in order to learn the pattern design rules by explicitly modeling the shapes of polygons and the layouts among polygons. Specifically, we first propose a sequential pattern representation scheme that fully describes the geometric information of polygons by encoding the 2D layout patterns as sequences of tokens, i.e., vertexes and edges. Then we train a sequential generative model to capture the long-term dependency among tokens and thus learn the design rules from training examples. To generate a new pattern in sequence, each token is generated conditioned on the previously generated tokens that are from the same polygon or different polygons in the same layout map. Our framework, termed LayouTransformer, is based on the Transformer architecture due to its remarkable ability in sequence modeling. Comprehensive experiments show that our LayouTransformer not only generates a large amount of legal patterns but also maintains high generation diversity, demonstrating its superiority over existing pattern generative models.

WaferHSL: Wafer Failure Pattern Classification with Efficient Human-Like Staged Learning

  • Qijing Wang
  • Martin D. F. Wong

As the demand for semiconductor products increases and the integrated circuits (IC) processes become more and more complex, wafer failure pattern classification is gaining more attention from manufacturers and researchers to improve yield. To further cope with the real-world scenario that there are only very limited labeled data and without any unlabeled data in the early manufacturing stage of new products, this work proposes an efficient human-like staged learning framework for wafer failure pattern classification named WaferHSL. Inspired by human’s knowledge acquisition process, a mutually reinforcing task fusion scheme is designed for guiding the deep learning model to simultaneously establish the knowledge of spatial relationships, geometry properties and semantics. Furthermore, a progressive stage controller is deployed to partition and control the learning process, so as to enable humanlike progressive advancement in the model. Experimental results show that with only 10% labeled samples and no unlabeled samples, WaferHSL can achieve better results than previous SOTA methods trained with 60% labeled samples and a large number of unlabeled samples, while the improvement is even more significant when using the same size of labeled training set.

SESSION: Advanced Verification Technologies (Virtual)

Session details: Advanced Verification Technologies (Virtual)

  • Takahide Yoshikawa

Combining BMC and Complementary Approximate Reachability to Accelerate Bug-Finding

  • Xiaoyu Zhang
  • Shengping Xiao
  • Jianwen Li
  • Geguang Pu
  • Ofer Strichman

Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.

Equivalence Checking of Dynamic Quantum Circuits

  • Xin Hong
  • Yuan Feng
  • Sanjiang Li
  • Mingsheng Ying

Despite the rapid development of quantum computing these years, state-of-the-art quantum devices still contain only a limited number of qubits. One possible way to execute more realistic algorithms in near-term quantum devices is to employ dynamic quantum circuits (DQCs). In DQCs, measurements can happen during the circuit, and their outcomes can be processed with classical computers and used to control other parts of the circuit. This technique can help significantly reduce the qubit resources required to implement a quantum algorithm. In this paper, we give a formal definition of DQCs and then characterise their functionality in terms of ensembles of linear operators, following the Kraus representation of superoperators. We further interpret DQCs as tensor networks, implement their functionality as tensor decision diagrams (TDDs), and reduce the equivalence of two DQCs to checking if they have the same TDD representation. Experiments show that embedding classical logic into conventional quantum circuits does not incur a significant time and space burden.

SESSION: Routing with Cell Movement (Virtual)

Session details: Routing with Cell Movement (Virtual)

  • Guojie Luo

ATLAS: A Two-Level Layer-Aware Scheme for Routing with Cell Movement

  • Xinshi Zang
  • Fangzhou Wang
  • Jinwei Liu
  • Martin D. F. Wong

Placement and routing are two crucial steps in the physical design of integrated circuits (ICs). To close the gap between placement and routing, the routing with cell movement problem has attracted great attention recently. In this problem, a certain number of cells can be moved to new positions and the nets can be rerouted to improve the total wire length. In this work, we advance the study on this problem by proposing a two-level layer-aware scheme, named ATLAS. A coarse-level cluster-based cell movement is first performed to optimize via usage and provides a better starting point for the next fine-level single cell movement. To further encourage routing on the upper metal layers, we utilize a set of adjusted layer weights to increase the routing cost on lower layers. Experimental results on the ICCAD 2020 contest benchmarks show that ATLAS achieves much more wire length reduction compared with the state-of-the-art routing with cell movement engine. Furthermore, applied on the ICCAD 2021 contest benchmarks, ATLAS outperforms the first place team of the contest with much better solution quality while being 3× faster.

A Robust Global Routing Engine with High-Accuracy Cell Movement under Advanced Constraints

  • Ziran Zhu
  • Fuheng Shen
  • Yangjie Mei
  • Zhipeng Huang
  • Jianli Chen
  • Jun Yang

Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.

SESSION: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

Session details: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

  • Michael Raitza

Securing Hardware through Reconfigurable Nano-Structures

  • Nima Kavand
  • Armin Darjani
  • Shubham Rai
  • Akash Kumar

Hardware security has been an ever-growing concern of the integrated circuit (IC) designers. Through different stages in the IC design and life cycle, an adversary can extract sensitive design information and private data stored in the circuit using logical, physical, and structural weaknesses. Besides, in recent times, ML-based attacks have become the new de facto standard in hardware security community. Contemporary defense strategies are often facing unforeseen challenges to cope up with these attack schemes. Additionally, the high overhead of the CMOS-based secure addon circuitry and intrinsic limitations of these devices indicate the need for new nano-electronics. Emerging reconfigurable devices like Reconfigurable Field Effect transistors (RFETs) provide unique features to fortify the design against various threats at different stages in the IC design and life cycle. In this manuscript, we investigate the applications of the RFETs for securing the design against traditional and machine learning (ML)-based intellectual property (IP) piracy techniques and side-channel attacks (SCAs).

Reconfigurable Logic for Hardware IP Protection: Opportunities and Challenges

  • Luca Collini
  • Benjamin Tan
  • Christian Pilato
  • Ramesh Karri

Protecting the intellectual property (IP) of integrated circuit (IC) design is becoming a significant concern of fab-less semiconductor design houses. Malicious actors can access the chip design at any stage, reverse engineer the functionality, and create illegal copies. On the one hand, defenders are crafting more and more solutions to hide the critical portions of the circuit. On the other hand, attackers are designing more and more powerful tools to extract useful information from the design and reverse engineer the functionality, especially when they can get access to working chips. In this context, the use of custom reconfigurable fabrics has recently been investigated for hardware IP protection. This paper will discuss recent trends in hardware obfuscation with embedded FPGAs, focusing also on the open challenges that must be necessarily addressed for making this solution viable.

SESSION: Performance, Power and Temperature Aspects in Deep Learning

Session details: Performance, Power and Temperature Aspects in Deep Learning

  • Callie Hao
  • Jeff Zhang

RT-NeRF: Real-Time On-Device Neural Radiance Fields Towards Immersive AR/VR Rendering

  • Chaojian Li
  • Sixu Li
  • Yang Zhao
  • Wenbo Zhu
  • Yingyan Lin

Neural Radiance Field (NeRF) based rendering has attracted growing attention thanks to its state-of-the-art (SOTA) rendering quality and wide applications in Augmented and Virtual Reality (AR/VR). However, immersive real-time (> 30 FPS) NeRF based rendering enabled interactions are still limited due to the low achievable throughput on AR/VR devices. To this end, we first profile SOTA efficient NeRF algorithms on commercial devices and identify two primary causes of the aforementioned inefficiency: (1) the uniform point sampling and (2) the dense accesses and computations of the required embeddings in NeRF. Furthermore, we propose RT-NeRF, which to the best of our knowledge is the first algorithm-hardware co-design acceleration of NeRF. Specifically, on the algorithm level, RT-NeRF integrates an efficient rendering pipeline for largely alleviating the inefficiency due to the commonly adopted uniform point sampling method in NeRF by directly computing the geometry of pre-existing points. Additionally, RT-NeRF leverages a coarse-grained view-dependent computing ordering scheme for eliminating the (unnecessary) processing of invisible points. On the hardware level, our proposed RT-NeRF accelerator (1) adopts a hybrid encoding scheme to adaptively switch between a bitmap- or coordinate-based sparsity encoding format for NeRF’s sparse embeddings, aiming to maximize the storage savings and thus reduce the required DRAM accesses while supporting efficient NeRF decoding; and (2) integrates both a high-density sparse search unit and a dual-purpose bi-direction adder & search tree to coordinate the two aforementioned encoding formats. Extensive experiments on eight datasets consistently validate the effectiveness of RT-NeRF, achieving a large throughput improvement (e.g., 9.7×~3,201×) while maintaining the rendering quality as compared with SOTA efficient NeRF solutions.

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

  • Yifan Gong
  • Zheng Zhan
  • Pu Zhao
  • Yushu Wu
  • Chao Wu
  • Caiwen Ding
  • Weiwen Jiang
  • Minghai Qin
  • Yanzhi Wang

During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

Robustify ML-Based Lithography Hotspot Detectors

  • Jingyu Pan
  • Chen-Chia Chang
  • Zhiyao Xie
  • Jiang Hu
  • Yiran Chen

Deep learning has been widely applied in various VLSI design automation tasks, from layout quality estimation to design optimization. Though deep learning has shown state-of-the-art performance in several applications, recent studies reveal that deep neural networks exhibit intrinsic vulnerability to adversarial perturbations, which pose risks in the ML-aided VLSI design flow. One of the most effective strategies to improve robustness is regularization approaches, which adjust the optimization objective to make the deep neural network generalize better. In this paper, we examine several adversarial defense methods to improve the robustness of ML-based lithography hotspot detectors. We present an innovative design rule checking (DRC)-guided curvature regularization (CURE) approach, which is customized to robustify ML-based lithography hotspot detectors against white-box attacks. Our approach allows for improvements in both the robustness and the accuracy of the model. Experiments show that the model optimized by DRC-guided CURE achieves the highest robustness and accuracy compared with those trained using the baseline defense methods. Compared with the vanilla model, DRC-guided CURE decreases the average attack success rate by 53.9% and increases the average ROC-AUC by 12.1%. Compared with the best of the defense baselines, DRC-guided CURE reduces the average attack success rate by 18.6% and improves the average ROC-AUC by 4.3%.

Associative Memory Based Experience Replay for Deep Reinforcement Learning

  • Mengyuan Li
  • Arman Kazemi
  • Ann Franchesca Laguna
  • X. Sharon Hu

Experience replay is an essential component in deep reinforcement learning (DRL), which stores the experiences and generates experiences for the agent to learn in real time. Recently, prioritized experience replay (PER) has been proven to be powerful and widely deployed in DRL agents. However, implementing PER on traditional CPU or GPU architectures incurs significant latency overhead due to its frequent and irregular memory accesses. This paper proposes a hardware-software co-design approach to design an associative memory (AM) based PER, AMPER, with an AM-friendly priority sampling operation. AMPER replaces the widely-used time-costly tree-traversal-based priority sampling in PER while preserving the learning performance. Further, we design an in-memory computing hardware architecture based on AM to support AMPER by leveraging parallel in-memory search operations. AMPER shows comparable learning performance while achieving 55× to 270× latency improvement when running on the proposed hardware compared to the state-of-the-art PER running on GPU.

SESSION: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

Session details: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

  • Hanrui Wang

TorchQuantum Case Study for Robust Quantum Circuits

  • Hanrui Wang
  • Zhiding Liang
  • Jiaqi Gu
  • Zirui Li
  • Yongshan Ding
  • Weiwen Jiang
  • Yiyu Shi
  • David Z. Pan
  • Frederic T. Chong
  • Song Han

Quantum Computing has attracted much research attention because of its potential to achieve fundamental speed and efficiency improvements in various domains. Among different quantum algorithms, Parameterized Quantum Circuits (PQC) for Quantum Machine Learning (QML) show promises to realize quantum advantages on the current Noisy Intermediate-Scale Quantum (NISQ) Machines. Therefore, to facilitate the QML and PQC research, a recent python library called TorchQuantum has been released. It can construct, simulate, and train PQC for machine learning tasks with high speed and convenient debugging supports. Besides quantum for ML, we want to raise the community’s attention on the reversed direction: ML for quantum. Specifically, the TorchQuantum library also supports using data-driven ML models to solve problems in quantum system research, such as predicting the impact of quantum noise on circuit fidelity and improving the quantum circuit compilation efficiency.

This paper presents a case study of the ML for quantum part in TorchQuantum. Since estimating the noise impact on circuit reliability is an essential step toward understanding and mitigating noise, we propose to leverage classical ML to predict noise impact on circuit fidelity. Inspired by the natural graph representation of quantum circuits, we propose to leverage a graph transformer model to predict the noisy circuit fidelity. We firstly collect a large dataset with a variety of quantum circuits and obtain their fidelity on noisy simulators and real machines. Then we embed each circuit into a graph with gate and noise properties as node features, and adopt a graph transformer to predict the fidelity. We can avoid exponential classical simulation cost and efficiently estimate fidelity with polynomial complexity.

Evaluated on 5 thousand random and algorithm circuits, the graph transformer predictor can provide accurate fidelity estimation with RMSE error 0.04 and outperform a simple neural network-based model by 0.02 on average. It can achieve 0.99 and 0.95 R2 scores for random and algorithm circuits, respectively. Compared with circuit simulators, the predictor has over 200× speedup for estimating the fidelity. The datasets and predictors can be accessed in the TorchQuantum library.

SESSION: Emerging Machine Learning Primitives: From Technology to Application

Session details: Emerging Machine Learning Primitives: From Technology to Application

  • Dharanidhar Dang
  • Hai Helen Lee

COSIME: FeFET Based Associative Memory for In-Memory Cosine Similarity Search

  • Che-Kai Liu
  • Haobang Chen
  • Mohsen Imani
  • Kai Ni
  • Arman Kazemi
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu
  • Liang Zhao
  • Cheng Zhuo
  • Xunzhao Yin

In a number of machine learning models, an input query is searched across the trained class vectors to find the closest feature class vector in cosine similarity metric. However, performing the cosine similarities between the vectors in Von-Neumann machines involves a large number of multiplications, Euclidean normalizations and division operations, thus incurring heavy hardware energy and latency overheads. Moreover, due to the memory wall problem that presents in the conventional architecture, frequent cosine similarity-based searches (CSSs) over the class vectors requires a lot of data movements, limiting the throughput and efficiency of the system. To overcome the aforementioned challenges, this paper introduces COSIME, a general in-memory associative memory (AM) engine based on the ferroelectric FET (FeFET) device for efficient CSS. By leveraging the one-transistor AND gate function of FeFET devices, current-based translinear analog circuit and winner-take-all (WTA) circuitry, COSIME can realize parallel in-memory CSS across all the entries in a memory block, and output the closest word to the input query in cosine similarity metric. Evaluation results at the array level suggest that the proposed COSIME design achieves 333× and 90.5× latency and energy improvements, respectively, and realizes better classification accuracy when compared with an AM design implementing approximated CSS. The proposed in-memory computing fabric is evaluated for an HDC problem, showcasing that COSIME can achieve on average 47.1× and 98.5× speedup and energy efficiency improvements compared with an GPU implementation.

DynaPAT: A Dynamic Pattern-Aware Encoding Technique for Robust MLC PCM-Based Deep Neural Networks

  • Thai-Hoang Nguyen
  • Muhammad Imran
  • Joon-Sung Yang

As the effectiveness of Deep Neural Networks (DNNs) is rising over time, so is the need for highly scalable and efficient hardware architectures to capitalize this effectiveness in many practical applications. Emerging non-volatile Phase Change Memory (PCM) technology has been found to be a promising candidate for future memory systems due to its better scalability, non-volatility and low leakage/dynamic power consumption, compared to conventional charged-based memories. Additionally, with its cell’s wide resistance span, PCM also has the Flash-like Multi-Level Cell (MLC) capability, which has enhanced storage density, providing an opportunity for the deployment of data-intensive applications such as DNNs on resource-constrained edge devices. However, the practical deployment of MLC PCM is hampered by certain reliability challenges, among which, the resistance drift is considered to be a critical concern. In a DNN application, the presence of resistance drift in MLC PCM can cause a severe impact to DNN’s accuracy if no drift-error-tolerance technique is utilized. This paper proposes DynaPAT, a low-cost and effective pattern-aware encoding technique to enhance the drift-error-tolerance of MLC PCM-based Deep Neural Networks. DynaPAT has been constructed on the insight into DNN’s vulnerability against different data pattern switching. Based on this insight, DynaPAT efficiently maps the most-frequent data pattern in DNN’s parameters to the least-drift-prone level of the MLC PCM, thus significantly enhancing the robustness of the system against drift errors. Various experiments on different DNN models and configurations demonstrate the effectiveness of DynaPAT. The experimental results indicate that DynaPAT can achieve up to 500× enhancement in the drift-errors-tolerance capability over the baseline MLC PCM based DNN while requiring only a negligible hardware overhead (below 1% storage overhead). Being orthogonal, DynaPAT can be integrated with existing drift-tolerance schemes for even higher gains in reliability.

Graph Neural Networks for Idling Error Mitigation

  • Vedika Servanan
  • Samah Mohamed Saeed

Dynamical Decoupling (DD)-based protocols have been shown to reduce the idling errors encountered in quantum circuits. However, the current research in suppressing idling qubit errors suffers from scalability issues due to the large number of tuning quantum circuits that should be executed first to find the locations of the DD sequences in the target quantum circuit, which boost the output state fidelity. This process becomes tedious as the size of the quantum circuit increases. To address this challenge, we propose a Graph Neural Network (GNN) framework, which mitigates idling errors through an efficient insertion of DD sequences into quantum circuits by modeling their impact at different idle qubit windows. Our paper targets maximizing the benefit of DD sequences using a limited number of tuning circuits. We propose to classify the idle qubit windows into critical and non-critical (benign) windows using a data-driven reliability model. Our results obtained from IBM Lagos quantum computer show that our proposed GNN models, which determine the locations of DD sequences in the quantum circuits, significantly improve the output state fidelity by a factor of 1.4x on average and up to 2.6x compared to the adaptive DD approach, which searches for the best locations of DD sequences at run-time.

Quantum Neural Network Compression

  • Zhirui Hu
  • Peiyan Dong
  • Zhepeng Wang
  • Youzuo Lin
  • Yanzhi Wang
  • Weiwen Jiang

Model compression, such as pruning and quantization, has been widely applied to optimize neural networks on resource-limited classical devices. Recently, there are growing interest in variational quantum circuits (VQC), that is, a type of neural network on quantum computers (a.k.a., quantum neural networks). It is well known that the near-term quantum devices have high noise and limited resources (i.e., quantum bits, qubits); yet, how to compress quantum neural networks has not been thoroughly studied. One might think it is straightforward to apply the classical compression techniques to quantum scenarios. However, this paper reveals that there exist differences between the compression of quantum and classical neural networks. Based on our observations, we claim that the compilation/traspilation has to be involved in the compression process. On top of this, we propose the very first systematical framework, namely CompVQC, to compress quantum neural networks (QNNs). In CompVQC, the key component is a novel compression algorithm, which is based on the alternating direction method of multipliers (ADMM) approach. Experiments demonstrate the advantage of the CompVQC, reducing the circuit depth (almost over 2.5×) with a negligible accuracy drop (<1%), which outperforms other competitors. Another promising truth is our CompVQC can indeed promote the robustness of the QNN on the near-term noisy quantum devices.

SESSION: Design for Low Energy, Low Resource, but High Quality

Session details: Design for Low Energy, Low Resource, but High Quality

  • Ravikumar Chakaravarthy
  • Cong “Callie” Hao

Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications

  • Azat Azamat
  • Jaewoo Park
  • Jongeun Lee

The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.

WSQ-AdderNet: Efficient Weight Standardization Based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization

  • Yunxiang Zhang
  • Biao Sun
  • Weixiong Jiang
  • Yajun Ha
  • Miao Hu
  • Wenfeng Zhao

Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight 1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.

Low-Cost 7T-SRAM Compute-in-Memory Design Based on Bit-Line Charge-Sharing Based Analog-to-Digital Conversion

  • Kyeongho Lee
  • Joonhyung Kim
  • Jongsun Park

Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.

SESSION: Microarchitectural Attacks and Countermeasures

Session details: Microarchitectural Attacks and Countermeasures

  • Rajesh JS
  • Amin Rezaei

Speculative Load Forwarding Attack on Modern Processors

  • Hasini Witharana
  • Prabhat Mishra

Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.

Fast, Robust and Accurate Detection of Cache-Based Spectre Attack Phases

  • Arash Pashrashid
  • Ali Hajiabadi
  • Trevor E. Carlson

Modern processors achieve high performance and efficiency by employing techniques such as speculative execution and sharing resources such as caches. However, recent attacks like Spectre and Meltdown exploit the speculative execution of modern processors to leak sensitive information from the system. Many mitigation strategies have been proposed to restrict the speculative execution of processors and protect potential side-channels. Currently, these techniques have shown a significant performance overhead. A solution that can detect memory leaks before the attacker has a chance to exploit them would allow the processor to reduce the performance overhead by enabling protections only when the system is at risk.

In this paper, we propose a mechanism to detect speculative execution attacks that use caches as a side-channel. In this detector we track the phases of a successful attack and raise an alert before the attacker gets a chance to recover sensitive information. We accomplish this through monitoring the microarchitectural changes in the core and caches, and detect the memory locations that can be potential memory data leaks. We achieve 100% accuracy and negligible false positive rate in detecting Spectre attacks and evasive versions of Spectre that the state-of-the-art detectors are unable to detect. Our detector has no performance overhead with negligible power and area overheads.

CASU: Compromise Avoidance via Secure Update for Low-End Embedded Systems

  • Ivan De Oliveira Nunes
  • Sashidhar Jakkamsetti
  • Youngil Kim
  • Gene Tsudik

Guaranteeing runtime integrity of embedded system software is an open problem. Trade-offs between security and other priorities (e.g., cost or performance) are inherent, and resolving them is both challenging and important. The proliferation of runtime attacks that introduce malicious code (e.g., by injection) into embedded devices has prompted a range of mitigation techniques. One popular approach is Remote Attestation (RA), whereby a trusted entity (verifier) checks the current software state of an untrusted remote device (prover). RA yields a timely authenticated snapshot of prover state that verifier uses to decide whether an attack occurred.

Current RA schemes require verifier to explicitly initiate RA, based on some unclear criteria. Thus, in case of prover’s compromise, verifier only learns about it late, upon the next RA instance. While sufficient for compromise detection, some applications would benefit from a more proactive, prevention-based approach. To this end, we construct CASU: Compromise Avoidance via Secure Updates. CASU is an inexpensive hardware/software co-design enforcing: (i) runtime software immutability, thus precluding any illegal software modification, and (ii) authenticated updates as the sole means of modifying software. In CASU, a successful RA instance serves as a proof of successful update, and continuous subsequent software integrity is implicit, due to the runtime immutability guarantee. This obviates the need for RA in between software updates and leads to unobtrusive integrity assurance with guarantees akin to those of prior RA techniques, with better overall performance.

SESSION: Genetic Circuits Meet Ising Machines

Session details: Genetic Circuits Meet Ising Machines

  • Marc Riedel
  • Lei Yang

Technology Mapping of Genetic Circuits: From Optimal to Fast Solutions

  • Tobias Schwarz
  • Christian Hochberger

Synthetic Biology aims to create biological systems from scratch that do not exist in nature. An important method in this context is the engineering of DNA sequences such that cells realize Boolean functions that serve as control mechanisms in biological systems, e.g. in medical or agricultural applications. Libraries of logic gates exist as predefined gene sequences, based on the genetic mechanism of transcriptional regulation. Each individual gate is composed of different biological parts to allow for the differentiation of their output signals. Even gates of the same logic type therefore exhibit different transfer characteristics, i.e. relation from input to output signals. Thus, simulation of the whole network of genetic gates is needed to determine the performance of a genetic circuit. This makes mapping Boolean functions to these libraries much more complicated compared to EDA. Yet, optimal results are desired in the design phase due to high lab implementation costs. In this work, we identify fundamental features of the transfer characteristic of gates based on transcriptional regulation which is widely used in genetic gate technologies. Based on this, we present novel exact (Branch-and-Bound) and heuristic (Branch-and-Bound, Simulated Annealing) algorithms for the problem of technology mapping of genetic circuits and evaluate them using a prominent gate library. In contrast to state-of-the-art tools, all obtained solutions feature a (near) optimal output performance. Our exact method only explores 6.5 % and the heuristics even 0.2 % of the design space.

DaS: Implementing Dense Ising Machines Using Sparse Resistive Networks

  • Naomi Sagan
  • Jaijeet Roychowdhury

Ising machines have generated much excitement in recent years due to their promise for solving hard combinatorial optimization problems. However, achieving physical all-to-all connectivity in IC implementations of large, densely-connected Ising machines remains a key challenge. We present a novel approach, DaS, that uses low-rank decomposition to achieve effectively-dense Ising connectivity using only sparsely interconnected hardware. The innovation consists of two components. First, we use the SVD to find a low-rank approximation of the Ising coupling matrix while maintaining very high accuracy. This decomposition requires substantially fewer nonzeros to represent the dense Ising coupling matrix. Second, we develop a method to translate the low-rank decomposition to a hardware implementation that uses only sparse resistive interconnections. We validate DaS on the MU-MIMO detection problem, important in modern telecommunications. Our results indicate that as problem sizes scale, DaS can achieve dense Ising coupling using only 5%-20% of the resistors needed for brute-force dense connections (which would be physically infeasible in ICs). We also outline a crossbar-style physical layout scheme for realizing sparse resistive networks generated by DaS.

QuBRIM: A CMOS Compatible Resistively-Coupled Ising Machine with Quantized Nodal Interactions

  • Yiqiao Zhang
  • Uday Kumar Reddy Vengalam
  • Anshujit Sharma
  • Michael Huang
  • Zeljko Ignjatovic

Physical Ising machines have been shown to solve combinatoric optimization problems with orders-of-magnitude improvements in speed and energy efficiency o ver v on N eumann systems. However, building such a system is still in its infancy and a scalable, robust implementation remains challenging. CMOS-compatible electronic Ising machines (e.g., [1]) are promising as the mature technology helps bring scale, speed, and energy efficiency to the dynamical system. However, subtle issues can arise when using voltage-controlled transistors to act as programmable resistive coupling. In this paper, we propose a version of resistively-coupled Ising machine using quantized nodal interactions (QuBRIM), which significantly i mproved the predictability of the coupling resistor. The functionality of QuBRIM is demonstrated by solving the well-known Max-Cut problem using both behavioral and circuit level simulations in 45 nm CMOS technology node. We show that the dynamical system naturally seeks local minima in the objective function’s energy landscape and that by applying spin-fix a nnealing, t he system reaches a global minimum with a high probability.

SESSION: Energy Efficient Neural Networks via Approximate Computations

Session details: Energy Efficient Neural Networks via Approximate Computations

  • M. Hasan Najafi
  • Vidya Chabria

Combining Gradients and Probabilities for Heterogeneous Approximation of Neural Networks

  • Elias Trommer
  • Bernd Waschneck
  • Akash Kumar

This work explores the search for heterogeneous approximate multiplier configurations for neural networks that produce high accuracy and low energy consumption. We discuss the validity of additive Gaussian noise added to accurate neural network computations as a surrogate model for behavioral simulation of approximate multipliers. The continuous and differentiable properties of the solution space spanned by the additive Gaussian noise model are used as a heuristic that generates meaningful estimates of layer robustness without the need for combinatorial optimization techniques. Instead, the amount of noise injected into the accurate computations is learned during network training using backpropagation. A probabilistic model of the multiplier error is presented to bridge the gap between the domains; the model estimates the standard deviation of the approximate multiplier error, connecting solutions in the additive Gaussian noise space to actual hardware instances. Our experiments show that the combination of heterogeneous approximation and neural network retraining reduces the energy consumption for multiplications by 70% to 79% for different ResNet variants on the CIFAR-10 dataset with a Top-1 accuracy loss below one percentage point. For the more complex Tiny ImageNet task, our VGG16 model achieves a 53 % reduction in energy consumption with a drop in Top-5 accuracy of 0.5 percentage points. We further demonstrate that our error model can predict the parameters of an approximate multiplier in the context of the commonly used additive Gaussian noise (AGN) model with high accuracy. Our software implementation is available under

Tunable Precision Control for Approximate Image Filtering in an In-Memory Architecture with Embedded Neurons

  • Ayushi Dube
  • Ankit Wagle
  • Gian Singh
  • Sarma Vrudhula

This paper presents a novel hardware-software co-design consisting of a Processing in-Memory (PiM) architecture with embedded neural processing elements (NPE) that are highly reconfigurable. The PiM platform and proposed approximation strategies are employed for various image filtering applications while providing the user with fine-grain dynamic control over energy efficiency, precision, and throughput (EPT). The proposed co-design can change the Peak Signal to Noise Ratio (PSNR, output quality metric for image filtering applications) from 25dB to 50dB (acceptable PSNR range for image filtering applications) without incurring any extra cost in terms of energy or latency. While switching from accurate to approximate mode of computation in the proposed co-design, the maximum improvement in energy efficiency and throughput is 2X. However, the gains in energy efficiency against a MAC-based PE array with the proposed memory platform are 3X-6X. The corresponding improvements in throughput are 2.26X-4.52X, respectively.

AppGNN: Approximation-Aware Functional Reverse Engineering Using Graph Neural Networks

  • Tim Bücher
  • Lilas Alrahis
  • Guilherme Paim
  • Sergio Bampi
  • Ozgur Sinanoglu
  • Hussam Amrouch

The globalization of the Integrated Circuit (IC) market is attracting an ever-growing number of partners, while remarkably lengthening the supply chain. Thereby, security concerns, such as those imposed by functional Reverse Engineering (RE), have become quintessential. RE leads to disclosure of confidential information to competitors, potentially enabling the theft of intellectual property. Traditional functional RE methods analyze a given gate-level netlist through employing pattern matching towards reconstructing the underlying basic blocks, and hence, reverse engineer the circuit’s function.

In this work, we are the first to demonstrate that applying Approximate Computing (AxC) principles to circuits significantly improves the resiliency against RE. This is attributed to the increased complexity in the underlying pattern-matching process. The resiliency remains effective even for Graph Neural Networks (GNNs) that are presently one of the most powerful state-of-the-art techniques in functional RE. Using AxC, we demonstrate a substantial reduction in GNN average classification accuracy- from 98% to a mere 53%. To surmount the challenges introduced by AxC in RE, we propose the highly promising AppGNN platform, which enables GNNs (still being trained on exact circuits) to: (i) perform accurate classifications, and (ii) reverse engineer the circuit functionality, notwithstanding the applied approximation technique. AppGNN accomplishes this by implementing a novel graph-based node sampling approach that mimics generic approximation methodologies, requiring zero knowledge of the targeted approximation type.

We perform an extensive evaluation targeting wide-ranging adder and multiplier circuits that are approximated using various AxC techniques, including state-of-the-art evolutionary-based approaches. We show that, using our method, we can improve the classification accuracy from 53% to 81% when classifying approximate adder circuits that have been generated using evolutionary algorithms, which our method is oblivious of. Our AppGNN framework is publicly available under

Seprox: Sequence-Based Approximations for Compressing Ultra-Low Precision Deep Neural Networks

  • Aradhana Mohan Parvathy
  • Sarada Krithivasan
  • Sanchari Sen
  • Anand Raghunathan

Compression techniques such as quantization and pruning are indispensable for deploying state-of-the-art Deep Neural Networks (DNNs) on resource-constrained edge devices. Quantization is widely used in practice – many commercial platforms already support 8-bits, with recent trends towards ultra-low precision (4-bits and below). Pruning, which increases network sparsity (incidence of zero-valued weights), enables compression by storing only the nonzero weights and their indices. Unfortunately, the compression benefits of pruning deteriorate or even vanish in ultra-low precision DNNs. This is due to (i) the unfavorable tradeoff between the number of bits needed to store a weight (which reduces with lower precision) and the number of bits needed to encode an index (which remains unchanged), and (ii) the lower sparsity levels that are achievable at lower precisions.

We propose Seprox, a new compression scheme that overcomes the aforementioned challenges by exploiting two key observations about ultra-low precision DNNs. First, with lower precision, fewer weight values are possible, leading to increased incidence of frequently-occurring weights and weight sequences. Second, some weight values occur rarely and can be eliminated by replacing them with similar values. Leveraging these insights, Seprox encodes frequently-occurring weight sequences (as opposed to individual weights) while using the eliminated weight values to encode them, thereby avoiding indexing overheads and achieving higher compression. Additionally, Seprox uses approximation techniques to increase the frequencies of the encoded sequences. Across six ultra-low precision DNNs trained on the Cifar10 and ImageNet datasets, Seprox achieves model compressions, energy improvements and speed-ups of up to 35.2%, 14.8% and 18.2% respectively.

SESSION: Algorithms and Tools for Security Analysis and Secure Hardware Design

Session details: Algorithms and Tools for Security Analysis and Secure Hardware Design

  • Rosario Cammarota
  • Satwik Patnaik

Evaluating the Security of eFPGA-Based Redaction Algorithms

  • Amin Rezaei
  • Raheel Afsharmazayejani
  • Jordan Maynard

Hardware IP owners must envision procedures to avoid piracy and overproduction of their designs under a fabless paradigm. A newly proposed technique to obfuscate critical components in a logic design is called eFPGA-based redaction, which replaces a sensitive sub-circuit with an embedded FPGA, and the eFPGA is configured to perform the same functionality as the missing sub-circuit. In this case, the configuration bitstream acts as a hidden key only known to the hardware IP owner. In this paper, we first evaluate the security promise of the existing eFPGA-based redaction algorithms as a preliminary study. Then, we break eFPGA-based redaction schemes by an initial but not necessarily efficient attack named DIP Exclusion that excludes problematic input patterns from checking in a brute-force manner. Finally, by combining cycle breaking and unrolling, we propose a novel and powerful attack called Break & Unroll that is able to recover the bitstream of state-of-the-art eFPGA-based redaction schemes in a relatively short time even with the existence of hard cycles and large size keys. This study reveals that the common perception that eFPGA-based redaction is by default secure against oracle-guided attacks, is prejudice. It also shows that additional research on how to systematically create an exponential number of non-combinational hard cycles is required to secure eFPGA-based redaction schemes.

An Approach to Unlocking Cyclic Logic Locking: LOOPLock 2.0

  • Pei-Pei Chen
  • Xiang-Min Yang
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

Cyclic logic locking is a new type of SAT-resistant techniques in hardware security. Recently, LOOPLock 2.0 was proposed, which is a cyclic logic locking method creating cycles deliberately in the locked circuit to resist SAT Attack, CycSAT, BeSAT, and Removal Attack simultaneously. The key idea of LOOPLock 2.0 is that the resultant circuit is still cyclic no matter the key vector is correct or not. This property refuses attackers and demonstrates its success on defending against attackers. In this paper, we propose an unlocking approach to LOOPLock 2.0 based on structure analysis and SAT solvers. Specifically, we identify and remove non-combinational cycles in the locked circuit before running SAT solvers. The experimental results show that the proposed unlocking approach is promising.

Garbled EDA: Privacy Preserving Electronic Design Automation

  • Mohammad Hashemi
  • Steffi Roy
  • Fatemeh Ganji
  • Domenic Forte

The complexity of modern integrated circuits (ICs) necessitates collaboration between multiple distrusting parties, including third-party intellectual property (3PIP) vendors, design houses, CAD/EDA tool vendors, and foundries, which jeopardizes confidentiality and integrity of each party’s IP. IP protection standards and the existing techniques proposed by researchers are ad hoc and vulnerable to numerous structural, functional, and/or side-channel attacks. Our framework, Garbled EDA, proposes an alternative direction through formulating the problem in a secure multi-party computation setting, where the privacy of IPs, CAD tools, and process design kits (PDKs) is maintained. As a proof-of-concept, Garbled EDA is evaluated in the context of simulation, where multiple IP description formats (Verilog, C, S) are supported. Our results demonstrate a reasonable logical-resource cost and negligible memory overhead. To further reduce the overhead, we present another efficient implementation methodology, feasible when the resource utilization is a bottleneck, but the communication between two parties is not restricted. Interestingly, this implementation is private and secure even in the presence of malicious adversaries attempting to, e.g., gain access to PDKs or in-house IPs of the CAD tool providers.

Don’t CWEAT It: Toward CWE Analysis Techniques in Early Stages of Hardware Design

  • Baleegh Ahmad
  • Wei-Kai Liu
  • Luca Collini
  • Hammond Pearce
  • Jason M. Fung
  • Jonathan Valamehr
  • Mohammad Bidmeshki
  • Piotr Sapiecha
  • Steve Brown
  • Krishnendu Chakrabarty
  • Ramesh Karri
  • Benjamin Tan

To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code that might contain one of a set of MITRE’s common weakness enumerations (CWEs). We explore the CWE database to characterize the scope and attributes of the CWEs and identify those that are amenable to static analysis. We prototype scanners and evaluate them on 11 open source designs – 4 system-on-chips (SoC) and 7 processor cores – and explore the nature of identified weaknesses. Our analysis reported 53 potential weaknesses in the OpenPiton SoC used in Hack@DAC-21, 11 of which we confirmed as security concerns.

SESSION: Special Session: Making ML Reliable: From Devices to Systems to Software

Session details: Special Session: Making ML Reliable: From Devices to Systems to Software

  • Krishnendu Chakrabarty
  • Partha Pande

Reliable Computing of ReRAM Based Compute-in-Memory Circuits for AI Edge Devices

  • Meng-Fan Chang
  • Je-Ming Hung
  • Ping-Cheng Chen
  • Tai-Hao Wen

Compute-in-memory macros based on non-volatile memory (nvCIM) are a promising approach to break through the memory bottleneck for artificial intelligence (AI) edge devices; however, the development of these devices involves unavoidable tradeoffs between reliability, energy efficiency, computing latency, and readout accuracy. This paper outlines the background of ReRAM-based nvCIM as well as the major challenges in its further development, including process variation in ReRAM devices and transistors and the small signal margins associated with variation in input-weight patterns. This paper also investigates the error model of a nvCIM macro, and the correspondent degradation of inference accuracy as a function of error model when using nvCIM macros. Finally, we summarize recent trends and advances in the development of reliable ReRAM-based nvCIM macro.

Fault-Tolerant Deep Learning Using Regularization

  • Biresh Kumar Joardar
  • Aqeeb Iqbal Arka
  • Janardhan Rao Doppa
  • Partha Pratim Pande

Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.

Machine Learning for Testing Machine-Learning Hardware: A Virtuous Cycle

  • Arjun Chaudhuri
  • Jonti Talukdar
  • Krishnendu Chakrabarty

The ubiquitous application of deep neural networks (DNN) has led to a rise in demand for AI accelerators. DNN-specific functional criticality analysis identifies faults that cause measurable and significant deviations from acceptable requirements such as the inferencing accuracy. This paper examines the problem of classifying structural faults in the processing elements (PEs) of systolic-array accelerators. We first present a two-tier machine-learning (ML) based method to assess the functional criticality of faults. While supervised learning techniques can be used to accurately estimate fault criticality, it requires a considerable amount of ground truth for model training. We therefore describe a neural-twin framework for analyzing fault criticality with a negligible amount of ground-truth data. We further describe a topological and probabilistic framework to estimate the expected number of PE’s primary outputs (POs) flipping in the presence of defects and use the PO-flip count as a surrogate for determining fault criticality. We demonstrate that the combination of PO-flip count and neural twin-enabled sensitivity analysis of internal nets can be used as additional features in existing ML-based criticality classifiers.

Observation Point Insertion Using Deep Learning

  • Bonita Bhaskaran
  • Sanmitra Banerjee
  • Kaushik Narayanun
  • Shao-Chun Hung
  • Seyed Nima Mozaffari Mojaveri
  • Mengyun Liu
  • Gang Chen
  • Tung-Che Liang

Silent Data Corruption (SDC) is one of the critical problems in the field of testing, where errors or corruption do not manifest externally. As a result, there is increased focus on improving the outgoing quality of dies by striving for better correlation between structural and functional patterns to achieve a low DPPM. This is very important for NVIDIA’s chips due to the various markets we target; for example, automotive and data center markets have stringent in-field testing requirements. One aspect of these efforts is to also target better testability while incurring lower test cost. Since structural testing is faster than functional tests, it is important to make these structural test patterns as effective as possible and free of test escapes. However, with the rising cell count in today’s digital circuits, it is becoming increasingly difficult to sensitize faults and propagate the fault effects to scan-flops or primary outputs. Hence, methods to insert observation points to facilitate the detection of hard-to-detect (HtD) faults are being increasingly explored. In this work, we propose an Observation Point Insertion (OPI) scheme using deep learning with the motivation of achieving – 1) better quality test points than commercial EDA tools leading to a potential lower pattern count 2) faster turnaround time to generate the test points. In order to achieve better pattern compaction than commercial EDA tools, we employ Graph Convolutional Networks (GCNs) to learn the topology of logic circuits along with the features that influence its testability. The graph structures are subsequently used to train two GCN-type deep learning models – the first model predicts signal probabilities at different nets and the second model uses these signal probabilities along with other features to predict the reduction in test-pattern count when OPs are inserted at different locations in the design. The features we consider include structural features like gate type, gate logic, reconvergent-fanouts and testability features like SCOAP. Our simulation results indicate that the proposed machine learning models can predict the probabilistic testability metrics with reasonable accuracy and can identify observation points that reduce pattern count.

SESSION: Autonomous Systems and Machine Learning on Embedded Systems

Session details: Autonomous Systems and Machine Learning on Embedded Systems

  • Ibrahim (Abe) Elfadel
  • Mimi Xie

Romanus: Robust Task Offloading in Modular Multi-Sensor Autonomous Driving Systems

  • Luke Chen
  • Mohanad Odema
  • Mohammad Abdullah Al Faruque

Due to the high performance and safety requirements of self-driving applications, the complexity of modern autonomous driving systems (ADS) has been growing, instigating the need for more sophisticated hardware which could add to the energy footprint of the ADS platform. Addressing this, edge computing is poised to encompass self-driving applications, enabling the compute-intensive autonomy-related tasks to be offloaded for processing at compute-capable edge servers. Nonetheless, the intricate hardware architecture of ADS platforms, in addition to the stringent robustness demands, set forth complications for task offloading which are unique to autonomous driving. Hence, we present ROMANUS, a methodology for robust and efficient task offloading for modular ADS platforms with multi-sensor processing pipelines. Our methodology entails two phases: (i) the introduction of efficient offloading points along the execution path of the involved deep learning models, and (ii) the implementation of a runtime solution based on Deep Reinforcement Learning to adapt the operating mode according to variations in the perceived road scene complexity, network connectivity, and server load. Experiments on the object detection use case demonstrated that our approach is 14.99% more energy-efficient than pure local execution while achieving a 77.06% reduction in risky behavior from a robust-agnostic offloading baseline.

ModelMap: A Model-Based Multi-Domain Application Framework for Centralized Automotive Systems

  • Soham Sinha
  • Anam Farrukh
  • Richard West

This paper presents ModelMap, a model-based multi-domain application development framework for DriveOS, our in-house centralized vehicle management software system. DriveOS runs on multicore x86 machines and uses hardware virtualization to host isolated RTOS and Linux guest OS sandboxes. In this work, we design Simulink interfaces for model-based vehicle control function development across multiple sandboxed domains in DriveOS. ModelMap provides abstractions to: (1) automatically generate periodic tasks bound to threads in different OS domains, (2) establish cross-domain synchronous and asynchronous communication interfaces, and (3) handle USB-based CAN I/O in Simulink. We introduce the concept of a nested binary, for the deployment of ELF binary executable code in different sandboxed domains. We demonstrate ModelMap using a combination of synthetic benchmarks, and experiments with Simulink models of a CAN Gateway and HVAC service running on an electric car. ModelMap eases the development of applications, which are shown to achieve industry-target performance using a multicore hardware platform in DriveOS.

INDENT: Incremental Online Decision Tree Training for Domain-Specific Systems-on-Chip

  • Anish Krishnakumar
  • Radu Marculescu
  • Umit Ogras

The performance and energy efficiency potential of heterogeneous architectures has fueled domain-specific systems-on-chip (DSSoCs) that integrate general-purpose and domain-specialized hardware accelerators. Decision trees (DTs) perform high-quality, low-latency task scheduling to utilize the massive parallelism and heterogeneity in DSSoCs effectively. However, offline trained DT scheduling policies can quickly become ineffective when applications or hardware configurations change. There is a critical need for runtime techniques to train DTs incrementally without sacrificing accuracy since current training approaches have large memory and computational power requirements. To address this need, we propose INDENT, an incremental online DT framework to update the scheduling policy and adapt it to unseen scenarios. INDENT updates DT schedulers at runtime using only 1–8% of the original training data embedded during training. Thorough evaluations with hardware platforms and DSSoC simulators demonstrate that INDENT performs within 5% of a DT trained from scratch using the entire dataset and outperforms current state-of-the-art approaches.

SGIRR: Sparse Graph Index Remapping for ReRAM Crossbar Operation Unit and Power Optimization

  • Cheng-Yuan Wang
  • Yao-Wen Chang
  • Yuan-Hao Chang

Resistive Random Access Memory (ReRAM) Crossbars are a promising process-in-memory technology to reduce enormous data movement overheads of large-scale graph processing between computation and memory units. ReRAM cells can combine with crossbar arrays to effectively accelerate graph processing, and partitioning ReRAM crossbar arrays into Operation Units (OUs) can further improve computation accuracy of ReRAM crossbars. The operation unit utilization was not optimized in previous work, incurring extra cost. This paper proposes a two-stage algorithm with a crossbar OU-aware scheme for sparse graph index remapping for ReRAM (SGIRR) crossbars, mitigating the influence of graph sparsity. In particular, this paper is the first to consider the given operation unit size with the remapping index algorithm, optimizing the operation unit and power dissipation. Experimental results show that our proposed algorithm reduces the utilization of crossbar OUs by 31.4%, improves the total OU block usage by 10.6%, and saves energy consumption by 17.2%, on average.


Proceedings of the 19th ACM-IEEE International Conference on Formal Methods and Models for System Design

Full Citation in the ACM Digital Library

Polynomial word-level verification of arithmetic circuits

  • Mohammed Barhoush
  • Alireza Mahzoon
  • Rolf Drechsler

Verifying the functional correctness of a circuit is often the most time-consuming part of the design process. Recently, world-level formal verification methods, e.g., Binary Moment Diagram (BMD) and Symbolic Computer Algebra (SCA) have reported very good results for proving the correctness of arithmetic circuits. However, these techniques still frequently fail due to memory or time requirements. The unknown complexity bounds of these techniques make it impossible to predict before invoking the verification tool whether it will successfully terminate or run for an indefinite amount of time.

In this paper, we formally prove that for integer arithmetic circuits, the entire verification process requires at most linear space and quadratic time with respect to the size of the circuit function. This is shown for the two main word-level verification methods: backward construction using BMD and backward substitution using SCA. We support the architectures which are used in the implementation of integer polynomial operations, e.g., X3 – XY2 + XY. Finally, we show in practice that the required space and run times of the word-level methods match the predicted results in theory when it comes to the verification of different arithmetic circuits, including exponentiation circuits with different power values (XP : 2 ≤ P ≤ 7) and more complicated circuits (e.g., X2 + XY + X).

Simplification of numeric variables for PLC model checking

  • Ignacio D. Lopez-Miguel
  • Borja Fernández Adiego
  • Jean-Charles Tournier
  • Enrique Blanco Viñuela
  • Juan A. Rodriguez-Aguilar

Software model checking has recently started to be applied in the verification of programmable logic controller (PLC) programs. It works efficiently when the number of input variables is limited, their interaction is small and, thus, the number of states the program can reach is not large. As observed in the large code base of the CERN industrial PLC applications, this is usually not the case: it thus leads to the well-known state-space explosion problem, making it impossible to perform model checking. One of the main reasons that causes state-space explosion is the inclusion of numeric variables due to the wide range of values they can take. In this paper, we propose an approach to discretize PLC input numeric variables (modelled as non-deterministic). This discretization is complemented with a set of transformations on the control-flow automaton that models the PLC program so that no extra behaviours are added. This approach is then quantitatively evaluated with a set of empirical tests using the PLC model checking framework PLCverif and three different state-of-the-art model checkers (CBMC, nuXmv, and Theta), showing beneficial results for BDD-based model checkers.

Enforcement FSMs: specification and verification of non-functional properties of program executions on MPSoCs

  • Khalil Esper
  • Stefan Wildermann
  • Jürgen Teich

Many embedded system applications impose hard real-time, energy or safety requirements on corresponding programs typically concurrently executed on a given MPSoC target platform. Even when mutually isolating applications in space or time, the enforcement of such properties, e.g., by adjusting the number of processors allocated to a program or by scaling the voltage/frequency mode of involved processors, is a difficult problem to solve, particularly in view of typically largely varying environmental input (workload) per execution. In this paper, we formalize the related control problem using finite state machine models for the uncertain environment determining the workload, the system response (feedback), as well as the enforcer strategy. The contributions of this paper are as follows: a) Rather than trace-based simulation, the uncertain environment is modeled by a discrete-time Markov chain (DTMC) as a random process to characterize possible input sequences an application may experience. b) A number of important verification goals to analyze different enforcer FSMs are formulated in PCTL for the resulting stochastic verification problem, i.e., the likelihood of violating a timing or energy constraint, or the expected number of steps for a system to return to a given execution time corridor. c) Applying stochastic model checking, i.e., PRISM to analyze and compare enforcer FSMs in these properties, and finally d) proposing an approach for reducing the environment DTMC by partitioning equivalent environmental states (i.e., input states leading to an equal system response in each MPSoC mode) such that verification times can be reduced by orders of magnitude to just a few ms for real-world examples.

LION: real-time I/O transfer control for massively parallel processor arrays

  • Dominik Walter
  • Jürgen Teich

The performance of many accelerator architectures depends on the communication with external memory. During execution, new I/O data is continuously fetched forth and back to memory. This data exchange is very often performance-critical and a careful orchestration thus vital. To satisfy the I/O demand for accelerators of loop nests, it was shown that the individual reads and writes can be merged into larger blocks, which are subsequently transferred by a single DMA transfer. Furthermore, the order in which such DMA transfers must be issued, was shown to be reducible to a real-time task scheduling problem to be solved at run time. Rather than just concepts, we investigate in this paper efficient algorithms, data structures and their implementation in hardware of such a programmable Loop I/O Controller architecture called LION that only needs to be synthesized once for each processor array size and I/O buffer configuration, thus supporting a large class of processor arrays. Based on a proposed heap-based priority queue, LION is able to issue every 6 cycles a new DMA request to a memory bus. Even on a simple FPGA prototype running at just 200 MHz, this allows for more than 33 million DMA requests to be issued per second. Since the execution time of a typical DMA request is in general at least one order of magnitude longer, we can conclude that this rate is sufficient to fully utilize a given memory interface. Finally, we present implementations on FPGA and also 22nm FDX ASIC showing that the overall overhead of a LION typically amounts to less than 5% of an overall processor array design.

Learning optimal decisions for stochastic hybrid systems

  • Mathis Niehage
  • Arnd Hartmanns
  • Anne Remke

We apply reinforcement learning to approximate the optimal probability that a stochastic hybrid system satisfies a temporal logic formula. We consider systems with (non)linear continuous dynamics, random events following general continuous probability distributions, and discrete nondeterministic choices. We present a discretized view of states to the learner, but simulate the continuous system. Once we have learned a near-optimal scheduler resolving the choices, we use statistical model checking to estimate its probability of satisfying the formula. We implemented the approach using Q-learning in the tools HYPEG and modes, which support Petri net- and hybrid automata-based models, respectively. Via two case studies, we show the feasibility of the approach, and compare its performance and effectiveness to existing analytical techniques for a linear model. We find that our new approach quickly finds near-optimal prophetic as well as non-prophetic schedulers, which maximize or minimize the probability that a specific signal temporal logic property is satisfied.

A secure insulin infusion system using verification monitors

  • Abhinandan panda
  • Srinivas Pinisetty
  • Partha Roop

Wearable and implantable medical devices are being increasingly deployed for diagnosis, monitoring, and to provide therapy for critical medical conditions. Such medical devices are examples of safety-critical, cyber-physical systems. In this paper we focus on insulin infusion systems (IISs), which are used by diabetics to maintain safe blood glucose levels. These systems support wireless features introducing potential vulnerabilities. Although these devices go through rigorous safety certification processes, these are not able to mitigate security threats. Based on published literature, attackers can remotely command to inject an incorrect amount of insulin thereby posing threat to a patient’s life. While prior work based on formal methods have been proposed to detect potential attack vectors using different forms of static analysis, these have limitations in preventing attacks at run-time. Also, as these devices are safety critical, it is not possible to apply security patches, when new types of attacks are detected, due to the need for recertification.

This paper addresses these limitations by developing a formal framework for the detection of cyber-physical attacks on an IIS. First, we propose a wearable device that senses the familiar ECG to detect attacks. Thus, this device is separate from the insulin infusion system, ensuring no need for recertification of IISs. To facilitate the design of this device, we establish a correlation of ECG intervals and blood glucose levels using statistical analysis. This helps us in proposing a framework for security policy mining using the developed statistical analysis. This paves the way for the design of formal verification monitors for IISs for the first time. We perform performance evaluation of the verification monitor, which proves the technical feasibility for the design of wearable devices for attack detection of IISs. Our approach is amenable to the application of security patches, when new attack vectors are detected, making the approach ideal for run-time monitoring of medical CPSs.

Translating structured sequential programs to dataflow graphs

  • Klaus Schneider

In this paper, a translation from structured sequential programs to equivalent dataflow process networks (DPNs) is presented that is based on a carefully chosen set of nodes including load/store operations to access a shared global memory. For every data structure stored in the main memory, we use corresponding tokens to enforce the sequential ordering of load/store operations accessing that data structure as far as needed. Except for the load/store nodes, all nodes obey the Kahn principle so that they are deterministic in the sense that the same inputs are always mapped to the same outputs regardless of the execution schedule of the nodes. Due to the sequential ordering of load/store nodes, determinacy is also maintained by them. Moreover, the generated DPNs are quasi-static, i.e., they have schedules that are bounded in a very strict sense: For every statement of the sequential program, the corresponding DPN behaves like a homogeneous synchronous actor, i.e., it consumes one value of each input port and will finally provide one value on each output port. Hence, no more than one value needs to be stored in each buffer.

Online monitoring of spatio-temporal properties for imprecise signals

  • Ennio Visconti
  • Ezio Bartocci
  • Michele Loreti
  • Laura Nenzi

From biological systems to cyber-physical systems, monitoring the behavior of such dynamical systems often requires reasoning about complex spatio-temporal properties of physical and computational entities that are dynamically interconnected and arranged in a particular spatial configuration. Spatio-Temporal Reach and Escape Logic (STREL) is a recent logic-based formal language designed to specify and reason about spatio-temporal properties. STREL considers each system’s entity as a node of a dynamic weighted graph representing its spatial arrangement. Each node generates a set of mixed-analog signals describing the evolution over time of computational and physical quantities characterizing the node’s behavior. While there are offline algorithms available for monitoring STREL specifications over logged simulation traces, here we investigate for the first time an online algorithm enabling the runtime verification during the system’s execution or simulation. Our approach extends the original framework by considering imprecise signals and by enhancing the logics’ semantics with the possibility to express partial guarantees about the conformance of the system’s behavior with its specification. Finally, we demonstrate our approach in a real-world environmental monitoring case study.

Verified functional programming of an IoT operating system’s bootloader

  • Shenghao Yuan
  • Jean-Pierre Talpin

The fault of one device on a grid may incur severe economical or physical damages. Among the many critical components in such IoT devices, the operating system’s bootloader comes first to initiate the trusted function of the device on the network. However, a bootloader uses hardware-dependent features that make its functional correctness proof difficult. This paper uses verified programming to automate the verification of both the C libraries and assembly boot-sequence of such a, real-world, bootloader in an operating system for ARM-based IoT devices: RIoT. We first define the ARM ISA specification, semantics and properties in F* to model its critical assembly code boot sequence. We then use Low*, a DSL rendering a C-like memory model in F*, to implement the complete bootloader library and verify its functional correctness and memory safety. Other than fixing potential faults and vulnerabilities in the source C and ASM bootloader, our evaluation provides an optimized and formally documented code structure, a reasonable specification/implementation ratio, a high degree of proof automation and an equally efficient generated code.

Controller verification meets controller code: a case study

  • Felix Freiberger
  • Stefan Schupp
  • Holger Hermanns
  • Erika Ábrahám

Cyber-physical systems are notoriously hard to verify due to the complex interaction between continuous physical behavior and discrete control. A widespread and important class is formed by digital controllers that operate on fixed control cycles to interact with the physical environment they are embedded in. This paper presents a case study for integrating such controllers into a rigorous verification method for cyber-physical systems, using flowpipe-based verification methods to verify legally binding requirements for electrified vehicles to a custom bike design. The controller is integrated in the underlying model in a way that correctly represents the input discretization performed by any digital controller.

Translation of continuous function charts to imperative synchronous quartz programs

  • Marcel Christian Werner
  • Klaus Schneider

Programmable logic controllers operating in a sequential execution scheme are widely used for various applications in industrial environments with real-time requirements. The graphical programming languages described in the third part of IEC 61131 are often intended to perform open and closed loop control tasks. Continuous Function Charts (CFCs) represent an additional language accepted in practice which can be interpreted as an extension of IEC 61131-3 Function Block Diagrams. Those charts allow more flexible positioning and interconnection of function blocks, but can quickly become difficult to manage. Furthermore, the sequential execution order forces a sequential processing of possible independent and thus possibly parallel program paths. The question arises whether a translation of existing CFCs to synchronous programs considering independent actions can lead to a more manageable software model. While current formalization approaches for CFCs primarily focus on verification, the focus of this approach is on restructuring and possible reuse in engineering. This paper introduces a possible automated translation of CFCs to imperative synchronous Quartz programs and outlines the potential for reducing the states of equivalent extended finite state machines through restructuring.

Design and formal verification of a copland-based attestation protocol

  • Adam Petz
  • Grant Jurgensen
  • Perry Alexander

We present the design and formal analysis of a remote attestation protocol and accompanying security architecture that generate evidence of trustworthy execution for legacy software. For formal guarantees of measurement ordering and cryptographic evidence strength, we leverage the Copland language and Copland Virtual Machine execution semantics. For isolation of attestation mechanisms we design a layered attestation architecture that leverages the seL4 microkernel. The formal properties of the protocol and architecture together serve to discharge assumptions made by an existing higher-level model-finding tool to characterize all ways an active adversary can corrupt a target and go undetected. As a proof of concept, we instantiate this analysis framework with a specific Copland protocol and security architecture to measure a legacy flight planning application. By leveraging components that are amenable to formal analysis, we demonstrate a principled way to design an attestation protocol and argue for its end-to-end correctness.

Sampling of shape expressions with ShapEx

  • Nicolas Basset
  • Thao Dang
  • Felix Gigler
  • Cristinel Mateis
  • Dejan Ničković

In this paper we present ShapEx, a tool that generates random behaviors from shape expressions, a formal specification language for describing sophisticated temporal behaviors of CPS. The tool samples a random behavior in two steps: (1) it first explores the space of qualitative parameterized shapes and then (2) instantiates parameters by sampling a possibly non-linear constraint. We implement several sampling strategies in the tool that we present in the paper and demonstrate its applicability on two use scenarios.

SEESAW: a tool for detecting memory vulnerabilities in protocol stack implementations

  • Farhaan Fowze
  • Tuba Yavuz

As the number of Internet of Things (IoT) devices proliferate, an in-depth understanding of the IoT attack surface has become quintessential for dealing with the security and reliability risks. IoT devices and components execute implementations of various communication protocols. Vulnerabilities in the protocol stack implementations form an important part of the IoT attack surface. Therefore, finding memory errors in such implementations is essential for improving the IoT security and reliability. This paper presents a tool, SEESAW, that is built on top of a static analysis tool and a symbolic execution engine to achieve scalable analysis of protocol stack implementations. SEESAW leverages the API model of the analyzed code base to perform component-level analysis. SEESAW has been applied to the USB and Bluetooth modules within the Linux kernel. SEESAW can reproduce known memory vulnerabilities in a more scalable way compared to baseline symbolic execution.

Formal modelling of attack scenarios and mitigation strategies in IEEE 1588

  • Kelvin Anto
  • Partha S. Roop
  • Akshya K. Swain

IEEE 1588 is a time synchronization protocol that is extensively used by many Cyber-Physical Systems (CPSs). However, this protocol is prone to various types of attacks. We focus on a specific type of Man-in-the-Middle (MITM) attack, where the attacker introduces random delays to the messages being exchanged between a master and a slave. Such attacks have been modelled previously and some mitigation strategies have also been developed. However, the proposed methods work only under constant delay attacks and the developed mitigation strategies are ad-hoc. We propose the first formal framework for modelling and mitigating time delay attacks in IEEE 1588. Initially, the master, the slave and the communication medium are modelled as Timed Automata (TA) assuming the absence of any attacks. Subsequently, a generic attacker is modelled as a TA, which can formally represent various attacks including constant delay, linear delay and exponential delay. Finally, system identification methods of control theory is used to design proportional controllers for mitigating the effects of time delay attacks. We use model checking to ensure the resilience of protocol to time delay attacks using the proposed mitigation strategy.


Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD

Full Citation in the ACM Digital Library

SESSION: Session 1: Physical Design and Optimization with ML

Placement Optimization via PPA-Directed Graph Clustering

  • Yi-Chen Lu
  • Tian Yang
  • Sung Kyu Lim
  • Haoxing Ren

In this paper, we present the first Power, Performance, and Area (PPA)-directed, end-to-end placement optimization framework that provides cell clustering constraints as placement guidance to advance commercial placers. Specifically, we formulate PPA metrics as Machine Learning (ML) loss functions, and use graph clustering techniques to optimize them by improving clustering assignments. Experimental results on 5 GPU/CPU designs in a 5nm technology not only show that our framework immediately improves the PPA metrics at the placement stage, but also demonstrate that the improvements last firmly to the post-route stage, where we observe improvements of 89% in total negative slack (TNS), 26% in effective frequency, 2.4% in wirelength, and 1.4% in clock power.

From Global Route to Detailed Route: ML for Fast and Accurate Wire Parasitics and Timing Prediction

  • Vidya A. Chhabria
  • Wenjing Jiang
  • Andrew B. Kahng
  • Sachin S. Sapatnekar

Timing prediction and optimization are challenging in design stages prior to detailed routing (DR) due to the unavailability of routing information. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. Machine learning (ML)-based models are proposed to bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization. These models show higher accuracy than GR-based timing estimation and, when used during post-GR optimization, show demonstrable improvements in post-DR circuit performance. Results on open 45nm and 130nm enablements using OpenROAD show efficient improvements in post-DR WNS and TNS metrics without increasing congestion.

Faster FPGA Routing by Forecasting and Pre-Loading Congestion Information

  • Umair Siddiqi
  • Timothy Martin
  • Sam Van Den Eijnden
  • Ahmed Shamli
  • Gary Grewal
  • Sadiq Sait
  • Shawki Areibi

Field Programmable Gate Array (FPGA) routing is one of the most time consuming tasks within the FPGA design flow, requiring hours and even days to complete for some large industrial designs. This is becoming a major concern for FPGA users and tool developers. This paper proposes a simple, yet effective, framework that reduces the runtime of PathFinder based routers. A supervised Machine Learning (ML) algorithm is developed to forecast costs (from the placement phase) associated with possible congestion and hot spot creation in the routing phase. These predicted costs are used to guide the router to avoid highly congested regions while routing nets, thus reducing the total number of iterations and rip-up and reroute operations involved. Results obtained indicate that the proposed ML approach achieves on average a 43 reduction in the number of routing iterations and 28.6 reduction in runtime when implemented in the state-of-the-art enhanced PathFinder algorithm.

SESSION: Session 2: Machine Learning for Analog Design

Deep Reinforcement Learning for Analog Circuit Sizing with an Electrical Design Space and Sparse Rewards

  • Yannick Uhlmann
  • Michael Essich
  • Lennart Bramlage
  • Jürgen Scheible
  • Cristóbal Curio

There is still a great reliance on human expert knowledge during the analog integrated circuit sizing design phase due to its complexity and scale, with the result that there is a very low level of automation associated with it. Current research shows that reinforcement learning is a promising approach for addressing this issue. Similarly, it has been shown that the convergence of conventional optimization approaches can be improved by transforming the design space from the geometrical domain into the electrical domain. Here, this design space transformation is employed as an alternative action space for deep reinforcement learning agents. The presented approach is based entirely on reinforcement learning, whereby agents are trained in the craft of analog circuit sizing without explicit expert guidance. After training and evaluating agents on circuits of varying complexity, their behavior when confronted with a different technology, is examined, showing the applicability, feasibility as well as transferability of this approach.

LinEasyBO: Scalable Bayesian Optimization Approach for Analog Circuit Synthesis via One-Dimensional Subspaces

  • Shuhan Zhang
  • Fan Yang
  • Changhao Yan
  • Dian Zhou
  • Xuan Zeng

A large body of literature has proved that the Bayesian optimization framework is especially efficient and effective in analog circuit synthesis. However, most of the previous research works only focus on designing informative surrogate models or efficient acquisition functions. Even if searching for the global optimum over the acquisition function surface is itself a difficult task, it has been largely ignored. In this paper, we propose a fast and robust Bayesian optimization approach via one-dimensional subspaces for analog circuit synthesis. By solely focusing on optimizing one-dimension subspaces at each iteration, we greatly reduce the computational overhead of the Bayesian optimization framework while safely maximizing the acquisition function. By combining the benefits of different dimension selection strategies, we adaptively balancing between searching globally and locally. By leveraging the batch Bayesian optimization framework, we further accelerate the optimization procedure by making full use of the hardware resources. Experimental results quantitatively show that our proposed algorithm can accelerate the optimization procedure by up to $9\times$ and $38\times$ compared to LP-EI and REMBOpBO respectively when the batch size is 15.

RobustAnalog: Fast Variation-Aware Analog Circuit Design Via Multi-task RL

  • Wei Shi
  • Hanrui Wang
  • Jiaqi Gu
  • Mingjie Liu
  • David Z. Pan
  • Song Han
  • Nan Sun

Analog/mixed-signal circuit design is one of the most complex and time-consuming stages in the whole chip design process. Due to various process, voltage, and temperature (PVT) variations from chip manufacturing, analog circuits inevitably suffer from performance degradation. Although there has been plenty of work on automating analog circuit design under the nominal condition, limited research has been done on exploring robust designs under the real and unpredictable silicon variations. Automatic analog design against variations requires prohibitive computation and time costs. To address the challenge, we present RobustAnalog, a robust circuit design framework that involves the variation information in the optimization process. Specifically, circuit optimizations under different variations are considered as a set of tasks. Similarities among tasks are leveraged and competitions are alleviated to realize a sample-efficient multi-task training. Moreover, RobustAnalog prunes the task space according to the current performance in each iteration, leading to a further simulation cost reduction. In this way, RobustAnalog can rapidly produce a set of circuit parameters that satisfies diverse constraints (e.g. gain, bandwidth, noise…) across variations. We compare RobustAnalog with Bayesian optimization, Evolutionary algorithm, and Deep Deterministic Policy Gradient (DDPG) and demonstrate that RobustAnalog can significantly reduce required the optimization time by 14x-30x. Therefore, our study provides a feasible method to handle various real silicon conditions.

Automatic Analog Schematic Diagram Generation based on Building Block Classification and Reinforcement Learning

  • Hung-Yun Hsu
  • Mark Po-Hung Lin

Schematic visualization is important for analog circuit designers to quickly recognize the structures and functions of transistor-level circuit netlists. However, most of the original analog design or other automatically extracted analog circuits are stored in the form of transistor-level netlists in the SPICE format. It can be error-prone and time-consuming to manually create an elegant and readable schematic from a netlist. Different from the conventional graph-based methods, this paper introduces a novel analog schematic diagram generation flow based on comprehensive building block classification and reinforcement learning. The experimental results show that the proposed method can effectively generate aesthetic analog circuit schematics with a higher building block compliance rate, and fewer numbers of wire bends and net crossings, resulting in better readability, compared with existing methods and modern tools.

SESSION: Plenary I

The Changing Landscape of AI-driven System Optimization for Complex Combinatorial Optimization

  • Somdeb Majumdar

With the unprecedented success of modern machine learning in areas like computer vision and natural language processing, a natural question is where can it have maximum impact in real life. At Intel Labs, we are actively investing in research that leverages the robustness and generalizability of deep learning to solve system optimization problems. Examples of such systems include individual hardware modules like memory schedulers and power management units on a chip, automated compiler and software design tools as well as broader problems like chip design. In this talk, I will address some of the open challenges in systems optimization and how Intel and others in the research community are harnessing the power of modern reinforcement learning to address those challenges. A particular aspect of problems in the domain of chip design is the very large combinatorial complexity of the solution space. For example, the number of possible ways to place standard cells and macros on a canvas for even small to medium sized netlists can approach 10100 to 101000. Importantly, only a very small subset of these possible outcomes are actually valid and performant.

Standard approaches like reinforcement learning struggle to learn effective policies under such conditions. For example, a sequential placement policy can get a reinforcing reward signal only after having taken several thousand individual placement actions. This reward is inherently noisy – especially when we need to assign credit to the earliest steps of the multi-step placement episode. This is an example of the classic credit assessment problem in reinforcement learning.

A different way to tackle such problems is to simply search over the solution space. Many approaches exist ranging from Genetic Algorithms to Monte Carlo Tree Search. However, they suffer from very slow convergence times due to the size of the search space.

In order to tackle such problems, we investigate an approach that combines the fast learning capabilities of reinforcement learning and the ability of search based methods to find performant solutions. We use deep reinforcement learning to strategies that are sub-optimal but quick to find. We use these partial solutions as anchors around which we constrain a genetic algorithm based search. This allows us to still exploit the power of genetic algorithms to find performant solutions while significantly reducing the overall search time.

I will describe this solution in the context of combinatorial optimization problems like device placement where we show the ability to learn effective strategies on combinatorial complexities of up to 10300. We also show that by representing these policies as neural networks, we are able to achieve reasonably good zero shot transfer learning performance on unseen problem configurations. Finally, I will touch upon how we are adapting this framework to handle similar combinatoric optimization problems for placement in EDA pipelines.

SESSION: Invited Session I

AI Chips Built by AI – Promise or Reality?: An Industry Perspective

  • Thomas Andersen

Artificial Intelligence is an avenue to innovation that is touching every industry worldwide. AI has made rapid advances in areas like speech and image recognition, gaming, and even self-driving cars, essentially automating less complex human tasks. In turn, this demand drives rapid growth across the semiconductor industry with new chip architectures emerging to deliver the specialized processing needed for the huge breadth of AI applications. Given the advances made to automate simple human tasks, can AI solve more complex tasks such as designing a computer chip? In this talk, we will discuss the challenges and opportunities of building advanced chip designs with the help of artificial intelligence, enabling higher performance, faster time to market, and utilizing reuse of machine-generated learning for successive products.

ML for Analog Design: Good Progress, but More to Do

  • Borivoje Nikolić

Analog and mixed-signal (AMS) blocks are often critical and time-consuming part of System-on-Chip (SoC) design, due to the largely manual process of circuit design, simulation and SoC integration iterations. There have been numerous efforts to realize AMS blocks from specification by using a process analogous to digital synthesis, with automated place and route techniques [1], [2], but although very effective within their application domains, they have been limited in scope. AMS block design process, outlined in Figure 1, starts with the derivation of its target performance specifications (gain, bandwidth, phase margin, settling time, etc.) from system requirements, and establishes a simulation testbench. Then, a designer relies on their expertise to choose the topology that is most likely to achieve the desired performance with minimum power consumption. Circuit sizing is a process of determining schematic-level transistor widths and lengths to attain the specifications, with minimum power consumption. Many of the commonly used analog circuits can be sized by using well-established heuristics to achieve near-optimal performance [3]-[5]. The performance is verified by running simulations, and there has been a notable progress in enriching the commercial simulators to automate the testbench design. Machine learning (ML) based techniques have recently been deployed in circuit sizing to achieve optimality without relying on design heuristics [6]-[8]. Many of the commonly employed ML techniques require a rich training dataset; reinforcement learning (RL) sidesteps this issue by using agent that interacts with its simulation environment through a trial-and-error process that mimics learning in humans. In each step, the RL agent, which contains a neural network, observes the state of the environment and takes a sizing action. The most time-consuming step in a traditional design procedure is layout, which is typically a manual iterative process. Layout parasitics degrade the schematic-level performance, requiring circuit resizing. However, the use of circuit generators, such as the Berkeley Analog Generator (BAG) [9] automates the layout iterations. RL agents have been coupled with BAG to automate the complete design process for a fixed circuit topology [7]. Simulations with post-layout parasitics are much slower than schematic-level simulations, which calls for deployment of RL techniques that limit the sampled space. Finally, the process of integrating an AMS block into an SoC and verifying its system-level performance can be very time consuming.

SESSION: Session 3: Circuit Evaluation and Simulation with ML

SpeedER: A Supervised Encoder-Decoder Driven Engine for Effective Resistance Estimation of Power Delivery Networks

  • Bing-Yue Wu
  • Shao-Yun Fang
  • Hsiang-Wen Chang
  • Peter Wei

Voltage (IR) analysis tools need to be launched multiple times during the Engineering Change Order (ECO) phase in the modern design cycle for Power Delivery Network (PDN) refinement, while analyzing the IR characteristics of advanced chip designs by using traditional IR analysis tools suffers from massive run-time. Multiple Machine Learning (ML)-driven IR analysis approaches have been frequently proposed to benefit from the fast inference time and flexible prediction ability. Among these ML-driven approaches, the Effective Resistance (effR) of a given PDN has been shown to be one of the most critical features that can greatly enhance model performance and thus prediction accuracy; however, calculating effR alone is still computationally expensive. In addition, in the ECO phase, even if only local adjustments of the PDN are required, the run-time of obtaining the regional effR changes by using traditional Laplacian Systems grows exponentially as the size of the chip grows. It is because the whole PDN needs to be considered in a Laplacian solver for computing the effR of any single network node. To address the problem, this paper proposes an ML-driven engine, SpeedER, that combines a U-Net model and a Fully Connected Neural Network (FCNN) with five selected features to speed up the process of estimating regional effRs. Experimental results show that SpeedER can be approximately four times faster than a commercial tool using a Laplacian System with errors of only around 1%.

XT-PRAGGMA: Crosstalk Pessimism Reduction Achieved with GPU Gate-level Simulations and Machine Learning

  • Vidya A. Chhabria
  • Ben Keller
  • Yanqing Zhang
  • Sandeep Vollala
  • Sreedhar Pratty
  • Haoxing Ren
  • Brucek Khailany

Accurate crosstalk-aware timing analysis is critical in nanometer-scale process nodes. While today’s VLSI flows rely on static timing analysis (STA) techniques to perform crosstalk-aware timing signoff, these techniques are limited due to their static nature as they use imprecise heuristics such as arbitrary aggressor filtering and simplified delay calculations. This paper proposes XT-PRAGGMA, a tool that uses GPU-accelerated dynamic gate-level simulations and machine learning to eliminate false aggressors and accurately predict crosstalk-induced delta delays. XT-PRAGGMA reduces STA pessimism and provides crucial information to identify crosstalk-critical nets, which can be considered for accurate SPICE simulation before signoff. The proposed technique is fast (less than two hours to simulate 30,000 vectors on million-gate designs) and reduces falsely-reported total negative slack in timing signoff by 70%.

Fast Prediction of Dynamic IR-Drop Using Recurrent U-Net Architecture

  • Yonghwi Kwon
  • Youngsoo Shin

Recurrent U-Net (RU-Net) is employed for fast prediction of dynamic IR-drop when power distribution network (PDN) contains capacitor components. Each capacitor can be modeled by a resistor and a current source, which is a function of v(t-Δ t) node voltages at time t – Δ t allow the PDN to be solved at time t which then allows the analysis at t + Δ t and so on. Provided that a quick prediction of IR-drop at one time instance can be done by U-Net, a image segmentation model, the analysis of PDN containing capacitors can be done by a number of U-Net instances connected in series, which become RU-Net architecture. Four input maps (effective PDN resistance map, PDN capacitance map, current map, and power pad distance map) are extracted from each layout clip, and are provided to RU-Net for IR-drop prediction. Experiments demonstrate that the proposed IR-drop prediction using the RU-Net is faster than a commercial tool by 16 times with about 12% error, while a simple U-Net-based prediction yields 19% error due to its inability to consider capacitors.

SESSION: Session 4: DRC, Test and Hotspot Detection using ML Methods

Efficient Design Rule Checking Script Generation via Key Information Extraction

  • Binwu Zhu
  • Xinyun Zhang
  • Yibo Lin
  • Bei Yu
  • Martin Wong

Design rule checking (DRC) is a critical step in integrated circuit design. DRC requires formatted scripts as the input to the design rule checker. However, these scripts are always generated manually in the foundry, and such a generation process is extremely inefficient, especially when encountering a large number of design rules. To mitigate this issue, we first propose a deep learning-based key information extractor to automatically identify the essential arguments of the scripts from rules. Then, a script translator is designed to organize the extracted arguments into executable DRC scripts. In addition, we incorporate three specific design rule generation techniques to improve the performance of our extractor. Experimental results demonstrate that our proposed method can significantly reduce the cost of script generation and show remarkable superiority over other baselines.

Scan Chain Clustering and Optimization with Constrained Clustering and Reinforcement Learning

  • Naiju Karim Abdul
  • George Antony
  • Rahul M. Rao
  • Suriya T. Skariah

Scan chains are used in design for test by providing controllability and observability at each register. Scan optimization is run during physical design after placement where scannable elements are re-ordered along the chain to reduce total wirelength (and power). In this paper, we present a machine learning based technique that leverages constrained clustering and reinforcement learning to obtain a wirelength efficient scan chain solution. Novel techniques like next-min sorted assignment, clustered assignment, node collapsing, partitioned Q-Learning and in-context start-end node determination are introduced to enable improved wire length while honoring design-for-test constraints. The proposed method is shown to provide up to 24% scan wirelength reduction over a traditional algorithmic optimization technique across 188 moderately sized blocks from an industrial 7nm design.

Autoencoder-Based Data Sampling for Machine Learning-Based Lithography Hotspot Detection

  • Mohamed Tarek Ismail
  • Hossam Sharara
  • Kareem Madkour
  • Karim Seddik

Technology scaling has increased the complexity of integrated circuit design. It has also led to more challenges in the field of Design for Manufacturing (DFM). One of these challenges is lithography hotspot detection. Hotspots (HS) are design patterns that negatively affect the output yield. Identifying these patterns early in the design phase is crucial for high yield fabrication. Machine Learning-based (ML) hotspot detection techniques are promising since they have shown superior results to other methods such as pattern matching. Training ML models is a challenging task due three main reasons. First, industrial training designs contain millions of unique patterns. It is impractical to train models using this large number of patterns due to limited computational and memory resources. Second, the HS detection problem has an imbalanced nature; datasets typically have a limited number of HS and a large number of non-hotspots. Lastly, hotspot and non-hotspot patterns can have very similar geometries causing models to be susceptible to high false positive rates. Due to these reasons, the use of data sampling techniques is needed to choose the best representative dataset for training. In this paper, a dataset sampling technique based on autoencoders is introduced. The autoencoders are used to identify latent data features that can reconstruct the input patterns. These features are used to group the patterns using Density-based spatial clustering of applications with noise (DBSCAN). Then, the clustered patterns are sampled to reduce the training set size. Experiments on the ICCAD-2019 dataset show that the proposed data sampling approach can reduce the dataset size while maintaining the levels of recall and precision that were obtained using the full dataset.

SESSION: Session 5: Power and Thermal Evaluation with ML

Driving Early Physical Synthesis Exploration through End-of-Flow Total Power Prediction

  • Yi-Chen Lu
  • Wei-Ting Chan
  • Vishal Khandelwal
  • Sung Kyu Lim

Leading-edge designs on advanced nodes are pushing physical design (PD) flow runtime into several weeks. Stringent time-to-market constraint necessitates efficient power, performance, and area (PPA) exploration by developing accurate models to evaluate netlist quality in early design stages. In this work, we propose PD-LSTM, a framework that leverages graph neural networks (GNNs) and long short-term memory (LSTM) networks to perform end-of-flow power predictions in early PD stages. Experimental results on two commercial CPU designs and five OpenCore netlists demonstrate that PD-LSTM achieves high fidelity total power prediction results within 4% normalized root-mean-squared error (NRMSE) on unseen netlists and a correlation coefficient score as high as 0.98.

Towards Neural Hardware Search: Power Estimation of CNNs for GPGPUs with Dynamic Frequency Scaling

  • Christopher A. Metz
  • Mehran Goli
  • Rolf Drechsler

Machine Learning (ML) algorithms are essential for emerging technologies such as autonomous driving and application-specific Internet of Things(IoT) devices. Convolutional Neural Network(CNN) is one of the major techniques used in such systems. This leads to leveraging ML accelerators like GPGPUs to meet the design constraints. However, GPGPUs have high power consumption, and selecting the most appropriate accelerator requires Design Space Exploration(DSE), which is usually time-consuming and needs high manual effort. Neural Hardware Search(NHS) is an upcoming approach to automate the DSE for Neural Networks. Therefore, automatic approaches for power, performance, and memory estimations are needed.

In this paper, we present a novel approach, enabling designers to fast and accurately estimate the power consumption of CNNs inferencing on GPGPUs with Dynamic Frequency Scaling(DFS) in the early stages of the design process. The proposed approach uses static analysis for feature extraction and Random Forest Tree regression analysis for predictive model generation. Experimental results demonstrate that our approach can predict the CNNs power consumption with a Mean Absolute Percentage Error(MAPE) of 5.03% compared to the actual hardware.

A Thermal Machine Learning Solver For Chip Simulation

  • Rishikesh Ranade
  • Haiyang He
  • Jay Pathak
  • Norman Chang
  • Akhilesh Kumar
  • Jimin Wen

Thermal analysis provides deeper insights into electronic chips’ behavior under different temperature scenarios and enables faster design exploration. However, obtaining detailed and accurate thermal profile on chip is very time-consuming using FEM or CFD. Therefore, there is an urgent need for speeding up the on-chip thermal solution to address various system scenarios. In this paper, we propose a thermal machine-learning (ML) solver to speed-up thermal simulations of chips. The thermal ML-Solver is an extension of the recent novel approach, CoAEMLSim (Composable Autoencoder Machine Learning Simulator) with modifications to the solution algorithm to handle constant and distributed HTC. The proposed method is validated against commercial solvers, such as Ansys MAPDL, as well as a latest ML baseline, UNet, under different scenarios to demonstrate its enhanced accuracy, scalability, and generalizability.

SESSION: Session 6: Performance Prediction with ML Models and Algorithms

Physically Accurate Learning-based Performance Prediction of Hardware-accelerated ML Algorithms

  • Hadi Esmaeilzadeh
  • Soroush Ghodrati
  • Andrew B. Kahng
  • Joon Kyung Kim
  • Sean Kinzer
  • Sayak Kundu
  • Rohan Mahapatra
  • Susmita Dey Manasi
  • Sachin S. Sapatnekar
  • Zhiang Wang
  • Ziqing Zeng

Parameterizable ML accelerators are the product of recent breakthroughs in machine learning (ML). To fully enable the design space exploration, we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It employs a unified methodology, coupling backend power, performance and area (PPA) analysis with frontend performance simulation, thus achieving realistic estimation of both backend PPA and system metrics (runtime and energy). Experimental studies show that the approach provides excellent predictions for both ASIC (in a 12nm commercial process) and FPGA implementations on the VTA and VeriGOOD-ML platforms.

Graph Representation Learning for Gate Arrival Time Prediction

  • Pratik Shrestha
  • Saran Phatharodom
  • Ioannis Savidis

An accurate estimate of the timing profile at different stages of the physical design flow allows for pre-emptive changes to the circuit, significantly reducing the design time and effort. In this work, a graph based deep regression model is utilized to predict the gate level arrival time of the timing paths of a circuit. Three scenarios for post routing prediction are considered: prediction after completing floorplanning, prediction after completing placement, and prediction after completing clock tree synthesis (CTS). A commercial static timing analysis (STA) tool is utilized to determine the mean absolute percentage error (MAPE) and the mean absolute error (MAE) for each scenario. Results obtained across all models trained on the complete dataset indicate that the proposed methodology outperforms the baseline errors produced by the commercial physical design tools with an average improvement of 61.58 in the MAPE score when predicting the post-routing arrival time after completing floorplanning and 13.53 improvement when predicting the post-routing arrival time after completing placement. Additional prediction scenarios are analyzed, where the complete dataset is further sub-divided based on the size of the circuits, which leads to an average improvement of 34.83 in the MAPE score as compared to the commercial tool for post-floorplanning prediction of the post-routing arrival time and 22.71 improvement for post-placement prediction of the post-routing arrival time.

A Tale of EDA’s Long Tail: Long-Tailed Distribution Learning for Electronic Design Automation

  • Zixuan Jiang
  • Mingjie Liu
  • Zizheng Guo
  • Shuhan Zhang
  • Yibo Lin
  • David Pan

Long-tailed distribution is a common and critical issue in the field of machine learning. While prior work addressed data imbalance in several tasks in electronic design automation (EDA), insufficient attention has been paid to the long-tailed distribution in real-world EDA problems. In this paper, we argue that conventional performance metrics can be misleading, especially in EDA contexts. Through two public EDA problems using convolutional neural networks and graph neural networks, we demonstrate that simple yet effective model-agnostic methods can alleviate the issue induced by long-tailed distribution when applying machine learning algorithms in EDA.


Industrial Experience with Open-Source EDA Tools

  • Christian Lück
  • Daniela Sánchez Lopera
  • Sven Wenzek
  • Wolfgang Ecker

Commonly, the design flow of integrated circuits from initial specifications to fabrication employs commercial, proprietary EDA tools. While these tools deliver high-quality, production-ready results, they can be seen as expensive black boxes and thus, are not suited for research and academic purposes. Innovations on the field are mostly focused on optimizing the quality of the results of the designs by modifying core elements of the tool chain or using techniques of the Machine Learning domain. In both cases, researchers require many or long runs of EDA tools for comparing results or generating training data for Machine Learning models. Using proprietary, commercial tools in those cases may be either not affordable or not possible at all.

With OpenROAD and OpenLane mature open-source alternatives emerged in the past couple of years. The development is driven by a growing community that is improving and extending the tools daily. In contrast to commercial tools, OpenROAD and OpenLane are transparent and allow inspection, modification and replacement of every tool aspect. They are also free and therefore are well suited for use cases such as Machine Learning data generation. Specifically, the fact that no licenses are needed neither for the tools nor for the default PDK enables even fresh students and starters on the field to quickly deploy their ideas and create initial proof of concepts.

Therefore, we at Infineon are using OpenROAD and OpenLane for more experimental and innovative projects. Our vision is to build initial prototypes using free software, and then improve upon them by cross-checking and polishing with commercial tools before delivering them for production. This talk will show Infineon’s experience with these open-source tools so far.

The first steps involved getting OpenLane installed in our company IT infrastructure. While their developers offer convenient build methods using Docker containers, these cannot be used in Infineon’s compute farm. This, and also the fact that most of the open-source tools are currently evolving quickly with little to no versioning, lead to the setup of an in-house continuous integration and continuous delivery system for nightly and weekly builds of the tools. Once the necessary tools were installed and running, effort was put into integrating Infineon’s in-house technology data.

At Infineon, we envision two use cases for OpenROAD/OpenLane: physical synthesis hyperparameter exploration (and tuning) and optimization of the complete flow starting from RTL. First, our goal is to use OpenROAD’s AutoTuner in the path-finding phase to automatically and cost-effectively find optimal parameters for the flow and then build upon these results within a commercial tool for the later steps near the tapeout. Second, we want to include not only the synthesis flow inside the optimization loop of the AutoTuner, but also our in-house RTL generation framework (MetaRTL). For instance, having RTL generators for a RISC-V CPU and also the results of simulated runtime benchmarks for each iteration, the AutoTuner should be able to change fundamental aspects (for example number of pipeline stages) of the RTL to reach certain power, performance, and area requirements when running the benchmark code on the CPU.

Overall, we see OpenROAD/OpenLane as a viable alternative to commercial tools, especially for research and academic use, where modifications to the tools are needed and where very long and otherwise costly tool runtimes are expected.

SESSION: Session 7: ML Models for Analog Design and Optimization

Invertible Neural Networks for Design of Broadband Active Mixers

  • Oluwaseyi Akinwande
  • Osama Waqar Bhatti
  • Xingchen Li
  • Madhavan Swaminathan

In this work, we present the invertible neural network for predicting the posterior distributions of the design space of broadband active mixers with RF from 100 MHz to 10 GHz. This invertible method gives a fast and accurate model when investigating crucial properties of active mixers such as conversion gain and noise figure. Our results show that the response generated by the invertible neural network model has close correlation with the output response from the circuit simulator.

High Dimensional Optimization for Electronic Design

  • Yuejiang Wen
  • Jacob Dean
  • Brian A. Floyd
  • Paul D. Franzon

Bayesian optimization (BO) samples points of interest to update a surrogate model for a blackbox function. This makes it a powerful technique to optimize electronic designs which have unknown objective functions and demand high computational cost of simulation. Unfortunately, Bayesian optimization suffers from scalability issues, e.g., it can perform well in problems up to 20 dimensions. This paper addresses the curse of dimensionality and proposes an algorithm entitled Inspection-based Combo Random Embedding Bayesian Optimization (IC-REMBO). IC-REMBO improves the effectiveness and efficiency of the Random EMbedding Bayesian Optimization (REMBO) approach, which is a state-of-the-art high dimensional optimization method. Generally, it inspects the space near local optima to explore more points near local optima, so that it mitigates the over-exploration on boundaries and embedding distortion in REMBO. Consequently, it helps escape from local optima and provides a family of feasible solutions when inspecting near global optimum within a limited number of iterations.

The effectiveness and efficiency of the proposed algorithm are compared with the state-of-the-art REMBO when optimizing a mmWave receiver with 38 calibration parameters to meet 4 objectives. The optimization results are close to that of a human expert. To the best of our knowledge, this is the first time applying REMBO or inspection method to electronic design.

Transfer of Performance Models Across Analog Circuit Topologies with Graph Neural Networks

  • Zhengfeng Wu
  • Ioannis Savidis

In this work, graph neural networks (GNNs) and transfer learning are leveraged to transfer device sizing knowledge learned from data of related analog circuit topologies to predict the performance of a new topology. A graph is generated from the netlist of a circuit, with nodes representing the devices and edges the connections between devices. To allow for the simultaneous training of GNNs on data of multiple topologies, graph isomorphism networks are adopted to address the limitation of graph convolutional networks in distinguishing between different graph structures. The techniques are applied to transfer predictions of performance across four op-amp topologies in a 65 nm technology, with 10000 sets of sizing and performance evaluations sampled for each circuit. Two scenarios, zero-shot learning and few-shot learning, are considered based on the availability of data in the target domain. Results from the analysis indicate that zero-shot learning with GNNs trained on all the data of the three related topologies is effective for coarse estimates of the performance of the fourth unseen circuit without requiring any data from the fourth circuit. Few-shot learning by fine-tuning the GNNs with a small dataset of 100 points from the target topology after pre-training on data from the other three topologies further boosts the model performance. The fine-tuned GNNs outperform the baseline artificial neural networks (ANNs) trained on the same dataset of 100 points from the target topology with an average reduction in the root-mean-square error of 70.6%. Applying the proposed techniques, specifically GNNs and transfer learning, improves the sample efficiency of the performance models of the analog ICs through the transfer of predictions across related circuit topologies.

RxGAN: Modeling High-Speed Receiver through Generative Adversarial Networks

  • Priyank Kashyap
  • Archit Gajjar
  • Yongjin Choi
  • Chau-Wai Wong
  • Dror Baron
  • Tianfu Wu
  • Chris Cheng
  • Paul Franzon

Creating models for modern high-speed receivers using circuit-level simulations is costly, as it requires computationally expensive simulations and upwards of months to finalize a model. Added to this is that many models do not necessarily agree with the final hardware they are supposed to emulate. Further, these models are complex due to the presence of various filters, such as a decision feedback equalizer (DFE) and continuous-time linear equalizer (CTLE), which enable the correct operation of the receiver. Other data-driven approaches tackle receiver modeling through multiple models to account for as many configurations as possible. This work proposes a data-driven approach using generative adversarial training to model a real-world receiver with varying DFE and CTLE configurations while handling different channel conditions and bitstreams. The approach is highly accurate as the eye height and width are within 1.59% and 1.12% of the ground truth. The horizontal and vertical bathtub curves match the ground truth and correlate to the ground truth bathtub curves.


DAC ’22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Full Citation in the ACM Digital Library

QuantumNAT: quantum noise-aware training with noise injection, quantization and normalization

  • Hanrui Wang
  • Jiaqi Gu
  • Yongshan Ding
  • Zirui Li
  • Frederic T. Chong
  • David Z. Pan
  • Song Han

Parameterized Quantum Circuits (PQC) are promising towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of PQC models has a severe degradation on real quantum devices. Take Quantum Neural Network (QNN) as an example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of PQC; on the other hand, existing PQC work does not consider noise effect. To this end, we present QuantumNAT, a PQC-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We experimentally observe that the effect of quantum noise to PQC measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to PQC according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that QuantumNAT improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class classification accuracy measured on real quantum computers. The code for construction and noise-aware training of PQC is available in the TorchQuantum library.

Optimizing quantum circuit synthesis for permutations using recursion

  • Cynthia Chen
  • Bruno Schmitt
  • Helena Zhang
  • Lev S. Bishop
  • Ali Javadi-Abhar

We describe a family of recursive methods for the synthesis of qubit permutations on quantum computers with limited qubit connectivity. Two objectives are of importance: circuit size and depth. In each case we combine a scalable heuristic with a non-scalable, yet exact, synthesis.

A fast and scalable qubit-mapping method for noisy intermediate-scale quantum computers

  • Sunghye Park
  • Daeyeon Kim
  • Minhyuk Kweon
  • Jae-Yoon Sim
  • Seokhyeong Kang

This paper presents an efficient qubit-mapping method that redesigns a quantum circuit to overcome the limitations of qubit connectivity. We propose a recursive graph-isomorphism search to generate the scalable initial mapping. In the main mapping, we use an adaptive look-ahead window search to resolve the connectivity constraint within a short runtime. Compared with the state-of-the-art method [15], our proposed method reduced the number of additional gates by 23% on average and the runtime by 68% for the three largest benchmark circuits. Furthermore, our method improved circuit stability by reducing the circuit depth and thus can be a step forward towards fault tolerance.

Optimizing quantum circuit placement via machine learning

  • Hongxiang Fan
  • Ce Guo
  • Wayne Luk

Quantum circuit placement (QCP) is the process of mapping the synthesized logical quantum programs on physical quantum machines, which introduces additional SWAP gates and affects the performance of quantum circuits. Nevertheless, determining the minimal number of SWAP gates has been demonstrated to be an NP-complete problem. Various heuristic approaches have been proposed to address QCP, but they suffer from suboptimality due to the lack of exploration. Although exact approaches can achieve higher optimality, they are not scalable for large quantum circuits due to the massive design space and expensive runtime. By formulating QCP as a bilevel optimization problem, this paper proposes a novel machine learning (ML)-based framework to tackle this challenge. To address the lower-level combinatorial optimization problem, we adopt a policy-based deep reinforcement learning (DRL) algorithm with knowledge transfer to enable the generalization ability of our framework. An evolutionary algorithm is then deployed to solve the upper-level discrete search problem, which optimizes the initial mapping with a lower SWAP cost. The proposed ML-based approach provides a new paradigm to overcome the drawbacks in both traditional heuristic and exact approaches while enabling the exploration of optimality-runtime trade-off. Compared with the leading heuristic approaches, our ML-based method significantly reduces the SWAP cost by up to 100%. In comparison with the leading exact search, our proposed algorithm achieves the same level of optimality while reducing the runtime cost by up to 40 times.

HERO: hessian-enhanced robust optimization for unifying and improving generalization and quantization performance

  • Huanrui Yang
  • Xiaoxuan Yang
  • Neil Zhenqiang Gong
  • Yiran Chen

With the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model’s generalizability on unseen testing data, as well as enhance the model’s robustness under fixed-point quantization for efficient deployment. Minimizing the training loss, however, provides few guarantees on the generalization and quantization performance. In this work, we fulfill the need of improving generalization and quantization performance simultaneously by theoretically unifying them under the framework of improving the model’s robustness against bounded weight perturbation and minimizing the eigenvalues of the Hessian matrix with respect to model weights. We therefore propose HERO, a Hessian-enhanced robust optimization method, to minimize the Hessian eigenvalues through a gradient-based training process, simultaneously improving the generalization and quantization performance. HERO enables up to a 3.8% gain on test accuracy, up to 30% higher accuracy under 80% training label perturbation, and the best post-training quantization accuracy across a wide range of precision, including a > 10% accuracy improvement over SGD-trained models for common model architectures on various datasets.

Neural computation for robust and holographic face detection

  • Mohsen Imani
  • Ali Zakeri
  • Hanning Chen
  • TaeHyun Kim
  • Prathyush Poduval
  • Hyunsei Lee
  • Yeseong Kim
  • Elaheh Sadredini
  • Farhad Imani

Face detection is an essential component of many tasks in computer vision with several applications. However, existing deep learning solutions are significantly slow and inefficient to enable face detection on embedded platforms. In this paper, we propose HDFace, a novel framework for highly efficient and robust face detection. HDFace exploits HyperDimensional Computing (HDC) as a neurally-inspired computational paradigm that mimics important brain functionalities towards high-efficiency and noise-tolerant computation. We first develop a novel technique that enables HDC to perform stochastic arithmetic computations over binary hypervectors. Next, we expand these arithmetic for efficient and robust processing of feature extraction algorithms in hyperspace. Finally, we develop an adaptive hyperdimensional classification algorithm for effective and robust face detection. We evaluate the effectiveness of HDFace on large-scale emotion detection and face detection applications. Our results indicate that HDFace provides, on average, 6.1X (4.6X) speedup and 3.0X (12.1X) energy efficiency as compared to neural networks running on CPU (FPGA), respectively.

FHDnn: communication efficient and robust federated learning for AIoT networks

  • Rishikanth Chandrasekaran
  • Kazim Ergun
  • Jihyun Lee
  • Dhanush Nanjunda
  • Jaeyoung Kang
  • Tajana Rosing

The advent of IoT and advances in edge computing inspired federated learning, a distributed algorithm to enable on device learning. Transmission costs, unreliable networks and limited compute power all of which are typical characteristics of IoT networks pose a severe bottleneck for federated learning. In this work we propose FHDnn, a synergetic federated learning framework that combines the salient aspects of CNNs and Hyperdimensional Computing. FHDnn performs hyperdimensional learning on features extracted from a self-supervised contrastive learning framework to accelerate training, lower communication costs, and increase robustness to network errors by avoiding the transmission of the CNN and training only the hyperdimensional component. Compared to CNNs, we show through experiments that FHDnn reduces communication costs by 66X, local client compute and energy consumption by 1.5 – 6X, while being highly robust to network errors with minimal loss in accuracy.

ODHD: one-class brain-inspired hyperdimensional computing for outlier detection

  • Ruixuan Wang
  • Xun Jiao
  • X. Sharon Hu

Outlier detection is a classical and important technique that has been used in different application domains such as medical diagnosis and Internet-of-Things. Recently, machine learning-based outlier detection algorithms, such as one-class support vector machine (OCSVM), isolation forest and autoencoder, have demonstrated promising results in outlier detection. In this paper, we take a radical departure from these classical learning methods and propose ODHD, an outlier detection method based on hyperdimensional computing (HDC). In ODHD, the outlier detection process is based on a P-U learning structure, in which we train a one-class HV based on inlier samples. This HV represents the abstraction information of all inlier samples; hence, any (testing) sample whose corresponding HV is dissimilar from this HV will be considered as an outlier. We perform an extensive evaluation using six datasets across different application domains and compare ODHD with multiple baseline methods including OCSVM, isolation forest, and autoencoder using three metrics including accuracy, F1 score and ROC-AUC. Experimental results show that ODHD outperforms all the baseline methods on every dataset for every metric. Moreover, we perform a design space exploration for ODHD to illustrate the tradeoff between performance and efficiency. The promising results presented in this paper provide a viable option and alternative to traditional learning algorithms for outlier detection.

High-level synthesis performance prediction using GNNs: benchmarking, modeling, and advancing

  • Nan Wu
  • Hang Yang
  • Yuan Xie
  • Pan Li
  • Cong Hao

Agile hardware development requires fast and accurate circuit quality evaluation from early design stages. Existing work of high-level synthesis (HLS) performance prediction usually requires extensive feature engineering after the synthesis process. To expedite circuit evaluation from as early design stage as possible, we propose rapid and accurate performance prediction methods, which exploit the representation power of graph neural networks (GNNs) by representing C/C++ programs as graphs. The contribution of this work is three-fold. (1) Benchmarking. We build a standard benchmark suite with 40k C programs, which includes synthetic programs and three sets of real-world HLS benchmarks. Each program is synthesized and implemented on FPGA to obtain post place-and-route performance metrics as the ground truth. (2) Modeling. We formally formulate the HLS performance prediction problem on graphs and propose multiple modeling strategies with GNNs that leverage different trade-offs between prediction timeliness (early/late prediction) and accuracy. (3) Advancing. We further propose a novel hierarchical GNN that does not sacrifice timeliness but largely improves prediction accuracy, significantly outperforming HLS tools. We apply extensive evaluations for both synthetic and unseen real-case programs; our proposed predictor largely outperforms HLS by up to 40X and excels existing predictors by 2X to 5X in terms of resource usage and timing prediction. The benchmark and explored GNN models are publicly available at

Automated accelerator optimization aided by graph neural networks

  • Atefeh Sohrabizadeh
  • Yunsheng Bai
  • Yizhou Sun
  • Jason Cong

Using High-Level Synthesis (HLS), the hardware designers must describe only a high-level behavioral flow of the design. However, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level to explore. Besides, it takes several minutes to hours to evaluate the design with the HLS tool. To solve this problem, we model the HLS tool with a graph neural network that is trained to be used for a wide range of applications. The experimental results demonstrate that our model can estimate the quality of design in milliseconds with high accuracy, resulting in up to 79X speedup (with an average of 48X) for optimizing the design compared to the previous state-of-the-art work relying on the HLS tool.

Functionality matters in netlist representation learning

  • Ziyi Wang
  • Chen Bai
  • Zhuolun He
  • Guangliang Zhang
  • Qiang Xu
  • Tsung-Yi Ho
  • Bei Yu
  • Yu Huang

Learning feasible representation from raw gate-level netlists is essential for incorporating machine learning techniques in logic synthesis, physical design, or verification. Existing message-passing-based graph learning methodologies focus merely on graph topology while overlooking gate functionality, which often fails to capture underlying semantic, thus limiting their generalizability. To address the concern, we propose a novel netlist representation learning framework that utilizes a contrastive scheme to acquire generic functional knowledge from netlists effectively. We also propose a customized graph neural network (GNN) architecture that learns a set of independent aggregators to better cooperate with the above framework. Comprehensive experiments on multiple complex real-world designs demonstrate that our proposed solution significantly outperforms state-of-the-art netlist feature learning flows.

EMS: efficient memory subsystem synthesis for spatial accelerators

  • Liancheng Jia
  • Yuyue Wang
  • Jingwen Leng
  • Yun Liang

Spatial accelerators provide massive parallelism with an array of homogeneous PEs, and enable efficient data reuse with PE array dataflow and on-chip memory. Many previous works have studied the dataflow architecture of spatial accelerators, including performance analysis and automatic generation. However, existing accelerator generators fail to exploit the entire memory-level reuse opportunities, and generate suboptimal designs with data duplication and inefficient interconnection.

In this paper, we propose EMS, an efficient memory subsystem synthesis and optimization framework for spatial accelerators. We first use space-time transformation (STT) to analyze both PE-level and memory-level data reuse. Based on the reuse analysis, we develop an algorithm to automatically generate data layout of the multi-banked scratchpad memory, data mapping, and access controller for the memory. Our generated memory subsystem supports multiple PE-memory interconnection topologies including direct, multicast, and rotated connection. The memory and interconnection generation approach can efficiently utilize the memory-level reuse to avoid duplicated data storage with low hardware cost. EMS can automatically synthesize tensor algebra to hardware designed in Chisel. Experiments show that our proposed memory generator reduces the on-chip memory size by an average of 28% than the state-of-the-art, and achieves comparable hardware performance.

DA PUF: dual-state analog PUF

  • Jiliang Zhang
  • Lin Ding
  • Zhuojun Chen
  • Wenshang Li
  • Gang Qu

Physical unclonable function (PUF) is a promising lightweight hardware security primitive that exploits process variations during chip fabrication for applications such as key generation and device authentication. Reliability of the PUF information plays a vital role and poses a major challenge for PUF design. In this paper, we propose a novel dual-state analog PUF (DA PUF) which has been successfully fabricated in 55nm process. The 40,960 bits generated by the fabricated DA PUF pass the NIST randomness test with reliability over 99.99% for working environment of -40 ~ 125° C (temperature) and 0.96 ~ 1.44V (voltage), outperforming the two state-of-the-art analog PUFs reported in JSSC 2016 and 2021.

PathFinder: side channel protection through automatic leaky paths identification and obfuscation

  • Haocheng Ma
  • Qizhi Zhang
  • Ya Gao
  • Jiaji He
  • Yiqiang Zhao
  • Yier Jin

Side-channel analysis (SCA) attacks show an enormous threat to cryptographic integrated circuits (ICs). To address this threat, designers try to adopt various countermeasures during the IC development process. However, many existing solutions are costly in terms of area, power and/or performance, and may require full-custom circuit design for proper implementations. In this paper, we propose a tool, namely PathFinder, to automatically identify leaky paths and protect the design, and is compatible with the commercial design flow. The tool first finds out partial logic cells that leak the most information through dynamic correlation analysis. PathFinder then exploits static security checking to construct complete leaky paths based on these cells. After leaky paths are identified, PathFinder will leverage proper hardware countermeasures, including Boolean masking and random precharge, to eliminate information leakage from these paths. The effectiveness of PathFinder is validated both through simulation and physical measurements on FPGA implementations. Results demonstrate more than 1000X improvements on side-channel resistance, with less than 6.53% penalty to the power, area and performance.

LOCK&ROLL: deep-learning power side-channel attack mitigation using emerging reconfigurable devices and logic locking

  • Gaurav Kolhe
  • Tyler Sheaves
  • Kevin Immanuel Gubbi
  • Soheil Salehi
  • Setareh Rafatirad
  • Sai Manoj PD
  • Avesta Sasan
  • Houman Homayoun

The security and trustworthiness of ICs are exacerbated by the modern globalized semiconductor business model. This model involves many steps performed at multiple locations by different providers and integrates various Intellectual Properties (IPs) from several vendors for faster time-to-market and cheaper fabrication costs. Many existing works have focused on mitigating the well-known SAT attack and its derivatives. Power Side-Channel Attacks (PSCAs) can retrieve the sensitive contents of the IP and can be leveraged to find the key to unlock the obfuscated circuit without simulating powerful SAT attacks. To mitigate P-SCA and SAT-attack together, we propose a multi-layer defense mechanism called LOCK&ROLL: Deep-Learning Power Side-Channel Attack Mitigation using Emerging Reconfigurable Devices and Logic Locking. LOCK&ROLL utilizes our proposed Magnetic Random-Access Memory (MRAM)-based Look Up Table called Symmetrical MRAM-LUT (SyM-LUT). Our simulation results using 45nm technology demonstrate that the SyM-LUT incurs a small overhead compared to traditional Static Random Access Memory LUT (SRAM-LUT). Additionally, SyM-LUT has a standby energy consumption of 20aJ while consuming 33fJ and 4.6fJ for write and read operations, respectively. LOCK&ROLL is resilient against various attacks such as SAT-attacks, removal attack, scan and shift attacks, and P-SCA.

Efficient access scheme for multi-bank based NTT architecture through conflict graph

  • Xiangren Chen
  • Bohan Yang
  • Yong Lu
  • Shouyi Yin
  • Shaojun Wei
  • Leibo Liu

Number Theoretical Transform (NTT) hardware accelerator becomes crucial building block in many cryptosystems like post-quantum cryptography. In this paper, we provide new insights into the construction of conflict-free memory mapping scheme (CFMMS) for multi-bank NTT architecture. Firstly, we offer parallel loop structure of arbitrary-radix NTT and propose two point-fetching modes. Afterwards, we transform the conflict-free mapping problem into conflict graph and develop novel heuristic to explore the design space of CFMMS, which turns out more efficient access scheme than classic works. To further verify the methodology, we design high-performance NTT/INTT kernels for Dilithium, whose area-time efficiency significantly outperforms state-of-the-art works on the similar FPGA platform.

InfoX: an energy-efficient ReRAM accelerator design with information-lossless low-bit ADCs

  • Yintao He
  • Songyun Qu
  • Ying Wang
  • Bing Li
  • Huawei Li
  • Xiaowei Li

ReRAM-based accelerators have shown great potential in neural network acceleration via in-memory analog computing. However, high-precision analog-to-digital converters (ADCs), which are required by the ReRAM crossbars to achieve high-accuracy network model inference, play an essential role in the energy-efficiency of the accelerators. Based on the discovery that the ADC precision requirements of crossbars are different, we propose the model-aware crossbarwise ADC precision assignment and the accompanied information-lossless low-bit ADCs to reduce energy overhead without sacrificing model accuracy. In experiments, the proposed information-lossless ReRAM accelerator, InfoX, only consumes 8.97% ADC energy of the SOTA baseline with no accuracy degradation at all.

PHANES: ReRAM-based photonic accelerator for deep neural networks

  • Yinyi Liu
  • Jiaqi Liu
  • Yuxiang Fu
  • Shixi Chen
  • Jiaxu Zhang
  • Jiang Xu

Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks.

In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.

CP-SRAM: charge-pulsation SRAM marco for ultra-high energy-efficiency computing-in-memory

  • He Zhang
  • Linjun Jiang
  • Jianxin Wu
  • Tingran Chen
  • Junzhan Liu
  • Wang Kang
  • Weisheng Zhao

SRAM-based computing-in-memory (SRAM-CIM) provides fast speed and good scalability with advanced process technology. However, the energy efficiency of the state-of-the-art current-domain SRAM-CIM bit-cell structure is limited and the peripheral circuitry (e.g., DAC/ADC) for high-precision is expensive. This paper proposes a charge-pulsation SRAM (CP-SRAM) structure to achieve ultra-high energy-efficiency thanks to its charge-domain mechanism. Furthermore, our proposed CP-SRAM CIM supports configurable precision (2/4/6-bit). The CP-SRAM CIM macro was designed in 180nm (with silicon verification) and 40nm (simulation) nodes. The simulation results in 40nm show that our macro can achieve energy efficiency of ~2950Tops/W at 2-bit precision, ~576.4 Tops/W at 4-bit precision and ~111.7 Tops/W at 6-bit precision, respectively.

CREAM: computing in ReRAM-assisted energy and area-efficient SRAM for neural network acceleration

  • Liukai Xu
  • Songyuan Liu
  • Zhi Li
  • Dengfeng Wang
  • Yiming Chen
  • Yanan Sun
  • Xueqing Li
  • Weifeng He
  • Shi Xu

Computing-in-memory has been widely explored to accelerate DNN. However, most existing CIM cannot store all NN weights due to limited SRAM capacity for edge AI devices, inducing a large amount off-chip DRAM access. In this paper, a new computing in ReRAM-assisted energy and area-efficient SRAM (CREAM) is proposed for implementing large-scale NNs while eliminating off-chip DRAM access. The weights of DNN are all stored in the high-dense on-chip ReRAM devices and restored to the proposed nvSRAM-CIM cells with array-level parallelism. A data-aware weight-mapping method is also proposed to enhance the CIM performance while fully exploiting the hardware utilization. Experiment results show that the proposed CREAM scheme enhances the storage density by up to 7.94x compared to the traditional SRAM arrays. The energy-efficiency of proposed CREAM is also enhanced by 2.14x and 1.99x, compared to the traditional SRAM-CIM with off-chip DRAM access and ReRAM-CIM circuits, respectively.

Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration

  • Yinxiao Feng
  • Kaisheng Ma

Multi-chip integration is widely recognized as the extension of Moore’s Law. Cost-saving is a frequently mentioned advantage, but previous works rarely present quantitative demonstrations on the cost superiority of multi-chip integration over monolithic SoC. In this paper, we build a quantitative cost model and put forward an analytical method for multi-chip systems based on three typical multi-chip integration technologies to analyze the cost benefits from yield improvement, chiplet and package reuse, and heterogeneity. We re-examine the actual cost of multi-chip systems from various perspectives and show how to reduce the total cost of the VLSI system through appropriate multi-chiplet architecture.

PANORAMA: divide-and-conquer approach for mapping complex loop kernels on CGRA

  • Dhananjaya Wijerathne
  • Zhaoying Li
  • Thilini Kaushalya Bandara
  • Tulika Mitra

CGRAs are well-suited as hardware accelerators due to power efficiency and reconfigurability. However, their potential is limited by the inability of the compiler to map complex loop kernels onto the architectures effectively. We propose PANORAMA, a fast and scalable compiler based on a divide-and-conquer approach to generate quality mapping for complex dataflow graphs (DFG) representing loop bodies onto larger CGRAs. PANORAMA improves the throughput of the mapped loops by up to 2.6x with 8.7x faster compilation time compared to the state-of-the-art techniques.

A fast parameter tuning framework via transfer learning and multi-objective bayesian optimization

  • Zheng Zhang
  • Tinghuan Chen
  • Jiaxin Huang
  • Meng Zhang

Design space exploration (DSE) can automatically and effectively determine design parameters to achieve the optimal performance, power and area (PPA) in very large-scale integration (VLSI) design. The lack of prior knowledge causes low efficient exploration. In this paper, a fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization is proposed to quickly find the optimal design parameters. Gaussian Copula is utilized to establish the correlation of the implemented technology. The prior knowledge is integrated into multi-objective Bayesian optimization through transforming the PPA data to residual observation. The uncertainty-aware search acquisition function is employed to explore design space efficiently. Experiments on a CPU design show that this framework can achieve a higher quality of Pareto frontier with less design flow running than state-of-the-art methodologies.

PriMax: maximizing DSL application performance with selective primitive acceleration

  • Nicholas Wendt
  • Todd Austin
  • Valeria Bertacco

Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm’s implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.

Accelerating and pruning CNNs for semantic segmentation on FPGA

  • Pierpaolo Morì
  • Manoj-Rohit Vemparala
  • Nael Fasfous
  • Saptarshi Mitra
  • Sreetama Sarkar
  • Alexander Frickenstein
  • Lukas Frickenstein
  • Domenik Helms
  • Naveen Shankar Nagaraja
  • Walter Stechele
  • Claudio Passerone

Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in

SoftSNN: low-cost fault tolerance for spiking neural network accelerators under soft errors

  • Rachmad Vidya Wicaksana Putra
  • Muhammad Abdullah Hanif
  • Muhammad Shafique

Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.

A joint management middleware to improve training performance of deep recommendation systems with SSDs

  • Chun-Feng Wu
  • Carole-Jean Wu
  • Gu-Yeon Wei
  • David Brooks

As the sizes and variety of training data scale over time, data preprocessing is becoming an important performance bottleneck for training deep recommendation systems. This challenge becomes more serious when training data is stored in Solid-State Drives (SSDs). Due to the access behavior gap between recommendation systems and SSDs, unused training data may be read and filtered out during preprocessing. This work advocates a joint management middleware to avoid reading unused data by bridging the access behavior gap. The evaluation results show that our middleware can effectively improve the performance of the data preprocessing phase so as to boost training performance.

The larger the fairer?: small neural networks can achieve fairness for edge devices

  • Yi Sheng
  • Junhuan Yang
  • Yawen Wu
  • Kevin Mao
  • Yiyu Shi
  • Jingtong Hu
  • Weiwen Jiang
  • Lei Yang

Along with the progress of AI democratization, neural networks are being deployed more frequently in edge devices for a wide range of applications. Fairness concerns gradually emerge in many applications, such as face recognition and mobile medical. One fundamental question arises: what will be the fairest neural architecture for edge devices? By examining the existing neural networks, we observe that larger networks typically are fairer. But, edge devices call for smaller neural architectures to meet hardware specifications. To address this challenge, this work proposes a novel Fairness- and Hardware-aware Neural architecture search framework, namely FaHaNa. Coupled with a model freezing approach, FaHaNa can efficiently search for neural networks with balanced fairness and accuracy, while guaranteed to meet hardware specifications. Results show that FaHaNa can identify a series of neural networks with higher fairness and accuracy on a dermatology dataset. Target edge devices, FaHaNa finds a neural architecture with slightly higher accuracy, 5.28X smaller size, 15.14% higher fairness score, compared with MobileNetV2; meanwhile, on Raspberry PI and Odroid XU-4, it achieves 5.75X and 5.79X speedup.

SCAIE-V: an open-source SCAlable interface for ISA extensions for RISC-V processors

  • Mihaela Damian
  • Julian Oppermann
  • Christoph Spang
  • Andreas Koch

Custom instructions extending a base ISA are often used to increase performance. However, only few cores provide open interfaces for integrating such ISA Extensions (ISAX). In addition, the degree to which a core’s capabilities are exposed for extension varies wildly between interfaces. Thus, even when using open-source cores, the lack of standardized ISAX interfaces typically causes high engineering effort when implementing or porting ISAXes. We present SCAIE-V, a highly portable and feature-rich ISAX interface that supports custom control flow, decoupled execution, multi-cycle-instructions, and memory transactions. The cost of the interface itself scales with the complexity of the ISAXes actually used.

A scalable symbolic simulation tool for low power embedded systems

  • Subhash Sethumurugan
  • Shashank Hegde
  • Hari Cherupalli
  • John Sartori

Recent work has demonstrated the effectiveness of using symbolic simulation to perform hardware software co-analysis on an application-processor pair and developed a variety of hardware and software design techniques and optimizations, ranging from providing system security guarantees to automated generation of application-specific bespoke processors. Despite their potential benefits, current state-of-the-art symbolic simulation tools for hardware-software co-analysis are restricted in their applicability, since prior work relies on a costly process of building a custom simulation tool for each processor design to be simulated. Furthermore, prior work does not describe how to extend the symbolic analysis technique to other processor designs.

In an effort to generalize the technique for any processor design, we propose a custom symbolic simulator that uses iverilog to perform symbolic behavioral simulation. With iverilog – an open source synthesis and simulation tool – we implement a design-agnostic symbolic simulation tool for hardware-software co-analysis. To demonstrate the generality of our tool, we apply symbolic analysis to three embedded processors with different ISAs: bm32 (a MIPS-based processor), darkRiscV (a RISC-V-based processor), and openMSP430 (based on MSP430). We use analysis results to generate bespoke processors for each design and observe gate count reductions of 27%, 16%, and 56% on these processors, respectively. Our results demonstrate the versatility of our simulation tool and the uniqueness of each design with respect to symbolic analysis and the bespoke methodology.

Designing critical systems with iterative automated safety analysis

  • Ran Wei
  • Zhe Jiang
  • Xiaoran Guo
  • Haitao Mei
  • Athanasios Zolotas
  • Tim Kelly

Safety analysis is an important aspect in Safety-Critical Systems Engineering (SCSE) to discover design problems that can potentially lead to hazards and eventually, accidents. Performing safety analysis requires significant manual effort — its automation has become the research focus in the critical system domain due to the increasing complexity of systems and emergence of open adaptive systems. In this paper, we present a methodology, in which automated safety analysis drives the design of safety-critical systems. We discuss our approach with its tool support and evaluate its applicability. We briefly discuss how our approach fits into current practice of SCSE.

Efficient ensembles of graph neural networks

  • Amrit Nagarajan
  • Jacob R. Stevens
  • Anand Raghunathan

Ensembles improve the accuracy and robustness of Graph Neural Networks (GNNs), but suffer from high latency and storage requirements. To address this challenge, we propose GNN Ensembles through Error Node Isolation (GEENI). The key concept in GEENI is to identify nodes that are likely to be incorrectly classified (error nodes) and suppress their outgoing messages, leading to simultaneous accuracy and efficiency improvements. GEENI also enables aggressive approximations of the constituent models in the ensemble while maintaining accuracy. To improve the efficacy of GEENI, we propose techniques for diverse ensemble creation and accurate error node identification. Our experiments establish that GEENI models are simultaneously up to 4.6% (3.8%) more accurate and up to 2.8X (5.7X) faster compared to non-ensemble (conventional ensemble) GNN models.

Sign bit is enough: a learning synchronization framework for multi-hop all-reduce with ultimate compression

  • Feijie Wu
  • Shiqi He
  • Song Guo
  • Zhihao Qu
  • Haozhao Wang
  • Weihua Zhuang
  • Jie Zhang

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.

GLite: a fast and efficient automatic graph-level optimizer for large-scale DNNs

  • Jiaqi Li
  • Min Peng
  • Qingan Li
  • Meizheng Peng
  • Mengting Yuan

We propose a scalable graph-level optimizer named GLite to speed up search-based optimizations on large neural networks. GLite leverages a potential-based partitioning strategy to partition large computation graphs into small subgraphs without losing profitable substitution patterns. To avoid redundant subgraph matching, we propose a dynamic programming algorithm to reuse explored matching patterns. The experimental results show that GLite reduces the running time of search-based optimizations from hours to milliseconds, without compromising in inference performance.

Contrastive quant: quantization makes stronger contrastive learning

  • Yonggan Fu
  • Qixuan Yu
  • Meng Li
  • Xu Ouyang
  • Vikas Chandra
  • Yingyan Lin

Contrastive learning learns visual representations by enforcing feature consistency under different augmented views. In this work, we explore contrastive learning from a new perspective. Interestingly, we find that quantization, when properly engineered, can enhance the effectiveness of contrastive learning. To this end, we propose a novel contrastive learning framework, dubbed Contrastive Quant, to encourage feature consistency under both differently augmented inputs via various data transformations and differently augmented weights/activations via various quantization levels. Extensive experiments, built on top of two state-of-the-art contrastive learning methods SimCLR and BYOL, show that Contrastive Quant consistently improves the learned visual representation.

Serpens: a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication

  • Linghao Song
  • Yuze Chi
  • Licheng Guo
  • Jason Cong

Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV, which features memory-centric processing engines and index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy/bandwidth efficiency, Serpens is 1.71x/1.99x, 1.90x/2.69x, and 6.25x/4.06x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS) and up to 3.79x over GraphLily. The code is available at

An energy-efficient seizure detection processor using event-driven multi-stage CNN classification and segmented data processing with adaptive channel selection

  • Jiahao Liu
  • Zirui Zhong
  • Yong Zhou
  • Hui Qiu
  • Jianbiao Xiao
  • Jiajing Fan
  • Zhaomin Zhang
  • Sixu Li
  • Yiming Xu
  • Siqi Yang
  • Weiwei Shan
  • Shuisheng Lin
  • Liang Chang
  • Jun Zhou

Recently wearable EEG monitoring devices with seizure detection processor using convolutional neural network (CNN) have been proposed to detect the seizure onset of patients in real time for alert or stimulation purpose. High energy efficiency and accuracy are required for the seizure detection processor due to the tight energy constraint of wearable devices. However, the use of CNN and multi-channel processing nature of seizure detection result in significant energy consumption. In this work, an energy-efficient seizure detection processor is proposed, featuring multi-stage CNN classification, segmented data processing and adaptive channel selection to reduce the energy consumption while achieving high accuracy. The design has been fabricated and tested using a 55nm process technology. Compared with several state-of-the-art designs, the proposed design achieves the lowest energy per classification (0.32 μJ) with high sensitivity (97.78%) and low false positive rate per hour (0.5).

PatterNet: explore and exploit filter patterns for efficient deep neural networks

  • Behnam Khaleghi
  • Uday Mallappa
  • Duygu Yaldiz
  • Haichao Yang
  • Monil Shah
  • Jaeyoung Kang
  • Tajana Rosing

Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. In this paper, we propose PatterNet, which enforces shared clustering topologies on filters. Cluster sharing leads to a greater extent of memory reduction by reusing the index information. PatterNet effectively factorizes input activations and post-processes the unique weights, which saves multiplications by several orders of magnitude. Furthermore, PatterNet reduces the add operations by harnessing the fact that filters sharing a clustering pattern have the same factorized terms. We introduce techniques for determining and assigning clustering patterns and training a network to fulfill the target patterns. We also propose and implement an efficient accelerator that builds upon the patterned filters. Experimental results show that PatterNet shrinks the memory and operation count up to 80.2% and 73.1%, respectively, with similar accuracy to the baseline models. PatterNet accelerator improves the energy efficiency by 107x over Nvidia 1080 1080 GTX and 2.2x over state of the art.

E2SR: an end-to-end video CODEC assisted system for super resolution acceleration

  • Zhuoran Song
  • Zhongkai Yu
  • Naifeng Jing
  • Xiaoyao Liang

Nowadays high-resolution (HR) videos have been a popular choice for a better viewing experience. Recent works have shown that super-resolution (SR) algorithms can provide superior quality HR video by applying the deep neural network (DNN) to each low-resolution (LR) frame. Obviously, such per-frame DNN processing is compute-intensive and hampers the deployment of SR algorithms on mobile devices. Although many accelerators have proposed solutions, they only focus on mobile devices. Differently, we notice that the HR video is originally stored in the cloud server and should be well exploited to gain high accuracy and performance improvement. Based on this observation, this paper proposes an end-to-end video CODEC assisted system (E2SR), which tightly couples the cloud server with the device to deliver a smooth and real-time video viewing experience. We propose the motion vector search algorithm executed in the cloud server, which can search the motion vectors and residuals for part of HR video frames and then pack them as addons. We further propose the reconstruction algorithm executed in the device to fast reconstruct the corresponding HR frames using the addons to skip part of DNN computations. We design the corresponding E2SR architecture to enable the reconstruction algorithm in the device, which achieves significant speedup with minimal hardware overhead. Our experimental results show that the E2SR system achieves 3.4x performance improvement with less than 0.56 PSNR loss compared with the state-of-the-art “EDVR” scheme.

MATCHA: a fast and energy-efficient accelerator for fully homomorphic encryption over the torus

  • Lei Jiang
  • Qian Lou
  • Nrushad Joshi

Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.

VirTEE: a full backward-compatible TEE with native live migration and secure I/O

  • Jianqiang Wang
  • Pouya Mahmoody
  • Ferdinand Brasser
  • Patrick Jauernig
  • Ahmad-Reza Sadeghi
  • Donghui Yu
  • Dahan Pan
  • Yuanyuan Zhang

Modern security architectures provide Trusted Execution Environments (TEEs) to protect critical data and applications against malicious privileged software in so-called enclaves. However, the seamless integration of existing TEEs into the cloud is hindered, as they require substantial adaptation of the software executing inside an enclave as well as the cloud management software to handle enclaved workloads. We tackle these challenges by presenting VirTEE, the first TEE architecture that allows strongly isolated execution of unmodified virtual machines (VMs) in enclaves, as well as secure live migration of VM enclaves between VirTEE-enabled servers. Combined with its secure I/O capabilities, VirTEE enables the integration of enclaved computing in today’s complex cloud infrastructure. We thoroughly evaluate our RISC-V-based prototype, and show its effectiveness and efficiency.

Apple vs. EMA: electromagnetic side channel attacks on apple CoreCrypto

  • Gregor Haas
  • Aydin Aysu

Cryptographic instruction set extensions are commonly used for ciphers which would otherwise face unacceptable side channel risks. A prominent example of such an extension is the ARMv8 Cryptographic Extension, or ARM CE for short, which defines dedicated instructions to securely accelerate AES. However, while these extensions may be resistant to traditional “digital” side channel attacks, they may still be vulnerable to physical side channel attacks.

In this work, we demonstrate the first such attack on a standard ARM CE AES implementation. We specifically focus on the implementation used by Apple’s CoreCrypto library which we run on the Apple A10 Fusion SoC. To that end, we implement an optimized side channel acquisition infrastructure involving both custom iPhone software and accelerated analysis code. We find that an adversary which can observe 5–30 million known-ciphertext traces can reliably extract secret AES keys using electromagnetic (EM) radiation as a side channel. This corresponds to an encryption operation on less than half of a gigabyte of data, which could be acquired in less than 2 seconds on the iPhone 7 we examined. Our attack thus highlights the need for side channel defenses for real devices and production, industry-standard encryption software.

Algorithm/architecture co-design for energy-efficient acceleration of multi-task DNN

  • Jaekang Shin
  • Seungkyu Choi
  • Jongwoo Ra
  • Lee-Sup Kim

Real-world AI applications, such as augmented reality or autonomous driving, require processing multiple CV tasks simultaneously. However, the enormous data size and the memory footprint have been a crucial hurdle for deep neural networks to be applied in resource-constrained devices. To solve the problem, we propose an algorithm/architecture co-design. The proposed algorithmic scheme, named SqueeD, reduces per-task weight and activation size by 21.9x and 2.1x, respectively, by sharing those data between tasks. Moreover, we design architecture and dataflow to minimize DRAM access by fully utilizing benefits from SqueeD. As a result, the proposed architecture reduces the DRAM access increment and energy consumption increment per task by 2.2x and 1.3x, respectively.

EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks

  • Fangxin Liu
  • Wenbo Zhao
  • Zongwu Wang
  • Yongbiao Chen
  • Zhezhi He
  • Naifeng Jing
  • Xiaoyao Liang
  • Li Jiang

Model compression has been extensively investigated for supporting efficient neural network inference on edge-computing platforms due to the huge model size and computation amount. Recent researches embrace joint-way compression across multiple techniques for extreme compression. However, most joint-way methods adopt a naive solution that applies two approaches sequentially, which can be sub-optimal, as it lacks a systematic approach to incorporate them.

This paper proposes the integration of aggressive joint-way compression into hardware design, namely EBSP. It is motivated by 1) the quantization allows simplifying hardware implementations; 2) the bit distribution of quantized weights can be viewed as an independent trainable variable; 3) the exploitation of bit sparsity in the quantized network has the potential to achieve better performance. To achieve that, this paper introduces the bit sparsity patterns to construct both highly expressive and inherently regular bit distribution in the quantized network. We further incorporate our sparsity constraint in training to evolve inherently bit distributions to the bit sparsity pattern. Moreover, the structure of the introduced bit sparsity pattern engenders minimum hardware implementation under competitive classification accuracy. Specifically, the quantized network constrained by bit sparsity pattern can be processed using LUTs with the fewest entries instead of multipliers in minimally modified computational hardware. Our experiments show that compared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance gain on average, respectively.

A time-to-first-spike coding and conversion aware training for energy-efficient deep spiking neural network processor design

  • Dongwoo Lew
  • Kyungchul Lee
  • Jongsun Park

In this paper, we present an energy-efficient SNN architecture, which can seamlessly run deep spiking neural networks (SNNs) with improved accuracy. First, we propose a conversion aware training (CAT) to reduce ANN-to-SNN conversion loss without hardware implementation overhead. In the proposed CAT, the activation function developed for simulating SNN during ANN training, is efficiently exploited to reduce the data representation error after conversion. Based on the CAT technique, we also present a time-to-first-spike coding that allows lightweight logarithmic computation by utilizing spike time information. The SNN processor design that supports the proposed techniques has been implemented using 28nm CMOS process. The processor achieves the top-1 accuracies of 91.7%, 67.9% and 57.4% with inference energy of 486.7uJ, 503.6uJ, and 1426uJ to process CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, when running VGG-16 with 5bit logarithmic weights.

XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method

  • Fan Zhang
  • Li Yang
  • Jian Meng
  • Jae-sun Seo
  • Yu (Kevin) Cao
  • Deliang Fan

ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells’ low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm’s benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.

SWIM: selective write-verify for computing-in-memory neural accelerators

  • Zheyu Yan
  • Xiaobo Sharon Hu
  • Yiyu Shi

Computing-in-Memory architectures based on non-volatile emerging memories have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, these emerging devices can suffer from significant variations during the mapping process (i.e., programming weights to the devices), and if left undealt with, can cause significant accuracy degradation. The non-ideality of weight mapping can be compensated by iterative programming with a write-verify scheme, i.e., reading the conductance and rewriting if necessary. In all existing works, such a practice is applied to every single weight of a DNN as it is being mapped, which requires extensive programming time. In this work, we show that it is only necessary to select a small portion of the weights for write-verify to maintain the DNN accuracy, thus achieving significant speedup. We further introduce a second derivative based technique SWIM, which only requires a single pass of forward and backpropagation, to efficiently select the weights that need write-verify. Experimental results on various DNN architectures for different datasets show that SWIM can achieve up to 10x programming speedup compared with conventional full-blown write-verify while attaining a comparable accuracy.

Enabling efficient deep convolutional neural network-based sensor fusion for autonomous driving

  • Xiaoming Zeng
  • Zhendong Wang
  • Yang Hu

Autonomous driving demands accurate perception and safe decision-making. To achieve this, automated vehicles are typically equipped with multiple sensors (e.g., cameras, Lidar, etc.), enabling them to exploit complementary environmental contexts by fusing data from different sensing modalities. With the success of Deep Convolutional Neural Network (DCNN), the fusion between multiple DCNNs has been proved to be a promising strategy to achieve satisfactory perception accuracy. However, existing mainstream DCNN fusion strategies conduct fusion by simply element-wisely adding feature maps extracted from different modalities together at various stages, failing to consider whether the features being fused are matched or not. Therefore, we first propose a feature disparity metric to quantitatively measure the degree of feature disparity between the fusing feature maps. Then, we propose a Fusion-filter as the Feature-matching techniques to tackle the feature-mismatching issue. We also propose a Layer-sharing technique in the deep layer of the DCNN to achieve high accuracy. With the assistance of feature disparity working as an additional loss, our proposed technologies enable DCNN to learn corresponding feature maps with similar characteristics and complementary visual context from different modalities. Evaluations demonstrate that our proposed fusion techniques can achieve higher accuracy on KITTI dataset with less computation resources consumption.

Zhuyi: perception processing rate estimation for safety in autonomous vehicles

  • Yu-Shun Hsiao
  • Siva Kumar Sastry Hari
  • Michał Filipiuk
  • Timothy Tsai
  • Michael B. Sullivan
  • Vijay Janapa Reddi
  • Vasu Singh
  • Stephen W. Keckler

The processing requirement of autonomous vehicles (AVs) for high-accuracy perception in complex scenarios can exceed the resources offered by the in-vehicle computer, degrading safety and comfort. This paper proposes a sensor frame processing rate (FPR) estimation model, Zhuyi, that quantifies the minimum safe FPR continuously in a driving scenario. Zhuyi can be employed post-deployment as an online safety check and to prioritize work. Experiments conducted using a multi-camera state-of-the-art industry AV system show that Zhuyi’s estimated FPRs are conservative, yet the system can maintain safety by processing only 36% or fewer frames compared to a default 30-FPR system in the tested scenarios.

Processing-in-SRAM acceleration for ultra-low power visual 3D perception

  • Yuquan He
  • Songyun Qu
  • Gangliang Lin
  • Cheng Liu
  • Lei Zhang
  • Ying Wang

Real-time ego-motion tracking and 3D structural estimation are the fundamental tasks for the ubiquitous cyper-physical systems, and they can be conducted via the state-of-the-art Edge-Based Visual Odometry (EBVO) algorithm. However, the intrinsic data-intensive process of EBVO emplaces a memory-wall hurdle in practical deployment on conventional von-Neumann-style computing systems. In this work, we attempt to leverage SRAM based processing-in-memory (PIM) technique to alleviate such memory-wall bottleneck, so as to optimize the EBVO systematically from the perspectives of the algorithm layer and physical layer. In the algorithm layer, we first investigate the data reuse patterns of the essential computing kernels required for the feature detection and pose estimation steps in EBVO, and propose PIM friendly data layout and computing scheme for each kernel accordingly. We distill the basic logical and arithmetical operations required in the algorithm layer, and in the physical layer, we propose a novel bit-parallel and reconfigurable SRAM-PIM architecture to realize the operations with high computing precision and throughput. Our experimental result shows that the proposed multi-layer optimization allows for high tracking accuracy of EBVO, and it can improve 11x processing speed and reduce 20x energy consumption compared to the CPU implementation.

Response time analysis for dynamic priority scheduling in ROS2

  • Abdullah Al Arafat
  • Sudharsan Vaidhun
  • Kurt M. Wilson
  • Jinghao Sun
  • Zhishan Guo

Robot Operating System (ROS) is the most popular framework for developing robotics software. Typically, robotics software is safety-critical and employed in real-time systems requiring timing guarantees. Since the first generation of ROS provides no timing guarantee, the recent release of its second generation, ROS2, is necessary and timely, and has since received immense attention from practitioners and researchers. Unfortunately, the existing analysis of ROS2 showed the peculiar scheduling strategy of ROS2 executor, which severely affects the response time of ROS2 applications. This paper proposes a deadline-based scheduling strategy for the ROS2 executor. It further presents an analysis for an end-to-end response time of ROS2 workload (processing chain) and an evaluation of the proposed scheduling strategy for real workloads.

Voltage prediction of drone battery reflecting internal temperature

  • Jiwon Kim
  • Seunghyeok Jeon
  • Jaehyun Kim
  • Hojung Cha

Drones are commonly used in mission-critical applications, and the accurate estimation of available battery capacity before flight is critical for reliable and efficient mission planning. To this end, the battery voltage should be predicted accurately prior to launching a drone. However, in drone applications, a rise in the battery’s internal temperature changes the voltage significantly and leads to challenges in voltage prediction. In this paper, we propose a battery voltage prediction method that takes into account the battery’s internal temperature to accurately estimate the available capacity of the drone battery. To this end, we devise a temporal temperature factor (TTF) metric that is calculated by accumulating time series data about the battery’s discharge history. We employ a machine learning-based prediction model, reflecting the TTF metric, to achieve high prediction accuracy and low complexity. We validated the accuracy and complexity of our model with extensive evaluation. The results show that the proposed model is accurate with less than 1.5% error and readily operates on resource-constrained embedded devices.

A near-storage framework for boosted data preprocessing of mass spectrum clustering

  • Weihong Xu
  • Jaeyoung Kang
  • Tajana Rosing

Mass spectrometry (MS) has been a key to proteomics and metabolomics due to its unique ability to identify and analyze protein structures. Modern MS equipment generates massive amount of tandem mass spectra with high redundancy, making spectral analysis the major bottleneck in design of new medicines. Mass spectrum clustering is one promising solution as it greatly reduces data redundancy and boosts protein identification. However, state-of-the-art MS tools take many hours to run spectrum clustering. Spectra loading and preprocessing consumes average 82% execution time and energy during clustering. We propose a near-storage framework, MSAS, to speed up spectrum preprocessing. Instead of loading data into host memory and CPU, MSAS processes spectra near storage, thus reducing the expensive cost of data movement. We present two types of accelerators that leverage internal bandwidth at two storage levels: SSD and channel. The accelerators are optimized to match the data rate at each storage level with negligible overhead. Our results demonstrate that the channel-level design yields the best performance improvement for preprocessing – it is up to 187X and 1.8X faster than the CPU and the state-of-the-art in-storage computing solution, INSIDER, respectively. After integrating channel-level MSAS into existing MS clustering tools, we measure system level improvements in speed of 3.5X to 9.8X with 2.8X to 11.9X better energy efficiency.

MetaZip: a high-throughput and efficient accelerator for DEFLATE

  • Ruihao Gao
  • Xueqi Li
  • Yewen Li
  • Xun Wang
  • Guangming Tan

Booming data volume has become an important challenge for data center storage and bandwidth resources. Consequently, fast and efficient compression architecture is becoming the most fundamental design in data centers. However, the compression ratio (CR) and compression throughput are often difficult to achieve at the same time on existing computing platforms. DEFLATE is a widely used compression format in data centers, which is an ideal case for hardware acceleration. Unfortunately, Deflate has an inherent connection among its special memory access pattern, which limits a higher throughput.

In this paper, we propose MetaZip, a high-throughput and scalable data-compression architecture, which is targeted for FPGA-enabled data centers. To improve the compression throughput within the constraints of FPGA resources, we propose an adaptive parallel-width pipeline, which can be fed 64bytes per cycle. To balance the compression quality, we propose a series of sub-modules (e.g. 8-bytes MetaHistory, Seed Bypass, Serialization Predictor). Experimental results show that MetaZip achieves the throughput of 15.6GB/s with a single engine, which is 234X/2.78X than a CPU gzip baseline and FPGA based architecture, respectively.

Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations

  • Hongxiang Fan
  • Martin Ferianc
  • Wayne Luk

Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.

Eventor: an efficient event-based monocular multi-view stereo accelerator on FPGA platform

  • Mingjun Li
  • Jianlei Yang
  • Yingjie Qi
  • Meng Dong
  • Yuhao Yang
  • Runze Liu
  • Weitao Pan
  • Bei Yu
  • Weisheng Zhao

Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams. Event-based monocular multi-view stereo (EMVS) is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory. It is a critical task for event-based monocular SLAM. However, the required intensive computation workloads make it challenging for real-time deployment on embedded platforms. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. Highly paralleled and fully pipelined processing elements are specially designed via FPGA and integrated with the embedded ARM as a heterogeneous system to improve the throughput and reduce the memory footprint. Meanwhile, the EMVS algorithm is reformulated to a more hardware-friendly manner by rescheduling, approximate computing and hybrid data quantization. Evaluation results on DAVIS dataset show that Eventor achieves up to 24X improvement in energy efficiency compared with Intel i5 CPU platform.

GEML: GNN-based efficient mapping method for large loop applications on CGRA

  • Mingyang Kou
  • Jun Zeng
  • Boxiao Han
  • Fei Xu
  • Jiangyuan Gu
  • Hailong Yao

Coarse-grained reconfigurable architecture (CGRA) is an emerging hardware architecture, with reconfigurable Processing Elements (PEs) for executing operations efficiently and flexibly. One major challenge for current CGRA compilers is the scalability issue for large loop applications, where valid loop mapping results cannot be obtained in an acceptable time. This paper proposes an enhanced loop mapping method based on Graph Neural Network (GNN), which effectively addresses the scalability issue and generates valid loop mapping results for large applications. Experimental results show that the proposed method enhances the compilation time by 10.8x on average over existing methods, with even better loop mapping solutions.

Mixed-granularity parallel coarse-grained reconfigurable architecture

  • Jinyi Deng
  • Linyun Zhang
  • Lei Wang
  • Jiawei Liu
  • Kexiang Deng
  • Shibin Tang
  • Jiangyuan Gu
  • Boxiao Han
  • Fei Xu
  • Leibo Liu
  • Shaojun Wei
  • Shouyi Yin

Coarse-Grained Reconfigurable Architecture (CGRA) is a high-performance computing architecture. However, existing CGRA silicon utilization is low due to the lack of fine-grained parallelism inside Processing Element (PE) and general coarse-grained parallel approach on PE array. No fine-grained parallelism in PE not only leads to low silicon utilization of PE, but also makes the mapping loose and irregular. No generalized parallel method for the mapping cause low PE utilization on CGRA. Our goal is to design an execution model and a Mixed-granularity Parallel CGRA (MP-CGRA), which is capable to fine-grained parallelize operators excution in PEs and parallelize data transmission in channels, leading to a compact mapping. A coarse-grained general parallel method is proposed to vectorize the compact mapping. Evaluated with Machsuite, MP-CGRA achieves an improvement of 104.65% silicon utilization on PE array and a 91.40% performance per area improvement compared with baseline-CGRA.

GuardNN: secure accelerator architecture for privacy-preserving deep learning

  • Weizhe Hua
  • Muhammad Umar
  • Zhiru Zhang
  • G. Edward Suh

This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.

SRA: a secure ReRAM-based DNN accelerator

  • Lei Zhao
  • Youtao Zhang
  • Jun Yang

Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM’s non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario.

In this paper, we propose SRA – a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM’s in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer’s intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC – a non-secure ReRAM-based baseline.

ABNN2: secure two-party arbitrary-bitwidth quantized neural network predictions

  • Liyan Shen
  • Ye Dong
  • Binxing Fang
  • Jinqiao Shi
  • Xuebin Wang
  • Shengli Pan
  • Ruisheng Shi

Data privacy and security issues are preventing a lot of potential on-cloud machine learning as services from happening. In the recent past, secure multi-party computation (MPC) has been used to achieve the secure neural network predictions, guaranteeing the privacy of data. However, the cost of the existing two-party solutions is expensive and they are impractical in real-world setting.

In this work, we utilize the advantages of quantized neural network (QNN) and MPC to present ABNN2, a practical secure two-party framework that can realize arbitrary-bitwidth quantized neural network predictions. Concretely, we propose an efficient and novel matrix multiplication protocol based on 1-out-of-N OT extension and optimize the the protocol through a parallel scheme. In addition, we design optimized protocol for the ReLU function. The experiments demonstrate that our protocols are about 2X-36X and 1.4X–7X faster than SecureML (S&P’17) and MiniONN (CCS’17) respectively. And ABNN2 obtain comparable efficiency as state of the art QNN prediction protocol QUOTIENT (CCS’19), but the later only supports ternary neural network.

Adaptive neural recovery for highly robust brain-like representation

  • Prathyush Poduval
  • Yang Ni
  • Yeseong Kim
  • Kai Ni
  • Raghavan Kumar
  • Rossario Cammarota
  • Mohsen Imani

Today’s machine learning platforms have major robustness issues dealing with insecure and unreliable memory systems. In conventional data representation, bit flips due to noise or attack can cause value explosion, which leads to incorrect learning prediction. In this paper, we propose RobustHD, a robust and noise-tolerant learning system based on HyperDimensional Computing (HDC), mimicking important brain functionalities. Unlike traditional binary representation, RobustHD exploits a redundant and holographic representation, ensuring all bits have the same impact on the computation. RobustHD also proposes a runtime framework that adaptively identifies and regenerates the faulty dimensions in an unsupervised way. Our solution not only provides security against possible bit-flip attacks but also provides a learning solution with high robustness to noises in the memory. We performed a cross-stacked evaluation from a conventional platform to emerging processing in-memory architecture. Our evaluation shows that under 10% random bit flip attack, RobustHD provides a maximum of 0.53% quality loss, while deep learning solutions are losing over 26.2% accuracy.

Efficiency attacks on spiking neural networks

  • Sarada Krithivasan
  • Sanchari Sen
  • Nitin Rathi
  • Kaushik Roy
  • Anand Raghunathan

Spiking Neural Networks are a class of artificial neural networks that process information as discrete spikes. The time and energy consumed in SNN implementations is strongly dependent on the number of spikes processed. We explore this sensitivity from an adversarial perspective and propose SpikeAttack, a completely new class of attacks on SNNs. SpikeAttack impacts the efficiency of SNNs via imperceptible perturbations that increase the overall spiking activity of the network, leading to increased time and energy consumption. Across four SNN benchmarks, SpikeAttackresults in 1.7x-2.5X increase in spike activity, leading to increases of 1.6x-2.3x and 1.4x-2.2x in latency and energy consumption, respectively.

L-QoCo: learning to optimize cache capacity overloading in storage systems

  • Ji Zhang
  • Xijun Li
  • Xiyao Zhou
  • Mingxuan Yuan
  • Zhuo Cheng
  • Keji Huang
  • Yifan Li

Cache plays an important role to maintain high and stable performance (i.e. high throughput, low tail latency and throughput jitter) in storage systems. Existing rule-based cache management methods, coupled with engineers’ manual configurations, cannot meet ever-growing requirements of both time-varying workloads and complex storage systems, leading to frequent cache overloading.

In this paper, we propose the first light-weight learning-based cache bandwidth control technique, called L-QoCo which can adaptively control the cache bandwidth so as to effectively prevent cache overloading in storage systems. Extensive experiments with various workloads on real systems show that L-QoCo, with its strong adaptability and fast learning ability, can adapt to various workloads to effectively control cache bandwidth, thereby significantly improving the storage performance (e.g. increasing the throughput by 10%-20% and reducing the throughput jitter and tail latency by 2X-6X and 1.5X-4X, respectively, compared with two representative rule-based methods).

Pipette: efficient fine-grained reads for SSDs

  • Shuhan Bai
  • Hu Wan
  • Yun Huang
  • Xuan Sun
  • Fei Wu
  • Changsheng Xie
  • Hung-Chih Hsieh
  • Tei-Wei Kuo
  • Chun Jason Xue

Big data applications, such as recommendation system and social network, often generate a huge number of fine-grained reads to the storage. Block-oriented storage devices tend to suffer from these fine-grained read operations in terms of I/O traffic as well as performance. Motivated by this challenge, a fine-grained read framework, Pipette, is proposed in this paper, as an extension to the traditional I/O framework. With an adaptive caching design, Pipette framework offers a tremendous reduction in I/O traffic as well as achieves significant performance gain. A Pipette prototype was implemented with Ext4 file system on an SSD for two real-world applications, where the I/O throughput is improved by 31.6% and 33.5%, and the I/O traffic is reduced by 95.6% and 93.6%, respectively.

CDB: critical data backup design for consumer devices with high-density flash based hybrid storage

  • Longfei Luo
  • Dingcui Yu
  • Liang Shi
  • Chuanmin Ding
  • Changlong Li
  • Edwin H.-M. Sha

Hybrid flash based storage constructed with high-density and low-cost flash memory are becoming increasingly popular in consumer devices during the last decade. However, to protect critical data, existing methods are designed for improving reliability of consumer devices with non-hybrid flash storage. Based on evaluations and analysis, these methods will result in significant performance and lifetime degradation in consumer devices with hybrid storage. The reason is that different kinds of memory in hybrid storage have different characteristics, such as performance and access granularity. To address the above problems, a critical data backup (CDB) method is proposed to backup designated critical data with making full use of different kinds of memory in hybrid storage. Experiment results show that compared with the state-of-the-arts, CDB achieves encouraging performance and lifetime improvement.

SS-LRU: a smart segmented LRU caching

  • Chunhua Li
  • Man Wu
  • Yuhan Liu
  • Ke Zhou
  • Ji Zhang
  • Yunqing Sun

Many caching policies use machine learning to predict data reuse, but they ignore the impact of incorrect prediction on cache performance, especially for large-size objects. In this paper, we propose a smart segmented LRU (SS-LRU) replacement policy, which adopts a size-aware classifier designed for cache scenarios and considers the cache cost caused by misprediction. Besides, SS-LRU enhances the migration rules of segmented LRU (SLRU) and implements a smart caching with unequal priorities and segment sizes based on prediction and multiple access patterns. We conducted Extensive experiments under the real-world workloads to demonstrate the superiority of our approach over state-of-the-art caching policies.

NobLSM: an LSM-tree with non-blocking writes for SSDs

  • Haoran Dang
  • Chongnan Ye
  • Yanpeng Hu
  • Chundong Wang

Solid-state drives (SSDs) are gaining popularity. Meanwhile, key-value stores built on log-structured merge-tree (LSM-tree) are widely deployed for data management. LSM-tree frequently calls syncs to persist newly-generated files for crash consistency. The blocking syncs are costly for performance. We revisit the necessity of syncs for LSM-tree. We find that Ext4 journaling embraces asynchronous commits to implicitly persist files. Hence, we design NobLSM that makes LSM-tree and Ext4 cooperate to substitute most syncs with non-blocking asynchronous commits, without losing consistency. Experiments show that NobLSM significantly outperforms state-of-the-art LSM-trees with higher throughput on an ordinary SSD.

TailCut: improving performance and lifetime of SSDs using pattern-aware state encoding

  • Jaeyong Lee
  • Myungsunk Kim
  • Wonil Choi
  • Sanggu Lee
  • Jihong Kim

Although lateral charge spreading is considered as a dominant error source in 3D NAND flash memory, little is known about its detailed characteristics at the storage system level. From a device characterization study, we observed that lateral charge spreading strongly depends on vertically adjacent state patterns and a few specific patterns are responsible for a large portion of bit errors from lateral charge spreading. We propose a new state encoding scheme, called TailCut, which removes vulnerable state patterns by modifying encoded states. By removing vulnerable patterns, TailCut can improve the SSD lifetime and read latency by 80% and 25%, respectively.

HIMap: a heuristic and iterative logic synthesis approach

  • Xing Li
  • Lei Chen
  • Fan Yang
  • Mingxuan Yuan
  • Hongli Yan
  • Yupeng Wan

Recently, many models show their superiority in sequence and parameter tuning. However, they usually generate non-deterministic flows and require lots of training data. We thus propose a heuristic and iterative flow, namely HIMap, for deterministic logic synthesis. In which, domain knowledge of the functionality and parameters of synthesis operators and their correlations to netlist PPA is fully utilized to design synthesis templates for various objetives. We also introduce deterministic and effective heuristics to tune the templates with relatively fixed operator combinations and iteratively improve netlist PPA. Two nested iterations with local searching and early stopping can thus generate dynamic sequence for various circuits and reduce runtime. HIMap improves 13 best results of the EPFL combinational benchmarks for delay (5 for area). Especially, for several arithmetic benchmarks, HIMap significantly reduces LUT-6 levels by 11.6 ~ 21.2% and delay after P&R by 5.0 ~ 12.9%.

Improving LUT-based optimization for ASICs

  • Walter Lau Neto
  • Luca Amarú
  • Vinicius Possani
  • Patrick Vuillod
  • Jiong Luo
  • Alan Mishchenko
  • Pierre-Emmanuel Gaillardon

LUT-based optimization techniques are finding new applications in synthesis of ASIC designs. Intuitively, packing logic into LUTs provides a better balance between functionality and structure in logic optimization. On this basis, the LUT-engine framework [1] was introduced to enhance the ASIC synthesis. In this paper, we present key improvements, at both algorithmic and flow levels, making a much stronger LUT-engine. We restructure the flow of LUT-engine, to benefit from a heterogeneous mixture of LUT sizes, and revisit its requirements for maximum scalability. We propose a dedicated LUT mapper for the new flow, based on FlowMap, natively balancing LUT-count and NAND2-count for a wide range LUT sizes. We describe a specialized Boolean factoring technique, exploiting the fanin bounds in LUT networks, resulting in a very fast LUT-based AIG minimization. By using the proposed methodology, we improve 9 of the best area results in the ongoing EPFL synthesis competition. Integrated in a complete EDA flow for ASICs, the new LUT-engine performs well on a set of 87 benchmarks: -4.60% area and -3.41% switching power at +5% runtime, compared to the baseline flow without LUT-based optimizations, and -3.02% area and -2.54% switching power with -1% runtime, compared to the original LUT-engine.

NovelRewrite: node-level parallel AIG rewriting

  • Shiju Lin
  • Jinwei Liu
  • Tianji Liu
  • Martin D. F. Wong
  • Evangeline F. Y. Young

Logic rewriting is an important part in logic optimization. It rewrites a circuit by replacing local subgraphs with logically equivalent ones, so that the area and the delay of the circuit can be optimized. This paper introduces a parallel AIG rewriting algorithm with a new concept of logical cuts. Experiments show that this algorithm implemented with one GPU can be on average 32X faster than the logic rewriting in the logic synthesis tool ABC on large benchmarks. Compared with other logic rewriting acceleration works, ours has the best quality and the shortest running time.

Search space characterization for approximate logic synthesis

  • Linus Witschen
  • Tobias Wiersema
  • Lucas Reuter
  • Marco Platzner

Approximate logic synthesis aims at trading off a circuit’s quality to improve a target metric. Corresponding methods explore a search space by approximating circuit components and verifying the resulting quality of the overall circuit, which is costly.

We propose a methodology that determines reasonable values for the component’s local error bounds prior to search space exploration. Utilizing formal verification on a novel approximation miter guarantees the circuit’s quality for such local error bounds, independent of employed approximation methods, resulting in reduced runtimes due to omitted verifications. Experiments show speed-ups of up to 3.7x for approximate logic synthesis using our method.

SEALS: sensitivity-driven efficient approximate logic synthesis

  • Chang Meng
  • Xuan Wang
  • Jiajun Sun
  • Sijun Tao
  • Wei Wu
  • Zhihang Wu
  • Leibin Ni
  • Xiaolong Shen
  • Junfeng Zhao
  • Weikang Qian

Approximate computing is an emerging computing paradigm to design energy-efficient systems. Many greedy approximate logic synthesis (ALS) methods have been proposed to automatically synthesize approximate circuits. They typically need to consider all local approximate changes (LACs) in each iteration of the ALS flow to select the best one, which is time-consuming. In this paper, we propose SEALS, a Sensitivity-driven Efficient ALS method to speed up a greedy ALS flow. SEALS centers around a newly proposed concept called sensitivity, which enables a fast and accurate error estimation method and an efficient method to filter out unpromising LACs. SEALS can handle any statistical error metric. The experimental results show that it outperforms a state-of-the-art ALS method in runtime by 12X to 15X without reducing circuit quality.

Beyond local optimality of buffer and splitter insertion for AQFP circuits

  • Siang-Yun Lee
  • Heinz Riener
  • Giovanni De Micheli

Adiabatic quantum-flux parametron (AQFP) is an energy-efficient superconducting technology. Buffer and splitter (B/S) cells must be inserted to an AQFP circuit to meet the technology-imposed constraints on path balancing and fanout branching. These cells account for a significant amount of the circuit’s area and delay. In this paper, we identify that B/S insertion is a scheduling problem, and propose (a) a linear-time algorithm for locally optimal B/S insertion subject to a given schedule; (b) an SMT formulation to find the global optimum; and (c) an efficient heuristic for global B/S optimization. Experimental results show a reduction of 4% on the B/S cost and 124X speed-up compared to the state-of-the-art algorithm, and capability to scale to a magnitude larger benchmarks.

NAX: neural architecture and memristive xbar based accelerator co-design

  • Shubham Negi
  • Indranil Chakraborty
  • Aayush Ankit
  • Kaushik Roy

Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX – an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.

MC-CIM: a reconfigurable computation-in-memory for efficient stereo matching cost computation

  • Zhiheng Yue
  • Yabing Wang
  • Leibo Liu
  • Shaojun Wei
  • Shuoyi Yin

This paper proposes the design of a computation-in-memory for stereo matching cost computation. The matching cost computation incurs large energy and latency overhead because of frequent memory access. To overcome previous design limitations, this work, named MC-CIM, performs matching cost computation without incurring memory access and introduces several key features. (1) Lightweight balanced computing unit is integrated within cell array to reduce memory access and improve system throughput. (2) Self-optimized circuit design enables to alter arithmetic operation for matching algorithm in various scenario. (3) Flexible data mapping method and reconfigurable digital peripheral explore maximum parallelism on different algorithm and bit-precision. The proposed design is implemented in 28nm technology and achieves average performance of 277 TOPs/W.

iMARS: an in-memory-computing architecture for recommendation systems

  • Mengyuan Li
  • Ann Franchesca Laguna
  • Dayane Reis
  • Xunzhao Yin
  • Michael Niemier
  • X. Sharon Hu

Recommendation systems (RecSys) suggest items to users by predicting their preferences based on historical data. Typical RecSys handle large embedding tables and many embedding table related operations. The memory size and bandwidth of the conventional computer architecture restrict the performance of RecSys. This work proposes an in-memory-computing (IMC) architecture (iMARS) for accelerating the filtering and ranking stages of deep neural network-based RecSys. iMARS leverages IMC-friendly embedding tables implemented inside a ferroelectric FET based IMC fabric. Circuit-level and system-level evaluation show that iMARS achieves 16.8x (713x) end-to-end latency (energy) improvement compared to the GPU counterpart for the MovieLens dataset.

ReGNN: a ReRAM-based heterogeneous architecture for general graph neural networks

  • Cong Liu
  • Haikun Liu
  • Hai Jin
  • Xiaofei Liao
  • Yu Zhang
  • Zhuohui Duan
  • Jiahong Xu
  • Huize Li

Graph Neural Networks (GNNs) have both graph processing and neural network computational features. Traditional graph accelerators and NN accelerators cannot meet these dual characteristics of GNN applications simultaneously. In this work, we propose a ReRAM-based processing-in-memory (PIM) architecture called ReGNN for GNN acceleration. ReGNN is composed of analog PIM (APIM) modules for accelerating matrix vector multiplication (MVM) operations, and digital PIM (DPIM) modules for accelerating non-MVM aggregation operations. To improve data parallelism, ReGNN maps data to aggregation sub-engines based on the degree of vertices and the dimension of feature vectors. Experimental results show that ReGNN speeds up GNN inference by 228x and 8.4x, and reduces energy consumption by 305.2x and 10.5x, compared with GPU and the ReRAM-based GNN accelerator ReGraphX, respectively.

You only search once: on lightweight differentiable architecture search for resource-constrained embedded platforms

  • Xiangzhong Luo
  • Di Liu
  • Hao Kong
  • Shuo Huai
  • Hui Chen
  • Weichen Liu

Benefiting from the search efficiency, differentiable neural architecture search (NAS) has evolved as the most dominant alternative to automatically design competitive deep neural networks (DNNs). We note that DNNs must be executed under strictly hard performance constraints in real-world scenarios, for example, the runtime latency on autonomous vehicles. However, to obtain the architecture that meets the given performance constraint, previous hardware-aware differentiable NAS methods have to repeat a plethora of search runs to manually tune the hyper-parameters by trial and error, and thus the total design cost increases proportionally. To resolve this, we introduce a lightweight hardware-aware differentiable NAS framework dubbed LightNAS, striving to find the required architecture that satisfies various performance constraints through a one-time search (i.e., you only search once). Extensive experiments are conducted to show the superiority of LightNAS over previous state-of-the-art methods. Related codes will be released at

EcoFusion: energy-aware adaptive sensor fusion for efficient autonomous vehicle perception

  • Arnav Vaibhav Malawade
  • Trier Mortlock
  • Mohammad Abdullah Al Faruque

Autonomous vehicles use multiple sensors, large deep-learning models, and powerful hardware platforms to perceive the environment and navigate safely. In many contexts, some sensing modalities negatively impact perception while increasing energy consumption. We propose EcoFusion: an energy-aware sensor fusion approach that uses context to adapt the fusion method and reduce energy consumption without affecting perception performance. EcoFusion performs up to 9.5% better at object detection than existing fusion methods with approximately 60% less energy and 58% lower latency on the industry-standard Nvidia Drive PX2 hardware platform. We also propose several context-identification strategies, implement a joint optimization between energy and performance, and present scenario-specific results.

Human emotion based real-time memory and computation management on resource-limited edge devices

  • Yijie Wei
  • Zhiwei Zhong
  • Jie Gu

Emotional AI or Affective Computing has been projected to grow rapidly in the upcoming years. Despite many existing developments in the application space, there has been a lack of hardware-level exploitation of the user’s emotions. In this paper, we propose a deep collaboration between user’s affects and the hardware system management on resource-limited edge devices. Based on classification results from efficient affect classifiers on smartphone devices, novel real-time management schemes for memory, and video processing are proposed to improve the energy efficiency of mobile devices. Case studies on H.264 / AVC video playback and Android smartphone usages are provided showing significant power saving of up to 23% and reduction of memory loading of up to 17% using the proposed affect adaptive architecture and system management schemes.

Hierarchical memory-constrained operator scheduling of neural architecture search networks

  • Zihan Wang
  • Chengcheng Wan
  • Yuting Chen
  • Ziyi Lin
  • He Jiang
  • Lei Qiao

Neural Architecture Search (NAS) is widely used in industry, searching for neural networks meeting task requirements. Meanwhile, it faces a challenge in scheduling networks satisfying memory constraints. This paper proposes HMCOS that performs hierarchical memory-constrained operator scheduling of NAS networks: given a network, HMCOS constructs a hierarchical computation graph and employs an iterative scheduling algorithm to progressively reduce peak memory footprints. We evaluate HMCOS against RPO and Serenity (two popular scheduling techniques). The results show that HMCOS outperforms existing techniques in supporting more NAS networks, reducing 8.7~42.4% of peak memory footprints, and achieving 137–283x of speedups in scheduling.

MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning

  • Abhiroop Bhattacharjee
  • Yeshwanth Venkatesha
  • Abhishek Moitra
  • Priyadarshini Panda

Recent years have seen a paradigm shift towards multi-task learning. This calls for memory and energy-efficient solutions for inference in a multi-task scenario. We propose an algorithm-hardware co-design approach called MIME. MIME reuses the weight parameters of a trained parent task and learns task-specific threshold parameters for inference on multiple child tasks. We find that MIME results in highly memory-efficient DRAM storage of neural-network parameters for multiple tasks compared to conventional multi-task inference. In addition, MIME results in input-dependent dynamic neuronal pruning, thereby enabling energy-efficient inference with higher throughput on a systolic-array hardware. Our experiments with benchmark datasets (child tasks)- CIFAR10, CIFAR100, and Fashion-MNIST, show that MIME achieves ~ 3.48x memory-efficiency and ~ 2.4 – 3.1x energy-savings compared to conventional multi-task inference in Pipelined task mode.

Sniper: cloud-edge collaborative inference scheduling with neural network similarity modeling

  • Weihong Liu
  • Jiawei Geng
  • Zongwei Zhu
  • Jing Cao
  • Zirui Lian

The cloud-edge collaborative inference demands scheduling the artificial intelligence (AI) tasks efficiently to the appropriate edge smart device. However, the continuously iterative deep neural networks (DNNs) and heterogeneous devices pose great challenges for inference tasks scheduling. In this paper, we propose a self-update cloud-edge collaborative inference scheduling system (Sniper) with time awareness. At first, considering that similar networks exhibit similar behaviors, we develop a non-invasive performance characterization network (PCN) based on neural network similarity (NNS) to accurately predict the inference time of DNNs. Moreover, PCN and time-based scheduling algorithms can be flexibly combined into the scheduling module of Sniper. Experimental results show that the average relative error of network inference time prediction is about 8.06%. Compared with the traditional method without time awareness, Sniper can reduce the waiting time by 52% on average while achieving a stable increase in throughput.

LPCA: learned MRC profiling based cache allocation for file storage systems

  • Yibin Gu
  • Yifan Li
  • Hua Wang
  • Li Liu
  • Ke Zhou
  • Wei Fang
  • Gang Hu
  • Jinhu Liu
  • Zhuo Cheng

File storage system (FSS) uses multi-caches to accelerate data accesses. Unfortunately, efficient FSS cache allocation remains extremely difficult. First, as the key of cache allocation, existing miss ratio curve (MRC) constructions are limited to LRU. Second, existing techniques are suitable for same-layer caches but not for hierarchical ones.

We present a Learned MRC Profiling based Cache Allocation (LPCA) scheme for FSS. To the best of our knowledge, LPCA is the first to apply machine learning to model MRC under non-LRU, LPCA also explores optimization target for hierarchical caches, in that LPCA can provide universal and efficient cache allocation for FSSs.

Equivalence checking paradigms in quantum circuit design: a case study

  • Tom Peham
  • Lukas Burgholzer
  • Robert Wille

As state-of-the-art quantum computers are capable of running increasingly complex algorithms, the need for automated methods to design and test potential applications rises. Equivalence checking of quantum circuits is an important, yet hardly automated, task in the development of the quantum software stack. Recently, new methods have been proposed that tackle this problem from widely different perspectives. However, there is no established baseline on which to judge current and future progress in equivalence checking of quantum circuits. In order to close this gap, we conduct a detailed case study of two of the most promising equivalence checking methodologies—one based on decision diagrams and one based on the ZX-calculus—and compare their strengths and weaknesses.

Accurate BDD-based unitary operator manipulation for scalable and robust quantum circuit verification

  • Chun-Yu Wei
  • Yuan-Hung Tsai
  • Chiao-Shan Jhang
  • Jie-Hong R. Jiang

Quantum circuit verification is essential, ensuring that quantum program compilation yields a sequence of primitive unitary operators executable correctly and reliably on a quantum processor. Most prior quantum circuit equivalence checking methods rely on edge-weighted decision diagrams and suffer from scalability and verification accuracy issues. This work overcomes these issues by extending a recent BDD-based algebraic representation of state vectors to support unitary operator manipulation. Experimental results demonstrate the superiority of the new method in scalability and exactness in contrast to the inexactness of prior approaches. Also, our method is much more robust in verifying dissimilar circuits than previous work.

Handling non-unitaries in quantum circuit equivalence checking

  • Lukas Burgholzer
  • Robert Wille

Quantum computers are reaching a level where interactions between classical and quantum computations can happen in real-time. This marks the advent of a new, broader class of quantum circuits: dynamic quantum circuits. They offer a broader range of available computing primitives that lead to new challenges for design tasks such as simulation, compilation, and verification. Due to the non-unitary nature of dynamic circuit primitives, most existing techniques and tools for these tasks are no longer applicable in an out-of-the-box fashion. In this work, we discuss the resulting consequences for quantum circuit verification, specifically equivalence checking, and propose two different schemes that eventually allow to treat the involved circuits as if they did not contain non-unitaries at all. As a result, we demonstrate methodically, as well as, experimentally that existing techniques for verifying the equivalence of quantum circuits can be kept applicable for this broader class of circuits.

A bridge-based algorithm for simultaneous primal and dual defects compression on topologically quantum-error-corrected circuits

  • Wei-Hsiang Tseng
  • Yao-Wen Chang

Topological quantum error correction (TQEC) using the surface code is among the most promising techniques for fault-tolerant quantum circuits. The required resource of a TQEC circuit can be modeled as a space-time volume of a three-dimensional diagram by describing the defect movement along the time axis. For large-scale complex problems, it is crucial to minimize the space-time volume for a quantum algorithm with a reasonable physical qubit number and computation time. Previous work proposed an automated tool to perform bridge compression on a large-scale TQEC circuit. However, the existing automated bridging compression is only for dual defects and not for primal defects. This paper presents an algorithm to perform bridge compression on primal and dual defects simultaneously. In addition, the automatic compression algorithm performs initialization/measurement simplification and flipping to improve the compression. Compared with the state-of-the-art work, experimental results show that our proposed algorithm can averagely reduce space-time volumes by 47%.

FaSe: fast selective flushing to mitigate contention-based cache timing attacks

  • Tuo Li
  • Sri Parameswaran

Caches are widely used to improve performance in modern processors. By carefully evicting cache lines and identifying cache hit/miss time, contention-based cache timing channel attacks can be orchestrated to leak information from the victim process. Existing hardware countermeasures explored cache partitioning and randomization, are either costly, not applicable for the L1 data cache, or are vulnerable to sophisticated attacks. Countermeasures using cache flush exist but are slow since all cache lines have to be evacuated during a cache flush. In this paper, we propose for the first time a hardware/software flush-based countermeasure, called fast selective flushing (FaSe). By utilizing an ISA extension and cache modification, FaSe selectively flushes cache lines and provides a mitigation method with a similar effect to methods using naive flush. FaSe is implemented on RISC-V Rocket Chip and evaluated on Xilinx FPGA running user programs and the Linux OS. Our experiments show that FaSe reduces time overhead by 36% for user programs and 42% for the OS compared to the methods with naive flushing, with less than 1% hardware overhead. Our security test shows FaSe can mitigate target cache timing attacks.

Conditional address propagation: an efficient defense mechanism against transient execution attacks

  • Peinan Li
  • Rui Hou
  • Lutan Zhao
  • Yifan Zhu
  • Dan Meng

Speculative execution is a critical technique in modern high performance processors. However, continuously exposed transient execution attacks, including Spectre and Meltdown, disclosed a large attack surface in mispredicted execution. Current state-of-the-art defense strategy blocks all memory accesses that use addresses loaded speculatively. However, propagation of base addresses is common in general applications and we find that more than 60% blocked memory accesses use propagated base rather than offset addresses. Therefore, we propose a novel hardware defense mechanism, named Conditional Address Propagation, to identify safe base addresses through taint tracking and address checking by a History Table. Then, the safe base addresses are allowed to be propagated to retrieve performance. For remaining unsafe addresses, they cannot be propagated for security. We constructed experiments on cycle-accurate Gem5 simulator. Compared to the representative study, STT, our mechanism effectively decreases the performance overhead from 13.27% to 1.92% targeting Spectre-type and 19.66% to 5.23% targeting all-type cache-based transient execution attacks.

Timed speculative attacks exploiting store-to-load forwarding bypassing cache-based countermeasures

  • Anirban Chakraborty
  • Nikhilesh Singh
  • Sarani Bhattacharya
  • Chester Rebeiro
  • Debdeep Mukhopadhyay

In this paper, we propose a novel class of speculative attacks, called Timed Speculative Attacks (TSA), that does not depend on the state changes in the cache memory. Instead, it makes use of the timing differences that occur due to store-to-load forwarding. We propose two attack strategies – Fill-and-Forward utilizing correctly speculated loads, and Fill-and-Misdirect using mis-speculated load instructions. While Fill-and-Forward exploits the shared store buffers in a multi-threaded CPU core, the Fill-and-Misdirect approach exploits the influence of rolled back mis-speculated loads on subsequent instructions. As case studies, we demonstrate a covert channel using Fill-and-Forward and key recovery attacks on OpenSSL AES and Romulus-N Authenticated Encryption with Associated Data scheme using Fill-and-Misdirect approach. Finally, we show that TSA is able to subvert popular cache-based countermeasures for transient attacks.

DARPT: defense against remote physical attack based on TDC in multi-tenant scenario

  • Fan Zhang
  • Zhiyong Wang
  • Haoting Shen
  • Bolin Yang
  • Qianmei Wu
  • Kui Ren

With rapidly increasing demands for cloud computing, Field Programmable Gate Array (FPGA) has become popular in cloud datacenters. Although it improves computing performance through flexible hardware acceleration, new security concerns also come along. For example, unavoidable physical leakage from the Power Distribution Network (PDN) can be utilized by attackers to mount remote Side-Channel Attacks (SCA), such as Correlation Power Attacks (CPA). Remote Fault Attacks (FA) can also be successfully presented by malicious tenants in a cloud multi-tenant scenario, posing a significant threat to legal tenants. There are few hardware-based countermeasures to defeat both remote attacks that aforementioned. In this work, we exploit Time-to-Digital Converter (TDC) and propose a novel defense technique called DARPT (Defense Against Remote Physical attack based on TDC) to protect sensitive information from CPA and FA. Specifically, DARPT produces random clock jitters to reduce possible information leakage through the power side-channel and provides an early warning of FA by constantly monitoring the variation of the voltage drop across PDN. In comparison to the fact that 8k traces are enough for a successful CPA on FPGA without DARPT, our experimental results show that up to 800k traces (100 times) are not enough for the same FPGA protected by DARPT. Meanwhile, the TDC-based voltage monitor presents significant readout changes (by 51.82% or larger) under FA with ring oscillators, demonstrating sufficient sensitivities to voltage-drop-based FA.

GNNIE: GNN inference engine with load-balancing and graph-specific caching

  • Sudipta Mondal
  • Susmita Dey Manasi
  • Kishor Kunal
  • Ramprasath S
  • Sachin S. Sapatnekar

Graph neural networks (GNN) inferencing involves weighting vertex feature vectors, followed by aggregating weighted vectors over a vertex neighborhood. High and variable sparsity in the input vertex feature vectors, and high sparsity and power-law degree distributions in the adjacency matrix, can lead to (a) unbalanced loads and (b) inefficient random memory accesses. GNNIE ensures load-balancing by splitting features into blocks, proposing a flexible MAC architecture, and employing load (re)distribution. GNNIE’s novel caching scheme bypasses the high costs of random DRAM accesses. GNNIE shows high speedups over CPUs/GPUs; it is faster and runs a broader range of GNNs than existing accelerators.

SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences

  • Guan Shen
  • Jieru Zhao
  • Quan Chen
  • Jingwen Leng
  • Chao Li
  • Minyi Guo

The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.

NN-LUT: neural approximation of non-linear operations for efficient transformer inference

  • Joonsang Yu
  • Junki Park
  • Seongmin Park
  • Minsoo Kim
  • Sihwa Lee
  • Dong Hyun Lee
  • Jungwook Choi

Non-linear operations such as GELU, Layer normalization, and Soft-max are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a Look-up table(LUT). The proposed framework called Neural network generated LUT(NN-LUT) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.

Self adaptive reconfigurable arrays (SARA): learning flexible GEMM accelerator configuration and mapping-space using ML

  • Ananda Samajdar
  • Eric Qin
  • Michael Pellauer
  • Tushar Krishna

This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present a self-adaptive (SA) unit, which runs a learnt model for one-shot configuration optimization in hardware offloading the software stack thus easing the deployment of the proposed design. We evaluate an instance of the proposed methodology with a 32.768 TOPS reference implementation called SAGAR, that can provide the same mapping flexibility as a compute equivalent distributed system while achieving 3.5X more power efficiency and 3.2X higher compute density demonstrated via architectural and post-layout simulation.

Enabling hard constraints in differentiable neural network and accelerator co-exploration

  • Deokki Hong
  • Kanghyun Choi
  • Hye Yoon Lee
  • Joonsang Yu
  • Noseong Park
  • Youngsok Kim
  • Jinho Lee

Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.

Heuristic adaptability to input dynamics for SpMM on CPUs

  • Guohao Dai
  • Guyue Huang
  • Shang Yang
  • Zhongming Yu
  • Hengrui Zhang
  • Yufei Ding
  • Yuan Xie
  • Huazhong Yang
  • Yu Wang

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics.

To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26X~1.37X speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59X end-to-end speedup to Graph Neural Networks.

H2H: heterogeneous model to heterogeneous system mapping with computation and communication awareness

  • Xinyi Zhang
  • Cong Hao
  • Peipei Zhou
  • Alex Jones
  • Jingtong Hu

The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at

PARIS and ELSA: an elastic scheduling algorithm for reconfigurable multi-GPU inference servers

  • Yunseong Kim
  • Yujeong Choi
  • Minsoo Rhu

Providing low latency to end-users while maximizing server utilization and system throughput is crucial for cloud ML servers. NVIDIA’s recently announced Ampere GPU architecture provides features to “reconfigure” one large, monolithic GPU into multiple smaller “GPU partitions”. Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. We study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server, presenting a sophisticated partitioning algorithm for reconfigurable GPUs combined with an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server.

Pursuing more effective graph spectral sparsifiers via approximate trace reduction

  • Zhiqiang Liu
  • Wenjian Yu

Spectral graph sparsification aims to find ultra-sparse subgraphs which can preserve spectral properties of original graphs. In this paper, a new spectral criticality metric based on trace reduction is first introduced for identifying spectrally important off-subgraph edges. Then, a physics-inspired truncation strategy and an approach using approximate inverse of Cholesky factor are proposed to compute the approximate trace reduction efficiently. Combining them with the iterative densification scheme in [8] and the strategy of excluding spectrally similar off-subgraph edges in [13], we develop a highly effective graph sparsification algorithm. The proposed method has been validated with various kinds of graphs. Experimental results show that it always produces sparsifiers with remarkably better quality than the state-of-the-art GRASS [8] in same computational cost, enabling more than 40% time reduction for preconditioned iterative equation solver on average. In the applications of power grid transient analysis and spectral graph partitioning, the derived iterative solver shows 3.3X or more advantages on runtime and memory cost, over the approach based on direct sparse solver.

Accelerating nonlinear DC circuit simulation with reinforcement learning

  • Zhou Jin
  • Haojie Pei
  • Yichao Dong
  • Xiang Jin
  • Xiao Wu
  • Wei W. Xing
  • Dan Niu

DC analysis is the foundation for nonlinear electronic circuit simulation. Pseudo transient analysis (PTA) methods have gained great success among various continuation algorithms. However, PTA tends to be computationally intensive without careful tuning of parameters and proper stepping strategies. In this paper, we harness the latest advancing in machine learning to resolve these challenges simultaneously. Particularly, an active learning is leveraged to provide a fine initial solver environment, in which a TD3-based Reinforcement Learning (RL) is implemented to accelerate the simulation on the fly. The RL agent is strengthen with dual agents, priority sampling, and cooperative learning to enhance its robustness and convergence. The proposed algorithms are implemented in an out-of-the-box SPICElike simulator, which demonstrated a significant speedup: up to 3.1X for the initial stage and 234X for the RL stage.

An efficient yield optimization method for analog circuits via gaussian process classification and varying-sigma sampling

  • Xiaodong Wang
  • Changhao Yan
  • Fan Yang
  • Dian Zhou
  • Xuan Zeng

This paper presents an efficient yield optimization method for analog circuits via Gaussian process classification and varying-sigma sampling. To quickly determine the better design, yield estimations are executed at varying sigma of process variations. Instead of regression methods requiring accurate yield values, a Gaussian process classification method is applied to model these preference information of designs with binary comparison results, and the preferential Bayesian optimization framework is implemented to guide the search. Additionally, a multi-fidelity surrogate model is adopted to learn the yield correlation at different sigmas. Compared with the state-of-the-art methods, the proposed method achieves up to 12× speed-up without loss of accuracy.

Partition and place finite element model on wafer-scale engine

  • Jinwei Liu
  • Xiaopeng Zhang
  • Shiju Lin
  • Xinshi Zang
  • Jingsong Chen
  • Bentian Jiang
  • Martin D. F. Wong
  • Evangeline F. Y. Young

The finite element method (FEM) is a well-known technique for approximately solving partial differential equations and it finds application in various engineering disciplines. The recently introduced wafer-scale engine (WSE) has shown the potential to accelerate FEM by up to 10,000×. However, accelerating FEM to the full potential of a WSE is non-trivial. Thus, in this work, we propose a partitioning algorithm to partition a 3D finite element model into tiles. The tiles can be thought of as a special netlist and are placed onto the 2D array of a WSE by our placement algorithm. Compared to the best-known approach, our partitioning has around 5% higher accuracy, and our placement algorithm can produce around 11% shorter wirelength (L1.5-normalized) on average.

CNN-inspired analytical global placement for large-scale heterogeneous FPGAs

  • Huimin Wang
  • Xingyu Tong
  • Chenyue Ma
  • Runming Shi
  • Jianli Chen
  • Kun Wang
  • Jun Yu
  • Yao-Wen Chang

The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net model is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.

High-performance placement for large-scale heterogeneous FPGAs with clock constraints

  • Ziran Zhu
  • Yangjie Mei
  • Zijun Li
  • Jingwen Lin
  • Jianli Chen
  • Jun Yang
  • Yao-Wen Chang

With the increasing complexity of the field-programmable gate array (FPGA) architecture, heterogeneity and clock constraints have greatly challenged FPGA placement. In this paper, we present a high-performance placement algorithm for large-scale heterogeneous FPGAs with clock constraints. We first propose a connectivity-aware and type-balanced clustering method to construct the hierarchy and improve the scalability. In each hierarchy level, we develop a novel hybrid penalty and augmented Lagrangian method to formulate the heterogeneous and clock-aware placement as a sequence of unconstrained optimization subproblems and adopt the Adam method to solve each unconstrained optimization subproblem. Then, we present a matching-based IP blocks legalization to legalize the RAMs and DSPs, and a multi-stage packing technique is proposed to cluster FFs and LUTs into HCLBs. Finally, history-based legalization is developed to legalize CLBs in an FPGA. Based on the ISPD 2017 clock-aware FPGA placement contest benchmarks, experimental results show that our algorithm achieves the smallest routed wirelength for all the benchmarks among all published works in a reasonable runtime.

Multi-electrostatic FPGA placement considering SLICEL-SLICEM heterogeneity and clock feasibility

  • Jing Mai
  • Yibai Meng
  • Zhixiong Di
  • Yibo Lin

Modern field-programmable gate arrays (FPGAs) contain heterogeneous resources, including CLB, DSP, BRAM, IO, etc. A Configurable Logic Block (CLB) slice is further categorized to SLICEL and SLICEM, which can be configured as specific combinations of instances in {LUT, FF, distributed RAM, SHIFT, CARRY}. Such kind of heterogeneity challenges the existing FPGA placement algorithms. Meanwhile, limited clock routing resources also lead to complicated clock constraints, causing difficulties in achieving clock feasible placement solutions. In this work, we propose a heterogeneous FPGA placement framework considering SLICEL-SLICEM heterogeneity and clock feasibility based on a multi-electrostatic formulation. We support a comprehensive set of the aforementioned instance types with a uniform algorithm for wirelength, routability, and clock optimization. Experimental results on both academic and industrial benchmarks demonstrate that we outperform the state-of-the-art placers in both quality and efficiency.

QOC: quantum on-chip training with parameter shift and gradient pruning

  • Hanrui Wang
  • Zirui Li
  • Jiaqi Gu
  • Yongshan Ding
  • David Z. Pan
  • Song Han

Parameterized Quantum Circuits (PQC) are drawing increasing research interest thanks to its potential to achieve quantum advantages on near-term Noisy Intermediate Scale Quantum (NISQ) hardware. In order to achieve scalable PQC learning, the training process needs to be offloaded to real quantum machines instead of using exponential-cost classical simulators. One common approach to obtain PQC gradients is parameter shift whose cost scales linearly with the number of qubits. We present QOC, the first experimental demonstration of practical on-chip PQC training with parameter shift. Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naïve parameter shift have low fidelity and thus degrading the training accuracy. To this end, we further propose probabilistic gradient pruning to firstly identify gradients with potentially large errors and then remove them. Specifically, small gradients have larger relative errors than large ones, thus having a higher probability to be pruned. We perform extensive experiments with the Quantum Neural Network (QNN) benchmarks on 5 classification tasks using 5 real quantum machines. The results demonstrate that our on-chip training achieves over 90% and 60% accuracy for 2-class and 4-class image classification tasks. The probabilistic gradient pruning brings up to 7% PQC accuracy improvements over no pruning. Overall, we successfully obtain similar on-chip training accuracy compared with noise-free simulation but have much better training scalability. The QOC code is available in the TorchQuantum library.

Memory-efficient training of binarized neural networks on the edge

  • Mikail Yayla
  • Jian-Jia Chen

A visionary computing paradigm is to train resource efficient neural networks on the edge using dedicated low-power accelerators instead of cloud infrastructures, eliminating communication overheads and privacy concerns. One promising resource-efficient approach for inference is binarized neural networks (BNNs), which binarize parameters and activations. However, training BNNs remains resource demanding. State-of-the-art BNN training methods, such as the binary optimizer (Bop), require to store and update a large number of momentum values in the floating point (FP) format.

In this work, we focus on memory-efficient FP encodings for the momentum values in Bop. To achieve this, we first investigate the impact of arbitrary FP encodings. When the FP format is not properly chosen, we prove that the updates of the momentum values can be lost and the quality of training is therefore dropped. With the insights, we formulate a metric to determine the number of unchanged momentum values in a training iteration due to the FP encoding. Based on the metric, we develop an algorithm to find FP encodings that are more memory-efficient than the standard FP encodings. In our experiments, the memory usage in BNN training is decreased by factors 2.47x, 2.43x, 2.04x, depending on the BNN model, with minimal accuracy cost (smaller than 1%) compared to using 32-bit FP encoding.

DeepGate: learning neural representations of logic gates

  • Min Li
  • Sadaf Khan
  • Zhengyuan Shi
  • Naixing Wang
  • Huang Yu
  • Qiang Xu

Applying deep learning (DL) techniques in the electronic design automation (EDA) field has become a trending topic. Most solutions apply well-developed DL models to solve specific EDA problems. While demonstrating promising results, they require careful model tuning for every problem. The fundamental question on “How to obtain a general and effective neural representation of circuits?” has not been answered yet. In this work, we take the first step towards solving this problem. We propose DeepGate, a novel representation learning solution that effectively embeds both logic function and structural information of a circuit as vectors on each gate. Specifically, we propose transforming circuits into unified and-inverter graph format for learning and using signal probabilities as the supervision task in DeepGate. We then introduce a novel graph neural network that uses strong inductive biases in practical circuits as learning priors for signal probability prediction. Our experimental results show the efficacy and generalization capability of DeepGate.

Bipolar vector classifier for fault-tolerant deep neural networks

  • Suyong Lee
  • Insu Choi
  • Joon-Sung Yang

Deep Neural Networks (DNNs) surpass the human-level performance on specific tasks. The outperforming capability accelerate an adoption of DNNs to safety-critical applications such as autonomous vehicles and medical diagnosis. Millions of parameters in DNN requires a high memory capacity. A process technology scaling allows increasing memory density, however, the memory reliability confronts significant reliability issues causing errors in the memory. This can make stored weights in memory erroneous. Studies show that the erroneous weights can cause a significant accuracy loss. This motivates research on fault-tolerant DNN architectures. Despite of these efforts, DNNs are still vulnerable to errors, especially error in DNN classifier. In the worst case, because a classifier in convolutional neural network (CNN) is the last stage determining an input class, a single error in the classifier can cause a significant accuracy drop. To enhance the fault tolerance in CNN, this paper proposes a novel bipolar vector classifier which can be easily integrated with any CNN structures and can be incorporated with other fault tolerance approaches. Experimental results show that the proposed method stably maintains an accuracy with a high bit error rate up to 10−3 in the classifier.

HDLock: exploiting privileged encoding to protect hyperdimensional computing models against IP stealing

  • Shijin Duan
  • Shaolei Ren
  • Xiaolin Xu

Hyperdimensional Computing (HDC) is facing infringement issues due to straightforward computations. This work, for the first time, raises a critical vulnerability of HDC — an attacker can reverse engineer the entire model, only requiring the unindexed hypervector memory. To mitigate this attack, we propose a defense strategy, namely HDLock, which significantly increases the reasoning cost of encoding. Specifically, HDLock adds extra feature hypervector combination and permutation in the encoding module. Compared to the standard HDC model, a two-layer-key HDLock can increase the adversarial reasoning complexity by 10 order of magnitudes without inference accuracy loss, with only 21% latency overhead.

Terminator on SkyNet: a practical DVFS attack on DNN hardware IP for UAV object detection

  • Junge Xu
  • Bohan Xuan
  • Anlin Liu
  • Mo Sun
  • Fan Zhang
  • Zeke Wang
  • Kui Ren

With increasing computation of various applications, dynamic voltage and frequency scaling (DVFS) is gradually deployed on FPGAs. However, its reliability and security haven’t been sufficiently evaluated. In this paper, we present a practical DVFS fault attack targeting at the SkyNet accelerator IP and successfully destroy the detection accuracy. With no knowledge about the internal accelerator structure, our attack can achieve more than 98% detection accuracy loss under ten vulnerable operating point pairs (OPPs). Meanwhile, we explore the local injection with 1 ms duration and next double the intensity which can achieve more than 50% and 74% average accuracy loss respectively.

AL-PA: cross-device profiled side-channel attack using adversarial learning

  • Pei Cao
  • Hongyi Zhang
  • Dawu Gu
  • Yan Lu
  • Yidong Yuan

In this paper, we focus on the portability issue in profiled side-channel attacks (SCAs) that arises due to significant device-to-device variations. Device discrepancy is inevitable in realistic attacks, but it is often neglected in research works. In this paper, we identify such device variations and take a further step towards leveraging the transferability of neural networks. We propose a novel adversarial learning-based profiled attack (AL-PA), which enables our neural network to learn device-invariant features. We evaluated our strategy on eight XMEGA microcontrollers. Without the need for target-specific preprocessing and multiple profiling devices, our approach has outperformed the state-of-the-art methods.

DETERRENT: detecting trojans using reinforcement learning

  • Vasudev Gohil
  • Satwik Patnaik
  • Hao Guo
  • Dileep Kalathil
  • Jeyavijayan (JV) Rajendran

Insertion of hardware Trojans (HTs) in integrated circuits is a pernicious threat. Since HTs are activated under rare trigger conditions, detecting them using random logic simulations is infeasible. In this work, we design a reinforcement learning (RL) agent that circumvents the exponential search space and returns a minimal set of patterns that is most likely to detect HTs. Experimental results on a variety of benchmarks demonstrate the efficacy and scalability of our RL agent, which obtains a significant reduction (169×) in the number of test patterns required while maintaining or improving coverage (95.75%) compared to the state-of-the-art techniques.

Exploiting data locality in memory for ORAM to reduce memory access overheads

  • Jinxi Kuang
  • Minghua Shen
  • Yutong Lu
  • Nong Xiao

This paper proposes a locality-aware Oblivious RAM (ORAM) primitive, named Green ORAM, which exploits spatial locality of data in the physical memory for reducing ORAM overheads. The Green ORAM is novel consisting of three policies. The first is row-guided label allocation used for mapping spatial locality onto ORAM tree to reduce the number of memory commands. The second is segment-based path replacement able to improve the data locality within the path in the ORAM tree in order to remove the redundant memory accesses. The third is multi-path write-back able to improve the data locality between different paths in order to obtain theoretical best stash hit rate. Notably, the Green ORAM still maintains the security as we analyzed. Experimental results show that Green ORAM achieves a 28.72% access latency reduction, and a 19.06% memory energy consumption reduction on average, compared with the state-of-the-art String ORAM.

HWST128: complete memory safety accelerator on RISC-V with metadata compression

  • Hsu-Kang Dow
  • Tuo Li
  • Sri Parameswaran

Memory safety is paramount for secure systems. Pointer-based memory safety relies on additional information (metadata) to check validity when a pointer is dereferenced. Such operations on the metadata introduce significant performance overhead to the system. This paper presents HWST128, a system to reduce performance overhead by using hardware/software co-design. As a result, the system described achieves spatial and temporal safety by utilizing microarchitecture support, pointer analysis from the compiler, and metadata compression. HWST128 is the first complete solution for memory safety (spatial and temporal) on RISC-V. The system is implemented and tested on a Xilinx ZCU102 FPGA board with 1536 LUTs (+4.11%) and 112 FFs (+0.66%) on top of a Rocket Chip processor. HWST128 is 3.74× faster than the equivalent software-based safety system in the SPEC2006 benchmark suite while providing similar or better security coverage for the Juliet test suite.

RegVault: hardware assisted selective data randomization for operating system kernels

  • Jinyan Xu
  • Haoran Lin
  • Ziqi Yuan
  • Wenbo Shen
  • Yajin Zhou
  • Rui Chang
  • Lei Wu
  • Kui Ren

This paper presents RegVault, a hardware-assisted lightweight data randomization scheme for OS kernels. RegVault introduces novel cryptographically strong hardware primitives to protect both the confidentiality and integrity of register-grained data. RegVault leverages annotations to mark sensitive data and instruments their loads and stores automatically. Moreover, RegVault also introduces new techniques to protect the interrupt context and safeguard the sensitive data spilling. We implement a prototype of RegVault by extending RISC-V architecture to protect six types of sensitive data in Linux kernel. Our evaluations show that RegVault can defend against the kernel data attacks effectively with a minimal performance overhead.

ASAP: reconciling asynchronous real-time operations and proofs of execution in simple embedded systems

  • Adam Caulfield
  • Norrathep Rattanavipanon
  • Ivan De Oliveira Nunes

Embedded devices are increasingly ubiquitous and their importance is hard to overestimate. While they often support safety-critical functions (e.g., in medical devices and sensor-alarm combinations), they are usually implemented under strict cost/energy budgets, using low-end microcontroller units (MCUs) that lack sophisticated security mechanisms. Motivated by this issue, recent work developed architectures capable of generating Proofs of Execution (PoX) for the correct/expected software in potentially compromised low-end MCUs. In practice, this capability can be leveraged to provide “integrity from birth” to sensor data, by binding the sensed results/outputs to an unforgeable cryptographic proof of execution of the expected sensing process. Despite this significant progress, current PoX schemes for low-end MCUs ignore the real-time needs of many applications. In particular, security of current PoX schemes precludes any interrupts during the execution being proved. We argue that lack of asynchronous capabilities (i.e., interrupts within PoX) can obscure PoX usefulness, as several applications require processing real-time and asynchronous events. To bridge this gap, we propose, implement, and evaluate an Architecture for Secure Asynchronous Processing in PoX (ASAP). ASAP is secure under full software compromise, enables asynchronous PoX, and incurs less hardware overhead than prior work.

Towards a formally verified hardware root-of-trust for data-oblivious computing

  • Lucas Deutschmann
  • Johannes Müller
  • Mohammad R. Fadiheh
  • Dominik Stoffel
  • Wolfgang Kunz

The importance of preventing microarchitectural timing side channels in security-critical applications has surged immensely over the last several years. Constant-time programming has emerged as a best-practice technique to prevent leaking out secret information through timing. It builds on the assumption that certain basic machine instructions execute timing-independently w.r.t. their input data. However, whether an instruction fulfills this data-independent timing criterion varies strongly from architecture to architecture.

In this paper, we propose a novel methodology to formally verify data-oblivious behavior in hardware using standard property checking techniques. Each successfully verified instruction represents a trusted hardware primitive for developing data-oblivious algorithms. A counterexample, on the other hand, represents a restriction that must be communicated to the software developer. We evaluate the proposed methodology in multiple case studies, ranging from small arithmetic units to medium-sized processors. One case study uncovered a data-dependent timing violation in the extensively verified and highly secure Ibex RISC-V core.

A scalable SIMD RISC-V based processor with customized vector extensions for CRYSTALS-kyber

  • Huimin Li
  • Nele Mentens
  • Stjepan Picek

This paper uses RISC-V vector extensions to speed up lattice-based operations in architectures based on HW/SW co-design. We analyze the structure of the number-theoretic transform (NTT), inverse NTT (INTT), and coefficient-wise multiplication (CWM) in CRYSTALS-Kyber, a lattice-based key encapsulation mechanism. We propose 12 vector extensions for CRYSTALS-Kyber multiplication and four for finite field operations in combination with two optimizations of the HW/SW interface. This results in a speed-up of 141.7, 168.7, and 245.5 times for NTT, INTT, and CWM, respectively, compared with the baseline implementation, and a speed-up of over four times compared with the state-of-the-art HW/SW co-design using RV32IMC.

Hexagons are the bestagons: design automation for silicon dangling bond logic

  • Marcel Walter
  • Samuel Sze Hang Ng
  • Konrad Walus
  • Robert Wille

Field-coupled Nanocomputing (FCN) defines a class of post-CMOS nanotechnologies that promises compact layouts, low power operation, and high clock rates. Recent breakthroughs in the fabrication of Silicon Dangling Bonds (SiDBs) acting as quantum dots enabled the demonstration of a sub-30 nm2 OR gate and wire segments. This motivated the research community to invest manual labor in the design of additional gates and whole circuits which, however, is currently severely limited by scalability issues. In this work, these limitations are overcome by the introduction of a design automation framework that establishes a flexible topology based on hexagons as well as a corresponding Bestagon gate library for this technology and, additionally, provides automatic methods for physical design. By this, the first design automation solution for the promising SiDB platform is proposed. In an effort to support open research and open data, the resulting framework and all design files will be made available.

Improving compute in-memory ECC reliability with successive correction

  • Brian Crafton
  • Zishen Wan
  • Samuel Spetalnick
  • Jong-Hyeok Yoon
  • Wei Wu
  • Carlos Tokunaga
  • Vivek De
  • Arijit Raychowdhury

Compute in-memory (CIM) is an exciting technique that minimizes data transport, maximizes memory throughput, and performs computation on the bitline of memory sub-arrays. This is especially interesting for machine learning applications, where increased memory bandwidth and analog domain computation offer improved area and energy efficiency. Unfortunately, CIM faces new challenges traditional CMOS architectures have avoided. In this work, we explore the impact of device variation (calibrated with measured data on foundry RRAM arrays) and propose a new class of error correcting codes (ECC) for hard and soft errors in CIM. We demonstrate single, double, and triple error correction offering over 16,000× reduction in bit error rate over a design without ECC and over 427× over prior work, while consuming only 29.1% area and 26.3% power overhead.

Energy efficient data search design and optimization based on a compact ferroelectric FET content addressable memory

  • Jiahao Cai
  • Mohsen Imani
  • Kai Ni
  • Grace Li Zhang
  • Bing Li
  • Ulf Schlichtmann
  • Cheng Zhuo
  • Xunzhao Yin

Content Addressable Memory (CAM) is widely used for associative search tasks in advanced machine learning models and data-intensive applications due to the highly parallel pattern matching capability. Most state-of-the-art CAM designs focus on reducing the CAM cell area by exploiting the nonvolatile memories (NVMs). There exists only little research on optimizing the design and energy efficiency of NVM based CAMs for practical deployment in edge devices and AI hardware. In this paper, we propose a general compact and energy efficient CAM design scheme that alleviates the design overhead by employing just one NVM device in the cell. We also propose an adaptive matchline (ML) precharge and discharge scheme that further optimizes the search energy by fully reducing the ML voltage swing. We consider Ferroelectric field effect transistors (FeFETs) as the representative NVM, and present a 2T-1FeFET CAM array including a sense amplifier implementing the proposed ML scheme. Evaluation results suggest that our proposed 2T-1FeFET CAM design achieves 6.64×/4.74×/9.14×/3.02× better energy efficiency compared with CMOS/ReRAM/STT-MRAM/2FeFET CAM arrays. Benchmarking results show that our approach provides 3.3×/2.1× energy-delay product improvement over the 2T-2R/2FeFET CAM in accelerating query processing applications.

CamSkyGate: camouflaged skyrmion gates for protecting ICs

  • Yuqiao Zhang
  • Chunli Tang
  • Peng Li
  • Ujjwal Guin

Magnetic skyrmion has the potential to become one of the candidates for emerging technologies due to its ultra-high integration density and ultra-low energy. Skyrmion is a magnetic pattern created by transverse current injection in the ferromagnetic (FM) layer. A skyrmion can be generated by localized spin-polarized current and behaves like a stable pseudoparticle. Different logic gates have been proposed, where the presence or absence of a single skyrmion is represented as binary logic 1 or logic 0, respectively. In this paper, we propose novel camouflaged logic gate designs to prevent an adversary from extracting the original netlist. The proposal uses differential doping to block the propagation of the skyrmions to realize the camouflaged gates. To the best of our knowledge, we are the first to propose camouflaged skyrmion gates to prevent an adversary from performing reverse engineering. We demonstrate the functionality of different camouflaged gates using the mumax3 micromagnetic simulator. We have also evaluated the security of the proposed camouflaged designs using SAT attacks. We show that the same security from the traditional CMOS-based camouflaged circuits can be retained.

GNN-based concentration prediction for random microfluidic mixers

  • Weiqing Ji
  • Xingzhuo Guo
  • Shouan Pan
  • Tsung-Yi Ho
  • Ulf Schlichtmann
  • Hailong Yao

Recent years have witnessed significant advances brought by microfluidic biochips in automating biochemical processing. Accurate preparation of fluid samples with microfluidic mixers is a fundamental step in various biomedical applications, where concentration prediction and generation are critical. Finite element analysis (FEA) is the most commonly used simulation method for accurate concentration prediction of a given biochip design, such as COMSOL. However, the FEA simulation process is time-consuming with poor scalability for large biochip sizes. This paper proposes a new concentration prediction method based on the graph neural networks (GNN), which efficiently and accurately predicts the generated concentration by random microfluidic mixers of different sizes. Experimental results show that compared with the state-of-the-art method, the proposed GNN-based simulation method obtains a reduction of 88% in terms of errors of predicted concentration, which validates the effectiveness of the proposed GNN model.

Designing ML-resilient locking at register-transfer level

  • Dominik Sisejkovic
  • Luca Collini
  • Benjamin Tan
  • Christian Pilato
  • Ramesh Karri
  • Rainer Leupers

Various logic-locking schemes have been proposed to protect hardware from intellectual property piracy and malicious design modifications. Since traditional locking techniques are applied on the gate-level netlist after logic synthesis, they have no semantic knowledge of the design function. Data-driven, machine-learning (ML) attacks can uncover the design flaws within gate-level locking. Recent proposals on register-transfer level (RTL) locking have access to semantic hardware information. We investigate the resilience of ASSURE, a state-of-the-art RTL locking method, against ML attacks. We used the lessons learned to derive two ML-resilient RTL locking schemes built to reinforce ASSURE locking. We developed ML-driven security metrics to evaluate the schemes against an RTL adaptation of the state-of-the-art, ML-based SnapShot attack.

O’clock: lock the clock via clock-gating for SoC IP protection

  • M Sazadur Rahman
  • Rui Guo
  • Hadi M Kamali
  • Fahim Rahman
  • Farimah Farahmandi
  • Mohamed Abdel-Moneum
  • Mark Tehranipoor

Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O’Clock, a fully-automated clock-gating-based approach that ‘locks the clock’ to protect IPs in complex SoCs. O’Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O’Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.

ALICE: an automatic design flow for eFPGA redaction

  • Chiara Muscari Tomajoli
  • Luca Collini
  • Jitendra Bhandari
  • Abdul Khader Thalakkattu Moosa
  • Benjamin Tan
  • Xifan Tang
  • Pierre-Emmanuel Gaillardon
  • Ramesh Karri
  • Christian Pilato

Fabricating an integrated circuit is becoming unaffordable for many semiconductor design houses. Outsourcing the fabrication to a third-party foundry requires methods to protect the intellectual property of the hardware designs. Designers can rely on embedded reconfigurable devices to completely hide the real functionality of selected design portions unless the configuration string (bitstream) is provided. However, selecting such portions and creating the corresponding reconfigurable fabrics are still open problems. We propose ALICE, a design flow that addresses the EDA challenges of this problem. ALICE partitions the RTL modules between one or more reconfigurable fabrics and the rest of the circuit, automating the generation of the corresponding redacted design.

DELTA: DEsigning a stealthy trigger mechanism for analog hardware trojans and its detection analysis

  • Nishant Gupta
  • Mohil Sandip Desai
  • Mark Wijtvliet
  • Shubham Rai
  • Akash Kumar

This paper presents a stealthy triggering mechanism that reduces the dependencies of analog hardware Trojans on the frequent toggling of the software-controlled rare nets. The trigger to activate the Trojan is generated by using a glitch generation circuit and a clock signal, which increases the selectivity and feasibility of the trigger signal. The proposed trigger is able to evade the state-of-the-art run-time detection (R2D2) and Built-In Acceleration Structure (BIAS) schemes. Furthermore, the simulation results show that the proposed trigger circuit incurs a minimal overhead in side-channel footprints in terms of area (29 transistors), delay (less than 1ps in the clock cycle), and power (1μW).

VIPR-PCB: a machine learning based golden-free PCB assurance framework

  • Aritra Bhattacharyay
  • Prabuddha Chakraborty
  • Jonathan Cruz
  • Swarup Bhunia

Printed circuit boards (PCBs) form an integral part of the electronics life cycle by providing mechanical support and electrical connections to microchips and discrete electronic components. PCBs follow a similar life cycle as microchips and are vulnerable to similar assurance issues. Malicious design alterations, i.e., hardware Trojan attacks, have emerged as a major threat to PCB assurance. Board-level Trojans are extremely challenging to detect due to (1) the lack of golden or reference models in most use cases, (2) potentially unbounded attack space, and (3) the growing complexity of commercial PCB designs. Existing PCB inspection techniques (e.g., optical and electrical) do not scale to large volume and are expensive, time-consuming, and often not reliable in covering diverse Trojan space. To address these issues, in this paper, we present VIPR-PCB, a board-level Trojan detection framework that employs a machine learning (ML) model to learn Trojan signatures in functional and structural space and uses a trained model to discover Trojans in suspect PCB designs with high fidelity. Using extensive evaluation with 10 open-source PCB designs and a wide variety of Trojan instances, we demonstrate that VIPR-PCB can achieve over 98% accuracy and is even capable of detecting Trojans in partially-recovered PCB designs.

CLIMBER: defending phase change memory against inconsistent write attacks

  • Zhuohui Duan
  • Haobo Wang
  • Haikun Liu
  • Xiaofei Liao
  • Hai Jin
  • Yu Zhang
  • Fubing Mao

Non-volatile Memories (NVMs) usually demonstrate vast endurance variation due to Process Variation (PV). They are vulnerable to an Inconsistent Write Attack (IWA) which reverses the write intensity distribution in two adjacent wear leveling windows. In this paper, we propose CLIMBER, a defense mechanism to neutralize IWA for NVMs. CLIMBER dynamically changes harmful address mappings so that intensive writes to weak cells are still redirected to strong cells. CLIMBER also conceals weak NVM cells from attackers by randomly mapping cold addresses to weak NVM regions. Experimental results show that CLIMBER can reduce maximum page wear rate by 43.2% compared with the state-of-the-art Toss-up Wear Leveling and prolong NVM lifetime from 4.19 years to 7.37 years with trivial performance/hardware overhead.

Rethinking key-value store for byte-addressable optane persistent memory

  • Sung-Ming Wu
  • Li-Pin Chang

Optane Persistent Memory (PM) is a pioneering solution to byte-addressable PM for commodity systems. However, the performance of Optane PM is highly workload-sensitive, rendering many prior designs of Key-Value (KV) store inefficient. To cope with this reality, we advocate rethinking KV store design for Optane PM. Our design follows a principle of Single-stream Writing with managed Multi-stream Reading (SWMR): Incoming KV pairs are written to PM through a single write stream and managed by an ordered index in DRAM. Through asynchronously sorting and rewriting large sets of KV pairs, range queries are handled with a managed number of concurrent streams. YCSB results show that our design improved upon existing ones by 116% and 21% for write-only throughput and read-write throughput, respectively.

libcrpm: improving the checkpoint performance of NVM

  • Feng Ren
  • Kang Chen
  • Yongwei Wu

libcrpm is a new programming library to improve the checkpoint performance for applications running in NVM. It proposes the failure-atomic differential checkpointing protocol, which addresses two problems simultaneously that exist in the current NVM-based checkpoint-recovery libraries: (1) high write amplification when page-granularity incremental checkpointing is used, and (2) high persistence costs from excessive memory fence instructions when fine-grained undo-log or copy-on-write is used. Evaluation results show that libcrpm reduces the checkpoint overhead in realistic workloads. For MPI-based parallel applications such as LULESH, the checkpoint overhead of libcrpm is only 44.78% of FTI, an application-level checkpoint-recovery library.

Scalable crash consistency for secure persistent memory

  • Ming Zhang
  • Yu Hua
  • Xuan Li
  • Hao Xu

Persistent memory (PM) suffers from data security and crash consistency issues due to non-volatility. Counter-mode encryption (CME) and bonsai merkle tree (BMT) have been adopted to ensure data security by using security metadata. The data and its security metadata need to be atomically persisted for correct recovery. To ensure crash consistency, durable transactions have been widely employed. However, the long-time BMT update increases the transaction latency, and the security metadata incur heavy write traffic. This paper presents Secon to ensure SEcurity and crash CONsistency for PM with high performance. Secon leverages a scalable write-through metadata cache to ensure the atomicity of data and its security metadata. To reduce the transaction latency, Secon proposes a transaction-specific epoch persistency model to minimize the ordering constraints. To reduce the amount of PM writes, Secon co-locates counters with log entries and coalesces BMT blocks. Experimental results demonstrate that Secon significantly improves the transaction performance and decreases the write traffic.

Don’t open row: rethinking row buffer policy for improving performance of non-volatile memories

  • Yongho Lee
  • Osang Kwon
  • Seokin Hong

Among the various NVM technologies, phase-change-memory (PCM) has attracted substantial attention as a candidate to replace the DRAM for next-generation memory. However, the characteristics of PCM cause it to have much longer read and write latencies than DRAM. This paper proposes a Write-Around PCM System that addresses this limitation using two novel schemes: Pseudo-Row Activation and Direct Write. Pseudo-Row Activation provides fast row activation for PCM writes by connecting a target row to bitlines, but it does not fetch the data into the row buffer. With the Direct Write scheme, our system allows for writing operations to update the data even if the target row is in the logically closed state.

SMART: on simultaneously marching racetracks to improve the performance of racetrack-based main memory

  • Xiangjun Peng
  • Ming-Chang Yang
  • Ho Ming Tsui
  • Chi Ngai Leung
  • Wang Kang

RaceTrack Memory (RTM) is a promising media for modern Main Memory subsystems. However, the “shift-before-access” principle, as the nature of RTM, introduces considerable overheads to the access latency. To obtain more insights for the mitigation of shift overheads, this work characterizes and observes that the access patterns, exhibited by the state-of-the-art RTM-based Main Memory, mismatches with the granularity of shift commands (i.e., a group of RaceTracks called Domain Block Cluster (DBC)). Based on the characterization, we propose a novel mechanism called SMART, which simultaneously and proactively marches all DBCs within a subarray, so that subsequent accesses to other DBCs can be served without additional shift commands. Evaluation results show that, averaged across 15 real-world workloads, SMART significantly outperforms other state-of-the-art proposals of RTM-based Main Memory by at least 1.53X in terms of the total execution time, on two different generations of RTM technologies.

SAPredictor: a simple and accurate self-adaptive predictor for hierarchical hybrid memory system

  • Yujuan Tan
  • Wei Chen
  • Zhulin Ma
  • Dan Xiao
  • Zhichao Yan
  • Duo Liu
  • Xianzhang Chen

In a hybrid memory system using DRAM as the NVM cache, DRAM and NVM can be accessed in serial or parallel mode. However, we found that using either mode alone will bring access latency and bandwidth problems. In this paper, we integrate these two access modes and design a simple but accurate predictor (called SAPredictor) to help choose the appropriate access mode, thereby avoiding long access latency and bandwidth problems to improve memory performance. Our experiments show that SAPredictor achieves an accuracy rate of up to 97.1% and helps reduce access latency by up to 35.6% at fairly low costs.

AVATAR: an aging- and variation-aware dynamic timing analyzer for application-based DVAFS

  • Zuodong Zhang
  • Zizheng Guo
  • Yibo Lin
  • Runsheng Wang
  • Ru Huang

As the timing guardband continues to increase with the continuous technology scaling, better-than-worst-case (BTWC) design has gained more and more attention. BTWC design can improve energy efficiency and/or performance by relaxing the conservative static timing constraints and exploiting the dynamic timing margin. However, to avoid potential reliability hazards, the existing dynamic timing analysis (DTA) tools have to add extra aging and variation guardbands, which are estimated under the worst-case corners of aging and variation. Such guardbanding method introduces unnecessary margin in timing analysis, thus reducing the performance and efficiency gains of BTWC designs. Therefore, in this paper, we propose AVATAR, an aging- and variation-aware dynamic timing analyzer that can perform DTA with the impact of transistor aging and random process variation. We also propose an application-based dynamic-voltage-accuracy-frequency-scaling (DVAFS) design flow based on AVATAR, which can improve energy efficiency by exploiting both dynamic timing slack (DTS) and the intrinsic error tolerance of the application. The results show that a 45.8% performance improvement and 68% power savings can be achieved by exploiting the intrinsic error tolerance. Compared with the conventional flow based on the corner-based DTA, the additional performance improvement of the proposed flow can be up to 14% or the additional power-saving can be up to 20%.

A defect tolerance framework for improving yield

  • Shiva Shankar Thiagarajan
  • Suriyaprakash Natarajan
  • Yiorgos Makris

In the latest technology nodes, there is a growing concern about yield loss due to timing failures and delay degradation resulting from manufacturing complexities. Largely, these process imperfections are fixed using empirical methods such as layout guidelines and process fixes which come late during the design cycle. In this work, we propose a framework for improving the design yield by synthesizing netlists with improved ability to withstand delay variations to reduce yield loss. We advocate a defect tolerant approach during early design stages to synthesize netlists by introducing defect-awareness to EDA synthesis, thereby generating robust netlists that can withstand delays induced by process imperfections. Toward this objective, we present a) a methodology to characterize standard library cells for delay defects to model the robustness of the cell delays, and b) a solution to drive design synthesis using the intelligence from the cell characterization to achieve design robustness to timing errors. We also introduce defect tolerance metrics to quantify the robustness of standard cells to timing variations, which we use to generate defect-aware libraries to guide defect-aware synthesis. Effectiveness of the proposed defect-aware methodology is evaluated on a set of benchmarks implemented in GF 12nm technology using static timing analysis (STA), revealing a 70–80% reduction of yield loss due to timing errors arising from manufacturing defects, with minimum impact on the area, power and no impact on performance.

Winograd convolution: a perspective from fault tolerance

  • Xinghua Xue
  • Haitong Huang
  • Cheng Liu
  • Tao Luo
  • Lei Zhang
  • Ying Wang

Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49% or energy consumption by 7.19% without any accuracy loss compared to that without being aware of the fault tolerance.

Towards resilient analog in-memory deep learning via data layout re-organization

  • Muhammad Rashedul Haq Rashed
  • Amro Awad
  • Sumit Kumar Jha
  • Rickard Ewetz

Processing in-memory paves the way for neural network inference engines. An arising challenge is to develop the software/hardware interface to automatically compile deep learning models onto in-memory computing platforms. In this paper, we observe that the data layout organization of a deep neural network (DNN) model directly impacts the model’s classification accuracy. This stems from that the resistive parasitics within a crossbar introduces a dependency between the matrix data and the precision of the analog computation. To minimize the impact of the parasitics, we first perform a case study to understand the underlying matrix properties that result in computation with low and high precision, respectively. Next, we propose the XORG framework that performs data layout organization for DNNs deployed on in-memory computing platforms. The data layout organization improves precision by optimizing the weight matrix to crossbar assignments at compile time. The experimental results show that the XORG framework improves precision with up to 3.2X and 31% on the average. When accelerating DNNs using XORG, the write bit-accuracy requirements are relaxed with 1-bit and the robustness to random telegraph noise (RTN) is improved.

SEM-latch: a lost-cost and high-performance latch design for mitigating soft errors in nanoscale CMOS process

  • Zhong-Li Tang
  • Chia-Wei Liang
  • Ming-Hsien Hsiao
  • Charles H.-P. Wen

Soft errors (primarily single-event transients (SET) and single-event upsets (SEU)) are receiving increased attention due to the increasing prevalence of automotive and biomedical electronics. In recent years, several latch designs have been developed for SEU/SET protection, but each has its own issues regarding timing, area, and power. Therefore, we propose a novel soft-error mitigating latch design, called SEM-Latch, which extends QUATRO and incorporates a speed path whereas embedding a reference voltage generator (RVG) for simultaneously improving timing, area, and power in 45nm CMOS process. SEM-Latch effectively reduces the power, area, and PDAP (product of delay, area, and power) by an average of 1.4%, 12.5%, and 8.7%, respectively, in comparison to a previous latch (HPST) with equivalent SEU protection. Furthermore, in comparison to AMSER-Latch, SEM-Latch reduces area, timing overhead and PDAP by 27.2%, 48.2%, and 60.2%, respectively, to provide 99.9999% particle rejection rate for SET protection.

BlueSeer: AI-driven environment detection via BLE scans

  • Valentin Poirot
  • Oliver Harms
  • Hendric Martens
  • Olaf Landsiedel

IoT devices rely on environment detection to trigger specific actions, e.g., for headphones to adapt noise cancellation to the surroundings. While phones feature many sensors, from GNSS to cameras, small wearables must rely on the few energy-efficient components they already incorporate. In this paper, we demonstrate that a Bluetooth radio is the only component required to accurately classify environments and present BlueSeer, an environment-detection system that solely relies on received BLE packets and an embedded neural network. BlueSeer achieves an accuracy of up to 84% differentiating between 7 environments on resource-constrained devices, and requires only ~ 12 ms for inference on a 64 MHz microcontroller-unit.

Compressive sensing based asymmetric semantic image compression for resource-constrained IoT system

  • Yujun Huang
  • Bin Chen
  • Jianghui Zhang
  • Qiu Han
  • Shu-Tao Xia

The widespread application of Internet-of-Things (IoT) and deep learning have made machine-to-machine semantic communication possible. However, it remains challenging to deploy DNN model on IoT devices, due to their limited computing and storage capacity. In this paper, we propose Compressed Sensing based Asymmetric Semantic Image Compression (CS-ASIC) for resource-constrained IoT systems, which consists of a lightweight front encoder and a deep iterative decoder offloaded at the server. We further consider a task-oriented scenario and optimize CS-ASIC for the semantic recognition tasks. The experiment results demonstrate that CS-ASIC achieves considerable data-semantic rate-distortion trade-off, and low encoding complexity over prevailing codecs.

R2B: high-efficiency and fair I/O scheduling for multi-tenant with differentiated demands

  • Diansen Sun
  • Yunpeng Chai
  • Chaoyang Liu
  • Weihao Sun
  • Qingpeng Zhang

Big data applications have differentiated requirements for I/O resources in cloud environments. For instance, data analytic and AI/ML applications usually have periodical burst I/O traffic, and data stream processing and database applications often introduce fluctuating I/O loads based on a guaranteed I/O bandwidth. However, the existing resource isolation model (i.e., RLW) and methods (e.g., Token-bucket, mClock, and cgroup) cannot support the fluctuating I/O load and differentiated I/O demands well, and thus cannot achieve fairness, high resource utilization, and high performance for applications at the same time. In this paper, we propose a novel efficient and fair I/O resource isolation model and method called R2B, which can adapt to the differentiated I/O characteristics and requirements of different applications in a shared resource environment. R2B can simultaneously satisfy the fairness and achieve both high application efficiency and high bandwidth utilization.

This work aims to help the cloud provider achieve higher utilization by shifting the burden to the cloud customers to specify their type of workload.

Fast and scalable human pose estimation using mmWave point cloud

  • Sizhe An
  • Umit Y. Ogras

Millimeter-Wave (mmWave) radar can enable high-resolution human pose estimation with low cost and computational requirements. However, mmWave data point cloud, the primary input to processing algorithms, is highly sparse and carries significantly less information than other alternatives such as video frames. Furthermore, the scarce labeled mmWave data impedes the development of machine learning (ML) models that can generalize to unseen scenarios. We propose a fast and scalable human pose estimation (FUSE) framework that combines multi-frame representation and meta-learning to address these challenges. Experimental evaluations show that FUSE adapts to the unseen scenarios 4× faster than current supervised learning approaches and estimates human joint coordinates with about 7 cm mean absolute error.

VWR2A: a very-wide-register reconfigurable-array architecture for low-power embedded devices

  • Benoît W. Denkinger
  • Miguel Peón-Quirós
  • Mario Konijnenburg
  • David Atienza
  • Francky Catthoor

Edge-computing requires high-performance energy-efficient embedded systems. Fixed-function or custom accelerators, such as FFT or FIR filter engines, are very efficient at implementing a particular functionality for a given set of constraints. However, they are inflexible when facing application-wide optimizations or functionality upgrades. Conversely, programmable cores offer higher flexibility, but often with a penalty in area, performance, and, above all, energy consumption. In this paper, we propose VWR2A, an architecture that integrates high computational density and low power memory structures (i.e., very-wide registers and scratchpad memories). VWR2A narrows the energy gap with similar or better performance on FFT kernels with respect to an FFT accelerator. Moreover, VWR2A flexibility allows to accelerate multiple kernels, resulting in significant energy savings at the application level.

Alleviating datapath conflicts and design centralization in graph analytics acceleration

  • Haiyang Lin
  • Mingyu Yan
  • Duo Wang
  • Mo Zou
  • Fengbin Tu
  • Xiaochun Ye
  • Dongrui Fan
  • Yuan Xie

Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2× speedup (1.5× on average) as well as better scalability.

Hyperdimensional hashing: a robust and efficient dynamic hash table

  • Mike Heddes
  • Igor Nunes
  • Tony Givargis
  • Alexandru Nicolau
  • Alex Veidenbaum

Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms that minimize key remapping as the hash table resizes. While memory errors in large-scale cloud deployments are common, neither algorithm offers both efficiency and robustness. Hyperdimensional Computing is an emerging computational model that has inherent efficiency, robustness and is well suited for vector or hardware acceleration. We propose Hyperdimensional (HD) hashing and show that it has the efficiency to be deployed in large systems. Moreover, a realistic level of memory errors causes more than 20% mismatches for consistent hashing while HD hashing remains unaffected.

In-situ self-powered intelligent vision system with inference-adaptive energy scheduling for BNN-based always-on perception

  • Maimaiti Nazhamaiti
  • Haijin Su
  • Han Xu
  • Zheyu Liu
  • Fei Qiao
  • Qi Wei
  • Zidong Du
  • Xinghua Yang
  • Li Luo

This paper proposes an in-situ self-powered BNN-based intelligent visual perception system that harvests light energy utilizing the indispensable image sensor itself. The harvested energy is allocated to the low-power BNN computation modules layer by layer, adopting a light-weighted duty-cycling-based energy scheduler. A software-hardware co-design method, which exploits the layer-wise error tolerance of BNN as well as the computing-error and energy consumption characteristics of the computation circuit, is proposed to determine the parameters of the energy scheduler, achieving high energy efficiency for self-powered BNN inference. Simulation results show that with the proposed inference-adaptive energy scheduling method, self-powered MNIST classification task can be performed at a frame rate of 4 fps if the harvesting power is 1μW, while guaranteeing at least 90% inference accuracy using binary LeNet-5 network.

Adaptive window-based sensor attack detection for cyber-physical systems

  • Lin Zhang
  • Zifan Wang
  • Mengyu Liu
  • Fanxin Kong

Sensor attacks alter sensor readings and spoof Cyber-Physical Systems (CPS) to perform dangerous actions. Existing detection works tend to minimize the detection delay and false alarms at the same time, while there is a clear trade-off between the two metrics. Instead, we argue that attack detection should dynamically balance the two metrics when a physical system is at different states. Along with this argument, we propose an adaptive sensor attack detection system that consists of three components – an adaptive detector, detection deadline estimator, and data logger. It can adapt the detection delay and thus false alarms at run time to meet a varying detection deadline and improve usability (or false alarms). Finally, we implement our detection system and validate it using multiple CPS simulators and a reduced-scale autonomous vehicle testbed.

Design-while-verify: correct-by-construction control learning with verification in the loop

  • Yixuan Wang
  • Chao Huang
  • Zhaoran Wang
  • Zhilu Wang
  • Qi Zhu

In the current control design of safety-critical cyber-physical systems, formal verification techniques are typically applied after the controller is designed to evaluate whether the required properties (e.g., safety) are satisfied. However, due to the increasing system complexity and the fundamental hardness of designing a controller with formal guarantees, such an open-loop process of design-then-verify often results in many iterations and fails to provide the necessary guarantees. In this paper, we propose a correct-by-construction control learning framework that integrates the verification into the control design process in a closed-loop manner, i.e., design-while-verify. Specifically, we leverage the verification results (computed reachable set of the system state) to construct feedback metrics for control learning, which measure how likely the current design of control parameters can meet the required reach-avoid property for safety and goal-reaching. We formulate an optimization problem based on such metrics for tuning the controller parameters, and develop an approximated gradient descent algorithm with a difference method to solve the optimization problem and learn the controller. The learned controller is formally guaranteed to meet the required reach-avoid property. By treating verifiability as a first-class objective and effectively leveraging the verification results during the control learning process, our approach can significantly improve the chance of finding a control design with formal property guarantees, demonstrated in a set of experiments that use model-based or neural network based controllers.

GaBAN: a generic and flexibly programmable vector neuro-processor on FPGA

  • Jiajie Chen
  • Le Yang
  • Youhui Zhang

Spiking neural network (SNN) is the main computational model of brain-inspired computing and neuroscience, which also acts as the bridge between them. With the rapid development of neuroscience, accurate and flexible SNN simulation with high performance is becoming important. This paper proposes GaBAN, a generic and flexibly programmable neuro-processor on FPGA. Different from the majority of current designs that realize neural components by custom hardware directly, it is centered on a compact, versatile vector instruction set, which supports multiple-precision vector calculation, indexed-/strided-memory access, and conditional execution to accommodate computational characteristics. By software and hardware co-design, the compiler extracts memory-accesses from SNN programs to generate micro-ops executed by an independent hardware unit; the latter interacts with the computing pipeline through an asynchronous buffering mechanism. Thus memory access delay can fully cover the calculation. Tests show that GaBAN can not only outperform the SOTA ISA-based FPGA solution remarkably but also be comparable with counterparts of the hardware-fixed model on some tasks. Moreover, in end-to-end testing, its simulation performance exceeds that of high-performance X86 processor (1.44–3.0x).

ADEPT: automatic differentiable DEsign of photonic tensor cores

  • Jiaqi Gu
  • Hanqing Zhu
  • Chenghao Feng
  • Zixuan Jiang
  • Mingjie Liu
  • Shuhan Zhang
  • Ray T. Chen
  • David Z. Pan

Photonic tensor cores (PTCs) are essential building blocks for optical artificial intelligence (AI) accelerators based on programmable photonic integrated circuits. PTCs can achieve ultra-fast and efficient tensor operations for neural network (NN) acceleration. Current PTC designs are either manually constructed or based on matrix decomposition theory, which lacks the adaptability to meet various hardware constraints and device specifications. To our best knowledge, automatic PTC design methodology is still unexplored. It will be promising to move beyond the manual design paradigm and “nurture” photonic neurocomputing with AI and design automation. Therefore, in this work, for the first time, we propose a fully differentiable framework, dubbed ADEPT, that can efficiently search PTC designs adaptive to various circuit footprint constraints and foundry PDKs. Extensive experiments show superior flexibility and effectiveness of the proposed ADEPT framework to explore a large PTC design space. On various NN models and benchmarks, our searched PTC topology outperforms prior manually-designed structures with competitive matrix representability, 2×-30× higher footprint compactness, and better noise robustness, demonstrating a new paradigm in photonic neural chip design. The code of ADEPT is available at link using the TorchONN library.

Unicorn: a multicore neuromorphic processor with flexible fan-in and unconstrained fan-out for neurons

  • Zhijie Yang
  • Lei Wang
  • Yao Wang
  • Linghui Peng
  • Xiaofan Chen
  • Xun Xiao
  • Yaohua Wang
  • Weixia Xu

Neuromorphic processor is popular due to its high energy efficiency for spatio-temporal applications. However, when running the spiking neural network (SNN) topologies with the ever-growing scale, existing neuromorphic architectures face challenges due to their restrictions on neuron fan-in and fan-out. This paper proposes Unicorn, a multicore neuromorphic processor with a spike train sliding multicasting mechanism (STSM) and neuron merging mechanism (NMM) to support unconstrained fan-out and flexible fan-in of neurons. Unicorn supports 36K neurons and 45M synapses and thus supports a variety of neuromorphic applications. The peak performance and energy efficiency of Unicorn reach 36TSOPS and 424GSOPS/W respectively. Experimental results show that Unicorn can achieve 2×-5.5× energy reduction over the state-of-the-art neuromorphic processor when running an SNN with a relatively large fan-out and fan-in.

Effective zero compression on ReRAM-based sparse DNN accelerators

  • Hoon Shin
  • Rihae Park
  • Seung Yul Lee
  • Yeonhong Park
  • Hyunseung Lee
  • Jae W. Lee

For efficient DNN inference Resistive RAM (ReRAM) crossbars have emerged as a promising building block to compute matrix multiplication in an area- and power-efficient manner. To improve inference throughput sparse models can be deployed on the ReRAM-based DNN accelerator. While unstructured pruning maintains both high accuracy and high sparsity, it performs poorly on the crossbar architecture due to the irregular locations of pruned weights. Meanwhile, due to the non-ideality of ReRAM cells and the high cost of ADCs, matrix multiplication is usually performed at a fine granularity, called Operation Unit (OU), along both wordline and bitline dimensions. While fine-grained, OU- based row compression (ORC) has recently been proposed to increase weight compression ratio, significant performance potentials are still left on the table due to sub-optimal weight mappings. Thus, we propose a novel weight mapping scheme that effectively clusters zero weights via OU-level filter reordering, hence improving the effective weight compression ratio. We also introduce a weight recovery scheme to further improve accuracy or compression ratio, or both. Our evaluation with three popular DNNs demonstrates that the proposed scheme effectively eliminates redundant weights in the crossbar array and hence ineffectual computation to achieve 3.27–4.26× of array compression ratio with negligible accuracy loss over the baseline ReRAM-based DNN accelerator.

Y-architecture-based flip-chip routing with dynamic programming-based bend minimization

  • Szu-Ru Nie
  • Yen-Ting Chen
  • Yao-Wen Chang

In modern VLSI designs, I/O counts have been growing continuously as the system becomes more complicated. To achieve higher routability, the hexagonal array is introduced with higher pad density and a larger pitch. However, the routing for hexagonal arrays is significantly different from that for traditional gird and staggered arrays. In this paper, we consider the Y-architecture-based flip-chip routing used for the hexagonal array. Unlike the conventional Manhattan and the X-architectures, the Y-architecture allows wires to be routed in three directions, namely, 0-, 60-, and 120-degrees. We first analyze the routing properties of the hexagonal array. Then, we propose a triangular tile model and a chord-based internal node division method that can handle both pre-assignment and free-assignment nets without wire crossing. Finally, we develop a novel dynamic programming-based bend minimization method to reduce the number of routing bends in the final solution. Experimental results show that our algorithm can achieve 100% routability with minimized total wirelength and the number of routing bends effectively.

Towards collaborative intelligence: routability estimation based on decentralized private data

  • Jingyu Pan
  • Chen-Chia Chang
  • Zhiyao Xie
  • Ang Li
  • Minxue Tang
  • Tunhou Zhang
  • Jiang Hu
  • Yiran Chen

Applying machine learning (ML) in design flow is a popular trend in Electronic Design Automation (EDA) with various applications from design quality predictions to optimizations. Despite its promise, which has been demonstrated in both academic researches and industrial tools, its effectiveness largely hinges on the availability of a large amount of high-quality training data. In reality, EDA developers have very limited access to the latest design data, which is owned by design companies and mostly confidential. Although one can commission ML model training to a design company, the data of a single company might be still inadequate or biased, especially for small companies. Such data availability problem is becoming the limiting constraint on future growth of ML for chip design. In this work, we propose an Federated-Learning based approach for well-studied ML applications in EDA. Our approach allows an ML model to be collaboratively trained with data from multiple clients but without explicit access to the data for respecting their data privacy. To further strengthen the results, we co-design a customized ML model FLNet and its personalization under the decentralized training scenario. Experiments on a comprehensive dataset show that collaborative training improves accuracy by 11% compared with individual local models, and our customized model FLNet significantly outperforms the best of previous routability estimators in this collaborative training flow.

A2-ILT: GPU accelerated ILT with spatial attention mechanism

  • Qijing Wang
  • Bentian Jiang
  • Martin D. F. Wong
  • Evangeline F. Y. Young

Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.

Generic lithography modeling with dual-band optics-inspired neural networks

  • Haoyu Yang
  • Zongyi Li
  • Kumara Sastry
  • Saumyadip Mukhopadhyay
  • Mark Kilgard
  • Anima Anandkumar
  • Brucek Khailany
  • Vivek Singh
  • Haoxing Ren

Lithography simulation is a critical step in VLSI design and optimization for manufacturability. Existing solutions for highly accurate lithography simulation with rigorous models are computationally expensive and slow, even when equipped with various approximation techniques. Recently, machine learning has provided alternative solutions for lithography simulation tasks such as coarse-grained edge placement error regression and complete contour prediction. However, the impact of these learning-based methods has been limited due to restrictive usage scenarios or low simulation accuracy. To tackle these concerns, we introduce an dual-band optics-inspired neural network design that considers the optical physics underlying lithography. To the best of our knowledge, our approach yields the first published via/metal layer contour simulation at 1nm2/pixel resolution with any tile size. Compared to previous machine learning based solutions, we demonstrate that our framework can be trained much faster and offers a significant improvement on efficiency and image quality with 20× smaller model size. We also achieve 85× simulation speedup over traditional lithography simulator with ~ 1% accuracy loss.

Statistical computing framework and demonstration for in-memory computing systems

  • Bonan Zhang
  • Peter Deaville
  • Naveen Verma

With the increasing importance of data-intensive workloads, such as AI, in-memory computing (IMC) has demonstrated substantial energy/throughput benefits by addressing both compute and data-movement/accessing costs, and holds significant further promise by its ability to leverage emerging forms of highly-scaled memory technologies. However, IMC fundamentally derives its advantages through parallelism, which poses a trade-off with SNR, whereby variations and noise in nanoscaled devices directly limit possible gains. In this work, we propose novel training approaches to improve model tolerance to noise via a contrastive loss function and a progressive training procedure. We further propose a methodology for modeling and calibrating hardware noise, efficiently at the level of a macro operation and through a limited number of hardware measurements. The approaches are demonstrated on a fabricated MRAM-based IMC prototype in 22nm FD-SOI, together with a neural network training framework implemented in PyTorch. For CIFAR-10/100 classifications, model performance is restored to the level of ideal noise-free execution, and generalized performance of the trained model deployed across different chips is demonstrated.

Write or not: programming scheme optimization for RRAM-based neuromorphic computing

  • Ziqi Meng
  • Yanan Sun
  • Weikang Qian

One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.

ReSMA: accelerating approximate string matching using ReRAM-based content addressable memory

  • Huize Li
  • Hai Jin
  • Long Zheng
  • Yu Huang
  • Xiaofei Liao
  • Zhuohui Duan
  • Dan Chen
  • Chuangyi Gui

Approximate string matching (ASM) functions as the basic operation kernel for a large number of string processing applications. Existing Von-Neumann-based ASM accelerators suffer from huge intermediate data with the ever-increasing string data, leading to massive off-chip data transmissions. This paper presents a novel ASM processing-in-memory (PIM) accelerator, namely ReSMA, based on ReCAM- and ReRAM-arrays to eliminate the off-chip data transmissions in ASM. We develop a novel ReCAM-friendly filter-and-filtering algorithm to process the q-grams filtering in ReCAM memory. We also design a new data mapping strategy and a new verification algorithm, which enables computing the edit distances totally in ReRAM crossbars for energy saving. Experimental results show that ReSMA outperforms the CPU-, GPU-, FPGA-, ASIC-, and PIM-based solutions by 268.7×, 38.6×, 20.9×, 707.8×, and 14.7× in terms of performance, and 153.8×, 42.2×, 31.6×, 18.3×, and 5.3× in terms of energy-saving, respectively.

VStore: in-storage graph based vector search accelerator

  • Shengwen Liang
  • Ying Wang
  • Ziming Yuan
  • Cheng Liu
  • Huawei Li
  • Xiaowei Li

Graph-based vector search that finds best matches to user queries based on their semantic similarities using a graph data structure, becomes instrumental in data science and AI application. However, deploying graph-based vector search in production systems requires high accuracy and cost-efficiency with low latency and memory footprint, which existing work fails to offer. We present VStore, a graph-based vector search solution that collaboratively optimizes accuracy, latency, memory, and data movement on large-scale vector data based on in-storage computing. The evaluation shows that VStore exhibits significant search efficiency improvement and energy reduction while attaining accuracy over CPU, GPU, and ZipNN platforms.

Scaled-CBSC: scaled counting-based stochastic computing multiplication for improved accuracy

  • Shuyuan Yu
  • Sheldon X.-D. Tan

Stochastic computing (SC) can lead area-efficient implementation of logic designs. Existing SC multiplication, however, suffers a long-standing problem: large multiplication error with small inputs due to its intrinsic nature of bit-stream based computing. In this article, we propose a new scaled counting-based SC multiplication approach, called Scaled-CBSC, to mitigate this issue by introducing scaling bits to ensure the bit ‘1’ density of the stochastic number is sufficiently large. The idea is to convert the “small” inputs to “large” inputs, thus improve the accuracy of SC multiplication. But different from an existing stream-bit based approach, the new method uses the binary format and does not require stochastic addition as the SC multiplication always starts with binary numbers. Furthermore, Scaled-CBSC only requires all the numbers to be larger than 0.5 instead of arbitrary defined threshold, which leads to integer numbers only for the scaling term. The experimental results show that the 8-bit Scaled-CBSC multiplication with 3 scaling bits can achieve up to 46.6% and 30.4% improvements in mean error and standard deviation, respectively; reduce the peak relative error from 100% to 1.8%; and improve 12.6%, 51.5%, 57.6%, 58.4% in delay, area, area-delay product, energy consumption, respectively, over the state of art work. Furthermore, we evaluate the proposed multiplication approach in a discrete cosine transformation (DCT) application. The results show that with 3 scaling bits, 8-bit scaled counting-based SC multiplication can improve the image quality with 5.9dB upon the state of art work in average.

Tailor: removing redundant operations in memristive analog neural network accelerators

  • Xingchen Li
  • Zhihang Yuan
  • Guangyu Sun
  • Liang Zhao
  • Zhichao Lu

Analog in-situ computation based on memristive circuits has been regarded as a promising approach for designing high-performance and low-power neural network accelerators. However, despite the low-cost and highly parallel memristive crossbars, the peripheral circuits especially analog-digital-converters (ADCs) induce significant overhead. Quantitative analysis shows that ADCs can contribute up to 91% energy consumption and 72% chip area, which significantly offset the advantages of memristive NN accelerators.

To address this problem, we first mathematically analyze the computation flow in a memristive accelerator, and find that there are many useless operations. These operations significantly increase the demand for peripheral circuits. Then, based on our discovery, we propose a novel architecture, Tailor, which removes these unnecessary operations without accuracy loss. We design two types of Tailor. General Tailor is compatible with most existing memristive accelerators and can be easily applied to them. Customized Tailor is specialized for a certain NN application and can obtain more improvement. Experimental results show that, General Tailor can reduce 14% ~ 20% inference time and 33% ~ 41% energy consumption. Customized Tailor can further achieve 56% ~ 87% higher computation density.

Domain knowledge-infused deep learning for automated analog/radio-frequency circuit parameter optimization

  • Weidong Cao
  • Mouhacine Benosman
  • Xuan Zhang
  • Rui Ma

The design automation of analog circuits is a longstanding challenge. This paper presents a reinforcement learning method enhanced by graph learning to automate the analog circuit parameter optimization at the pre-layout stage, i.e., finding device parameters to fulfill desired circuit specifications. Unlike all prior methods, our approach is inspired by human experts who rely on domain knowledge of analog circuit design (e.g., circuit topology and couplings between circuit specifications) to tackle the problem. By originally incorporating such key domain knowledge into policy training with a multimodal network, the method best learns the complex relations between circuit parameters and design targets, enabling optimal decisions in the optimization process. Experimental results on exemplary circuits show it achieves human-level design accuracy (~99%) with 1.5× efficiency of existing best-performing methods. Our method also shows better generalization ability to unseen specifications and optimality in circuit performance optimization. Moreover, it applies to design radio-frequency circuits on emerging semiconductor technologies, breaking the limitations of prior learning methods in designing conventional analog circuits.

A cost-efficient fully synthesizable stochastic time-to-digital converter design based on integral nonlinearity scrambling

  • Qiaochu Zhang
  • Shiyu Su
  • Mike Shuo-Wei Chen

Stochastic time-to-digital converters (STDCs) are gaining increasing interest in submicron CMOS analog/mixed-signal design for their superior tolerance to nonlinear quantization levels. However, the large number of required delay units and time comparators for conventional STDC operation incurs excessive implementation costs. This paper presents a fully synthesizable STDC architecture based on an integral non-linearity (INL) scrambling technique, allowing order-of-magnitude cost reduction. The proposed technique randomizes and averages the STDC INL using a digital-to-time converter. Moreover, we propose an associated design automation flow and demonstrate an STDC design in 12nm FinFET process. Post-layout simulations show significant linearity and area/power efficiency improvements compared to prior arts.

Using machine learning to optimize graph execution on NUMA machines

  • Hiago Mayk G. de A. Rocha
  • Janaina Schwarzrock
  • Arthur F. Lorenzon
  • Antonio Carlos S. Beck

This paper proposes PredG, a Machine Learning framework to enhance the graph processing performance by finding the ideal thread and data mapping on NUMA systems. PredG is agnostic to the input graph: it uses the available graphs’ features to train an ANN to perform predictions as new graphs arrive – without any application execution after being trained. When evaluating PredG over representative graphs and algorithms on three NUMA systems, its solutions are up to 41% faster than the Linux OS Default and the Best Static – on average 2% far from the Oracle -, and it presents lower energy consumption.

HCG: optimizing embedded code generation of simulink with SIMD instruction synthesis

  • Zhuo Su
  • Zehong Yu
  • Dongyan Wang
  • Yixiao Yang
  • Yu Jiang
  • Rui Wang
  • Wanli Chang
  • Jiaguang Sun

Simulink is widely used for the model-driven design of embedded systems. It is able to generate optimized embedded control software code through expression folding, variable reuse, etc. However, for some commonly used computing-sensitive models, such as the models for signal processing applications, the efficiency of the generated code is still limited.

In this paper, we propose HCG, an optimized code generator for the Simulink model with SIMD instruction synthesis. It will select the optimal implementations for intensive computing actors based on adaptively pre-calculation of the input scales, and synthesize the appropriate SIMD instructions for batch computing actors based on the iterative dataflow graph mapping. We implemented and evaluated its performance on benchmark Simulink models. Compared to the built-in Simulink Coder and the most recent DFSynth, the code generated by HCG achieves an improvement of 38.9%-92.9% and 41.2%-76.8% in terms of execution time across different architectures and compilers, respectively.

Raven: a novel kernel debugging tool on RISC-V

  • Hongyi Lu
  • Fengwei Zhang

Debugging is an essential part of kernel development. However, debugging features are not available on RISC-V without the use of external hardware. In this paper, we leverage a security feature called Physical Memory Protection (PMP) as a debugging primitive to address this issue. Based on this debugging primitive, we design Raven, a novel kernel debugging tool with the standard functionalities (breakpoints, watchpoints, stepping, introspection). A prototype of Raven is implemented on a SiFive Unmatched development board. Our experiments show that Raven imposes a moderate but acceptable overhead to the kernel. Moreover, a real-world debugging scenario is set up to test its effectiveness.

GTuner: tuning DNN computations on GPU via graph attention network

  • Qi Sun
  • Xinyun Zhang
  • Hao Geng
  • Yuxuan Zhao
  • Yang Bai
  • Haisheng Zheng
  • Bei Yu

It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.

Pref-X: a framework to reveal data prefetching in commercial in-order cores

  • Quentin Huppert
  • Francky Catthoor
  • Lionel Torres
  • David Novo

Computer system simulators are major tools used by architecture researchers to develop and evaluate new ideas. Clearly, such evaluations are more conclusive when compared to commercial state-of-the-art architectures. However, the behavior of key components in existing processors is often not disclosed, complicating the construction of faithful reference models. The data prefetching engine is one of such obscured components that can have a significant impact on key metrics such as performance and energy.

In this paper, we propose Pref-X, a framework to analyze functional characteristics of data prefetching in commercial in-order cores. Our framework reveals data prefetches by X-raying into the cache memory at the request granularity, which allows linking memory access patterns with changes in the cache content. To demonstrate the power and accuracy of our methodology, we use Pref-X to replicate the data prefetching mechanisms of two representative processors, namely the Arm Cortex-A7 and the Arm Cortex-A53, with a 99.8% and 96.9% average accuracy, respectively.

Architecting DDR5 DRAM caches for non-volatile memory systems

  • Xin Xin
  • Wanyi Zhu
  • Li Zhao

With the release of Intel’s Optane DIMM, Non-Volatile Memories (NVMs) are emerging as viable alternatives to DRAM memories because of the advantage of higher capacity. However, the higher latency and lower bandwidth of Optane prevent it from outright replacing DRAM. A prevailing strategy is to employ existing DRAM as a data cache for Optane, thereby achieving overall benefit in capacity, bandwidth, and latency.

In this paper, we inspect new features in DDR5 to better support the DRAM cache design for Optane. Specifically, we leverage the two-level ECC scheme, i.e., DIMM ECC and on-die ECC, in DDR5 to construct a narrower channel for tag probing and propose a new operation for fast cache replacement. Experimental results show that our proposed strategy can achieve, on average, 26% performance improvement.

GraphRing: an HMC-ring based graph processing framework with optimized data movement

  • Zerun Li
  • Xiaoming Chen
  • Yinhe Han

Due to the irregular memory access and high bandwidth demanding, graph processing is usually inefficient on conventional computer architectures. The recent development of the processing-in-memory (PIM) technique such as hybrid memory cube (HMC) has provided a feasible design direction for graph processing accelerators. Although PIM provides high internal bandwidth, inter-node memory access is inevitable in large-scale graph processing, which greatly affects the performance. In this paper, we propose an HMC-based graph processing framework, GraphRing. GraphRing is a software-hardware codesign framework that optimizes inter-HMC communication. It contains a regularity- and locality-aware graph execution model and a ring-based multi-HMC architecture. The evaluation results based on 5 graph datasets and 4 graph algorithms show that GraphRing achieves on average 2.14× speedup and 3.07× inter-HMC communication energy saving, compared with GraphQ, a state-of-the-art graph processing architecture.

AxoNN: energy-aware execution of neural network inference on multi-accelerator heterogeneous SoCs

  • Ismet Dagli
  • Alexander Cieslewicz
  • Jedidiah McClurg
  • Mehmet E. Belviranli

The energy and latency demands of critical workload execution, such as object detection, in embedded systems vary based on the physical system state and other external factors. Many recent mobile and autonomous System-on-Chips (SoC) embed a diverse range of accelerators with unique power and performance characteristics. The execution flow of the critical workloads can be adjusted to span into multiple accelerators so that the trade-off between performance and energy fits to the dynamically changing physical factors.

In this study, we propose running neural network (NN) inference on multiple accelerators of an SoC. Our goal is to enable an energy-performance trade-off with an by distributing layers in a NN between a performance- and a power-efficient accelerator. We first provide an empirical modeling methodology to characterize execution and inter-layer transition times. We then find an optimal layers-to-accelerator mapping by representing the trade-off as a linear programming optimization constraint. We evaluate our approach on the NVIDIA Xavier AGX SoC with commonly used NN models. We use the Z3 SMT solver to find schedules for different energy consumption targets, with up to 98% prediction accuracy.

PIPF-DRAM: processing in precharge-free DRAM

  • Nezam Rohbani
  • Mohammad Arman Soleimani
  • Hamid Sarbazi-Azad

To alleviate costly data communication among processing cores and memory modules, parallel processing-in-memory (PIM) is a promising approach which exploits the huge available internal memory bandwidth. High capacity, wide row size, and maturity of DRAM technology, make DRAM an alluring structure for PIM. However, dense layout, high process variation, and noise vulnerability of DRAMs make it very challenging to apply PIM for DRAMs in practice. This work proposes a PIM structure which eliminates these DRAM limitations, exploiting a precharge-free DRAM (PF-DRAM) structure. The proposed PIM structure, called PIPF-DRAM, performs parallel bitwise operations only by modifying control signal sequences in PF-DRAM, with almost zero structural and circuit modifications. Comparing the state-of-the-art PIM techniques, PIPF-DRAM is 4.2× more robust to process variation, 4.1% faster in average cycle time of operations, and consumes 66.1% less energy.

TAIM: ternary activation in-memory computing hardware with 6T SRAM array

  • Nameun Kang
  • Hyungjun Kim
  • Hyunmyung Oh
  • Jae-Joon Kim

Recently, various in-memory computing accelerators for low precision neural networks have been proposed. While in-memory Binary Neural Network (BNN) accelerators achieved significant energy efficiency, BNNs show severe accuracy degradation compared to their full precision counterpart models. To mitigate the problem, we propose TAIM, an in-memory computing hardware that can support ternary activation with negligible hardware overhead. In TAIM, a 6T SRAM cell can compute the multiplication between ternary activation and binary weight. Since the 6T SRAM cell consumes no energy when the input activation is 0, the proposed TAIM hardware can achieve even higher energy efficiency compared to BNN case by exploiting input 0’s. We fabricated the proposed TAIM hardware in 28nm CMOS process and evaluated the energy efficiency on various image classification benchmarks. The experimental results show that the proposed TAIM hardware can achieve ~ 3.61× higher energy efficiency on average compared to previous designs which support ternary activation.

PIM-DH: ReRAM-based processing-in-memory architecture for deep hashing acceleration

  • Fangxin Liu
  • Wenbo Zhao
  • Yongbiao Chen
  • Zongwu Wang
  • Zhezhi He
  • Rui Yang
  • Qidong Tang
  • Tao Yang
  • Cheng Zhuo
  • Li Jiang

Deep hashing has gained growing momentum in large-scale image retrieval. However, deep hashing is computation- and memory-intensive, which demands hardware acceleration. The unique process of hash sequence computation in deep hashing is non-trivial to accelerate due to the lack of an efficient compute primitive for Hamming distance calculation and ranking.

This paper proposes the first PIM-based scheme for deep hashing accelerator, namely PIM-DH. PIM-DH is supported by an algorithm and architecture co-design. The proposed algorithm seeks to compress the hash sequence to increase the retrieval efficiency by exploiting the hash code sparsity without accuracy loss. Further, we design a lightweight circuit to assist CAM to optimize hash computation efficiency. This design leads to an elegant extension of current PIM-based architectures for adapting to various hashing algorithms and arbitrary size of hash sequence induced by pruning. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA RTX2080 GPU, PIM-DH achieves an average 4.75E+03 speedup with 4.64E+05 energy reduction over CPU, 2.30E+02 speedup with 3.38E+04 energy reduction over GPU. Compared with PIM architecture CASCADE, PIM-DH can improve computing efficiency by 17.49× and energy efficiency by 41.38×.

YOLoC: deploy large-scale neural network by ROM-based computing-in-memory using residual branch on a chip

  • Yiming Chen
  • Guodong Yin
  • Zhanhong Tan
  • Mingyen Lee
  • Zekun Yang
  • Yongpan Liu
  • Huazhong Yang
  • Kaisheng Ma
  • Xueqing Li

Computing-in-memory (CiM