ISPD’23 TOC

ISPD ’23: Proceedings of the 2023 International Symposium on Physical Design

 Full Citation in the ACM Digital Library

SESSION: Session 1: Opening Session and Keynote I

Automated Design of Chiplets

  • Alberto Sangiovanni-Vincentelli
  • Zheng Liang
  • Zhe Zhou
  • Jiaxi Zhang

Chiplet-based designs have gained recognition as a promising alternative to monolithic SoCs due to their lower manufacturing costs, improved re-usability, and optimized technology specialization. Despite progress made in various related domains, the design of chiplets remains largely reliant on manual processes. In this paper, we provide an examination of the historical evolution of chiplets, encompassing a review of crucial design considerations and a synopsis of recent advancements in relevant fields. Further, we identify and examine the opportunities and challenges in the automated design of chiplets. To further demonstrate the potential of this nascent area, we present a novel task that

SESSION: Session 2: Routing

FastPass: Fast Pin Access Analysis with Incremental SAT Solving

  • Fangzhou Wang
  • Jinwei Liu
  • Evangeline F.Y. Young

Pin access analysis is a critical step in detailed routing. With complicated design rules and pin shapes, efficient and accurate pin accessibility evaluation is desirable in many physical design scenarios. To this end, we present FastPass, a fast and robust pin access analysis framework, which first generates design rule checking (DRC)-clean pin access route candidates for each pin, pre-computes incompatible pairs of routes, and then uses incremental SAT solving to find an optimized pin access scheme. Experimental results on the ISPD 2018 benchmarks show that FastPass produces DRC-clean pin access schemes for all cases while being 14.7× faster than the known best pin access analysis framework on average.

Pin Access-Oriented Concurrent Detailed Routing

  • Yun-Jhe Jiang
  • Shao-Yun Fang

Due to continuously shrunk feature sizes and increased design complexity, the difficulty in pin access becomes one of the most critical challenges in large-scale full-chip routing. State-of-the-art pin access-aware detailed routing techniques suffer from either the ordering problem of the sequential routing scheme or the inflexibility of pre-determining an access point for each pin. Some other routing-related studies create pin extensions with Metal-2 metal segments to optimize pin accessibility; however, this strategy may not be practical without considering the contemporary routing flow. This paper presents a pin access-oriented concurrent detailed routing approach conducted after the track assignment stage. The core detailed routing engine is based on an integer linear programming (ILP) formulation, which has lower complexity and can flexibly tackle multi-pin nets compared to an existing formulation. Besides, to maximize the free routing resource and to keep the problem size tractable, a pre-processing flow trimming redundant metals and inserting assistant metals is developed. The experimental results show that compared to a state-of-the-art academic router, the proposed concurrent scheme can effectively derive good results with fewer design rule violations and less runtime.

Reinforcement Learning Guided Detailed Routing for Custom Circuits

  • Hao Chen
  • Kai-Chieh Hsu
  • Walker J. Turner
  • Po-Hsuan Wei
  • Keren Zhu
  • David Z. Pan
  • Haoxing Ren

Detailed routing is the most tedious and complex procedure in design automation and has become a determining factor in layout automation in advanced manufacturing nodes. Despite continuing advances in custom integrated circuit (IC) routing research, industrial custom layout flows remain heavily manual due to the high complexity of the custom IC design problem. Besides conventional design objectives such as wirelength minimization, custom detailed routing must also accommodate additional constraints (e.g., path-matching) across the analog/mixed-signal (AMS) and digital domains, making an already challenging procedure even more so. This paper presents a novel detailed routing framework for custom circuits that leverages deep reinforcement learning to optimize routing patterns while considering custom routing constraints and industrial design rules. Comprehensive post-layout analyses based on industrial designs demonstrate the effectiveness of our framework in dealing with the specified constraints and producing sign-off-quality routing solutions.

Voltage-Drop Optimization Through Insertion of Extra Stripes to a Power Delivery Network

  • Jai-Ming Lin
  • Yu-Tien Chen
  • Yang-Tai Kung
  • Hao-Jia Lin

As the complexity increases, power delivery network (PDN) optimization becomes a more important step in a modern design. In order to construct a robust PDN, most classic PDN optimization methods focus on adjusting the dimensions of power stripes. However, this approach becomes infeasible when voltage violation regions also have severe routing congestion. Hence, this paper proposes a delicate procedure to insert additional power stripes to reduce voltage violation while maintaining routability. In the beginning, IR-drop high related regions are identified to reveal those locations which are thirsty for more currents. Then, we solve a minimum-cost flow problem to find the topologies of power delivery paths (PDPs) from power sources to these regions and determine the widths of edges in each PDP so that enough currents can be provided to these regions. Moreover, vertical power stripes (VPSs for short) are inserted to the locations which have less routing congestion and severe voltage violations by the dynamic programming to reduce a probability to deteriorate routability. Finally, more wires will be inserted to IR-drop high related regions if there still exist voltage violations. Experimental results show that our method can use much less routing resource and induce less routing congestion to meet IR-drop constraint in industry designs.

NVCell 2: Routability-Driven Standard Cell Layout in Advanced Nodes with Lattice Graph Routability Model

  • Chia-Tung Ho
  • Alvin Ho
  • Matthew Fojtik
  • Minsoo Kim
  • Shang Wei
  • Yaguang Li
  • Brucek Khailany
  • Haoxing Ren

Standard cells are essential components of modern digital circuit designs. With process technologies advancing beyond the 5nm node, more routability issues have arisen due to the decreasing number of routing tracks, increasing number and complexity of design rules, and strict patterning rules. Automatic standard cell synthesis tools are struggling to design cells with severe routability issues. In this paper, we propose a routability-driven standard cell synthesis framework using a novel pin density aware congestion metric, lattice graph routability modelling approach, and dynamic external pin allocation methodology to generate routability optimized layouts. On a benchmark of 94 complex and hard-to-route standard cells, NVCell 2 improves the number of routable and LVS/DRC clean cell layouts by 84.0% and 87.2%, respectively. NVCell 2 can generate 98.9% of cells LVS/DRC clean, with 13.9% of the cells having smaller area, compared to an industrial standard cell library with over 1000 standard cells.

SESSION: Session 3: 3D ICs, Heterogeneous Integration, and Packaging I

FXT-Route: Efficient High-Performance PCB Routing with Crosstalk Reduction Using Spiral Delay Lines

  • Meng Lian
  • Yushen Zhang
  • Mengchu Li
  • Tsun-Ming Tseng
  • Ulf Schlichtmann

In high-performance printed circuit boards (PCBs), adding serpentine delay lines is the most prevalent delay-matching technique to balance the delays of time-critical signals. Serpentine topology, however, can induce simultaneous accumulation of the crosstalk noise, resulting in erroneous logic gate triggering and speed-up effects. The state-of-the-art approach for crosstalk alleviation achieves waveform integrity by enlarging wire separation, resulting in an increased routing area. We introduce a method that adopts spiral delay lines for delay matching to mitigate the speed-up effect by spreading the crosstalk noise uniformly in time. Our method avoids possible routing congestion while achieving a high density of transmission lines. We implement our method by constructing a mixed-integer-linear programming (MILP) model for routing and a quadratic programming (QP) model for spiral synthesis. Experimental results demonstrate that our method requires, on average, 31% less routing area than the original design. In particular, compared to the state-of-the-art approach, our method can reduce the magnitude of the crosstalk noise by at least 69%.

On Legalization of Die Bonding Bumps and Pads for 3D ICs

  • Sai Pentapati
  • Anthony Agnesina
  • Moritz Brunion
  • Yen-Hsiang Huang
  • Sung Kyu Lim

State-of-the-art 3D IC Place-and-Route flows were designed with older technology nodes and aggressive bonding pitch assumptions. As a result, these flows fail to honor the width and spacing rules for the 3D vias with realistic pitch values. We propose a critical new 3D via legalization stage during routing to reduce such violations. A force-based solver and bipartite-matching algorithm with Bayesian optimization are presented as viable legalizers and are compatible with various process nodes, bonding technologies, and partitioning types. With the modified 3D routing, we reduce the 3D via violations by more than 10× with zero impact on performance, power, or area.

Reshaping System Design in 3D Integration: Perspectives and Challenges

  • Hung-Ming Chen
  • Chu-Wen Ho
  • Shih-Hsien Wu
  • Wei Lu
  • Po-Tsang Huang
  • Hao-Ju Chang
  • Chien-Nan Jimmy Liu

In this paper, we depict modern system design methodologies via 3D integration along with the advance of packaging, considering system prototyping, interconnecting, and physical implementation. The corresponding challenges are presented as well.

SESSION: Session 4: 3D ICs, Heterogeneous Integration, and Packaging II

Co-design for Heterogeneous Integration: A Failure Analysis Perspective

  • Erica Douglas
  • Julia Deitz
  • Timothy Ruggles
  • Daniel Perry
  • Damion Cummings
  • Mark Rodriguez
  • Nichole Valdez
  • Brad Boyce

As scaling for CMOS transistors asymptotically approaches the end of Moore’s Law, the need to push into 3D integration schemes to innovate capabilities is gaining significant traction. Further, rapid development of new semiconductor solutions, such as heterogeneous integration, has turned the semiconductor industry’s consistent march towards next generation products into new arenas. In 2018, the Department of Energy Office of Science (DOE SC) released their “Basic Research Needs for Microelectronics,” communicating a strong push towards “parallel but intimately networked efforts to create radically new capabilities,”1 which they have coined as “co-design.”

Advanced packaging and heterogeneous integration, particularly with mixed semiconductor materials (e.g., CMOS FPGAs & GaN RF amplifiers) is a realm ripe for applicability towards DOE SC’s co-design call to action. In theory, development occurring at all scales across the semiconductor ecosystem, particularly across disciplines that are not traditionally adjacent, should significantly accelerate innovation. In reality, co-design requires a paradigm shift in approach, requiring not only interconnected parallel development. Further, accurate ground truth data during learning cycles is critical in order to effectively and efficiently communicate across disparate disciplines and advise design iterations across the microelectronics ecosystem.

This talk will outline three orthogonal facets towards co-design for HI: (1) on-going efforts towards development of materials characterization and failure analysis techniques to enable accurate evaluation of materials and heterogeneously integrated components, (2) development of artificial intelligence & machine learning algorithms for large scale, high throughput process development and characterization, and (3) development of capabilities for rapid communication and visualization of data across disparate disciplines.

Goal Driven PCB Synthesis Using Machine Learning and CloudScale Compute

  • Taylor Hogan

X AI is a cloud-based system that leverages machine learning, and search to place and route printed circuit boards using physics-based analysis and high-level design. We propose a feedback-based Monte Carlo Tree Search (MCTS) algorithm to explore the space of possible designs. A metric, or metrics, is given to evaluate the quality of designs as MCTS learns about possible solutions. A policy and value network are trained during exploration to learn to accurately weight quality actions and identify useful design states. This is performed as a feedback loop in conjunction with other feedforward tools for placement and routing.

Gate-All-Around Technology is Coming.: What’s Next After GAA?

  • Victor Moroz

Currently, the industry is transitioning from FinFETs to gate-all-around (GAA) technology and will likely have several GAA technology generations in the next few years. What’s next after that? This is the question that we are trying to answer in this project by benchmarking GAA technology with transistors on 2D materials and stacked transistors (CFETs).

The main objective for logic is to get a meaningful gain in power, performance, area, and cost (PPAC). The main objective for SRAM is to get a noticeable density scaling for the SRAM array and its periphery without losing performance and yield. Another objective is to move in the direction that has a promise of longer-term progress, such as to start stacking two layers of transistors before moving to a larger number of transistor layers. With that in mind, we explore and discuss the next steps beyond GAA technology.

SESSION: Session 5: Analog Design

VLSIR – A Modular Framework for Programming Analog & Custom Circuits & Layouts

  • Dan Fritchman

We present VLSIR, a modular and fully open-source framework for programming analog and custom circuits and layouts. VLSIR is centered around a protobuf-defined design database. It features high-productivity front-ends for hardware description (“circuit programming”), simulation, and custom layout programming, designed to be amenable to both human designers and automation.

Joint Optimization of Sizing and Layout for AMS Designs: Challenges and Opportunities

  • Ahmet F. Budak
  • Keren Zhu
  • Hao Chen
  • Souradip Poddar
  • Linran Zhao
  • Yaoyao Jia
  • David Z. Pan

Recent advances in analog device sizing algorithms show promising results on the automatic schematic design. However, the majority of the sizing algorithms are based on schematic-level simulations and layout-agnostic. The physical layout implementation brings extra parasitics to the analog circuits, leading to discrepancies between schematic and post-layout performance. This performance gap raises questions about the effectiveness of automatic analog device sizing tools. Prior work has leveraged procedural layout generation to account for layout-induced parasitics in the sizing process. However, the need for layout templates makes such methodology limited in application. In this paper, we propose to bridge automatic analog sizing with post-layout performance using state-of-the-art optimization-based analog layout generators. A quantitative study is conducted to measure the impact of layout awareness in state-of-the-art device sizing algorithms. Furthermore, we present our perspectives on the future directions in layout-aware analog circuit schematic design.

Learning from the Implicit Functional Hierarchy in an Analog Netlist

  • Helmut Graeb
  • Markus Leibl

Analog circuit design is characterized by a plethora of implicit design and technology aspects available to the experienced designer. In order to create useful computer-aided design methods, this implicit knowledge has to be captured in a systematic and hierarchical way. A key approach to this goal is to “learn” the knowledge from the netlist of an analog circuit. This requires a library of structural and functional blocks for analog circuits together with their individual constraints and performance equations, graph homomorphism techniques to recognize blocks that can have different structural implementations and I/O pins, as well as synthesis methods that exploit the learned knowledge. In this contribution, we will present how to make use of the functional and structural hierarchy of operational amplifiers. As an application, we explore the capabilities of machine learning in the context of structural and functional properties and show that the results can be substantially improved by pre-processing data with traditional methods for functional block analysis. This claim is validated on a data set of roughly 100,000 readily sized and simulated operational amplifiers.

The ALIGN Automated Analog Layout Engine: Progress, Learnings, and Open Issues

  • Sachin S. Sapatnekar

The ALIGN (Analog Layout, Intelligently Generated from Netlists) project [1, 2] is a joint university-industry effort to push the envelope of automated analog layout through a systematic new approach, novel algorithms, and open-source software [3]. Analog automation research has been active for several decades, but has not found widespread acceptance due to its general inability to meet the needs of the design community. Therefore, unlike digital design, which has a rich history of automation and extensive deployment of design tools, analog design is largely unautomated.

ALIGN attempts to overcome several of the major issues associated with this lack of success. First, to mimic the human designer’s ability to recognize sub-blocks and specify constraints, ALIGN has used machine learning (ML) based methods to assist in these tasks. Second, to overcome the limitation of past automation approaches, which are largely specific to a class of designs, ALIGN attempts to create a truly general layout engine by decomposing the layout automation process into a set of steps, with specific constraints that are specific to the family of circuits, which are divided into four classes: low-frequency components (e.g., analog-to-digital converters (ADCs), amplifiers, and filters); wireline components for high-speed links (e.g., equalizers, clock/data recovery circuits, and phase interpolators); RF/Wireless components (e.g., components of RF transmitters and receivers), and power delivery components (e.g., capacitor- and inductor-based DC-DC converters and low dropout (LDO) regulators). For each class of circuits, different sets of constraints are important, depending on their frequency, parasitic sensitivity, need for matching, etc., and ALIGN creates a unified methodological framework that can address each class. Third, in each step, ALIGN has generated new algorithms and approaches to help improve the performance of analog layout. Fourth, given that experienced analog designers desire greater visibility into the process and input into the way that design is carried out, ALIGN is built modularly, providing multiple entry points at which a designer may intervene in the process.

Analog Layout Automation On Advanced Process Technologies

  • Soner Yaldiz

Despite the digitization of analog and the disaggregated silicon trends, high-volume or high-performance system-on-chip (SoC) designs integrate numerous analog and mixed-signal (AMS) intellectual property (IP) blocks including voltage regulators, clock generators, sensors, memory and other interfaces. For example, fine-grain dynamic voltage and frequency scaling requires a dedicated clock generator and voltage regulator per compute unit. The design of these blocks in advanced FinFET or GAAFET technologies is challenging due to the i) increasing gap between schematic and post-layout simulation, ii) design rule complexity, and iii) strict reliability rules [1]. The convergence of a high-performance or a high-power block may require multiple iterations of circuit sizing and layout changes. As a result, physical design, which is primarily a manual effort, has become a key bottleneck in the design process. Migrating these blocks across process technologies or process variants only exacerbates the problem. Layout synthesis for AMS IP blocks is an on-going research problem with a long history [2] and is gaining more attention recently to leverage the latest advances in machine learning [3]. Yet neither template nor optimization-based approaches have reduced the burden significantly for high performance products on leading process technologies

This talk will first overview physical design of AMS IP blocks on an advanced process technology highlighting the opportunities and the expectations from layout automation during this process. On a new process technology, this process starts with conducting early layout studies on a selection of critical high performance or high power subcircuits. In parallel, the IP blocks are placed in a bottom-up fashion to optimize the IP floorplan but also to provide information to SoC floorplanning. Routing follows the placement to verify the post-layout performance. A quick turnaround during these explorations is vital to decide on any architectural changes or circuit re-sizing. The rest of the talk will share experiences with piloting an open-source analog layout synthesis tool flow [4] on a 22nm FinFET technology for voltage regulators [5].

The learnings from this exercise and the extensions to the tool flow will be summarized that include Boolean satisfiability-based routing algorithm, formally verifiable constraint language and leveraging parameterized and standard cells. The talk will conclude with opportunities for research.

SESSION: Session 6: Keynote II

Immersion and EUV Lithography: Two Pillars to Sustain Single-Digit Nanometer Nodes

  • Burn J. Lin

Semiconductor technology has advanced to single-digit nanometer dimensions for the circuit elements. The minimum feature size has reached subwavelength dimension. Many resolution enhancement techniques have been developed to extend the resolution limit of optical lithography systems, namely illumination optimization, phase-shifting masks, and proximity corrections. Needless to say, the actinic wavelength and the numerical aperture of the imaging lens have been reduced in stages, The most recent innovations are Immersion lithography and Extreme UV (EUV) lithography

In this presentation, the working principles, advantages, and challenges of immersion lithography are given. The defectivity issue is addressed by showing possible causes and solutions. The circuit design issues for pushing immersion lithography to single-digit nanometer delineation are presented.

Similarly, the working principles, advantages, and challenges of EUV lithography are given. There are special focusses on EUV power requirement, generation, and distribution; EUV mask components, absorber thickness, defects, flatness requirement, and pellicles; EUV resist challenges on sensitivity, line edge roughness, thickness, and etch resistance.

SESSION: Session 7: DFM, Reliability, and Electromigration

Advanced Design Methodologies for Directed Self-Assembly

  • Shao-Yun Fang

Directed self-assembly (DSA), which uses the segregation nature after an annealing process of block co-polymer (BCP) to generate tiny feature shapes, becomes one of the most promising next generation lithography technologies. According to the different proportions of the two monomers in an adopted BCP, either cylinders or lamellae can be generated by removing one of the two monomers, which are respectively referred to as cylindrical DSA and lamellar DSA. In addition, guiding templates are required to produce trenches before filling BCP such that the additional forces from the trench walls regulate the generated cylinders/lamellae. Both the two DSA technologies can be used to generate contact/via patterns in circuit layouts, while the practices of designing guiding templates are quite different due to different manufacturing principles. This paper reviews the existing studies on the guiding template design problem for contact/via hole fabrication with the DSA technology. The design constraints are differentiated and the design methodologies are respectively introduced for cylindrical DSA and lamellar DSA. Possible future research directions are finally suggested to further enhance contact/via manufacturability and the feasibility of adopting DSA in semiconductor manufacturing.

Challenges for Interconnect Reliability: From Element to System Level

  • Olalla Varela Pedreira
  • Houman Zahedmanesh
  • Youqi Ding
  • Ivan Ciofi
  • Kristof Croes

The high current densities carried by the interconnects have a direct impact on the back-end-of-line (BEOL) reliability degradation as they locally increase the temperature by Joule heating, and they lead to drift in the metal atoms. Local increase in temperature due to Joule heating will lead to thermal gradients along the interconnects inducing degradation through thermomigration. As the power density of the chip increases, thermal gradients may become a major reliability concern for scaled Cu interconnects. Therefore, it is of utmost relevance to fundamentally understand the impact of thermal gradients in metal migration. Our studies show that by using a combined modelling approach and a dedicated test structure we can assess the local temperatures and temperature gradients profiles. Moreover, with long-term experiments, we are able to successfully generate voids at the location of highest temperature gradients. Additionally, the main consequence of scaling the Cu interconnects is the dramatic drop of EM lifetime (Jmax). Currently the experimentally obtained EM parameters are used at system design level to set the current limits through the interconnect networks. However, this approach is very simplistic and neglects the benefits provided by the redundancy and interconnectivity from the network. Our studies by using a system-level physics-based EM simulation framework which can determine the EM induced IR drop at the standard cell level, show that the circuit reliability margins of the power delivery network (PDN) can be further relaxed.

Combined Modeling of Electromigration, Thermal and Stress Migration in AC Interconnect Lines

  • Susann Rothe
  • Jens Lienig

The migration of atoms in metal interconnects in integrated circuits (ICs) increasingly endangers chip reliability. The susceptibility of DC interconnects to electromigration has been extensively studied. A few works on thermal migration and AC electromigration are also available. Yet, the combined effect of both on chip reliability has been neglected thus far. This paper provides both FEM and analytical models for atomic migration and steady-state stress profiles in AC interconnects considering electromigration, thermal and stress migration combined. For this we expand existing models by the impact of self-healing, temperature-dependent resistivity, and short wire length. We conclude by analyzing the impact of thermal migration on interconnect robustness and show that it cannot be neglected any longer in migration robustness verification.

Recent Progress in the Analysis of Electromigration and Stress Migration in Large Multisegment Interconnects

  • Nestor Evmorfopoulos
  • Mohammad Abdullah Al Shohel
  • Olympia Axelou
  • Pavlos Stoikos
  • Vidya A. Chhabria
  • Sachin S. Sapatnekar

Traditional approaches to analyzing electromigration (EM) in on-chip interconnects are largely driven by semi-empirical models. However, such methods are inexact for the typical multisegment lines that are found in modern integrated circuits. This paper overviews recent advances in analyzing EM in on-chip interconnect structures based on physics-based models that use partial differential equations, with appropriate boundary conditions, to capture the impact of electron-wind and back-stress forces within an interconnect, across multiple wire segments. Methods for both steady-state and transient analysis are presented, highlighting approaches that can solve these problems with a computation time that is linear in the number of wire segments in the interconnect.

Electromigration Assessment in Power Grids with Account of Redundancy and Non-Uniform Temperature Distribution

  • Armen Kteyan
  • Valeriy Sukharev
  • Alexander Volkov
  • Jun Ho Choy
  • Farid N. Najm
  • Yong Hyeon Yi
  • Chris H. Kim
  • Stephane Moreau

A recently proposed methodology for electromigration (EM) assessment in on-chip power/ground grid of integrated circuits has been validated by means of measurements, performed on dedicated test grids. IR drop degradation in the grid is used for defining the EM failure criteria. Physics-based models are involved for simulation of EM-induced stress evolution in interconnect structures, void formation and evolution, resistance increase of the voided segments, and consequent re-distribution of electric current in the redundant grid paths. A grid-like test structure, fabricated with a 65 nm technology and consisting of two metal layers, allowed to calibrate the voiding models by tracking voltage evolution in all grid nodes in experiment and in simulation. Good fit of the measured and simulated time-to-failure (TTF) probability distribution was obtained in both cases of uniform and non-uniform temperature distribution across the grid. The second test grid was fabricated with a 28 nm technology, consisted of 4 metal layers, and contained power and ground nets connected to “quasi-cells” with poly-resistors, which were specially designed for operating at elevated temperatures ~350°C. The existing current distributions resulted in different behavior of EM-induced failures in these nets: a gradual voltage evolution in power net, and sharp changes in ground net were observed in experiment, and successfully reproduced in simulations.

SESSION: Session 8: Placement

Placement Initialization via Sequential Subspace Optimization with Sphere Constraints

  • Pengwen Chen
  • Chung-Kuan Cheng
  • Albert Chern
  • Chester Holtz
  • Aoxi Li
  • Yucheng Wang

State-of-the-art analytical placement algorithms for VLSI designs rely on solving nonlinear programs to minimize wirelength and cell congestion. As a consequence, the quality of solutions produced using these algorithms crucially depends on the initial cell coordinates. In this work, we reduce the problem of finding wirelength-minimal initial layouts subject to density and fixed-macro constraints to a Quadratically Constrained Quadratic Program (QCQP). We additionally propose an efficient sequential quadratic programming algorithm to recover a block-globally optimal solution and a subspace method to reduce the complexity of problem. We extend our formulation to facilitate direct minimization of the Half-Perimeter Wirelength (HPWL) by showing that a corresponding solution can be derived by solving a sequence of reweighted quadratic programs. Critically, our method is parameter-free, i.e. involves no hyperparameters to tune. We demonstrate that incorporating initial layouts produced by our algorithm with a global analytical placer results in improvements of up to 4.76% in post-detailed-placement wirelength on the ISPD’05 benchmark suite. Our code is available on github. https://github.com/choltz95/laplacian-eigenmaps-revisited.

DREAM-GAN: Advancing DREAMPlace towards Commercial-Quality using Generative Adversarial Learning

  • Yi-Chen Lu
  • Haoxing Ren
  • Hao-Hsiang Hsiao
  • Sung Kyu Lim

DREAMPlace is a renowned open-source placer that provides GPU-acceleratable infrastructure for placements of Very-Large-Scale-Integration (VLSI) circuits. However, due to its limited focus on wirelength and density, existing placement solutions of DREAMPlace are not applicable to industrial design flows. To improve DREAMPlace towards commercial-quality without knowing the black-boxed algorithms of the tools, in this paper, we present DREAM-GAN, a placement optimization framework that advances DREAMPlace using generative adversarial learning. At each placement iteration, aside from optimizing the wirelength and density objectives of the vanilla DREAMPlace, DREAM-GAN computes and optimizes a differentiable loss that denotes the similarity score between the underlying placement and the tool-generated placements in commercial databases. Experimental results on 5 commercial and OpenCore designs using an industrial design flow implemented by Synopsys ICC2 not only demonstrate that DREAM-GAN significantly improves the vanilla DREAMPlace at the placement stage across each benchmark, but also show that the improvements last firmly to the post-route stage, where we observe improvements by up to 8.3% in wirelength and 7.4% in total power.

AutoDMP: Automated DREAMPlace-based Macro Placement

  • Anthony Agnesina
  • Puranjay Rajvanshi
  • Tian Yang
  • Geraldo Pradipta
  • Austin Jiao
  • Ben Keller
  • Brucek Khailany
  • Haoxing Ren

Macro placement is a critical very large-scale integration (VLSI) physical design problem that significantly impacts the design power-performance-area (PPA) metrics. This paper proposes AutoDMP, a methodology that leverages DREAMPlace, a GPU-accelerated placer, to place macros and standard cells concurrently in conjunction with automated parameter tuning using a multi-objective hyperparameter optimization technique. As a result, we can generate high-quality predictable solutions, improving the macro placement quality of academic benchmarks compared to baseline results generated from academic and commercial tools. AutoDMP is also computationally efficient, optimizing a design with 2.7 million cells and 320 macros in 3 hours on a single NVIDIA DGX Station A100. This work demonstrates the promise and potential of combining GPU-accelerated algorithms and ML techniques for VLSI design automation.

Assessment of Reinforcement Learning for Macro Placement

  • Chung-Kuan Cheng
  • Andrew B. Kahng
  • Sayak Kundu
  • Yucheng Wang
  • Zhiang Wang

We provide open, transparent implementation and assessment of Google Brain’s deep reinforcement learning approach to macro placement (Nature) and its Circuit Training (CT) implementation in GitHub. We implement in open-source key “blackbox” elements of CT, and clarify discrepancies between CT and Nature. New testcases on open enablements are developed and released. We assess CT alongside multiple alternative macro placers, with all evaluation flows and related scripts public in GitHub. Our experiments also encompass academic mixed-size placement benchmarks, as well as ablation and stability studies. We comment on the impact of Nature and CT, as well as directions for future research.

SESSION: Session 9: New Computing Techniques and Accelerators

GPU Acceleration in Physical Synthesis

  • Evangeline F.Y. Young

Placement and routing are essential steps in physical synthesis of VLSI designs. Modern circuits contain billions of cells and nets, which significantly increases the computational complexity of physical synthesis and brings big challenges to leading-edge physical design tools. With the fast development of GPU architecture and computational power, it becomes an important direction to explore speeding up physical synthesis with massive parallelism on GPU. In this talk, we will look into opportunities to improve EDA algorithms with GPU acceleration. Traditional EDA tools run on CPU with limited degree of parallelism. We will investigate a few examples of accelerating some classical algorithms in placement and routing using GPU. We will see how one can leverage the power of GPU to improve both quality and computational time in solving these EDA problems.

Efficient Runtime Power Modeling with On-Chip Power Meters

  • Zhiyao Xie

Accurate and efficient power modeling techniques are crucial for both design-time power optimization and runtime on-chip IC management. In prior research, different types of power modeling solutions have been proposed, optimizing multiple objectives including accuracy, efficiency, temporal resolution, and automation level, targeting various power/voltage-related applications. Despite extensive prior explorations in this topic, new solutions still keep emerging and achieve state-of-the-art performance. This paper aims at providing a review of the recent progress in power modeling, with more focus on runtime on-chip power meter (OPM) development techniques. It also serves as a vehicle for discussing some general development techniques for the runtime on-chip power modeling task.

DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs

  • Rachel Selina Rajarathnam
  • Zixuan Jiang
  • Mahesh A. Iyer
  • David Z. Pan

Placement plays a pivotal and strategic role in the FPGA implementation flow to allocate the physical locations of the heterogeneous instances in the design. Among the placement stages, the packing or clustering stage groups logic instances like look-up tables (LUTs) and flip-flops (FFs) that could be placed on the same site. The legalization stage determines all instances’ physical site locations. With advances in FPGA architecture and technology nodes, designs contain millions of logic instances, and placement algorithms must scale accordingly. While other placement stages – global placement and detailed placement, have been accelerated using GPUs, the acceleration of packing and legalization stages on a GPU remains largely unexplored. This work presents DREAMPlaceFPGA-PL, an open-source packer-legalizer for heterogeneous FPGAs that employs GPU for acceleration. We revise the existing consensus-based parallel algorithms employed for packing and legalizing a flat placement to obtain further speedup on a GPU. Our experiments on the ISPD’2016 benchmarks demonstrate more than 2× acceleration.

SESSION: Session 10: Lifetime Achievement Commemoration for Professor Malgorzata Marek-Sadowska

Building Oscillatory Neural Networks: AI Applications and Physical Design Challenges

  • Aida Todri-Sanial

This talk is about a novel computing paradigm based on coupled oscillatory neural networks. Oscillatory neural networks (ONNs) are recurrent neural networks where each neuron is an oscillator and oscillator couplings are the synaptic weights. Inspired by Hopfield Neural Networks, ONNs make use of nonlinear dynamics to compute and solve computational problems such as associative memory tasks and combinatorial optimization problems difficult to address with conventional digital computers. An exciting direction in recent years has been to implement Ising machines based on the Ising model of coupled binary spins on magnets. In this talk, I cover the design aspects of building ONNs from devices to architecture to allow to benefit from the parallel computations with oscillators while implementing them in an energy efficient way.

Optimization of AI SoC with Compiler-assisted Virtual Design Platform

  • Chih-Tsun Huang
  • Juin-Ming Lu
  • Yao-Hua Chen
  • Ming-Chih Tung
  • Shih-Chieh Chang

As deep learning keeps evolving dramatically with rapidly increasing complexity, the demand for efficient hardware accelerators has become vital. However, the lack of software/hardware co-development toolchains makes designing AI SoCs (artificial intelligent system-on-chips) considerably challenging. This paper presents a compiler-assisted virtual platform to facilitate the development of AI SoCs from the early design stage. The electronic system-level design platform provides rapid functional verification and performance/energy analysis. Cooperating with the neural network compiler, AI software and hardware can be co-optimized on the proposed virtual design platform. Our Deep Inference Processor is also utilized on the virtual design platform to demonstrate the effectiveness of the architectural evaluation and exploration methodology.

Challenges and Opportunities for Computing-in-Memory Chips

  • Xiang Qiu

In recent years, artificial neural networks have been applied to many scenarios, from daily life applications like face detection, to industry problems like placement and routing in physical design. Neural network inference mainly contains multiply-accumulate operations, which requires huge amount of data movement. Traditional Von-Neumann architecture computers are inefficient for neural networks as they have separate CPU and memory, and data transfer between them costs excessive energy and performance. To address this problem, in-memory or near-memory computing have been proposed and attracted much attention in both academic and industry. In this talk, we will give a brief review of non-volatile memory crossbar-based computing-in-memory architecture. Next, we will demonstrate the challenges for chips with such architecture to replace current CPUs/GPUs for neural network processing, from an industry perspective. Lastly, we will discuss possible solutions for those challenges.

ISPD 2023 Lifetime Achievement Award Bio

  • Malgorzata Marek-Sadowska

The 2023 International Symposium on Physical Design lifetime achievement award goes to Professor Malgorzata Marek-Sadowska for her outstanding contributions to the field.

SESSION: Session 11: Keynote III

Neural Operators for Solving PDEs and Inverse Design

  • Anima Anandkumar

Deep learning surrogate models have shown promise in modeling complex physical phenomena such as photonics, fluid flows, molecular dynamics and material properties. However, standard neural networks assume finite-dimensional inputs and outputs, and hence, cannot withstand a change in resolution or discretization between training and testing. We introduce Fourier neural operators that can learn operators, which are mappings between infinite dimensional spaces. They are discretization-invariant and can generalize beyond the discretization or resolution of training data. They can efficiently solve partial differential equations (PDEs) on general geometries. We consider a variety of PDEs for both forward modeling and inverse design problems, as well as show practical gains in the lithography domain.

SESSION: Session 12: Quantum Computing

Quantum Challenges for EDA

  • Leon Stok

Though early in its development, quantum computing is now available on real hardware and via the cloud through IBM Quantum. This radically new kind of computing holds open the possibility of solving some problems that are now and perhaps always will be intractable for “classical” computers.

As with any new technology things are developing rapidly but there are still a lot of open questions. What is the status of Quantum computers today? What are the key metrics we need to look at to improve a Quantum System? What are some of the technical opportunities being looked at from an EDA perspective.

We will look at the Quantum Roadmap for the next couple of years and outline challenges that need to be solved and how the EDA community can potentially contribute to solve these challenges.

Developing Quantum Workloads for Workload-Driven Co-design

  • Anne Matsuura

Quantum computing offers the future promise of solving problems that are intractable for classical computers today. However, as an entirely new kind of computational device, we must learn how to best develop useful workloads. Today’s small workloads serve the dual purpose that they can also be used to learn how to design a better quantum computing system architecture. At Intel Labs, we develop small application-oriented workloads and use them to drive research into the design of a scalable quantum computing system architecture. We run these small workloads on the small systems of qubits that we have today to understand what is required from the system architecture to run them efficiently and accurately on real qubits. In this presentation, I will give examples of quantum workload-driven co-design and what we have learned from this type of research.

MQT QMAP: Efficient Quantum Circuit Mapping

  • Robert Wille
  • Lukas Burgholzer

Quantum computing is an emerging technology that has the potential to revolutionize fields such as cryptography, machine learning, optimization, and quantum simulation. However, a major challenge in the realization of quantum algorithms on actual machines is ensuring that the gates in a quantum circuit (i.e., corresponding operations) match the topology of a targeted architecture so that the circuit can be executed while, at the same time, the resulting costs (e.g., in terms of the number of additionally introduced gates, fidelity, etc.) are kept low. This is known as the quantum circuit mapping problem. This summary paper provides an overview of QMAP-an open-source tool that is part of the Munich Quantum Toolkit (MQT) and offers efficient, automated, and accessible methods for tackling this problem. To this end, the paper first briefly reviews the problem. Afterwards, it shows how QMAP can be used to efficiently map quantum circuits to quantum computing architectures from both a user’s and a developer’s perspective. QMAP is publicly available as open-source at https://github.com/cda-tum/qmap.

SESSION: Session 13: Panel on EDA for Domain Specific Computing

EDA for Domain Specific Computing: An Introduction for the Panel

  • Iris Hui-Ru Jiang
  • David Chinnery

This panel explores domain-specific computing from hardware, software, and electronic design automation (EDA) perspectives.

Hennessey and Patterson signaled a new “golden age of computer architecture” in 2018 [1]. Process technology advances and general-purpose processor improvements provided much faster and more efficient computation, but scaling with Moore’s law has slowed significantly. Domain-specific customization can improve power-performance efficiency by orders-of-magnitude for important application domains, such as graphics, deep neural networks (DNN) for machine learning [2], simulation, bioinformatics [3], image processing, and many other tasks.

The common features of domain-specific architectures are: 1) dedicated memories to minimize data movement across chip; 2) more arithmetic units or bigger memories; 3) use of parallelism matching the domain; 4) smaller data types appropriate for the target applications; and 5) domain-specific software languages. Expediting software development with optimized compilation for efficient fast computation on heterogeneous architectures is a difficult task, and must be considered with the hardware design. For example, GPU programming has used CUDA and OpenCL.

The hardware comprises application-specific integrated circuits (ASICs) [4] and systems-of-chips (SoCs). General-purpose processor cores are often combined with graphics processing units (GPUs) for stream processing, digital signal processors, field programmable gate arrays (FPGAs) for configurability [5], artificial intelligence (AI) acceleration hardware, and so forth.

Domain-specific computers have been deployed recently. For example: the Google Tensor Processing Unit (DNN ASIC) [6]; Microsoft Catapult (FPGA-based cloud domain-service solution) [7]; Intel Crest (DNN ASIC) [8]; Google Pixel Visual Core (image processing and computer vision for cell phones and tablets) [9]; and the RISC-V architecture and open instruction set for heterogeneous computing [10].

Software-driven Design for Domain-specific Compute

  • Desmond A. Kirkpatrick

The end of Dennard scaling has created a focus on advancing domain-specific computing; we are seeing a renaissance of accelerating compute problems through specialization, with orders-of-magnitude improvement in performance and energy efficiency [1]. Domain-specific compute, with its wide proliferation of domains and narrow specialization of hardware and software, provides unique challenges in design automation not met by the methodologies matured under the model of high-volume manufacturing of competitive CPUs, GPUS, and SOCs [2]. Importantly, domain-specific compute targets smaller markets that move more rapidly so design NRE plays a much larger role. Secondly, the role of software is so much more significant that we believe a software-first approach, where software drives hardware design and the product is developed at the speed of software, is required to keep pace with domain-specific compute market requirements. This creates significant new challenges and opportunities for EDA to address the domain-specific compute design space. The forces that are driving the renaissance in domain-specific compute architectures also require a renaissance in the tools, flows, and methods to maintain this pace of innovation.

This talk will present a general framework for approaching automation of domain-specific compute co-design of SW/HW and draw upon recent innovations in EDA that can help us address this challenge. The focus will be on driving software-oriented techniques, such as agile design, into hardware design [3], as well as vertically oriented domain-specific codesign automation stacks [4], and some of the gaps in EDA that currently limit these approaches.

Google Investment in Open Source Custom Hardware Development Including No-Cost Shuttle Program

  • Tim Ansell

The end of Moore’s Law combined with unabated growth in usage have forced Google to turn to hardware acceleration to deliver efficiency gains to meet demand. Traditional hardware design methodology for accelerators is practical when there’s a common core – such as with Machine Learning (ML) or video transcoding, but what about the hundreds of smaller tasks performed in Google data centers? Our vision is “software-speed” development for hardware acceleration so that it becomes commonplace and, frankly, boring. Toward this goal Google is investing in open tooling to foster innovation in multiplying accelerator developer productivity.

Tim Ansell will provide an outline of these coordinated open source projects in EDA (including high level synthesis), IP, PDKs, and related areas. This will be followed by presenting the CFU (Custom Function Unit) Playground, which utilizes many of these projects.

The CFU Playground lets you build your own specialized & optimized ML processor based on the open RISC-V ISA, implemented on an FPGA using a fully open source stack. The goal isn’t general ML extensions; it’s about a methodology for building your own extension specialized just for your specific tiny ML model. The extension can range from a few simple new instructions, up to a complex accelerator that interfaces to the CPU via a set of custom instructions; we will show examples of both.

A Case for Open EDA Verticals

  • Zhiru Zhang
  • Matthew Hofmann
  • Andrew Butt

With the end of Dennard scaling and Moore’s Law reaching its limits, domain-specific hardware specialization has become a crucial method for improving compute performance and efficiency for various important applications. Leading companies in competitive fields, such as machine learning and video processing, are building their own in-house technology stacks to better suit their accelerator design needs. However, currently this approach is only a viable option for a few large enterprises that can afford to invest in teams of experts in hardware, systems, and compiler development for high-value applications. In particular, the high license cost of commercial electronic design automation (EDA) tools presents a significant barrier for small and mid-size engineering teams to create new hardware accelerators. These tools are essential for designing, simulating, and testing new hardware, but can be too expensive for smaller teams with limited budgets, reducing their ability to innovate and compete with larger organizations.

More recently, open-source EDA toolflows [1] [12] [11] [5] have emerged which offer a promising alternative to commercial tools, with the potential to provide more cost-effective solutions for hardware development. For example, OpenROAD [1] allows the design of custom ASICs with minimal human intervention and no licensing fees. During initial development, it was also able to take advantage of existing tools such as Yosys [14] and KLayout [6] to reduce the amount of new code required to get a working flow. However, early adoption of open-source alternatives carries risk, as open-source EDA projects often lack important features and are less reliable than commercial options. Additionally, current open-source EDA tools may produce less competitive quality of results (QoR) and may not be able to catch up to commercial solutions anytime soon. Even when EDA tool access is not an issue, designing and implementing special-purpose accelerators using conventional RTL methodology can be unproductive and incurs high non-recurring engineering (NRE) costs. High-level synthesis (HLS) has become increasingly popular in both academia and industry to automatically generate RTL designs from software programs. However, existing HLS tools do not help maintain domain-specific context throughout the design flow (e.g., placement, routing), which makes achieving good QoR difficult without significant manual fine-tuning. This hinders wider adoption of HLS.

We advocate for open EDA verticals as a solution to enabling more widespread use of domain-specific hardware acceleration. The objective is to empower small teams of domain experts to productively develop high-performance accelerators using programming interfaces they are already familiar with. For example, this means supporting domain-specific frameworks like PyTorch or TensorFlow for ML applications. In order for EDA verticals to proliferate, there must first be extensible infrastructure similar to LLVM [8] and MLIR [9] from which to build new tool flows. The proper EDA infrastructure would include novel intermediate representations specifically tailored to the unique challenges in gradually lowering high-level code down to gates.

Addressing the EDA Roadblocks for Domain-specific Compilers: An Industry Perspective

  • Alireza Kaviani

Computer architects are now widely subscribed to domain-specific architectures as being the only path left for major improvements in performance-cost-energy. As a result, future compilers need to go beyond their traditional role of mapping a design input to a generic hardware platform. Emerging domain-specific compilers must subscribe to a broader view in which compilers provide more control to the end users, enabling customization of hardware components to implement their corresponding tasks. Transitioning into this new design paradigm, where control and customization are key enablers, poses new challenges for domain-specific compiler.

Today, generic vendor backend EDA compilers are the only available mechanism to realize a broad range of applications in many domains. The necessity of breadth coverage by commercial tools often leads to implementations that do not take full advantage of the underlying hardware. Domain-specific compilers, on the other hand, can potentially deliver near-spec performance by taking advantage of both application attributes and architecture details. This issue is less pronounced for more generic computing platforms such CPUs due to leveraging open source as an essential component of software development. However, quality EDA software has remained mostly proprietary. Existing open-source attempts do not produce quality results to be useful commercially at scale. Addressing the EDA roadblocks towards quality domain-specific compilers will require stepping milestones from both industry and community.

This suggests the need for a framework capable of interfacing between closed source vendor backend tools and open-source domain compilers. RapidWright [1] is an example of such framework that enables a new level of optimization and customization for the application architect to further exploit FPGA silicon capabilities focusing on a specific domain.

There are a few factors that will expedite the progress for this approach. For example, RapidStream [2] demonstrates 30% higher performance and more than 5X faster compile time for data flow applications. The key enabler for RapidStream domain compiler is the split-compilation that was made possible for data flow applications with a latency-tolerant front-end and design entry. EDA vendors could enable such bottom-up flows by implementing a foundational infrastructure that allows multiple application modules to be implemented independently. Another useful step would be to decouple certain portions of monolithic EDA tools with separate more permissible licensing to be combined with open-source domain compilers.

Another key step that is required for domain-specific compilers to be successful is a process to offer a guarantee to the end customer. Today’s vendor tool flow offers full guarantee and support to the end customer at the expense of limiting the customization and control. The new paradigm of domain-specific compilers implies many variations of the tool flow, and it might not be feasible to provide the same level of support and guarantee as existing standard flows. The community needs to explore alternative ways of offering an equivalent level of support and guarantee to the end users in order to make domain-specific compilers widely adopted.

High-level Synthesis for Domain Specific Computing

  • Hanchen Ye
  • Hyegang Jun
  • Jin Yang
  • Deming Chen

This paper proposes a High-Level Synthesis (HLS) framework for domain-specific computing. The framework contains three key components: 1) ScaleHLS, a multi-level HLS compilation flow. Aimed to address the lack of expressiveness and hardware-dedicated representation of traditional software-oriented compilers. ScaleHLS introduces a hierarchical intermediate representation (IR) for the progressive optimization of HLS designs defined in various high-level languages. ScaleHLS consists of three levels of optimizations, including graph, loop, and directive levels, to realize an efficient compilation pipeline and generate highly-optimized domain-specific accelerators. 2) AutoScaleDSE is an automated design space exploration (DSE) engine. Real-world HLS designs often come with large design spaces that are difficult for designers to explore. Meanwhile, the connections between different components of an HLS design further complicate the design spaces. In order to address the DSE problem, AutoScaleDSE proposes a random forest classifier and a graph-driven approach to improve the accuracy of estimating the intermediate DSE results while reducing the time and computational cost. With this new approach, AutoScaleDSE can evaluate thousands of HLS design points and find the Pareto-dominating design points within a couple of hours. 3) PyTransform is a flexible pattern-driven design customization flow. Existing HLS flows demand manual code rewriting or intrusive compiler customization to conduct domain-specific optimizations, leading to unscalable or inflexible compiler solutions. PyTransform proposes a Python-based flow that enables users to define custom matching and rewriting patterns at a high level of abstraction, being able to be incorporated into the DSL compilation flow in an automatic and scalable manner. In summary, ScaleHLS, AutoScaleDSE, and PyTransform aim to address the challenges present in the compilation, DSE, and customization of existing HLS flows, respectively. With the three key components, our newly proposed HLS framework can deliver a scalable and extensible solution for designing domain-specific languages to automate and speed up the process of designing domain-specific accelerators.

SESSION: Session 14: Hardware Security and Bug Fixing

Security-aware Physical Design against Trojan Insertion, Frontside Probing, and Fault Injection Attacks

  • Jhih-Wei Hsu
  • Kuan-Cheng Chen
  • Yan-Syuan Chen
  • Yu-Hsiang Lo
  • Yao-Wen Chang

The dramatic growth of hardware attacks and the lack of security-concern solutions in design tools lead to severe security problems in modern IC designs. Although many existing countermeasures provide decent protection against security issues, they still lack the global design view with sufficient security consideration in design time. This paper proposes a security-aware framework against Trojan insertion, frontside probing, and fault injection attacks at the design stage. The framework consists of two major techniques: (1) a large-scale shielding method that effectively covers the exposed areas of assets and (2) a cell-movement-based method to eliminate the empty spaces vulnerable to Trojan insertion. Experimental results show that our framework effectively reduces the vulnerability of these attacks and achieves the best overall score compared with the top-3 teams in the 2022 ACM ISPD Security Closure of Physical Layouts Contest.

Security Closure of IC Layouts Against Hardware Trojans

  • Fangzhou Wang
  • Qijing Wang
  • Bangqi Fu
  • Shui Jiang
  • Xiaopeng Zhang
  • Lilas Alrahis
  • Ozgur Sinanoglu
  • Johann Knechtel
  • Tsung-Yi Ho
  • Evangeline F.Y. Young

Due to cost benefits, supply chains of integrated circuits (ICs) are largely outsourced nowadays. However, passing ICs through various third-party providers gives rise to many threats, like piracy of IC intellectual property or insertion of hardware Trojans, i.e., malicious circuit modifications.

In this work, we proactively and systematically harden the physical layouts of ICs against post-design insertion of Trojans. Toward that end, we propose a multiplexer-based logic-locking scheme that is (i) devised for layout-level Trojan prevention, (ii) resilient against state-of-the-art, oracle-less machine learning attacks, and (iii) fully integrated into a tailored, yet generic, commercial-grade design flow. Our work provides in-depth security and layout analysis on a challenging benchmark suite. We show that ours can render layouts resilient, with reasonable overheads, against Trojan insertion in general and also against second-order attacks (i.e., adversaries seeking to bypass the locking defense in an oracle-less setting).

We release our layout artifacts for independent verification[29].

X-Volt: Joint Tuning of Driver Strengths and Supply Voltages Against Power Side-Channel Attacks

  • Saideep Sreekumar
  • Mohammed Ashraf
  • Mohammed Nabeel
  • Ozgur Sinanoglu
  • Johann Knechtel

Power side-channel (PSC) attacks are well-known threats to sensitive hardware like advanced encryption standard (AES) crypto cores. Given the significant impact of supply voltages (VCCs) on power profiles, various countermeasures based on VCC tuning have been proposed, among other defense strategies. Driver strengths of cells, however, have been largely overlooked, despite having direct and significant impact on power profiles as well.

For the first time, we thoroughly explore the prospects of jointly tuning driver strengths and VCCs as novel working principle for PSC-attack countermeasures. Toward this end, we take the following steps: 1) we develop a simple circuit-level scheme for tuning; 2) we implement a CAD flow for design-time evaluation of ASICs, enabling security assessment of ICs before tape-out; 3) we implement a correlation power analysis (CPA) framework for thorough and comparative security analysis; 4) we conduct an extensive experimental study of a regular AES design, implemented in ASIC as well as FPGA fabrics, under various tuning scenarios; 5) we summarize design guidelines for secure and efficient joint tuning.

In our experiments, we observe that runtime tuning is more effective than static tuning, for both ASIC and FPGA implementations. For the latter, the AES core is rendered > 11.8x (i.e., at least 11.8 times) as resilient as the untuned baseline design. Layout overheads can be considered acceptable, with, e.g., around +10% critical-path delay for the most resilient tuning scenario in FPGA.

We release source codes for our methodology, as well as artifacts from the experimental study in[13].

Validating the Redundancy Assumption for HDL from Code Clone’s Perspective

  • Jianjun Xu
  • Jiayu He
  • Jingyan Zhang
  • Deheng Yang
  • Jiang Wu
  • Xiaoguang Mao

Automated program repair (APR) is being leveraged in hardware description languages (HDLs) to fix hardware bugs without human involvement. Most existing APR techniques search for donor code (i.e., code fragment for bug fixing) in the original program to generate repairs, which is based on the assumption that donor code can be found in existing source code. The redundancy assumption is the fundamental basis of most APR techniques, which has been widely studied in software by searching code clones of donor code. However, despite a large body of work on code clone detection, researchers have focused almost exclusively on repositories in traditional programming languages, such as C/C++ and Java, while few studies have been done on detecting code clones in HDLs. Furthermore, little attention has been paid on the repetitiveness of bug fixes in hardware designs, which limits automatic repair targeting HDLs. To validate the redundancy assumption for HDL, we perform an empirical study on code clones of real-world bug fixes in Verilog. On top of empirical results, we find that 17.71% of newly introduced code in bug fixes can be found from the clone pairs of buggy code in the original program, and 11.77% can be found in the file itself. The findings not only validate the assumption but also provides helpful insights for the design of APR targeting HDLs.

SESSION: Session 15: ISPD 2023 Contest Results and Closing Remarks

Benchmarking Advanced Security Closure of Physical Layouts: ISPD 2023 Contest

  • Mohammad Eslami
  • Johann Knechtel
  • Ozgur Sinanoglu
  • Ramesh Karri
  • Samuel Pagliarini

Computer-aided design (CAD) tools traditionally optimize “only” for power, performance, and area (PPA). However, given the wide range of hardware-security threats that have emerged, future CAD flows must also incorporate techniques for designing secure and trustworthy integrated circuits (ICs). This is because threats that are not addressed during design time will inevitably be exploited in the field, where system vulnerabilities induced by ICs are almost impossible to fix. However, there is currently little experience for designing secure ICs within the CAD community.

This contest seeks to actively engage with the community to close this gap. The theme is security closure of physical layouts, that is, hardening the physical layouts at design time against threats that are executed post-design time. Acting as security engineers, contest participants will proactively analyse and fix the vulnerabilities of benchmark layouts in a blue-team approach. Benchmarks and submissions are based on the generic DEF format and related files.

This contest is focused on the threat of Trojans, with challenging aspects for physical design in general and for hindering Trojan insertion in particular. For one, layouts are based on the ASAP7 library and rules are strict, e.g., no DRC issues and no timing violations are allowed at all. In the alpha/qualifying round, submissions are evaluated using first-order metrics focused on exploitable placement and routing resources, whereas in the final round, submissions are thoroughly evaluated (red-teamed) through actual insertion of different Trojans.

Who’s Scott Beamer

April 2023

Scott Beamer

Assistant Professor

Department of Computer Science & Engineering, University of California, Santa Cruz

Email:

sbeamer@ucsc.edu

Personal webpage:

https://scottbeamer.net

Research interests

Agile and open-source hardware design, computer architecture, graph processing, and data movement optimization

Short bio

Scott Beamer is an assistant professor of computer science and engineering at the University of California, Santa Cruz. His research interests include agile hardware design, high-performance graph processing, and computer architecture. He has received an NSF CAREER award, the Kaivalya Dixit Distinguished Dissertation Award from SPEC, and best paper awards from the International Parallel & Distributed Processing Symposium (IPDPS) and the International Symposium on Workload Characterization (IISWC). He has a PhD in Computer Science from the University of California, Berkeley, and was formerly a postdoctoral scholar at Lawrence Berkeley National Laboratory.

Research highlights

(1) Accelerating RTL Simulation
Simulation is a crucial tool for hardware design, but the current slow speed of RTL simulation often bottlenecks the whole design process. Dr. Beamer’s work explores techniques to drastically accelerate simulation speed while providing the same cycle-accurate result. These results are released as the open-source ESSENT simulator, which demonstrates both leading single-threaded [DAC20] and parallel [ASPLOS23] performance.
(2) High-Performance Graph Processing
The versatility of the graph abstraction allows it to represent many things from hardware circuits to social networks. Unfortunately, graph applications typically underutilize existing general-purpose compute resources. Dr. Beamer’s work has accelerated graph processing through a variety of means. Algorithmically, he created the direction-optimizing breadth-first search (BFS) algorithm, which is the fastest BFS for low-diameter graphs [SC12] and it is widely used in the Graph500 competition. He carefully analyzed graph workloads with performance counters [IISWC15], and created the GAP benchmark suite which has been used by over 250 publications. His propagation blocking work transforms graph algorithms to increase their spatial locality [IPDPS17].
(3) Monolithically-Integrated Silicon Photonics
Due to the limits of electrical signalling, off-chip bandwidth can be greatly hindered by area or power constraints. Monolithically-integrated silicon photonics provide an amazing opportunity to overcome such limitations for inter-chip communication. Collaborating with device experts, Dr. Beamer designed architectures to best utilize photonics, for CPU to DRAM [ISCA10] and within a single chip [NOCS09].
(4) Agile & Open-Source Hardware Design
With the slowing of Dennard scaling and the corresponding rise in the need for hardware specialization, there is also a corresponding need to reduce the cost and complexity of hardware design. Agile techniques provide a promising way to increase productivity, and Dr. Beamer has created a course on Agile Hardware Design and released all of the content as open source (https://github.com/agile-hw). Releasing tools and hardware designs as open-source can greatly help the community, and Dr. Beamer was an early contributor to the RISC-V project and one of the first users of the Chisel hardware construction language.

Who’s Ming-Chang Yang

March 2023

Ming-Chang Yang

Associate Professor

Department of Computer Science and Engineering, The Chinese University of Hong Kong

Email:

mcyang@cse.cuhk.edu.hk

Personal webpage:

http://www.cse.cuhk.edu.hk/~mcyang/

Research interests

Emerging non-volatile memory and storage technologies, memory and storage systems, and the next-generation memory/storage architecture designs.

Short bio

Ming-Chang Yang is currently an Assistant Professor at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He received his B.S. degree from the Department of Computer Science at National Chiao-Tung University, Hsinchu, Taiwan, in 2010. He received his Master and Ph.D. degrees (supervised by Professor Tei-Wei Kuo) from the Department of Computer Science and Information Engineering at National Taiwan University, Taipei, Taiwan, in 2012 and 2016, respectively. His primary research interests include emerging non-volatile memory and storage technologies, memory and storage systems, and the next-generation memory/storage architecture designs.

Dr. Yang has published more than 70 research papers, which were mainly published in top journals (e.g., IEEE TC, IEEE TCAD, IEEE TVLSI, and ACM TECS) and top conferences (e.g., USENIX OSDI, USENIX FAST, USENIX ATC, ACM/IEEE DAC, ACM/IEEE ICCAD, ACM/IEEE CODES+ISSS, and ACM/IEEE EMSOFT). He received two best paper awards (from IEEE NVMSA 2019 and ACM/IEEE ISLPED 2020) for his research contributions on emerging non-volatile memory; also, he was awarded TSIA Ph.D. Student Semiconductor Award from Taiwan Semiconductor Industry Association (TSIA) in 2016 because of his research achievements on flash memory.

Research highlights

The main research interest of Dr. Yang’s research group is in embracing emerging memory/storage technologies, including various types of non-volatile memory (NVM) as well as the shingled magnetic recording (SMR) and interlaced magnetic recording (IMR) technologies for the next-generation hard disk drive (HDD), in computer systems.

Particularly, in view of the common read-write asymmetry (in both latency and energy) of NVM, one series of Dr. Yang’s research work attempts to alleviate the side effects caused by such asymmetry via innovating the application and/or algorithm designs. For example, one of their most recent research studies devises a novel dynamic hashing scheme for NVM called SEPH, which exhibits excellent performance scalability, efficiency, and predictability on the real product of NVM (i.e., Intel® Optane™ DCPMM). Also, Dr. Yang’s research group revamps the algorithmic design of random forest, one core algorithm of machine learning (ML), for NVM. This line of study receives particular attention and recognition from the community, including winning two best paper awards from NVMSA 2019 and ISLPED 2020. Moreover, Dr. Yang’s research group is also the pioneer in exploring the memory subsystem design based on an emerging type of NVM called racetrack memory (RTM).

On the other hand, even though the cutting-edge SMR and IMR technologies bring lower cost-per-GB to HDD, they also impose the write amplification problem on HDD, resulting in severe write performance degradation. In light of this, Dr. Yang’s research group introduces a couple of novel data management designs into different system layers for SMR-based or IMR-based HDD. For example, they architect KVIMR, a data management middleware for constructing a cost-effective yet high-throughput LSM-tree based KV store on IMR-based HDD. KVIMR exhibits significant throughput improvement and even excellent compatibility with the mainstream LSM-tree based KV stores (such as RocksDB and LevelDB). In addition, at the block layer, they put forward a novel design called Virtual Persistent Cache (VPC) that adaptively exploits the computing and management resources from the host system to ultimately improve the write responsiveness of SMR-based HDD. Moreover, they realize a firmware design called MAGIC, which shows great potential to close the performance gap between traditional and IMR-based HDDs.

Apart from the system work on adapting emerging memory/storage technologies, Dr. Yang’s research group is also of special interest to data-intensive or data-driven applications. For instance, they aim to optimize the efficiency and practicality of out-of-core graph processing systems, which feature offloading the enormous graph data from memory into storage for better scalability at a low cost. Also, they develop new frameworks for graph representation learning and graph neural networks with significant performance improvements.

FPGA’23 TOC

FPGA ’23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

 Full Citation in the ACM Digital Library

SESSION: Keynote I

Compiler Support for Structured Data

  • Saman Amarasinghe

In 1957, the FORTRAN language and compiler introduced multidimensional dense arrays or dense tensors. Subsequent programming languages added a myriad of data structures from lists, sets, hash tables, trees, to graphs. Still, when dealing with extremely large data sets, dense tensors are the only simple and practical solution. However, modern data is anything but dense. Real world data, generated by sensors, produced by computation, or created by humans, often contain underlying structure, such as sparsity, runs of repeated values, or symmetry.

In this talk I will describe how programming languages and compilers can support large data sets with structure. I will introduce TACO, a compiler for sparse data computing. TACO is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. It pioneered a new technique for compiling compound tensor expressions into efficient loops in a systematic way. TACO generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. With TACO, I will show how to put sparse array programming on the same compiler transformation and code generation footing as dense array codes. Structured data has immense potential for hardware acceleration. However, instead of one-off single-operation compute engines, with compilers frameworks such as TACO, I believe that it is possible to create hardware for an entire class of sparse computations. With the help of the FPGA community, I am looking forward to such a future.

SESSION: Session: High-Level Abstraction and Tools

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS

  • Linus Y. Wong
  • Jialiang Zhang
  • Jing (Jane) Li

Rapid growth in data size poses increasing computational and memory challenges to data processing. FPGA accelerators and near-storage processing are promising candidates for tackling computational and memory requirements, and many near-storage FPGA accelerators have been shown to be effective in processing large data. However, the current HLS development environment does not allow direct NVMe storage access from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. We prototyped DONGLE with an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD and demonstrate a geomean speed-up of 2.3× and a reduction of lines-of-code by 2.4× on evaluated workloads over the state-of-the-art commercial platform.

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

  • Linfeng Du
  • Tingyuan Liang
  • Sharad Sinha
  • Zhiyao Xie
  • Wei Zhang

Multi-die FPGAs are widely adopted to deploy large-scale hardware accelerators. Two factors impede the performance optimization of high-level synthesis (HLS) designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplanning of HLS designs generated under each group of configurations during directive optimization due to the large design scale.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive optimization, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit strategy from the online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing step to compact the floorplan, followed by pipelining of the long wires crossing die-boundaries.

Through experiments on a set of HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.

Eliminating Excessive Dynamism of Dataflow Circuits Using Model Checking

  • Jiahui Xu
  • Emmet Murphy
  • Jordi Cortadella
  • Lana Josipovic

Recent HLS efforts explore the generation of dynamically scheduled, dataflow circuits from high-level code; their ability to adapt the schedule at runtime to particular data and control outcomes promises superior performance to standard, statically scheduled HLS solutions. However, dataflow circuits are notoriously resource-expensive: their distributed handshake mechanism brings performance benefits in some cases, but causes an unneeded resource overhead when general dynamism is not required. In this work, we present a verification framework based on model checking to systematically reduce the hardware complexity of dataflow circuits. We devise a series of formal proofs that identify the absence of particular behavioral scenarios and use this information to replace the generic dataflow logic with simpler and cheaper control structures. On a set of benchmarks obtained from high-level code, we demonstrate that our technique significantly reduces the resource requirements of dataflow circuits (i.e., it results in LUT and FF reductions of up to 51% and 53%, respectively), while still reaping all performance benefits of dynamic scheduling.

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

  • Ayatallah Elakhras
  • Riya Sawhney
  • Andrea Guerrieri
  • Lana Josipovic
  • Paolo Ienne

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such a way as to mimic the program’s sequential execution. Although correct and quite effective in many cases, this adherence to control flow still significantly limits exploitable parallelism. Recent research introduced techniques to deliver data tokens directly from producers to consumers and achieved tangible benefits both in circuit complexity and execution time. Unfortunately, while this successfully addressed ordinary data dependencies, the problem of potential dependencies through memory remains open: When no technique can statically disambiguate accesses, circuits must be built with load-store queues (LSQs) which, to reorder accesses safely, need memory accesses to be allocated in the queues in program order. Such in-order allocation still demands control circuitry emulating sequential execution, with its negative impact on parallelization. In this paper, we transform potential memory dependencies into virtual data dependencies and use the new direct token delivery strategy to allocate accesses sequentially into the LSQ. In other words, we exploit more parallelism by constructing control circuitry to emulate exclusively those parts of the control flow strictly necessary for in-order allocation. Our results show that we can achieve up to a 74% reduction in execution time compared to prior work, in some cases, at no area cost.

SESSION: Poster Session I

OMT: A Demand-Adaptive, Hardware-Targeted Bonsai Merkle Tree Framework for Embedded Heterogeneous Memory Platform

  • Rakin Muhammad Shadab
  • Yu Zou
  • Sanjay Gandham
  • Mingjie Lin

Novel flash-based, crash-tolerant, non-volatile memory (NVM) such as Intel’s Optane DC memory brings about new and exciting use-case scenarios for both traditional and embedded computing systems involving Field-Programmable Gate Arrays (FPGA). However, NVMs cannot be proper replacement for existing DDR memory modules due to low write endurance and are more well-suited for a hybrid NVM + Volatile memory system. They are also well-known to be vulnerable to different memory-based adversaries that demand the use of a robust authentication method such as Bonsai Merkle Tree. However, typical update process of a BMT (eager update) requires updating the entire update chain frequently, affecting run-time performance even for the data that is not persistence-critical. The latest intermittent BMT update techniques can help provide better real-time throughput, but they lack crash-consistency.

A heterogeneous memory-based system would, therefore, greatly benefit from an authentication mechanism that can change its update method on-the-fly. Hence we propose a modular, unified and adaptable hardware-based BMT framework called Opportunistic Merkle tree (OMT). OMT combines two BMT with different update methods and streamlines the BMT read with a common datapath to provide support for both recovery-critical and general data, eliminating the need for individual authentication subsystems for heterogeneous memory platforms. It also allows for a switch between the update methods based on the request type (persistent/intermittent) while considerably reducing the resource overhead compared to standalone BMT implementations. We test OMT on a heterogeneous embedded secure memory system and the setup provides 44% lower memory overhead & up to 22% faster execution in synthetic benchmarks compared to a baseline.

Cyclone-NTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Platforms

  • Kaveh Aasaraai
  • Emanuele Cesena
  • Rahul Maganti
  • Nicolas Stalder
  • Javier Varela
  • Kevin Bowers

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier-Transform (FFT) on finite fields. NTT is being increasingly used in blockchain and zero-knowledge proof applications. Although FFT and NTT are widely studied for FPGA implementation, we believe CycloneNTT is the first to solve this problem for large data sets (2^24, 64-bit numbers) that would not fit in the on-chip RAM. CycloneNTT uses a state-of-the-art butterfly network and maps the dataflow to hybrid FIFOs composed of on-chip SRAM and external memory. This manifests into a quasi-streaming data access pattern minimizing external memory access latency and maximizing throughput. We implement two variants of CycloneNTT optimized for DDR and HBM external memories. Although historically this problem has been shown to be memory-bound, CycloneNTT’s quasi-streaming access pattern is optimized to the point that when using HBM (Xilinx C1100), the architecture becomes compute-bound. On the DDR-based platform (AWS F1), the latency of the application is equal to the streaming of the entire dataset log(N) times to/from external memory. Moreover, exploiting HBM’s larger number of channels, and following a series of additional optimizations, CycloneNTT only requires log(N)/6 passes.

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

  • Hyeong-Ju Kang

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of slow and power consuming external memories. This paper exploits two schemes to reduce the required memory amount and ultimately to implement a CNN of reasonable performance only with on-chip memory of a practical device like a low-end FPGA. To reduce the memory amount of the intermediate data, a stream-based line-buffer architecture and a dataflow for the architecture are proposed instead of the conventional frame-based architecture, where the amount of the intermediate data memory is proportional to the square of the input image size. The architecture consists of layer-dedicated blocks operating in a pipelined way with the input and output streams. Each convolutional layer block has a line buffer storing just a few rows of input data. The sizes of the line buffers are proportional to the width of the input image, so the architecture requires less intermediate data storage, especially in the trend of getting larger input size in modern object detection CNNs. In addition, the weight memory is reduced by the accelerator-aware pruning. The experimental results show that a whole object detection CNN can be implemented even on a low-end FPGA without an external memory. Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing higher efficiency.

Fault Detection on Multi COTS FPGA Systems for Physics Experiments on the International Space Station

  • Tim Oberschulte
  • Jakob Marten
  • Holger Blume

Field-programmable gate arrays (FPGAs) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose Einstein Condensate and Cold Atom Laboratory (BECCAL) apparatus, which is planned to operate on the International Space Station for several years. A total of more than 100 FPGAs distributed in the setup will be used for high-precision control of specialized sensors and actuators at nanosecond scale. Due to the large amount of devices in BECCAL, commercial off-the-shelf (COTS) FPGAs are used which are not radiation hardened. In this work, we detect and mitigate radiation effects in an application specific COTS-FPGA-based communication network. For that redundancy is integrated into the design while the firmware is optimized to stay within the FPGA’s resource constraints. A redundant integrity checker module is developed which can notify preceding network devices about data and configuration bit errors. The firmware is evaluated by injecting faults into data and configuration registers in simulation and real hardware. The FPGA resource usage of the firmware is cut down by more than half, enabling the use of double modular redundancy for the switching fabric. Together with the triple modular redundancy protected integrity checker, this combination fully prevents silent data corruptions in the design as shown in simulations and by injecting faults in hardware using the Intel Fault Injection FPGA IP Core while staying in the resource limitation of a COTS FPGA.

Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization

  • Meghna Mandava
  • Deming Chen

As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores both preemption and pipelining as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA. We evaluate our algorithm and validate its efficacy by measuring results from real workloads running on the board with different real-time constraints and priority levels. In our exploration, we compare our novel Nimblock algorithm against a no-sharing and no-virtualization baseline algorithm and three other algorithms which support sharing and virtualization. We achieve up to 5x lower average response time when compared to the baseline algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.

Graph-OPU: An FPGA-Based Overlay Processor for Graph Neural Networks

  • Ruiqi Chen
  • Haoyang Zhang
  • Yuhanxiao Ma
  • Enhao Tang
  • Shun Li
  • Yanxiang Zhu
  • Jun Yu
  • Kun Wang

Graph Neural Networks (GNNs) have outstanding performance on graph-structured data and have been extensively accelerated by field-programmable gate array (FPGA) in various ways. However, existing accelerators significantly lack flexibility, especially in the following two aspects: 1) Many FPGA-based accelerators only support one GNN model. 2) The processes of re-synthesizing and bitstream re-generating are very time-consuming for new GNN models. To this end, we propose a highly integrated FPGA-based overlay processor for general GNN accelerations named Graph-OPU. Regarding the data structure and operation irregularity, we customize the instruction sets to support irregular operation patterns in the inference process of GNN models. Then, we customize our datapath and optimize the data format in the microarchitecture to take full advantage of high bandwidth memory (HBM). Moreover, we design the computation module to ensure a unified and fully-pipelined process of sparse matrix multiplication (SpMM) and general matrix multiplication (GEMM). Users can avoid the process of FPGA reconfiguration or RTL regeneration for the newly invented GNN models. We implement the hardware prototype on Xilinx Alveo U50 and test the mainstream GNN models with 9 datasets. Graph-OPU can achieve an average of 435× and 18× speedup, while 2013× and 109× better energy efficiency, compared with the Intel I7-12700KF processor and NVIDIA RTX3090 GPU, respectively. To the best of our knowledge, Graph-OPU is the first in-depth study on FPGA-based general processors for GNN acceleration with high speedup and energy efficiency.

HMLib: Efficient Data Transfer for HLS Using Host Memory

  • Michael Lo
  • Young-kyu Choi
  • Weikang Qiao
  • Mau-Chung Frank Chang
  • Jason Cong

Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather than directly into its on-chip memory, substantially diminishing the end-to-end speedup, especially for small workloads (hundreds of kilobytes). AMD Xilinx’s Host Memory IP (HMI) aims to address the data movement problem by exposing to the developer an High-Level Synthesis (HLS) interface that moves the data from the host directly to the FPGA’s on-chip memory. However, using HMI purely for its interface without additional code changes incurred a 3.3x slowdown in comparison with the current programming model. The slowdown mainly originates from OpenCL call overhead and the kernel control logic unnecessarily switching states. To overcome these issues, we propose Host Memory Library (HMLib), an efficient HLS-based library that facilitates data transfer on behalf of the user. HMLib not only optimizes the runtime stack for efficient data transfer, but also provides HLS compatible and user-friendly interfaces. We demonstrate HMLib’s effectiveness for streaming applications (Deflate compression and CRC32) with improvements of up to up to 36.2X over OpenCL-DDR and up to 79.5X over raw HMI for small-scale data while maintaining little-to-no performance loss for large scale inputs. We plan to open source our work in the future.

An Efficient High-Speed FFT Implementation

  • Ross Martin

This poster introduces the “BxBFFT” parallel-pipelined Fast Fourier Transform (FFT), which gives higher clock speeds (Fmax) than competitors with substantial savings in power and logic resources. In comparisons with the Xilinx SSR FFT, Spiral FFT, Astron FFT, and ZipCPU FFT, the BxBFFT had clock speeds above 650MHz in cases where all others were below 300MHz. The BxBFFT’s LUTs and power were lower by a factor of ~1.5. The BxBFFT had faster Vivado implementation and faster RTL simulation, for improved productivity in design, testing, and verification. BxBFFT simulations were over 10 times faster than the Xilinx SSR FFT. The BxBFFT supports more features than other FFTs, including real-to-complex FFTs, non-power-of-2 FFTs, and features for high reliability in adverse environments. The BxBFFT’s improved performance has been verified in real applications. One customer design had to operate with a reduced workload due to excessive current draw of the Xilinx SSR FFT. A quick replacement of the Xilinx SSR FFT with the BxBFFT lowered die temperature by 34.8 degree Celsius and allowed the design to operate under full load. The source of the BxBFFT’s performance is intensive optimization of well-known FFT algorithms, not new algorithms. The BxBFFT’s coding style gives better control over synthesis to avoid and resolve performance bottlenecks. Automated generation of top-level code supports 13 different choices for radix and 2 different choices for data flow at each stage, to make optimal choices for each BxBFFT size. This results in a highly efficient FFT.

Weave: Abstraction for Accelerator Integration of Generated Modules

  • Tuo Dai
  • Bizhao Shi
  • Guojie Luo

As domain-specific accelerators demand multiple functional components for complex applications in a domain, the conventional wisdom for effective development involves module decomposition, module implementation, and module integration. In the recent decade, the generator-based design methodology improves the productivity of module implementation. However, with the guidance of current abstractions, it is difficult to integrate modules implemented by generators because of implicit interface definition, non-unified performance modeling, and fragmented memory management. These disadvantages cause low productivity of the integration flow and low performance of the integrated accelerators.

To address these drawbacks, we propose Weave, an abstraction for the integration of generated modules to facilitate an agile design flow for domain-specific accelerators. Weave abstraction enables the formulation and automation of optimizing the unified performance model under resource constraints of all modules. And we design a hierarchical memory management method with corresponding interface to integrate modules under the guidance of modular abstraction. In the experiments, the accelerator developed by Weave achieves 2.17× higher performance in the deep learning domain compared with an open-source accelerator, and the integrated acce-lerator attains 88.9% peak performance of generated accelerators.

A Novel FPGA Simulator Accelerating Reinforcement Learning-Based Design of Power Converters

  • Zhenyu Xu
  • Miaoxiang Yu
  • Qing Yang
  • Yeonho Jeong
  • Tao Wei

High-efficiency energy conversion systems have become increasingly important due to their wide use in all electronic systems such as data centers, smart mobile devices, E-vehicles, medical instruments, and so forth. Complex and interdependent parameters make optimal designs of power converters challenging to get. Recent research has shown that reinforcement learning (RL) shows great promise in the design of such converter circuits. A trained RL agent can search for optimal design parameters for power conversion circuit topologies under targeted application requirements. Training an RL agent requires numerous circuit simulations. As a result, they may take days to complete, primarily because of the slow time-domain circuit simulation.

This abstract proposes a new FPGA architecture that accelerates the circuit simulation and hence substantially speeds up the RL-based design method for power converters. Our new architecture supports all power electronic circuit converters and their variations. It substantially improves the training speed of RL-based design methods. High-level synthesis (HLS) was used to build the accelerator on Amazon Web Service (AWS) F1 instance. An AWS virtual PC hosts the training algorithm. The host interacts with the FPGA accelerator by updating the circuit parameters, initiating simulation, and collecting the simulation results during training iterations. A script was created on the host side to facilitate this design method to convert a netlist containing circuit topology and parameters into core matrices in the FPGA accelerator. Experimental results showed 60x overall speedup of our RL-based design method in comparison with using a popular commercial simulator, PowerSim.

A Fractal Astronomical Correlator Based on FPGA Cluster with Scalability

  • Lin Shu
  • Long Xiao
  • Yafang Song
  • Qiuxiang Fan
  • Guitian Fang
  • Jie Hao

Correlation is a highly computationally intensive and data-intensive signal processing application that is used heavily in radio astronomy for imaging and other measurements. For example, the next generation radio telescope, Square Kilometer Array Low (SKA-L), needs a correlator that calculates up to 22 million cross products, which is a real-time system with continuous input data rates of 6 terabits per second and equivalent computation of 2 Peta-operations per second. Therefore, a flexible and scalable solution with high performance per watt is very urgent and meaningful. In this work, a flexible FX correlation architecture based on FPGA cluster is proposed, which can be fractal in subsystem level, engine level and calculation module level, simplifying the complexity of data distribution network to increase the system’s scalability. The interconnect network between processing engines is a new two-stage solution, using self-developed data redistribution hardware to decouple full bandwidth correlation into several independent sub-bands’ computation. And the most intensive calculations, cross-multiplications among all the antennas, are modularly designed under MATLAB Simulink and AMD Xilinx System Generator, which are parametrized to scale to arbitrary antenna numbers with optional parallel granularity to minimize development effort on different FPGA or for different applications. What’s more, a fully FPGA-based FX correlator for a large array with 202 antennas, consisting of 26 F Engines based on AMD Xilinx Kintex-7 325T FPGAs, 13 X Engines based on AMD Xilinx Kintex ultrascale KU115 FPGAs, has been deployed in 2022, which is the largest full FPGA-based astronomical correlator as we know.

Power Side-channel Countermeasures for ARX Ciphers using High-level Synthesis

  • Saya Inagaki
  • Mingyu Yang
  • Yang Li
  • Kazuo Sakiyama
  • Yuko Hara-Azumi

In the era of Internet of Things (IoT), edge devices are considerably diversified and are often designed using high-level synthesis (HLS) to improve design-productivity. A problem here is that HLS tools were originally developed in a security-unaware fashion, inducing vulnerabilities to power side-channel attacks (PSCA), which is a serious threat in IoT. Although PSCA vulnerabilities induced by HLS tools recently started to be discussed, the effects and applicability of existing methods for PSCA-resistant designs using HLS are limited so far. In this paper, we propose a novel HLS-based design method for PSCA-resistant ciphers in hardware. Particularly focusing on lightweight block ciphers composed of Addition-Rotation-XOR (ARX)-based permutations, we studied the effects of applying ”threshold implementation”, one of the provably secure countermeasures against PSCA, to behavioral descriptions of the ciphers. In addition, we tuned the scheduling optimization of HLS tools that might cause power side-channel leakage. In our experiment, using ARX-based ciphers (Chaskey, Simon, and Speck) as benchmarks, we implemented the unprotected and protected circuit on FPGA and evaluated the PSCA vulnerability using Welch’s t-test. The results demonstrated that our proposed method can successfully mitigate vulnerabilities to PSCA for all benchmarks. From these results, we provide further discussion on the direction of PSCA countermeasures based on HLS.

Single-Batch CNN Training using Block Minifloats on FPGAs

  • Chuliang Guo
  • Binglei Lou
  • Xueyuan Liu
  • David Boland
  • Philip H.W. Leong

Training convolutional neural networks remains a challenge on resource-limited edge devices due to its intensive computations, large storage requirements, and high bandwidth. Error back-propagation, gradient generation, and weight update usually require high precision to guarantee model accuracy, which places a further burden on computation and bandwidth. This paper presents the first parallel FPGA CNN training accelerator with block minifloat datatypes. We first propose a heuristic bit-width allocation technique to derive a unified 8-bit block minifloat format with a sign bit, 2 exponent bits, and 5 mantissa bits. In contrast to previous techniques, the same data format is used for weights, activations, errors, and gradients. Using this format, accuracy similar to 32-bit single precision floating point is achieved and thus simplifies the FPGA-based designs of computational units such as multiply-and-add. In addition, we propose a unified Conv block to deal with Conv and transposed Conv in the forward and backward paths respectively; and a dilated Conv block with a weight kernel partition scheme for gradient generation. Both Conv blocks support non-unit stride, this being crucial for the residual connections that appear in modern CNNs. For training of ResNet20 on the CIFAR-10 dataset with a batch size of 1, our accelerator on a Xilinx Ultrascale+ ZCU102 FPGA achieves state-of-the-art single-batch throughput of 144.64 and 192.68 GOPs with and without batch normalisation layers respectively.

SESSION: Session: Applications and Design Studies I

A Study of Early Aggregation in Database Query Processing on FPGAs

  • Mehdi Moghaddamfar
  • Norman May
  • Christian Färber
  • Wolfgang Lehner
  • Akash Kumar

In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggregation operator. In this paper, we study early aggregation algorithms in the context of query processing acceleration in database systems on FPGAs. The comparative study leads us to set-associative caches with a low inter-reference recency set (LIRS) replacement policy. They show both great performance and modest implementation complexity compared to some of the most prominent early aggregation algorithms. We also present a novel application-specific architecture for implementing set-associative caches. Benchmarks of our implementation show speedups of up to 3x for end-to-end aggregation compared to a state-of-the-art FPGA-based query engine.

FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction

  • Chaoqiang Liu
  • Haifeng Liu
  • Long Zheng
  • Yu Huang
  • Xiangyu Ye
  • Xiaofei Liao
  • Hai Jin

The k-nearest neighbor graph has emerged as the key data structure for many critical applications. However, it can be notoriously challenging to construct k-nearest neighbor graphs over large graph datasets, especially with a high-dimensional vector feature. Many solutions have been recently proposed to support the construction of k-nearest neighbor graphs. However, these solutions involve substantial memory access and computational overheads and an architecture-level solution is still absent. To address these issues, we architect FNNG, the first FPGA-based accelerator to support k-nearest neighbor graph construction. Specifically, FNNG is equipped with the block-based scheduling technique to exploit the inherent data locality between vertices. It divides the vertices that are close in space into blocks and process the vertices according to the granularity of the blocks during the construction process. FNNG also adopts the useless computation aborting technique to identify superfluous useless computations. It keeps the existing maximum similarity values of all vertices inside the computing unit. In addition, we propose an improved architecture in order to fully utilize both techniques. We implement FNNG on the Xilinx Alveo U280 FPGA card. The results show that FNNG achieves 190x and 2.1x speedups over the state-of-the-art CPU and GPU solutions, running on Intel Xeon Gold 5117 CPU and NVIDIA GeForce RTX 3090 GPU, respectively.

ACTS: A Near-Memory FPGA Graph Processing Framework

  • Wole Jaiyeoba
  • Nima Elyasi
  • Changho Choi
  • Kevin Skadron

Despite the high off-chip bandwidth and on-chip parallelism offered by today’s near-memory accelerators, software-based (CPU and GPU) graph processing frameworks still suffer performance degradation from under-utilization of available memory bandwidth because graph traversal often exhibits poor locality. Emerging FPGAbased graph accelerators tackle this challenge by designing specialized graph processing pipelines and application-specific memory subsystems to maximize bandwidth utilization and efficiently utilize high-speed on-chip memory. To use the limited on-chip (BRAM) memory effectively while handling larger graph sizes, several FPGAbased solutions resort to some form of graph slicing or partitioning during preprocessing to stage vertex property data into the BRAM. While this has demonstrated performance superiority for small graphs, this approach breaks down with larger graph sizes. For example, GraphLily [19], a recent high-performance FPGA-based graph accelerator, experiences up to 11X performance degradation between graphs having 3M vertices and 28M vertices. This makes prior FPGA approaches impractical for large graphs.

We propose ACTS, an HBM-enabled FPGA graph accelerator, to address this problem. Rather than partitioning the graph offline to improve spatial locality, we partition vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This optimizes read bandwidth even as the graph size scales. We compare ACTS against Gunrock, a state-of-the-art graph processing accelerator for the GPU, and GraphLily, a recent FPGA-based graph accelerator also utilizing HBM memory. Our results show a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X over Gunrock, and a geometric speedup of 3.6X, with a maximum speedup of 16.5X, over GraphLily. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock.

Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation

  • Nick Brown

AMD Xilinx’s new Versal Adaptive Compute Acceleration Platform (ACAP) is an FPGA architecture combining reconfigurable fabric with other on-chip hardened compute resources. AI engines are one of these and, by operating in a highly vectorized manner, they provide significant raw compute that is potentially beneficial for a range of workloads including HPC simulation. However, this technology is still early-on, and as yet unproven for accelerating HPC codes, with a lack of benchmarking and best practice.

This paper presents an experience report, exploring porting of the Piacsek and Williams (PW) advection scheme onto the Versal ACAP, using the chip’s AI engines to accelerate the compute. A stencil-based algorithm, advection is commonplace in atmospheric modelling, including several Met Office codes who initially developed this scheme. Using this algorithm as a vehicle, we explore optimal approaches for structuring AI engine compute kernels and how best to interface the AI engines with programmable logic. Evaluating performance using a VCK5000 against non-AI engine FPGA configurations on the VCK5000 and Alveo U280, as well as a 24-core Xeon Platinum Cascade Lake CPU and Nvidia V100 GPU, we found that whilst the number of channels between the fabric and AI engines are a limitation, by leveraging the ACAP we can double performance compared to an Alveo U280.

SESSION: Session: Architecture, CAD, and Circuit Design

Regularity Matters: Designing Practical FPGA Switch-Blocks

  • Stefan Nikolic
  • Paolo Ienne

Several techniques have been proposed for automatically searching for FPGA switch-blocks which typically show some tangible advantage over the well-known academic architectures. However, the resulting switch-blocks usually exhibit high levels of irregularity, in contrast with what can be observed in a typical commercial architecture. One wonders whether the architectures produced by such search methods can actually be laid out in an efficient manner while retaining the perceived gains. In this work, we propose a new switch-block exploration method that enhances a recently published search algorithm by combining it with ILP, in order to guarantee that the obtained solution satisfies arbitrary regularity constraints. We measure the impact of regularity constraints commonly seen in industrial architectures (such as limiting the number of different multiplexer sizes or forced sharing of inputs between pairs of multiplexers) and observe benefits to the routability of complex circuits with only a limited reduction in performance.

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Time-to-Digital Converters

  • Colin Drewes
  • Olivia Weng
  • Keegan Ryan
  • Bill Hunter
  • Christopher McCarty
  • Ryan Kastner
  • Dustin Richmond

Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5× over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80% classification accuracy that generalizes beyond a single board. Finally, our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2×.

Post-Radiation Fault Analysis of a High Reliability FPGA Linux SoC

  • Andrew Elbert Wilson
  • Nathan Baker
  • Ethan Campbell
  • Jackson Sahleen
  • Michael Wirthlin

FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments.

FPGA Technology Mapping with Adaptive Gate Decomposition

  • Longfei Fan
  • Chang Wu

Most existing technology mapping algorithms use graph covering approaches and suffer from the netlist structural bias problem. Chen and Cong proposed a simultaneous simple gate decomposition with technology mapping algorithm that encodes many gate decomposition choices into the netlist. However, their algorithm suffers from the long runtime problem due to a large set of choices. Later on, A. Mishchenko et al. proposed a mapping algorithm based on choice generation with the so-called lossless synthesis. Nevertheless, their algorithm cannot guarantee to find and keep all good choices a priori before mapping. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvement. Compared with the lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvements. Compared with the state-of-the-art ABC lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction.

FPGA Mux Usage and Routability Estimates without Explicit Routing

  • Jonathan W. Greene

A new algorithm is proposed to rapidly evaluate an FPGA routing architecture without need of explicitly routing benchmark applications. The algorithm takes as input a probability distribution of nets to be accommodated and a description of an architecture. It produces an estimate for the usage of each type of mux in the FPGA (including intra-cluster muxes), valuable feedback to the architect. The estimates are shown to correlate with actual routed applications in both academic and commercial architectures. This is due in part to the algorithm’s novel ability to account for long and multi-fanout nets. Run time is reduced by applying periodic graphs to model FPGAs’ regular structure.

We then show how Percolation Theory (a branch of statistical physics) can be applied to elucidate the relationship between mux usage and routability. We show that any blockages when routing a net are most likely to occur in the neighborhood of its terminals, and demonstrate a quantitative relationship among the post-placement wirelength of an application, the percolation threshold of an architecture, and the channel width required to map the application to the architecture. Supporting experimental data is provided.

SESSION: Banquet and Panel

Open-source and FPGAs: Hardware, Software, Both or None?

  • Dana How
  • Tim Ansell
  • Vaughn Betz
  • Chris Lavin
  • Ted Speers
  • Pierre-Emmanuel Gaillardon

Following the footsteps of the open-source software movement that is at the foundation of many fundamental infrastructures today, e.g., Linux, the internet, etc., a growing amount of open-source hardware initiatives have been impacting our field, e.g., the RISC-V ISA, Open chiplet standards, etc.

SESSION: Keynote II

FPGAs and Their Evolving Role in Domain Specific Architectures: A Case Study of the AMD 400G Adaptive SmartNIC/DPU SoC

  • Jaideep Dastidar

Domain Specific Architectures (DSA) typically apply heterogeneous compute elements such as FPGAs, GPUs, AI Engines, TPUs, etc. towards solving domain-specific problems, and have their accompanying Domain Specific Software. FPGAs have played a prominent role in DSAs for AI, Video Transcoding, Network Acceleration etc. This talk will start by going over a brief historical survey of FPGAs in DSAs and an emerging trend in Domain Specific Accelerators, where the programmable logic element is paired with other heterogeneous compute or acceleration elements. The talk will then perform a case study of AMD’s 400G Adaptive SmartNIC/DPU SoC and the considerations that went into that DSA. The case study includes where, why, and how the programmable logic element was paired with other hardened offload accelerators and embedded processors with the goal of striking the right balance between Software Processing on the embedded cores, Fastpath ASIC-like processing on the Hardened Accelerators, and Adaptive and Composable processing on the integrated FPGA. The talk will describe the data movement between various network, storage and interface acceleration elements and their shared and private memory resources. Throughout the talk, we will focus on the tradeoffs between the FPGA element and the rest of the heterogeneous compute or acceleration elements as they apply to SmartNIC/DPU offload acceleration.

SESSION: Session: Deep Learning

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

  • Jinming Zhuang
  • Jason Lau
  • Hanchen Ye
  • Zhuoping Yang
  • Yubo Du
  • Jack Lo
  • Kristof Denolf
  • Stephen Neuendorffer
  • Alex Jones
  • Jingtong Hu
  • Deming Chen
  • Jason Cong
  • Peipei Zhou

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?

We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing

  • Alireza Khataei
  • Gaurav Singh
  • Kia Bazargan

We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

  • Lei Cai
  • Jing Wang
  • Lianfeng Yu
  • Bonan Yan
  • Yaoyu Tao
  • Yuchao Yang

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

SESSION: Session: FPGA-Based Computing Engines

hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA

  • Xuan Wang
  • Lei Gong
  • Jing Cao
  • Wenqi Lou
  • Weiya Wang
  • Chao Wang
  • Xuehai Zhou

Regular expression (REGEX) matching tasks drive much research on automata processors (AP). Among them, the von Neumann AP can efficiently utilize on-chip memory to process the Deterministic Finite Automata (DFA), but it is limited to small REGEX sets due to the DFA’s state explosion problem. For large REGEX sets, the spatial AP based on Nondeterministic Finite Automaton (NFA) is the mainstream choice. However, there are two problems with previous FPGA-based spatial AP. First, it cannot obtain a balanced FPGA resource usage (LUT and BRAM), which easily leads to resource shortage. Second, to compress the report output data of large REGEX sets, it uses dynamic report compression, which not only consumes a lot of FPGA resources but also limits performance.

This paper optimizes the resource and IO overhead of spatial AP. First, noticing the resource optimization ability of the von Neumann AP, we propose the flex-hybrid-FA algorithm to generate small hybrid-FAs (an NFA/DFA hybrid model) and further propose the Spatial-von Neumann Heterogeneous AP to deploy hybrid-FA. Under the constraints of the flex-hybrid-FA algorithm, we can obtain balanced and efficient FPGA resource usage. Second, we propose High-Efficient Automata Report Compression (HEARC) with a compression ratio of up to 5.5-47.6x, which can thoroughly release the performance from IO congestion, and consumes less FPGA resource compared to previous dynamic report compression approaches. As far as we know, this is the first work to deploy large REGEX sets on low-cost small-scale FPGAs (e.g. Xilinx XCZU3CG). The experimental results show that compared to the previous FPGA-based APs, we save 4.0-6.6x power consumption and improve 2.7-5.9x energy efficiency.

CSAIL2019 Crypto-Puzzle Solver Architecture

  • Sergey Gribok
  • Bogdan Pasca
  • Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations (i.e. computations which cannot be parallelized), with each iteration performing a very large (3072-bit in the case of CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified.

In this paper, we present an FPGA architecture for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 21 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

ENCORE: Efficient Architecture Verification Framework with FPGA Acceleration

  • Kan Shi
  • Shuoxiang Xu
  • Yuhan Diao
  • David Boland
  • Yungang Bao

Verification typically consumes the majority of the time in the hardware development cycle. Primarily this is because multiple iterations to debug hardware using software simulation is extremely time-consuming. While FPGAs can be utilised to accelerate the simulation, existing methods either provide limited visibility of design details, or are expensive to check against a reference model dynamically at the system level.

In this paper, we present ENCORE, an FPGA-accelerated framework for processor architecture verification. The design-under-test (DUT) hardware and the corresponding software emulator run simultaneously on the same FPGA with hardened processors. ENCORE embodies hardware modules that dynamically monitor and compare key registers from both the DUT and reference model, pausing the execution if any mismatches are detected. In this case, ENCORE automatically creates snapshots of the current design status, and offloads this to software simulators for further debugging. We demonstrate the performance of ENCORE by running RISC-V processor designs and benchmarks. We show that ENCORE can achieve over 44000x speedup over a traditional software simulation-based approach, while maintaining full visibility and debugging capabilities.

BOBBER A Prototyping Platform for Batteryless Intermittent Accelerators

  • Vishak Narayanan
  • Rohit Sahu
  • Jidong Sun
  • Henry Duwe

Batteryless systems offer promising platforms to support pervasive, near-sensor intelligence in a sustainable manner. These systems solely rely on ambient energy sources that often provide limited power. One common approach to designing batteryless systems is using intermittent execution—a node banks energy into a capacitive store until a threshold voltage is met and the digital components turn on and consume the banked energy until the energy is depleted and they die. The limited amount of available energy demands the development of application- and domain-specific accelerators to achieve energy efficiency and timeliness. Given the extremely close relationship between volatile state and intermittent behavior, performing actual system prototyping has been critical for demonstrating feasibility of intermittent systems. However, no prototyping platform exists for intermittent accelerators. This paper introduces BOBBER, the first implementation of an intermittent FPGA-based accelerator prototyping platform. We demonstrate BOBBER in the optimization and evaluation of a neural network accelerator powered solely by RF energy harvesting.

SESSION: Poster Session II

Adapting Skip Connections for Resource-Efficient FPGA Inference

  • Olivia Weng
  • Gabriel Marcano
  • Vladimir Loncar
  • Alireza Khodamoradi
  • Nojan Sheybani
  • Farinaz Koushanfar
  • Kristof Denolf
  • Javier Mauricio Duarte
  • Ryan Kastner

Deep neural networks employ skip connections – identity functions that combine the outputs of different layers-to improve training convergence; however, these skip connections are costly to implement in hardware. In particular, for inference accelerators on resource-limited platforms, they require extra buffers, increasing not only on- and off-chip memory utilization but also memory bandwidth requirements. Thus, a network that has skip connections costs more to deploy in hardware than one that has none. We argue that, for certain classification tasks, a network’s skip connections are needed for the network to learn but not necessary for inference after convergence. We thus explore removing skip connections from a fully-trained network to mitigate their hardware cost. From this investigation, we introduce a fine-tuning/retraining method that adapts a network’s skip connections – by either removing or shortening them-to make them fit better in hardware with minimal to no loss in accuracy. With these changes, we decrease resource utilization by up to 34% for BRAMs, 7% for FFs, and 12% LUTs when implemented on an FPGA.

Multi-bit-width CNN Accelerator with Systolic-in-Systolic Dataflow and Single DSP Multiple Multiplication Scheme

  • Mingqiang Huang
  • Yucen Liu
  • Sixiao Huang
  • Kai Li
  • Qiuping Wu
  • Hao Yu

Multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has been one of the core hardware platforms for deploying various neural networks. However, it is still difficult to fully make use of the dedicated digital signal processing (DSP) blocks in FPGA for accelerating the multi-bit-width network. In this work, we develop state-of-the-art multi-bit-width convolutional neural network accelerator with novel systolic-in-systolic type of dataflow and single DSP multiple multiplication (SDMM) INT2/4/8 execution scheme. Multi-level optimizations have also been adopted to further improve the performance, including group-vector systolic array for maximizing the circuit efficiency as well as minimizing the systolic delay, and differential neural architecture search (NAS) method for the high accuracy multi-bit-width network generation. The proposed accelerator has been practically deployed on Xilinx ZCU102 with accelerating NAS optimized VGG16 and Resnet18 networks as case studies. Average performance on accelerating the convolutional layer in VGG16 and Resnet18 is 1289GOPs and 1155GOPs, respectively. Throughput for running the full multi-bit-width VGG16 network is 870.73 GOPS at 250MHz, which has exceeded all of previous CNN accelerators on the same platform.

Janus: An Experimental Reconfigurable SmartNIC with P4 Programmability and SDN Isolation

  • Bharat Sukhwani
  • Mohit Kapur
  • Alda Ohmacht
  • Liran Schour
  • Martin Ohmacht
  • Chris Ward
  • Chuck Haymes
  • Sameh Asaad

Disparate deployment models of cloud computing pose varying requirements on cloud infrastructure components such as networking, storage, provisioning, and security. Infrastructure providers need to study these and often create custom infrastructure components to satisfy these requirements. A major challenge in the research and development of these cloud infrastructure solutions, however, is the availability of customizable platforms for experimentation and trade-off analysis of the various hardware and software components. Most platforms are either general purpose or bespoke solutions created to assist a particular task, too rigid to allow meaningful customization. In this work, we present a 100G reconfigurable smartNIC prototyping platform called Janus that enables cloud infrastructure research and hardware-software co-design of infrastructure components such as hypervisor, secure boot, software defined networking and distributed storage. The platform provides a path to optimize the stack by offloading the functionalities from the host x86 to the embedded processor on the smartNIC and optimize performance by moving pieces to hardware using P4. Further, our platform provides hardware-enforced isolation of cloud network control plane, thereby securing the control plane from the tenants even for bare-metal deployments.

LAWS: Large-Scale Accelerated Wave Simulations on FPGAs

  • Dimitrios Gourounas
  • Bagus Hanindhito
  • Arash Fathi
  • Dimitar Trenev
  • Lizy John
  • Andreas Gerstlauer

Computing numerical solution to large-scale scientific computing problems described by partial differential equations is a common task in high-performance computing. Improving their performance and efficiency is critical to exa-scale computing. Application-specific hardware design is a well-known solution, but the wide range of kernels makes it infeasible to provision supercomputers with accelerators for all applications. This makes reconfigurable platforms a promising direction. In this work, we focus on wave simulations using discontinuous Galerkin solvers, as an important class of applications. Existing work using FPGAs is limited to accelerating specific kernels or small problems that fit into FPGA BRAM. We present LAWS, a generic and configurable architecture for large-scale accelerated wave simulation problems running on FPGAs out of DRAM. LAWS exploits fine- and coarse-grain parallelism using a scalable array of application-specific cores, and incorporates novel dataflow optimizations, including prefetching, kernel fusion, and memory layout optimizations to minimize data transfers and maximize DRAM bandwidth utilization. We further accompany LAWS with an analytical performance model that allows for scaling across technology trends and architecture configurations. We demonstrate LAWS on the simulation of elastic wave equations. Results show that a single FPGA core achieves 69% higher performance than 24 Xeon cores with 13.27x better energy efficiency, when given 1.94x less peak DRAM bandwidth. Scaling to the same peak DRAM bandwidth shows that an FPGA is 3.27x and 1.5x faster than 24 CPU cores and an Nvidia P100 GPU, with 22.3x and 4.53x better efficiency, respectively.

Mitigating the Last-Mile Bottleneck: A Two-Step Approach For Faster Commercial FPGA Routing

  • Shashwat Shrivastava
  • Stefan Nikolic
  • Chirag Ravishankar
  • Dinesh Gaitonde
  • Mirjana Stojilovic

We identified that in modern commercial FPGAs, routing signals from the general interconnect to the inputs of the CLB primitives through a very sparse input interconnect block (IIB) represents a significant runtime bottleneck. This is despite academic research often neglecting the runtime of last-mile routing through the IIB. We propose a two-step routing approach that allows resolving this bottleneck by leveraging massive parallelism of today’s compute infrastructure. The main premise that enables massive parallelization is that once the signals are legally routed in the general interconnect-only reaching the inputs of the IIB, but not the final targets-the remaining last-mile routing through the IIB can be completed independently for each FPGA tile.

We ran experiments using ISPD16 and industrial designs to demonstrate the dominant contribution of last-mile routing to the router’s runtime. We used an architectural model closely resembling Xilinx UltraScale FPGAs, which makes it highly representative of the current state of the art. For ISPD16 benchmarks, we observed that when the router is instructed to complete the entire routing, including its last-mile portion, the average number of heap pushes (a machine-agnostic measure of runtime) increases 4.1× compared to a simplified reference in which last-mile routing is neglected. On industrial designs, the number of heap pushes increased 4.4×. Last-mile routing was successfully completed using a SAT-based router in up to 83% of FPGA tiles. With a slight increase in density of IIB connectivity, we were able to bring the completion success rate up to 100%.

Towards a Machine Learning Approach to Predicting the Difficulty of FPGA Routing Problems

  • Andrew David Gunter
  • Steven Wilton

In this poster, we present a Machine Learning (ML) technique to predict the number of iterations needed for a Pathfinder-based FPGA router to complete a routing problem. Given a placed circuit, our technique uses features gathered on each routing iteration to predict if the circuit is routable and how many more iterations will be required to successfully route the circuit. This enables early exit for routing problems which are unlikely to be completed in a target number of iterations. Such early exit may help to achieve a successful route within tractable time by allowing the user to quickly retry the circuit compilation with a different random seed, a modified circuit design, or a different FPGA. We demonstrate our predictor in the VTR 8 framework; compared to VTR’s predictor, our ML predictor incurs lower prediction errors on the Koios Deep Learning benchmark suite. This corresponds with an approximate time saving of 48% from early rejection of unroutable FPGA designs while also successfully completing 5% more routable designs and having a 93% shorter early exit latency.

An FPGA-Based Weightless Neural Network for Edge Network Intrusion Detection

  • Zachary Susskind
  • Aman Arora
  • Alan T. L. Bacellar
  • Diego L. C. Dutra
  • Igor D. S. Miranda
  • Mauricio Breternitz
  • Priscila M. V. Lima
  • Felipe M. G. França
  • Lizy K. John

Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using ASICs, 6G emphasizes the adaptability of algorithms for specific user scenarios, which motivates broader adoption of FPGAs. In this paper, we propose the FPGA-based Weightless Intrusion Warden (FWIW), a novel solution for detecting anomalous network traffic on edge devices. While prior work in this domain is based on conventional deep neural networks (DNNs), FWIW incorporates a weightless neural network (WNN), a table lookup-based model which learns sophisticated nonlinear behaviors. This allows FWIW to achieve accuracy far superior to prior FPGA-based work at a very small fraction of the model footprint, enabling deployment on small, low-cost devices. FWIW achieves a prediction accuracy of 98.5% on the UNSW-NB15 dataset with a total model parameter size of just 192 bytes, reducing error by 7.9x and model size by 262x vs. LogicNets, the best prior edge-optimized implementation. Implemented on a Xilinx Virtex UltraScale+ FPGA, FWIW demonstrates a 59x reduction in LUT usage with a 1.6x increase in throughput. The accuracy of FWIW comes within 0.6% of the best-reported result in literature (Edge-Detect), a model several orders of magnitude larger. Our results make it clear that WNNs are worth exploring in the emerging domain of edge networking, and suggest that FPGAs are capable of providing the extreme throughput needed.

A Flexible Toolflow for Mapping CNN Models to High Performance FPGA-based Accelerators

  • Yongzheng Chen
  • Gang Wu

There have been many studies on developing automatic tools for mapping CNN models onto FPGAs. However, challenges remain in designing an easy-to-use toolflow. First, the toolflow should be able to handle models exported from various deep learning frameworks and models with different topologies. Second, the hardware architecture should make better use of on-chip resources to achieve high performance. In this work, we build a toolflow upon Open Neural Network Exchange (ONNX) IR to support different DL frameworks. We also try to maximize the overall throughput via multiple hardware-level efforts. We propose to accelerate the convolution operation by applying parallelism not only at the input and output channel level, but also at the output feature map level. Several on-chip buffers and corresponding management algorithms are also designed to leverage abundant memory resources. Moreover, we employ a fully pipelined systolic array running at 400 MHz as the convolution engine, and develop a dedicated bus to implement the im2col algorithm and provide feature inputs to the systolic array. We generated 4 accelerators with different systolic array shapes and compiled 12 CNN models for each accelerator. Deployed on a Xilinx VCU118 evaluation board, the performance of convolutional layers can reach 3267.61 GOPS, which is 99.72% of the ideal throughput (3276.8 GOPS). We also achieve an overall throughput of up to 2424.73 GOPS. Compared with previous studies, our toolflow is more user-friendly. The end-to-end performance of the generated accelerators is also better than that of related work at the same DSP utilization.

Senju: A Framework for the Design of Highly Parallel FPGA-based Iterative Stencil Loop Accelerators

  • Emanuele Del Sozzo
  • Davide Conficconi
  • Marco D. Santambrogio
  • Kentaro Sano

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving, seismic simulations, and financial option pricing, to name a few. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional system of points until it reaches convergence. However, due to their iterative and computationally intensive nature, these workloads are highly performance-hungry, demanding specialized solutions to boost performance and reduce power consumption. Here, FPGAs represent a valid architectural choice as their peculiar features enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for FPGA-based ISL accelerators. Starting from an input description, Senju builds highly parallel hardware modules and automatizes all their design phases. The experimental evaluation shows remarkable and scalable results, reaching significant performance and energy efficiency improvements compared to the other single-FPGA literature approaches.

FPGA Acceleration for Successive Interference Cancellation in Severe Multipath Acoustic Communication Channels

  • Jinfeng Li
  • Yahong Rosa Zheng

This paper proposes a hardware implementation of a Successive Interference Cancellation (SIC) scheme in a Turbo Equalizer for very long multipath fading channels where the Intersymbol-interference (ISI) channel length L is on the order of 100 taps. To reduce the computational complexity caused by large matrix arithmetic in the SIC, we explore the data dependencies and convolutional nature of the SIC algorithm and propose an FPGA acceleration architecture by taking advantage of the high degree of parallelism and the flexible data movements offered by FPGA. Instead of reconstructing interference in symbol-wise by matrix and vector multiplication directly for each symbol in a block, we propose a two-stage processing algorithm. The first stage is block-wise processing, where the convolution of the channel impulse response (CIR) vector and the vector consisting of the whole symbol block is computed. The second stage is symbol-wise processing but turns to the multiplication of one symbol and the CIR vector. The result shows that for a block of Nblk symbols and a channel of length L, the proposed architecture completes the SIC within 2Nblk+L clock cycles, while direct matrix multiplication requires L×Nblk clock cycles. Implemented on a Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit, the SIC and equalization in one turbo iteration based on this architecture is completed around 40 us for a 1024 symbol block and a channel length of L=100. This architecture achieves around 40× speed-up compared with the implementation on a powerful CPU platform.

FreezeTime: Towards System Emulation through Architectural Virtualization

  • Sergiu Mosanu
  • Joshua Fixelle
  • Kevin Skadron
  • Mircea Stan

High-end FPGAs enable architecture modeling through emulation with high speed and fidelity. However, the available reconfigurable logic and memory resources limit the size, complexity, and speed of the emulated target designs. The challenge is to map and model large and fast memory hierarchies, such as large caches and mixed main memory, various heterogeneous computation instances, such as CPUs, GPUs, AI/ML processing units and accelerator cores, and communication infrastructure, such as buses and networks. In addition to the spatial dimension, this work uses the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture. Our approach presents mechanisms to abstract instance plurality while preserving timing in sync. With only a subset of the architecture on the FPGA, we freeze a whole emulated module’s activity and state during the additional time intervals necessary for the action on the virtualized modules to elapse. We demonstrate this technique by emulating a hypothetical system consisting of a processor and an SRAM memory too large to map on the FPGA. For this, we modify a LiteX-generated SoC consisting of a VexRISC-V processor and DDR memory, with the memory controller issuing stall signals that freeze the processor, effectively ”hiding” the memory latency. For Linux boot, we measure significant emulation vs. simulation speedup while matching RTL simulation accuracy. The work is open-sourced.

SESSION: Session: Applications and Design Studies II

A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree Management

  • Yuan Meng
  • Rajgopal Kannan
  • Viktor Prasanna

Monte Carlo Tree Search (MCTS) is a widely used search technique in Artificial Intelligence (AI) applications. MCTS manages a dynamically evolving decision tree (i.e., one whose depth and height evolve at run-time) to guide an AI agent toward an optimal policy. In-tree operations are memory-bound leading to a critical performance bottleneck for large-scale parallel MCTS on general-purpose processors. CPU-FPGA accelerators can alleviate the memory bottleneck of in-tree operations. However, a major challenge for existing FPGA accelerators is the lack of dynamic memory management due to which they cannot efficiently support dynamically evolving MCTS trees. In this work, we address this challenge by proposing an MCTS acceleration framework that (1) incorporates an algorithm-hardware co-optimized accelerator design that supports in-tree operations on dynamically evolving trees without expensive hardware reconfiguration; (2) adopts a hybrid parallel execution model to fully exploit the compute power in a CPU-FPGA heterogeneous system; (3) supports Python-based programming API for easy integration of the proposed accelerator with RL domain-specific bench-marking libraries at run-time. We show that by using our framework, we achieve up to 6.8× speedup and superior scalability of parallel workers than state-of-the-art parallel MCTS on multi-core systems.

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

  • Linghao Song
  • Licheng Guo
  • Suhail Basalama
  • Yuze Chi
  • Robert F. Lucas
  • Jason Cong

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

  • Sasindu Wijeratne
  • Ta-Yang Wang
  • Rajgopal Kannan
  • Viktor Prasanna

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.

ASPDAC’23 TOC

ASPDAC ’23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

 Full Citation in the ACM Digital Library

SESSION: Technical Program: Reliability Considerations for Emerging Computing and Memory Architectures

A Fast Semi-Analytical Approach for Transient Electromigration Analysis of Interconnect Trees Using Matrix Exponential

  • Pavlos Stoikos
  • George Floros
  • Dimitrios Garyfallou
  • Nestor Evmorfopoulos
  • George Stamoulis

As integrated circuit technologies are moving to smaller technology nodes, Electromigration (EM) has become one of the most challenging problems facing the EDA industry. While numerical approaches have been widely deployed since they can handle complicated interconnect structures, they tend to be much slower than analytical approaches. In this paper, we present a fast semi-analytical approach, based on the matrix exponential, for the solution of Korhonen’s stress equation at discrete spatial points of interconnect trees, which enables the analytical calculation of EM stress at any time and point independently. The proposed approach is combined with the extended Krylov subspace method to accurately simulate large EM models and accelerate the calculation of the final solution. Experimental evaluation on OpenROAD benchmarks demonstrates that our method achieves 0.5% average relative error over the COMSOL industrial tool while being up to three orders of magnitude faster.

Chiplet Placement for 2.5D IC with Sequence Pair Based Tree and Thermal Consideration

  • Hong-Wen Chiou
  • Jia-Hao Jiang
  • Yu-Teng Chang
  • Yu-Min Lee
  • Chi-Wen Pan

This work develops an efficient chiplet placer with thermal consideration for 2.5D ICs. Combining the sequence-pair based tree, branch-and-bound method, and advanced placement/pruning techniques, the developed placer can find the solution fast with the optimized total wirelength (TWL) on half-perimeter wirelength (HPWL). Additionally, with the post placement procedure, the placer reduces maximum temperatures with slight increase of wirelength. Experimental results show that the placer can not only find better optimized TWL (reducing 1.035% HPWL) but also speed up at most two orders of magnitude than the prior art. With thermal consideration, the placer can reduce the maximum temperature up to 8.214 °C with an average 5.376% increase of TWL.

An On-Line Aging Detection and Tolerance Framework for Improving Reliability of STT-MRAMs

  • Yu-Guang Chen
  • Po-Yeh Huang
  • Jin-Fu Li

Spin-transfer-torque magnetic random-access memory (STT-MRAM) is one of the most promising emerging memories for on-chip memory. However, the magnetic tunnel junction (MTJ) in the STT-MRAM suffers from several reliability threats which degrade the endurance, create defects, and cause memory failure. One of the primary reliability issues comes from time-dependent dielectric breakdown (TDDB) on MTJ, which deviates resistance value of MTJ over time and may lead to reading error. To overcome this challenge, in this paper we present an on-line aging detection and tolerance framework to dynamically monitor the electrical parameter deviations and provide appropriate compensation to avoid reading error. The on-line aging detection mechanism can identify aged words by monitoring read current and then the aging tolerance mechanism can adjust the reference resistance of the sensing amplifier to compensate the aging-induced resistance drop of MTJ. In comparison with existing testing-based aging detection techniques, our mechanism can operate on-line with read operations for both aging detection and tolerance simultaneously with negligible performance overhead. Simulation and analysis results show that the proposed techniques can successfully detect 99% aging words under process variation and achieve at most 25% reliability improvement of STT-MRAMs.

SESSION: Technical Program: Accelerators and Equivalence Checking

Automated Equivalence Checking Method for Majority Based In-Memory Computing on ReRAM Crossbars

  • Arighna Deb
  • Kamalika Datta
  • Muhammad Hassan
  • Saeideh Shirinzadeh
  • Rolf Drechsler

Recent progress in the fabrication of Resistive Random Access Memory (ReRAM) devices has paved the way for large scale crossbar structures. In particular, in-memory computing on ReRAM crossbars helps in bridging the processor-memory speed gap for current CMOS technology. To this end, synthesis and mapping of Boolean functions to such crossbars have been investigated by researchers. However the verification of simple designs on crossbar is still done through manual inspection or sometimes complemented by simulation based techniques. Clearly this is an important problem as real world designs are complex and have higher number of inputs. As a result manual inspection and simulation based methods for these designs are not practical.

In this paper for the first time as per our knowledge we propose an automated equivalence checking methodology for majority based in-memory designs on ReRAM crossbars. Our contributions are twofold: first, we introduce an intermediate data structure called ReRAM Sequence Graph (ReSG) to represent the logic-in-memory design. This in turn is translated into Boolean Satifiability (SAT) formulas. These SAT formulas are verified against the golden functional specification using Z3 Satifiability Modulo Theory (SMT) solver. We validate the proposed method by running widely available benchmarks.

An Equivalence Checking Framework for Agile Hardware Design

  • Yanzhao Wang
  • Fei Xie
  • Zhenkun Yang
  • Pasquale Cocchini
  • Jin Yang

Agile hardware design enables designers to produce new design iterations efficiently. Equivalence checking is critical in ensuring that a new design iteration conforms to its specification. In this paper, we introduce an equivalence checking framework for hardware designs represented in HalideIR. HalideIR is a popular intermediate representation in software domains such as deep learning and image processing, and it is increasingly utilized in agile hardware design. We have developed a fully automatic equivalence checking workflow seamlessly integrated with HalideIR and several optimizations that leverage the incremental nature of agile hardware design to scale equivalence checking. Evaluations of two deep learning accelerator designs show our automatic equivalence checking framework scales to hardware designs of practical sizes and detects inconsistencies that manually crafted tests have missed.

Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector Duplication

  • Bowen Liu
  • Dajiang Liu

Sparse matrix-vector multiplication (SpMV) is widely used in many fields and usually dominates the execution time of a task. With large off-chip memory bandwidth, customizable on-chip resources and high-performance float-point operation, FPGA is a potential platform to accelerate SpMV tasks. However, as compressed data formats for SpMV usually introduce irregular memory access while it is also memory-intensive, implementing an SpMV accelerator on FPGA to achieve a high bandwidth utilization (BU) is a challenging work. Existing works either eliminate irregular memory access at the sacrifice of increasing data redundancy or try to locally reduce the port conflicts introduced by irregular memory access, leading to a limited BU improvement. To this end, this paper proposes a high-bandwidth-utilization SpMV accelerator on FPGAs using partial vector duplication, where read-conflict-free vector buffer, writing-conflict-free adder tree, and ping-pong-like accumulator registers are well elaborated. The FPGA implementation results show that the proposed design can achieve an average of 1.10x performance speedup compared to the state-of-the-art work.

SESSION: Technical Program: New Frontiers in Cyber-Physical and Autonomous Systems

Safety-Driven Interactive Planning for Neural Network-Based Lane Changing

  • Xiangguo Liu
  • Ruochen Jiao
  • Bowen Zheng
  • Dave Liang
  • Qi Zhu

Neural network-based driving planners have shown great promises in improving task performance of autonomous driving. However, it is critical and yet very challenging to ensure the safety of systems with neural network-based components, especially in dense and highly interactive traffic environments. In this work, we propose a safety-driven interactive planning framework for neural network-based lane changing. To prevent over-conservative planning, we identify the driving behavior of surrounding vehicles and assess their aggressiveness, and then adapt the planned trajectory for the ego vehicle accordingly in an interactive manner. The ego vehicle can proceed to change lanes if a safe evasion trajectory exists even in the predicted worst case; otherwise, it can stay around the current lateral position or return back to the original lane. We quantitatively demonstrate the effectiveness of our planner design and its advantage over baseline methods through extensive simulations with diverse and comprehensive experimental settings, as well as in real-world scenarios collected by an autonomous vehicle company.

Safety-Aware Flexible Schedule Synthesis for Cyber-Physical Systems Using Weakly-Hard Constraints

  • Shengjie Xu
  • Bineet Ghosh
  • Clara Hobbs
  • P. S. Thiagarajan
  • Samarjit Chakraborty

With the emergence of complex autonomous systems, multiple control tasks are increasingly being implemented on shared computational platforms. Due to the resource-constrained nature of such platforms in domains such as automotive, scheduling all the control tasks in a timely manner is often difficult. The usual requirement—that all task invocations must meet their deadlines—stems from the isolated design of a control strategy and its implementation (including scheduling) in software. This separation of concerns, where the control designer sets the deadlines, and the embedded software engineer aims to meet them, eases the design and verification process. However, it is not flexible and is overly conservative. In this paper, we show how to capture the deadline miss patterns under which the safety properties of the controllers will still be satisfied. The allowed patterns of such deadline misses may be captured using what are referred to as “weakly-hard constraints.” But scheduling tasks under these weakly-hard constraints is non-trivial since common scheduling policies like fixed-priority or earliest deadline first do not satisfy them in general. The main contribution of this paper is to automatically synthesize schedules from the safety properties of controllers. Using real examples, we demonstrate the effectiveness of this strategy and illustrate that traditional notions of schedulability, e.g., utility ratios, are not applicable when scheduling controllers to satisfy safety properties.

Mixed-Traffic Intersection Management Utilizing Connected and Autonomous Vehicles as Traffic Regulators

  • Pin-Chun Chen
  • Xiangguo Liu
  • Chung-Wei Lin
  • Chao Huang
  • Qi Zhu

Connected and autonomous vehicles (CAVs) can realize many revolutionary applications, but it is expected to have mixed-traffic including CAVs and human-driving vehicles (HVs) together for decades. In this paper, we target the problem of mixed-traffic intersection management and schedule CAVs to control the subsequent HVs. We develop a dynamic programming approach and a mixed integer linear programming (MILP) formulation to optimally solve the problems with the corresponding intersection models. We then propose an MILP-based approach which is more efficient and real-time-applicable than solving the optimal MILP formulation, while keeping good solution quality as well as outperforming the first-come-first-served (FCFS) approach. Experimental results and SUMO simulation indicate that controlling CAVs by our approaches is effective to regulate mixed-traffic even if the CAV penetration rate is low, which brings incentive to early adoption of CAVs.

SESSION: Technical Program: Machine Learning Assisted Optimization Techniques for Analog Circuits

Fully Automated Machine Learning Model Development for Analog Placement Quality Prediction

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Yaguang Li
  • Yishuang Lin
  • Jiang Hu
  • Yiran Chen

Analog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art.

Efficient Hierarchical mm-Wave System Synthesis with Embedded Accurate Transformer and Balun Machine Learning Models

  • F. Passos
  • N. Lourenço
  • L. Mendes
  • R. Martins
  • J. Vaz
  • N. Horta

Integrated circuit design in millimeter-wave (mm-Wave) bands is exceptionally complex and dependent on costly electromagnetic (EM) simulations. Therefore, in the past few years, a growing interest has emerged in developing novel optimization-based methodologies for the automatic design of mm-Wave circuits. However, current approaches lack scalability when the circuit/system complexity increases. Besides, many also depend on EM simulators, which degrade their efficiency. This work resorts to hierarchical system partitioning and bottom-up design approaches, where a precise machine learning model – composed of hundreds of seamlessly integrated sub-models that guarantee high accuracy (validated against EM simulations and measurements) up to 200GHz – is embedded to design passive components, e.g., transformers and baluns. The model generates optimal design surfaces to be fed to the hierarchical levels above or acts as a performance estimator. With the proposed scheme, it is possible to remove the dependency of EM simulations during optimization. The proposed mixed-optimal-surface, performance estimator, and simulation-based bottom-up multiobjective optimization (MOO) are used to fully design a Ka-band mm-Wave transmitter from the device up to the system level in 65-nm CMOS for state-of-the-art specifications.

APOSTLE: Asynchronously Parallel Optimization for Sizing Analog Transistors Using DNN Learning

  • Ahmet F. Budak
  • David Smart
  • Brian Swahn
  • David Z. Pan

Analog circuit sizing is a high-cost process in terms of the manual effort invested and the computation time spent. With rapidly developing technology and high market demand, bringing automated solutions for sizing has attracted great attention. This paper presents APOSTLE, an asynchronously parallel optimization method for sizing analog transistors using Deep Neural Network (DNN) learning. This work introduces several methods to minimize real-time of optimization when the sizing task consists of several different simulations with varying time costs. The key contributions of this paper are: (1) a batch optimization framework, (2) a novel deep neural network architecture for exploring design points when the existed solutions are not always fully evaluated, (3) a ranking approximation method based on cheap evaluations and (4) a theoretical approach to balance between the cheap and the expensive simulations to maximize the optimization efficiency. Our method shows high real-time efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits while reaching similar or better performance.

SESSION: Technical Program: Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems

ML to the Rescue: Reliability Estimation from Self-Heating and Aging in Transistors All the Way up Processors

  • Hussam Amrouch
  • Florian Klemme

With increasingly confined 3D structures and newly-adopted materials of higher thermal resistance, transistor self-heating has risen to a critical reliability threat in state-of-the-art and emerging process nodes. One of the challenges of transistor self-heating is accelerated transistor aging, which leads to earlier failure of the chip if not considered appropriately. Nevertheless, adequate consideration of accelerated aging effects, induced by self-heating, throughout a large circuit design is profoundly challenging due to the large gap between where self-heating does originate (i.e., at the transistor level) and where its ultimate effect occurs (i.e., at the circuit and system levels). In this work, we demonstrate an end-to-end workflow starting from self-heating and aging effects in individual transistors all the way up to large circuits and processor designs. We demonstrate that with our accurately estimated degradations, the required timing guardband to ensure reliable operation of circuits is considerably reduced by up to 96% compared to otherwise worst-case estimations that are conventionally employed.

Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs

  • Lilas Alrahis
  • Johann Knechtel
  • Ozgur Sinanoglu

Graph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design.

In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at https://github.com/DfX-NYUAD/GNN4IC. Finally, we discuss interesting open problems for future research.

Detection and Classification of Malicious Bitstreams for FPGAs in Cloud Computing

  • Jayeeta Chaudhuri
  • Krishnendu Chakrabarty

As FPGAs are increasingly shared and remotely accessed by multiple users and third parties, they introduce significant security concerns. Modules running on an FPGA may include circuits that induce voltage-based fault attacks and denial-of-service (DoS). An attacker might configure some regions of the FPGA with bitstreams that implement malicious circuits. Attackers can also perform side-channel analysis and fault attacks to extract secret information (e.g., secret key of an AES encryption). In this paper, we present a convolutional neural network (CNN)-based defense to detect bitstreams of RO-based malicious circuits by analyzing the static features extracted from FPGA bitstreams. We further explore the criticality of RO-based circuits in order to detect malicious Trojans that are configured on the FPGA. Evaluation on Xilinx FPGAs demonstrates the effectiveness of the security solutions.

Learning Based Spatial Power Characterization and Full-Chip Power Estimation for Commercial TPUs

  • Jincong Lu
  • Jinwei Zhang
  • Wentian Jin
  • Sachin Sachdeva
  • Sheldon X.-D. Tan

In this paper, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer’s total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98mW/mm2, or 2.6% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.

SESSION: Technical Program: High Performance Memory for Storage and Computing

DECC: Differential ECC for Read Performance Optimization on High-Density NAND Flash Memory

  • Yunpeng Song
  • Yina Lv
  • Liang Shi

3D NAND flash memory with advanced multi-level-cell technology has been widely adopted due to its high density, but with significantly degraded reliability. To solve the reliability issue, flash memory often adopts the low-density parity-check code (LDPC) as error correction code (ECC) to encode data and provide fault tolerance. For LDPC with a low code rate, it can provide a strong correction capability, but with a high energy cost. To avoid the cost, LDPC with a higher code rate is always adopted. When the accessed data is not successfully decoded, LDPC will rely on read retry operations to improve the error correction capability. However, the read retry operation will induce degraded read performance. In this work, a differential ECC (DECC) method is proposed to improve the read performance. The basic idea of DECC is to adopt LDPC with different code rates for data with different access characteristics. Specifically, when data is hot read and retried due to reliability, LDPC with a low code rate will be adopted to optimize performance. With this approach, the cost from LDPC with a low code rate is minimized and the performance is optimized. Through careful design and real-world workloads evaluation on a 3D triple-level-cell (TLC) NAND flash memory, DECC achieves encouraging read performance optimization.

Optimizing Data Layout for Racetrack Memory in Embedded Systems

  • Peng Hui
  • Edwin H.-M. Sha
  • Qingfeng Zhuge
  • Rui Xu
  • Han Wang

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too many shift operations decrease the performance of RTM and cause unpredictable performance. In this paper, we propose three schemes to optimize the performance of RTM from different aspects, including intra-DBC, inter-DBC, and hybrid SPM with SRAM and RTM. Firstly, a balanced group-based data placement method for the data layout inside one DBC is proposed to reduce shifts. Second, a grouping method for the data allocation among DBCs is proposed. It helps with the shift reduction while using fewer DBCs by using one DBC as multiple DBCs. Finally, we use SRAM to further help the cost reduction, and a cost evaluation metric is proposed to assist the shrinking method which determines the data allocation for hybrid SPM with SRAM and RTM. Experiments show that the proposed schemes can significantly improve the performance of pure RTM and hybrid SPM while using fewer DBCs.

Exploring Architectural Implications to Boost Performance for in-NVM B+-Tree

  • Yanpeng Hu
  • Qisheng Jiang
  • Chundong Wang

Computer architecture keeps evolving to support the byte-addressable non-volatile memory (NVM). Researchers have tailored the prevalent B+-tree with NVM, crafting a history of utilizing architectural supports to gain both high performance and crash consistency. The latest architecture-level changes for NVM, e.g., the eADR, motivate us to further explore architectural implications in the design and implementation of in-NVM B+-tree. Our quantitative study finds that eADR makes the cache misses impact increasingly on an in-NVM B+-tree’s performance. We hence propose Conan for the conflict-aware node allocation based on theoretical justifications. Conan decomposes the virtual addresses of B+-tree nodes regarding a VIPT cache and intentionally places them into different cache sets. Experiments show that Conan evidently reduces cache conflicts and boosts the performance of state-of-the-art in-NVM B+-tree.

An Efficient near-Bank Processing Architecture for Personalized Recommendation System

  • Yuqing Yang
  • Weidong Yang
  • Qin Wang
  • Naifeng Jing
  • Jianfei Jiang
  • Zhigang Mao
  • Weiguang Sheng

Personalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory challenges, near-memory processing (NMP) would be an effective solution which provides high bandwidth. Recent work proposes an NMP approach to accelerate the recommendation models by utilizing the through-silicon via (TSV) bandwidth in 3D-stacked DRAMs. However, the total bandwidth provided by TSVs is insufficient for a batch of embedding layers processed in parallel. In this paper, we propose a near-bank processing architecture to accelerate recommendation models. By integrating the compute-logic near memory banks on DRAM dies of the 3D-stacked DRAM, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. We also present a hardware/software interface for embedding layers offloading. Moreover, we propose an efficient mapping scheme to enhance the utilization of bank-level bandwidth. As a result, our architecture achieves up to 2.10X speedup and 31% energy saving for data movement over the state-of-the-art NMP solution for recommendation acceleration based on 3D-stacked memory.

SESSION: Technical Program: Cool and Efficient Approximation

PAALM: Power Density Aware Approximate Logarithmic Multiplier Design

  • Shuyuan Yu
  • Sheldon X.-D. Tan

Approximate hardware designs can lead to significant power or energy reduction. However, a recent study showed that approximated designs might lead to unwanted higher temperature and related reliability issues due to the increased power density. In this work, we try to mitigate this important problem by proposing a novel power density aware approximate logarithmic multiplier (called PAALM) design for the first time. The new multiplier design is based on the approximate logarithmic multiplier (ALM) framework due to its rigorous mathematics based foundation. The idea is to re-design the high computing switch activities of existing ALM designs based on equivalent mathematical formula so that the power density can be reduced at no accuracy loss while at costs of some area overheads. Our results show that the proposed PAALM design can improve 11.5%/5.7% of power density and 31.6%/70.8% of area with 8/16-bit precision when compared with the fixed-point multiplier baseline, respectively. And also achieves extremely low error bias: -0.17/0.08 for 8/16-bit precision, respectively. On top of this, we further implement the PAALM design in a Convolutional Neural Network (CNN) and test it on CIFAR10 dataset. The results show that with error compensation, PAALM can achieve the same inference accuracy as the fixed-point multiplier baseline. We also evaluate the PAALM in a discrete cosine transformation (DCT) application. The results show that with error compensation, PAALM can improve the image quality of 8.6dB in average when compared to the ALM design.

Approximate Floating-Point FFT Design with Wide Precision-Range and High Energy Efficiency

  • Chenyi Wen
  • Ying Wu
  • Xunzhao Yin
  • Cheng Zhuo

Fast Fourier Transform (FFT) is a key digital signal processing algorithm that is widely deployed in mobile and portable devices. Recently, with the popularity of human perception related tasks, it is noted that the requirements of full precision and exactness are not always necessary for FFT computation. We propose a top-down approximate Floating-Point FFT design methodology to fully exploit the error-tolerance nature of the FFT algorithm. An efficient error modeling of the configurable approximate multiplier is proposed to link the multiplier approximation to the FFT algorithm precision. Then an approximation optimization flow is formulated to maximize the energy efficiency. Experimental results show that the proposed approximate FFT can achieve up to 52% Area-Delay-Product improvement and 23% energy saving when compared to the exact FFT. The proposed approximate FFT is also found to cover almost 2X wider precision range with higher energy efficiency in comparison with the prior state-of-the-art approximate FFT.

RUCA: RUntime Configurable Approximate Circuits with Self-Correcting Capability

  • Jingxiao Ma
  • Sherief Reda

Approximate computing is an emerging computing paradigm that offers improved power consumption by relaxing the requirement for full accuracy. Since the requirements for accuracy may vary according to specific real-world applications, one trend of approximate computing is to design quality-configurable circuits, which are able to switch at runtime among different accuracy modes with different power and delay. In this paper, we present a novel framework RUCA which aims to synthesize runtime configurable approximate circuits based on arbitrary input circuits. By decomposing the truth table, our approach aims to approximate and separate the input circuit into multiple configuration blocks which support different accuracy levels, including a corrector circuit to restore full accuracy. Power gating is used to activate different blocks, such that the approximate circuit is able to operate at different accuracy-power configurations. To improve the scalability of our algorithm, we also provide a design space exploration scheme with circuit partitioning. We evaluate our methodology on a comprehensive set of benchmarks. For 3-level designs, RUCA saves power consumption by 43.71% within 2% error and by 30.15% within 1% error on average.

Approximate Logic Synthesis by Genetic Algorithm with an Error Rate Guarantee

  • Chun-Ting Lee
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

Approximate computing is an emerging design technique for error-tolerant applications, which may improve circuit area, delay, or power consumption by trading off a circuit’s correctness. In this paper, we propose a novel approximate logic synthesis approach based on genetic algorithm targeting at depth minimization with an error rate guarantee. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results demonstrate that the depth can be reduced by up to 50%, and 22% on average under a 5% error rate constraint. As compared with the state-of-the-art method, our approach can achieve an average of 159% more depth reduction under the same 5% error rate constraint.

SESSION: Technical Program: Logic Synthesis for AQFP, Quantum Logic, AI Driven and Efficient Data Layout for HBM

Depth-Optimal Buffer and Splitter Insertion and Optimization in AQFP Circuits

  • Alessandro Tempia Calvino
  • Giovanni De Micheli

The Adiabatic Quantum-Flux Parametron (AQFP) is an energy-efficient superconducting logic family. AQFP technology requires buffer and splitting elements (B/S) to be inserted to satisfy path-balancing and fanout-branching constraints. B/S insertion policies and optimization strategies have been recently proposed to minimize the number of buffers and splitters needed in an AQFP circuit. In this work, we study the B/S insertion and optimization methods. In particular, the paper proposes: i) an algorithm for B/S insertion that guarantees global depth optimality; ii) a new approach for B/S optimization based on minimum register retiming; iii) a B/S optimization flow based on (i), (ii), and existing work. We show that our approach reduces the number of B/S up to 20% while guaranteeing optimal depth and providing a 55X speed-up in run time compared to the state-of-the-art.

Area-Driven FPGA Logic Synthesis Using Reinforcement Learning

  • Guanglei Zhou
  • Jason H. Anderson

Logic synthesis involves a rich set of optimization algorithms applied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed “recipe” of such algorithms deemed to work well for a wide range of different circuits. We apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the set of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs. the conventional approach (resyn2). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of “unseen” circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.

Optimization of Reversible Logic Networks with Gate Sharing

  • Yung-Chih Chen
  • Feng-Jie Chao

Logic synthesis for quantum computing aims to transform a Boolean logic network into a quantum circuit. A conventional two-stage flow first synthesizes the given Boolean logic network into a reversible logic network composed of reversible logic gates. Then, it maps each reversible logic gate into quantum gates to generate a quantum circuit. The state-of-the-art method for the first stage takes advantage of the lookup-table (LUT) mapping technology for FPGAs to decompose the given Boolean logic network into sub-networks, and then maps the sub-networks into reversible logic networks. Although every sub-network is well synthesized, we observe that the reversible logic networks could be further optimized by sharing the reversible logic gates belonging to different sub-networks. Thus, in this paper, we propose a new optimization method for the reversible logic networks by sharing gates. We translate the problem of extracting shareable gates to the exclusive-sums-of-product term optimization problem. The experimental results show that the proposed method successfully optimizes the reversible logic networks generated by the LUT-based method. It is able to reduce an average of approximately 4% of quantum gate cost without increasing the number of ancilla lines for a set of IWLS 2005 benchmarks.

Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization

  • Stephanie Soldavini
  • Donatella Sciuto
  • Christian Pilato

Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.

SESSION: Technical Program: University Design Contest

ViraEye: An Energy-Efficient Stereo Vision Accelerator with Binary Neural Network in 55 nm CMOS

  • Yu Zhang
  • Gang Chen
  • Tao He
  • Qian Huang
  • Kai Huang

This paper presents the ViraEye chip, an energy-efficient stereo vision accelerator based on the binary neural network (BNN) to achieve high-quality and real-time stereo estimation. This stereo vision accelerator is designed as an end-to-end full pipeline architecture where all processing procedures, including stereo rectification, BNNs, cost aggregation and post-processing, are implemented on the ViraEye chip. ViraEye allows for top level pipelining between accelerator and image sensors, and no external CPUs or GPUs are required. The accelerator is implemented using SMIC 55nm CMOS technology and achieves top-performing processing speed in terms of million disparity estimations per second (MDE/s) metric among the existing ASIC in the open literature.

A 1.2nJ/Classification Fully Synthesized All-Digital Asynchronous Wired-Logic Processor Using Quantized Non-Linear Function Blocks in 0.18μm CMOS

  • Rei Sumikawa
  • Kota Shiba
  • Atsutake Kosuge
  • Mototsugu Hamada
  • Tadahiro Kuroda

A 5.3 times smaller and 2.6 times more energy-efficient all-digital wired-logic processor which infers MNIST with 90.6% accuracy and 1.2nJ of energy consumption has been developed. To improve area efficiency of wired-logic architecture, nonlinear neural network (NNN), which is a neuron and synapse efficient network, and logical compression technology to implement it with area-saving and low-power digital circuits by logic synthesis are proposed, and asynchronous digital combinational circuit DNN hardware has been developed.

A Fully Synthesized 13.7μJ/Prediction 88% Accuracy CIFAR-10 Single-Chip Data-Reusing Wired-Logic Processor Using Non-Linear Neural Network

  • Yao-Chung Hsu
  • Atsutake Kosuge
  • Rei Sumikawa
  • Kota Shiba
  • Mototsugu Hamada
  • Tadahiro Kuroda

An FPGA-based wired-logic CNN processor is presented that can process CIFAR-10 at 13.7μJ/prediction with an 88% accuracy, which is 2,036 times more energy-efficient than the prior state-of-the-art FPGA-based processor. Energy efficiency is greatly improved by implementing all processing elements and wirings in parallel on a single FPGA chip to eliminate the memory access. By utilizing both (1) a non-linear neural network which saves on neurons and synapses and (2) a shift register-based wired-logic architecture, hardware resource usage is reduced by three orders of magnitude.

A Multimode Hybrid Memristor-CMOS Prototyping Platform Supporting Digital and Analog Projects

  • K.-E. Harabi
  • C. Turck
  • M. Drouhin
  • A. Renaudineau
  • T. Bersani-Veroni
  • D. Querlioz
  • T. Hirtzlin
  • E. Vianello
  • M Bocquet
  • J.-M. Portal

We present an integrated circuit fabricated in a process co-integrating CMOS and hafnium-oxide memristor technology, which provides a prototyping platform for projects involving memristors. Our circuit includes the periphery circuitry for using memristors within digital circuits, as well as an analog mode with direct access to memristors. The platform allows optimizing the conditions for reading and writing memristors, as well as developing and testing innovative memristor-based neuromorphic concepts.

A Fully Synchronous Digital LDO with Built-in Adaptive Frequency Modulation and Implicit Dead-Zone Control

  • Shun Yamaguchi
  • Mahfuzul Islam
  • Takashi Hisakado
  • Osami Wada

This paper proposes a synchronous digital LDO with adaptive clocking and dead-zone control without additional reference voltages. A test chip fabricated in a commercial 65 nm CMOS general-purpose (GP) process achieves 580x frequency modulation with 99.9% maximum efficiency at 0.6V supply.

Demonstration of Order Statistics Based Flash ADC in a 65nm Process

  • Mahfuzul Islam
  • Takehiro Kitamura
  • Takashi Hisakado
  • Osami Wada

This paper presents measurement results of a flash ADC that utilizes offset voltages as references. To operate the minimum number of comparators, we select the target comparators based on the rankings of the offset voltage. We present performance improvement by tuning offset voltage distribution using multiple comparator groups under the same power. A test chip in a commercial 65 nm GP process demonstrates the ADCs at 1 GS/s operation.

SESSION: Technical Program: Synthesis of Quantum Circuits and Systems

A SAT Encoding for Optimal Clifford Circuit Synthesis

  • Sarah Schneider
  • Lukas Burgholzer
  • Robert Wille

Executing quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to devices’ limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm’s description first has to be synthesized using the device’s gate library. In this paper, we consider the optimal synthesis of Clifford circuits—an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are limited to a maximum of six qubits. The contribution of this work is twofold: First, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. The resulting tool is demonstrated to synthesize optimal circuits for up to 26 qubits—more than four times as many as the current state of the art. Second, we experimentally show that the overhead introduced by state-of-the-art heuristics exceeds the lower bound by 27 % on average. The resulting tool is publicly available at https://github.com/cda-tum/qmap.

An SMT-Solver-Based Synthesis of NNA-Compliant Quantum Circuits Consisting of CNOT, H and T Gates

  • Kyohei Seino
  • Shigeru Yamashita

It is natural to assume that we can perform quantum operations between only two adjacent physical qubits (quantum bits) to realize a quantum computer for both the current and possible future technologies. This restriction is called the Nearest Neighbor Architecture (NNA) restriction. This paper proposes an SMT-solver-based synthesis of quantum circuits consisting of CNOT, H, and T gates to satisfy the NNA restriction. Although the existing SMT-solver-based synthesis cannot treat H and T gates directly, our method treats the functionality of quantum-specific T and H gates carefully so that we can utilize an SMT-solver to minimize the number of CNOT gates; unlike the existing SMT-solver-based methods, our method considers “Don’t Care” conditions in intermediate points of a quantum circuit by exploiting the property of T gates to reduce CNOT gates. Experimental results show that our approach can reduce the number of CNOT gates by 58.11% on average compared to the naive application of the existing method which does not consider the “Don’t Care” condition.

Compilation of Entangling Gates for High-Dimensional Quantum Systems

  • Kevin Mato
  • Martin Ringbauer
  • Stefan Hillmich
  • Robert Wille

Most quantum computing architectures to date natively support multi-valued logic, albeit being typically operated in a binary fashion. Multi-valued, or qudit, quantum processors have access to much richer forms of quantum entanglement, which promise to significantly boost the performance and usefulness of quantum devices. However, much of the theory as well as corresponding design methods required for exploiting such hardware remain insufficient and generalizations from qubits are not straightforward. A particular challenge is the compilation of quantum circuits into sets of native qudit gates supported by state-of-the-art quantum hardware. In this work, we address this challenge by introducing a complete workflow for compiling any two-qudit unitary into an arbitrary native gate set. Case studies demonstrate the feasibility of both, the proposed approach as well as the corresponding implementation (which is freely available at github.com/cda-tum/qudit-entanglement-compilation).

WIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code

  • Wang Liao
  • Yasunari Suzuki
  • Teruo Tanimoto
  • Yosuke Ueno
  • Yuuki Tokunaga

Large error rates of quantum bits (qubits) are one of the main difficulties in the development of quantum computing. Performing quantum error correction (QEC) with surface codes is considered the most promising approach to reduce the error rates of qubits effectively. To perform error correction, we need an error-decoding unit, which estimates errors in the noisy physical qubits repetitively, to create a robust logical qubit. While complicated graph-matching problems must be solved within a strict time restriction for the error decoding, several hardware implementations that satisfy the restriction at a large code distance have been proposed.

However, the existing decoder designs are still challenging in reducing the logical error rate. This is because they assume that the error rates of physical qubits are uniform while they have large variations in practice. According to our numerical simulation based on the quantum chip with the largest qubit number, neglecting the non-uniform error properties of a real quantum chip in the decoding process induces significant degradation of the logical error rate and spoils the benefit of QEC. To take the non-uniformity into account, decoders need to solve matching problems on a weighted graph, but they are difficult to solve using the existing designs without exceeding the time limit of decoding. Therefore, a decoder that can treat both the non-uniform physical error rates and the large surface code is strongly demanded.

In this paper, we propose a hardware design of decoding units for the surface code that can treat the non-identical error properties with small latency at a large code distance. The key idea of our design is 1) constructing a look-up table for calculating the shortest paths between nodes in a weighted graph and 2) enabling parallel processing during decoding. The implementation results in field programmable gate array (FPGA) indicate that our design scales up to code distance 11 within a microsecond-level delay, which is comparable to the existing state-of-the-art designs, while our design can treat non-identical errors.

Quantum Data Compression for Efficient Generation of Control Pulses

  • Daniel Volya
  • Prabhat Mishra

In order to physically realize a robust quantum gate, a specifically tailored laser pulse needs to be derived via strategies such as quantum optimal control. Unfortunately, such strategies face exponential complexity with quantum system size and become infeasible even for moderate-sized quantum circuits. In this paper, we propose an automated framework for effective utilization of these quantum resources. Specifically, this paper makes three important contributions. First, we utilize an effective combination of register compression and dimensionality reduction to reduce the area of a quantum circuit. Next, due to the properties of an autoencoder, the compressed gates produced are robust even in the presence of noise. Finally, our proposed compression reduces the computation time of quantum control. Experimental evaluation using popular quantum algorithms demonstrates that our proposed approach can enable efficient generation of noise-resilient control pulses while state-of-the-art fails to handle large-scale quantum systems.

SESSION: Technical Program: In-Memory/Near-Memory Computing for Neural Networks

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

  • Yueting Li
  • He Zhang
  • Xueyan Wang
  • Hao Cai
  • Yundong Zhang
  • Shuqin Lv
  • Renguang Liu
  • Weisheng Zhao

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.

RIMAC: An Array-Level ADC/DAC-Free ReRAM-Based in-Memory DNN Processor with Analog Cache and Computation

  • Peiyu Chen
  • Meng Wu
  • Yufei Ma
  • Le Ye
  • Ru Huang

By directly computing in analog domain, processing-in-memory (PIM) is emerging as a promising alternative to overcome the memory bottleneck of traditional von-Neuman architecture, especially for deep neural networks (DNNs). However, the data outside PIM macros in most existing PIM accelerators are stored and operated as digital signals that require massive expensive digital-to-analog (D/A) and analog-to-digital (A/D) converters. In this work, an array-level ADC/DAC-free ReRAM-based in-memory DNN processor named RIMAC is proposed, which accelerates various DNNs in pure analog-domain with analog cache and analog computation modules to eliminate the expensive D/A and A/D conversions. Our experiment result shows the peak energy efficiency is improved by about 34.8×, 97.6×, 10.7×, and 14.0× compared to PRIME, ISAAC, Lattice, and 21’DAC for various DNNs on ImageNet, respectively.

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient in-Memory Acceleration

  • Shuo Huai
  • Di Liu
  • Xiangzhong Luo
  • Hui Chen
  • Weichen Liu
  • Ravi Subramaniam

Crossbar-based In-Memory Computing (IMC) accelerators preload the entire Deep Neural Network (DNN) into crossbars before inference. However, devices with limited crossbars cannot infer increasingly complex models. IMC-pruning can reduce the usage of crossbars, but current methods need expensive extra hardware for data alignment. Meanwhile, quantization can represent weights of DNNs by integers, but they employ non-integer scaling factors to ensure accuracy, requiring costly multipliers. In this paper, we first propose crossbar-aligned pruning to reduce the usage of crossbars without hardware overhead. Then, we introduce a quantization scheme to avoid multipliers in IMC devices. Finally, we design a learning method to complete above two schemes and cultivate an optimal compact DNN with high accuracy and large sparsity during training. Experiments demonstrate that our framework, compared to state-of-the-art methods, achieves larger sparsity and lower power consumption with higher accuracy. We even improve the accuracy by 0.43% for VGG-16 with an 88.25% sparsity rate on the Cifar-10 dataset. Compared to the original model, we reduce computing power and area by 19.8x and 18.8x, respectively.

Discovering the in-Memory Kernels of 3D Dot-Product Engines

  • Muhammad Rashedul Haq Rashed
  • Sumit Kumar Jha
  • Rickard Ewetz

The capability of resistive random access memory (ReRAM) to implement multiply-and-accumulate operations promises unprecedented efficiency in the design of scientific computing applications. While the use of two-dimensional (2D) ReRAM crossbar has been well investigated in the last few years, the design of in-memory dot-product engines using three-dimensional (3D) ReRAM crossbars remains a topic of active investigations. In this paper, we holistically explore how to leverage 3D ReRAM crossbars with several (2 to 7) stacked crossbar layers. In contrast, previous studies have focused on 3D ReRAM with at most 2 stacked crossbar layers. We first discover the in-memory compute kernels that can be realized using 3D ReRAM with multiple stacked crossbar layers. We discover that matrices with different sparsity patterns can be realized by appropriately assigning the inputs and outputs to the perpendicular metal wires within the 3D stack. We present a design automation tool to map sparse matrices within scientific computing applications to the discovered 3D kernels. The proposed framework is evaluated using 20 applications from the SuitSparse Matrix Collection. Compared with 2D crossbars, the proposed approach using 3D crossbars improves area, energy, and latency with 2.02X, 2.37X, 2.45X, respectively.

RVComp: Analog Variation Compensation for RRAM-Based in-Memory Computing

  • Jingyu He
  • Yucong Huang
  • Miguel Lastras
  • Terry Tao Ye
  • Chi-Ying Tsui
  • Kwang-Ting Cheng

Resistive Random Access Memory (RRAM) has shown great potential in accelerating memory-intensive computation in neural network applications. However, RRAM-based computing suffers from significant accuracy degradation due to the inevitable device variations. In this paper, we propose RVComp, a fine-grained analog Compensation approach to mitigate the accuracy loss of in-memory computing incurred by the Variations of the RRAM devices. Specifically, weights in the RRAM crossbar are accompanied by dedicated compensation RRAM cells to offset their programming errors with a scaling factor. A programming target shifting mechanism is further designed with the objectives of reducing the hardware overhead and minimizing the compensation errors under large device variations. Based on these two key concepts, we propose double and dynamic compensation schemes and the corresponding support architecture. Since the RRAM cells only account for a small fraction of the overall area of the computing macro due to the dominance of the peripheral circuitry, the overall area overhead of RVComp is low and manageable. Simulation results show RVComp achieves a negligible 1.80% inference accuracy drop for ResNet18 on the CIFAR-10 dataset under 30% device variation with only 7.12% area and 5.02% power overhead and no extra latency.

SESSION: Technical Program: Machine Learning-Based Design Automation

Rethink before Releasing Your Model: ML Model Extraction Attack in EDA

  • Chen-Chia Chang
  • Jingyu Pan
  • Zhiyao Xie
  • Jiang Hu
  • Yiran Chen

Machine learning (ML)-based techniques for electronic design automation (EDA) have boosted the performance of modern integrated circuits (ICs). Such achievement makes ML model to be of importance for the EDA industry. In addition, ML models for EDA are widely considered having high development cost because of the time-consuming and complicated training data generation process. Thus, confidentiality protection for EDA models is a critical issue. However, an adversary could apply model extraction attacks to steal the model in the sense of achieving the comparable performance to the victim’s model. As model extraction attacks have posed great threats to other application domains, e.g., computer vision and natural language process, in this paper, we study model extraction attacks for EDA models under two real-world scenarios. It is the first work that (1) introduces model extraction attacks on EDA models and (2) proposes two attack methods against the unlimited and limited query budget scenarios. Our results show that our approach can achieve competitive performance with the well-trained victim model without any performance degradation. Based on the results, we demonstrate that model extraction attacks truly threaten the EDA model privacy and hope to raise concerns about ML security issues in EDA.

MacroRank: Ranking Macro Placement Solutions Leveraging Translation Equivariancy

  • Yifan Chen
  • Jing Mai
  • Xiaohan Gao
  • Muhan Zhang
  • Yibo Lin

Modern large-scale designs make extensive use of heterogeneous macros, which can significantly affect routability. Predicting the final routing quality in the early macro placement stage can filter out poor solutions and speed up design closure. By observing that routing is correlated with the relative positions between instances, we propose MacroRank, a macro placement ranking framework leveraging translation equivariance and a Learning to Rank technique. The framework is able to learn the relative order of macro placement solutions and rank them based on routing quality metrics like wirelength, number of vias, and number of shorts. The experimental results show that compared with the most recent baseline, our framework can improve the Kendall rank correlation coefficient by 49.5% and the average performance of top-30 prediction by 8.1%, 2.3%, and 10.6% on wirelength, vias, and shorts, respectively.

BufFormer: A Generative ML Framework for Scalable Buffering

  • Rongjian Liang
  • Siddhartha Nath
  • Anand Rajaram
  • Jiang Hu
  • Haoxing Ren

Buffering is a prevalent interconnect optimization technique to help timing closure and is often performed after placement. A common buffering approach is to construct a Steiner tree and then buffers are inserted on the tree based on Ginneken-Lillis style algorithm. Such an approach is difficult to scale with large nets. Our work attempts to solve this problem with a generative machine-learning (ML) approach without Steiner tree construction. Our approach can extract and reuse knowledge from high quality samples and therefore has significantly improved scalability. A generative ML framework, BufFormer, is proposed to construct abstract tree topology while simultaneously determining buffer sizes & locations. A baseline method, FLUTE-based Steiner tree construction followed by Ginneken-Lillis style buffer insertion, is implemented to generate training samples. After training, BufFormer can produce solutions for unseen nets highly comparable to baseline results with a correlation coefficient 0.977 in terms of buffer area and 0.934 for driver-sink delays. On average, BufFormer-generated tree achieves similar delays with slightly larger buffer area. And up to 160X speedup can be achieved for large nets when running on a GPU over the baseline on a single CPU thread.

Decoupling Capacitor Insertion Minimizing IR-Drop Violations and Routing DRVs

  • Daijoon Hyun
  • Younggwang Jung
  • Insu Cho
  • Youngsoo Shin

Decoupling capacitor (decap) cells are inserted near function cells of high switching activities so that their IR-drop can be suppressed. Their design becomes more complex and uses higher metal layers, thereby starting to manifest themselves as routing blockage. Post-placement decap insertion, with a goal of minimizing both IR-drop violations and routing design rule violations (DRVs), is addressed for the first time. U-Net with graph convolutional network is introduced to predict routing DRV penalty. The decap insertion problem is formulated and a heuristic algorithm is presented. Experiments with a few test circuits demonstrate that DRVs are reduced by 16% on average with no IR-drop violations, compared to a conventional method which does not explicitly consider DRVs. This results in 48% reduction in routing runtime and 23% improvement in total negative slack.

DPRoute: Deep Learning Framework for Package Routing

  • Yeu-Haw Yeh
  • Simon Yi-Hung Chen
  • Hung-Ming Chen
  • Deng-Yao Tu
  • Guan-Qi Fang
  • Yun-Chih Kuo
  • Po-Yang Chen

For routing closures in package designs, net order is critical due to complex design rules and severe wire congestion. However, existing solutions are deliberatively designed using heuristics and are difficult to adapt to different design requirements unless updating the algorithm. This work presents a novel deep learning-based routing framework that can keep improving by accumulating data to accommodate increasingly complex design requirements. Based on the initial routing results, we apply deep learning to concurrent detailed routing to deal with the problem of net ordering decisions. We use multi-agent deep reinforcement learning to learn routing schedules between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict. Experimental results on industrial package design show that the proposed framework can improve the number of design rule violations by 99.5% and the wirelength by 2.9% for initial routing.

SESSION: Technical Program: Advanced Techniques for Yields, Low Power and Reliability

High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction

  • Shuo Yin
  • Guohao Dai
  • Wei W. Xing

Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of “dimensionality”, which is inevitable when dealing with modern large-scale circuits, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel and acts as a surrogate model to emulate the expensive SPICE simulation. To further improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is also enhanced with parallel batch sampling for parallel computing, making it ready for practical deployment. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art (SOTA) approaches in terms of accuracy and efficiency with up to 11.1x speedup over SOTA methods.

MIA-Aware Detailed Placement and VT Reassignment for Leakage Power Optimization

  • Hung-Chun Lin
  • Shao-Yun Fang

As the feature size decreases, leakage power consumption becomes an important target in the design. Using multiple threshold voltages (VTs) in cell-based designs is a popular technique to simultaneously optimize circuit timing and minimize leakage power. However, an arbitrary cell placement result of a multi-VT design may suffer from many design rule violations induced by the Minimum-Implant-Area (MIA) rule, and thus it is necessary to take the MIA rules into consideration during the detailed placement stage. The state-of-the-art works on detailed placement comprehensively tackling MIA rules either disallow VT change or only allow reducing cell VTs to avoid timing degradation. However, these limitations may either result in larger cell displacement or cause overhead in leakage power. In this paper, we propose an optimization framework of VT reassignment and detailed placement to simultaneously consider MIA rules and leakage power minimization under timing constraints. Experimental results show that compared with the state-of-the-art works, the proposed framework can efficiently achieve better trade-off between leakage power and cell displacement.

SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network

  • Junchi Ma
  • Sulei Huang
  • Zongtao Duan
  • Lei Tang
  • Luyang Wang

The trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.

SESSION: Technical Program: Microarchitectural Design and Neural Networks

Microarchitecture Power Modeling via Artificial Neural Network and Transfer Learning

  • Jianwang Zhai
  • Yici Cai
  • Bei Yu

Accurate and robust power models are highly demanded to explore better CPU designs. However, previous learning-based power models ignore the discrepancies in data distribution among different CPU designs, making it difficult to use data from the historical configuration to aid modeling for new target configuration. In this paper, we investigate the transferability of power models and propose a microarchitecture power modeling method based on transfer learning (TL). A novel TL method for artificial neural network (ANN)-based power models is proposed, where cross-domain mixup generates more auxiliary samples close to the target configuration to fill in the distribution discrepancy and domain-adversarial training extracts domain-invariant features to complete the target model construction. Experiments show that our method greatly improves the model transferability and can effectively utilize the knowledge of the existing CPU configuration to facilitate target power model construction.

MUGNoC: A Software-Configured Multicast-Unicast-Gather NoC for Accelerating CNN Dataflows

  • Hui Chen
  • Di Liu
  • Shiqing Li
  • Shuo Huai
  • Xiangzhong Luo
  • Weichen Liu

Current communication infrastructures for convolutional neural networks (CNNs) only focus on specific transmission patterns, not applicable to benefit the whole system if the dataflow changes or different dataflows run in one system. To reduce data movement, various CNN dataflows are presented. For these dataflows, parameters and results are delivered using different traffic patterns, i.e., multicast, unicast, and gather, preventing dataflow-specific communication backbones from benefiting the entire system if the dataflow changes or different dataflows run in the same system. Thus, in this paper, we propose MUG-NoC to support typical traffic patterns and accelerate them, therefore boosting multiple dataflows. Specifically, (i) we for the first time support multicast in 2D-mesh software configurable NoC by revising router configuration and proposing the efficient multicast routing; (ii) we decrease unicast latency by transmitting data through the different routes in parallel; (iii) we reduce output gather overheads by pipelining basic dataflow units. Experiments show that at least our proposed design can reduce 39.2% total data transmission time compared with the state-of-the-art CNN communication backbone.

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

  • Bo-Wun Cheng
  • En-Ming Huang
  • Chen-Hao Chao
  • Wei-Fang Sun
  • Tsung-Tai Yeh
  • Chun-Yi Lee

In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

SESSION: Technical Program: Novel Techniques for Scheduling and Memory Optimizations in Embedded Software

Mixed-Criticality with Integer Multiple WCETs and Dropping Relations: New Scheduling Challenges

  • Federico Reghenzani
  • William Fornaciari

Scheduling Mixed-Criticality (MC) workload is a challenging problem in real-time computing. Earliest Deadline First Virtual Deadline (EDF-VD) is one of the most famous scheduling algorithm with optimal speedup bound properties. However, when EDF-VD is used to schedule task sets using a model with additional or relaxed constraints, its scheduling properties change. Inspired by an application of MC to the scheduling of fault tolerant tasks, in this article, we propose two models for multiple criticality levels: the first is a specialization of the MC model, and the second is a generalization of it. We then show, via formal proofs and numerical simulations, that the former considerably improves the speedup bound of EDF-VD. Finally, we provide the proofs related to the optimality of the two models, identifying the need of new scheduling algorithms.

An Exact Schedulability Analysis for Global Fixed-Priority Scheduling of the AER Task Model

  • Thilanka Thilakasiri
  • Matthias Becker

Commercial off-the-shelf (COTS) multi-core platforms offer high performance and large availability of processing resources. Increased contention when accessing shared resources is a result of the high parallelism and one of the main challenges when realtime applications are deployed to these platforms. As a result, several execution models have been proposed to avoid contention by separating access to shared resources from execution.

In this work, we consider the Acquisition-Execution-Restitution (AER) model where contention to shared resources is avoided by design. We propose an exact schedulability test for the AER model under global fixed-priority scheduling using timed automata where we describe the schedulability problem as a reachability problem. To the best of our knowledge, this is the first exact schedulability test for the AER model under global fixed-priority scheduling on multiprocessor platforms. The performance of the proposed approach is evaluated using synthetic experiments and provides up to 65% more schedulable task sets than the state-of-the-art.

Skyrmion Vault: Maximizing Skyrmion Lifespan for Enabling Low-Power Skyrmion Racetrack Memory

  • Syue-Wei Lu
  • Shuo-Han Chen
  • Yu-Pei Liang
  • Yuan-Hao Chang
  • Kang Wang
  • Tseng-Yi Chen
  • Wei-Kuan Shih

Skyrmion racetrack memory (SK-RM) has demonstrated great potential as a high-density and low-cost nonvolatile memory. Nevertheless, even though random data accesses are supported on SK-RM, data accesses can not be carried out on individual data bit directly. Instead, special skyrmion manipulations, such as injecting and shifting, are required to support random information update and deletion. With such special manipulations, the latency and energy consumption of skyrmion manipulations could quickly accumulate and induce additional overhead on the data read/write path of SK-RM. Meanwhile, injection operation consumes more energy and has higher latency than any other manipulations. Although prior arts have tried to alleviate the overhead of skyrmion manipulations, the possibility of minimizing injections through buffering skyrmions for future reuse and energy conservation receives much less attention. Such observation motivates us to propose the concept of skyrmion vault to effectively utilize the skyrmion buffer track structure for energy conservation through maximizing the lifespan of injected skyrmions and minimizing the number of skyrmion injections. Experimental results have shown promising improvements in both energy consumption and skyrmions’ lifespan.

SESSION: Technical Program: Efficient Circuit Simulation and Synthesis for Analog Designs

Parallel Incomplete LU Factorization Based Iterative Solver for Fixed-Structure Linear Equations in Circuit Simulation

  • Lingjie Li
  • Zhiqiang Liu
  • Kan Liu
  • Shan Shen
  • Wenjian Yu

A series of fixed-structure sparse linear equations are solved in a circuit simulation process. We propose a parallel incomplete LU (ILU) preconditioned GMRES solver for those equations. A new subtree-based scheduling algorithm for ILU factorization and forward/backward substitution is adopted to overcome the load-balancing and data locality problem of the conventional levelization-based scheduling. Experimental results show that the proposed scheduling algorithm can achieve up to 2.6X speedup for ILU factorization and 3.1X speedup for forward/backward substitution compared to the levelization-based scheduling. The proposed ILU-GMRES solver achieves around 4X parallel speedup with 8 threads, which is up to 2.1X faster than that based on the levelization-based scheme. The proposed parallel solver also shows remarkable advantage over existing methods (including HSPICE) on transient simulation of linear and nonlinear circuits.

Accelerated Capacitance Simulation of 3-D Structures with Considerable Amounts of General Floating Metals

  • Jiechen Huang
  • Wenjian Yu
  • Mingye Song
  • Ming Yang

Floating metals are special conductors introduced into conductor structures by design for manufacturing (DFM). They bring difficulty to accurate capacitance simulation. In this work, we aim to accelerate the floating random walk (FRW) based capacitance simulation for structures with considerable amounts of general floating metals. We first discuss how the existing modified FRW is affected by the integral surfaces of floating metals and propose an improved placement of integral surface. Then, we propose a hybrid approach called incomplete network reduction to avoid random transitions trapped by floating metals. Experiments on structures from IC and FPD design, which involves multiple floating metals and single or multiple master conductors, have shown the effectiveness of the proposed techniques. The proposed techniques reduce the computational time of capacitance calculation, while preserving the accuracy.

On Automating Finger-Cap Array Synthesis with Optimal Parasitic Matching for Custom SAR ADC

  • Cheng-Yu Chiang
  • Chia-Lin Hu
  • Mark Po-Hung Lin
  • Yu-Szu Chung
  • Shyh-Jye Jou
  • Jieh-Tsorng Wu
  • Shiuh-hua Wood Chiang
  • Chien-Nan Jimmy Liu
  • Hung-Ming Chen

Due to its excellent power efficiency, the successive-approximation-register (SAR) analog-to-digital converter (ADC) is an attractive design choice for low-power ADC implements. In analog layout design, the parasitics induced by interconnecting wires and elements affect the accuracy and performance of the device. Due to the requirement of low-power and high-speed, series of very small lateral metal-metal capacitor units are usually adopted as the architecture of capacitor array. Besides power consumption and area reduction, the parasitic capacitance would significantly affect the matching properties and settling time of capacitors. This work presents a framework to synthesize good-quality binary-weighted capacitors for custom SAR ADC. Also, this work proposes a parasitic-aware ILP-based weight-dynamic network routing algorithm to generate a layout considering parasitic capacitance and capacitance ratio mismatch simultaneously. The experimental result shows that the effective number of bits (ENOB) of the layout generated by our approach is comparable to or better than that of manual design and other automated works, closing the gap between pre-sim and post-sim results.

SESSION: Technical Program: Security of Heterogeneous Systems Containing FPGAs

FPGANeedle: Precise Remote Fault Attacks from FPGA to CPU

  • Mathieu Gross
  • Jonas Krautter
  • Dennis Gnad
  • Michael Gruber
  • Georg Sigl
  • Mehdi Tahoori

FPGA as general-purpose accelerators can greatly improve system efficiency and performance in cloud and edge devices alike. However, they have recently become the focus of remote attacks, such as fault and side-channel attacks from one to another user of a part of the FPGA fabric. In this work, we consider system-on-chip platforms, where an FPGA and an embedded processor core are located on the same die. We show that the embedded processor core is vulnerable to voltage drops generated by the FPGA logic. Our experiments demonstrate the possibility of compromising the data transfer from external DDR memory to the processor cache hierarchy. Furthermore, we were also able to fault and skip instructions executed on an ARM Cortex-A9 core. The FPGA based fault injection is shown precise enough to recover the secret key of an AES T-tables implementation found in the mbedTLS library.

FPGA Based Countermeasures against Side Channel Attacks on Block Ciphers

  • Darshana Jayasinghe
  • Brian Udugama
  • Sri Parameswaran

Field Programmable Gate Arrays (FPGAs) are increasingly ubiquitous. FPGAs enable hardware acceleration and reconfigurability. Any security breach or attack on critical computations occurring on an FPGA can lead to devastating consequences. Side-channel attacks have the ability to reveal secret information, such as secret keys from cryptographic circuits running on FPGAs. Power dissipation (PA), Electromagnetic (EM) radiation, fault injection (FI) and remote power dissipation (RPA) attacks are the most compelling and noninvasive side-channel attacks demonstrated on FPGAs. This paper discusses two PA attack countermeasures (QuadSeal and RFTC) and one RPA attack countermeasure (UCloD) in detail to protect FPGAs.

SESSION: Technical Program: Novel Application & Architecture-Specific Quantization Techniques

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

  • Ruoyang Liu
  • Chenhan Wei
  • Yixiong Yang
  • Wenxun Wang
  • Huazhong Yang
  • Yongpan Liu

Data quantization is an effective method to accelerate neural network training and reduce power consumption. However, it is challenging to perform low-bit quantized training: the conventional equal-precision quantization will lead to either high accuracy loss or limited bit-width reduction, while existing mixed-precision methods offer high compression potential but failed to perform accurate and efficient bit-width assignment. In this work, we propose DYNASTY, a block-wise dynamic-precision neural network training framework. DYNASTY provides accurate data sensitivity information through fast online analytics, and maintains stable training convergence with an adaptive bit-width map generator. Network training experiments on CIFAR-100 and ImageNet dataset are carried out, and compared to 8-bit quantization baseline, DYNASTY brings up to 5.1× speedup and 4.7× energy consumption reduction with no accuracy drop and negligible hardware overhead.

Quantization through Search: A Novel Scheme to Quantize Convolutional Neural Networks in Finite Weight Space

  • Qing Lu
  • Weiwen Jiang
  • Xiaowei Xu
  • Jingtong Hu
  • Yiyu Shi

Quantization has become an essential technique in compressing deep neural networks for deployment onto resource-constrained hardware. It is noticed that, the hardware efficiency of implementing quantized networks is highly coupled with the actual values to be quantized into, and therefore, with given bit widths, we can smartly choose a value space to further boost the hardware efficiency. For example, using weights of only integer powers of two, multiplication can be fulfilled by bit operations. Under such circumstances, however, existing quantization-aware training methods are either not suitable to apply or unable to unleash the expressiveness of very low bit-widths. For the best hardware efficiency, we revisit the quantization of convolutional neural networks and propose to address the training process from a weight-searching angle, as opposed to optimizing the quantizer functions as in existing works. Extensive experiments on CIFAR10 and ImageNet classification tasks are examined with implementations onto well-established CNN architectures, such as ResNet, VGG, and MobileNet, etc. It is shown the proposed method can achieve a lower accuracy loss than the state of arts, and/or improving implementation efficiency by using hardware-friendly weight values at the same time.

Multi-Wavelength Parallel Training and Quantization-Aware Tuning for WDM-Based Optical Convolutional Neural Networks Considering Wavelength-Relative Deviations

  • Ying Zhu
  • Min Liu
  • Lu Xu
  • Lei Wang
  • Xi Xiao
  • Shaohua Yu

Wavelength Division Multiplexing (WDM)-based Mach-Zehnder Interferometer Optical Convolutional Neural Networks (MZI-OCNNs) have emerged as a promising platform to accelerate convolutions that cost most computing sources in neural networks. However, the wavelength-relative imperfect split ratios and actual phase shifts in MZIs and quantization errors from the electronic configuration module will degrade the inference accuracy of WDM-based MZI-OCNNs and thus render them unusable in practice. In this paper, we propose a framework that models the split ratios and phase shifts under different wavelengths, incorporates them into OCNN training, and introduces quantization-aware tuning to maintain inference accuracy and reduce electronic module complexity. Consequently, the framework can improve the inference accuracy by 49%, 76%, and 76%, respectively, for LeNet5, VGG7, and VGG8 implemented with multi-wavelength parallel computing. And instead of using Float 32/64 quantization resolutions, only 5,6, and 4 bits are needed and fewer quantization levels are utilized for configuration signals.

Semantic Guided Fine-Grained Point Cloud Quantization Framework for 3D Object Detection

  • Xiaoyu Feng
  • Chen Tang
  • Zongkai Zhang
  • Wenyu Sun
  • Yongpan Liu

Unlike the grid-paced RGB images, network compression, i.e.pruning and quantization, for the irregular and sparse 3D point cloud face more challenges. Traditional quantization ignores the unbalanced semantic distribution in 3D point cloud. In this work, we propose a semantic-guided adaptive quantization framework for 3D point cloud. Different from traditional quantization methods that adopt a static and uniform quantization scheme, our proposed framework can adaptively locate the semantic-rich foreground points in the feature maps to allocate a higher bitwidth for these “important” points. Since the foreground points are in a low proportion in the sparse 3D point cloud, such adaptive quantization can achieve higher accuracy than uniform compression under a similar compression rate. Furthermore, we adopt a block-wise fine-grained compression scheme in the proposed framework to fit the larger dynamic range in the point cloud. Moreover, a 3D point cloud based software and hardware co-evaluation process is proposed to evaluate the effectiveness of the proposed adaptive quantization in actual hardware devices. Based on the nuScenes dataset, we achieve 12.52% precision improvement under average 2-bit quantization. Compared with 8-bit quantization, we can achieve 3.11× energy efficiency based on co-evaluation results.

SESSION: Technical Program: Approximate Brain-Inspired Architectures for Efficient Learning

ReMeCo: Reliable Memristor-Based in-Memory Neuromorphic Computation

  • Ali BanaGozar
  • Seyed Hossein Hashemi Shadmehri
  • Sander Stuijk
  • Mehdi Kamal
  • Ali Afzali-Kusha
  • Henk Corporaal

Memristor-based in-memory neuromorphic computing systems promise a highly efficient implementation of vector-matrix multiplications, commonly used in artificial neural networks (ANNs). However, the immature fabrication process of memristors and circuit level limitations, i.e., stuck-at-fault (SAF), IR-drop, and device-to-device (D2D) variation, degrade the reliability of these platforms and thus impede their wide deployment. In this paper, we present ReMeCo, a redundancy-based reliability improvement framework. It addresses the non-idealities while constraining the induced overhead. It achieves this by performing a sensitivity analysis on ANN. With the acquired insight, ReMeCo avoids the redundant calculation of least sensitive neurons and layers. ReMeCo uses a heuristic approach to find the balance between recovered accuracy and imposed overhead. ReMeCo further decreases hardware redundancy by exploiting the bit-slicing technique. In addition, the framework employs the ensemble averaging method at the output of every ANN layer to incorporate the redundant neurons. The efficacy of the ReMeCo is assessed using two well-known ANN models, i.e., LeNet, and AlexNet, running the MNIST and CIFAR10 datasets. Our results show 98.5% accuracy recovery with roughly 4% redundancy which is more than 20× lower than the state-of-the-art.

SyFAxO-GeN: Synthesizing FPGA-Based Approximate Operators with Generative Networks

  • Rohit Ranjan
  • Salim Ullah
  • Siva Satyendra Sahoo
  • Akash Kumar

With rising trends of moving AI inference to the edge, due to communication and privacy challenges, there has been a growing focus on designing low-cost Edge-AI. Given the diversity of application areas at the edge, FPGA-based systems are increasingly used for high-performance inference. Similarly, approximate computing has emerged as a viable approach to achieve disproportionate resource gains by utilizing the applications’ inherent robustness. However, most related research has focused on selecting the appropriate approximate operators for an application from a set of ASIC-based designs. This approach fails to leverage the FPGA’s architectural benefits and limits the scope of approximation to already existing generic designs. To this end, we propose an AI-based approach to synthesizing novel approximate operators for FPGA’s Look-up-table-based structure. Specifically, we use state-of-the-art generative networks to search for constraint-aware arithmetic operator designs optimized for FPGA-based implementation. With the proposed GANs, we report up to 49% faster training, with negligible accuracy degradation, than related generative networks. Similarly, we report improved hypervolume and increased pareto-front design points compared to state-of-the-art approaches to synthesizing approximate multipliers.

Approximating HW Accelerators through Partial Extractions onto Shared Artificial Neural Networks

  • Prattay Chowdhury
  • Jorge Castro Godínez
  • Benjamin Carrion Schafer

One approach that has been suggested to further reduce the energy consumption of heterogenous Systems-on-Chip (SoCs) is approximate computing. In approximate computing the error at the output is relaxed in order to simplify the hardware and thus, achieve lower power. Fortunately, most of the hardware accelerators in these SoCs are also amenable to approximate computing.

In this work we propose a fully automatic method that substitutes portions of a hardware accelerator specified in C/C++/SystemC for High-Level Synthesis (HLS) to an Artificial Neural Network (ANN). ANNs have many advantages that make them well suited for this. First, they are very scalable which allows to approximate multiple separate portions of the behavioral description simultaneously on them. Second, multiple ANNs can be fused together and re-optimized to further reduce the power consumption. We use this to share the ANN to approximate multiple different HW accelerators in the same SoC. Experimental results with different error thresholds show that our proposed approach leads to better results than the state of the art.

DependableHD: A Hyperdimensional Learning Framework for Edge-Oriented Voltage-Scaled Circuits

  • Dehua Liang
  • Hiromitsu Awano
  • Noriyuki Miura
  • Jun Shiomi

Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing the stable operation in modern VLSI. To tackle such issues, we propose DependableHD, a learning framework based on HyperDimensional Computing (HDC), which supports the systems to tolerate bit-level memory failure in the low voltage region with high robustness. For the first time, DependableHD introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. Our experiment shows that under 10% memory error, DependableHD exhibits a 1.22% accuracy loss on average, which achieves an 11.2× improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHD supports the systems to reduce the supply voltage from 400mV to 300mV, which provides a 50.41% energy consumption reduction while maintaining competitive accuracy performance.

SESSION: Technical Program: Retrospect and Prospect of Verifiation and Test Technologies

EDDY: A Multi-Core BDD Package with Dynamic Memory Management and Reduced Fragmentation

  • Rune Krauss
  • Mehran Goli
  • Rolf Drechsler

In recent years, hardware systems have significantly grown in complexity. Due to the increasing complexity, there is a need to continuously improve the quality of the hardware design process. This leads designers to strive for more efficient data structures and algorithms operating on them to guarantee the correct behavior of such systems through verification techniques like model checking and meet time-to-market constraints. A Binary Decision Diagram (BDD) is a suitable data structure as it provides a canonical compact representation of Boolean functions, given variable ordering, and efficient algorithms for manipulating them. However, reduced ordered BDDs also have challenges: There is a large memory consumption for the BDD construction of some complex practical functions and the use of realizations in the form of BDD packages strongly depends on the application.

To address these issues, this paper presents a novel multi-core package called Engineer Decision Diagrams Yourself (EDDY) with dynamic memory management and reduced fragmentation. Experiments on BDD benchmarks of both combinational circuits and model checking show that using EDDY leads to a significantly performance boost compared to state-of-the-art packages.

Exploiting Reversible Computing for Verification: Potential, Possible Paths, and Consequences

  • Lukas Burgholzer
  • Robert Wille

Today, the verification of classical circuits poses a severe challenge for the design of circuits and systems. While the underlying (exponential) complexity is tackled in various fashions (simulation-based approaches, emulation, formal equivalence checking, fuzzing, model checking, etc.), no “silver bullet” has been found yet which allows to escape the growing verification gap. In this work, we entertain and investigate the idea of a complementary approach which aims at exploiting reversible computing. More precisely, we show the potential of the reversible computing paradigm for verification, debunk misleading paths that do not allow to exploit this potential, and discuss the resulting consequences for the development of future, complementary design and verification flows. An extensive empirical study (involving more than 30 million simulations) confirms these findings. Although this work cannot provide a fully-fledged realization yet, it may provide the basis for an alternative path towards overcoming the verification gap.

Automatic Test Pattern Generation and Compaction for Deep Neural Networks

  • Dina Moussa
  • Michael Hefenbrock
  • Christopher Münch
  • Mehdi Tahoori

Deep Neural Networks (DNNs) have gained considerable attention lately due to their excellent performance on a wide range of recognition and classification tasks. Accordingly, fault detection in DNNs and their implementations plays a crucial role in the quality of DNN implementations to ensure that their post-mapping and infield accuracy matches with model accuracy. This paper proposes a functional-level automatic test pattern generation approach for DNNs. This is done by generating inputs which causes misclassification of the output class label in the presence of single or multiple faults. Furthermore, to obtain a smaller set of test patterns with full coverage, a heuristic algorithm as well as a test pattern clustering method using K-means were implemented. The experimental results showed that the proposed test patterns achieved the highest label misclassification and a high output deviation compared to state-of-the-art approaches.

Wafer-Level Characteristic Variation Modeling Considering Systematic Discontinuous Effects

  • Takuma Nagao
  • Tomoki Nakamura
  • Masuo Kajiyama
  • Makoto Eiki
  • Michiko Inoue
  • Michihiro Shintani

Statistical wafer-level variation modeling is an attractive method for reducing the measurement cost in large-scale integrated circuit (LSI) testing while maintaining the test quality. In this method, the performance of unmeasured LSI circuits manufactured on a wafer is statistically predicted from a few measured LSI circuits. Conventional statistical methods model spatially smooth variations in wafer. However, actual wafers may have discontinuous variations that are systematically caused by the manufacturing environments, such as shot dependence. In this study, we propose a modeling method that considers discontinuous variations in wafer characteristics by applying the knowledge of manufacturing engineers to a model estimated using Gaussian process regression. In the proposed method, the process variation is decomposed into the systematic discontinuous and global components to improve the estimation accuracy. An evaluation performed using an industrial production test dataset shows that the proposed method reduces the estimation error for an entire wafer by over 33% compared to conventional methods.

SESSION: Technical Program: Computing, Erasing, and Protecting: The Security Challenges for the Next Generation of Memories

Hardware Security Primitives Using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs

  • Simranjeet Singh
  • Furqan Zahoor
  • Gokul Rajendran
  • Sachin Patkar
  • Anupam Chattopadhyay
  • Farhad Merchant

With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.

Data Sanitization on eMMCs

  • Aya Fukami
  • Francesco Regazzoni
  • Zeno Geradts

Data sanitization of modern digital devices is an important issue given that electronic wastes are being recycled and repurposed. The embedded Multi Media Card (eMMC), one of the NAND flash memory-based commodity devices, is one of the popularly recycled products in the current recycling ecosystem. We analyze a repurposed devices and evaluate its sanitization practice. Data from the formerly used device can still be recovered, which may lead to an unintentional leakage of sensitive data such as personally identifiable information (PII). Since the internal storage of an eMMC is the NAND flash memory, sanitization practice of the NAND flash memory-based systems should apply to the eMMC. However, proper sanitize operation is obviously not always performed in the current recycling ecosystem. We discuss how data stored in eMMC and other flash memory-based devices need to be deleted in order to avoid the potential data leakage. We also review the NAND flash memory data sanitization schemes and discuss how they should be applied in eMMCs.

Fundamentally Understanding and Solving RowHammer

  • Onur Mutlu
  • Ataberk Olgun
  • A. Giray Yağlıkcı

We provide an overview of recent developments and future directions in the RowHammer vulnerability that plagues modern DRAM (Dynamic Random Memory Access) chips, which are used in almost all computing systems as main memory.

RowHammer is the phenomenon in which repeatedly accessing a row in a real DRAM chip causes bitflips (i.e., data corruption) in physically nearby rows. This phenomenon leads to a serious and widespread system security vulnerability, as many works since the original RowHammer paper in 2014 have shown. Recent analysis of the RowHammer phenomenon reveals that the problem is getting much worse as DRAM technology scaling continues: newer DRAM chips are fundamentally more vulnerable to RowHammer at the device and circuit levels. Deeper analysis of RowHammer shows that there are many dimensions to the problem as the vulnerability is sensitive to many variables, including environmental conditions (temperature & voltage), process variation, stored data patterns, as well as memory access patterns and memory control policies. As such, it has proven difficult to devise fully-secure and very efficient (i.e., low-overhead in performance, energy, area) protection mechanisms against RowHammer and attempts made by DRAM manufacturers have been shown to lack security guarantees.

After reviewing various recent developments in exploiting, understanding, and mitigating RowHammer, we discuss future directions that we believe are critical for solving the RowHammer problem. We argue for two major directions to amplify research and development efforts in: 1) building a much deeper understanding of the problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and 2) the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.

SESSION: Technical Program: System-Level Codesign in DNN Accelerators

Hardware-Software Codesign of DNN Accelerators Using Approximate Posit Multipliers

  • Tom Glint
  • Kailash Prasad
  • Jinay Dagli
  • Krishil Gandhi
  • Aryan Gupta
  • Vrajesh Patel
  • Neel Shah
  • Joycee Mekie

Emerging data intensive AI/ML workloads encounter memory and power wall when run on general-purpose compute cores. This has led to the development of a myriad of techniques to deal with such workloads, among which DNN accelerator architectures have found a prominent place. In this work, we propose a hardware-software co-design approach to achieve system-level benefits. We propose a quantized data-aware POSIT number representation that leads to a highly optimized DNN accelerator. We demonstrate this work on SOTA SIMBA architecture, extendable to any other accelerator. Our proposal reduces the buffer/storage requirements within the architecture and reduces the data transfer cost between the main memory and the DNN accelerator. We have investigated the impact of using integer, IEEE floating point, and posit multipliers for LeNet, ResNet and VGG NNs trained and tested on MNIST, CIFAR10 and ImageNet datasets, respectively. Our system-level analysis shows that the proposed approximate-fixed-posit multiplier when implemented on SIMBA architecture, achieves on average ~2.2× speed up, consumes ~3.1× less energy and requires ~3.2× less area, respectively, against the baseline SOTA architecture, without loss of accuracy (~±1%)

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

  • Susmita Dey Manasi
  • Suvadeep Banerjee
  • Abhijit Davare
  • Anton A. Sorokin
  • Steven M. Burns
  • Desmond A. Kirkpatrick
  • Sachin S. Sapatnekar

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

  • Mohammadhossein Askarihemmat
  • Sean Wagner
  • Olexa Bilaniuk
  • Yassine Hariri
  • Yvon Savaria
  • Jean-Pierre David

We present a DNN accelerator that allows inference at arbitrary precision with dedicated processing elements that are configurable at the bit level. Our DNN accelerator has 8 Processing Elements controlled by a RISC-V controller with a combined 8.2 TMACs of computational power when implemented with the recent Alveo U250 FPGA platform. We develop a code generator tool that ingests CNN models in ONNX format and generates an executable command stream for the RISC-V controller. We demonstrate the scalable throughput of our accelerator by running different DNN kernels and models when different quantization levels are selected. Compared to other low precision accelerators, our accelerator provides run time programmability without hardware reconfiguration and can accelerate DNNs with multiple quantization levels, regardless of the target FPGA size. BARVINN is an open source project and it is available at https://github.com/hossein1387/BARVINN.

Agile Hardware and Software Co-Design for RISC-V-Based Multi-Precision Deep Learning Microprocessor

  • Zicheng He
  • Ao Shen
  • Qiufeng Li
  • Quan Cheng
  • Hao Yu

Recent network architecture search (NAS) has been widely applied to simplify deep learning neural networks, which typically result in a multi-precision network. Many multi-precision accelerators have been developed as well to support computing multi-precision networks manually. A software-hardware interface is thereby needed to automatically map multi-precision networks onto multi-precision accelerators. In this paper, we have developed an agile hardware and software co-design for RISC-V-based multi-precision deep learning microprocessor. We have designed custom RISC-V instructions with a framework to automatically compile multi-precision CNN networks onto multi-precision CNN accelerators, demonstrated on FPGA. Experiments show that with NAS optimized multi-precision CNN models (LeNet, VGG16, ResNet, MobileNet), the RISC-V core with multi-precision accelerators can reach the highest throughput in 2,4,8-bit precisions respectively on a Xilinx ZCU102 FPGA.

SESSION: Technical Program: New Advances in Hardware Trojan Detection

Hardware Trojan Detection Using Shapley Ensemble Boosting

  • Zhixin Pan
  • Prabhat Mishra

Due to globalized semiconductor supply chain, there is an increasing risk of exposing system-on-chip designs to hardware Trojans (HT). While there are promising machine Learning based HT detection techniques, they have three major limitations: ad-hoc feature selection, lack of explainability, and vulnerability towards adversarial attacks. In this paper, we propose a novel HT detection approach using an effective combination of Shapley value analysis and boosting framework. Specifically, this paper makes two important contributions. We use Shapley value (SHAP) to analyze the importance ranking of input features. It not only provides explainable interpretation for HT detection, but also serves as a guideline for feature selection. We utilize boosting (ensemble learning) to generate a sequence of lightweight models that significantly reduces the training time while provides robustness against adversarial attacks. Experimental results demonstrate that our approach can drastically improve both detection accuracy (up to 24.6%) and time efficiency (up to 5.1x) compared to state-of-the-art HT detection techniques.

ASSURER: A PPA-friendly Security Closure Framework for Physical Design

  • Guangxin Guo
  • Hailong You
  • Zhengguang Tang
  • Benzheng Li
  • Cong Li
  • Xiaojue Zhang

Hardware security is emerging in the very large scale integration (VLSI). The seminal threats, like hardware Trojan insertion, probing attacks, and fault injection, are hard to detect and almost impossible to fix at post-design stage. The optimal solution is to prevent them at the physical design stage. Usually, defending against them may cause a lot of power, performance, and area (PPA) loss. In this paper, we propose a PPA-friendly physical layout security closure framework ASSURER. Reward-directed placement refinement and multi-threshold partition algorithm are proposed to assure Trojan threats are empty. Cleaning up probing attacks is established on a patch-based ECO routing flow. Evaluated on the ISPD’22 benchmarks, ASSURER can clean out the Trojan threat with no leakage power increase when shrinking the physical layout area. When not shrinking, ASSURER only increases 14% total power. Compared with the work of first place in the ISPD2022 Contest, ASSURE reduced 53% additional total power consumption, and probing vulnerability can be reduced by 97.6% under the premise of timing closure. We believe this work shall open up a new perspective for preventing Trojan insertion and probing attacks.

Static Probability Analysis Guided RTL Hardware Trojan Test Generation

  • Haoyi Wang
  • Qiang Zhou
  • Yici Cai

Directed test generation is an effective method to detect potential hardware Trojan (HT) in RTL. While the existing works are able to activate hard-to-cover Trojans by covering security targets, the effectiveness and efficiency of identifying the targets to cover are ignored. We propose a static probability analysis method for identifying the hard-to-active data channel targets and generating the corresponding assertions for the HT test generation. Our method could generate test vectors to trigger Trojans from Trusthub, DeTrust, and OpenCores in 1 minute and get 104.33X time improvement on average compared with the existing method.

Hardware Trojan Detection and High-Precision Localization in NoC-Based MPSoC Using Machine Learning

  • Haoyu Wang
  • Basel Halak

Networks-on-Chips (NoC) based Multi-Processor System-on-Chip (MPSoC) are increasingly employed in industrial and consumer electronics. Outsourcing third-party IPs (3PIPs) and tools in NoC-based MPSoC is a prevalent development way in most fabless companies. However, Hardware Trojan (HT) injected during its design stage can maliciously tamper with the functionality of this communication scheme, which undermines the security of the system and may cause a failure. Detecting and localizing HT with high precision is a challenge for current techniques. This work proposes for the first time a novel approach that allows detection and high-precision localization of HT, which is based on the use of packet information and machine learning algorithms. It is equipped with a novel Dynamic Confidence Interval (DCI) algorithm to detect malicious packets, and a novel Dynamic Security Credit Table (DSCT) algorithm to localize HT. We evaluated the proposed framework on the mesh NoC running real workloads. The average detection precision of 96.3% and the average localization precision of 100% were obtained from the experiment results, and the minimum HT localization time is around 5.8 ~ 12.9us at 2GHz depending on the different HT-infected nodes and workloads.

SESSION: Technical Program: Advances in Physical Design and Timing Analysis

An Integrated Circuit Partitioning and TDM Assignment Optimization Framework for Multi-FPGA Systems

  • Dan Zheng
  • Evangeline F. Y. Young

In multi-FPGA systems, Time-Division Multiplexing (TDM) is a widely used method for transferring multiple signals over a common wire. The circuit performance will be significantly influenced by this inter-FPGA delay. Some inter-FPGA nets are driven by different clocks, in which case they cannot share the same wire. In this paper, to minimize the maximum delay of inter-FPGA nets, we propose a two-step framework. First, a TDM-aware partitioning algorithm is adopted to minimize the maximum cut size between an FPGA-pair. A TDM ratio assignment method is then applied to assign TDM ratio for each inter-FPGA net optimally. Experimental results show that our algorithm can reduce the maximum TDM ratio significantly within reasonable runtime.

A Robust FPGA Router with Concurrent Intra-CLB Rerouting

  • Jiarui Wang
  • Jing Mai
  • Zhixiong Di
  • Yibo Lin

Routing is the most time-consuming step in the FPGA design flow with increasingly complicated FPGA architectures and design scales. The growing complexity of connections between logic pins inside CLBs of FPGAs challenges the efficiency and quality of FPGA routers. Existing negotiation-based rip-up and reroute schemes will result in a large number of iterations when generating paths inside CLBs. In this work, we propose a robust routing framework for FPGAs with complex connections between logic elements and switch boxes. We propose a concurrent intra-CLB rerouting algorithm that can effectively resolve routing congestion inside a CLB tile. Experimental results on modified ISPD 2016 benchmarks demonstrate that our framework can achieve 100% routability in less wirelength and runtime, while the state-of-the-art VTR 8.0 routing algorithm fails at 4 of 12 benchmarks.

Efficient Global Optimization for Large Scaled Ordered Escape Routing

  • Chuandong Chen
  • Dishi Lin
  • Rongshan Wei
  • Qinghai Liu
  • Ziran Zhu
  • Jianli Chen

Ordered Escape Routing (OER) problem, which is an NP-hard problem, is critical in PCB design. Primary methods based on integer linear programming (ILP) or heuristic algorithms work well on small-scale PCBs with fewer pins. However, when dealing with large-scale instances, the performance of ILP strategies suffers dramatically as the number of variables increases due to time-consuming preprocessing. As for heuristic algorithms, ripping-up and rerouting is adopted to increase resource utilization, which frequently causes time violation. In this paper, we propose an efficient ILP-based routing engine for dense PCB to simultaneously minimize wiring length and runtime, considering the specific routing constraints. By weighting the length, we first model the OER problem as a special network flow problem. Then we separate the non-crossing constraint from typical ILP modeling to reduce the number of integral variables greatly. In addition, considering the congestion of routing resources, the ILP method is proposed to detect congestion. Finally, unlike the traditional schemes that deal with negotiated congestion, our approach works by reducing the local area capacity and then allowing the global automatic optimization of congestion. Compared with the state-of-the-art work, experimental results show that our algorithm can solve cases in larger scale in high routing quality of less length and reduce routing time by 76%.

An Adaptive Partition Strategy of Galerkin Boundary Element Method for Capacitance Extraction

  • Shengkun Wu
  • Biwei Xie
  • Xingquan Li

In advanced process, electromagnetic coupling among interconnect wires plays an increasingly important role in signoff analysis. For VLSI chip design, the requirement of fast and accurate capacitance extraction is becoming more and more urgent. And the critical step of extracting capacitance among interconnect wires is solving electric field. However, due to the high computational complexity, solving electric field is extreme timing-consuming. The Galerkin boundary element method (GBEM) was used for capacitance extraction in [2]. In this paper, we are going to use some mathematical theorems to analysis its error. Furthermore, with the error estimation of the Galerkin method, we design a boundary partition strategy to fit the electric field attenuation. It is worth to mention that this boundary partition strategy can greatly reduce the number of boundary elements on the promise of ensuring that the error is small enough. As a consequence, the matrix order of the discretization equation will also decrease. We also provide our suggestion of the calculation of the matrix elements. Experimental analysis demonstrates that, our partition strategy obtains a good enough result with a small number of boundary elements.

Graph-Learning-Driven Path-Based Timing Analysis Results Predictor from Graph-Based Timing Analysis

  • Yuyang Ye
  • Tinghuan Chen
  • Yifei Gao
  • Hao Yan
  • Bei Yu
  • Longxing Shi

With diminishing margins in advanced technology nodes, the performance of static timing analysis (STA) is a serious concern, including accuracy and runtime. The STA can generally be divided into graph-based analysis (GBA) and path-based analysis (PBA). For GBA, the timing results are always pessimistic, leading to overdesign during design optimization. For PBA, the timing pessimism is reduced via propagating real path-specific slews with the cost of severe runtime overheads relative to GBA. In this work, we present a fast and accurate predictor of post-layout PBA timing results from inexpensive GBA based on deep edge-featured graph attention network, namely deep EdgeGAT. Compared with the conventional machine and graph learning methods, deep EdgeGAT can learn global timing path information. Experimental results demonstrate that our predictor has the potential to substantially predict PBA timing results accurately and reduce timing pessimism of GBA with maximum error reaching 6.81 ps, and our work achieves an average 24.80× speedup faster than PBA using the commercial STA tool.

SESSION: Technical Program: Brain-Inspired Hyperdimensional Computing to the Rescue for Beyond von Neumann Era

Beyond von Neumann Era: Brain-Inspired Hyperdimensional Computing to the Rescue

  • Hussam Amrouch
  • Paul R. Genssler
  • Mohsen Imani
  • Mariam Issa
  • Xun Jiao
  • Wegdan Mohammad
  • Gloria Sepanta
  • Ruixuan Wang

Breakthroughs in deep learning (DL) continuously fuel innovations that profoundly improve our daily life. However, DNNs overwhelm conventional computing architectures by their massive data movements between processing and memory units. As a result, novel computer architectures are indispensable to improve or even replace the decades-old von Neumann architecture. Nevertheless, going far beyond the existing von Neumann principles comes with profound reliability challenges for the performed computations. This is due to analog computing together with emerging beyond-CMOS technologies being inherently noisy and inevitably leading to unreliable computing. Hence, novel robust algorithms become a key to go beyond the boundaries of the von Neumann era. Hyper-dimensional Computing (HDC) is rapidly emerging as an attractive alternative to traditional DL and ML algorithms. Unlike conventional DL and ML algorithms, HDC is inherently robust against errors along a much more efficient hardware implementation. In addition to these advantages at hardware level, HDC’s promise to learn from little data and the underlying algebra enable new possibilities at the application level. In this work, the robustness of HDC algorithms against errors and beyond von Neumann architectures are discussed. Further, the benefits of HDC as a machine learning algorithm are demonstrated with the example of outlier detection and reinforcement learning.

SESSION: Technical Program: System Level Design Space Exploration

System-Level Exploration of In-Package Wireless Communication for Multi-Chiplet Platforms

  • Rafael Medina
  • Joshua Kein
  • Giovanni Ansaloni
  • Marina Zapater
  • Sergi Abadal
  • Eduard Alarcón
  • David Atienza

Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

Efficient System-Level Design Space Exploration for High-Level Synthesis Using Pareto-Optimal Subspace Pruning

  • Yuchao Liao
  • Tosiron Adegbija
  • Roman Lysecky

High-level synthesis (HLS) is a rapidly evolving and popular approach to designing, synthesizing, and optimizing embedded systems. Many HLS methodologies utilize design space exploration (DSE) at the post-synthesis stage to find Pareto-optimal hardware implementations for individual components. However, the design space for the system-level Pareto-optimal configurations is orders of magnitude larger than component-level design space, making existing approaches insufficient for system-level DSE. This paper presents Pruned Genetic Design Space Exploration (PG-DSE)—an approach to post-synthesis DSE that involves a pruning method to effectively reduce the system-level design space and an elitist genetic algorithm to accurately find the system-level Pareto-optimal configurations. We evaluate PG-DSE using an autonomous driving application subsystem (ADAS) and three synthetic systems with extremely large design spaces. Experimental results show that PG-DSE can reduce the design space by several orders of magnitude compared to prior work while achieving higher quality results (an average improvement of 58.1x).

Automatic Generation of Complete Polynomial Interpolation Design Space for Hardware Architectures

  • Bryce Orloski
  • Samuel Coward
  • Theo Drane

Hardware implementations of elementary functions regularly deploy piecewise polynomial approximations. This work determines the complete design space of piecewise polynomial approximations meeting a given accuracy specification. Knowledge of this design space determines the minimum number of regions required to approximate the function accurately enough and facilitates the generation of optimized hardware which is competitive against the state of the art. Designers can explore the space of feasible architectures without needing to validate their choices. A heuristic based decision procedure is proposed to generate optimal ASIC hardware designs. Targeting alternative hardware technologies simply requires a modified decision procedure to explore the space. We highlight the difficulty in choosing an optimal number of regions to approximate the function with, as this is input width dependent.

SESSION: Technical Program: Security Assurance and Acceleration

SHarPen: SoC Security Verification by Hardware Penetration Test

  • Hasan Al-Shaikh
  • Arash Vafaei
  • Mridha Md Mashahedur Rahman
  • Kimia Zamiri Azar
  • Fahim Rahman
  • Farimah Farahmandi
  • Mark Tehranipoor

As modern SoC architectures incorporate many complex/heterogeneous intellectual properties (IPs), the protection of security assets has become imperative, and the number of vulnerabilities revealed is rising due to the increased number of attacks. Over the last few years, penetration testing (PT) has become an increasingly effective means of detecting software (SW) vulnerabilities. As of yet, no such technique has been applied to the detection of hardware vulnerabilities. This paper proposes a PT framework, SHarPen, for detecting hardware vulnerabilities, which facilitates the development of a SoC-level security verification framework. SHarPen proposes a formalism for performing gray-box hardware (HW) penetration testing instead of relying on coverage-based testing and provides an automation for mapping hardware vulnerabilities to logical/mathematical cost functions. SHarPen supports both simulation and FPGA-based prototyping, allowing us to automate security testing at different stages of the design process with high capabilities for identifying vulnerabilities in the targeted SoC.

SecHLS: Enabling Security Awareness in High-Level Synthesis

  • Shang Shi
  • Nitin Pundir
  • Hadi M Kamali
  • Mark Tehranipoor
  • Farimah Farahmandi

In their quest for further optimization, High-level synthesis (HLS) utilizes advanced automatic optimization algorithms to achieve lower implementation time/effort for even more complex designs. These optimization algorithms are for the HLS tools’ backend stages, e.g., allocation, scheduling, and binding, and they are highly optimized for resources/latency constraints. However, current HLS tools’ backend is unaware of designs’ security assets, and their algorithms are incapable of handling security constraints. In this paper, we propose Secure-HLS (SecHLS), which aims to define underlying security constraints for HLS tools’ backend stages and intermediate representations. In SecHLS, we improve a set of widely-used scheduling and binding algorithms by integrating the proposed security-related constraints into them. We evaluate the effectiveness of SecHLS in terms of power, performance, area (PPA), security, and complexity (execution time) on small and real-size benchmarks, showing how the proposed security constraints can be integrated into HLS while maintaining low PPA/complexity burdens.

A Flexible ASIC-Oriented Design for a Full NTRU Accelerator

  • Francesco Antognazza
  • Alessandro Barenghi
  • Gerardo Pelosi
  • Ruggero Susella

Post-quantum cryptosystems are the subject of a significant research effort, witnessed by various international standardization competitions. Among them, the NTRU Key Encapsulation Mechanism has been recognized as a secure, patent-free, and efficient public key encryption scheme. In this work, we perform a design space exploration on an FPGA target, with the final goal of an efficient ASIC realization. Specifically, we focus on the possible choices for the design of polynomial multipliers with different memory bus widths to trade-off lower clock cycle counts with larger interconnections. Our design outperforms the best FPGA synthesis results at the state of the art, and we report the results of ASIC syntheses minimizing latency and area with a 40nm industrial grade technology library. Our speed-oriented design computes an encapsulation in 4.1 to 10.2μs and a decapsulation in 7.1 to 11.7μs, depending on the NTRU security level, while our most compact design only takes 20% more area than the underlying SHA-3 hash module.

SESSION: Technical Program: Hardware and Software Co-Design of Emerging Machine Learning Algorithms

Robust Hyperdimensional Computing against Cyber Attacks and Hardware Errors: A Survey

  • Dongning Ma
  • Sizhe Zhang
  • Xun Jiao

Hyperdimensional Computing (HDC), also known as Vector Symbolic Architecture (VSA), is an emerging AI algorithm inspired by the way the human brain functions. Compared with deep neural networks (DNNs), HDC possesses several advantages such as smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. With the increasing deployment of AI in safety-critical systems such as healthcare and robotics, it is not only important to strive for high accuracy, but also to ensure its robustness under even highly uncertain and adversarial environments. However, recent studies show that HDC, just like DNNs, is vulnerable to both cyber attacks (e.g., adversarial attacks) and hardware errors (e.g., memory failures). While a growing body of research has been studying the robustness of HDC, there is a lack of systematic review of research efforts on this increasingly-important topic. To the best of our knowledge, this paper presents the first survey dedicated to review the research efforts made to the robustness of HDC against cyber attacks and hardware errors. While the performance and accuracy of HDC as an AI method still expects future theoretical advancement, this survey paper aims to shed light and call for community efforts on robustness research of HDC.

In-Memory Computing Accelerators for Emerging Learning Paradigms

  • Dayane Reis
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu

Over the past decades, emerging, data-driven machine learning (ML) paradigms have increased in popularity, and revolutionized many application domains. To date, a substantial effort has been devoted to devising mechanisms for facilitating the deployment and near ubiquitous use of these memory intensive ML models. This review paper presents the use of in-memory computing (IMC) accelerators for emerging ML paradigms from a bottom-up perspective through the choice of devices, the design of circuits/architectures, to the application-level results.

Toward Fair and Efficient Hyperdimensional Computing

  • Yi Sheng
  • Junhuan Yang
  • Weiwen Jiang
  • Lei Yang

We are witnessing the evolution that Machine Learning (ML) is applied to varied applications, such as intelligent security systems, medical diagnoses, etc. With this trend, it has high demand to run ML on end devices with limited resources. What’s more, the fairness in these ML algorithms is mounting important, since these applications are not designed for specific users (e.g., people with fair skin in skin disease diagnosis) but need to be applied to all possible users (i.e., people with different skin tones). Brain-inspired hyperdimensional computing (HDC) has demonstrated its ability to run ML tasks on edge devices with a small memory footprint; yet, it is unknown whether HDC can satisfy the fairness requirements from applications (e.g., medical diagnosis for people with different skin tones). In this paper, for the first time, we reveal that the vanilla HDC has severe bias due to its sensitivity to color information. Toward a fair and efficient HDC, we propose a holistic framework, namely FE-HDC, which integrates the image processing and input compression techniques in HDC’s encoder. Compared with the vanilla HDC, results show that the proposed FE-HDC can reduce the unfairness score by 90%, achieving fairer architectures with competitively high accuracy.

SESSION: Technical Program: Full-Stack Co-Design for on-Chip Learning in AI Systems

Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design

  • Xiaoxuan Yang
  • Shiyu Li
  • Qilin Zheng
  • Yiran Chen

Processing-in-memory (PIM) based architecture shows great potential to process several emerging artificial intelligence workloads, including vision and language models. Cross-layer optimizations could bridge the gap between computing density and the available resources by reducing the computation and memory cost of the model and improving the model’s robustness against non-ideal hardware effects. We first introduce several hardware-aware training methods to improve the model robustness to the PIM device’s non-ideal effects, including stuck-at-fault, process variation, and thermal noise. Then, we further demonstrate a software/hardware (SW/HW) co-design methodology to efficiently process the state-of-the-art attention-based model on PIM-based architecture by performing sparsity exploration for the attention-based model and circuit-architecture co-design to support the sparse processing.

Hardware-Software Co-Design for On-Chip Learning in AI Systems

  • M. L. Varshika
  • Abhishek Kumar Mishra
  • Nagarajan Kandasamy
  • Anup Das

Spike-based convolutional neural networks (CNNs) are empowered with on-chip learning in their convolution layers, enabling the layer to learn to detect features by combining those extracted in the previous layer. We propose ECHELON, a generalized design template for a tile-based neuromorphic hardware with on-chip learning capabilities. Each tile in ECHELON consists of a neural processing units (NPU) to implement convolution and dense layers of a CNN model, an on-chip learning unit (OLU) to facilitate spike-timing dependent plasticity (STDP) in the convolution layer, and a special function unit (SFU) to implement other CNN functions such as pooling, concatenation, and residual computation. These tile resources are interconnected using a shared bus, which is segmented and configured via the software to facilitate parallel communication inside the tile. Tiles are themselves interconnected using a classical Network-on-Chip (NoC) interconnect. We propose a system software to map CNN models to ECHELON, maximizing the performance. We integrate the hardware design and software optimization within a co-design loop to obtain the hardware and software architectures for a target CNN, satisfying both performance and resource constraints. In this preliminary work, we show the implementation of a tile on a FPGA and some early evaluations. Using 8 STDP-enabled CNN models, we show the potential of our co-design methodology to optimize hardware resources.

Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis

  • Vito Giovanni Castellana
  • Nicolas Bohm Agostini
  • Ankur Limaye
  • Vinay Amatya
  • Marco Minutoli
  • Joseph Manzano
  • Antonino Tumeo
  • Serena Curzel
  • Michele Fiorito
  • Fabrizio Ferrandi

The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

SESSION: Technical Program: Energy-Efficient Computing for Emerging Applications

Knowledge Distillation in Quantum Neural Network Using Approximate Synthesis

  • Mahabubul Alam
  • Satwik Kundu
  • Swaroop Ghosh

Recent assertions of a potential advantage of Quantum Neural Network (QNN) for specific Machine Learning (ML) tasks have sparked the curiosity of a sizable number of application researchers. The parameterized quantum circuit (PQC), a major building block of a QNN, consists of several layers of single-qubit rotations and multi-qubit entanglement operations. The optimum number of PQC layers for a particular ML task is generally unknown. A larger network often provides better performance in noiseless simulations. However, it may perform poorly on hardware compared to a shallower network. Because the amount of noise varies amongst quantum devices, the optimal depth of PQC can vary significantly. Additionally, the gates chosen for the PQC may be suitable for one type of hardware but not for another due to compilation overhead. This makes it difficult to generalize a QNN design to wide range of hardware and noise levels. An alternate approach is to build and train multiple QNN models targeted for each hardware which can be expensive. To circumvent these issues, we introduce the concept of knowledge distillation in QNN using approximate synthesis. The proposed approach will create a new QNN network with (i) a reduced number of layers or (ii) a different gate set without having to train it from scratch. Training the new network for a few epochs can compensate for the loss caused by approximation error. Through empirical analysis, we demonstrate ≈71.4% reduction in circuit layers, and still achieve ≈16.2% better accuracy under noise.

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring

  • Wentao Hou
  • Kai Zhong
  • Shulin Zeng
  • Guohao Dai
  • Huazhong Yang
  • Yu Wang

Graph Attention Network (GAT) has demonstrated better performance in many graph tasks than previous Graph Neural Networks (GNN). However, it involves graph attention operations with extra computing complexity. While a large amount of existing literature has researched GNN acceleration, few have focused on the attention mechanism in GAT. The graph attention mechanism makes the computation flow different. Therefore, previous GNN accelerators can not support GAT well. Besides, GAT distinguishes the importance of neighbors and makes it possible to reduce the workload through runtime tailoring. We present NTGAT, a software-hardware co-design approach to accelerate GAT with runtime node tailoring. Our work comprises both a runtime node tailoring algorithm and an accelerator design. We propose a pipeline sorting method and a hardware unit to support node tailoring during inference. The experiments show that our algorithm can reduce up to 86% of aggregation workload while incurring slight accuracy loss (<0.4%). And the FPGA based accelerator can achieve up to 3.8× speedup and 4.98× energy efficiency comparing to the GPU baseline.

A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks

  • Pai-Yu Tan
  • Cheng-Wen Wu

Spiking neural networks (SNNs) that enable energy-efficient neuromorphic hardware are receiving growing attention. Training SNNs directly with back-propagation has demonstrated accuracy comparable to deep neural networks (DNNs). However, previous direct-training algorithms require high-precision floating-point operations, which are not suitable for low-power end-point devices. The high-precision operations also require the learning algorithm to run on high-performance accelerator hardware. In this paper, we propose an improved approach that converts the high-precision floating-point operations to low-bitwidth integer operations for an existing direct-training algorithm, i.e., the Spatio-Temporal Back-Propagation (STBP) algorithm. The proposed low-bitwidth Integer-STBP algorithm requires only integer arithmetic for SNN training and inference, which greatly reduces the computational complexity. Experimental results show that the proposed STBP algorithm achieves comparable accuracy and higher energy efficiency than the original floating-point STBP algorithm. Moreover, it can be implemented on low-power end-point devices to provide learning capability during inference, which are mostly supported by fixed-point hardware.

TiC-SAT: Tightly-Coupled Systolic Accelerator for Transformers

  • Alireza Amirshahi
  • Joshua Alexander Harrison Klein
  • Giovanni Ansaloni
  • David Atienza

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework1.

SESSION: Technical Program: Side-Channel Attacks and RISC-V Security

PMU-Leaker: Performance Monitor Unit-Based Realization of Cache Side-Channel Attacks

  • Pengfei Qiu
  • Qiang Gao
  • Dongsheng Wang
  • Yongqiang Lyu
  • Chunlu Wang
  • Chang Liu
  • Rihui Sun
  • Gang Qu

Performance Monitor Unit (PMU) is a special hardware module in processors that contains a set of counters to record various architectural and micro-architectural events. In this paper, we propose PMU-Leaker, a novel realization of all existing cache side-channel attacks where accurate execution time measurements are replaced by information leaked through PMU. The efficacy of PMU-Leaker is demonstrated by (1) leaking the secret data stored in Intel Software Guard Extensions (SGX) with the transient execution vulnerabilities including Spectre and ZombieLoad and (2) extracting the encryption key of a victim AES performed in SGX. We perform thorough experiments on a DELL Inspiron 15-7560 laptop that has an Intel® Core i5-7200U processor with the Kaby Lake architecture and the results show that, among the 176 PMU counters, 24 of them are vulnerable and can be used to launch the PMU-Leaker attack.

EO-Shield: A Multi-Function Protection Scheme against Side Channel and Focused Ion Beam Attacks

  • Ya Gao
  • Qizhi Zhang
  • Haocheng Ma
  • Jiaji He
  • Yiqiang Zhao

Smart devices, especially Internet-connected devices, typically incorporate security protocols and cryptographic algorithms to ensure the control flow integrity and information security. However, there are various invasive and non-invasive attacks trying to tamper with these devices. Chip-level active shield has been proved to be an effective countermeasure against invasive attacks, but existing active shields cannot be utilized to counter side-channel attacks (SCAs). In this paper, we propose a multi-function protection scheme and an active shield prototype to against invasive and non-invasive attacks simultaneously. The protection scheme has a complex active shield implemented using the top metal layer of the chip and an information leakage obfuscation module underneath. The leakage obfuscation module generates its protection patterns based on the operating conditions of the circuit that needs to be protected, thus reducing the correlation between electromagnetic (EM) emanations and cryptographic data. We implement the protection scheme on one Advanced Encryption Standard (AES) circuit to demonstrate the effectiveness of the method. Experiment results demonstrate that the information leakage obfuscation module decreases SNR below 0.6 and reduces the success rate of SCAs. Compared to existing single-function protection methods against physical attacks, the proposed scheme provides good performance against both invasive and non-invasive attacks.

CompaSeC: A Compiler-Assisted Security Countermeasure to Address Instruction Skip Fault Attacks on RISC-V

  • Johannes Geier
  • Lukas Auer
  • Daniel Mueller-Gritschneder
  • Uzair Sharif
  • Ulf Schlichtmann

Fault-injection attacks are a risk for any computing system executing security-relevant tasks, such as a secure boot process. While hardware-based countermeasures to these invasive attacks have been found to be a suitable option, they have to be implemented via hardware extensions and are thus not available in most Commonly used Off-The-Shelf (COTS) components. Software Implemented Hardware Fault Tolerance (SIHFT) is therefore the only valid option to enhance a COTS system’s resilience against fault attacks. Established SIHFT techniques usually target the detection of random hardware errors for functional safety and not targeted attacks. Using the example of a secure boot system running on a RISC-V processor, in this work we first show that when the software is hardened by these existing techniques from the safety domain, the number of vulnerabilities in the boot process to single, double, triple, and quadruple instruction skips cannot be fully closed. We extend these techniques to the security domain and propose Compiler-assisted Security Countermeasure (CompaSeC). We demonstrate that CompaSeC can close all vulnerabilities for the studied secure boot system. To further reduce performance and memory overheads we additionally propose a method for CompaSeC to selectively harden individual vulnerable functions without compromising the security against the considered instruction skip faults.

Trojan-D2: Post-Layout Design and Detection of Stealthy Hardware Trojans – A RISC-V Case Study

  • Sajjad Parvin
  • Mehran Goli
  • Frank Sill Torres
  • Rolf Drechsler

With the exponential increase in the popularity of the RISC-V ecosystem, the security of this platform must be re-evaluated especially for mission-critical and IoT devices. Besides, the insertion of a Hardware Trojan (HT) into a chip after the in-house mask design is outsourced to a chip manufacturer abroad for fabrication is a significant source of concern. Though abundant HT detection methods have been investigated based on side-channel analysis, physical measurements, and functional testing to overcome this problem, there exists stealthy HTs that can hide from detection. This is due to the small overhead of such HTs compared to the whole circuit.

In this work, we propose several novel HTs that can be placed into a RISC-V core’s post-layout in an untrusted manufacturing environment. Next, we propose a non-invasive analytical method based on contactless optical probing to detect any stealthy HTs. Finally, we propose an open-source library of HTs that can be used to be placed into a processor unit in the post-layout phase. All the designs in this work are done using a commercial 28nm technology.

SESSION: Technical Program: Simulation and Verification of Quantum Circuits

Graph Partitioning Approach for Fast Quantum Circuit Simulation

  • Jaekyung Im
  • Seokhyeong Kang

Owing to the exponential increase in computational complexity, the fast simulation of the large quantum circuit has become very difficult. This is an important challenge for the utilization of quantum computers because it is closely related to the verification of quantum computation by classical machines. The Hybrid Schrödinger-Feynman simulation seems to be a promising solution, but its application is very limited. To solve this drawback, we propose an improved simulation method based on graph partitioning. Experimental results show that our approach significantly reduces the simulation time of the Hybrid Schrödinger-Feynman simulation.

A Robust Approach to Detecting Non-Equivalent Quantum Circuits Using Specially Designed Stimuli

  • Hsiao-Lun Liu
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

As several compilation and optimization techniques have been proposed, equivalence checking for quantum circuits has become essential in design flows. The state-of-the-art to this problem observed that even small errors substantially affect the entire quantum system. As a result, it exploited random simulations to prove the non-equivalence of two quantum circuits. However, when errors occurred close to outputs, it was hard for the work to prove the non-equivalence of some non-equivalent quantum circuits under a limited number of simulations. In this work, we propose a novel simulation-based approach using a set of specially designed stimuli. The simulation runs of the proposed approach is linear rather than exponential to the number of quantum bits of a circuit. According to the experimental results, the success rate of our approach is 100% (100%) under a simulation run (execution time) constraint for a set of benchmarks, while that of the state-of-the-art is only 69% (74%) on average. Our approach also achieves a speedup of 26 on average.

Equivalence Checking of Parameterized Quantum Circuits: Verifying the Compilation of Variational Quantum Algorithms

  • Tom Peham
  • Lukas Burgholzer
  • Robert Wille

Variational quantum algorithms have been introduced as a promising class of quantum-classical hybrid algorithms that can already be used with the noisy quantum computing hardware available today by employing parameterized quantum circuits. Considering the non-trivial nature of quantum circuit compilation and the subtleties of quantum computing, it is essential to verify that these parameterized circuits have been compiled correctly. Established equivalence checking procedures that handle parameter-free circuits already exist. However, no methodology capable of handling circuits with parameters has been proposed yet. This work fills this gap by showing that verifying the equivalence of parameterized circuits can be achieved in a purely symbolic fashion using an equivalence checking approach based on the ZX-calculus. At the same time, proofs of inequality can be efficiently obtained with conventional methods by taking advantage of the degrees of freedom inherent to parameterized circuits. We implemented the corresponding methods and proved that the resulting methodology is complete. Experimental evaluations (using the entire parametric ansatz circuit library provided by Qiskit as benchmarks) demonstrate the efficacy of the proposed approach.

Software Tools for Decoding Quantum Low-Density Parity-Check Codes

  • Lucas Berent
  • Lukas Burgholzer
  • Robert Wille

Quantum Error Correction (QEC) is an essential field of research towards the realization of large-scale quantum computers. On the theoretical side, a lot of effort is put into designing error-correcting codes that protect quantum data from errors, which inevitably happen due to the noisy nature of quantum hardware and quantum bits (qubits). Protecting data with an error-correcting code necessitates means to recover the original data, given a potentially corrupted data set—a task referred to as decoding. It is vital that decoding algorithms can recover error-free states in an efficient manner. While theoretical properties of certain QEC methods have been extensively studied, good techniques to analyze their performance in practically more relevant settings is still a widely unexplored area. In this work, we propose a set of software tools that facilitate numerical experiments with so-called Quantum Low-Density Parity-Check codes (QLDPC codes)—a broad class of codes, some of which have recently been shown to be asymptotically good. Based on that, we provide an implementation of a general decoder for QLDPC codes. On top of that, we propose a highly efficient heuristic decoder that eliminates the runtime bottlenecks of the general QLDPC decoder while still maintaining comparable decoding performance. These tools eventually make it possible to confirm theoretical results around QLDPC codes in a more practical setting and showcase the value of software tools (in addition to theoretical considerations) for investigating codes for practical applications. The resulting tool, which is publicly available at https://github.com/cda-tum/qecc as part of the Munich Quantum Toolkit (MQT), is meant to provide a playground for the search for “practically good” quantum codes.

SESSION: Technical Program: Learning x Security in DFM

Enabling Scalable AI Computational Lithography with Physics-Inspired Models

  • Haoyu Yang
  • Haoxing Ren

Computational lithography is a critical research area for the continued scaling of semiconductor manufacturing process technology by enhancing silicon printability via numerical computing methods. Today’s solutions for these problems are primarily CPU-based and require many thousands of CPUs running for days to tape out a modern chip. We seek AI/GPU-assisted solutions for the two problems, aiming at improving both runtime and quality. Prior academic research has proposed using machine learning for lithography modeling and mask optimization, typically represented as image-to-image mapping problems, where convolution layer backboned UNets and ResNets are applied. However, due to the lack of domain knowledge integrated into the framework designs, these solutions have been limited by their application scenarios or performance. Our method aims to tackle the limitations of such previous CNN-based solutions by introducing lithography bias into the neural network design, yielding a much more efficient model design and significant performance improvements.

Data-Driven Approaches for Process Simulation and Optical Proximity Correction

  • Hao-Chiang Shao
  • Chia-Wen Lin
  • Shao-Yun Fang

With continuous shrinking of process nodes, semiconductor manufacturing encounters more and more serious inconsistency between designed layout patterns and resulted wafer images. Conventionally, examining how a layout pattern can deviate from its original after complicated process steps, such as optical lithography and subsequent etching, relies on computationally expensive process simulation, which suffers from incredibly long runtime for large-scale circuit layouts, especially in advanced nodes. In addition, being one of the most important and commonly adopted resolution enhancement techniques, optical proximity correction (OPC) corrects image errors due to process effects by moving segment edges or adding extra polygons to mask patterns, while it is generally driven by simulation or time-consuming inverse lithography techniques (ILTs) to achieve acceptable accuracy. As a result, more and more state-of-the-art works on process simulation or/and OPC resort to the fast inference characteristic of machine/deep learning. This paper reviews these data-driven approaches to highlight the challenges in various aspects, explore preliminary solutions, and reveal possible future directions to push forward the frontiers of the research in design for manufacturability.

Mixed-Type Wafer Failure Pattern Recognition

  • Hao Geng
  • Qi Sun
  • Tinghuan Chen
  • Qi Xu
  • Tsung-Yi Ho
  • Bei Yu

The ongoing evolution in process fabrication enables us to step below the 5nm technology node. Although foundries can pattern and etch smaller but more complex circuits on silicon wafers, a multitude of challenges persist. For example, defects on the surface of wafers are inevitable during manufacturing. To increase the yield rate and reduce time-to-market, it is vital to recognize these failures and identify the failure mechanisms of these defects. Recently, applying machine learning-powered methods to combat single defect pattern classification has made significant progress. However, as the processes become increasingly complicated, various single-type defect patterns may emerge and be coupled on a wafer and thus shape a mixed-type pattern. In this paper, we will survey the recent pace of progress on advanced methodologies for wafer failure pattern recognition, especially for mixed-type one. We sincerely hope this literature review can highlight the future directions and promote the advancement of the wafer failure pattern recognition.

SESSION: Technical Program: Lightweight Models for Edge AI

Accelerating Convolutional Neural Networks in Frequency Domain via Kernel-Sharing Approach

  • Bosheng Liu
  • Hongyi Liang
  • Jigang Wu
  • Xiaoming Chen
  • Peng Liu
  • Yinhe Han

Convolutional neural networks (CNNs) are typically computationally heavy. Fast algorithms such as fast Fourier transforms (FFTs), are promising in significantly reducing computation complexity by replacing convolutions with frequency-domain element-wise multiplication. However, the increased high memory access overhead of complex weights counteracts the computing benefit, because frequency-domain convolutions not only pad weights to the same size as input maps, but also have no sharable complex kernel weights. In this work, we propose an FFT-based kernel-sharing technique called FS-Conv to reduce memory access. Based on FS-Conv, we derive the sharable complex weights in frequency-domain convolutions, which has never been solved. FS-Conv includes a hybrid padding approach, which utilizes the inherent periodic characteristic of FFT transformation to provide sharable complex weights for different blocks of complex input maps. We in addition build a frequency-domain inference accelerator (called Yixin) that can utilize the sharable complex weights for CNN accelerations. Evaluation results demonstrate the significant performance and energy efficiency benefits compared with the state-of-the-art baseline.

Mortar: Morphing the Bit Level Sparsity for General Purpose Deep Learning Acceleration

  • Yunhung Gao
  • Hongyan Li
  • Kevin Zhang
  • Xueru Yu
  • Hang Lu

Vanilla Deep Neural Networks (DNN) after training are represented with native floating-point 32 (fp32) weights. We observe that the bit-level sparsity of these weights is very abundant in the mantissa and can be directly exploited to speed up model inference. In this paper, we propose Mortar, an off-line/on-line collaborated approach for fp32 DNN acceleration, which includes two parts: first, an off-line bit sparsification algorithm to construct the target formulation by “mantissa morphing”, which maintains higher model accuracy while increasing bit-level sparsity; second, the associating hardware accelerator architecture to speed up the on-line fp32 inference through manipulating the enlarged bit sparsity. We highlight the following results by evaluating various deep learning tasks, including image classification, object detection, video understanding, video & image super-resolution, etc.: We (1) increase bit-level sparsity up to 1.28~2.51x with only a negligible -0.09~0.23% accuracy loss, (2) maintain on average 3.55% higher model accuracy while increasing more bit-level sparsity than the baseline, (3)and our hardware accelerator outperforms up to 4.8x over the baseline, with an area of 0.031 mm2 and power of 68.58 mW.

Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

  • Yimeng Zhang
  • Akshay Karkal Kamath
  • Qiucheng Wu
  • Zhiwen Fan
  • Wuyang Chen
  • Zhangyang Wang
  • Shiyu Chang
  • Sijia Liu
  • Cong Hao

In this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream. First, to enable ultra-light video intelligence, we propose temporal frame-filtering and spatial saliency-focusing approaches to reduce the complexity of massive video data. Second, we exploit structure-aware weight sparsity to design a hardware-friendly model compression method. Third, assisted with data and model complexity reduction, we propose a sparsity-aware, scalable, and low-power accelerator design, aiming to deliver real-time performance with high energy efficiency. Different from existing works, we make a solid step towards the synergized software/hardware co-optimization for realistic MOT model implementation. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5× latency reduction, 20.9× effective frame rate improvement, 5.83× lower power, and 9.78× better energy efficiency, without much accuracy drop.

Latent Weight-Based Pruning for Small Binary Neural Networks

  • Tianen Chen
  • Noah Anderson
  • Younghyun Kim

Binary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove.

In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseduogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 33%–46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count.

SESSION: Technical Program: Design Automation for Emerging Devices

AutoFlex: Unified Evaluation and Design Framework for Flexible Hybrid Electronics

  • Tianliang Ma
  • Zhihui Deng
  • Leilai Shao

Flexible hybrid electronics (FHE), integrating high performance silicon chips with multi-functional sensors and actuators on flexible substrates, can be intimately attached onto irregular surfaces without compromising their functionalities, thus enabling more innovations in healthcare, internet of things (IoTs) and various human-machine interfaces (HMIs). Recent developments on compact models and process design kits (PDKs) of flexible electronics have made designs of small to medium flexible circuits feasible. However, the absence of a unified model and comprehensive evaluation benchmarks for flexible electronics makes it infeasible for a designer to fairly compare different flexible technologies and to explore potential design options for a heterogeneous FHE design. In this paper, we present AutoFlex, a unified evaluation and design framework for flexible hybrid electronics, where device parameters can be extracted automatically and performance can be evaluated comprehensively from device levels, digital blocks to large-scale digital circuits. Moreover, a ubiquitous FHE sensor acquisition system, including a flexible multi-functional sensor array, scan drivers, amplifiers and a silicon based analog-to-digital converter (ADC), is developed to reveal the design challenges of a representative FHE system.

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology

  • Chenlin Shi
  • Shinobu Miwa
  • Tongxin Yang
  • Ryota Shioya
  • Hayato Yamaki
  • Hiroki Honda

In this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an open-source CNFET SPICE model called VS-CNFET, and various model parameters such as the channel width and carbon nanotube diameter are carefully tuned to mimic the predictive 7-nm CNFET technology presented in a published paper. Some nondisclosure parameters, such as the cell size and pin layout, are derived from those of the NanGate 15-nm open-source cell library in the same way as for an open-source framework for CNFET circuit design. CNFET7 includes two types of delay model (i.e., the composite current source and nonlinear delay model), each having 56 cells, such as INV_X1 and BUF_X1. CNFET7 supports both logic synthesis and timing-driven place and route in the Cadence design flow. Our experimental results for several synthesized circuits show that CNFET7 has reductions of up to 96%, 62% and 82% in dynamic and static power consumption and critical-path delay, respectively, when compared with ASAP7.

A Global Optimization Algorithm for Buffer and Splitter Insertion in Adiabatic Quantum-Flux-Parametron Circuits

  • Rongliang Fu
  • Mengmeng Wang
  • Yirong Kan
  • Nobuyuki Yoshikawa
  • Tsung-Yi Ho
  • Olivia Chen

As a highly energy-efficient application of low-temperature superconductivity, the adiabatic quantum-flux-parametron (AQFP) logic circuit has characteristics of extremely low-power consumption, making it an attractive candidate for extremely energy-efficient computing systems. Since logic gates are driven by the alternating current (AC) serving as the clock signal in AQFP circuits, plenty of AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. Meanwhile, since the currently developed AQFP logic gates can only drive a single output, splitters are required by logic gates to drive multiple fan-outs. These gates take up a significant amount of the circuit’s area and delay. This paper proposes a global optimization algorithm for buffer and splitter (B/S) insertion to address the issues above. The B/S insertion is first identified as a combinational optimization problem, and a dynamic programming formulation is presented to find the global optimal solution. Due to the limitation of its impractical search space, an integer linear programming formulation is proposed to explore the global optimization of B/S insertion approximately. Experimental results on the ISCAS’85 and simple arithmetic benchmark circuits show the effectiveness of the proposed method, with an average reduction of 8.22% and 7.37% in the number of buffers and splitters inserted compared to the state-of-the-art methods from ICCAD’21 and DAC’22, respectively.

FLOW-3D: Flow-Based Computing on 3D Nanoscale Crossbars with Minimal Semiperimeter

  • Sven Thijssen
  • Sumit Kumar Jha
  • Rickard Ewetz

The emergence of data-intensive applications has spurred the interest for in-memory computing using nanoscale crossbars. Flow-based in-memory computing is a promising approach for evaluating Boolean logic using the natural flow of electrical currents. While automated synthesis approaches have been developed for 2D crossbars, 3D crossbars have advantageous properties in terms of density, area, and performance. In this paper, we propose the first framework for performing flow-based computing using 3D crossbars. The framework, FLOW-3D, automatically synthesizes a Boolean function into a crossbar design. FLOW-3D is based on an analogy between BDDs and crossbars, resulting in the synthesis of 3D crossbar designs with minimal semiperimeter. A BDD with n nodes is mapped to a 3D crossbar with (n + k) metal wires. The k extra metal wires are needed to handle hardware-imposed constraints. Compared with the state-of-the-art synthesis tool for 2D crossbars, FLOW-3D improves semiperimeter, area, energy consumption, and latency up to 61%, 84%, 37%, and 41% on 15 Revlib benchmarks.

SLIP’22 TOC

SLIP ’22: Proceedings of the 24th ACM/IEEE Workshop on System Level Interconnect Pathfinding

 Full Citation in the ACM Digital Library

SESSION: Breaking the Interconnect Limits

Session details: Breaking the Interconnect Limits

  • Ismail Bustany

Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning be?

  • Raveena Raikar
  • Dirk Stroobandt

High-capacity multi-die FPGA systems generally consist of multiple dies connected by external interposer lines. These external connections are limited in number. Further, these connections also contribute to a higher delay as compared to the internal network on a monolithic FPGA and should therefore be sparsely used. These architectural changes compel the placement & routing tools to minimize the number of signals at the die boundary. Incorporating a netlist partitioning step in the CAD flow can help to minimize the overall number of signals using the cross-die connections.

Conventional partitioning techniques focus on minimizing the cut edges at the cost of generating unequal-sized partitions. Such highly unbalanced partitions can affect the overall placement & routing quality by causing congestion on the denser die. Moreover, this can also negatively impact the overall runtime of the placement & routing tools as well as the FPGA resource utilization.

In previous studies, a low value of the unbalance was proposed to generate equal-sized partitions. In this work, we investigate the factors that influence the netlist partitioning quality for a multi-die FPGA system. A die-level partitioning step, performed using hMETIS, is incorporated into the flow before the packing step. Large heterogeneous circuits from the Koios benchmark suite are used to analyze the partitioning-packing results. Consequently, we examine the variation in output unbalance, the number of cut edges vs the input value of unbalance. We propose an empirical optimal parametric value of the unbalance factor for achieving the desired partitioning quality for the Koios benchmark suite.

Limiting Interconnect Heating in Power-Driven Physical Synthesis

  • Xiuyan Zhang
  • Shantanu Dutt

Current technology trend of VLSI chips includes sub-10 nm nodes and 3D ICs. Unfortunately, due to significantly increased Joule heating in these technologies, interconnect reliability has become a significant casualty. In this paper, we explore how interconnect power dissipation (of CV2/2 per logic transition) and thus heating can be effectively constrained during a power-optimizing physical synthesis (PS) flow that applies three different PS transformations: cell sizing, Vth assignment and cell replication; the latter is particularly useful for limiting interconnect heating. Other constraints considered are timing, slew and cell fanout load. To address this multi-constraint power-optimization problem effectively, we consider the application of the aforementioned three transforms simultaneously (as opposed to sequentially in some order) as well as simultaneously across all cells of the circuit using a novel discrete optimization technique called discretized network flow (DNF). We applied our algorithm to ISPD-13 benchmark circuits: the ISPD-13 competition was for power optimization for cell-sizing and Vth assignment transforms under timing, slew and cell fanout load constraints; to these we added the interconnect heating constraint and the cell replication transform—a much harder transform to engineer in a simultaneous-consideration framework than the other two. Results show the significant efficacy of our techniques.

SESSION: 2.5D/3D Extension for High-Performance Computing

Session details: 2.5D/3D Extension for High-Performance Computing

  • Pascal Vivet

Opportunities of Chip Power Integrity and Performance Improvement through Wafer Backside (BS) Connection: Invited Paper

  • Rongmei Chen
  • Giuliano Sisto
  • Odysseas Zografos
  • Dragomir Milojevic
  • Pieter Weckx
  • Geert Van der Plas
  • Eric Beyne

Technology node scaling is driven by the need to increase system performance, but it also leads to a significant power integrity bottleneck, due to the associated back-end-of-line (BEOL) scaling. Power integrity degradation induced by on-chip Power Delivery Network (PDN) IR drop is a result of increased power density and number of metal layers in the BEOL and their resistivity. Meanwhile, signal routing limits the SoC performance improvements due to increased routing congestion and delays. To conquer these issues, we introduce a disruptive technology: wafer backside (BS) connection to realize chip BS PDN (BSPDN) and BS signal routing. We first provide some key wafer processes features that were developed at imec to enable this technology. Further, we show benefits of this technology by demonstrating a large improvement in chip power integrity and performance after applying this technology to BSPDN and BS routing with a sub-2nm technology node design rule. Challenges and outlook of the BS technology are also discussed before conclusion of this paper.

SESSION: Compute-in-Memory and Design of Structured Compute Arrays

Session details: Compute-in-Memory and Design of Structured Compute Arrays

  • Shantanu Dutt

An Automated Design Methodology for Computational SRAM Dedicated to Highly Data-Centric Applications: Invited Paper

  • A. Philippe
  • L. Ciampolini
  • A. Philippe
  • M. Gerbaud
  • M. Ramirez-Corrales
  • V. Egloff
  • B. Giraud
  • J.-P. Noel

To meet the performance requirements of highly data-centric applications (e.g. edge-AI or lattice-based cryptography), Computational SRAM (C-SRAM), a new type of computational memory, was designed as a key element of an emerging computing paradigm called near-memory computing. For this particular type of applications, C-SRAM has been specialized to perform low-latency vector operations in order to limit energy-intensive data transfers with the processor or dedicated processing units. This paper presents a design methodology that aims at making the C-SRAM design flow as simple as possible by automating the configuration of the memory part (e.g. number of SRAM cuts and access ports) according to system constraints (e.g. instruction frequency or memory capacity) and off-the-shelf SRAM compilers. In order to fairly quantify the benefits of the proposed memory selector, it has been evaluated with three different CMOS process technologies from two different foundries. The results show that this memory selection methodology makes it possible to determine the best memory configuration whatever the CMOS process technology and the trade-off between area and power consumption. Furthermore, we also show how this methodology could be used to efficiently assess the level of design optimization of available SRAM compilers in a targeted CMOS process technology.

A Machine Learning Approach for Accelerating SimPL-Based Global Placement for FPGA’s

  • Tianyi Yu
  • Nima Karimpour Darav
  • Ismail Bustany
  • Mehrdad Eslami Dehkordi

Many commercial FPGA placement tools are based on the SimPL framework where the Lower Bound (LB) phase optimizes wire length and timing without considering cell overlaps and the Upper Bound (UB) phase spreads out cells while considering the target FPGA architectures. In the SimPL framework, the number of iterations depends on design complexity and the quality of UB placement, which highly impacts runtime. In this work, we propose a machine learning (ML) scheme where the anchor weights of cells are dynamically adjusted to make the process converge in a pre-determined budget for the number of iterations. In our approach and for a given FPGA architecture, a ML model constructs a trajectory guide function that is used for adjusting anchor weights during SimPL’s iterations. Our experimental results on industrial benchmarks show, we can achieve on average 28.01% and 4.7% runtime reduction in the runtime of Global Placement and the runtime of the whole placer, respectively while maintaining the quality of solutions within an acceptable range.

SESSION: Interconnect Performance Estimation Techniques

Session details: Interconnect Performance Estimation Techniques

  • Rasit Topaloglu

Neural Network Model for Detour Net Prediction

  • Jaehoon Ahn
  • Taewhan Kim

Identifying nets in a placement which will be very likely to be detoured routes in routing is very useful in that (1) in conjunction with the routing congestion, path timing, or design rule violation (DRV) prediction, predicting detour nets can be used as a complementary means of characterizing the outcome of those predictions in a more depth and (2) we can place more importance on the detour predicted nets for optimizing timing and routing resources in the early stage of placement since those nets consume more timing budget as well as metal/via resources. In this context, this work proposes a neural network based detour net prediction model. Our proposed model consists of two parts: CNN based and ANN based. The CNN based model processes the features describing various physical proximity maps or states while the ANN based model processes the features of individual nets in the form of vector descriptions, concatenated to the CNN outputs. Through experiments, we analyze and assess the accuracy of our prediction model in terms of F1 score and the complementary role of timing prediction and optimization. More specifically, it is shown that our proposed model improves the prediction accuracy by 9.9% on average in comparison with that produced by the conventional (vanilla ANN based) detour net prediction model. Furthermore, linking our prediction model to a state-of-the-art timing optimization of the commercial tool is able to reduce the worst negative slack by 18.4%, the total negative slack by 40.8%, and the number of timing violation paths by 30.9% on average.

Machine-Learning Based Delay Prediction for FPGA Technology Mapping

  • Hailiang Hu
  • Jiang Hu
  • Fan Zhang
  • Bing Tian
  • Ismail Bustany

Accurate delay prediction is important in the early stages of logic and high-level synthesis. In technology mapping for field programmable gate array (FPGA), a gate-level circuit is transcribed into a lookup table (LUT)-level circuit. Quick timing analysis is necessary on a pre-mapped circuit to guide optimizations downstream. However, a static timing analyzer is too slow due to its complexity and highly inaccurate like other faster empirical heuristics before technology mapping. In this work, we present a machine learning based framework for accurately and efficiently estimating the delay of a gate-level circuit from predicting the depth of the corresponding LUT logic after technology mapping. Our experimental results show that the proposed method achieves a 56x accuracy improvement compared to the existing delay estimation heuristic. Instead of running the mapper for the ground truth, our delay estimator saves 87.5% on runtime with negligible error.

Who’s Wang Ying

February 2023

Wang Ying

Associate Professor

Institute of Computing Technology, Chinese Academy of Sciences.

Email:

wangying2009@ict.ac.cn

Personal webpage

https://wangying-ict.github.io/

Research interests

Domain-Specific chips, processor architecture and design automation

Short bio

Ying Wang is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences. Wang’s research expertise is focused on VLSI testing, reliability and the design automation of domain-specific processors such as accelerators for computer vision, deep learning, graph computing and robotics. His group has conducted pioneering work in the open-source frameworks for automated neural network accelerator generation and customization. He has published more than 30 papers at DAC, and over 120 papers on other IEEE/ACM conferences and journals. He holds over 30 patents related to chip design. Wang is also a co-founder of Jeejio Tech in Beijing, which is granted the Special Start-up Award of the year 2018 by Chinese Academy of Sciences. Among Wang’s honors, it also includes the Young Talent Development Program Awardee from Chinese Association of Science and Technology (as one of the two awardees of computer science in 2016), 2017 CCF Intel outstanding researcher award, 2019 Early Career Award from Chinese Computer Federation and etc. He is the recipient of Under 40 Innovator award of DAC at 2021. Dr. Wang has also received several awards from international conferences, including the winner of System Design Contest at DAC 2018 and IEEE rebooting LPIRC 2016, the Best Paper Award at ITC-Asia 2018, GLSVLSI2021 (2nd place), ICCD 2019, and the best paper of 2011 IEEE Transaction on Computers, as well as the best paper nominee in ASPDAC.

Research highlights

Dr. Wang’s innovative research in the DeepBurning project has significantly contributed to one of the viable approaches toward automatic specialized accelerator generation and is considered one of the representative works in this area, which is to start from the software framework to directly generate a specialized circuit design implemented on FPGA or ASICs. After the initial project of DeepBurning1.0, he continues to pioneer several on-going works including ELNA (DAC2017), Dadu (DAC2018), 3D-Cube (DAC2019), DeepBurning-GL (ICCAD2020) and DeepBurning-Seg (Micro-2022), which also follows the same technical route of automatic hardware accelerator generation but has been extended to different applications and architectures. Also, the DeepBurning series not only develops horizontally to different areas, but also vertically go to the high level processor design stacks including early-stage design parameter exploration, ISA extension and compiler-hardware co-design. In general, his holistic work on this field has attracted considerable attention from different Chinese EDA companies. Based on the agile chip customization technology initiated by Dr. Wang, his company, Jeejio, is able to develop highly-customized chip solutions at a relatively low cost, and help its customers stay competitive in the niche IoT markets. Dr. Wang’s team has proposed the RISC-V compatible Sage architecture that can be used to customize AIoT SoC solution with user-redeemable computing power, for audio/video/image processing capability and also automotive scenarios.

Who’s Heba Abunahla

January 2023

Heba Abunahla

Assistant Professor

Quantum and Computer Engineering department, TU Delft, Netherlands.

Email:

Heba.nadhmi@gmail.com

Research interests

• Emerging RRAM devices
• Smart sensors
• Hardware security
• Graphene-based electronics
• CNTs-based electronics
• Neuromorphic computing

Short bio

Heba Abunahla is currently Assistant Professor at the Quantum and Computer Engineering department, Delft University of Technology. Abunahla received the BSc, MSc and PhD degrees (with honors) from United Arab Emirates University, University of Sharjah and Khalifa University, respectively, via competitive scholarship. Prior to joining TU Delft as an Assistant Professor, Abunahla spent over five years as Postdoctoral Fellow and Research Scientist working extensively on the design, fabrication and characterization of emerging memory devices with great emphasis on computing, sensing and security applications.

Abunahla owns two patents, has published one book, and has co-authored over 30 conference and journal papers. In 2017. Abunahla had a collaborative project with University of Adelaide, Australia, on developing novel non-enzymatic glucose sensor. According to her achievements, she received Australian Global Talent Permanent Residency in 2021. Moreover, Abunahla’s innovation of deploying emerging RRAM devices in neuromorphic computing has been selected by Nature-Scientific Reports to be among Top 100 in materials Science. Also, her recent achievement in fabricating RRAM-based tunable filters was selected to be published in the first issue of Innovation@UAE Magazine launched by Ministry of Education.

Abunahla has several awards and competitive scholarships. E.g., she is the recipient of Unique Fellowship for Top Female Academic Scientists – Electrical Engineering, Mathematics & Computer Science (2022) from Delft University of Technology. Abunahla serves as a lead Editor in Frontiers in Neuroscience. She is an active reviewer for several high impact journals and conferences.

Research highlights

Secure intelligent memory and sensors are crucial components in our daily electronic devices and systems. CMOS has been the core technology to provide such requirements for decades. However, the limitations associated to power and area have led to the need for an alternative technology, called Resistive-RAM (RRAM). RRAM devices are able to perform memory and computation in the same cell, which enables in-memory computation feature. Moreover, RRAM can be deployed as a smart sensor due to its ability to change its I-V characteristic against the surrounding environment. Inherit stochasticity in RRAM junctions is also a great asset for security applications.

Abunahla has built a strong expertise in the field of micro-electronics design, modeling, fabrication, and characterization of high-performance and high-density memory devices. Abunahla developed novel RRAM devices that have been uniquely deployed in sensing, computing, security, and communication applications. For instance, Abunahla demonstrated a novel approach to measuring glucose levels for an adult human, and demonstrated the ability to fabricate such biosensor using a simple, low-cost standard photolithography process. In contrast to other sensors, the developed sensor has the ability to accurately measure glucose levels at neutral pH conditions (i.e. pH=7). Abunahla filed a US patent for this device and all the details of the innovation are published by the prestigious Nature Scientific Reports. This work has great commercialization opportunity; being unique and cutting edge in nature, and Abunahla is currently working with her team toward providing lab-on-chip sensing approach based on this technology.

Furthermore, Abunahla has recently innovated flexible memory devices, namely NeuroMem, that can mimic the memorization behavior of the brain. This unique feature makes NeuroMem a potential candidate for emerging in-memory-computing applications. This work is the first to report on the great potential of this technology for Artificial Intelligence (AI) inference for edge devices. Abunahla filed a US patent for this innovation and published the work in the prestigious Nature Scientific Reports. Further, her innovative research in using nanoscale devices for Gamma-ray sensing using Sol-gel/drop-coated micro-think nanomaterials is very unique and has been filed as US patent and published by the prestigious Journal of Materials Chemistry and Physics. Moreover, Abunahla has fabricated novel RRAM-based tunable filters which prove the possibility of tuning RF devices without any localized surface mount device (SMD) element or complex realization technique. In the field of hardware security, Abunahla developed an efficient flexible RRAM-based true random number generation, named SecureMem. The data generated by SecureMem prototype passed all NIST tests without any post-processing or hardware overhead.

Who’s Aman Arora

December, 2022

Aman Arora

Graduate Fellow

The University of Texas at Austin

Email:

aman.kbm@utexas.edu

Personal webpage

https://amanarora.site

Research interests

Reconfigurable computing, Domain-specific acceleration, Hardware for Machine Learning

Short bio

Aman Arora is a Graduate Fellow and Ph.D. Candidate in the Department of Electrical and Computer Engineering at the University of Texas at Austin. His research vision is to minimize the gap between ASICs and FPGAs in terms of performance and efficiency, and to minimize the gap between CPUs/GPUs and FPGAs in terms of programmability. He imagines a future where FPGAs are first-class citizens in the world of computing and first-choice for accelerating new workloads. His PhD dissertation research focuses on the search for efficient reconfigurable fabrics for Deep Learning (DL) by proposing new DL-optimized blocks for FPGAs. His research has resulted in 11 paper publications in top conferences and journals in the field of reconfigurable computing and computer architecture and design. His work received a Best Paper Award at the IEEE FCCM conference in 2022, and he currently holds a fellowship from the UT Austin Graduate School. His research has been funded by the NSF. Aman has served as a secondary reviewer in top conferences like ACM FPGA (in 2021 and 2022). He is also the leader of the AI+FPGA committee at Open-Source FPGA (OSFPGA) Foundation, where he leads research efforts and organizes training webinars. He has 12 years of experience in the semiconductor industry in design, verification, testing and architecture roles. Most recently, he worked in the GPU Deep Learning architecture team at NVIDIA.

Research highlights

Aman’s past and current research focusses on architecting efficient reconfigurable acceleration substrates (or fabrics) for Deep Learning (DL). With Moore’s law slowing down, the requirements of resource-hungry applications like DL growing & changing rapidly, and climate change already knocking at our doors, this research theme has never been more relevant and important.

Aman has proposed changing the architecture of FPGAs to make them better DL accelerators. He proposed replacing a portion of the FPGA’s programmable logic area with new blocks called Tensor Slices, which are specialized for performing matrix operations like matrix-matrix multiplication and matrix-vector multiplication that are common in DL workloads. The FPGA industry has parallelly developed similar blocks like Intel AI Tensor Block and Achronix Machine Learning Processor.

In addition, Aman proposed adding compute capabilities to the on-chip memory blocks on FPGAs, so they can operate on data without having to move the data to compute units on the FPGA. He was the first to exploit the dual port nature of FPGA BRAMs to design these blocks instead of using technologies that significantly impact the circuitry of the RAM array and degrade its performance. He calls these new blocks CoMeFa RAMs. This work won the Best Paper Award at IEEE FCCM 2022.

Aman also led a team effort spanning three universities – UT Austin, University of Toronto, and University of New Brunswick – to develop an open-source DL benchmark suite called Koios. These benchmarks can be used to perform FPGA architecture and CAD research, and are integrated into VTR, which is the most popular open-source FPGA CAD flow.

Other research projects Aman has worked on or is working on include: (1) developing a parallel reconfigurable spatial acceleration fabric consisting of PIM (Processing-In-Memory) blocks connected using an FPGA-like interconnect, (2) implementing efficient accelerators for Weightless Neural Networks (WNNs) on FPGAs, (3) enabling support for open-source tools in FPGA research tools like COFFE, and (4) using Machine Learning (ML) to perform cross-prediction of power consumption on FPGAs and developing an open-source dataset that can be widely used for such prediction problems.

Aman hopes to start and direct a research lab at a university soon. His future research will straddle the entire stack of computer engineering: programmability at the top, architecture exploration in the middle, and hardware design at the bottom. The research thrusts he plans to focus on are next-gen reconfigurable fabrics, ML and FPGA related tooling, enabling the creation of an FPGA app store, and sustainable acceleration.

ICCAD’22 TOC

ICCAD ’22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

 Full Citation in the ACM Digital Library

SESSION: The Role of Graph Neural Networks in Electronic Design Automation

Session details: The Role of Graph Neural Networks in Electronic Design Automation

  • Jeyavijayan Rajendran

Why are Graph Neural Networks Effective for EDA Problems?: (Invited Paper)

  • Haoxing Ren
  • Siddhartha Nath
  • Yanqing Zhang
  • Hao Chen
  • Mingjie Liu

In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.

On Advancing Physical Design Using Graph Neural Networks

  • Yi-Chen Lu
  • Sung Kyu Lim

As modern Physical Design (PD) algorithms and methodologies evolve into the post-Moore era with the aid of machine learning, Graph Neural Networks (GNNs) are becoming increasingly ubiquitous given that netlists are essentially graphs. Recently, their ability to perform effective graph learning has provided significant insights to understand the underlying dynamics during netlist-to-layout transformations. GNNs follow a message-passing scheme, where the goal is to construct meaningful representations either at the entire graph or node-level by recursively aggregating and transforming the initial features. In the realm of PD, the GNN-learned representations have been leveraged to solve the tasks such as cell clustering, quality-of-result prediction, activity simulation, etc., which often overcome the limitations of traditional PD algorithms. In this work, we first revisit recent advancements that GNNs have made in PD. Second, we discuss how GNNs serve as the backbone of novel PD flows. Finally, we present our thoughts on ongoing and future PD challenges that GNNs can tackle and succeed.

Applying GNNs to Timing Estimation at RTL

  • Daniela Sánchez Lopera
  • Wolfgang Ecker

In the Electronic Design Automation (EDA) flow, signoff checks, such as timing analysis, are performed only after physical synthesis. Encountered timing violations cause re-iterations of the design flow. Hence, timing estimations at initial design stages, such as Register Transfer Level (RTL), would increase the quality of the results and lower the flow iterations. Machine learning has been used to estimate the timing behavior of chip components. However, existing solutions map EDA objects to Euclidean data without considering that EDA objects are represented naturally as graphs. Recent advances in Graph Neural Networks (GNNs) motivate the mapping from EDA objects to graphs for design metric prediction tasks at different stages. This paper maps RTL designs to directed, featured graphs with multidimensional node and edge features. These are the input to GNNs for estimating component delays and slews. An in-house hardware generation framework and open-source EDA tools for ASIC synthesis are employed for collecting training data. Experiments over unseen circuits show that GNN-based models are promising for timing estimation, even when the features come from early RTL implementations. Based on estimated delays, critical areas of the design can be detected, and proper RTL micro-architectures can be chosen without running long design iterations.

Embracing Graph Neural Networks for Hardware Security

  • Lilas Alrahis
  • Satwik Patnaik
  • Muhammad Shafique
  • Ozgur Sinanoglu

Graph neural networks (GNNs) have attracted increasing attention due to their superior performance in deep learning on graph-structured data. GNNs have succeeded across various domains such as social networks, chemistry, and electronic design automation (EDA). Electronic circuits have a long history of being represented as graphs, and to no surprise, GNNs have demonstrated state-of-the-art performance in solving various EDA tasks. More importantly, GNNs are now employed to address several hardware security problems, such as detecting intellectual property (IP) piracy and hardware Trojans (HTs), to name a few.

In this survey, we first provide a comprehensive overview of the usage of GNNs in hardware security and propose the first taxonomy to divide the state-of-the-art GNN-based hardware security systems into four categories: (i) HT detection systems, (ii) IP piracy detection systems, (iii) reverse engineering platforms, and (iv) attacks on logic locking. We summarize the different architectures, graph types, node features, benchmark data sets, and model evaluation of the employed GNNs. Finally, we elaborate on the lessons learned and discuss future directions.

SESSION: Compiler and System-Level Techniques for Efficient Machine Learning

Session details: Compiler and System-Level Techniques for Efficient Machine Learning

  • Sri Parameswaran
  • Martin Rapp

Fine-Granular Computation and Data Layout Reorganization for Improving Locality

  • Mahmut Kandemir
  • Xulong Tang
  • Jagadish Kotra
  • Mustafa Karakoy

While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.

An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration

  • Nicolas Bohm Agostini
  • Serena Curzel
  • Vinay Amatya
  • Cheng Tan
  • Marco Minutoli
  • Vito Giovanni Castellana
  • Joseph Manzano
  • David Kaeli
  • Antonino Tumeo

The generation of custom hardware accelerators for applications implemented within high-level productive programming frameworks requires considerable manual effort. To automate this process, we introduce SODA-OPT, a compiler tool that extends the MLIR infrastructure. SODA-OPT automatically searches, outlines, tiles, and pre-optimizes relevant code regions to generate high-quality accelerators through high-level synthesis. SODA-OPT can support any high-level programming framework and domain-specific language that interface with the MLIR infrastructure. By leveraging MLIR, SODA-OPT solves compiler optimization problems with specialized abstractions. Backend synthesis tools connect to SODA-OPT through progressive intermediate representation lowerings. SODA-OPT interfaces to a design space exploration engine to identify the combination of compiler optimization passes and options that provides high-performance generated designs for different backends and targets. We demonstrate the practical applicability of the compilation flow by exploring the automatic generation of accelerators for deep neural networks operators outlined at arbitrary granularity and by combining outlining with tiling on large convolution layers. Experimental results with kernels from the PolyBench benchmark show that our high-level optimizations improve execution delays of synthesized accelerators up to 60x. We also show that for the selected kernels, our solution outperforms the current of state-of-the art in more than 70% of the benchmarks and provides better average speedup in 55% of them. SODA-OPT is an open source project available at https://gitlab.pnnl.gov/sodalite/soda-opt.

Physics-Aware Differentiable Discrete Codesign for Diffractive Optical Neural Networks

  • Yingjie Li
  • Ruiyang Chen
  • Weilu Gao
  • Cunxi Yu

Diffractive optical neural networks (DONNs) have attracted lots of attention as they bring significant advantages in terms of power efficiency, parallelism, and computational speed compared with conventional deep neural networks (DNNs), which have intrinsic limitations when implemented on digital platforms. However, inversely mapping algorithm-trained physical model parameters onto real-world optical devices with discrete values is a non-trivial task as existing optical devices have non-unified discrete levels and non-monotonic properties. This work proposes a novel device-to-system hardware-software codesign framework, which enables efficient physics-aware training of DONNs w.r.t arbitrary experimental measured optical devices across layers. Specifically, Gumbel-Softmax is employed to enable differentiable discrete mapping from real-world device parameters into the forward function of DONNs, where the physical parameters in DONNs can be trained by simply minimizing the loss function of the ML task. The results have demonstrated that our proposed framework offers significant advantages over conventional quantization-based methods, especially with low-precision optical devices. Finally, the proposed algorithm is fully verified with physical experimental optical systems in low-precision settings.

Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture

  • Gokul Krishnan
  • A. Alper Goksoy
  • Sumit K. Mandal
  • Zhenyu Wang
  • Chaitali Chakrabarti
  • Jae-sun Seo
  • Umit Y. Ogras
  • Yu Cao

Monolithic in-memory computing (IMC) architectures face significant yield and fabrication cost challenges as the complexity of DNNs increases. Chiplet-based IMCs that integrate multiple dies with advanced 2.5D/3D packaging offers a low-cost and scalable solution. They enable heterogeneous architectures where the chiplets and their associated interconnection can be tailored to the non-uniform algorithmic structures to maximize IMC utilization and reduce energy consumption. This paper proposes a heterogeneous IMC architecture with big-little chiplets and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. For a given DNN, we develop a custom methodology to map the model onto the big-little architecture such that the early layers in the DNN are mapped to the little chiplets with higher NoP bandwidth and the subsequent layers are mapped to the big chiplets with lower NoP bandwidth. Furthermore, we achieve a scalable solution by incorporating a DRAM into each chiplet to support a wide range of DNNs beyond the area limit. Compared to a homogeneous chiplet-based IMC architecture, the proposed big-little architecture achieves up to 329× improvement in the energy-delay-area product (EDAP) and up to 2× higher IMC utilization. Experimental evaluation of the proposed big-little chiplet-based RRAM IMC architecture for ResNet-50 on ImageNet shows 259×, 139×, and 48× improvement in energy-efficiency at lower area compared to Nvidia V100 GPU, Nvidia T4 GPU, and SIMBA architecture, respectively.

SESSION: Addressing Sensor Security through Hardware/Software Co-Design

Session details: Addressing Sensor Security through Hardware/Software Co-Design

  • Marilyn Wolf

Attacks on Image Sensors

  • Marilyn Wolf
  • Kruttidipta Samal

This paper provides a taxonomy of security vulnerabilities of smart image sensor systems. Image sensors form an important class of sensors. Many image sensors include computation units that can provide traditional algorithms such as image or video compression along with machine learning tasks such as classification. Some attacks rely on the physics and optics of imaging. Other attacks take advantage of the complex logic and software required to perform imaging systems.

False Data Injection Attacks on Sensor Systems

  • Dimitrios Serpanos

False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.

Stochastic Mixed-Signal Circuit Design for In-Sensor Privacy

  • Ningyuan Cao
  • Jianbo Liu
  • Boyang Cheng
  • Muya Chang

The ubiquitous data acquisition and extensive data exchange of sensors pose severe security and privacy concerns for the end-users and the public. To enable real-time protection of raw data, it is demanding to facilitate privacy-preserving algorithms at data generation, or in-sensory privacy. However, due to the severe sensor resource constraints and intensive computation/security cost, it remains an open question of how to enable data protection algorithms with efficient circuit techniques. To answer this question, this paper discusses the potential of a stochastic mixed-signal (SMS) circuit for ultra-low-power, small-foot-print data security. In particular, this paper discusses digitally-controlled-oscillators (DCO) and their advantages in (1) seamless analog interface, (2) stochastic computation efficiency, and (3) unified entropy generation over conventional digital circuit baselines. With DCO as an illustrative case, we target (1) SMS privacy-preserving architecture definition and systematic SMS analysis on its performance gains across various hardware/software configurations, and (2) revisit analog/mixed-signal voltage/transistor scaling in the context of entropy-based data protection.

Sensor Security: Current Progress, Research Challenges, and Future Roadmap (Invited Paper)

  • Anomadarshi Barua
  • Mohammad Abdullah Al Faruque

Sensors are one of the most pervasive and integral components of today’s safety-critical systems. Sensors serve as a bridge between physical quantities and connected systems. The connected systems with sensors blindly believe the sensor as there is no way to authenticate the signal coming from a sensor. This could be an entry point for an attacker. An attacker can inject a fake input signal along with the legitimate signal by using a suitable spoofing technique. As the sensor’s transducer is not smart enough to differentiate between a fake and legitimate signal, the injected fake signal eventually can collapse the connected system. This type of attack is known as the transduction attack. Over the last decade, several works have been published to provide a defense against the transduction attack. However, the defenses are proposed on an ad-hoc basis; hence, they are not well-structured. Our work begins to fill this gap by providing a checklist that a defense technique should always follow to be considered as an ideal defense against the transduction attack. We name this checklist as the Golden reference of sensor defense. We provide insights on how this Golden reference can be achieved and argue that sensors should be redesigned from the transducer level to the sensor electronics level. We point out that only hardware or software modification is not enough; instead, a hardware/software (HW/SW) co-design approach is required to ride on this future roadmap to the robust and resilient sensor.

SESSION: Advances in Partitioning and Physical Optimization

Session details: Advances in Partitioning and Physical Optimization

  • Markus Olbrich
  • Yu-Guang Chen

SpecPart: A Supervised Spectral Framework for Hypergraph Partitioning Solution Improvement

  • Ismail Bustany
  • Andrew B. Kahng
  • Ioannis Koutis
  • Bodhisatta Pramanik
  • Zhiang Wang

State-of-the-art hypergraph partitioners follow the multilevel paradigm that constructs multiple levels of progressively coarser hypergraphs that are used to drive cut refinements on each level of the hierarchy. Multilevel partitioners are subject to two limitations: (i) Hypergraph coarsening processes rely on local neighborhood structure without fully considering the global structure of the hypergraph. (ii) Refinement heuristics can stagnate on local minima. In this paper, we describe SpecPart, the first supervised spectral framework that directly tackles these two limitations. SpecPart solves a generalized eigenvalue problem that captures the balanced partitioning objective and global hypergraph structure in a low-dimensional vertex embedding while leveraging initial high-quality solutions from multilevel partitioners as hints. SpecPart further constructs a family of trees from the vertex embedding and partitions them with a tree-sweeping algorithm. Then, a novel overlay of multiple tree-based partitioning solutions, followed by lifting to a coarsened hypergraph, where an ILP partitioning instance is solved to alleviate local stagnation. We have validated SpecPart on multiple sets of benchmarks. Experimental results show that for some benchmarks, our SpecPart can substantially improve the cutsize by more than 50% with respect to the best published solutions obtained with leading partitioners hMETIS and KaHyPar.

HyperEF: Spectral Hypergraph Coarsening by Effective-Resistance Clustering

  • Ali Aghdaei
  • Zhuo Feng

This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over 70× runtime speedups over hMetis and 20× speedups over HyperSF.

Design and Technology Co-Optimization Utilizing Multi-Bit Flip-Flop Cells

  • Soomin Kim
  • Taewhan Kim

The benefit of multi-bit flip-flop (MBFF) as opposed to single-bit flip-flop is sharing in-cell clock inverters among the master and slave latches in the internal flip-flops of MBFF. Theoretically, the more flip-flops an MBFF has, the more power saving it can achieve. However, in practice, physically increasing the size of MBFF to accommodate many flip-flops imposes two new challenging problems in physical design: (1) non-flexible MBFF cell flipping for multiple D-to-Q signals and (2) unbalanced or wasted use of MBFF footprint space. In this work, we solve the two problems in a way to enhance routability and timing at the placement and routing stages. Precisely, for problem 1, we make the non-flexible MBFF cell flipping to be fully flexible by generating MBFF layouts supporting diverse D-to-Q flow directions in the detailed placement to improve routability and for problem 2, we enhance the setup and clock-to-Q delay on timing critical flip-flops in MBFF through gate upsizing (i.e., transistor folding) by using the unused space in MBFF to improve timing slack at the post-routing stage. Through experiments with benchmark circuits, it is shown that our proposed design and technology co-optimization (DTCO) flow using MBFFs that solves problems 1 and 2 is very promising.

Transitive Closure Graph-Based Warpage-Aware Floorplanning for Package Designs

  • Yang Hsu
  • Min-Hsuan Chung
  • Yao-Wen Chang
  • Ci-Hong Lin

In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.

SESSION: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

Session details: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

  • Antonino Tumeo

A Mixed Open-Source and Proprietary EDA Commons for Education and Prototyping

  • Andrew B. Kahng

In recent years, several open-source projects have shown potential to serve a future technology commons for EDA and design prototyping. This paper examines how open-source and proprietary EDA technologies will inevitably take on complementary roles within a future technology commons. Proprietary EDA technologies offer numerous benefits that will endure, including (i) exceptional technology and engineering; (ii) ever-increasing importance in design-based equivalent scaling and the overall semiconductor value chain; and (iii) well-established commercial and partner relationships. On the other hand, proprietary EDA technologies face challenges that will also endure, including (i) inability to pursue directions such as massive leverage of cloud compute, extreme reduction of turnaround times, or “free tools”; and (ii) difficulty in evolving and addressing new applications and markets. By contrast, open-source EDA technologies offer benefits that include (i) the capability to serve as a friction-free, democratized platform for education and future workforce development (i.e., as a platform for EDA research, and as a means of teaching / training both designers and EDA developers with public code); and (ii) addressing the needs of underserved, non-enterprise account markets (e.g., older nodes, research flows, cost-sensitive IoT, new devices and integrations, system-design-technology pathfinding). This said, open-source will always face challenges such as sustainability, governance, and how to achieve critical mass and critical quality. The paper will conclude with key directions and synergies for open-source and proprietary EDA within an EDA Commons for education and prototyping.

SODA Synthesizer: An Open-Source, Multi-Level, Modular, Extensible Compiler from High-Level Frameworks to Silicon

  • Nicolas Bohm Agostini
  • Ankur Limaye
  • Marco Minutoli
  • Vito Giovanni Castellana
  • Joseph Manzano
  • Antonino Tumeo
  • Serena Curzel
  • Fabrizio Ferrandi

The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.

A Scalable Methodology for Agile Chip Development with Open-Source Hardware Components

  • Maico Cassel dos Santos
  • Tianyu Jia
  • Martin Cochet
  • Karthik Swaminathan
  • Joseph Zuckerman
  • Paolo Mantovani
  • Davide Giri
  • Jeff Jun Zhang
  • Erik Jens Loscalzo
  • Gabriele Tombesi
  • Kevin Tien
  • Nandhini Chandramoorthy
  • John-David Wellman
  • David Brooks
  • Gu-Yeon Wei
  • Kenneth Shepard
  • Luca P. Carloni
  • Pradip Bose

We present a scalable methodology for the agile physical design of tile-based heterogeneous system-on-chip (SoC) architectures that simplifies the reuse and integration of open-source hardware components. The methodology leverages the regularity of the on-chip communication infrastructure, which is based on a multi-plane network-on-chip (NoC), and the modularity of socket interfaces, which connect the tiles to the NoC. Each socket also provides its tile with a set of platform services, including independent clocking and voltage control. As a result, the physical design of each tile can be decoupled from its location in the top-level floorplan of the SoC and the overall SoC design can benefit from a hierarchical timing-closure flow, design reuse and, if necessary, fast respin. With the proposed methodology we completed two SoC tapeouts of increasing complexity, which illustrate its capabilities and the resulting gains in terms of design productivity.

SESSION: Accelerators on A New Horizon

Session details: Accelerators on A New Horizon

  • Vaibhav Verma
  • Georgios Zervakis

GraphRC: Accelerating Graph Processing on Dual-Addressing Memory with Vertex Merging

  • Wei Cheng
  • Chun-Feng Wu
  • Yuan-Hao Chang
  • Ing-Chao Lin

Architectural innovation in graph accelerators attracts research attention due to foreseeable inflation in data sizes and the irregular memory access pattern of graph algorithms. Conventional graph accelerators ignore the potential of Non-Volatile Memory (NVM) crossbar as a dual-addressing memory and treat it as a traditional single-addressing memory with higher density and better energy efficiency. In this work, we present GraphRC, a graph accelerator that leverages the power of dual-addressing memory by mapping in-edge/out-edge requests to column/row-oriented memory accesses. Although the capability of dual-addressing memory greatly improves the performance of graph processing, some memory accesses still suffer from low-utilization issues. Therefore, we propose a vertex merging (VM) method that improves cache block utilization rate by merging memory requests from consecutive vertices. VM reduces the execution time of all 6 graph algorithms on all 4 datasets by 24.24% on average. We then identify the data dependency inherent in a graph limits the usage of VM, and its effectiveness is bounded by the percentage of mergeable vertices. To overcome this limitation, we propose an aggressive vertex merging (AVM) method that outperforms VM by ignoring the data dependency inherent in a graph. AVM significantly reduces the execution time of ranking-based algorithms on all 4 datasets while preserving the correct ranking of the top 20 vertices.

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

  • Matheus Cavalcante
  • Domenic Wüthrich
  • Matteo Perotti
  • Samuel Riedel
  • Luca Benini

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz’ performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

  • Edward Richter
  • Deming Chen

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

ReSiPI: A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for Energy-Efficient Interposer Communication

  • Ebadollah Taheri
  • Sudeep Pasricha
  • Mahdi Nikdast

2.5D chiplet systems have been proposed to improve the low manufacturing yield of large-scale chips. However, connecting the chiplets through an electronic interposer imposes a high traffic load on the interposer network. Silicon photonics technology has shown great promise towards handling a high volume of traffic with low latency in intra-chip network-on-chip (NoC) fabrics. Although recent advances in silicon photonic devices have extended photonic NoCs to enable high bandwidth communication in 2.5D chiplet systems, such interposer-based photonic networks still suffer from high power consumption. In this work, we design and analyze a novel Reconfigurable power-efficient and congestion-aware Silicon-Photonic 2.5D Interposer network, called ReSiPI. Considering runtime traffic, ReSiPI is able to dynamically deploy inter-chiplet photonic gateways to improve the overall network congestion. ReSiPI also employs switching elements based on phase change materials (PCMs) to dynamically reconfigure and power-gate the photonic interposer network, thereby improving the network power efficiency. Compared to the best prior state-of-the-art 2.5D photonic network, ReSiPI demonstrates, on average, 37% lower latency, 25% power reduction, and 53% energy minimization in the network.

SESSION: CAD for Confidentiality of Hardware IPS

Session details: CAD for Confidentiality of Hardware IPS

  • Swarup Bhunia

Hardware IP Protection against Confidentiality Attacks and Evolving Role of CAD Tool

  • Swarup Bhunia
  • Amitabh Das
  • Saverio Fazzari
  • Vivian Kammler
  • David Kehlet
  • Jeyavijayan Rajendran
  • Ankur Srivastava

With growing use of hardware intellectual property (IP) based integrated circuits (IC) design and increasing reliance on a globalized supply chain, the threats to confidentiality of hardware IPs have emerged as major security concerns to the IP producers and owners. These threats are diverse, including reverse engineering (RE), piracy, cloning, and extraction of design secrets, and span different phases of electronics life cycle. The academic research community and the semiconductor industry have made significant efforts over the past decade on developing effective methodologies and CAD tools targeted to protect hardware IPs against these threats. These solutions include watermarking, logic locking, obfuscation, camouflaging, split manufacturing, and hardware redaction. This paper focuses on key topics on confidentiality of hardware IPs encompassing the major threats, protection approaches, security analysis, and metrics. It discusses the strengths and limitations of the major solutions in protecting hardware IPs against the confidentiality attacks, and future directions to address the limitations in the modern supply chain ecosystem.

SESSION: Analyzing Reliability, Defects and Patterning

Session details: Analyzing Reliability, Defects and Patterning

  • Gaurav Rajavendra Reddy
  • Kostas Adam

Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction Using Graph Neural Network and U-Net

  • Kyeonghyeon Baek
  • Hyunbum Park
  • Suwan Kim
  • Kyumyung Choi
  • Taewhan Kim

An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.

A Novel Semi-Analytical Approach for Fast Electromigration Stress Analysis in Multi-Segment Interconnects

  • Olympia Axelou
  • Nestor Evmorfopoulos
  • George Floros
  • George Stamoulis
  • Sachin S. Sapatnekar

As integrated circuit technologies move below 10 nm, Electromigration (EM) has become an issue of great concern for the longterm reliability due to the stricter performance, thermal and power requirements. The problem of EM becomes even more pronounced in power grids due to the large unidirectional currents flowing in these structures. The attention for EM analysis during the past years has been drawn to accurate physics-based models describing the interplay between the electron wind force and the back stress force, in a single Partial Differential Equation (PDE) involving wire stress. In this paper, we present a fast semi-analytical approach for the solution of the stress PDE at discrete spatial points in multi-segment lines of power grids, which allows the analytical calculation of EM stress independently at any time in these lines. Our method exploits the specific form of the discrete stress coefficient matrix whose eigenvalues and eigenvectors are known beforehand. Thus, a closed-form equation can be constructed with almost linear time complexity without the need of time discretization. This closed-form equation can be subsequently used at any given time in transient stress analysis. Our experimental results, using the industrial IBM power grid benchmarks, demonstrate that our method has excellent accuracy compared to the industrial tool COMSOL while being orders of magnitude times faster.

HierPINN-EM: Fast Learning-Based Electromigration Analysis for Multi-Segment Interconnects Using Hierarchical Physics-Informed Neural Network

  • Wentian Jin
  • Liang Chen
  • Subed Lamichhane
  • Mohammadamir Kavousi
  • Sheldon X.-D. Tan

Electromigration (EM) becomes a major concern for VLSI circuits as the technology advances in the nanometer regime. The crux of problem is to solve the partial differential Korhonen equations, which remains challenging due to the increasing integrated density. Recently, scientific machine learning has been explored to solve partial differential equations (PDE) due to breakthrough success in deep neural networks and existing approach such as physics-informed neural networks (PINN) shows promising results for some small PDE problems. However, for large engineering problems like EM analysis for large interconnect trees, it was shown that the plain PINN does not work well due the to large number of variables. In this work, we propose a novel hierarchical PINN approach, HierPINN-EM for fast EM induced stress analysis for multi-segment interconnects. Instead of solving the interconnect tree as a whole, we first solve EM problem for one wire segment under different boundary and geometrical parameters using supervised learning. Then we apply unsupervised PINN concept to solve the whole interconnects by enforcing the physics laws in the boundaries for all wire segments. In this way, HierPINN-EM can significantly reduce the number of variables at plain PINN solver. Numerical results on a number of synthetic interconnect trees show that HierPINN-EM can lead to orders of magnitude speedup in training and more than 79× better accuracy over the plain PINN method. Furthermore, HierPINN-EM yields 19% better accuracy with 99% reduction in training cost over recently proposed Graph Neural Network-based EM solver, EMGraph.

Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning

  • Guan-Ting Liu
  • Wei-Chen Tai
  • Yi-Ting Lin
  • Iris Hui-Ru Jiang
  • James P. Shiely
  • Pu-Jen Cheng

As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.

SESSION: New Frontier in Verification Technology

Session details: New Frontier in Verification Technology

  • Jyotirmoy Vinay
  • Zahra Ghodsi

Automatic Test Configuration and Pattern Generation (ATCPG) for Neuromorphic Chips

  • I-Wei Chiu
  • Xin-Ping Chen
  • Jennifer Shueh-Inn Hu
  • James Chien-Mo Li

The demand for low-power, high-performance neuromorphic chips is increasing. However, conventional testing is not applicable to neuromorphic chips due to three reasons: (1) lack of scan DfT, (2) stochastic characteristic, and (3) configurable functionality. In this paper, we present an automatic test configuration and pattern generation (ATCPG) method for testing a configurable stochastic neuromorphic chip without using scan DfT. We use machine learning to generate test configurations. Then, we apply a modified fast gradient sign method to generate test patterns. Finally, we determine test repetitions with statistical power of test. We conduct experiments on one of the neuromorphic architectures, spiking neural network, to evaluate the effectiveness of our ATCPG. The experimental results show that our ATCPG can achieve 100% fault coverage for the five fault models we use. For testing a 3-layer model at 0.05 significant level, we produce 5 test configurations and 67 test patterns. The average test repetitions of neuron faults and synapse faults are 2,124 and 4,557, respectively. Besides, our simulation results show that the overkill matched our significance level perfectly.

ScaleHD: Robust Brain-Inspired Hyperdimensional Computing via Adapative Scaling

  • Sizhe Zhang
  • Mohsen Imani
  • Xun Jiao

Brain-inspired hyperdimensional computing (HDC) has demonstrated promising capability in various cognition tasks such as robotics, bio-medical signal analysis, and natural language processing. Compared to deep neural networks, HDC models show advantages such as light-weight model and one/few-shot learning capabilities, making it a promising alternative paradigm to traditional resource-demanding deep learning models particularly in edge devices with limited resources. Despite the growing popularity of HDC, the robustness of HDC models and the approaches to enhance HDC robustness has not been systematically analyzed and sufficiently examined. HDC relies on high-dimensional numerical vectors referred to as hypervectors (HV) to perform cognition tasks and the values inside the HVs are critical to the robustness of an HDC model. We propose ScaleHD, an adaptive scaling method that scales the value of HVs in the associative memory of an HDC model to enhance the robustness of HDC models. We propose three different modes of ScaleHD including Global-ScaleHD, Class-ScaleHD, and (Class + Clip)-ScaleHD which are based on different adaptive scaling strategies. Results show that ScaleHD is able to enhance HDC robustness against memory errors up to 10,000X. Moreover, we leverage the enhanced HDC robustness in exchange for energy saving via voltage scaling method. Experimental results show that ScaleHD can reduce energy consumption on HDC memory system up to 72.2% with less than 1% accuracy loss.

Quantitative Verification and Design Space Exploration under Uncertainty with Parametric Stochastic Contracts

  • Chanwook Oh
  • Michele Lora
  • Pierluigi Nuzzo

This paper proposes an automated framework for quantitative verification and design space exploration of cyber-physical systems in the presence of uncertainty, leveraging assume-guarantee contracts expressed in Stochastic Signal Temporal Logic (StSTL). We introduce quantitative semantics for StSTL and formulations of the quantitative verification and design space exploration problems as bi-level optimization problems. We show that these optimization problems can be effectively solved for a class of stochastic systems and a fragment of bounded-time StSTL formulas. Our algorithm searches for partitions of the upper-level design space such that the solutions of the lower-level problems satisfy the upper-level constraints. A set of optimal parameter values are then selected within these partitions. We illustrate the effectiveness of our framework on the design of a multi-sensor perception system and an automatic cruise control system.

SESSION: Low Power Edge Intelligence

Session details: Low Power Edge Intelligence

  • Sabya Das
  • Jiang Hu

Reliable Machine Learning for Wearable Activity Monitoring: Novel Algorithms and Theoretical Guarantees

  • Dina Hussein
  • Taha Belkhouja
  • Ganapati Bhat
  • Janardhan Rao Doppa

Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.

Neurally-Inspired Hyperdimensional Classification for Efficient and Robust Biosignal Processing

  • Yang Ni
  • Nicholas Lesica
  • Fan-Gang Zeng
  • Mohsen Imani

The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.

In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.

EVE: Environmental Adaptive Neural Network Models for Low-Power Energy Harvesting System

  • Sahidul Islam
  • Shanglin Zhou
  • Ran Ran
  • Yu-Fang Jin
  • Wujie Wen
  • Caiwen Ding
  • Mimi Xie

IoT devices are increasingly being implemented with neural network models to enable smart applications. Energy harvesting (EH) technology that harvests energy from ambient environment is a promising alternative to batteries for powering those devices due to the low maintenance cost and wide availability of the energy sources. However, the power provided by the energy harvester is low and has an intrinsic drawback of instability since it varies with the ambient environment. This paper proposes EVE, an automated machine learning (autoML) co-exploration framework to search for desired multi-models with shared weights for energy harvesting IoT devices. Those shared models incur significantly reduced memory footprint with different levels of model sparsity, latency, and accuracy to adapt to the environmental changes. An efficient on-device implementation architecture is further developed to efficiently execute each model on device. A run-time model extraction algorithm is proposed that retrieves individual model with negligible overhead when a specific model mode is triggered. Experimental results show that the neural networks models generated by EVE is on average 2.5× times faster than the baseline models without pruning and shared weights.

SESSION: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

Session details: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

  • Hussam Amrouch
  • Sheldon Tan

Designing Energy-Efficient Decision Tree Memristor Crossbar Circuits Using Binary Classification Graphs

  • Pranav Sinha
  • Sunny Raj

We propose a method to design in-memory, energy-efficient, and compact memristor crossbar circuits for implementing decision trees using flow-based computing. We develop a new tool called binary classification graph, which is equivalent to decision trees in accuracy but uses bit values of input features to make decisions instead of thresholds. Our proposed design is resilient to manufacturing errors and can scale to large crossbar sizes due to the utilization of sneak paths in computations. Our design uses zero transistor and one memristor (0T1R) crossbars with only two resistance states of high and low, which makes it resilient to resistance drift and radiation degradation. We test the performance of our designs on multiple standard machine learning datasets and show that our method utilizes circuits of size 5.23 × 10-3 mm2 and uses 20.5 pJ per decision, and outperforms state-of-the-art decision tree acceleration algorithms on these metrics.

Fuse and Mix: MACAM-Enabled Analog Activation for Energy-Efficient Neural Acceleration

  • Hanqing Zhu
  • Keren Zhu
  • Jiaqi Gu
  • Harrison Jin
  • Ray T. Chen
  • Jean Anne Incorvia
  • David Z. Pan

Analog computing has been recognized as a promising low-power alternative to digital counterparts for neural network acceleration. However, conventional analog computing is mainly in a mixed-signal manner. Tedious analog/digital (A/D) conversion cost significantly limits the overall system’s energy efficiency. In this work, we devise an efficient analog activation unit with magnetic tunnel junction (MTJ)-based analog content-addressable memory (MACAM), simultaneously realizing nonlinear activation and A/D conversion in a fused fashion. To compensate for the nascent and therefore currently limited representation capability of MACAM, we propose to mix our analog activation unit with digital activation dataflow. A fully differential framework, SuperMixer, is developed to search for an optimized activation workload assignment, adaptive to various activation energy constraints. The effectiveness of our proposed methods is evaluated on a silicon photonic accelerator. Compared to standard activation implementation, our mixed activation system with the searched assignment can achieve competitive accuracy with >60% energy saving on A/D conversion and activation.

Aging-Aware Training for Printed Neuromorphic Circuits

  • Haibin Zhao
  • Michael Hefenbrock
  • Michael Beigl
  • Mehdi B. Tahoori

Printed electronics allow for ultra-low-cost circuit fabrication with unique properties such as flexibility, non-toxicity, and stretchability. Because of these advanced properties, there is a growing interest in adapting printed electronics for emerging areas such as fast-moving consumer goods and wearable technologies. In such domains, analog signal processing in or near the sensor is favorable. Printed neuromorphic circuits have been recently proposed as a solution to perform such analog processing natively. Additionally, their learning-based design process allows high efficiency of their optimization and enables them to mitigate the high process variations associated with low-cost printed processes. In this work, we address the aging of the printed components. This effect can significantly degrade the accuracy of printed neuromorphic circuits over time. For this, we develop a stochastic aging-model to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device. This approach ensures to provide acceptable accuracy over the device lifetime. Our experiments show that an overall 35.8% improvement in terms of expected accuracy over the device lifetime can be achieved using the proposed learning approach.

SESSION: Designing DNN Accelerators

Session details: Designing DNN Accelerators

  • Elliott Delaye
  • Yiyu Shi

Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates

  • Naebeom Park
  • Daehyun Ahn
  • Jae-Joon Kim

Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.

Re2fresh: A Framework for Mitigating Read Disturbance in ReRAM-Based DNN Accelerators

  • Hyein Shin
  • Myeonggu Kang
  • Lee-Sup Kim

A severe read disturbance problem degrades the inference accuracy of a resistive RAM (ReRAM) based deep neural network (DNN) accelerator. Refresh, which reprograms the ReRAM cells, is the most obvious solution for the problem, but programming ReRAM consumes huge energy. To address the issue, we first analyze the resistance drift pattern of each conductance state and the actual read stress applied to the ReRAM array by considering the characteristics of ReRAM-based DNN accelerators. Based on the analysis, we cluster ReRAM cells into a few groups for each layer of DNN and generate a proper refresh cycle for each group in the offline phase. The individual refresh cycles reduce energy consumption by reducing the number of unnecessary refresh operations. In the online phase, the refresh controller selectively launches refresh operations according to the generated refresh cycles. ReRAM cells are selectively refreshed by minimally modifying the conventional structure of the ReRAM-based DNN accelerator. The proposed work successfully resolves the read disturbance problem by reducing 97% of the energy consumption for the refresh operation while preserving inference accuracy.

FastStamp: Accelerating Neural Steganography and Digital Watermarking of Images on FPGAs

  • Shehzeen Hussain
  • Nojan Sheybani
  • Paarth Neekhara
  • Xinqiao Zhang
  • Javier Duarte
  • Farinaz Koushanfar

Steganography and digital watermarking are the tasks of hiding recoverable data in image pixels. Deep neural network (DNN) based image steganography and watermarking techniques are quickly replacing traditional hand-engineered pipelines. DNN based water-marking techniques have drastically improved the message capacity, imperceptibility and robustness of the embedded watermarks. However, this improvement comes at the cost of increased computational overhead of the watermark encoder neural network. In this work, we design the first accelerator platform FastStamp to perform DNN based steganography and digital watermarking of images on hardware. We first propose a parameter efficient DNN model for embedding recoverable bit-strings in image pixels. Our proposed model can match the success metrics of prior state-of-the-art DNN based watermarking methods while being significantly faster and lighter in terms of memory footprint. We then design an FPGA based accelerator framework to further improve the model throughput and power consumption by leveraging data parallelism and customized computation paths. FastStamp allows embedding hardware signatures into images to establish media authenticity and ownership of digital media. Our best design achieves 68× faster inference as compared to GPU implementations of prior DNN based watermark encoder while consuming less power.

SESSION: Novel Chiplet Approaches from Interconnect to System (Virtual)

Session details: Novel Chiplet Approaches from Interconnect to System (Virtual)

  • Xinfei Guo

GIA: A Reusable General Interposer Architecture for Agile Chiplet Integration

  • Fuping Li
  • Ying Wang
  • Yuanqing Cheng
  • Yujie Wang
  • Yinhe Han
  • Huawei Li
  • Xiaowei Li

2.5D chiplet technology is gaining popularity for the efficiency of integrating multiple heterogeneous dies or chiplets on interposers, and it is also considered an ideal option for agile silicon system design by mitigating the huge design, verification, and manufacturing overhead of monolithic SoCs. Although it significantly reduces development costs by chiplet reuse, the design and fabrication of interposers also introduce additional high non-recurring engineering (NRE) costs and development cycles which might be prohibitive for application-specific designs having low volume.

To address this challenge, in this paper, we propose a reusable general interposer architecture (GIA) to amortize NRE costs and accelerate integration flows of interposers across different chiplet-based systems effectively. The proposed assembly-time configurable interposer architecture covers both active interposers and passive interposers considering diverse applications of 2.5D systems. The agile interposer integration is also facilitated by a novel end-to-end design automation framework to generate optimal system assembly configurations including the selection of chiplets, inter-chiplet network configuration, placement of chiplets, and mapping on GIA, which are specialized for the given target workload. The experimental results show that our proposed active GIA and passive GIA achieve 3.15x and 60.92x performance boost with 2.57x and 2.99x power saving over baselines respectively.

Accelerating Cache Coherence in Manycore Processor through Silicon Photonic Chiplet

  • Chengeng Li
  • Fan Jiang
  • Shixi Chen
  • Jiaxu Zhang
  • Yinyi Liu
  • Yuxiang Fu
  • Jiang Xu

Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.

Re-LSM: A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store

  • Qian Wei
  • Zhaoyan Shen
  • Yiheng Tong
  • Zhiping Jia
  • Lei Ju
  • Jiezhi Chen
  • Bingzhe Li

Log-structured merge (LSM) tree based key-value (KV) stores organize writes into hierarchical batches for high-speed writing. However, the notorious compaction process of LSM-tree severely hurts system performance. It not only involves huge I/O operations but also consumes tremendous computation and memory resources. In this paper, first we find that when compaction happens in the high levels (i.e., L0L1) of the LSM-tree, it may saturate all system computation and memory resources, and eventually stall the whole system. Based on this observation, we present Re-LSM, a ReRAM-based Processing-in-Memory (PIM) framework for LSM-based Key-Value Store. Specifically, in Re-LSM, we propose to offload certain computation and memory-intensive tasks in the high levels of the LSM-tree to the ReRAM-based PIM space. A high parallel ReRAM compaction accelerator is designed by decomposing the three-phased compaction into basic logic operating units. Evaluation results based on db_bench and YCSB show that Re-LSM achieves 2.2× improvement on the throughput of random writes compared to RocksDB, and the ReRAM-based compaction accelerator speedups the CPU-based implementation by 64.3× and saves 25.5× energy.

SESSION: Architecture for DNN Acceleration (Virtual)

Session details: Architecture for DNN Acceleration (Virtual)

  • Zhezhi He

Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability

  • Yiming Chen
  • Guodong Yin
  • Mingyen Lee
  • Wenjun Tang
  • Zekun Yang
  • Yongpan Liu
  • Huazhong Yang
  • Xueqing Li

Motivated by reducing the data transfer activities in data-intensive neural network computing, SRAM-based compute-in-memory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.

This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that “hides” ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, Hidden-ROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.

DCIM-GCN: Digital Computing-in-Memory to Efficiently Accelerate Graph Convolutional Networks

  • Yikan Qiu
  • Yufei Ma
  • Wentao Zhao
  • Meng Wu
  • Le Ye
  • Ru Huang

Computing-in-memory (CIM) is emerging as a promising architecture to accelerate graph convolutional networks (GCNs) normally bounded by redundant and irregular memory transactions. Current analog based CIM requires frequent analog and digital conversions (AD/DA) that dominate the overall area and power consumption. Furthermore, the analog non-ideality degrades the accuracy and reliability of CIM. In this work, an SRAM based digital CIM system is proposed to accelerate memory intensive GCNs, namely DCIM-GCN, which covers innovations from CIM circuit level eliminating costly AD/DA converters to architecture level addressing irregularity and sparsity of graph data. DCIM-GCN achieves 2.07X, 1.76X, and 1.89× speedup and 29.98×, 1.29×, and 3.73× energy efficiency improvement on average over CIM based PIMGCN, TARe, and PIM-GCN, respectively.

Hardware Computation Graph for DNN Accelerator Design Automation without Inter-PU Templates

  • Jun Li
  • Wei Wang
  • Wu-Jun Li

Existing deep neural network (DNN) accelerator design automation (ADA) methods adopt architecture templates to predetermine parts of design choices and then explore the left design choices beyond templates. These templates can be classified into intra-PU templates and inter-PU templates according to the architecture hierarchy. Since templates limit the flexibility of ADA, designing effective ADA methods without templates has become an important research topic. Although there have appeared some works to enhance the flexibility of ADA by removing intra-PU templates, to the best of our knowledge no existing works have studied ADA methods without inter-PU templates. ADA with predetermined inter-PU templates is typically inefficient in terms of resource utilization, especially for DNNs with complex topology. In this paper, we propose a novel method, called hardware computation graph (HCG), for ADA without inter-PU templates. Experiments show that HCG method can achieve competitive latency while using only 1.4× ~ 5× fewer on-chip memory, compared with existing state-of-the-art ADA methods.

SESSION: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

Session details: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

  • Sabya Das
  • Mondira Pant

Dynamic Frequency Boosting Beyond Critical Path Delay

  • Nikolaos Zompakis
  • Sotirios Xydis

This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) technique to release “hidden” performance margins of digital circuit designs currently suppressed by typical critical path constraint design flows, thus defining higher limits of operation speed. The proposed technique goes beyond state-of-the-art and exploits the data-driven path delay variability incorporating an innovative hardware clocking mechanism that detects in real-time the paths’ activation. In contrast to timing speculation, the operating speed is adjusted on the nominal path delay activation, succeeding an error-free acceleration. The proposed technique has been evaluated on three FPGA-based use cases carefully selected to exhibit differing domain characteristics, i.e i) a third party DNN inference accelerator IP for CIFAR-10 images achieving an average speedup of 18%, ii) a highly designer-optimized Optical Digital Equalizer design, in which DBF delivered a speedup of 50% and iii) a set of 5 synthetic designs examining high frequency (beyond 400 MHz) applications in FPGAs, achieving accelerations of 20–60% depending on the underlying path variability.

ASPPLN: Accelerated Symbolic Probability Propagation in Logic Network

  • Weihua Xiao
  • Weikang Qian

Probability propagation is an important task used in logic network analysis, which propagates signal probabilities from its primary inputs to its primary outputs. It has many applications such as power estimation, reliability analysis, and error analysis for approximate circuits. Existing methods for the task can be divided into two categories: simulation-based and probability-based methods. However, most of them suffer from low accuracy or bad scalability. In this work, we propose ASPPLN, a method for accelerated symbolic probability propagation in logic network, which has a linear complexity with the network size. We first introduce a new definition in a graph called redundant input and take advantage of it to simplify the propagation process without losing accuracy. Then, a technique called symbol limitation is proposed to limit the complexity of each node’s propagation according to the partial probability significances of the symbols. The experimental results showed that compared to the existing methods, ASPPLN improves the estimation accuracy of switching activity by up to 24.70%, while it also has a speedup of up to 29X.

A High-Precision Stochastic Solver for Steady-State Thermal Analysis with Fourier Heat Transfer Robin Boundary Conditions

  • Longlong Yang
  • Cuiyang Ding
  • Changhao Yan
  • Dian Zhou
  • Xuan Zeng

In this work, we propose a path integral random walk (PIRW) solver, the first accurate stochastic method for steady-state thermal analysis with mixed boundary conditions, especially involving Fourier heat transfer Robin boundary conditions. We innovatively adopt the strictly correct calculation of the local time and the Feynman-Kac functional êc (t) to handle Neumann and Robin boundary conditions with high precision. Compared with ANSYS, experimental results show that PIRW achieves over 121× speedup and over 83× storage space reduction with a negligible error within 0.8° C at a single point. An application combining PIRW with low-accuracy ANSYS for the temperature calculation at hot-spots is provided as a more accurate and faster solution than only ANSYS used.

SESSION: GPU Acceleration for Routing Algorithms (Virtual)

Session details: GPU Acceleration for Routing Algorithms (Virtual)

  • Umamaheswara Rao Tida

Superfast Full-Scale CPU-Accelerated Global Routing

  • Shiju Lin
  • Martin D. F. Wong

Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.

X-Check: CPU-Accelerated Design Rule Checking via Parallel Sweepline Algorithms

  • Zhuolun He
  • Yuzhe Ma
  • Bei Yu

Design rule checking (DRC) is essential in physical verification to ensure high yield and reliability for VLSI circuit designs. To achieve reasonable design cycle time, acceleration for computationally intensive DRC tasks has been demanded to accommodate the ever-growing complexity of modern VLSI circuits. In this paper, we propose X-Check, a GPU-accelerated design rule checker. X-Check integrates novel parallel sweepline algorithms, which are both efficient in practice and with nontrivial theoretical guarantees. Experimental results have demonstrated significant speedup achieved by X-Check compared with a multi-threaded CPU checker.

GPU-Accelerated Rectilinear Steiner Tree Generation

  • Zizheng Guo
  • Feng Gu
  • Yibo Lin

Rectilinear Steiner minimum tree (RSMT) generation is a fundamental component in the VLSI design automation flow. Due to its extensive usage in circuit design iterations at early design stages like synthesis, placement, and routing, the performance of RSMT generation is critical for a reasonable design turnaround time. State-of-the-art RSMT generation algorithms, like fast look-up table estimation (FLUTE), are constrained by CPU-based parallelism with limited runtime improvements. The acceleration of RSMT on GPUs is an important yet difficult task, due to the complex and non-trivial divide-and-conquer computation patterns with recursions. In this paper, we present the first GPU-accelerated RSMT generation algorithm based on FLUTE. By designing GPU-efficient data structures and levelized decomposition, table look-up, and merging operations, we incorporate large-scale data parallelism into the generation of Steiner trees. An up to 10.47× runtime speed-up has been achieved compared with FLUTE running on 40 CPU cores, filling in a critical missing component in today’s GPU-accelerated design automation framework.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

  • Christian Pilato
  • Miroslav Velev

HECTOR: A Multi-Level Intermediate Representation for Hardware Synthesis Methodologies

  • Ruifan Xu
  • Youwei Xiao
  • Jin Luo
  • Yun Liang

Hardware synthesis requires a complicated process to generate synthesizable register transfer level (RTL) code. High-level synthesis tools can automatically transform a high-level description into hardware design, while hardware generators adopt domain specific languages and synthesis flows for specific applications. The implementation of these tools generally requires substantial engineering efforts due to RTL’s weak expressivity and low level of abstraction. Furthermore, different synthesis tools adopt different levels of intermediate representations (IR) and transformations. A unified IR obviously is a good way to lower the engineering cost and get competitive hardware design rapidly by exploring different synthesis methodologies.

In this paper, we propose Hector, a two-level IR providing a unified intermediate representation for hardware synthesis methodologies. The high-level IR binds computation with a control graph annotated with timing information, while the low-level IR provides a concise way to describe hardware modules and elastic interconnections among them. Implemented based on the multi-level compiler infrastructure (MLIR), Hector’s IRs can be converted to synthesizable RTL designs. To demonstrate the expressivity and versatility, we implement three synthesis approaches based on Hector: a high-level synthesis (HLS) tool, a systolic array generator, and a hardware accelerator. The hardware generated by Hector’s HLS approach is comparable to that generated by the state-of-the-art HLS tools, and the other two cases outperform HLS implementations in performance and productivity.

QCIR: Pattern Matching Based Universal Quantum Circuit Rewriting Framework

  • Mingyu Chen
  • Yu Zhang
  • Yongshang Li
  • Zhen Wang
  • Jun Li
  • Xiangyang Li

Due to multiple limitations of quantum computers in the NISQ era, quantum compilation efforts are required to efficiently execute quantum algorithms on NISQ devices Program rewriting based on pattern matching can improve the generalization ability of compiler optimization. However, it has rarely been explored for quantum circuit optimization, further considering physical features of target devices.

In this paper, we propose a pattern-matching based quantum circuit optimization framework QCIR with a novel pattern description format, enabling the user-configured cost model and two categories of patterns, i.e., generic patterns and folding patterns. To get better compilation latency, we propose a DAG representation of quantum circuit called QCir-DAG, and QVF algorithm for subcircuit matching. We implement continuous single-qubit optimization pass constructed by QCIR, achieving 10% and 20% optimization rate for benchmarks from Qiskit and ScaffCC, respectively. The practicality of QCIR is demonstrated by execution time and experimental results on the quantum simulator and quantum devices.

Batch Sequential Black-Box Optimization with Embedding Alignment Cells for Logic Synthesis

  • Chang Feng
  • Wenlong Lyu
  • Zhitang Chen
  • Junjie Ye
  • Mingxuan Yuan
  • Jianye Hao

During the logic synthesis flow of EDA, a sequence of graph transformation operators are applied to the circuits so that the Quality of Results (QoR) of the circuits highly depends on the chosen operators and their specific parameters in the sequence, making the search space operator-dependent and increasingly exponential. In this paper, we formulate the logic synthesis design space exploration as a conditional sequence optimization problem, where at each transformation step, an optimization operator is selected and its corresponding parameters are decided. To solve this problem, we propose a novel sequential black-box optimization approach without human intervention: 1) Due to the conditional and sequential structure of operator sequence with variable length, we build an embedding alignment cells based recurrent neural network as a surrogate model to estimate the QoR of the logic synthesis flow with historical data. 2) With the surrogate model, we construct acquisition function to balance exploration and exploitation with respect to each metric of the QoR. 3) We use multi-objective optimization algorithm to find the Pareto front of the acquisition functions, along which a batch of sequences, consisting of parameterized operators, are (randomly) selected to users for evaluation under the budget of computing resource. We repeat the above three steps until convergence or time limit. Experimental results on public EPFL benchmarks demonstrate the superiority of our approach over the expert-crafted optimization flows and other machine learning based methods. Compared to resyn2, we achieve 11.8% LUT-6 count descent improvements without sacrificing level values.

Heterogeneous Graph Neural Network-Based Imitation Learning for Gate Sizing Acceleration

  • Xinyi Zhou
  • Junjie Ye
  • Chak-Wa Pui
  • Kun Shao
  • Guangliang Zhang
  • Bin Wang
  • Jianye Hao
  • Guangyong Chen
  • Pheng Ann Heng

Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.

SESSION: Smart Search (Virtual)

Session details: Smart Search (Virtual)

  • Jianlei Yang

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

  • Huihong Shi
  • Haoran You
  • Yang Zhao
  • Zhongfeng Wang
  • Yingyan Lin

Multiplication is arguably the most cost-dominant operation in modern deep neural networks (DNNs), limiting their achievable efficiency and thus more extensive deployment in resource-constrained applications. To tackle this limitation, pioneering works have developed handcrafted multiplication-free DNNs, which require expert knowledge and time-consuming manual iteration, calling for fast development tools. To this end, we propose a Neural Architecture Search and Acceleration framework dubbed NASA, which enables automated multiplication-reduced DNN development and integrates a dedicated multiplication-reduced accelerator for boosting DNNs’ achievable efficiency. Specifically, NASA adopts neural architecture search (NAS) spaces that augment the state-of-the-art one with hardware inspired multiplication-free operators, such as shift and adder, armed with a novel progressive pretrain strategy (PGP) together with customized training recipes to automatically search for optimal multiplication-reduced DNNs; On top of that, NASA further develops a dedicated accelerator, which advocates a chunk-based template and auto-mapper dedicated for NASA-NAS resulting DNNs to better leverage their algorithmic properties for boosting hardware efficiency. Experimental results and ablation studies consistently validate the advantages of NASA’s algorithm-hardware co-design framework in terms of achievable accuracy and efficiency tradeoffs. Codes are available at https://github.com/shihuihong214/NASA.

Personalized Heterogeneity-Aware Federated Search Towards Better Accuracy and Energy Efficiency

  • Zhao Yang
  • Qingshuang Sun

Federated learning (FL), a new distributed technology, allows us to train the global model on the edge and embedded devices without local data sharing. However, due to the wide distribution of different types of devices, FL faces severe heterogeneity issues. The accuracy and efficiency of FL deployment at the edge are severely impacted by heterogeneous data and heterogeneous systems. In this paper, we perform joint FL model personalization for heterogeneous systems and heterogeneous data to address the challenges posed by heterogeneities. We begin by using model inference efficiency as a starting point to personalize network scale on each node. Furthermore, it can be used to guide the efficient FL training process, which can help to ease the problem of straggler devices and improve FL’s energy efficiency. During FL training, federated search is then used to acquire highly accurate personalized network structures. By taking into account the unique characteristics of FL deployment at edge devices, the personalized network structures obtained by our federated search framework with a lightweight search controller can achieve competitive accuracy with state-of-the-art (SOTA) methods, while reducing inference and training energy consumption by up to 3.57× and 1.82×, respectively.

SESSION: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

Session details: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

  • Cheng Tan

Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning

  • Keqi Fu
  • Zhi Qi
  • Jiaxuan Cai
  • Xulong Shi

As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.

Towards High-Quality CGRA Mapping with Graph Neural Networks and Reinforcement Learning

  • Yan Zhuang
  • Zhihao Zhang
  • Dajiang Liu

Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution to accelerate domain applications due to its good combination of energy-efficiency and flexibility. Loops, as computation-intensive parts of applications, are often mapped onto CGRA and modulo scheduling is commonly used to improve the execution performance. However, the actual performance using modulo scheduling is highly dependent on the mapping ability of the Data Dependency Graph (DDG) extracted from a loop. As existing approaches usually separate routing exploration of multi-cycle dependence from mapping for fast compilation, they may easily suffer from poor mapping quality. In this paper, we integrate the routing explorations into the mapping process and make it have more opportunities to find a globally optimized solution. Meanwhile, with a reduced resource graph defined, the searching space of the new mapping problem is not greatly increased. To efficiently solve the problem, we introduce graph neural network based reinforcement learning to predict a placement distribution over different resource nodes for all operations in a DDG. Using the routing connectivity as the reward signal, we optimize the parameters of neural network to find a valid mapping solution with a policy gradient method. Without much engineering and heuristic designing, our approach achieves 1.57× mapping quality, as compared to the state-of-the-art heuristic.

SESSION: Hardware Security: Attacks and Countermeasures (Virtual)

Session details: Hardware Security: Attacks and Countermeasures (Virtual)

  • Johann Knechtel
  • Lejla Batina

Attack Directories on ARM big.LITTLE Processors

  • Zili Kou
  • Sharad Sinha
  • Wenjian He
  • Wei Zhang

Eviction-based cache side-channel attacks take advantage of inclusive cache hierarchies and shared cache hardware. Processors with the template ARM big.LITTLE architecture do not guarantee such preconditions and therefore will not usually allow cross-core attacks let alone cross-cluster attacks. This work reveals a new side-channel based on the snoop filter (SF), an unexplored directory structure embedded in template ARM big.LITTLE processors. Our systematic reverse engineering unveils the undocumented structure and property of the SF, and we successfully utilize it to bootstrap cross-core and cross-cluster cache eviction. We demonstrate a comprehensive methodology to exploit the SF side-channel, including the construction of eviction sets, the covert channel, and attacks against RSA and AES. When attacking TrustZone, we conduct an interrupt-based side-channel attack to extract the key of RSA by a single profiling trace, despite the strict cache clean defense. Supported by detailed experiments, the SF side-channel not only achieves competitive performance but also overcomes the main challenge of cache side-channel attacks on ARM big.LITTLE processors.

AntiSIFA-CAD: A Framework to Thwart SIFA at the Layout Level

  • Rajat Sadhukhan
  • Sayandeep Saha
  • Debdeep Mukhopadhyay

Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.

SESSION: Advanced VLSI Routing and Layout Learning

Session details: Advanced VLSI Routing and Layout Learning

  • Wing-Kai Chow
  • David Chinnery

A Stochastic Approach to Handle Non-Determinism in Deep Learning-Based Design Rule Violation Predictions

  • Rongjian Liang
  • Hua Xiang
  • Jinwook Jung
  • Jiang Hu
  • Gi-Joon Nam

Deep learning is a promising approach to early DRV (Design Rule Violation) prediction. However, non-deterministic parallel routing hampers model training and degrades prediction accuracy. In this work, we propose a stochastic approach, called LGC-Net, to solve this problem. In this approach, we develop new techniques of Gaussian random field layer and focal likelihood loss function to seamlessly integrate Log Gaussian Cox process with deep learning. This approach provides not only statistical regression results but also classification ones with different thresholds without retraining. Experimental results with noisy training data on industrial designs demonstrate that LGC-Net achieves significantly better accuracy of DRV density prediction than prior arts.

Obstacle-Avoiding Multiple Redistribution Layer Routing with Irregular Structures

  • Yen-Ting Chen
  • Yao-Wen Chang

In advanced packages, redistribution layers (RDLs) are extra metal layers for high interconnections among the chips and printed circuit board (PCB). To better utilize the routing resources of RDLs, published works adopted flexible vias such that they can place the vias everywhere. Furthermore, some regions may be blocked for signal integrity protection or manually prerouted nets (such as power/ground nets or feeding lines of antennas) to achieve higher performance. These blocked regions will be treated as obstacles in the routing process. Since the positions of pads, obstacles, and vias can be arbitrary, the structures of RDLs become irregular. The obstacles and irregular structures substantially increase the difficulty of the routing process. This paper proposes a three-stage algorithm: First, the layout is partitioned by a method based on constrained Delaunay triangulation (CDT). Then we present a global routing graph model and generate routing guides for unified-assignment netlists. Finally, a novel tile routing method is developed to obtain detailed routes. Experiment results demonstrate the robustness and effectiveness of our proposed algorithm.

TAG: Learning Circuit Spatial Embedding from Layouts

  • Keren Zhu
  • Hao Chen
  • Walker J. Turner
  • George F. Kokai
  • Po-Hsuan Wei
  • David Z. Pan
  • Haoxing Ren

Analog and mixed-signal (AMS) circuit designs still rely on human design expertise. Machine learning has been assisting circuit design automation by replacing human experience with artificial intelligence. This paper presents TAG, a new paradigm of learning the circuit representation from layouts leveraging Text, self Attention and Graph. The embedding network model learns spatial information without manual labeling. We introduce text embedding and a self-attention mechanism to AMS circuit learning. Experimental results demonstrate the ability to predict layout distances between instances with industrial FinFET technology benchmarks. The effectiveness of the circuit representation is verified by showing the transferability to three other learning tasks with limited data in the case studies: layout matching prediction, wirelength estimation, and net parasitic capacitance prediction.

SESSION: Physical Attacks and Countermeasures

Session details: Physical Attacks and Countermeasures

  • Satwik Patnaik
  • Gang Qu

PowerTouch: A Security Objective-Guided Automation Framework for Generating Wired Ghost Touch Attacks on Touchscreens

  • Huifeng Zhu
  • Zhiyuan Yu
  • Weidong Cao
  • Ning Zhang
  • Xuan Zhang

The wired ghost touch attacks are the emerging and severe threats against modern touchscreens. The attackers can make touchscreens falsely report nonexistent touches (i.e., ghost touches) by injecting common-mode noise (CMN) into the target devices via power cables. Existing attacks rely on reverse-engineering the touchscreens, then manually crafting the CMN waveforms to control the types and locations of ghost touches. Although successful, they are limited in practicality and attack capability due to the touchscreens’ black-box nature and the immense search space of attack parameters. To overcome the above limitations, this paper presents PowerTouch, a framework that can automatically generate wired ghost touch attacks. We adopt a software-hardware co-design approach and propose a domain-specific genetic algorithm-based method that is tailored to account for the characteristics of the CMN waveform. Based on the security objectives, our framework automatically optimizes the CMN waveform towards injecting the desired type of ghost touches into regions specified by attackers. The effectiveness of PowerTouch is demonstrated by successfully launching attacks on touchscreen devices from two different brands given nine different objectives. Compared with the state-of-the-art attack, we seminally achieve controlling taps on an extra dimension and injecting swipes on both dimensions. We can place an average of 84.2% taps on the targeted side of the screen, with the location error in the other dimension no more than 1.53mm. An average of 94.5% of injected swipes with correct directions is also achieved. The quantitative comparison with the state-of-the-art method shows that a better attack performance can be achieved by PowerTouch.

A Combined Logical and Physical Attack on Logic Obfuscation

  • Michael Zuzak
  • Yuntao Liu
  • Isaac McDaniel
  • Ankur Srivastava

Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.

A Pragmatic Methodology for Blind Hardware Trojan Insertion in Finalized Layouts

  • Alexander Hepp
  • Tiago Perez
  • Samuel Pagliarini
  • Georg Sigl

A potential vulnerability for integrated circuits (ICs) is the insertion of hardware trojans (HTs) during manufacturing. Understanding the practicability of such an attack can lead to appropriate measures for mitigating it. In this paper, we demonstrate a pragmatic framework for analyzing HT susceptibility of finalized layouts. Our framework is representative of a fabrication-time attack, where the adversary is assumed to have access only to a layout representation of the circuit. The framework inserts trojans into tapeoutready layouts utilizing an Engineering Change Order (ECO) flow. The attacked security nodes are blindly searched utilizing reverse-engineering techniques. For our experimental investigation, we utilized three crypto-cores (AES-128, SHA-256, and RSA) and a microcontroller (RISC-V) as targets. We explored 96 combinations of triggers, payloads and targets for our framework. Our findings demonstrate that even in high-density designs, the covert insertion of sophisticated trojans is possible. All this while maintaining the original target logic, with minimal impact on power and performance. Furthermore, from our exploration, we conclude that it is too naive to only utilize placement resources as a metric for HT vulnerability. This work highlights that the HT insertion success is a complex function of the placement, routing resources, the position of the attacked nodes, and further design-specific characteristics. As a result, our framework goes beyond just an attack, we present the most advanced analysis tool to assess the vulnerability of HT insertion into finalized layouts.

SESSION: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

Session details: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

  • Rolf Drechsler

Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

  • Rolf Drechsler
  • Alireza Mahzoon

Recently, a lot of effort has been put into developing formal verification approaches by both academic and industrial research. In practice, these techniques often give satisfying results for some types of circuits, while they fail for others. A major challenge in this domain is that the verification techniques suffer from unpredictability in their performance. The only way to overcome this challenge is the calculation of bounds for the space and time complexities. If a verification method has polynomial space and time complexities, scalability can be guaranteed.

In this tutorial paper, we review recent developments in formal verification techniques and give a comprehensive overview of Polynomial Formal Verification (PFV). In PFV, polynomial upper bounds for the run-time and memory needed during the entire verification task hold. Thus, correctness under resource constraints can be ensured. We discuss the importance and advantages of PFV in the design flow. Formal methods on the bit-level and the word-level, and their complexities when used to verify different types of circuits, like adders, multipliers, or ALUs are presented. The current status of this new research field and directions for future work are discussed.

SESSION: Scalable Verification Technologies

Session details: Scalable Verification Technologies

  • Viraphol Chaiyakul
  • Alex Orailoglu

Arjun: An Efficient Independent Support Computation Technique and its Applications to Counting and Sampling

  • Mate Soos
  • Kuldeep S. Meel

Given a Boolean formula ϕ over the set of variables X and a projection set P ⊆ X, then if I ⊆ P is independent support of P, then if two solutions of ϕ agree on I, then they also agree on P. The notion of independent support is related to the classical notion of definability dating back to 1901, and have been studied over the decades. Recently, the computational problem of determining independent support for a given formula has attained importance owing to the crucial importance of independent support for hashing-based counting and sampling techniques.

In this paper, we design an efficient and scalable independent support computation technique that can handle formulas arising from real-world benchmarks. Our algorithmic framework, called Arjun1, employs implicit and explicit definability notions, and is based on a tight integration of gate-identification techniques and assumption-based framework. We demonstrate that augmenting the state-of-the-art model counter ApproxMC4 and sampler UniGen3 with Arjun leads to significant performance improvements. In particular, ApproxMC4 augmented with Arjun counts 576 more benchmarks out of 1896 while UniGen3 augmented with Arjun samples 335 more benchmarks within the same time limit.

Compositional Verification Using a Formal Component and Interface Specification

  • Yue Xing
  • Huaixi Lu
  • Aarti Gupta
  • Sharad Malik

Property-based specification s uch a s SystemVerilog Assertions (SVA) uses mathematical logic to specify the temporal behavior of RTL designs which can then be formally verified using model checking algorithms. These properties are specified for a single component (which may contain other components in the design hierarchy). Composing design components that have already been verified requires additional verification since incorrect communication at their interface may invalidate the properties that have been checked for the individual components. This paper focuses on a specification for their interface which can be checked individually for each component, and which guarantees that refinement-based properties checked for each component continue to hold after their composition. We do this in the setting of the Instruction-level Abstraction (ILA) specification and verification methodology. The ILA methodology provides a uniform specification for processors, accelerators and general modules at the instruction-level, and the automatic generation of a complete set of correctness properties for checking that the RTL model is a refinement of the ILA specification. We add an interface specification to model the inter-ILA communication. Further, we use our interface specification to generate a set of interface checking properties that check that the communication between the RTL components is correct. This provides the following guarantee: if each RTL component is a refinement of its ILA specification and the interface checks pass, then the RTL composition is a refinement of the ILA composition. We have applied the proposed methodology to six case studies including parts of large-scale designs such as parts of the FlexASR and NVDLA machine learning accelerators, demonstrating the practical applicability of our method.

Usage-Based RTL Subsetting for Hardware Accelerators

  • Qinhan Tan
  • Aarti Gupta
  • Sharad Malik

Recent years have witnessed increasing use of domain-specific accelerators in computing platforms to provide power-performance efficiency for emerging applications. To increase their applicability within the domain, these accelerators tend to support a large set of functions, e.g. Nvidia’s open-source Deep Learning Accelerator, NVDLA, supports five distinct groups of functions [17]. However, an individual use case of an accelerator may utilize only a subset of these functions. The unused functions lead to unnecessary overhead of silicon area, power, and hardware verification/hardware-software co-verification complexity. This motivates our research question: Given an RTL design for an accelerator and a subset of functions of interest, can we automatically extract a subset of the RTL that is sufficient for these functions and sequentially equivalent to the original RTL? We call this the Usage-based RTL Subsetting problem, referred to as the RTL subsetting problem in short. We first formally define this problem and show that it can be formulated as a program synthesis problem, which can be solved by performing expensive hyperproperty checks. To overcome the high cost, we propose multiple levels of sound over-approximations to construct an effective algorithm based on relatively less expensive temporal property checking and taint analysis for information flow checking. We demonstrate the acceptable computation cost and the quality of the results of our algorithm through several case studies of accelerators from different domains. The applicability of our proposed algorithm can be seen in its ability to subset the large NVDLA accelerator (with over 50,000 registers and 1,600,000 gates) for the group of convolution functions, where the subset reduces the total number of registers by 18.6% and the total number of gates by 37.1%.

SESSION: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

Session details: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

  • Amit Gupta
  • Kerim Kalafala

TransSizer: A Novel Transformer-Based Fast Gate Sizer

  • Siddhartha Nath
  • Geraldo Pradipta
  • Corey Hu
  • Tian Yang
  • Brucek Khailany
  • Haoxing Ren

Gate sizing is a fundamental netlist optimization move and researchers have used supervised learning-based models in gate sizers. Recently, Reinforcement Learning (RL) has been tried for sizing gates (and other EDA optimization problems) but are very runtime-intensive. In this work, we explore a novel Transformer-based gate sizer, TransSizer, to directly generate optimized gate sizes given a placed and unoptimized netlist. TransSizer is trained on datasets obtained from real tapeout-quality industrial designs in a foundry 5nm technology node. Our results indicate that TransSizer achieves 97% accuracy in predicting optimized gate sizes at the postroute optimization stage. Furthermore, TransSizer has a speedup of ~1400× while delivering similar timing, power and area metrics when compared to a leading-edge commercial tool for sizing-only optimization.

Generation of Mixed-Driving Multi-Bit Flip-Flops for Power Optimization

  • Meng-Yun Liu
  • Yu-Cheng Lai
  • Wai-Kei Mak
  • Ting-Chi Wang

Multi-bit flip-flops (MBFFs) are often used to reduce the number of clock sinks, resulting in a low-power design. A traditional MBFF is composed of individual FFs of uniform driving strength. However, if some but not all of the bits of an MBFF violate timing constraints, the MBFF has to be sized up or decomposed into smaller bit-width combinations to satisfy timing, which reduces the power saving. In this paper, we present a new MBFF generation approach considering mixed-driving MBFFs whose certain bits have a higher driving strength than the other bits. To maximize the FF merging rate (and hence to minimize the final amount of clock sinks), our approach will first perform aggressive FF merging subject to timing constraints. Our merging is aggressive in the sense that we are willing to possibly oversize some FFs and allow the presence of empty bits in an MBFF to merge FFs into MBFFs of uniform driving strengths as much as possible. The oversized individual FFs of an MBFF will be later downsized subject to timing constraints by our approach, which results in a mixed-driving MBFF. Our MBFF generation approach has been combined with a commercial place and route tool, and our experimental results show the superiority of our approach over a prior work that considers uniform-driving MBFFs only in terms of the clock sink count, the FF power, the clock buffer count, and the routed clock wirelength.

DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters

  • Zhiyao Xie
  • Shiyu Li
  • Mingyuan Ma
  • Chen-Chia Chang
  • Jingyu Pan
  • Yiran Chen
  • Jiang Hu

Accurate and efficient on-chip power modeling is crucial to runtime power, energy, and voltage management. Such power monitoring can be achieved by designing and integrating on-chip power meters (OPMs) into the target design. In this work, we propose a new method named DEEP to automatically develop extremely efficient OPM solutions for a given design. DEEP selects OPM inputs from all individual bits in RTL signals. Such bit-level selection provides an unprecedentedly large number of input candidates and supports lower hardware cost, compared with signal-level selection in prior works. In addition, DEEP proposes a powerful two-step OPM input selection method, and it supports reporting both total power and the power of major design components. Experiments on a commercial microprocessor demonstrate that DEEP’s OPM solution achieves correlation R > 0.97 in per-cycle power prediction with an unprecedented low area overhead on hardware, i.e., < 0.1% of the microprocessor layout. This reduces the OPM hardware cost by 4 — 6× compared with the state-of-the-art solution.

SESSION: Energy Efficient Hardware Acceleration and Stochastic Computing

Session details: Energy Efficient Hardware Acceleration and Stochastic Computing

  • Sunil Khatri
  • Anish Krishnakumar

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

  • Ranyang Zhou
  • Arman Roohi
  • Durga Misra
  • Shaahin Angizi

In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.

Sparse-T: Hardware Accelerator Thread for Unstructured Sparse Data Processing

  • Pranathi Vasireddy
  • Krishna Kavi
  • Gayatri Mehta

Sparse matrix-dense vector (SpMV) multiplication is inherent in most scientific, neural networks and machine learning algorithms. To efficiently exploit sparsity of data in SpMV computations, several compressed data representations have been used. However, compressed data representations of sparse data can result in overheads of locating nonzero values, requiring indirect memory accesses which increases instruction count and memory access delays. We call these translations of compressed representations as metadata processing. We propose a memory-side accelerator for metadata (or indexing) computations and supplying only the required nonzero values to the processor, additionally permitting an overlap of indexing with core computations on nonzero elements. In this contribution, we target our accelerator for low-end micro-controllers with very limited memory and processing capabilities. In this paper we will explore two dedicated ASIC designs of the proposed accelerator that handles the indexed memory accesses for compressed sparse row (CSR) format working alongside a simple RISC-like programmable core. One version of the accelerator supplies only vector values corresponding to nonzero matrix values and the second version supplies both nonzero matrix and matching vector values for SpMV computations. Our experiments show speedups ranging between 1.3 and 2.1 times for SpMV for different levels of sparsity. Our accelerator also results in energy savings ranging between 15.8% and 52.7% over different matrix sizes, when compared to the baseline system with primary RISC-V core performing all computations. We use smaller synthetic matrices with different sparsity levels and larger real-world matrices with higher sparsity (below 1% non-zeros) in our experimental evaluations.

Sound Source Localization Using Stochastic Computing

  • Peter Schober
  • Seyedeh Newsha Estiri
  • Sercan Aygun
  • Nima TaheriNejad
  • M. Hassan Najafi

Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.

SESSION: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

Session details: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

  • Medhi Tahoori

Approximate Computing and the Efficient Machine Learning Expedition

  • Jörg Henkel
  • Hai Li
  • Anand Raghunathan
  • Mehdi B. Tahoori
  • Swagath Venkataramani
  • Xiaoxuan Yang
  • Georgios Zervakis

Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads associated with ML applications-that were effectively mitigated by corresponding approximations-led to a perfect matching and a fruitful synergy. AxC for AI/ML has transcended beyond academic prototypes. In this work, we enlighten the synergistic nature of AxC and ML and elucidate the impact of AxC in designing efficient ML systems. To that end, we present an overview and taxonomy of AxC for ML and use two descriptive application scenarios to demonstrate how AxC boosts the efficiency of ML systems.

SESSION: Co-Search Methods and Tools

Session details: Co-Search Methods and Tools

  • Cunxi Yu
  • Yingyan “Celine” Lin

ObfuNAS: A Neural Architecture Search-Based DNN Obfuscation Approach

  • Tong Zhou
  • Shaolei Ren
  • Xiaolin Xu

Malicious architecture extraction has been emerging as a crucial concern for deep neural network (DNN) security. As a defense, architecture obfuscation is proposed to remap the victim DNN to a different architecture. Nonetheless, we observe that, with only extracting an obfuscated DNN architecture, the adversary can still retrain a substitute model with high performance (e.g., accuracy), rendering the obfuscation techniques ineffective. To mitigate this under-explored vulnerability, we propose ObfuNAS, which converts the DNN architecture obfuscation into a neural architecture search (NAS) problem. Using a combination of function-preserving obfuscation strategies, ObfuNAS ensures that the obfuscated DNN architecture can only achieve lower accuracy than the victim. We validate the performance of ObfuNAS with open-source architecture datasets like NAS-Bench-101 and NAS-Bench-301. The experimental results demonstrate that ObfuNAS can successfully find the optimal mask for a victim model within a given FLOPs constraint, leading up to 2.6% inference accuracy degradation for attackers with only 0.14× FLOPs overhead. The code is available at: https://github.com/Tongzhou0101/ObfuNAS.

Deep Learning Toolkit-Accelerated Analytical Co-Optimization of CNN Hardware and Dataflow

  • Rongjian Liang
  • Jianfeng Song
  • Yuan Bo
  • Jiang Hu

The continuous growth of CNN complexity not only intensifies the need for hardware acceleration but also presents a huge challenge. That is, the solution space for CNN hardware design and dataflow mapping becomes enormously large besides the fact that it is discrete and lacks a well behaved structure. Most previous works either are stochastic metaheuristics, such as genetic algorithm, which are typically very slow for solving large problems, or rely on expensive sampling, e.g., Gumbel Softmax-based differentiable optimization and Bayesian optimization. We propose an analytical model for evaluating power and performance of CNN hardware design and dataflow solutions. Based on this model, we introduce a co-optimization method consisting of nonlinear programming and parallel local search. A key innovation in this model is its matrix form, which enables the use of deep learning toolkit for highly efficient computations of power/performance values and gradients in the optimization. In handling power-performance tradeoff, our method can lead to better solutions than minimizing a weighted sum of power and latency. The average relative error of our model compared with Timeloop is as small as 1%. Compared to state-of-the-art methods, our approach achieves solutions with up to 1.7 × shorter inference latency, 37.5% less power consumption, and 3 × less area on ResNet 18. Moreover, it provides a 6.2 × speedup of optimization runtime.

HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration

  • William Andrew Simon
  • Una Pale
  • Tomas Teijeiro
  • David Atienza

The HyperDimensional Computing (HDC) Machine Learning (ML) paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other ML approaches, necessitating frameworks enabling fast HDC algorithm design space exploration. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch’s utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. We also demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data may not generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.

SESSION: Reconfigurable Computing: Accelerators and Methodologies II

Session details: Reconfigurable Computing: Accelerators and Methodologies II

  • Peipei Zhou

DARL: Distributed Reconfigurable Accelerator for Hyperdimensional Reinforcement Learning

  • Hanning Chen
  • Mariam Issa
  • Yang Ni
  • Mohsen Imani

Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems such as robotics control. Modern RL algorithms, i.e., Deep Q-Learning, are based on costly and resource hungry deep neural networks. This motivates us to deploy alternative models for powering RL agents on edge devices. Recently, brain-inspired Hyper-Dimensional Computing (HDC) has been introduced as a promising solution for lightweight and efficient machine learning, particularly for classification.

In this work, we develop a novel platform capable of real-time hyperdimensional reinforcement learning. Our heterogeneous CPU-FPGA platform, called DARL, maximizes FPGA’s computing capabilities by applying hardware optimizations to hyperdimensional computing’s critical operations, including hardware-friendly encoder IP, the hypervector chunk fragmentation, and the delayed model update. Aside from hardware innovation, we also extend the platform to basic single-agent RL to support multi-agents distributed learning. We evaluate the effectiveness of our approach on OpenAI Gym tasks. Our results show that the FPGA platform provides on average 20× speedup compared to current state-of-the-art hyperdimensional RL methods running on Intel Xeon 6226 CPU. In addition, DARL provides around 4.8× faster and 4.2× higher energy efficiency compared to the state-of-the-art RL accelerator while ensuring a better or comparable quality of learning.

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

  • Carl-Johannes Johnsen
  • Tiziano De Matteis
  • Tal Ben-Nun
  • Johannes de Fine Licht
  • Torsten Hoefler

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for finegrained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.

SESSION: Compute-in-Memory for Neural Networks

Session details: Compute-in-Memory for Neural Networks

  • Bo Yuan

ISSA: Input-Skippable, Set-Associative Computing-in-Memory (SA-CIM) Architecture for Neural Network Accelerators

  • Yun-Chen Lo
  • Chih-Chen Yeh
  • Jun-Shen Wu
  • Chia-Chun Wang
  • Yu-Chih Tsai
  • Wen-Chien Ting
  • Ren-Shuo Liu

Among several emerging architectures, computing in memory (CIM), which features in-situ analog computation, is a potential solution to the data movement bottleneck of the Von Neumann architecture for artificial intelligence (AI). Interestingly, more strengths of CIM significantly different from in-situ analog computation are not widely known yet. In this work, we point out that mutually stationary vectors (MSVs), which can be maximized by introducing associativity to CIM, are another inherent power unique to CIM. By MSVs, CIM exhibits significant freedom to dynamically vectorize the stored data (e.g., weights) to perform agile computation using the dynamically formed vectors.

We have designed and realized an SA-CIM silicon prototype and corresponding architecture and acceleration schemes in the TSMC 28 nm process. More specifically, the contributions of this paper are fourfold: 1) We identify MSVs as new features that can be exploited to improve the current performance and energy challenges of the CIM-based hardware. 2) We propose SA-CIM to enhance MSVs for skipping the zeros, small values, and sparse vectors. 3) We propose a transposed systolic dataflow to efficiently conduct conv3×3 while being capable of exploiting input-skipping schemes. 4) We propose a design flow to search for optimal aggressive skipping scheme setups while satisfying the accuracy loss constraint.

The proposed ISSA architecture improves the throughput by 1.91× to 2.97× speedup and the energy efficiency by 2.5× to 4.2×.

Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?

  • Zheyu Yan
  • Xiaobo Sharon Hu
  • Yiyu Shi

Computing-in-Memory (CiM) architectures based on emerging nonvolatile memory (NVM) devices have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, NVM devices suffer from various non-idealities, especially device-to-device variations due to fabrication defects and cycle-to-cycle variations due to the stochastic behavior of devices. As such, the DNN weights actually mapped to NVM devices could deviate significantly from the expected values, leading to large performance degradation. To address this issue, most existing works focus on maximizing average performance under device variations. This objective would work well for general-purpose scenarios. But for safety-critical applications, the worst-case performance must also be considered. Unfortunately, this has been rarely explored in the literature. In this work, we formulate the problem of determining the worst-case performance of CiM DNN accelerators under the impact of device variations. We further propose a method to effectively find the specific combination of device variation in the high-dimensional space that leads to the worst-case performance. We find that even with very small device variations, the accuracy of a DNN can drop drastically, causing concerns when deploying CiM accelerators in safety-critical applications. Finally, we show that surprisingly none of the existing methods used to enhance average DNN performance in CiM accelerators are very effective when extended to enhance the worst-case performance, and further research down the road is needed to address this problem.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist II

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist II

  • Sunil Khatri
  • Cunxi Yu

Language Equation Solving via Boolean Automata Manipulation

  • Wan-Hsuan Lin
  • Chia-Hsuan Su
  • Jie-Hong R. Jiang

Language equations are a powerful tool for compositional synthesis, modeled as the unknown component problem. Given a (sequential) system specification S and a fixed component F, we are asked to synthesize an unknown component X such that whose composition with F fulfills S. The synthesis of X can be formulated with language equation solving. Although prior work exploits partitioned representation for effective finite automata manipulation, it remains challenging to solve language equations involving a large number of states. In this work, we propose variants of Boolean automata as the underlying succinct representation for regular languages. They admit logic circuit manipulation and extend the scalability for solving language equations. Experimental results demonstrate the superiority of our method to the state-of-the-art in solving nine more cases out of the 36 studied benchmarks and achieving an average of 740× speedup.

How Good Is Your Verilog RTL Code?: A Quick Answer from Machine Learning

  • Prianka Sengupta
  • Aakash Tyagi
  • Yiran Chen
  • Jiang Hu

Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.

SESSION: In-Memory Computing Revisited

Session details: In-Memory Computing Revisited

  • Biresh Kumar Joardar
  • Ulf Schlichtmann

Logic Synthesis for Digital In-Memory Computing

  • Muhammad Rashedul Haq Rashed
  • Sumit Kumar Jha
  • Rickard Ewetz

Processing in-memory is a promising solution strategy for accelerating data-intensive applications. While analog in-memory computing is extremely efficient, the limited precision is only acceptable for approximate computing applications. Digital in-memory computing provides the deterministic precision required to accelerate high assurance applications. State-of-the-art digital in-memory computing schemes rely on manually decomposing arithmetic operations into in-memory compute kernels. In contrast, traditional digital circuits are synthesized using complex and automated design flows. In this paper, we propose a logic synthesis framework called LOGIC for mapping high-level applications into digital in-memory compute kernels that can be executed using non-volatile memory. We first propose techniques to decompose element-wise arithmetic operations into in-memory kernels while minimizing the number of in-memory operations. Next, the sequence of the in-memory operation is optimized to minimize non-volatile memory utilization. Lastly, data layout re-organization is used to efficiently accelerate applications dominated by sparse matrix-vector multiplication operations. The experimental evaluations show that the proposed synthesis approach improves the area and latency of fixed-point multiplication by 77% and 20% over the state-of-the-art, respectively. On scientific computing applications from Suite Sparse Matrix Collection, the proposed design improves the area, latency and, energy by 3.6X, 2.6X, and 8.3X, respectively.

Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators

  • Kang He
  • Indranil Chakraborty
  • Cheng Wang
  • Kaushik Roy

In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.

SESSION: Special Session: 2022 CAD Contest at ICCAD

Session details: Special Session: 2022 CAD Contest at ICCAD

  • Yu-Guang Chen

Overview of 2022 CAD Contest at ICCAD

  • Yu-Guang Chen
  • Chun-Yao Wang
  • Tsung-Wei Huang
  • Takashi Sato

The “CAD Contest at ICCAD” is a challenging, multi-month, research and development competition, focusing on advanced, real-world problems in the field of electronic design automation (EDA). Since 2012, the contest has been publishing many sophisticated circuit design problems, from system-level design to physical design, together with industrial benchmarks and solution evaluators. Contestants can participate in one or more problems provided by EDA/IC industry. The winners will be awarded at an ICCAD special session dedicated to this contest. Every year, the contest attracts more than a hundred teams, fosters productive industry-academia collaborations, and leads to hundreds of publications in top-tier conferences and journals. The 2022 CAD Contest has 166 teams from all over the world. Moreover, the problems of this year cover state-of-the-art EDA research trends such as circuit security, 3D-IC, and design space exploration from well-known EDA/IC companies. We believe the contest keeps enhancing impact and boosting EDA researches.

2022 CAD Contest Problem A: Learning Arithmetic Operations from Gate-Level Circuit

  • Chung-Han Chou
  • Chih-Jen (Jacky) Hsu
  • Chi-An (Rocky) Wu
  • Kuan-Hua Tu

Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.

In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.

2022 ICCAD CAD Contest Problem B: 3D Placement with D2D Vertical Connections

  • Kai-Shun Hu
  • I-Jye Lin
  • Yu-Hui Huang
  • Hao-Yu Chi
  • Yi-Hsuan Wu
  • Chin-Fang Cindy Shen

In the chiplet era, the benefits from multiple factors can be observed by splitting a large single die into multiple small dies. By having the multiple small dies with die-to-die (D2D) vertical connections, the benefits including: 1) better yield, 2) better timing/performance, and 3) better cost. How to do the netlist partitioning, cell placement in each of the small dies, and also how to determine the location of the D2D inter-connection terminals becomes a new topic.

To address this chiplet era physical implementation problem, ICCAD-2022 contest encourages the research in the techniques of multi-die netlist partitioning and placement with D2D vertical connections. We provided (i) a set of benchmarks and (ii) an evaluation metric that facilitate contestants to develop, test, and evaluate their new algorithms.

2022 ICCAD CAD Contest Problem C: Microarchitecture Design Space Exploration

  • Sicheng Li
  • Chen Bai
  • Xuechao Wei
  • Bizhao Shi
  • Yen-Kuang Chen
  • Yuan Xie

It is vital to select microarchitectures to achieve good trade-offs between performance, power, and area in the chip development cycle. Combining high-level hardware description languages and optimization of electronic design automation tools empowers microarchitecture exploration at the circuit level. Due to the extremely large design space and high runtime cost to evaluate a microarchitecture, ICCAD 2022 CAD Contest Problem C calls for an effective design space exploration algorithm to solve the problem. We formulate the research topic as a contest problem and provide benchmark suites, contest benchmark platforms, etc., for all contestants to innovate and estimate their algorithms.

IEEE CEDA DATC: Expanding Research Foundations for IC Physical Design and ML-Enabled EDA

  • Jinwook Jung
  • Andrew B. Kahng
  • Ravi Varadarajan
  • Zhiang Wang

This paper describes new elements in the RDF-2022 release of the DATC Robust Design Flow, along with other activities of the IEEE CEDA DATC. The RosettaStone initiated with RDF-2021 has been augmented to include 35 benchmarks and four open-source technologies (ASAP7, NanGate45 and SkyWater130HS/HD), plus timing-sensible versions created using path-cutting. The Hier-RTLMP macro placer is now part of DATC RDF, enabling macro placement for large modern designs with hundreds of macros. To establish a clear baseline for macro placers, new open-source benchmark suites on open PDKs, with corresponding flows for fully reproducible results, are provided. METRICS2.1 infrastructure in OpenROAD and OpenROAD-flow-scripts now uses native JSON metrics reporting, which is more robust and general than the previous Python script-based method. Calibrations on open enablements have also seen notable updates in the RDF. Finally, we also describe an approach to establishing a generic, cloud-native large-scale design of experiments for ML-enabled EDA. Our paper closes with future research directions related to DATC’s efforts.

SESSION: Architectures and Methodologies for Advanced Hardware Security

Session details: Architectures and Methodologies for Advanced Hardware Security

  • Amin Rezaei
  • Gang Qu

Inhale: Enabling High-Performance and Energy-Efficient In-SRAM Cryptographic Hash for IoT

  • Jingyao Zhang
  • Elaheh Sadredini

In the age of big data, information security has become a major issue of debate, especially with the rise of the Internet of Things (IoT), where attackers can effortlessly obtain physical access to edge devices. The hash algorithm is the current foundation for data integrity and authentication. However, it is challenging to provide a high-performance, high-throughput, and energy-efficient solution on resource-constrained edge devices. In this paper, we propose Inhale, an in-SRAM architecture to effectively compute hash algorithms with innovative data alignment and efficient read/write strategies to implicitly execute data shift operations through the in-situ controller. We present two variations of Inhale: Inhale-Opt, which is optimized for latency, throughput, and area-overhead; and Inhale-Flex, which offers flexibility in repurposing a part of last-level caches for hash computation. We thoroughly evaluate our proposed architectures on both SRAM and ReRAM memories and compare them with the state-of-the-art in-memory and ASIC accelerators. Our performance evaluation confirms that Inhale can achieve 1.4× – 14.5× higher throughput-per-area and about two-orders-of-magnitude higher throughput-per-area-per-energy compared to the state-of-the-art solutions.

Accelerating N-Bit Operations over TFHE on Commodity CPU-FPGA

  • Kevin Nam
  • Hyunyoung Oh
  • Hyungon Moon
  • Yunheung Paek

TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.

Fast and Compact Interleaved Modular Multiplication Based on Carry Save Addition

  • Oleg Mazonka
  • Eduardo Chielle
  • Deepraj Soni
  • Michail Maniatakos

Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.

Accelerating Fully Homomorphic Encryption by Bridging Modular and Bit-Level Arithmetic

  • Eduardo Chielle
  • Oleg Mazonka
  • Homer Gamil
  • Michail Maniatakos

The dramatic increase of data breaches in modern computing platforms has emphasized that access control is not sufficient to protect sensitive user data. Recent advances in cryptography allow end-to-end processing of encrypted data without the need for decryption using Fully Homomorphic Encryption (FHE). Such computation however, is still orders of magnitude slower than direct (unencrypted) computation. Depending on the underlying cryptographic scheme, FHE schemes can work natively either at bit-level using Boolean circuits, or over integers using modular arithmetic. Operations on integers are limited to addition/subtraction and multiplication. On the other hand, bit-level arithmetic is much more comprehensive allowing more operations, such as comparison and division. While modular arithmetic can emulate bit-level computation, there is a significant cost in performance. In this work, we propose a novel method, dubbed bridging, that blends faster and restricted modular computation with slower and comprehensive bit-level computation, making them both usable within the same application and with the same cryptographic scheme instantiation. We introduce and open source C++ types representing the two distinct arithmetic modes, offering the possibility to convert from one to the other. Experimental results show that bridging modular and bit-level arithmetic computation can lead to 1–2 orders of magnitude performance improvement for tested synthetic benchmarks, as well as one real-world FHE application: a genotype imputation case study.

SESSION: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

Session details: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

  • Jiang Hu

A Reconfigurable Hardware Library for Robot Scene Perception

  • Yanqi Liu
  • Anthony Opipari
  • Odest Chadwicke Jenkins
  • R. Iris Bahar

Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26× more energy efficient than a high-end GPU and 11× faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.

Analyzing and Improving Resilience and Robustness of Autonomous Systems

  • Zishen Wan
  • Karthik Swaminathan
  • Pin-Yu Chen
  • Nandhini Chandramoorthy
  • Arijit Raychowdhury

Autonomous systems have reached a tipping point, with a myriad of self-driving cars, unmanned aerial vehicles (UAVs), and robots being widely applied and revolutionizing new applications. The continuous deployment of autonomous systems reveals the need for designs that facilitate increased resiliency and safety. The ability of an autonomous system to tolerate, or mitigate against errors, such as environmental conditions, sensor, hardware and software faults, and adversarial attacks, is essential to ensure its functional safety. Application-aware resilience metrics, holistic fault analysis frameworks, and lightweight fault mitigation techniques are being proposed for accurate and effective resilience and robustness assessment and improvement. This paper explores the origination of fault sources across the computing stack of autonomous systems, discusses the various fault impacts and fault mitigation techniques of different scales of autonomous systems, and concludes with challenges and opportunities for assessing and building next-generation resilient and robust autonomous systems.

Factor Graph Accelerator for LiDAR-Inertial Odometry (Invited Paper)

  • Yuhui Hao
  • Bo Yu
  • Qiang Liu
  • Shaoshan Liu
  • Yuhao Zhu

Factor graph is a graph representing the factorization of a probability distribution function, and has been utilized in many autonomous machine computing tasks, such as localization, tracking, planning and control etc. We are developing an architecture with the goal of using factor graph as a common abstraction for most, if not, all autonomous machine computing tasks. If successful, the architecture would provide a very simple interface of mapping autonomous machine functions to the underlying compute hardware. As a first step of such an attempt, this paper presents our most recent work of developing a factor graph accelerator for LiDAR-Inertial Odometry (LIO), an essential task in many autonomous machines, such as autonomous vehicles and mobile robots. By modeling LIO as a factor graph, the proposed accelerator not only supports multi-sensor fusion such as LiDAR, inertial measurement unit (IMU), GPS, etc., but solves the global optimization problem of robot navigation in batch or incremental modes. Our evaluation demonstrates that the proposed design significantly improves the real-time performance and energy efficiency of autonomous machine navigation systems. The initial success suggests the potential of generalizing the factor graph architecture as a common abstraction for autonomous machine computing, including tracking, planning, and control etc.

Hardware Architecture of Graph Neural Network-Enabled Motion Planner (Invited Paper)

  • Lingyi Huang
  • Xiao Zang
  • Yu Gong
  • Bo Yuan

Motion planning aims to find a collision-free trajectory from the start to goal configurations of a robot. As a key cognition task for all the autonomous machines, motion planning is fundamentally required in various real-world robotic applications, such as 2-D/3-D autonomous navigation of unmanned mobile and aerial vehicles and high degree-of-freedom (DoF) autonomous manipulation of industry/medical robot arms and graspers.

Motion planning can be performed using either non-learning-based classical algorithms or learning-based neural approaches. Most recently, the powerful capabilities of deep neural networks (DNNs) make neural planners become very attractive because of their superior planning performance over the classical methods. In particular, graph neural network (GNN)-enabled motion planner has demonstrated the state-of-the-art performance across a set of challenging high-dimensional planning tasks, motivating the efficient hardware acceleration to fully unleash its potential and promote its widespread deployment in practical applications.

To that end, in this paper we perform preliminary study of the efficient accelerator design of the GNN-based neural planner, especially for the neural explorer as the key component of the entire planning pipeline. By performing in-depth analysis on the different design choices, we identify that the hybrid architecture, instead of the uniform sparse matrix multiplication (SpMM)-based solution that is popularly adopted in the existing GNN hardware, is more suitable for our target neural explorer. With a set of optimization on microarchitecture and dataflow, several design challenges incurred by using hybrid architecture, such as extensive memory access and imbalanced workload, can be efficiently mitigated. Evaluation results show that our proposed customized hardware architecture achieves order-of-magnitude performance improvement over the CPU/GPU-based implementation with respect to area and energy efficiency in various working environments.

SESSION: From Logical to Physical Qubits: New Models and Techniques for Mapping

Session details: From Logical to Physical Qubits: New Models and Techniques for Mapping

  • Weiwen Jiang

A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker

  • Tsou-An Wu
  • Yun-Jhe Jiang
  • Shao-Yun Fang

Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP-free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000× runtime speedup over a state-of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800× speedup and obtains near-optimal solutions with only 3% SWAP overhead.

Reinforcement Learning and DEAR Framework for Solving the Qubit Mapping Problem

  • Ching-Yao Huang
  • Chi-Hsiang Lien
  • Wai-Kei Mak

Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self-attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR-framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.

Qubit Mapping for Reconfigurable Atom Arrays

  • Bochen Tan
  • Dolev Bluvstein
  • Mikhail D. Lukin
  • Jason Cong

Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.

MCQA: Multi-Constraint Qubit Allocation for Near-FTQC Device

  • Sunghye Park
  • Dohun Kim
  • Jae-Yoon Sim
  • Seokhyeong Kang

In response to the rapid development of quantum processors, quantum software must be advanced by considering the actual hardware limitations. Among the various design automation problems in quantum computing, qubit allocation modifies the input circuit to match the hardware topology constraints. In this work, we present an effective heuristic approach for qubit allocation that considers not only the hardware topology but also other constraints for near-fault-tolerant quantum computing (near-FTQC). We propose a practical methodology to find an effective initial mapping to reduce both the number of gates and circuit latency. We then perform dynamic scheduling to maximize the number of gates executed in parallel in the main mapping phase. Our experimental results with a Surface-17 processor confirmed a substantial reduction in the number of gates, latency, and runtime by 58%, 28%, and 99%, respectively, compared with the previous method [18]. Moreover, our mapping method is scalable and has a linear time complexity with respect to the number of gates.

SESSION: Smart Embedded Systems (Virtual)

Session details: Smart Embedded Systems (Virtual)

  • Leonidas Kosmidis
  • Pietro Mercati

Smart Scissor: Coupling Spatial Redundancy Reduction and CNN Compression for Embedded Hardware

  • Hao Kong
  • Di Liu
  • Shuo Huai
  • Xiangzhong Luo
  • Weichen Liu
  • Ravi Subramaniam
  • Christian Makaya
  • Qian Lin

Scaling down the resolution of input images can greatly reduce the computational overhead of convolutional neural networks (CNNs), which is promising for edge AI. However, as an image usually contains much spatial redundancy, e.g., background pixels, directly shrinking the whole image will lose important features of the foreground object and lead to severe accuracy degradation. In this paper, we propose a dynamic image cropping framework to reduce the spatial redundancy by accurately cropping the foreground object from images. To achieve the instance-aware fine cropping, we introduce a lightweight foreground predictor to efficiently localize and crop the foreground of an image. The finely cropped images can be correctly recognized even at a small resolution. Meanwhile, computational redundancy also exists in CNN architectures. To pursue higher execution efficiency on resource-constrained embedded devices, we also propose a compound shrinking strategy to coordinately compress the three dimensions (depth, width, resolution) of CNNs. Eventually, we seamlessly combine the proposed dynamic image cropping and compound shrinking into a unified compression framework, Smart Scissor, which is expected to significantly reduce the computational overhead of CNNs while still maintaining high accuracy. Experiments on ImageNet-1K demonstrate that our method reduces the computational cost of ResNet50 by 41.5% while improving the top-1 accuracy by 0.3%. Moreover, compared to HRank, the state-of-the-art CNN compression framework, our method achieves 4.1% higher top-1 accuracy at the same computational cost. The codes and data are available at https://github.com/ntuliuteam/smart-scissor

SHAPE: Scheduling of Fixed-Priority Tasks on Heterogeneous Architectures with Multiple CPUs and Many PEs

  • Yuankai Xu
  • Tiancheng He
  • Ruiqi Sun
  • Yehan Ma
  • Yier Jin
  • An Zou

Despite being employed in burgeoning efforts to accelerate artificial intelligence, heterogeneous architectures have yet to be well managed with strict timing constraints. As a classic task model, multi-segment self-suspension (MSSS) has been proposed for general I/O-intensive systems and computation offloading. However, directly applying this model to heterogeneous architectures with multiple CPUs and many processing units (PEs) suffers tremendous pessimism. In this paper, we present a real-time scheduling approach, SHAPE, for general heterogeneous architectures with significant schedulability and improved utilization rate. We start with building the general task execution pattern on a heterogeneous architecture integrating multiple CPU cores and many PEs such as GPU streaming multiprocessors and FPGA IP cores. A real-time scheduling strategy and corresponding schedulability analysis are presented following the task execution pattern. Compared with state-of-the-art scheduling algorithms through comprehensive experiments on unified and versatile tasks, SHAPE improves the schedulability by 11.1% – 100%. Moreover, experiments performed on the NVIDIA GPU systems further indicate up to 70.9% of pessimism reduction can be achieved by the proposed scheduling. Since we target general heterogeneous architectures, SHAPE can be directly applied to off-the-shelf heterogeneous computing systems with guaranteed deadlines and improved schedulability.

On Minimizing the Read Latency of Flash Memory to Preserve Inter-Tree Locality in Random Forest

  • Yu-Cheng Lin
  • Yu-Pei Liang
  • Tseng-Yi Chen
  • Yuan-Hao Chang
  • Shuo-Han Chen
  • Wei-Kuan Shih

Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.

SESSION: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

Session details: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

  • Biying Xu
  • Ilya Yusim

Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation

  • Xiaoming Chen

A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to “guess” the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the “guessed” dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.

EI-MOR: A Hybrid Exponential Integrator and Model Order Reduction Approach for Transient Power/Ground Network Analysis

  • Cong Wang
  • Dongen Yang
  • Quan Chen

Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.

Multi-Package Co-Design for Chiplet Integration

  • Zhen Zhuang
  • Bei Yu
  • Kai-Yuan Chao
  • Tsung-Yi Ho

Due to the cost and design complexity associated with advanced technology nodes, it is difficult for traditional monolithic System-on-Chip to follow the Moore’s Law, which means the economic benefits have been weakened. Semiconductor industries are looking for advanced packages to improve the economic advantages. Since the multi-chiplet architecture supporting heterogeneous integration has the robust re-usability and effective cost reduction, chiplet integration has become the mainstream of advanced packages. Nowadays, the number of mounted chiplets in a package is continuously increasing with the requirement of high system performance. However, the large area caused by the increasing of chiplets leads to the serious reliability issues, including warpage and bump stress, which worsens the yield and cost. The multi-package architecture, which can distribute chiplets to multiple packages and use less area of each package, is a popular alternative to enhance the reliability and reduce the cost in advanced packages. However, the primary challenge of the multi-package architecture lies in the tradeoff between the inter-package costs, i.e., the interconnection among packages, and the intra-package costs, i.e., the reliability caused by warpage and bump stress. Therefore, a co-design methodology is indispensable to optimize multiple packages simultaneously to improve the quality of the whole system. To tackle this challenge, we adopt mathematical programming methods in the multi-package co-design problem regarding the nature of the synergistic optimization of multiple packages. To the best of our knowledge, this is the first work to solve the multi-package co-design problem.

SESSION: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

Session details: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

  • Grace Li Zhang

Gzippo: Highly-Compact Processing-in-Memory Graph Accelerator Alleviating Sparsity and Redundancy

  • Xing Li
  • Rachata Ausavarungnirun
  • Xiao Liu
  • Xueyuan Liu
  • Xuan Zhang
  • Heng Lu
  • Zhuoran Song
  • Naifeng Jing
  • Xiaoyao Liang

Graph application plays a significant role in real-world data computation. However, the memory access patterns become the performance bottleneck of the graph applications, which include low compute-to-communication ratio, poor temporal locality, and poor spatial locality. Existing RRAM-based processing-in-memory accelerators reduce the data movements but fail to address both sparsity and redundancy of graph data. In this work, we present Gzippo, a highly-compact design that supports graph computation in the compressed sparse format. Gzippo employs a tandem-isomorphic-crossbar architecture both to eliminate redundant searches and sequential indexing during iterations, and to remove sparsity leading to non-effective computation on zero values. Gzippo achieves a 3.0× (up to 17.4×) performance speedup, 23.9× (up to 163.2×) energy efficiency over state-of-the-art RRAM-based PIM accelerator, respectively.

CoMUX: Combinatorial-Coding-Based High-Performance Microfluidic Control Multiplexer Design

  • Siyuan Liang
  • Mengchu Li
  • Tsun-Ming Tseng
  • Ulf Schlichtmann
  • Tsung-Yi Ho

Flow-based microfluidic chips are one of the most promising platforms for biochemical experiments. Transportation channels and operation devices inside these chips are controlled by microvalves, which are driven by external pressure sources. As the complexity of experiments on these chips keeps increasing, control multiplexers (MUXes) become necessary for the actuation of the enormous number of valves. However, current binary-coding-based MUXes do not take full advantage of the coding capacity and suffer from the reliability problem caused by the high control channel density. In this work, we propose a novel MUX coding strategy, named Combinatorial Coding, along with an algorithm to synthesize combinatorial-coding-based MUXes (CoMUXes) of arbitrary sizes with the proven maximum coding capacity. Moreover, we develop a simplification method to reduce the number of valves and control channels in CoMUXes and thus improve their reliability. We compare CoMUX with the state-of-the-art MUXes under different control demands with up to 10 × 213 independent control channels. Experiments show that CoMUXes can reliably control more independent control channels with fewer resources. For example, when the number of the to-be-controlled control channels is up to 10 × 213, compared to a state-of-the-art MUX, the optimized CoMUX reduces the number of required flow channels by 44% and the number of valves by 90%.

Exploiting Uniform Spatial Distribution to Design Efficient Random Number Source for Stochastic Computing

  • Kuncai Zhong
  • Zexi Li
  • Haoran Jin
  • Weikang Qian

Stochastic computing (SC) generally suffers from long latency. One solution is to apply proper random number sources (RNSs). Nevertheless, current RNS designs either have high hardware cost or low accuracy. To address the issue, motivated by that the uniform spatial distribution generally leads to a high accuracy for an SC circuit, we propose a basic architecture to generate the uniform spatial distribution and a further detailed implementation of it. For the implementation, we further propose a method to optimize its hardware cost and a method to optimize its accuracy. The method for hardware cost optimization can optimize the hardware cost without affecting the accuracy. The experimental results show that our proposed implementation can achieve both low hardware cost and high accuracy. Compared to the state-of-the-art stochastic number generator design, the proposed design can reduce 88% area with close accuracy.

SESSION: On Automating Heterogeneous Designs (Virtual)

Session details: On Automating Heterogeneous Designs (Virtual)

  • Haocheng Li

A Novel Blockage-Avoiding Macro Placement Approach for 3D ICs Based on POCS

  • Jai-Ming Lin
  • Po-Chen Lu
  • Heng-Yu Lin
  • Jia-Ting Tsai

Although the 3D integrated circuit (IC) placement problem has been studied for many years, few publications devoted to the macro legalization. Due to large sizes of macros, the macro placement problem is harder than cell placement, especially when preplaced macros exist in a multi-tier structure. In order to have a more global view, this paper proposes the partitioning-last macro-first flow to handle 3D placement for mixed-size designs, which performs tier partitioning after placement prototyping and then legalizes macros before cell placement. A novel two-step approach is proposed to handle 3D macro placement. The first step determines locations of macros in a projection plane based on a new representation, named K-tier Partially Occupied Corner Stitching. It not only can keep the prototyping result but also guarantees a legal placement after tier assignment of macros. Next, macros are assigned to respective tiers by Integer Linear Programming (ILP) algorithm. Experimental results show that our design flow can obtain better solutions than other flows especially in the cases with more preplaced macros.

Routability-Driven Analytical Placement with Precise Penalty Models for Large-Scale 3D ICs

  • Jai-Ming Lin
  • Hao-Yuan Hsieh
  • Hsuan Kung
  • Hao-Jia Lin

Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.

SESSION: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

Session details: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

  • Swaroop Ghosh

Quantum Machine Learning for Material Synthesis and Hardware Security (Invited Paper)

  • Collin Beaudoin
  • Satwik Kundu
  • Rasit Onur Topaloglu
  • Swaroop Ghosh

Using quantum computing, this paper addresses two scientifically-pressing and day to day-relevant problems, namely, chemical retrosynthesis which is an important step in drug/material discovery and security of semiconductor supply chain. We show that Quantum Long Short-Term Memory (QLSTM) is a viable tool for retrosynthesis. We achieve 65% training accuracy with QLSTM whereas classical LSTM can achieve 100%. However, in testing we achieve 80% accuracy with the QLSTM while classical LSTM peaks at only 70% accuracy! We also demonstrate an application of Quantum Neural Network (QNN) in the hardware security domain, specifically in Hardware Trojan (HT) detection using a set of power and area Trojan features. The QNN model achieves detection accuracy as high as 97.27%.

Quantum Machine Learning Applications in High-Energy Physics

  • Andrea Delgado
  • Kathleen E. Hamilton

Some of the most significant achievements of the modern era of particle physics, such as the discovery of the Higgs boson, have been made possible by the tremendous effort in building and operating large-scale experiments like the Large Hadron Collider or the Tevatron. In these facilities, the ultimate theory to describe matter at the most fundamental level is constantly probed and verified. These experiments often produce large amounts of data that require storing, processing, and analysis techniques that continually push the limits of traditional information processing schemes. Thus, the High-Energy Physics (HEP) field has benefited from advancements in information processing and the development of algorithms and tools for large datasets. More recently, quantum computing applications have been investigated to understand how the community can benefit from the advantages of quantum information science. Nonetheless, to unleash the full potential of quantum computing, there is a need to understand the quantum behavior and, thus, scale up current algorithms beyond what can be simulated in classical processors. In this work, we explore potential applications of quantum machine learning to data analysis tasks in HEP and how to overcome the limitations of algorithms targeted for Noisy Intermediate-Scale Quantum (NISQ) devices.

SESSION: Making Patterning Work (Virtual)

Session details: Making Patterning Work (Virtual)

  • Yuzhe Ma

DeePEB: A Neural Partial Differential Equation Solver for Post Exposure Baking Simulation in Lithography

  • Qipan Wang
  • Xiaohan Gao
  • Yibo Lin
  • Runsheng Wang
  • Ru Huang

Post Exposure Baking (PEB) has been widely utilized in advanced lithography. PEB simulation is critical in the lithography simulation flow, as it bridges the optical simulation result and the final developed profile in the photoresist. The process of PEB can be described by coupled partial differential equations (PDE) and corresponding boundary and initial conditions. Recent years have witnessed growing presence of machine learning algorithms in lithography simulation, while PEB simulation is often ignored or treated with compact models, considering the huge cost of solving PDEs exactly. In this work, based on the observation of the physical essence of PEB, we propose DeePEB: a neural PDE Solver for PEB simulation. This model is capable of predicting the PEB latent image with high accuracy and >100 × acceleration (compared to the commercial rigorous simulation tool), paving the way for efficient and accurate photoresist modeling in lithography simulation and layout optimization.

AdaOPC: A Self-Adaptive Mask Optimization Framework for Real Design Patterns

  • Wenqian Zhao
  • Xufeng Yao
  • Ziyang Yu
  • Guojin Chen
  • Yuzhe Ma
  • Bei Yu
  • Martin D. F. Wong

Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that different sub-regions have different pattern complexity. Besides, we also find that many patterns repetitively appear in the design layout, and these patterns may possibly share optimized masks. We exploit these properties and propose a self-adaptive OPC framework to improve efficiency. Firstly we choose different OPC solvers adaptively for patterns of different complexity from an extensible solver pool to reach a speed/accuracy co-optimization. Apart from that, we prove the feasibility of reusing optimized masks for repeated patterns and hence, build a graph-based dynamic pattern library reusing stored masks to further speed up the OPC flow. Experimental results show that our framework achieves substantial improvement in both performance and efficiency.

LayouTransformer: Generating Layout Patterns with Transformer via Sequential Pattern Modeling

  • Liangjian Wen
  • Yi Zhu
  • Lei Ye
  • Guojin Chen
  • Bei Yu
  • Jianzhuang Liu
  • Chunjing Xu

Generating legal and diverse layout patterns to establish large pattern libraries is fundamental for many lithography design applications. Existing pattern generation models typically regard the pattern generation problem as image generation of layout maps and learn to model the patterns via capturing pixel-level coherence, which is insufficient to achieve polygon-level modeling, e.g., shape and layout of patterns, thus leading to poor generation quality. In this paper, we regard the pattern generation problem as an unsupervised sequence generation problem, in order to learn the pattern design rules by explicitly modeling the shapes of polygons and the layouts among polygons. Specifically, we first propose a sequential pattern representation scheme that fully describes the geometric information of polygons by encoding the 2D layout patterns as sequences of tokens, i.e., vertexes and edges. Then we train a sequential generative model to capture the long-term dependency among tokens and thus learn the design rules from training examples. To generate a new pattern in sequence, each token is generated conditioned on the previously generated tokens that are from the same polygon or different polygons in the same layout map. Our framework, termed LayouTransformer, is based on the Transformer architecture due to its remarkable ability in sequence modeling. Comprehensive experiments show that our LayouTransformer not only generates a large amount of legal patterns but also maintains high generation diversity, demonstrating its superiority over existing pattern generative models.

WaferHSL: Wafer Failure Pattern Classification with Efficient Human-Like Staged Learning

  • Qijing Wang
  • Martin D. F. Wong

As the demand for semiconductor products increases and the integrated circuits (IC) processes become more and more complex, wafer failure pattern classification is gaining more attention from manufacturers and researchers to improve yield. To further cope with the real-world scenario that there are only very limited labeled data and without any unlabeled data in the early manufacturing stage of new products, this work proposes an efficient human-like staged learning framework for wafer failure pattern classification named WaferHSL. Inspired by human’s knowledge acquisition process, a mutually reinforcing task fusion scheme is designed for guiding the deep learning model to simultaneously establish the knowledge of spatial relationships, geometry properties and semantics. Furthermore, a progressive stage controller is deployed to partition and control the learning process, so as to enable humanlike progressive advancement in the model. Experimental results show that with only 10% labeled samples and no unlabeled samples, WaferHSL can achieve better results than previous SOTA methods trained with 60% labeled samples and a large number of unlabeled samples, while the improvement is even more significant when using the same size of labeled training set.

SESSION: Advanced Verification Technologies (Virtual)

Session details: Advanced Verification Technologies (Virtual)

  • Takahide Yoshikawa

Combining BMC and Complementary Approximate Reachability to Accelerate Bug-Finding

  • Xiaoyu Zhang
  • Shengping Xiao
  • Jianwen Li
  • Geguang Pu
  • Ofer Strichman

Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.

Equivalence Checking of Dynamic Quantum Circuits

  • Xin Hong
  • Yuan Feng
  • Sanjiang Li
  • Mingsheng Ying

Despite the rapid development of quantum computing these years, state-of-the-art quantum devices still contain only a limited number of qubits. One possible way to execute more realistic algorithms in near-term quantum devices is to employ dynamic quantum circuits (DQCs). In DQCs, measurements can happen during the circuit, and their outcomes can be processed with classical computers and used to control other parts of the circuit. This technique can help significantly reduce the qubit resources required to implement a quantum algorithm. In this paper, we give a formal definition of DQCs and then characterise their functionality in terms of ensembles of linear operators, following the Kraus representation of superoperators. We further interpret DQCs as tensor networks, implement their functionality as tensor decision diagrams (TDDs), and reduce the equivalence of two DQCs to checking if they have the same TDD representation. Experiments show that embedding classical logic into conventional quantum circuits does not incur a significant time and space burden.

SESSION: Routing with Cell Movement (Virtual)

Session details: Routing with Cell Movement (Virtual)

  • Guojie Luo

ATLAS: A Two-Level Layer-Aware Scheme for Routing with Cell Movement

  • Xinshi Zang
  • Fangzhou Wang
  • Jinwei Liu
  • Martin D. F. Wong

Placement and routing are two crucial steps in the physical design of integrated circuits (ICs). To close the gap between placement and routing, the routing with cell movement problem has attracted great attention recently. In this problem, a certain number of cells can be moved to new positions and the nets can be rerouted to improve the total wire length. In this work, we advance the study on this problem by proposing a two-level layer-aware scheme, named ATLAS. A coarse-level cluster-based cell movement is first performed to optimize via usage and provides a better starting point for the next fine-level single cell movement. To further encourage routing on the upper metal layers, we utilize a set of adjusted layer weights to increase the routing cost on lower layers. Experimental results on the ICCAD 2020 contest benchmarks show that ATLAS achieves much more wire length reduction compared with the state-of-the-art routing with cell movement engine. Furthermore, applied on the ICCAD 2021 contest benchmarks, ATLAS outperforms the first place team of the contest with much better solution quality while being 3× faster.

A Robust Global Routing Engine with High-Accuracy Cell Movement under Advanced Constraints

  • Ziran Zhu
  • Fuheng Shen
  • Yangjie Mei
  • Zhipeng Huang
  • Jianli Chen
  • Jun Yang

Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.

SESSION: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

Session details: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

  • Michael Raitza

Securing Hardware through Reconfigurable Nano-Structures

  • Nima Kavand
  • Armin Darjani
  • Shubham Rai
  • Akash Kumar

Hardware security has been an ever-growing concern of the integrated circuit (IC) designers. Through different stages in the IC design and life cycle, an adversary can extract sensitive design information and private data stored in the circuit using logical, physical, and structural weaknesses. Besides, in recent times, ML-based attacks have become the new de facto standard in hardware security community. Contemporary defense strategies are often facing unforeseen challenges to cope up with these attack schemes. Additionally, the high overhead of the CMOS-based secure addon circuitry and intrinsic limitations of these devices indicate the need for new nano-electronics. Emerging reconfigurable devices like Reconfigurable Field Effect transistors (RFETs) provide unique features to fortify the design against various threats at different stages in the IC design and life cycle. In this manuscript, we investigate the applications of the RFETs for securing the design against traditional and machine learning (ML)-based intellectual property (IP) piracy techniques and side-channel attacks (SCAs).

Reconfigurable Logic for Hardware IP Protection: Opportunities and Challenges

  • Luca Collini
  • Benjamin Tan
  • Christian Pilato
  • Ramesh Karri

Protecting the intellectual property (IP) of integrated circuit (IC) design is becoming a significant concern of fab-less semiconductor design houses. Malicious actors can access the chip design at any stage, reverse engineer the functionality, and create illegal copies. On the one hand, defenders are crafting more and more solutions to hide the critical portions of the circuit. On the other hand, attackers are designing more and more powerful tools to extract useful information from the design and reverse engineer the functionality, especially when they can get access to working chips. In this context, the use of custom reconfigurable fabrics has recently been investigated for hardware IP protection. This paper will discuss recent trends in hardware obfuscation with embedded FPGAs, focusing also on the open challenges that must be necessarily addressed for making this solution viable.

SESSION: Performance, Power and Temperature Aspects in Deep Learning

Session details: Performance, Power and Temperature Aspects in Deep Learning

  • Callie Hao
  • Jeff Zhang

RT-NeRF: Real-Time On-Device Neural Radiance Fields Towards Immersive AR/VR Rendering

  • Chaojian Li
  • Sixu Li
  • Yang Zhao
  • Wenbo Zhu
  • Yingyan Lin

Neural Radiance Field (NeRF) based rendering has attracted growing attention thanks to its state-of-the-art (SOTA) rendering quality and wide applications in Augmented and Virtual Reality (AR/VR). However, immersive real-time (> 30 FPS) NeRF based rendering enabled interactions are still limited due to the low achievable throughput on AR/VR devices. To this end, we first profile SOTA efficient NeRF algorithms on commercial devices and identify two primary causes of the aforementioned inefficiency: (1) the uniform point sampling and (2) the dense accesses and computations of the required embeddings in NeRF. Furthermore, we propose RT-NeRF, which to the best of our knowledge is the first algorithm-hardware co-design acceleration of NeRF. Specifically, on the algorithm level, RT-NeRF integrates an efficient rendering pipeline for largely alleviating the inefficiency due to the commonly adopted uniform point sampling method in NeRF by directly computing the geometry of pre-existing points. Additionally, RT-NeRF leverages a coarse-grained view-dependent computing ordering scheme for eliminating the (unnecessary) processing of invisible points. On the hardware level, our proposed RT-NeRF accelerator (1) adopts a hybrid encoding scheme to adaptively switch between a bitmap- or coordinate-based sparsity encoding format for NeRF’s sparse embeddings, aiming to maximize the storage savings and thus reduce the required DRAM accesses while supporting efficient NeRF decoding; and (2) integrates both a high-density sparse search unit and a dual-purpose bi-direction adder & search tree to coordinate the two aforementioned encoding formats. Extensive experiments on eight datasets consistently validate the effectiveness of RT-NeRF, achieving a large throughput improvement (e.g., 9.7×~3,201×) while maintaining the rendering quality as compared with SOTA efficient NeRF solutions.

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

  • Yifan Gong
  • Zheng Zhan
  • Pu Zhao
  • Yushu Wu
  • Chao Wu
  • Caiwen Ding
  • Weiwen Jiang
  • Minghai Qin
  • Yanzhi Wang

During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

Robustify ML-Based Lithography Hotspot Detectors

  • Jingyu Pan
  • Chen-Chia Chang
  • Zhiyao Xie
  • Jiang Hu
  • Yiran Chen

Deep learning has been widely applied in various VLSI design automation tasks, from layout quality estimation to design optimization. Though deep learning has shown state-of-the-art performance in several applications, recent studies reveal that deep neural networks exhibit intrinsic vulnerability to adversarial perturbations, which pose risks in the ML-aided VLSI design flow. One of the most effective strategies to improve robustness is regularization approaches, which adjust the optimization objective to make the deep neural network generalize better. In this paper, we examine several adversarial defense methods to improve the robustness of ML-based lithography hotspot detectors. We present an innovative design rule checking (DRC)-guided curvature regularization (CURE) approach, which is customized to robustify ML-based lithography hotspot detectors against white-box attacks. Our approach allows for improvements in both the robustness and the accuracy of the model. Experiments show that the model optimized by DRC-guided CURE achieves the highest robustness and accuracy compared with those trained using the baseline defense methods. Compared with the vanilla model, DRC-guided CURE decreases the average attack success rate by 53.9% and increases the average ROC-AUC by 12.1%. Compared with the best of the defense baselines, DRC-guided CURE reduces the average attack success rate by 18.6% and improves the average ROC-AUC by 4.3%.

Associative Memory Based Experience Replay for Deep Reinforcement Learning

  • Mengyuan Li
  • Arman Kazemi
  • Ann Franchesca Laguna
  • X. Sharon Hu

Experience replay is an essential component in deep reinforcement learning (DRL), which stores the experiences and generates experiences for the agent to learn in real time. Recently, prioritized experience replay (PER) has been proven to be powerful and widely deployed in DRL agents. However, implementing PER on traditional CPU or GPU architectures incurs significant latency overhead due to its frequent and irregular memory accesses. This paper proposes a hardware-software co-design approach to design an associative memory (AM) based PER, AMPER, with an AM-friendly priority sampling operation. AMPER replaces the widely-used time-costly tree-traversal-based priority sampling in PER while preserving the learning performance. Further, we design an in-memory computing hardware architecture based on AM to support AMPER by leveraging parallel in-memory search operations. AMPER shows comparable learning performance while achieving 55× to 270× latency improvement when running on the proposed hardware compared to the state-of-the-art PER running on GPU.

SESSION: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

Session details: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

  • Hanrui Wang

TorchQuantum Case Study for Robust Quantum Circuits

  • Hanrui Wang
  • Zhiding Liang
  • Jiaqi Gu
  • Zirui Li
  • Yongshan Ding
  • Weiwen Jiang
  • Yiyu Shi
  • David Z. Pan
  • Frederic T. Chong
  • Song Han

Quantum Computing has attracted much research attention because of its potential to achieve fundamental speed and efficiency improvements in various domains. Among different quantum algorithms, Parameterized Quantum Circuits (PQC) for Quantum Machine Learning (QML) show promises to realize quantum advantages on the current Noisy Intermediate-Scale Quantum (NISQ) Machines. Therefore, to facilitate the QML and PQC research, a recent python library called TorchQuantum has been released. It can construct, simulate, and train PQC for machine learning tasks with high speed and convenient debugging supports. Besides quantum for ML, we want to raise the community’s attention on the reversed direction: ML for quantum. Specifically, the TorchQuantum library also supports using data-driven ML models to solve problems in quantum system research, such as predicting the impact of quantum noise on circuit fidelity and improving the quantum circuit compilation efficiency.

This paper presents a case study of the ML for quantum part in TorchQuantum. Since estimating the noise impact on circuit reliability is an essential step toward understanding and mitigating noise, we propose to leverage classical ML to predict noise impact on circuit fidelity. Inspired by the natural graph representation of quantum circuits, we propose to leverage a graph transformer model to predict the noisy circuit fidelity. We firstly collect a large dataset with a variety of quantum circuits and obtain their fidelity on noisy simulators and real machines. Then we embed each circuit into a graph with gate and noise properties as node features, and adopt a graph transformer to predict the fidelity. We can avoid exponential classical simulation cost and efficiently estimate fidelity with polynomial complexity.

Evaluated on 5 thousand random and algorithm circuits, the graph transformer predictor can provide accurate fidelity estimation with RMSE error 0.04 and outperform a simple neural network-based model by 0.02 on average. It can achieve 0.99 and 0.95 R2 scores for random and algorithm circuits, respectively. Compared with circuit simulators, the predictor has over 200× speedup for estimating the fidelity. The datasets and predictors can be accessed in the TorchQuantum library.

SESSION: Emerging Machine Learning Primitives: From Technology to Application

Session details: Emerging Machine Learning Primitives: From Technology to Application

  • Dharanidhar Dang
  • Hai Helen Lee

COSIME: FeFET Based Associative Memory for In-Memory Cosine Similarity Search

  • Che-Kai Liu
  • Haobang Chen
  • Mohsen Imani
  • Kai Ni
  • Arman Kazemi
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu
  • Liang Zhao
  • Cheng Zhuo
  • Xunzhao Yin

In a number of machine learning models, an input query is searched across the trained class vectors to find the closest feature class vector in cosine similarity metric. However, performing the cosine similarities between the vectors in Von-Neumann machines involves a large number of multiplications, Euclidean normalizations and division operations, thus incurring heavy hardware energy and latency overheads. Moreover, due to the memory wall problem that presents in the conventional architecture, frequent cosine similarity-based searches (CSSs) over the class vectors requires a lot of data movements, limiting the throughput and efficiency of the system. To overcome the aforementioned challenges, this paper introduces COSIME, a general in-memory associative memory (AM) engine based on the ferroelectric FET (FeFET) device for efficient CSS. By leveraging the one-transistor AND gate function of FeFET devices, current-based translinear analog circuit and winner-take-all (WTA) circuitry, COSIME can realize parallel in-memory CSS across all the entries in a memory block, and output the closest word to the input query in cosine similarity metric. Evaluation results at the array level suggest that the proposed COSIME design achieves 333× and 90.5× latency and energy improvements, respectively, and realizes better classification accuracy when compared with an AM design implementing approximated CSS. The proposed in-memory computing fabric is evaluated for an HDC problem, showcasing that COSIME can achieve on average 47.1× and 98.5× speedup and energy efficiency improvements compared with an GPU implementation.

DynaPAT: A Dynamic Pattern-Aware Encoding Technique for Robust MLC PCM-Based Deep Neural Networks

  • Thai-Hoang Nguyen
  • Muhammad Imran
  • Joon-Sung Yang

As the effectiveness of Deep Neural Networks (DNNs) is rising over time, so is the need for highly scalable and efficient hardware architectures to capitalize this effectiveness in many practical applications. Emerging non-volatile Phase Change Memory (PCM) technology has been found to be a promising candidate for future memory systems due to its better scalability, non-volatility and low leakage/dynamic power consumption, compared to conventional charged-based memories. Additionally, with its cell’s wide resistance span, PCM also has the Flash-like Multi-Level Cell (MLC) capability, which has enhanced storage density, providing an opportunity for the deployment of data-intensive applications such as DNNs on resource-constrained edge devices. However, the practical deployment of MLC PCM is hampered by certain reliability challenges, among which, the resistance drift is considered to be a critical concern. In a DNN application, the presence of resistance drift in MLC PCM can cause a severe impact to DNN’s accuracy if no drift-error-tolerance technique is utilized. This paper proposes DynaPAT, a low-cost and effective pattern-aware encoding technique to enhance the drift-error-tolerance of MLC PCM-based Deep Neural Networks. DynaPAT has been constructed on the insight into DNN’s vulnerability against different data pattern switching. Based on this insight, DynaPAT efficiently maps the most-frequent data pattern in DNN’s parameters to the least-drift-prone level of the MLC PCM, thus significantly enhancing the robustness of the system against drift errors. Various experiments on different DNN models and configurations demonstrate the effectiveness of DynaPAT. The experimental results indicate that DynaPAT can achieve up to 500× enhancement in the drift-errors-tolerance capability over the baseline MLC PCM based DNN while requiring only a negligible hardware overhead (below 1% storage overhead). Being orthogonal, DynaPAT can be integrated with existing drift-tolerance schemes for even higher gains in reliability.

Graph Neural Networks for Idling Error Mitigation

  • Vedika Servanan
  • Samah Mohamed Saeed

Dynamical Decoupling (DD)-based protocols have been shown to reduce the idling errors encountered in quantum circuits. However, the current research in suppressing idling qubit errors suffers from scalability issues due to the large number of tuning quantum circuits that should be executed first to find the locations of the DD sequences in the target quantum circuit, which boost the output state fidelity. This process becomes tedious as the size of the quantum circuit increases. To address this challenge, we propose a Graph Neural Network (GNN) framework, which mitigates idling errors through an efficient insertion of DD sequences into quantum circuits by modeling their impact at different idle qubit windows. Our paper targets maximizing the benefit of DD sequences using a limited number of tuning circuits. We propose to classify the idle qubit windows into critical and non-critical (benign) windows using a data-driven reliability model. Our results obtained from IBM Lagos quantum computer show that our proposed GNN models, which determine the locations of DD sequences in the quantum circuits, significantly improve the output state fidelity by a factor of 1.4x on average and up to 2.6x compared to the adaptive DD approach, which searches for the best locations of DD sequences at run-time.

Quantum Neural Network Compression

  • Zhirui Hu
  • Peiyan Dong
  • Zhepeng Wang
  • Youzuo Lin
  • Yanzhi Wang
  • Weiwen Jiang

Model compression, such as pruning and quantization, has been widely applied to optimize neural networks on resource-limited classical devices. Recently, there are growing interest in variational quantum circuits (VQC), that is, a type of neural network on quantum computers (a.k.a., quantum neural networks). It is well known that the near-term quantum devices have high noise and limited resources (i.e., quantum bits, qubits); yet, how to compress quantum neural networks has not been thoroughly studied. One might think it is straightforward to apply the classical compression techniques to quantum scenarios. However, this paper reveals that there exist differences between the compression of quantum and classical neural networks. Based on our observations, we claim that the compilation/traspilation has to be involved in the compression process. On top of this, we propose the very first systematical framework, namely CompVQC, to compress quantum neural networks (QNNs). In CompVQC, the key component is a novel compression algorithm, which is based on the alternating direction method of multipliers (ADMM) approach. Experiments demonstrate the advantage of the CompVQC, reducing the circuit depth (almost over 2.5×) with a negligible accuracy drop (<1%), which outperforms other competitors. Another promising truth is our CompVQC can indeed promote the robustness of the QNN on the near-term noisy quantum devices.

SESSION: Design for Low Energy, Low Resource, but High Quality

Session details: Design for Low Energy, Low Resource, but High Quality

  • Ravikumar Chakaravarthy
  • Cong “Callie” Hao

Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications

  • Azat Azamat
  • Jaewoo Park
  • Jongeun Lee

The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.

WSQ-AdderNet: Efficient Weight Standardization Based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization

  • Yunxiang Zhang
  • Biao Sun
  • Weixiong Jiang
  • Yajun Ha
  • Miao Hu
  • Wenfeng Zhao

Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight 1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.

Low-Cost 7T-SRAM Compute-in-Memory Design Based on Bit-Line Charge-Sharing Based Analog-to-Digital Conversion

  • Kyeongho Lee
  • Joonhyung Kim
  • Jongsun Park

Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.

SESSION: Microarchitectural Attacks and Countermeasures

Session details: Microarchitectural Attacks and Countermeasures

  • Rajesh JS
  • Amin Rezaei

Speculative Load Forwarding Attack on Modern Processors

  • Hasini Witharana
  • Prabhat Mishra

Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.

Fast, Robust and Accurate Detection of Cache-Based Spectre Attack Phases

  • Arash Pashrashid
  • Ali Hajiabadi
  • Trevor E. Carlson

Modern processors achieve high performance and efficiency by employing techniques such as speculative execution and sharing resources such as caches. However, recent attacks like Spectre and Meltdown exploit the speculative execution of modern processors to leak sensitive information from the system. Many mitigation strategies have been proposed to restrict the speculative execution of processors and protect potential side-channels. Currently, these techniques have shown a significant performance overhead. A solution that can detect memory leaks before the attacker has a chance to exploit them would allow the processor to reduce the performance overhead by enabling protections only when the system is at risk.

In this paper, we propose a mechanism to detect speculative execution attacks that use caches as a side-channel. In this detector we track the phases of a successful attack and raise an alert before the attacker gets a chance to recover sensitive information. We accomplish this through monitoring the microarchitectural changes in the core and caches, and detect the memory locations that can be potential memory data leaks. We achieve 100% accuracy and negligible false positive rate in detecting Spectre attacks and evasive versions of Spectre that the state-of-the-art detectors are unable to detect. Our detector has no performance overhead with negligible power and area overheads.

CASU: Compromise Avoidance via Secure Update for Low-End Embedded Systems

  • Ivan De Oliveira Nunes
  • Sashidhar Jakkamsetti
  • Youngil Kim
  • Gene Tsudik

Guaranteeing runtime integrity of embedded system software is an open problem. Trade-offs between security and other priorities (e.g., cost or performance) are inherent, and resolving them is both challenging and important. The proliferation of runtime attacks that introduce malicious code (e.g., by injection) into embedded devices has prompted a range of mitigation techniques. One popular approach is Remote Attestation (RA), whereby a trusted entity (verifier) checks the current software state of an untrusted remote device (prover). RA yields a timely authenticated snapshot of prover state that verifier uses to decide whether an attack occurred.

Current RA schemes require verifier to explicitly initiate RA, based on some unclear criteria. Thus, in case of prover’s compromise, verifier only learns about it late, upon the next RA instance. While sufficient for compromise detection, some applications would benefit from a more proactive, prevention-based approach. To this end, we construct CASU: Compromise Avoidance via Secure Updates. CASU is an inexpensive hardware/software co-design enforcing: (i) runtime software immutability, thus precluding any illegal software modification, and (ii) authenticated updates as the sole means of modifying software. In CASU, a successful RA instance serves as a proof of successful update, and continuous subsequent software integrity is implicit, due to the runtime immutability guarantee. This obviates the need for RA in between software updates and leads to unobtrusive integrity assurance with guarantees akin to those of prior RA techniques, with better overall performance.

SESSION: Genetic Circuits Meet Ising Machines

Session details: Genetic Circuits Meet Ising Machines

  • Marc Riedel
  • Lei Yang

Technology Mapping of Genetic Circuits: From Optimal to Fast Solutions

  • Tobias Schwarz
  • Christian Hochberger

Synthetic Biology aims to create biological systems from scratch that do not exist in nature. An important method in this context is the engineering of DNA sequences such that cells realize Boolean functions that serve as control mechanisms in biological systems, e.g. in medical or agricultural applications. Libraries of logic gates exist as predefined gene sequences, based on the genetic mechanism of transcriptional regulation. Each individual gate is composed of different biological parts to allow for the differentiation of their output signals. Even gates of the same logic type therefore exhibit different transfer characteristics, i.e. relation from input to output signals. Thus, simulation of the whole network of genetic gates is needed to determine the performance of a genetic circuit. This makes mapping Boolean functions to these libraries much more complicated compared to EDA. Yet, optimal results are desired in the design phase due to high lab implementation costs. In this work, we identify fundamental features of the transfer characteristic of gates based on transcriptional regulation which is widely used in genetic gate technologies. Based on this, we present novel exact (Branch-and-Bound) and heuristic (Branch-and-Bound, Simulated Annealing) algorithms for the problem of technology mapping of genetic circuits and evaluate them using a prominent gate library. In contrast to state-of-the-art tools, all obtained solutions feature a (near) optimal output performance. Our exact method only explores 6.5 % and the heuristics even 0.2 % of the design space.

DaS: Implementing Dense Ising Machines Using Sparse Resistive Networks

  • Naomi Sagan
  • Jaijeet Roychowdhury

Ising machines have generated much excitement in recent years due to their promise for solving hard combinatorial optimization problems. However, achieving physical all-to-all connectivity in IC implementations of large, densely-connected Ising machines remains a key challenge. We present a novel approach, DaS, that uses low-rank decomposition to achieve effectively-dense Ising connectivity using only sparsely interconnected hardware. The innovation consists of two components. First, we use the SVD to find a low-rank approximation of the Ising coupling matrix while maintaining very high accuracy. This decomposition requires substantially fewer nonzeros to represent the dense Ising coupling matrix. Second, we develop a method to translate the low-rank decomposition to a hardware implementation that uses only sparse resistive interconnections. We validate DaS on the MU-MIMO detection problem, important in modern telecommunications. Our results indicate that as problem sizes scale, DaS can achieve dense Ising coupling using only 5%-20% of the resistors needed for brute-force dense connections (which would be physically infeasible in ICs). We also outline a crossbar-style physical layout scheme for realizing sparse resistive networks generated by DaS.

QuBRIM: A CMOS Compatible Resistively-Coupled Ising Machine with Quantized Nodal Interactions

  • Yiqiao Zhang
  • Uday Kumar Reddy Vengalam
  • Anshujit Sharma
  • Michael Huang
  • Zeljko Ignjatovic

Physical Ising machines have been shown to solve combinatoric optimization problems with orders-of-magnitude improvements in speed and energy efficiency o ver v on N eumann systems. However, building such a system is still in its infancy and a scalable, robust implementation remains challenging. CMOS-compatible electronic Ising machines (e.g., [1]) are promising as the mature technology helps bring scale, speed, and energy efficiency to the dynamical system. However, subtle issues can arise when using voltage-controlled transistors to act as programmable resistive coupling. In this paper, we propose a version of resistively-coupled Ising machine using quantized nodal interactions (QuBRIM), which significantly i mproved the predictability of the coupling resistor. The functionality of QuBRIM is demonstrated by solving the well-known Max-Cut problem using both behavioral and circuit level simulations in 45 nm CMOS technology node. We show that the dynamical system naturally seeks local minima in the objective function’s energy landscape and that by applying spin-fix a nnealing, t he system reaches a global minimum with a high probability.

SESSION: Energy Efficient Neural Networks via Approximate Computations

Session details: Energy Efficient Neural Networks via Approximate Computations

  • M. Hasan Najafi
  • Vidya Chabria

Combining Gradients and Probabilities for Heterogeneous Approximation of Neural Networks

  • Elias Trommer
  • Bernd Waschneck
  • Akash Kumar

This work explores the search for heterogeneous approximate multiplier configurations for neural networks that produce high accuracy and low energy consumption. We discuss the validity of additive Gaussian noise added to accurate neural network computations as a surrogate model for behavioral simulation of approximate multipliers. The continuous and differentiable properties of the solution space spanned by the additive Gaussian noise model are used as a heuristic that generates meaningful estimates of layer robustness without the need for combinatorial optimization techniques. Instead, the amount of noise injected into the accurate computations is learned during network training using backpropagation. A probabilistic model of the multiplier error is presented to bridge the gap between the domains; the model estimates the standard deviation of the approximate multiplier error, connecting solutions in the additive Gaussian noise space to actual hardware instances. Our experiments show that the combination of heterogeneous approximation and neural network retraining reduces the energy consumption for multiplications by 70% to 79% for different ResNet variants on the CIFAR-10 dataset with a Top-1 accuracy loss below one percentage point. For the more complex Tiny ImageNet task, our VGG16 model achieves a 53 % reduction in energy consumption with a drop in Top-5 accuracy of 0.5 percentage points. We further demonstrate that our error model can predict the parameters of an approximate multiplier in the context of the commonly used additive Gaussian noise (AGN) model with high accuracy. Our software implementation is available under https://github.com/etrommer/agn-approx.

Tunable Precision Control for Approximate Image Filtering in an In-Memory Architecture with Embedded Neurons

  • Ayushi Dube
  • Ankit Wagle
  • Gian Singh
  • Sarma Vrudhula

This paper presents a novel hardware-software co-design consisting of a Processing in-Memory (PiM) architecture with embedded neural processing elements (NPE) that are highly reconfigurable. The PiM platform and proposed approximation strategies are employed for various image filtering applications while providing the user with fine-grain dynamic control over energy efficiency, precision, and throughput (EPT). The proposed co-design can change the Peak Signal to Noise Ratio (PSNR, output quality metric for image filtering applications) from 25dB to 50dB (acceptable PSNR range for image filtering applications) without incurring any extra cost in terms of energy or latency. While switching from accurate to approximate mode of computation in the proposed co-design, the maximum improvement in energy efficiency and throughput is 2X. However, the gains in energy efficiency against a MAC-based PE array with the proposed memory platform are 3X-6X. The corresponding improvements in throughput are 2.26X-4.52X, respectively.

AppGNN: Approximation-Aware Functional Reverse Engineering Using Graph Neural Networks

  • Tim Bücher
  • Lilas Alrahis
  • Guilherme Paim
  • Sergio Bampi
  • Ozgur Sinanoglu
  • Hussam Amrouch

The globalization of the Integrated Circuit (IC) market is attracting an ever-growing number of partners, while remarkably lengthening the supply chain. Thereby, security concerns, such as those imposed by functional Reverse Engineering (RE), have become quintessential. RE leads to disclosure of confidential information to competitors, potentially enabling the theft of intellectual property. Traditional functional RE methods analyze a given gate-level netlist through employing pattern matching towards reconstructing the underlying basic blocks, and hence, reverse engineer the circuit’s function.

In this work, we are the first to demonstrate that applying Approximate Computing (AxC) principles to circuits significantly improves the resiliency against RE. This is attributed to the increased complexity in the underlying pattern-matching process. The resiliency remains effective even for Graph Neural Networks (GNNs) that are presently one of the most powerful state-of-the-art techniques in functional RE. Using AxC, we demonstrate a substantial reduction in GNN average classification accuracy- from 98% to a mere 53%. To surmount the challenges introduced by AxC in RE, we propose the highly promising AppGNN platform, which enables GNNs (still being trained on exact circuits) to: (i) perform accurate classifications, and (ii) reverse engineer the circuit functionality, notwithstanding the applied approximation technique. AppGNN accomplishes this by implementing a novel graph-based node sampling approach that mimics generic approximation methodologies, requiring zero knowledge of the targeted approximation type.

We perform an extensive evaluation targeting wide-ranging adder and multiplier circuits that are approximated using various AxC techniques, including state-of-the-art evolutionary-based approaches. We show that, using our method, we can improve the classification accuracy from 53% to 81% when classifying approximate adder circuits that have been generated using evolutionary algorithms, which our method is oblivious of. Our AppGNN framework is publicly available under https://github.com/ML-CAD/AppGNN

Seprox: Sequence-Based Approximations for Compressing Ultra-Low Precision Deep Neural Networks

  • Aradhana Mohan Parvathy
  • Sarada Krithivasan
  • Sanchari Sen
  • Anand Raghunathan

Compression techniques such as quantization and pruning are indispensable for deploying state-of-the-art Deep Neural Networks (DNNs) on resource-constrained edge devices. Quantization is widely used in practice – many commercial platforms already support 8-bits, with recent trends towards ultra-low precision (4-bits and below). Pruning, which increases network sparsity (incidence of zero-valued weights), enables compression by storing only the nonzero weights and their indices. Unfortunately, the compression benefits of pruning deteriorate or even vanish in ultra-low precision DNNs. This is due to (i) the unfavorable tradeoff between the number of bits needed to store a weight (which reduces with lower precision) and the number of bits needed to encode an index (which remains unchanged), and (ii) the lower sparsity levels that are achievable at lower precisions.

We propose Seprox, a new compression scheme that overcomes the aforementioned challenges by exploiting two key observations about ultra-low precision DNNs. First, with lower precision, fewer weight values are possible, leading to increased incidence of frequently-occurring weights and weight sequences. Second, some weight values occur rarely and can be eliminated by replacing them with similar values. Leveraging these insights, Seprox encodes frequently-occurring weight sequences (as opposed to individual weights) while using the eliminated weight values to encode them, thereby avoiding indexing overheads and achieving higher compression. Additionally, Seprox uses approximation techniques to increase the frequencies of the encoded sequences. Across six ultra-low precision DNNs trained on the Cifar10 and ImageNet datasets, Seprox achieves model compressions, energy improvements and speed-ups of up to 35.2%, 14.8% and 18.2% respectively.

SESSION: Algorithms and Tools for Security Analysis and Secure Hardware Design

Session details: Algorithms and Tools for Security Analysis and Secure Hardware Design

  • Rosario Cammarota
  • Satwik Patnaik

Evaluating the Security of eFPGA-Based Redaction Algorithms

  • Amin Rezaei
  • Raheel Afsharmazayejani
  • Jordan Maynard

Hardware IP owners must envision procedures to avoid piracy and overproduction of their designs under a fabless paradigm. A newly proposed technique to obfuscate critical components in a logic design is called eFPGA-based redaction, which replaces a sensitive sub-circuit with an embedded FPGA, and the eFPGA is configured to perform the same functionality as the missing sub-circuit. In this case, the configuration bitstream acts as a hidden key only known to the hardware IP owner. In this paper, we first evaluate the security promise of the existing eFPGA-based redaction algorithms as a preliminary study. Then, we break eFPGA-based redaction schemes by an initial but not necessarily efficient attack named DIP Exclusion that excludes problematic input patterns from checking in a brute-force manner. Finally, by combining cycle breaking and unrolling, we propose a novel and powerful attack called Break & Unroll that is able to recover the bitstream of state-of-the-art eFPGA-based redaction schemes in a relatively short time even with the existence of hard cycles and large size keys. This study reveals that the common perception that eFPGA-based redaction is by default secure against oracle-guided attacks, is prejudice. It also shows that additional research on how to systematically create an exponential number of non-combinational hard cycles is required to secure eFPGA-based redaction schemes.

An Approach to Unlocking Cyclic Logic Locking: LOOPLock 2.0

  • Pei-Pei Chen
  • Xiang-Min Yang
  • Yi-Ting Li
  • Yung-Chih Chen
  • Chun-Yao Wang

Cyclic logic locking is a new type of SAT-resistant techniques in hardware security. Recently, LOOPLock 2.0 was proposed, which is a cyclic logic locking method creating cycles deliberately in the locked circuit to resist SAT Attack, CycSAT, BeSAT, and Removal Attack simultaneously. The key idea of LOOPLock 2.0 is that the resultant circuit is still cyclic no matter the key vector is correct or not. This property refuses attackers and demonstrates its success on defending against attackers. In this paper, we propose an unlocking approach to LOOPLock 2.0 based on structure analysis and SAT solvers. Specifically, we identify and remove non-combinational cycles in the locked circuit before running SAT solvers. The experimental results show that the proposed unlocking approach is promising.

Garbled EDA: Privacy Preserving Electronic Design Automation

  • Mohammad Hashemi
  • Steffi Roy
  • Fatemeh Ganji
  • Domenic Forte

The complexity of modern integrated circuits (ICs) necessitates collaboration between multiple distrusting parties, including third-party intellectual property (3PIP) vendors, design houses, CAD/EDA tool vendors, and foundries, which jeopardizes confidentiality and integrity of each party’s IP. IP protection standards and the existing techniques proposed by researchers are ad hoc and vulnerable to numerous structural, functional, and/or side-channel attacks. Our framework, Garbled EDA, proposes an alternative direction through formulating the problem in a secure multi-party computation setting, where the privacy of IPs, CAD tools, and process design kits (PDKs) is maintained. As a proof-of-concept, Garbled EDA is evaluated in the context of simulation, where multiple IP description formats (Verilog, C, S) are supported. Our results demonstrate a reasonable logical-resource cost and negligible memory overhead. To further reduce the overhead, we present another efficient implementation methodology, feasible when the resource utilization is a bottleneck, but the communication between two parties is not restricted. Interestingly, this implementation is private and secure even in the presence of malicious adversaries attempting to, e.g., gain access to PDKs or in-house IPs of the CAD tool providers.

Don’t CWEAT It: Toward CWE Analysis Techniques in Early Stages of Hardware Design

  • Baleegh Ahmad
  • Wei-Kai Liu
  • Luca Collini
  • Hammond Pearce
  • Jason M. Fung
  • Jonathan Valamehr
  • Mohammad Bidmeshki
  • Piotr Sapiecha
  • Steve Brown
  • Krishnendu Chakrabarty
  • Ramesh Karri
  • Benjamin Tan

To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code that might contain one of a set of MITRE’s common weakness enumerations (CWEs). We explore the CWE database to characterize the scope and attributes of the CWEs and identify those that are amenable to static analysis. We prototype scanners and evaluate them on 11 open source designs – 4 system-on-chips (SoC) and 7 processor cores – and explore the nature of identified weaknesses. Our analysis reported 53 potential weaknesses in the OpenPiton SoC used in Hack@DAC-21, 11 of which we confirmed as security concerns.

SESSION: Special Session: Making ML Reliable: From Devices to Systems to Software

Session details: Special Session: Making ML Reliable: From Devices to Systems to Software

  • Krishnendu Chakrabarty
  • Partha Pande

Reliable Computing of ReRAM Based Compute-in-Memory Circuits for AI Edge Devices

  • Meng-Fan Chang
  • Je-Ming Hung
  • Ping-Cheng Chen
  • Tai-Hao Wen

Compute-in-memory macros based on non-volatile memory (nvCIM) are a promising approach to break through the memory bottleneck for artificial intelligence (AI) edge devices; however, the development of these devices involves unavoidable tradeoffs between reliability, energy efficiency, computing latency, and readout accuracy. This paper outlines the background of ReRAM-based nvCIM as well as the major challenges in its further development, including process variation in ReRAM devices and transistors and the small signal margins associated with variation in input-weight patterns. This paper also investigates the error model of a nvCIM macro, and the correspondent degradation of inference accuracy as a function of error model when using nvCIM macros. Finally, we summarize recent trends and advances in the development of reliable ReRAM-based nvCIM macro.

Fault-Tolerant Deep Learning Using Regularization

  • Biresh Kumar Joardar
  • Aqeeb Iqbal Arka
  • Janardhan Rao Doppa
  • Partha Pratim Pande

Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.

Machine Learning for Testing Machine-Learning Hardware: A Virtuous Cycle

  • Arjun Chaudhuri
  • Jonti Talukdar
  • Krishnendu Chakrabarty

The ubiquitous application of deep neural networks (DNN) has led to a rise in demand for AI accelerators. DNN-specific functional criticality analysis identifies faults that cause measurable and significant deviations from acceptable requirements such as the inferencing accuracy. This paper examines the problem of classifying structural faults in the processing elements (PEs) of systolic-array accelerators. We first present a two-tier machine-learning (ML) based method to assess the functional criticality of faults. While supervised learning techniques can be used to accurately estimate fault criticality, it requires a considerable amount of ground truth for model training. We therefore describe a neural-twin framework for analyzing fault criticality with a negligible amount of ground-truth data. We further describe a topological and probabilistic framework to estimate the expected number of PE’s primary outputs (POs) flipping in the presence of defects and use the PO-flip count as a surrogate for determining fault criticality. We demonstrate that the combination of PO-flip count and neural twin-enabled sensitivity analysis of internal nets can be used as additional features in existing ML-based criticality classifiers.

Observation Point Insertion Using Deep Learning

  • Bonita Bhaskaran
  • Sanmitra Banerjee
  • Kaushik Narayanun
  • Shao-Chun Hung
  • Seyed Nima Mozaffari Mojaveri
  • Mengyun Liu
  • Gang Chen
  • Tung-Che Liang

Silent Data Corruption (SDC) is one of the critical problems in the field of testing, where errors or corruption do not manifest externally. As a result, there is increased focus on improving the outgoing quality of dies by striving for better correlation between structural and functional patterns to achieve a low DPPM. This is very important for NVIDIA’s chips due to the various markets we target; for example, automotive and data center markets have stringent in-field testing requirements. One aspect of these efforts is to also target better testability while incurring lower test cost. Since structural testing is faster than functional tests, it is important to make these structural test patterns as effective as possible and free of test escapes. However, with the rising cell count in today’s digital circuits, it is becoming increasingly difficult to sensitize faults and propagate the fault effects to scan-flops or primary outputs. Hence, methods to insert observation points to facilitate the detection of hard-to-detect (HtD) faults are being increasingly explored. In this work, we propose an Observation Point Insertion (OPI) scheme using deep learning with the motivation of achieving – 1) better quality test points than commercial EDA tools leading to a potential lower pattern count 2) faster turnaround time to generate the test points. In order to achieve better pattern compaction than commercial EDA tools, we employ Graph Convolutional Networks (GCNs) to learn the topology of logic circuits along with the features that influence its testability. The graph structures are subsequently used to train two GCN-type deep learning models – the first model predicts signal probabilities at different nets and the second model uses these signal probabilities along with other features to predict the reduction in test-pattern count when OPs are inserted at different locations in the design. The features we consider include structural features like gate type, gate logic, reconvergent-fanouts and testability features like SCOAP. Our simulation results indicate that the proposed machine learning models can predict the probabilistic testability metrics with reasonable accuracy and can identify observation points that reduce pattern count.

SESSION: Autonomous Systems and Machine Learning on Embedded Systems

Session details: Autonomous Systems and Machine Learning on Embedded Systems

  • Ibrahim (Abe) Elfadel
  • Mimi Xie

Romanus: Robust Task Offloading in Modular Multi-Sensor Autonomous Driving Systems

  • Luke Chen
  • Mohanad Odema
  • Mohammad Abdullah Al Faruque

Due to the high performance and safety requirements of self-driving applications, the complexity of modern autonomous driving systems (ADS) has been growing, instigating the need for more sophisticated hardware which could add to the energy footprint of the ADS platform. Addressing this, edge computing is poised to encompass self-driving applications, enabling the compute-intensive autonomy-related tasks to be offloaded for processing at compute-capable edge servers. Nonetheless, the intricate hardware architecture of ADS platforms, in addition to the stringent robustness demands, set forth complications for task offloading which are unique to autonomous driving. Hence, we present ROMANUS, a methodology for robust and efficient task offloading for modular ADS platforms with multi-sensor processing pipelines. Our methodology entails two phases: (i) the introduction of efficient offloading points along the execution path of the involved deep learning models, and (ii) the implementation of a runtime solution based on Deep Reinforcement Learning to adapt the operating mode according to variations in the perceived road scene complexity, network connectivity, and server load. Experiments on the object detection use case demonstrated that our approach is 14.99% more energy-efficient than pure local execution while achieving a 77.06% reduction in risky behavior from a robust-agnostic offloading baseline.

ModelMap: A Model-Based Multi-Domain Application Framework for Centralized Automotive Systems

  • Soham Sinha
  • Anam Farrukh
  • Richard West

This paper presents ModelMap, a model-based multi-domain application development framework for DriveOS, our in-house centralized vehicle management software system. DriveOS runs on multicore x86 machines and uses hardware virtualization to host isolated RTOS and Linux guest OS sandboxes. In this work, we design Simulink interfaces for model-based vehicle control function development across multiple sandboxed domains in DriveOS. ModelMap provides abstractions to: (1) automatically generate periodic tasks bound to threads in different OS domains, (2) establish cross-domain synchronous and asynchronous communication interfaces, and (3) handle USB-based CAN I/O in Simulink. We introduce the concept of a nested binary, for the deployment of ELF binary executable code in different sandboxed domains. We demonstrate ModelMap using a combination of synthetic benchmarks, and experiments with Simulink models of a CAN Gateway and HVAC service running on an electric car. ModelMap eases the development of applications, which are shown to achieve industry-target performance using a multicore hardware platform in DriveOS.

INDENT: Incremental Online Decision Tree Training for Domain-Specific Systems-on-Chip

  • Anish Krishnakumar
  • Radu Marculescu
  • Umit Ogras

The performance and energy efficiency potential of heterogeneous architectures has fueled domain-specific systems-on-chip (DSSoCs) that integrate general-purpose and domain-specialized hardware accelerators. Decision trees (DTs) perform high-quality, low-latency task scheduling to utilize the massive parallelism and heterogeneity in DSSoCs effectively. However, offline trained DT scheduling policies can quickly become ineffective when applications or hardware configurations change. There is a critical need for runtime techniques to train DTs incrementally without sacrificing accuracy since current training approaches have large memory and computational power requirements. To address this need, we propose INDENT, an incremental online DT framework to update the scheduling policy and adapt it to unseen scenarios. INDENT updates DT schedulers at runtime using only 1–8% of the original training data embedded during training. Thorough evaluations with hardware platforms and DSSoC simulators demonstrate that INDENT performs within 5% of a DT trained from scratch using the entire dataset and outperforms current state-of-the-art approaches.

SGIRR: Sparse Graph Index Remapping for ReRAM Crossbar Operation Unit and Power Optimization

  • Cheng-Yuan Wang
  • Yao-Wen Chang
  • Yuan-Hao Chang

Resistive Random Access Memory (ReRAM) Crossbars are a promising process-in-memory technology to reduce enormous data movement overheads of large-scale graph processing between computation and memory units. ReRAM cells can combine with crossbar arrays to effectively accelerate graph processing, and partitioning ReRAM crossbar arrays into Operation Units (OUs) can further improve computation accuracy of ReRAM crossbars. The operation unit utilization was not optimized in previous work, incurring extra cost. This paper proposes a two-stage algorithm with a crossbar OU-aware scheme for sparse graph index remapping for ReRAM (SGIRR) crossbars, mitigating the influence of graph sparsity. In particular, this paper is the first to consider the given operation unit size with the remapping index algorithm, optimizing the operation unit and power dissipation. Experimental results show that our proposed algorithm reduces the utilization of crossbar OUs by 31.4%, improves the total OU block usage by 10.6%, and saves energy consumption by 17.2%, on average.