Yibo Lin, Author at SIGDA

Jianli Chen

18 July 2023

Yibo Lin

No comments

Categories: Who's Who

July 2023

Jianli Chen

Professor

State Key Laboratory of ASIC and System,
Fudan University

Email:
chenjianli@fudan.edu.cn

Personal webpage
https://sme.fudan.edu.cn/5f/c6/c31141a352198/page.htm

Research interests

Algorithms for VLSI Physical Design

Short bio

Jianli Chen is currently a professor at the State Key Laboratory of ASIC and System, Fudan University. He received the B.Sc. degree in information and computing sciences, the M.Sc. degree in computer application technology, and the Ph.D. degree in applied mathematics, all from Fuzhou University, in 2007, 2009, and 2012, respectively. His research interests include optimization theory and applications, and optimization methods for VLSI physical design automation.

Dr. Chen has received two Best Paper Awards from the International Forum on Operations Research and Control Theory in 2011 and the Design Automation Conference (DAC) in 2017, three Best Paper Award candidates from DAC, and two Best Paper Award candidates from the International Conference on Computer-Aided Design (ICCAD). He and his group were the recipient of the First Place Award at the CAD Contest at ICCAD in 2017, 2018 and 2019, respectively. Dr. Chen has also received multiple government honors and industry awards, such as the Integrated Circuit Early Career Award from China Computer Federation in 2018, the Youth Science Award for Applied Mathematics from China Society for Industrial and Applied Mathematics in 2020, the First Prize of Shanghai Technology Invention Award in 2020, the Special Prize for Excellent Project Award from Shanghai Industry-University-Institute Cooperation in 2021, the First Prize of the Natural Science Award from the Ministry of Education in 2022. He has served as a design automation technical committee of IEEE Council on Electronic Design Automation (CEDA) since January 2018. He is the founder of LEDA Technology, Shanghai, China.

Research highlights

Proposed an analytical approach to deal with the NP-hard problem of VLSI mixed-cell-height legalization. The paper ‘Toward Optimal Legalization for Mixed-cell-height Circuit Designs,’ based on this achievement, won the DAC 2017 Best Paper Award.
Led the research team to win three champions at the CAD Contest at ICCAD from 2017 to 2019, the 2^nd place in ICCAD’20, the 3rd place in ICCAD’21, and the 1^st place in ISPD Contest 2023. Furthermore, the team’s achievements have become an integral part of the core technology used in LEDA’s products.
Developed placement engine, which is capable of efficiently processing tens of millions of cells. This innovation has significantly contributed to the advancement of integrated circuit placement algorithms and has been incorporated into the IEEE CEDA Reference Design Flow (RDF) since 2019.

2023 SIGDA LIVE webinars

18 June 2023

Yibo Lin

No comments

Categories: SIGDA Events

SIGDA Live is a series of webinars, launched monthly or bi-monthly, on topics (either technical or non-technical) of general interest to the SIGDA community. The talks in general fall on the last Wednesday of a month, and last about 45 minutes plus 15 minutes Q&A. Speaker and topic nominations are welcome and should be sent to sigdalive@gmail.com. All past talks are archived through our Youtube channel at: https://www.youtube.com/channel. Each year we recognize one speaker with the “Most Influential Speaker of the Year” award.

Organizers: Yiyu Shi (University of Notre Dame), Qinru Qiu (Syracuse University)

Technical support: Bei Yu (Chinese University of Hong Kong)

Recent SIGDA-sponsored presentations:

2023 System Design Contest at Design Automation Conference (DAC-SDC’23)

18 June 2023

Yibo Lin

No comments

Categories: SIGDA Events

The DAC System Design Contest focuses on object detection and classification on an embedded GPU or FPGA system. Contestants will receive a training dataset provided by Baidu, and a hidden dataset will be used to evaluate the performance of the designs in terms of accuracy and speed. Contestants will compete to create the best performing design on a Nvidia Jetson Nano GPU or Xilinx Kria KV260 FPGA board. Grand cash awards will be given to the top three teams. The award ceremony will be held at the 2023 IEEE/ACM Design Automation Conference.

Organizing Committee

Jeff Goeders – Brigham Young University
Callie Hao – Georgia Institute of Technology
Meng Li – Peking University
Cheng Zhuo – Zhejiang University

2023 International Symposium on Physical Design (ISPD) Table of Content

20 April 2023

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Session 1: Opening Session and Keynote I

Automated Design of Chiplets

Alberto Sangiovanni-Vincentelli
Zheng Liang
Zhe Zhou
Jiaxi Zhang

Chiplet-based designs have gained recognition as a promising alternative to monolithic SoCs due to their lower manufacturing costs, improved re-usability, and optimized technology specialization. Despite progress made in various related domains, the design of chiplets remains largely reliant on manual processes. In this paper, we provide an examination of the historical evolution of chiplets, encompassing a review of crucial design considerations and a synopsis of recent advancements in relevant fields. Further, we identify and examine the opportunities and challenges in the automated design of chiplets. To further demonstrate the potential of this nascent area, we present a novel task that

SESSION: Session 2: Routing

FastPass: Fast Pin Access Analysis with Incremental SAT Solving

Fangzhou Wang
Jinwei Liu
Evangeline F.Y. Young

Pin access analysis is a critical step in detailed routing. With complicated design rules and pin shapes, efficient and accurate pin accessibility evaluation is desirable in many physical design scenarios. To this end, we present FastPass, a fast and robust pin access analysis framework, which first generates design rule checking (DRC)-clean pin access route candidates for each pin, pre-computes incompatible pairs of routes, and then uses incremental SAT solving to find an optimized pin access scheme. Experimental results on the ISPD 2018 benchmarks show that FastPass produces DRC-clean pin access schemes for all cases while being 14.7× faster than the known best pin access analysis framework on average.

Pin Access-Oriented Concurrent Detailed Routing

Yun-Jhe Jiang
Shao-Yun Fang

Due to continuously shrunk feature sizes and increased design complexity, the difficulty in pin access becomes one of the most critical challenges in large-scale full-chip routing. State-of-the-art pin access-aware detailed routing techniques suffer from either the ordering problem of the sequential routing scheme or the inflexibility of pre-determining an access point for each pin. Some other routing-related studies create pin extensions with Metal-2 metal segments to optimize pin accessibility; however, this strategy may not be practical without considering the contemporary routing flow. This paper presents a pin access-oriented concurrent detailed routing approach conducted after the track assignment stage. The core detailed routing engine is based on an integer linear programming (ILP) formulation, which has lower complexity and can flexibly tackle multi-pin nets compared to an existing formulation. Besides, to maximize the free routing resource and to keep the problem size tractable, a pre-processing flow trimming redundant metals and inserting assistant metals is developed. The experimental results show that compared to a state-of-the-art academic router, the proposed concurrent scheme can effectively derive good results with fewer design rule violations and less runtime.

Reinforcement Learning Guided Detailed Routing for Custom Circuits

Hao Chen
Kai-Chieh Hsu
Walker J. Turner
Po-Hsuan Wei
Keren Zhu
David Z. Pan
Haoxing Ren

Detailed routing is the most tedious and complex procedure in design automation and has become a determining factor in layout automation in advanced manufacturing nodes. Despite continuing advances in custom integrated circuit (IC) routing research, industrial custom layout flows remain heavily manual due to the high complexity of the custom IC design problem. Besides conventional design objectives such as wirelength minimization, custom detailed routing must also accommodate additional constraints (e.g., path-matching) across the analog/mixed-signal (AMS) and digital domains, making an already challenging procedure even more so. This paper presents a novel detailed routing framework for custom circuits that leverages deep reinforcement learning to optimize routing patterns while considering custom routing constraints and industrial design rules. Comprehensive post-layout analyses based on industrial designs demonstrate the effectiveness of our framework in dealing with the specified constraints and producing sign-off-quality routing solutions.

Voltage-Drop Optimization Through Insertion of Extra Stripes to a Power Delivery Network

Jai-Ming Lin
Yu-Tien Chen
Yang-Tai Kung
Hao-Jia Lin

As the complexity increases, power delivery network (PDN) optimization becomes a more important step in a modern design. In order to construct a robust PDN, most classic PDN optimization methods focus on adjusting the dimensions of power stripes. However, this approach becomes infeasible when voltage violation regions also have severe routing congestion. Hence, this paper proposes a delicate procedure to insert additional power stripes to reduce voltage violation while maintaining routability. In the beginning, IR-drop high related regions are identified to reveal those locations which are thirsty for more currents. Then, we solve a minimum-cost flow problem to find the topologies of power delivery paths (PDPs) from power sources to these regions and determine the widths of edges in each PDP so that enough currents can be provided to these regions. Moreover, vertical power stripes (VPSs for short) are inserted to the locations which have less routing congestion and severe voltage violations by the dynamic programming to reduce a probability to deteriorate routability. Finally, more wires will be inserted to IR-drop high related regions if there still exist voltage violations. Experimental results show that our method can use much less routing resource and induce less routing congestion to meet IR-drop constraint in industry designs.

NVCell 2: Routability-Driven Standard Cell Layout in Advanced Nodes with Lattice Graph Routability Model

Chia-Tung Ho
Alvin Ho
Matthew Fojtik
Minsoo Kim
Shang Wei
Yaguang Li
Brucek Khailany
Haoxing Ren

Standard cells are essential components of modern digital circuit designs. With process technologies advancing beyond the 5nm node, more routability issues have arisen due to the decreasing number of routing tracks, increasing number and complexity of design rules, and strict patterning rules. Automatic standard cell synthesis tools are struggling to design cells with severe routability issues. In this paper, we propose a routability-driven standard cell synthesis framework using a novel pin density aware congestion metric, lattice graph routability modelling approach, and dynamic external pin allocation methodology to generate routability optimized layouts. On a benchmark of 94 complex and hard-to-route standard cells, NVCell 2 improves the number of routable and LVS/DRC clean cell layouts by 84.0% and 87.2%, respectively. NVCell 2 can generate 98.9% of cells LVS/DRC clean, with 13.9% of the cells having smaller area, compared to an industrial standard cell library with over 1000 standard cells.

SESSION: Session 3: 3D ICs, Heterogeneous Integration, and Packaging I

FXT-Route: Efficient High-Performance PCB Routing with Crosstalk Reduction Using Spiral Delay Lines

Meng Lian
Yushen Zhang
Mengchu Li
Tsun-Ming Tseng
Ulf Schlichtmann

In high-performance printed circuit boards (PCBs), adding serpentine delay lines is the most prevalent delay-matching technique to balance the delays of time-critical signals. Serpentine topology, however, can induce simultaneous accumulation of the crosstalk noise, resulting in erroneous logic gate triggering and speed-up effects. The state-of-the-art approach for crosstalk alleviation achieves waveform integrity by enlarging wire separation, resulting in an increased routing area. We introduce a method that adopts spiral delay lines for delay matching to mitigate the speed-up effect by spreading the crosstalk noise uniformly in time. Our method avoids possible routing congestion while achieving a high density of transmission lines. We implement our method by constructing a mixed-integer-linear programming (MILP) model for routing and a quadratic programming (QP) model for spiral synthesis. Experimental results demonstrate that our method requires, on average, 31% less routing area than the original design. In particular, compared to the state-of-the-art approach, our method can reduce the magnitude of the crosstalk noise by at least 69%.

On Legalization of Die Bonding Bumps and Pads for 3D ICs

Sai Pentapati
Anthony Agnesina
Moritz Brunion
Yen-Hsiang Huang
Sung Kyu Lim

State-of-the-art 3D IC Place-and-Route flows were designed with older technology nodes and aggressive bonding pitch assumptions. As a result, these flows fail to honor the width and spacing rules for the 3D vias with realistic pitch values. We propose a critical new 3D via legalization stage during routing to reduce such violations. A force-based solver and bipartite-matching algorithm with Bayesian optimization are presented as viable legalizers and are compatible with various process nodes, bonding technologies, and partitioning types. With the modified 3D routing, we reduce the 3D via violations by more than 10× with zero impact on performance, power, or area.

Reshaping System Design in 3D Integration: Perspectives and Challenges

Hung-Ming Chen
Chu-Wen Ho
Shih-Hsien Wu
Wei Lu
Po-Tsang Huang
Hao-Ju Chang
Chien-Nan Jimmy Liu

In this paper, we depict modern system design methodologies via 3D integration along with the advance of packaging, considering system prototyping, interconnecting, and physical implementation. The corresponding challenges are presented as well.

SESSION: Session 4: 3D ICs, Heterogeneous Integration, and Packaging II

Co-design for Heterogeneous Integration: A Failure Analysis Perspective

Erica Douglas
Julia Deitz
Timothy Ruggles
Daniel Perry
Damion Cummings
Mark Rodriguez
Nichole Valdez
Brad Boyce

As scaling for CMOS transistors asymptotically approaches the end of Moore’s Law, the need to push into 3D integration schemes to innovate capabilities is gaining significant traction. Further, rapid development of new semiconductor solutions, such as heterogeneous integration, has turned the semiconductor industry’s consistent march towards next generation products into new arenas. In 2018, the Department of Energy Office of Science (DOE SC) released their “Basic Research Needs for Microelectronics,” communicating a strong push towards “parallel but intimately networked efforts to create radically new capabilities,”1 which they have coined as “co-design.”

Advanced packaging and heterogeneous integration, particularly with mixed semiconductor materials (e.g., CMOS FPGAs & GaN RF amplifiers) is a realm ripe for applicability towards DOE SC’s co-design call to action. In theory, development occurring at all scales across the semiconductor ecosystem, particularly across disciplines that are not traditionally adjacent, should significantly accelerate innovation. In reality, co-design requires a paradigm shift in approach, requiring not only interconnected parallel development. Further, accurate ground truth data during learning cycles is critical in order to effectively and efficiently communicate across disparate disciplines and advise design iterations across the microelectronics ecosystem.

This talk will outline three orthogonal facets towards co-design for HI: (1) on-going efforts towards development of materials characterization and failure analysis techniques to enable accurate evaluation of materials and heterogeneously integrated components, (2) development of artificial intelligence & machine learning algorithms for large scale, high throughput process development and characterization, and (3) development of capabilities for rapid communication and visualization of data across disparate disciplines.

Goal Driven PCB Synthesis Using Machine Learning and CloudScale Compute

Taylor Hogan

X AI is a cloud-based system that leverages machine learning, and search to place and route printed circuit boards using physics-based analysis and high-level design. We propose a feedback-based Monte Carlo Tree Search (MCTS) algorithm to explore the space of possible designs. A metric, or metrics, is given to evaluate the quality of designs as MCTS learns about possible solutions. A policy and value network are trained during exploration to learn to accurately weight quality actions and identify useful design states. This is performed as a feedback loop in conjunction with other feedforward tools for placement and routing.

Gate-All-Around Technology is Coming.: What’s Next After GAA?

Victor Moroz

Currently, the industry is transitioning from FinFETs to gate-all-around (GAA) technology and will likely have several GAA technology generations in the next few years. What’s next after that? This is the question that we are trying to answer in this project by benchmarking GAA technology with transistors on 2D materials and stacked transistors (CFETs).

The main objective for logic is to get a meaningful gain in power, performance, area, and cost (PPAC). The main objective for SRAM is to get a noticeable density scaling for the SRAM array and its periphery without losing performance and yield. Another objective is to move in the direction that has a promise of longer-term progress, such as to start stacking two layers of transistors before moving to a larger number of transistor layers. With that in mind, we explore and discuss the next steps beyond GAA technology.

SESSION: Session 5: Analog Design

VLSIR – A Modular Framework for Programming Analog & Custom Circuits & Layouts

Dan Fritchman

We present VLSIR, a modular and fully open-source framework for programming analog and custom circuits and layouts. VLSIR is centered around a protobuf-defined design database. It features high-productivity front-ends for hardware description (“circuit programming”), simulation, and custom layout programming, designed to be amenable to both human designers and automation.

Joint Optimization of Sizing and Layout for AMS Designs: Challenges and Opportunities

Ahmet F. Budak
Keren Zhu
Hao Chen
Souradip Poddar
Linran Zhao
Yaoyao Jia
David Z. Pan

Recent advances in analog device sizing algorithms show promising results on the automatic schematic design. However, the majority of the sizing algorithms are based on schematic-level simulations and layout-agnostic. The physical layout implementation brings extra parasitics to the analog circuits, leading to discrepancies between schematic and post-layout performance. This performance gap raises questions about the effectiveness of automatic analog device sizing tools. Prior work has leveraged procedural layout generation to account for layout-induced parasitics in the sizing process. However, the need for layout templates makes such methodology limited in application. In this paper, we propose to bridge automatic analog sizing with post-layout performance using state-of-the-art optimization-based analog layout generators. A quantitative study is conducted to measure the impact of layout awareness in state-of-the-art device sizing algorithms. Furthermore, we present our perspectives on the future directions in layout-aware analog circuit schematic design.

Learning from the Implicit Functional Hierarchy in an Analog Netlist

Helmut Graeb
Markus Leibl

Analog circuit design is characterized by a plethora of implicit design and technology aspects available to the experienced designer. In order to create useful computer-aided design methods, this implicit knowledge has to be captured in a systematic and hierarchical way. A key approach to this goal is to “learn” the knowledge from the netlist of an analog circuit. This requires a library of structural and functional blocks for analog circuits together with their individual constraints and performance equations, graph homomorphism techniques to recognize blocks that can have different structural implementations and I/O pins, as well as synthesis methods that exploit the learned knowledge. In this contribution, we will present how to make use of the functional and structural hierarchy of operational amplifiers. As an application, we explore the capabilities of machine learning in the context of structural and functional properties and show that the results can be substantially improved by pre-processing data with traditional methods for functional block analysis. This claim is validated on a data set of roughly 100,000 readily sized and simulated operational amplifiers.

The ALIGN Automated Analog Layout Engine: Progress, Learnings, and Open Issues

Sachin S. Sapatnekar

The ALIGN (Analog Layout, Intelligently Generated from Netlists) project [1, 2] is a joint university-industry effort to push the envelope of automated analog layout through a systematic new approach, novel algorithms, and open-source software [3]. Analog automation research has been active for several decades, but has not found widespread acceptance due to its general inability to meet the needs of the design community. Therefore, unlike digital design, which has a rich history of automation and extensive deployment of design tools, analog design is largely unautomated.

ALIGN attempts to overcome several of the major issues associated with this lack of success. First, to mimic the human designer’s ability to recognize sub-blocks and specify constraints, ALIGN has used machine learning (ML) based methods to assist in these tasks. Second, to overcome the limitation of past automation approaches, which are largely specific to a class of designs, ALIGN attempts to create a truly general layout engine by decomposing the layout automation process into a set of steps, with specific constraints that are specific to the family of circuits, which are divided into four classes: low-frequency components (e.g., analog-to-digital converters (ADCs), amplifiers, and filters); wireline components for high-speed links (e.g., equalizers, clock/data recovery circuits, and phase interpolators); RF/Wireless components (e.g., components of RF transmitters and receivers), and power delivery components (e.g., capacitor- and inductor-based DC-DC converters and low dropout (LDO) regulators). For each class of circuits, different sets of constraints are important, depending on their frequency, parasitic sensitivity, need for matching, etc., and ALIGN creates a unified methodological framework that can address each class. Third, in each step, ALIGN has generated new algorithms and approaches to help improve the performance of analog layout. Fourth, given that experienced analog designers desire greater visibility into the process and input into the way that design is carried out, ALIGN is built modularly, providing multiple entry points at which a designer may intervene in the process.

Analog Layout Automation On Advanced Process Technologies

Soner Yaldiz

Despite the digitization of analog and the disaggregated silicon trends, high-volume or high-performance system-on-chip (SoC) designs integrate numerous analog and mixed-signal (AMS) intellectual property (IP) blocks including voltage regulators, clock generators, sensors, memory and other interfaces. For example, fine-grain dynamic voltage and frequency scaling requires a dedicated clock generator and voltage regulator per compute unit. The design of these blocks in advanced FinFET or GAAFET technologies is challenging due to the i) increasing gap between schematic and post-layout simulation, ii) design rule complexity, and iii) strict reliability rules [1]. The convergence of a high-performance or a high-power block may require multiple iterations of circuit sizing and layout changes. As a result, physical design, which is primarily a manual effort, has become a key bottleneck in the design process. Migrating these blocks across process technologies or process variants only exacerbates the problem. Layout synthesis for AMS IP blocks is an on-going research problem with a long history [2] and is gaining more attention recently to leverage the latest advances in machine learning [3]. Yet neither template nor optimization-based approaches have reduced the burden significantly for high performance products on leading process technologies

This talk will first overview physical design of AMS IP blocks on an advanced process technology highlighting the opportunities and the expectations from layout automation during this process. On a new process technology, this process starts with conducting early layout studies on a selection of critical high performance or high power subcircuits. In parallel, the IP blocks are placed in a bottom-up fashion to optimize the IP floorplan but also to provide information to SoC floorplanning. Routing follows the placement to verify the post-layout performance. A quick turnaround during these explorations is vital to decide on any architectural changes or circuit re-sizing. The rest of the talk will share experiences with piloting an open-source analog layout synthesis tool flow [4] on a 22nm FinFET technology for voltage regulators [5].

The learnings from this exercise and the extensions to the tool flow will be summarized that include Boolean satisfiability-based routing algorithm, formally verifiable constraint language and leveraging parameterized and standard cells. The talk will conclude with opportunities for research.

SESSION: Session 6: Keynote II

Immersion and EUV Lithography: Two Pillars to Sustain Single-Digit Nanometer Nodes

Burn J. Lin

Semiconductor technology has advanced to single-digit nanometer dimensions for the circuit elements. The minimum feature size has reached subwavelength dimension. Many resolution enhancement techniques have been developed to extend the resolution limit of optical lithography systems, namely illumination optimization, phase-shifting masks, and proximity corrections. Needless to say, the actinic wavelength and the numerical aperture of the imaging lens have been reduced in stages, The most recent innovations are Immersion lithography and Extreme UV (EUV) lithography

In this presentation, the working principles, advantages, and challenges of immersion lithography are given. The defectivity issue is addressed by showing possible causes and solutions. The circuit design issues for pushing immersion lithography to single-digit nanometer delineation are presented.

Similarly, the working principles, advantages, and challenges of EUV lithography are given. There are special focusses on EUV power requirement, generation, and distribution; EUV mask components, absorber thickness, defects, flatness requirement, and pellicles; EUV resist challenges on sensitivity, line edge roughness, thickness, and etch resistance.

SESSION: Session 7: DFM, Reliability, and Electromigration

Advanced Design Methodologies for Directed Self-Assembly

Shao-Yun Fang

Directed self-assembly (DSA), which uses the segregation nature after an annealing process of block co-polymer (BCP) to generate tiny feature shapes, becomes one of the most promising next generation lithography technologies. According to the different proportions of the two monomers in an adopted BCP, either cylinders or lamellae can be generated by removing one of the two monomers, which are respectively referred to as cylindrical DSA and lamellar DSA. In addition, guiding templates are required to produce trenches before filling BCP such that the additional forces from the trench walls regulate the generated cylinders/lamellae. Both the two DSA technologies can be used to generate contact/via patterns in circuit layouts, while the practices of designing guiding templates are quite different due to different manufacturing principles. This paper reviews the existing studies on the guiding template design problem for contact/via hole fabrication with the DSA technology. The design constraints are differentiated and the design methodologies are respectively introduced for cylindrical DSA and lamellar DSA. Possible future research directions are finally suggested to further enhance contact/via manufacturability and the feasibility of adopting DSA in semiconductor manufacturing.

Challenges for Interconnect Reliability: From Element to System Level

Olalla Varela Pedreira
Houman Zahedmanesh
Youqi Ding
Ivan Ciofi
Kristof Croes

The high current densities carried by the interconnects have a direct impact on the back-end-of-line (BEOL) reliability degradation as they locally increase the temperature by Joule heating, and they lead to drift in the metal atoms. Local increase in temperature due to Joule heating will lead to thermal gradients along the interconnects inducing degradation through thermomigration. As the power density of the chip increases, thermal gradients may become a major reliability concern for scaled Cu interconnects. Therefore, it is of utmost relevance to fundamentally understand the impact of thermal gradients in metal migration. Our studies show that by using a combined modelling approach and a dedicated test structure we can assess the local temperatures and temperature gradients profiles. Moreover, with long-term experiments, we are able to successfully generate voids at the location of highest temperature gradients. Additionally, the main consequence of scaling the Cu interconnects is the dramatic drop of EM lifetime (Jmax). Currently the experimentally obtained EM parameters are used at system design level to set the current limits through the interconnect networks. However, this approach is very simplistic and neglects the benefits provided by the redundancy and interconnectivity from the network. Our studies by using a system-level physics-based EM simulation framework which can determine the EM induced IR drop at the standard cell level, show that the circuit reliability margins of the power delivery network (PDN) can be further relaxed.

Combined Modeling of Electromigration, Thermal and Stress Migration in AC Interconnect Lines

Susann Rothe
Jens Lienig

The migration of atoms in metal interconnects in integrated circuits (ICs) increasingly endangers chip reliability. The susceptibility of DC interconnects to electromigration has been extensively studied. A few works on thermal migration and AC electromigration are also available. Yet, the combined effect of both on chip reliability has been neglected thus far. This paper provides both FEM and analytical models for atomic migration and steady-state stress profiles in AC interconnects considering electromigration, thermal and stress migration combined. For this we expand existing models by the impact of self-healing, temperature-dependent resistivity, and short wire length. We conclude by analyzing the impact of thermal migration on interconnect robustness and show that it cannot be neglected any longer in migration robustness verification.

Recent Progress in the Analysis of Electromigration and Stress Migration in Large Multisegment Interconnects

Nestor Evmorfopoulos
Mohammad Abdullah Al Shohel
Olympia Axelou
Pavlos Stoikos
Vidya A. Chhabria
Sachin S. Sapatnekar

Traditional approaches to analyzing electromigration (EM) in on-chip interconnects are largely driven by semi-empirical models. However, such methods are inexact for the typical multisegment lines that are found in modern integrated circuits. This paper overviews recent advances in analyzing EM in on-chip interconnect structures based on physics-based models that use partial differential equations, with appropriate boundary conditions, to capture the impact of electron-wind and back-stress forces within an interconnect, across multiple wire segments. Methods for both steady-state and transient analysis are presented, highlighting approaches that can solve these problems with a computation time that is linear in the number of wire segments in the interconnect.

Electromigration Assessment in Power Grids with Account of Redundancy and Non-Uniform Temperature Distribution

Armen Kteyan
Valeriy Sukharev
Alexander Volkov
Jun Ho Choy
Farid N. Najm
Yong Hyeon Yi
Chris H. Kim
Stephane Moreau

A recently proposed methodology for electromigration (EM) assessment in on-chip power/ground grid of integrated circuits has been validated by means of measurements, performed on dedicated test grids. IR drop degradation in the grid is used for defining the EM failure criteria. Physics-based models are involved for simulation of EM-induced stress evolution in interconnect structures, void formation and evolution, resistance increase of the voided segments, and consequent re-distribution of electric current in the redundant grid paths. A grid-like test structure, fabricated with a 65 nm technology and consisting of two metal layers, allowed to calibrate the voiding models by tracking voltage evolution in all grid nodes in experiment and in simulation. Good fit of the measured and simulated time-to-failure (TTF) probability distribution was obtained in both cases of uniform and non-uniform temperature distribution across the grid. The second test grid was fabricated with a 28 nm technology, consisted of 4 metal layers, and contained power and ground nets connected to “quasi-cells” with poly-resistors, which were specially designed for operating at elevated temperatures ~350°C. The existing current distributions resulted in different behavior of EM-induced failures in these nets: a gradual voltage evolution in power net, and sharp changes in ground net were observed in experiment, and successfully reproduced in simulations.

SESSION: Session 8: Placement

Placement Initialization via Sequential Subspace Optimization with Sphere Constraints

Pengwen Chen
Chung-Kuan Cheng
Albert Chern
Chester Holtz
Aoxi Li
Yucheng Wang

State-of-the-art analytical placement algorithms for VLSI designs rely on solving nonlinear programs to minimize wirelength and cell congestion. As a consequence, the quality of solutions produced using these algorithms crucially depends on the initial cell coordinates. In this work, we reduce the problem of finding wirelength-minimal initial layouts subject to density and fixed-macro constraints to a Quadratically Constrained Quadratic Program (QCQP). We additionally propose an efficient sequential quadratic programming algorithm to recover a block-globally optimal solution and a subspace method to reduce the complexity of problem. We extend our formulation to facilitate direct minimization of the Half-Perimeter Wirelength (HPWL) by showing that a corresponding solution can be derived by solving a sequence of reweighted quadratic programs. Critically, our method is parameter-free, i.e. involves no hyperparameters to tune. We demonstrate that incorporating initial layouts produced by our algorithm with a global analytical placer results in improvements of up to 4.76% in post-detailed-placement wirelength on the ISPD’05 benchmark suite. Our code is available on github. https://github.com/choltz95/laplacian-eigenmaps-revisited.

DREAM-GAN: Advancing DREAMPlace towards Commercial-Quality using Generative Adversarial Learning

Yi-Chen Lu
Haoxing Ren
Hao-Hsiang Hsiao
Sung Kyu Lim

DREAMPlace is a renowned open-source placer that provides GPU-acceleratable infrastructure for placements of Very-Large-Scale-Integration (VLSI) circuits. However, due to its limited focus on wirelength and density, existing placement solutions of DREAMPlace are not applicable to industrial design flows. To improve DREAMPlace towards commercial-quality without knowing the black-boxed algorithms of the tools, in this paper, we present DREAM-GAN, a placement optimization framework that advances DREAMPlace using generative adversarial learning. At each placement iteration, aside from optimizing the wirelength and density objectives of the vanilla DREAMPlace, DREAM-GAN computes and optimizes a differentiable loss that denotes the similarity score between the underlying placement and the tool-generated placements in commercial databases. Experimental results on 5 commercial and OpenCore designs using an industrial design flow implemented by Synopsys ICC2 not only demonstrate that DREAM-GAN significantly improves the vanilla DREAMPlace at the placement stage across each benchmark, but also show that the improvements last firmly to the post-route stage, where we observe improvements by up to 8.3% in wirelength and 7.4% in total power.

AutoDMP: Automated DREAMPlace-based Macro Placement

Anthony Agnesina
Puranjay Rajvanshi
Tian Yang
Geraldo Pradipta
Austin Jiao
Ben Keller
Brucek Khailany
Haoxing Ren

Macro placement is a critical very large-scale integration (VLSI) physical design problem that significantly impacts the design power-performance-area (PPA) metrics. This paper proposes AutoDMP, a methodology that leverages DREAMPlace, a GPU-accelerated placer, to place macros and standard cells concurrently in conjunction with automated parameter tuning using a multi-objective hyperparameter optimization technique. As a result, we can generate high-quality predictable solutions, improving the macro placement quality of academic benchmarks compared to baseline results generated from academic and commercial tools. AutoDMP is also computationally efficient, optimizing a design with 2.7 million cells and 320 macros in 3 hours on a single NVIDIA DGX Station A100. This work demonstrates the promise and potential of combining GPU-accelerated algorithms and ML techniques for VLSI design automation.

Assessment of Reinforcement Learning for Macro Placement

Chung-Kuan Cheng
Andrew B. Kahng
Sayak Kundu
Yucheng Wang
Zhiang Wang

We provide open, transparent implementation and assessment of Google Brain’s deep reinforcement learning approach to macro placement (Nature) and its Circuit Training (CT) implementation in GitHub. We implement in open-source key “blackbox” elements of CT, and clarify discrepancies between CT and Nature. New testcases on open enablements are developed and released. We assess CT alongside multiple alternative macro placers, with all evaluation flows and related scripts public in GitHub. Our experiments also encompass academic mixed-size placement benchmarks, as well as ablation and stability studies. We comment on the impact of Nature and CT, as well as directions for future research.

SESSION: Session 9: New Computing Techniques and Accelerators

GPU Acceleration in Physical Synthesis

Evangeline F.Y. Young

Placement and routing are essential steps in physical synthesis of VLSI designs. Modern circuits contain billions of cells and nets, which significantly increases the computational complexity of physical synthesis and brings big challenges to leading-edge physical design tools. With the fast development of GPU architecture and computational power, it becomes an important direction to explore speeding up physical synthesis with massive parallelism on GPU. In this talk, we will look into opportunities to improve EDA algorithms with GPU acceleration. Traditional EDA tools run on CPU with limited degree of parallelism. We will investigate a few examples of accelerating some classical algorithms in placement and routing using GPU. We will see how one can leverage the power of GPU to improve both quality and computational time in solving these EDA problems.

Efficient Runtime Power Modeling with On-Chip Power Meters

Zhiyao Xie

Accurate and efficient power modeling techniques are crucial for both design-time power optimization and runtime on-chip IC management. In prior research, different types of power modeling solutions have been proposed, optimizing multiple objectives including accuracy, efficiency, temporal resolution, and automation level, targeting various power/voltage-related applications. Despite extensive prior explorations in this topic, new solutions still keep emerging and achieve state-of-the-art performance. This paper aims at providing a review of the recent progress in power modeling, with more focus on runtime on-chip power meter (OPM) development techniques. It also serves as a vehicle for discussing some general development techniques for the runtime on-chip power modeling task.

DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs

Rachel Selina Rajarathnam
Zixuan Jiang
Mahesh A. Iyer
David Z. Pan

Placement plays a pivotal and strategic role in the FPGA implementation flow to allocate the physical locations of the heterogeneous instances in the design. Among the placement stages, the packing or clustering stage groups logic instances like look-up tables (LUTs) and flip-flops (FFs) that could be placed on the same site. The legalization stage determines all instances’ physical site locations. With advances in FPGA architecture and technology nodes, designs contain millions of logic instances, and placement algorithms must scale accordingly. While other placement stages – global placement and detailed placement, have been accelerated using GPUs, the acceleration of packing and legalization stages on a GPU remains largely unexplored. This work presents DREAMPlaceFPGA-PL, an open-source packer-legalizer for heterogeneous FPGAs that employs GPU for acceleration. We revise the existing consensus-based parallel algorithms employed for packing and legalizing a flat placement to obtain further speedup on a GPU. Our experiments on the ISPD’2016 benchmarks demonstrate more than 2× acceleration.

SESSION: Session 10: Lifetime Achievement Commemoration for Professor Malgorzata Marek-Sadowska

Building Oscillatory Neural Networks: AI Applications and Physical Design Challenges

Aida Todri-Sanial

This talk is about a novel computing paradigm based on coupled oscillatory neural networks. Oscillatory neural networks (ONNs) are recurrent neural networks where each neuron is an oscillator and oscillator couplings are the synaptic weights. Inspired by Hopfield Neural Networks, ONNs make use of nonlinear dynamics to compute and solve computational problems such as associative memory tasks and combinatorial optimization problems difficult to address with conventional digital computers. An exciting direction in recent years has been to implement Ising machines based on the Ising model of coupled binary spins on magnets. In this talk, I cover the design aspects of building ONNs from devices to architecture to allow to benefit from the parallel computations with oscillators while implementing them in an energy efficient way.

Optimization of AI SoC with Compiler-assisted Virtual Design Platform

Chih-Tsun Huang
Juin-Ming Lu
Yao-Hua Chen
Ming-Chih Tung
Shih-Chieh Chang

As deep learning keeps evolving dramatically with rapidly increasing complexity, the demand for efficient hardware accelerators has become vital. However, the lack of software/hardware co-development toolchains makes designing AI SoCs (artificial intelligent system-on-chips) considerably challenging. This paper presents a compiler-assisted virtual platform to facilitate the development of AI SoCs from the early design stage. The electronic system-level design platform provides rapid functional verification and performance/energy analysis. Cooperating with the neural network compiler, AI software and hardware can be co-optimized on the proposed virtual design platform. Our Deep Inference Processor is also utilized on the virtual design platform to demonstrate the effectiveness of the architectural evaluation and exploration methodology.

Challenges and Opportunities for Computing-in-Memory Chips

Xiang Qiu

In recent years, artificial neural networks have been applied to many scenarios, from daily life applications like face detection, to industry problems like placement and routing in physical design. Neural network inference mainly contains multiply-accumulate operations, which requires huge amount of data movement. Traditional Von-Neumann architecture computers are inefficient for neural networks as they have separate CPU and memory, and data transfer between them costs excessive energy and performance. To address this problem, in-memory or near-memory computing have been proposed and attracted much attention in both academic and industry. In this talk, we will give a brief review of non-volatile memory crossbar-based computing-in-memory architecture. Next, we will demonstrate the challenges for chips with such architecture to replace current CPUs/GPUs for neural network processing, from an industry perspective. Lastly, we will discuss possible solutions for those challenges.

ISPD 2023 Lifetime Achievement Award Bio

Malgorzata Marek-Sadowska

The 2023 International Symposium on Physical Design lifetime achievement award goes to Professor Malgorzata Marek-Sadowska for her outstanding contributions to the field.

SESSION: Session 11: Keynote III

Neural Operators for Solving PDEs and Inverse Design

Anima Anandkumar

Deep learning surrogate models have shown promise in modeling complex physical phenomena such as photonics, fluid flows, molecular dynamics and material properties. However, standard neural networks assume finite-dimensional inputs and outputs, and hence, cannot withstand a change in resolution or discretization between training and testing. We introduce Fourier neural operators that can learn operators, which are mappings between infinite dimensional spaces. They are discretization-invariant and can generalize beyond the discretization or resolution of training data. They can efficiently solve partial differential equations (PDEs) on general geometries. We consider a variety of PDEs for both forward modeling and inverse design problems, as well as show practical gains in the lithography domain.

SESSION: Session 12: Quantum Computing

Quantum Challenges for EDA

Leon Stok

Though early in its development, quantum computing is now available on real hardware and via the cloud through IBM Quantum. This radically new kind of computing holds open the possibility of solving some problems that are now and perhaps always will be intractable for “classical” computers.

As with any new technology things are developing rapidly but there are still a lot of open questions. What is the status of Quantum computers today? What are the key metrics we need to look at to improve a Quantum System? What are some of the technical opportunities being looked at from an EDA perspective.

We will look at the Quantum Roadmap for the next couple of years and outline challenges that need to be solved and how the EDA community can potentially contribute to solve these challenges.

Developing Quantum Workloads for Workload-Driven Co-design

Anne Matsuura

Quantum computing offers the future promise of solving problems that are intractable for classical computers today. However, as an entirely new kind of computational device, we must learn how to best develop useful workloads. Today’s small workloads serve the dual purpose that they can also be used to learn how to design a better quantum computing system architecture. At Intel Labs, we develop small application-oriented workloads and use them to drive research into the design of a scalable quantum computing system architecture. We run these small workloads on the small systems of qubits that we have today to understand what is required from the system architecture to run them efficiently and accurately on real qubits. In this presentation, I will give examples of quantum workload-driven co-design and what we have learned from this type of research.

MQT QMAP: Efficient Quantum Circuit Mapping

Robert Wille
Lukas Burgholzer

Quantum computing is an emerging technology that has the potential to revolutionize fields such as cryptography, machine learning, optimization, and quantum simulation. However, a major challenge in the realization of quantum algorithms on actual machines is ensuring that the gates in a quantum circuit (i.e., corresponding operations) match the topology of a targeted architecture so that the circuit can be executed while, at the same time, the resulting costs (e.g., in terms of the number of additionally introduced gates, fidelity, etc.) are kept low. This is known as the quantum circuit mapping problem. This summary paper provides an overview of QMAP-an open-source tool that is part of the Munich Quantum Toolkit (MQT) and offers efficient, automated, and accessible methods for tackling this problem. To this end, the paper first briefly reviews the problem. Afterwards, it shows how QMAP can be used to efficiently map quantum circuits to quantum computing architectures from both a user’s and a developer’s perspective. QMAP is publicly available as open-source at https://github.com/cda-tum/qmap.

SESSION: Session 13: Panel on EDA for Domain Specific Computing

EDA for Domain Specific Computing: An Introduction for the Panel

Iris Hui-Ru Jiang
David Chinnery

This panel explores domain-specific computing from hardware, software, and electronic design automation (EDA) perspectives.

Hennessey and Patterson signaled a new “golden age of computer architecture” in 2018 [1]. Process technology advances and general-purpose processor improvements provided much faster and more efficient computation, but scaling with Moore’s law has slowed significantly. Domain-specific customization can improve power-performance efficiency by orders-of-magnitude for important application domains, such as graphics, deep neural networks (DNN) for machine learning [2], simulation, bioinformatics [3], image processing, and many other tasks.

The common features of domain-specific architectures are: 1) dedicated memories to minimize data movement across chip; 2) more arithmetic units or bigger memories; 3) use of parallelism matching the domain; 4) smaller data types appropriate for the target applications; and 5) domain-specific software languages. Expediting software development with optimized compilation for efficient fast computation on heterogeneous architectures is a difficult task, and must be considered with the hardware design. For example, GPU programming has used CUDA and OpenCL.

The hardware comprises application-specific integrated circuits (ASICs) [4] and systems-of-chips (SoCs). General-purpose processor cores are often combined with graphics processing units (GPUs) for stream processing, digital signal processors, field programmable gate arrays (FPGAs) for configurability [5], artificial intelligence (AI) acceleration hardware, and so forth.

Domain-specific computers have been deployed recently. For example: the Google Tensor Processing Unit (DNN ASIC) [6]; Microsoft Catapult (FPGA-based cloud domain-service solution) [7]; Intel Crest (DNN ASIC) [8]; Google Pixel Visual Core (image processing and computer vision for cell phones and tablets) [9]; and the RISC-V architecture and open instruction set for heterogeneous computing [10].

Software-driven Design for Domain-specific Compute

Desmond A. Kirkpatrick

The end of Dennard scaling has created a focus on advancing domain-specific computing; we are seeing a renaissance of accelerating compute problems through specialization, with orders-of-magnitude improvement in performance and energy efficiency [1]. Domain-specific compute, with its wide proliferation of domains and narrow specialization of hardware and software, provides unique challenges in design automation not met by the methodologies matured under the model of high-volume manufacturing of competitive CPUs, GPUS, and SOCs [2]. Importantly, domain-specific compute targets smaller markets that move more rapidly so design NRE plays a much larger role. Secondly, the role of software is so much more significant that we believe a software-first approach, where software drives hardware design and the product is developed at the speed of software, is required to keep pace with domain-specific compute market requirements. This creates significant new challenges and opportunities for EDA to address the domain-specific compute design space. The forces that are driving the renaissance in domain-specific compute architectures also require a renaissance in the tools, flows, and methods to maintain this pace of innovation.

This talk will present a general framework for approaching automation of domain-specific compute co-design of SW/HW and draw upon recent innovations in EDA that can help us address this challenge. The focus will be on driving software-oriented techniques, such as agile design, into hardware design [3], as well as vertically oriented domain-specific codesign automation stacks [4], and some of the gaps in EDA that currently limit these approaches.

Google Investment in Open Source Custom Hardware Development Including No-Cost Shuttle Program

Tim Ansell

The end of Moore’s Law combined with unabated growth in usage have forced Google to turn to hardware acceleration to deliver efficiency gains to meet demand. Traditional hardware design methodology for accelerators is practical when there’s a common core – such as with Machine Learning (ML) or video transcoding, but what about the hundreds of smaller tasks performed in Google data centers? Our vision is “software-speed” development for hardware acceleration so that it becomes commonplace and, frankly, boring. Toward this goal Google is investing in open tooling to foster innovation in multiplying accelerator developer productivity.

Tim Ansell will provide an outline of these coordinated open source projects in EDA (including high level synthesis), IP, PDKs, and related areas. This will be followed by presenting the CFU (Custom Function Unit) Playground, which utilizes many of these projects.

The CFU Playground lets you build your own specialized & optimized ML processor based on the open RISC-V ISA, implemented on an FPGA using a fully open source stack. The goal isn’t general ML extensions; it’s about a methodology for building your own extension specialized just for your specific tiny ML model. The extension can range from a few simple new instructions, up to a complex accelerator that interfaces to the CPU via a set of custom instructions; we will show examples of both.

A Case for Open EDA Verticals

Zhiru Zhang
Matthew Hofmann
Andrew Butt

With the end of Dennard scaling and Moore’s Law reaching its limits, domain-specific hardware specialization has become a crucial method for improving compute performance and efficiency for various important applications. Leading companies in competitive fields, such as machine learning and video processing, are building their own in-house technology stacks to better suit their accelerator design needs. However, currently this approach is only a viable option for a few large enterprises that can afford to invest in teams of experts in hardware, systems, and compiler development for high-value applications. In particular, the high license cost of commercial electronic design automation (EDA) tools presents a significant barrier for small and mid-size engineering teams to create new hardware accelerators. These tools are essential for designing, simulating, and testing new hardware, but can be too expensive for smaller teams with limited budgets, reducing their ability to innovate and compete with larger organizations.

More recently, open-source EDA toolflows [1] [12] [11] [5] have emerged which offer a promising alternative to commercial tools, with the potential to provide more cost-effective solutions for hardware development. For example, OpenROAD [1] allows the design of custom ASICs with minimal human intervention and no licensing fees. During initial development, it was also able to take advantage of existing tools such as Yosys [14] and KLayout [6] to reduce the amount of new code required to get a working flow. However, early adoption of open-source alternatives carries risk, as open-source EDA projects often lack important features and are less reliable than commercial options. Additionally, current open-source EDA tools may produce less competitive quality of results (QoR) and may not be able to catch up to commercial solutions anytime soon. Even when EDA tool access is not an issue, designing and implementing special-purpose accelerators using conventional RTL methodology can be unproductive and incurs high non-recurring engineering (NRE) costs. High-level synthesis (HLS) has become increasingly popular in both academia and industry to automatically generate RTL designs from software programs. However, existing HLS tools do not help maintain domain-specific context throughout the design flow (e.g., placement, routing), which makes achieving good QoR difficult without significant manual fine-tuning. This hinders wider adoption of HLS.

We advocate for open EDA verticals as a solution to enabling more widespread use of domain-specific hardware acceleration. The objective is to empower small teams of domain experts to productively develop high-performance accelerators using programming interfaces they are already familiar with. For example, this means supporting domain-specific frameworks like PyTorch or TensorFlow for ML applications. In order for EDA verticals to proliferate, there must first be extensible infrastructure similar to LLVM [8] and MLIR [9] from which to build new tool flows. The proper EDA infrastructure would include novel intermediate representations specifically tailored to the unique challenges in gradually lowering high-level code down to gates.

Addressing the EDA Roadblocks for Domain-specific Compilers: An Industry Perspective

Alireza Kaviani

Computer architects are now widely subscribed to domain-specific architectures as being the only path left for major improvements in performance-cost-energy. As a result, future compilers need to go beyond their traditional role of mapping a design input to a generic hardware platform. Emerging domain-specific compilers must subscribe to a broader view in which compilers provide more control to the end users, enabling customization of hardware components to implement their corresponding tasks. Transitioning into this new design paradigm, where control and customization are key enablers, poses new challenges for domain-specific compiler.

Today, generic vendor backend EDA compilers are the only available mechanism to realize a broad range of applications in many domains. The necessity of breadth coverage by commercial tools often leads to implementations that do not take full advantage of the underlying hardware. Domain-specific compilers, on the other hand, can potentially deliver near-spec performance by taking advantage of both application attributes and architecture details. This issue is less pronounced for more generic computing platforms such CPUs due to leveraging open source as an essential component of software development. However, quality EDA software has remained mostly proprietary. Existing open-source attempts do not produce quality results to be useful commercially at scale. Addressing the EDA roadblocks towards quality domain-specific compilers will require stepping milestones from both industry and community.

This suggests the need for a framework capable of interfacing between closed source vendor backend tools and open-source domain compilers. RapidWright [1] is an example of such framework that enables a new level of optimization and customization for the application architect to further exploit FPGA silicon capabilities focusing on a specific domain.

There are a few factors that will expedite the progress for this approach. For example, RapidStream [2] demonstrates 30% higher performance and more than 5X faster compile time for data flow applications. The key enabler for RapidStream domain compiler is the split-compilation that was made possible for data flow applications with a latency-tolerant front-end and design entry. EDA vendors could enable such bottom-up flows by implementing a foundational infrastructure that allows multiple application modules to be implemented independently. Another useful step would be to decouple certain portions of monolithic EDA tools with separate more permissible licensing to be combined with open-source domain compilers.

Another key step that is required for domain-specific compilers to be successful is a process to offer a guarantee to the end customer. Today’s vendor tool flow offers full guarantee and support to the end customer at the expense of limiting the customization and control. The new paradigm of domain-specific compilers implies many variations of the tool flow, and it might not be feasible to provide the same level of support and guarantee as existing standard flows. The community needs to explore alternative ways of offering an equivalent level of support and guarantee to the end users in order to make domain-specific compilers widely adopted.

High-level Synthesis for Domain Specific Computing

Hanchen Ye
Hyegang Jun
Jin Yang
Deming Chen

This paper proposes a High-Level Synthesis (HLS) framework for domain-specific computing. The framework contains three key components: 1) ScaleHLS, a multi-level HLS compilation flow. Aimed to address the lack of expressiveness and hardware-dedicated representation of traditional software-oriented compilers. ScaleHLS introduces a hierarchical intermediate representation (IR) for the progressive optimization of HLS designs defined in various high-level languages. ScaleHLS consists of three levels of optimizations, including graph, loop, and directive levels, to realize an efficient compilation pipeline and generate highly-optimized domain-specific accelerators. 2) AutoScaleDSE is an automated design space exploration (DSE) engine. Real-world HLS designs often come with large design spaces that are difficult for designers to explore. Meanwhile, the connections between different components of an HLS design further complicate the design spaces. In order to address the DSE problem, AutoScaleDSE proposes a random forest classifier and a graph-driven approach to improve the accuracy of estimating the intermediate DSE results while reducing the time and computational cost. With this new approach, AutoScaleDSE can evaluate thousands of HLS design points and find the Pareto-dominating design points within a couple of hours. 3) PyTransform is a flexible pattern-driven design customization flow. Existing HLS flows demand manual code rewriting or intrusive compiler customization to conduct domain-specific optimizations, leading to unscalable or inflexible compiler solutions. PyTransform proposes a Python-based flow that enables users to define custom matching and rewriting patterns at a high level of abstraction, being able to be incorporated into the DSL compilation flow in an automatic and scalable manner. In summary, ScaleHLS, AutoScaleDSE, and PyTransform aim to address the challenges present in the compilation, DSE, and customization of existing HLS flows, respectively. With the three key components, our newly proposed HLS framework can deliver a scalable and extensible solution for designing domain-specific languages to automate and speed up the process of designing domain-specific accelerators.

SESSION: Session 14: Hardware Security and Bug Fixing

Security-aware Physical Design against Trojan Insertion, Frontside Probing, and Fault Injection Attacks

Jhih-Wei Hsu
Kuan-Cheng Chen
Yan-Syuan Chen
Yu-Hsiang Lo
Yao-Wen Chang

The dramatic growth of hardware attacks and the lack of security-concern solutions in design tools lead to severe security problems in modern IC designs. Although many existing countermeasures provide decent protection against security issues, they still lack the global design view with sufficient security consideration in design time. This paper proposes a security-aware framework against Trojan insertion, frontside probing, and fault injection attacks at the design stage. The framework consists of two major techniques: (1) a large-scale shielding method that effectively covers the exposed areas of assets and (2) a cell-movement-based method to eliminate the empty spaces vulnerable to Trojan insertion. Experimental results show that our framework effectively reduces the vulnerability of these attacks and achieves the best overall score compared with the top-3 teams in the 2022 ACM ISPD Security Closure of Physical Layouts Contest.

Security Closure of IC Layouts Against Hardware Trojans

Fangzhou Wang
Qijing Wang
Bangqi Fu
Shui Jiang
Xiaopeng Zhang
Lilas Alrahis
Ozgur Sinanoglu
Johann Knechtel
Tsung-Yi Ho
Evangeline F.Y. Young

Due to cost benefits, supply chains of integrated circuits (ICs) are largely outsourced nowadays. However, passing ICs through various third-party providers gives rise to many threats, like piracy of IC intellectual property or insertion of hardware Trojans, i.e., malicious circuit modifications.

In this work, we proactively and systematically harden the physical layouts of ICs against post-design insertion of Trojans. Toward that end, we propose a multiplexer-based logic-locking scheme that is (i) devised for layout-level Trojan prevention, (ii) resilient against state-of-the-art, oracle-less machine learning attacks, and (iii) fully integrated into a tailored, yet generic, commercial-grade design flow. Our work provides in-depth security and layout analysis on a challenging benchmark suite. We show that ours can render layouts resilient, with reasonable overheads, against Trojan insertion in general and also against second-order attacks (i.e., adversaries seeking to bypass the locking defense in an oracle-less setting).

We release our layout artifacts for independent verification[29].

X-Volt: Joint Tuning of Driver Strengths and Supply Voltages Against Power Side-Channel Attacks

Saideep Sreekumar
Mohammed Ashraf
Mohammed Nabeel
Ozgur Sinanoglu
Johann Knechtel

Power side-channel (PSC) attacks are well-known threats to sensitive hardware like advanced encryption standard (AES) crypto cores. Given the significant impact of supply voltages (VCCs) on power profiles, various countermeasures based on VCC tuning have been proposed, among other defense strategies. Driver strengths of cells, however, have been largely overlooked, despite having direct and significant impact on power profiles as well.

For the first time, we thoroughly explore the prospects of jointly tuning driver strengths and VCCs as novel working principle for PSC-attack countermeasures. Toward this end, we take the following steps: 1) we develop a simple circuit-level scheme for tuning; 2) we implement a CAD flow for design-time evaluation of ASICs, enabling security assessment of ICs before tape-out; 3) we implement a correlation power analysis (CPA) framework for thorough and comparative security analysis; 4) we conduct an extensive experimental study of a regular AES design, implemented in ASIC as well as FPGA fabrics, under various tuning scenarios; 5) we summarize design guidelines for secure and efficient joint tuning.

In our experiments, we observe that runtime tuning is more effective than static tuning, for both ASIC and FPGA implementations. For the latter, the AES core is rendered > 11.8x (i.e., at least 11.8 times) as resilient as the untuned baseline design. Layout overheads can be considered acceptable, with, e.g., around +10% critical-path delay for the most resilient tuning scenario in FPGA.

We release source codes for our methodology, as well as artifacts from the experimental study in[13].

Validating the Redundancy Assumption for HDL from Code Clone’s Perspective

Jianjun Xu
Jiayu He
Jingyan Zhang
Deheng Yang
Jiang Wu
Xiaoguang Mao

Automated program repair (APR) is being leveraged in hardware description languages (HDLs) to fix hardware bugs without human involvement. Most existing APR techniques search for donor code (i.e., code fragment for bug fixing) in the original program to generate repairs, which is based on the assumption that donor code can be found in existing source code. The redundancy assumption is the fundamental basis of most APR techniques, which has been widely studied in software by searching code clones of donor code. However, despite a large body of work on code clone detection, researchers have focused almost exclusively on repositories in traditional programming languages, such as C/C++ and Java, while few studies have been done on detecting code clones in HDLs. Furthermore, little attention has been paid on the repetitiveness of bug fixes in hardware designs, which limits automatic repair targeting HDLs. To validate the redundancy assumption for HDL, we perform an empirical study on code clones of real-world bug fixes in Verilog. On top of empirical results, we find that 17.71% of newly introduced code in bug fixes can be found from the clone pairs of buggy code in the original program, and 11.77% can be found in the file itself. The findings not only validate the assumption but also provides helpful insights for the design of APR targeting HDLs.

SESSION: Session 15: ISPD 2023 Contest Results and Closing Remarks

Benchmarking Advanced Security Closure of Physical Layouts: ISPD 2023 Contest

Mohammad Eslami
Johann Knechtel
Ozgur Sinanoglu
Ramesh Karri
Samuel Pagliarini

Computer-aided design (CAD) tools traditionally optimize “only” for power, performance, and area (PPA). However, given the wide range of hardware-security threats that have emerged, future CAD flows must also incorporate techniques for designing secure and trustworthy integrated circuits (ICs). This is because threats that are not addressed during design time will inevitably be exploited in the field, where system vulnerabilities induced by ICs are almost impossible to fix. However, there is currently little experience for designing secure ICs within the CAD community.

This contest seeks to actively engage with the community to close this gap. The theme is security closure of physical layouts, that is, hardening the physical layouts at design time against threats that are executed post-design time. Acting as security engineers, contest participants will proactively analyse and fix the vulnerabilities of benchmark layouts in a blue-team approach. Benchmarks and submissions are based on the generic DEF format and related files.

This contest is focused on the threat of Trojans, with challenging aspects for physical design in general and for hindering Trojan insertion in particular. For one, layouts are based on the ASAP7 library and rules are strict, e.g., no DRC issues and no timing violations are allowed at all. In the alpha/qualifying round, submissions are evaluated using first-order metrics focused on exploitable placement and routing resources, whereas in the final round, submissions are thoroughly evaluated (red-teamed) through actual insertion of different Trojans.

Scott Beamer

29 March 2023

Yibo Lin

No comments

Categories: Who's Who

April 2023

Scott Beamer

Assistant Professor

Department of Computer Science & Engineering, University of California, Santa Cruz

Email:
sbeamer@ucsc.edu

Personal webpage:
https://scottbeamer.net

Research interests

Agile and open-source hardware design, computer architecture, graph processing, and data movement optimization

Short bio

Scott Beamer is an assistant professor of computer science and engineering at the University of California, Santa Cruz. His research interests include agile hardware design, high-performance graph processing, and computer architecture. He has received an NSF CAREER award, the Kaivalya Dixit Distinguished Dissertation Award from SPEC, and best paper awards from the International Parallel & Distributed Processing Symposium (IPDPS) and the International Symposium on Workload Characterization (IISWC). He has a PhD in Computer Science from the University of California, Berkeley, and was formerly a postdoctoral scholar at Lawrence Berkeley National Laboratory.

Research highlights

(1) Accelerating RTL Simulation
Simulation is a crucial tool for hardware design, but the current slow speed of RTL simulation often bottlenecks the whole design process. Dr. Beamer’s work explores techniques to drastically accelerate simulation speed while providing the same cycle-accurate result. These results are released as the open-source ESSENT simulator, which demonstrates both leading single-threaded [DAC20] and parallel [ASPLOS23] performance.
(2) High-Performance Graph Processing
The versatility of the graph abstraction allows it to represent many things from hardware circuits to social networks. Unfortunately, graph applications typically underutilize existing general-purpose compute resources. Dr. Beamer’s work has accelerated graph processing through a variety of means. Algorithmically, he created the direction-optimizing breadth-first search (BFS) algorithm, which is the fastest BFS for low-diameter graphs [SC12] and it is widely used in the Graph500 competition. He carefully analyzed graph workloads with performance counters [IISWC15], and created the GAP benchmark suite which has been used by over 250 publications. His propagation blocking work transforms graph algorithms to increase their spatial locality [IPDPS17].
(3) Monolithically-Integrated Silicon Photonics
Due to the limits of electrical signalling, off-chip bandwidth can be greatly hindered by area or power constraints. Monolithically-integrated silicon photonics provide an amazing opportunity to overcome such limitations for inter-chip communication. Collaborating with device experts, Dr. Beamer designed architectures to best utilize photonics, for CPU to DRAM [ISCA10] and within a single chip [NOCS09].
(4) Agile & Open-Source Hardware Design
With the slowing of Dennard scaling and the corresponding rise in the need for hardware specialization, there is also a corresponding need to reduce the cost and complexity of hardware design. Agile techniques provide a promising way to increase productivity, and Dr. Beamer has created a course on Agile Hardware Design and released all of the content as open source (https://github.com/agile-hw). Releasing tools and hardware designs as open-source can greatly help the community, and Dr. Beamer was an early contributor to the RISC-V project and one of the first users of the Chisel hardware construction language.

Ming-Chang Yang

29 March 2023

Yibo Lin

No comments

Categories: Who's Who

March 2023

Ming-Chang Yang

Associate Professor

Department of Computer Science and Engineering, The Chinese University of Hong Kong

Email:
mcyang@cse.cuhk.edu.hk

Personal webpage:
http://www.cse.cuhk.edu.hk/~mcyang/

Research interests

Emerging non-volatile memory and storage technologies, memory and storage systems, and the next-generation memory/storage architecture designs.

Short bio

Ming-Chang Yang is currently an Assistant Professor at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. He received his B.S. degree from the Department of Computer Science at National Chiao-Tung University, Hsinchu, Taiwan, in 2010. He received his Master and Ph.D. degrees (supervised by Professor Tei-Wei Kuo) from the Department of Computer Science and Information Engineering at National Taiwan University, Taipei, Taiwan, in 2012 and 2016, respectively. His primary research interests include emerging non-volatile memory and storage technologies, memory and storage systems, and the next-generation memory/storage architecture designs.

Dr. Yang has published more than 70 research papers, which were mainly published in top journals (e.g., IEEE TC, IEEE TCAD, IEEE TVLSI, and ACM TECS) and top conferences (e.g., USENIX OSDI, USENIX FAST, USENIX ATC, ACM/IEEE DAC, ACM/IEEE ICCAD, ACM/IEEE CODES+ISSS, and ACM/IEEE EMSOFT). He received two best paper awards (from IEEE NVMSA 2019 and ACM/IEEE ISLPED 2020) for his research contributions on emerging non-volatile memory; also, he was awarded TSIA Ph.D. Student Semiconductor Award from Taiwan Semiconductor Industry Association (TSIA) in 2016 because of his research achievements on flash memory.

Research highlights

The main research interest of Dr. Yang’s research group is in embracing emerging memory/storage technologies, including various types of non-volatile memory (NVM) as well as the shingled magnetic recording (SMR) and interlaced magnetic recording (IMR) technologies for the next-generation hard disk drive (HDD), in computer systems.

Particularly, in view of the common read-write asymmetry (in both latency and energy) of NVM, one series of Dr. Yang’s research work attempts to alleviate the side effects caused by such asymmetry via innovating the application and/or algorithm designs. For example, one of their most recent research studies devises a novel dynamic hashing scheme for NVM called SEPH, which exhibits excellent performance scalability, efficiency, and predictability on the real product of NVM (i.e., Intel® Optane™ DCPMM). Also, Dr. Yang’s research group revamps the algorithmic design of random forest, one core algorithm of machine learning (ML), for NVM. This line of study receives particular attention and recognition from the community, including winning two best paper awards from NVMSA 2019 and ISLPED 2020. Moreover, Dr. Yang’s research group is also the pioneer in exploring the memory subsystem design based on an emerging type of NVM called racetrack memory (RTM).

On the other hand, even though the cutting-edge SMR and IMR technologies bring lower cost-per-GB to HDD, they also impose the write amplification problem on HDD, resulting in severe write performance degradation. In light of this, Dr. Yang’s research group introduces a couple of novel data management designs into different system layers for SMR-based or IMR-based HDD. For example, they architect KVIMR, a data management middleware for constructing a cost-effective yet high-throughput LSM-tree based KV store on IMR-based HDD. KVIMR exhibits significant throughput improvement and even excellent compatibility with the mainstream LSM-tree based KV stores (such as RocksDB and LevelDB). In addition, at the block layer, they put forward a novel design called Virtual Persistent Cache (VPC) that adaptively exploits the computing and management resources from the host system to ultimately improve the write responsiveness of SMR-based HDD. Moreover, they realize a firmware design called MAGIC, which shows great potential to close the performance gap between traditional and IMR-based HDDs.

Apart from the system work on adapting emerging memory/storage technologies, Dr. Yang’s research group is also of special interest to data-intensive or data-driven applications. For instance, they aim to optimize the efficiency and practicality of out-of-core graph processing systems, which feature offloading the enormous graph data from memory into storage for better scalability at a low cost. Also, they develop new frameworks for graph representation learning and graph neural networks with significant performance improvements.

2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA) Table of Content

12 February 2023

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Keynote I

Compiler Support for Structured Data

Saman Amarasinghe

In 1957, the FORTRAN language and compiler introduced multidimensional dense arrays or dense tensors. Subsequent programming languages added a myriad of data structures from lists, sets, hash tables, trees, to graphs. Still, when dealing with extremely large data sets, dense tensors are the only simple and practical solution. However, modern data is anything but dense. Real world data, generated by sensors, produced by computation, or created by humans, often contain underlying structure, such as sparsity, runs of repeated values, or symmetry.

In this talk I will describe how programming languages and compilers can support large data sets with structure. I will introduce TACO, a compiler for sparse data computing. TACO is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. It pioneered a new technique for compiling compound tensor expressions into efficient loops in a systematic way. TACO generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. With TACO, I will show how to put sparse array programming on the same compiler transformation and code generation footing as dense array codes. Structured data has immense potential for hardware acceleration. However, instead of one-off single-operation compute engines, with compilers frameworks such as TACO, I believe that it is possible to create hardware for an entire class of sparse computations. With the help of the FPGA community, I am looking forward to such a future.

SESSION: Session: High-Level Abstraction and Tools

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS

Linus Y. Wong
Jialiang Zhang
Jing (Jane) Li

Rapid growth in data size poses increasing computational and memory challenges to data processing. FPGA accelerators and near-storage processing are promising candidates for tackling computational and memory requirements, and many near-storage FPGA accelerators have been shown to be effective in processing large data. However, the current HLS development environment does not allow direct NVMe storage access from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. We prototyped DONGLE with an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD and demonstrate a geomean speed-up of 2.3× and a reduction of lines-of-code by 2.4× on evaluated workloads over the state-of-the-art commercial platform.

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

Linfeng Du
Tingyuan Liang
Sharad Sinha
Zhiyao Xie
Wei Zhang

Multi-die FPGAs are widely adopted to deploy large-scale hardware accelerators. Two factors impede the performance optimization of high-level synthesis (HLS) designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplanning of HLS designs generated under each group of configurations during directive optimization due to the large design scale.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive optimization, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit strategy from the online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing step to compact the floorplan, followed by pipelining of the long wires crossing die-boundaries.

Through experiments on a set of HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.

Eliminating Excessive Dynamism of Dataflow Circuits Using Model Checking

Jiahui Xu
Emmet Murphy
Jordi Cortadella
Lana Josipovic

Recent HLS efforts explore the generation of dynamically scheduled, dataflow circuits from high-level code; their ability to adapt the schedule at runtime to particular data and control outcomes promises superior performance to standard, statically scheduled HLS solutions. However, dataflow circuits are notoriously resource-expensive: their distributed handshake mechanism brings performance benefits in some cases, but causes an unneeded resource overhead when general dynamism is not required. In this work, we present a verification framework based on model checking to systematically reduce the hardware complexity of dataflow circuits. We devise a series of formal proofs that identify the absence of particular behavioral scenarios and use this information to replace the generic dataflow logic with simpler and cheaper control structures. On a set of benchmarks obtained from high-level code, we demonstrate that our technique significantly reduces the resource requirements of dataflow circuits (i.e., it results in LUT and FF reductions of up to 51% and 53%, respectively), while still reaping all performance benefits of dynamic scheduling.

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

Ayatallah Elakhras
Riya Sawhney
Andrea Guerrieri
Lana Josipovic
Paolo Ienne

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such a way as to mimic the program’s sequential execution. Although correct and quite effective in many cases, this adherence to control flow still significantly limits exploitable parallelism. Recent research introduced techniques to deliver data tokens directly from producers to consumers and achieved tangible benefits both in circuit complexity and execution time. Unfortunately, while this successfully addressed ordinary data dependencies, the problem of potential dependencies through memory remains open: When no technique can statically disambiguate accesses, circuits must be built with load-store queues (LSQs) which, to reorder accesses safely, need memory accesses to be allocated in the queues in program order. Such in-order allocation still demands control circuitry emulating sequential execution, with its negative impact on parallelization. In this paper, we transform potential memory dependencies into virtual data dependencies and use the new direct token delivery strategy to allocate accesses sequentially into the LSQ. In other words, we exploit more parallelism by constructing control circuitry to emulate exclusively those parts of the control flow strictly necessary for in-order allocation. Our results show that we can achieve up to a 74% reduction in execution time compared to prior work, in some cases, at no area cost.

SESSION: Poster Session I

OMT: A Demand-Adaptive, Hardware-Targeted Bonsai Merkle Tree Framework for Embedded Heterogeneous Memory Platform

Rakin Muhammad Shadab
Yu Zou
Sanjay Gandham
Mingjie Lin

Novel flash-based, crash-tolerant, non-volatile memory (NVM) such as Intel’s Optane DC memory brings about new and exciting use-case scenarios for both traditional and embedded computing systems involving Field-Programmable Gate Arrays (FPGA). However, NVMs cannot be proper replacement for existing DDR memory modules due to low write endurance and are more well-suited for a hybrid NVM + Volatile memory system. They are also well-known to be vulnerable to different memory-based adversaries that demand the use of a robust authentication method such as Bonsai Merkle Tree. However, typical update process of a BMT (eager update) requires updating the entire update chain frequently, affecting run-time performance even for the data that is not persistence-critical. The latest intermittent BMT update techniques can help provide better real-time throughput, but they lack crash-consistency.

A heterogeneous memory-based system would, therefore, greatly benefit from an authentication mechanism that can change its update method on-the-fly. Hence we propose a modular, unified and adaptable hardware-based BMT framework called Opportunistic Merkle tree (OMT). OMT combines two BMT with different update methods and streamlines the BMT read with a common datapath to provide support for both recovery-critical and general data, eliminating the need for individual authentication subsystems for heterogeneous memory platforms. It also allows for a switch between the update methods based on the request type (persistent/intermittent) while considerably reducing the resource overhead compared to standalone BMT implementations. We test OMT on a heterogeneous embedded secure memory system and the setup provides 44% lower memory overhead & up to 22% faster execution in synthetic benchmarks compared to a baseline.

Cyclone-NTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Platforms

Kaveh Aasaraai
Emanuele Cesena
Rahul Maganti
Nicolas Stalder
Javier Varela
Kevin Bowers

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier-Transform (FFT) on finite fields. NTT is being increasingly used in blockchain and zero-knowledge proof applications. Although FFT and NTT are widely studied for FPGA implementation, we believe CycloneNTT is the first to solve this problem for large data sets (2^24, 64-bit numbers) that would not fit in the on-chip RAM. CycloneNTT uses a state-of-the-art butterfly network and maps the dataflow to hybrid FIFOs composed of on-chip SRAM and external memory. This manifests into a quasi-streaming data access pattern minimizing external memory access latency and maximizing throughput. We implement two variants of CycloneNTT optimized for DDR and HBM external memories. Although historically this problem has been shown to be memory-bound, CycloneNTT’s quasi-streaming access pattern is optimized to the point that when using HBM (Xilinx C1100), the architecture becomes compute-bound. On the DDR-based platform (AWS F1), the latency of the application is equal to the streaming of the entire dataset log(N) times to/from external memory. Moreover, exploiting HBM’s larger number of channels, and following a series of additional optimizations, CycloneNTT only requires log(N)/6 passes.

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

Hyeong-Ju Kang

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of slow and power consuming external memories. This paper exploits two schemes to reduce the required memory amount and ultimately to implement a CNN of reasonable performance only with on-chip memory of a practical device like a low-end FPGA. To reduce the memory amount of the intermediate data, a stream-based line-buffer architecture and a dataflow for the architecture are proposed instead of the conventional frame-based architecture, where the amount of the intermediate data memory is proportional to the square of the input image size. The architecture consists of layer-dedicated blocks operating in a pipelined way with the input and output streams. Each convolutional layer block has a line buffer storing just a few rows of input data. The sizes of the line buffers are proportional to the width of the input image, so the architecture requires less intermediate data storage, especially in the trend of getting larger input size in modern object detection CNNs. In addition, the weight memory is reduced by the accelerator-aware pruning. The experimental results show that a whole object detection CNN can be implemented even on a low-end FPGA without an external memory. Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing higher efficiency.

Fault Detection on Multi COTS FPGA Systems for Physics Experiments on the International Space Station

Tim Oberschulte
Jakob Marten
Holger Blume

Field-programmable gate arrays (FPGAs) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose Einstein Condensate and Cold Atom Laboratory (BECCAL) apparatus, which is planned to operate on the International Space Station for several years. A total of more than 100 FPGAs distributed in the setup will be used for high-precision control of specialized sensors and actuators at nanosecond scale. Due to the large amount of devices in BECCAL, commercial off-the-shelf (COTS) FPGAs are used which are not radiation hardened. In this work, we detect and mitigate radiation effects in an application specific COTS-FPGA-based communication network. For that redundancy is integrated into the design while the firmware is optimized to stay within the FPGA’s resource constraints. A redundant integrity checker module is developed which can notify preceding network devices about data and configuration bit errors. The firmware is evaluated by injecting faults into data and configuration registers in simulation and real hardware. The FPGA resource usage of the firmware is cut down by more than half, enabling the use of double modular redundancy for the switching fabric. Together with the triple modular redundancy protected integrity checker, this combination fully prevents silent data corruptions in the design as shown in simulations and by injecting faults in hardware using the Intel Fault Injection FPGA IP Core while staying in the resource limitation of a COTS FPGA.

Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization

Meghna Mandava
Deming Chen

As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores both preemption and pipelining as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA. We evaluate our algorithm and validate its efficacy by measuring results from real workloads running on the board with different real-time constraints and priority levels. In our exploration, we compare our novel Nimblock algorithm against a no-sharing and no-virtualization baseline algorithm and three other algorithms which support sharing and virtualization. We achieve up to 5x lower average response time when compared to the baseline algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.

Graph-OPU: An FPGA-Based Overlay Processor for Graph Neural Networks

Ruiqi Chen
Haoyang Zhang
Yuhanxiao Ma
Enhao Tang
Shun Li
Yanxiang Zhu
Jun Yu
Kun Wang

Graph Neural Networks (GNNs) have outstanding performance on graph-structured data and have been extensively accelerated by field-programmable gate array (FPGA) in various ways. However, existing accelerators significantly lack flexibility, especially in the following two aspects: 1) Many FPGA-based accelerators only support one GNN model. 2) The processes of re-synthesizing and bitstream re-generating are very time-consuming for new GNN models. To this end, we propose a highly integrated FPGA-based overlay processor for general GNN accelerations named Graph-OPU. Regarding the data structure and operation irregularity, we customize the instruction sets to support irregular operation patterns in the inference process of GNN models. Then, we customize our datapath and optimize the data format in the microarchitecture to take full advantage of high bandwidth memory (HBM). Moreover, we design the computation module to ensure a unified and fully-pipelined process of sparse matrix multiplication (SpMM) and general matrix multiplication (GEMM). Users can avoid the process of FPGA reconfiguration or RTL regeneration for the newly invented GNN models. We implement the hardware prototype on Xilinx Alveo U50 and test the mainstream GNN models with 9 datasets. Graph-OPU can achieve an average of 435× and 18× speedup, while 2013× and 109× better energy efficiency, compared with the Intel I7-12700KF processor and NVIDIA RTX3090 GPU, respectively. To the best of our knowledge, Graph-OPU is the first in-depth study on FPGA-based general processors for GNN acceleration with high speedup and energy efficiency.

HMLib: Efficient Data Transfer for HLS Using Host Memory

Michael Lo
Young-kyu Choi
Weikang Qiao
Mau-Chung Frank Chang
Jason Cong

Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather than directly into its on-chip memory, substantially diminishing the end-to-end speedup, especially for small workloads (hundreds of kilobytes). AMD Xilinx’s Host Memory IP (HMI) aims to address the data movement problem by exposing to the developer an High-Level Synthesis (HLS) interface that moves the data from the host directly to the FPGA’s on-chip memory. However, using HMI purely for its interface without additional code changes incurred a 3.3x slowdown in comparison with the current programming model. The slowdown mainly originates from OpenCL call overhead and the kernel control logic unnecessarily switching states. To overcome these issues, we propose Host Memory Library (HMLib), an efficient HLS-based library that facilitates data transfer on behalf of the user. HMLib not only optimizes the runtime stack for efficient data transfer, but also provides HLS compatible and user-friendly interfaces. We demonstrate HMLib’s effectiveness for streaming applications (Deflate compression and CRC32) with improvements of up to up to 36.2X over OpenCL-DDR and up to 79.5X over raw HMI for small-scale data while maintaining little-to-no performance loss for large scale inputs. We plan to open source our work in the future.

An Efficient High-Speed FFT Implementation

Ross Martin

This poster introduces the “BxBFFT” parallel-pipelined Fast Fourier Transform (FFT), which gives higher clock speeds (Fmax) than competitors with substantial savings in power and logic resources. In comparisons with the Xilinx SSR FFT, Spiral FFT, Astron FFT, and ZipCPU FFT, the BxBFFT had clock speeds above 650MHz in cases where all others were below 300MHz. The BxBFFT’s LUTs and power were lower by a factor of ~1.5. The BxBFFT had faster Vivado implementation and faster RTL simulation, for improved productivity in design, testing, and verification. BxBFFT simulations were over 10 times faster than the Xilinx SSR FFT. The BxBFFT supports more features than other FFTs, including real-to-complex FFTs, non-power-of-2 FFTs, and features for high reliability in adverse environments. The BxBFFT’s improved performance has been verified in real applications. One customer design had to operate with a reduced workload due to excessive current draw of the Xilinx SSR FFT. A quick replacement of the Xilinx SSR FFT with the BxBFFT lowered die temperature by 34.8 degree Celsius and allowed the design to operate under full load. The source of the BxBFFT’s performance is intensive optimization of well-known FFT algorithms, not new algorithms. The BxBFFT’s coding style gives better control over synthesis to avoid and resolve performance bottlenecks. Automated generation of top-level code supports 13 different choices for radix and 2 different choices for data flow at each stage, to make optimal choices for each BxBFFT size. This results in a highly efficient FFT.

Weave: Abstraction for Accelerator Integration of Generated Modules

Tuo Dai
Bizhao Shi
Guojie Luo

As domain-specific accelerators demand multiple functional components for complex applications in a domain, the conventional wisdom for effective development involves module decomposition, module implementation, and module integration. In the recent decade, the generator-based design methodology improves the productivity of module implementation. However, with the guidance of current abstractions, it is difficult to integrate modules implemented by generators because of implicit interface definition, non-unified performance modeling, and fragmented memory management. These disadvantages cause low productivity of the integration flow and low performance of the integrated accelerators.

To address these drawbacks, we propose Weave, an abstraction for the integration of generated modules to facilitate an agile design flow for domain-specific accelerators. Weave abstraction enables the formulation and automation of optimizing the unified performance model under resource constraints of all modules. And we design a hierarchical memory management method with corresponding interface to integrate modules under the guidance of modular abstraction. In the experiments, the accelerator developed by Weave achieves 2.17× higher performance in the deep learning domain compared with an open-source accelerator, and the integrated acce-lerator attains 88.9% peak performance of generated accelerators.

A Novel FPGA Simulator Accelerating Reinforcement Learning-Based Design of Power Converters

Zhenyu Xu
Miaoxiang Yu
Qing Yang
Yeonho Jeong
Tao Wei

High-efficiency energy conversion systems have become increasingly important due to their wide use in all electronic systems such as data centers, smart mobile devices, E-vehicles, medical instruments, and so forth. Complex and interdependent parameters make optimal designs of power converters challenging to get. Recent research has shown that reinforcement learning (RL) shows great promise in the design of such converter circuits. A trained RL agent can search for optimal design parameters for power conversion circuit topologies under targeted application requirements. Training an RL agent requires numerous circuit simulations. As a result, they may take days to complete, primarily because of the slow time-domain circuit simulation.

This abstract proposes a new FPGA architecture that accelerates the circuit simulation and hence substantially speeds up the RL-based design method for power converters. Our new architecture supports all power electronic circuit converters and their variations. It substantially improves the training speed of RL-based design methods. High-level synthesis (HLS) was used to build the accelerator on Amazon Web Service (AWS) F1 instance. An AWS virtual PC hosts the training algorithm. The host interacts with the FPGA accelerator by updating the circuit parameters, initiating simulation, and collecting the simulation results during training iterations. A script was created on the host side to facilitate this design method to convert a netlist containing circuit topology and parameters into core matrices in the FPGA accelerator. Experimental results showed 60x overall speedup of our RL-based design method in comparison with using a popular commercial simulator, PowerSim.

A Fractal Astronomical Correlator Based on FPGA Cluster with Scalability

Lin Shu
Long Xiao
Yafang Song
Qiuxiang Fan
Guitian Fang
Jie Hao

Correlation is a highly computationally intensive and data-intensive signal processing application that is used heavily in radio astronomy for imaging and other measurements. For example, the next generation radio telescope, Square Kilometer Array Low (SKA-L), needs a correlator that calculates up to 22 million cross products, which is a real-time system with continuous input data rates of 6 terabits per second and equivalent computation of 2 Peta-operations per second. Therefore, a flexible and scalable solution with high performance per watt is very urgent and meaningful. In this work, a flexible FX correlation architecture based on FPGA cluster is proposed, which can be fractal in subsystem level, engine level and calculation module level, simplifying the complexity of data distribution network to increase the system’s scalability. The interconnect network between processing engines is a new two-stage solution, using self-developed data redistribution hardware to decouple full bandwidth correlation into several independent sub-bands’ computation. And the most intensive calculations, cross-multiplications among all the antennas, are modularly designed under MATLAB Simulink and AMD Xilinx System Generator, which are parametrized to scale to arbitrary antenna numbers with optional parallel granularity to minimize development effort on different FPGA or for different applications. What’s more, a fully FPGA-based FX correlator for a large array with 202 antennas, consisting of 26 F Engines based on AMD Xilinx Kintex-7 325T FPGAs, 13 X Engines based on AMD Xilinx Kintex ultrascale KU115 FPGAs, has been deployed in 2022, which is the largest full FPGA-based astronomical correlator as we know.

Power Side-channel Countermeasures for ARX Ciphers using High-level Synthesis

Saya Inagaki
Mingyu Yang
Yang Li
Kazuo Sakiyama
Yuko Hara-Azumi

In the era of Internet of Things (IoT), edge devices are considerably diversified and are often designed using high-level synthesis (HLS) to improve design-productivity. A problem here is that HLS tools were originally developed in a security-unaware fashion, inducing vulnerabilities to power side-channel attacks (PSCA), which is a serious threat in IoT. Although PSCA vulnerabilities induced by HLS tools recently started to be discussed, the effects and applicability of existing methods for PSCA-resistant designs using HLS are limited so far. In this paper, we propose a novel HLS-based design method for PSCA-resistant ciphers in hardware. Particularly focusing on lightweight block ciphers composed of Addition-Rotation-XOR (ARX)-based permutations, we studied the effects of applying ”threshold implementation”, one of the provably secure countermeasures against PSCA, to behavioral descriptions of the ciphers. In addition, we tuned the scheduling optimization of HLS tools that might cause power side-channel leakage. In our experiment, using ARX-based ciphers (Chaskey, Simon, and Speck) as benchmarks, we implemented the unprotected and protected circuit on FPGA and evaluated the PSCA vulnerability using Welch’s t-test. The results demonstrated that our proposed method can successfully mitigate vulnerabilities to PSCA for all benchmarks. From these results, we provide further discussion on the direction of PSCA countermeasures based on HLS.

Single-Batch CNN Training using Block Minifloats on FPGAs

Chuliang Guo
Binglei Lou
Xueyuan Liu
David Boland
Philip H.W. Leong

Training convolutional neural networks remains a challenge on resource-limited edge devices due to its intensive computations, large storage requirements, and high bandwidth. Error back-propagation, gradient generation, and weight update usually require high precision to guarantee model accuracy, which places a further burden on computation and bandwidth. This paper presents the first parallel FPGA CNN training accelerator with block minifloat datatypes. We first propose a heuristic bit-width allocation technique to derive a unified 8-bit block minifloat format with a sign bit, 2 exponent bits, and 5 mantissa bits. In contrast to previous techniques, the same data format is used for weights, activations, errors, and gradients. Using this format, accuracy similar to 32-bit single precision floating point is achieved and thus simplifies the FPGA-based designs of computational units such as multiply-and-add. In addition, we propose a unified Conv block to deal with Conv and transposed Conv in the forward and backward paths respectively; and a dilated Conv block with a weight kernel partition scheme for gradient generation. Both Conv blocks support non-unit stride, this being crucial for the residual connections that appear in modern CNNs. For training of ResNet20 on the CIFAR-10 dataset with a batch size of 1, our accelerator on a Xilinx Ultrascale+ ZCU102 FPGA achieves state-of-the-art single-batch throughput of 144.64 and 192.68 GOPs with and without batch normalisation layers respectively.

SESSION: Session: Applications and Design Studies I

A Study of Early Aggregation in Database Query Processing on FPGAs

Mehdi Moghaddamfar
Norman May
Christian Färber
Wolfgang Lehner
Akash Kumar

In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggregation operator. In this paper, we study early aggregation algorithms in the context of query processing acceleration in database systems on FPGAs. The comparative study leads us to set-associative caches with a low inter-reference recency set (LIRS) replacement policy. They show both great performance and modest implementation complexity compared to some of the most prominent early aggregation algorithms. We also present a novel application-specific architecture for implementing set-associative caches. Benchmarks of our implementation show speedups of up to 3x for end-to-end aggregation compared to a state-of-the-art FPGA-based query engine.

FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction

Chaoqiang Liu
Haifeng Liu
Long Zheng
Yu Huang
Xiangyu Ye
Xiaofei Liao
Hai Jin

The k-nearest neighbor graph has emerged as the key data structure for many critical applications. However, it can be notoriously challenging to construct k-nearest neighbor graphs over large graph datasets, especially with a high-dimensional vector feature. Many solutions have been recently proposed to support the construction of k-nearest neighbor graphs. However, these solutions involve substantial memory access and computational overheads and an architecture-level solution is still absent. To address these issues, we architect FNNG, the first FPGA-based accelerator to support k-nearest neighbor graph construction. Specifically, FNNG is equipped with the block-based scheduling technique to exploit the inherent data locality between vertices. It divides the vertices that are close in space into blocks and process the vertices according to the granularity of the blocks during the construction process. FNNG also adopts the useless computation aborting technique to identify superfluous useless computations. It keeps the existing maximum similarity values of all vertices inside the computing unit. In addition, we propose an improved architecture in order to fully utilize both techniques. We implement FNNG on the Xilinx Alveo U280 FPGA card. The results show that FNNG achieves 190x and 2.1x speedups over the state-of-the-art CPU and GPU solutions, running on Intel Xeon Gold 5117 CPU and NVIDIA GeForce RTX 3090 GPU, respectively.

ACTS: A Near-Memory FPGA Graph Processing Framework

Wole Jaiyeoba
Nima Elyasi
Changho Choi
Kevin Skadron

Despite the high off-chip bandwidth and on-chip parallelism offered by today’s near-memory accelerators, software-based (CPU and GPU) graph processing frameworks still suffer performance degradation from under-utilization of available memory bandwidth because graph traversal often exhibits poor locality. Emerging FPGAbased graph accelerators tackle this challenge by designing specialized graph processing pipelines and application-specific memory subsystems to maximize bandwidth utilization and efficiently utilize high-speed on-chip memory. To use the limited on-chip (BRAM) memory effectively while handling larger graph sizes, several FPGAbased solutions resort to some form of graph slicing or partitioning during preprocessing to stage vertex property data into the BRAM. While this has demonstrated performance superiority for small graphs, this approach breaks down with larger graph sizes. For example, GraphLily [19], a recent high-performance FPGA-based graph accelerator, experiences up to 11X performance degradation between graphs having 3M vertices and 28M vertices. This makes prior FPGA approaches impractical for large graphs.

We propose ACTS, an HBM-enabled FPGA graph accelerator, to address this problem. Rather than partitioning the graph offline to improve spatial locality, we partition vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This optimizes read bandwidth even as the graph size scales. We compare ACTS against Gunrock, a state-of-the-art graph processing accelerator for the GPU, and GraphLily, a recent FPGA-based graph accelerator also utilizing HBM memory. Our results show a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X over Gunrock, and a geometric speedup of 3.6X, with a maximum speedup of 16.5X, over GraphLily. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock.

Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation

Nick Brown

AMD Xilinx’s new Versal Adaptive Compute Acceleration Platform (ACAP) is an FPGA architecture combining reconfigurable fabric with other on-chip hardened compute resources. AI engines are one of these and, by operating in a highly vectorized manner, they provide significant raw compute that is potentially beneficial for a range of workloads including HPC simulation. However, this technology is still early-on, and as yet unproven for accelerating HPC codes, with a lack of benchmarking and best practice.

This paper presents an experience report, exploring porting of the Piacsek and Williams (PW) advection scheme onto the Versal ACAP, using the chip’s AI engines to accelerate the compute. A stencil-based algorithm, advection is commonplace in atmospheric modelling, including several Met Office codes who initially developed this scheme. Using this algorithm as a vehicle, we explore optimal approaches for structuring AI engine compute kernels and how best to interface the AI engines with programmable logic. Evaluating performance using a VCK5000 against non-AI engine FPGA configurations on the VCK5000 and Alveo U280, as well as a 24-core Xeon Platinum Cascade Lake CPU and Nvidia V100 GPU, we found that whilst the number of channels between the fabric and AI engines are a limitation, by leveraging the ACAP we can double performance compared to an Alveo U280.

SESSION: Session: Architecture, CAD, and Circuit Design

Regularity Matters: Designing Practical FPGA Switch-Blocks

Stefan Nikolic
Paolo Ienne

Several techniques have been proposed for automatically searching for FPGA switch-blocks which typically show some tangible advantage over the well-known academic architectures. However, the resulting switch-blocks usually exhibit high levels of irregularity, in contrast with what can be observed in a typical commercial architecture. One wonders whether the architectures produced by such search methods can actually be laid out in an efficient manner while retaining the perceived gains. In this work, we propose a new switch-block exploration method that enhances a recently published search algorithm by combining it with ILP, in order to guarantee that the obtained solution satisfies arbitrary regularity constraints. We measure the impact of regularity constraints commonly seen in industrial architectures (such as limiting the number of different multiplexer sizes or forced sharing of inputs between pairs of multiplexers) and observe benefits to the routability of complex circuits with only a limited reduction in performance.

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Time-to-Digital Converters

Colin Drewes
Olivia Weng
Keegan Ryan
Bill Hunter
Christopher McCarty
Ryan Kastner
Dustin Richmond

Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5× over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80% classification accuracy that generalizes beyond a single board. Finally, our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2×.

Post-Radiation Fault Analysis of a High Reliability FPGA Linux SoC

Andrew Elbert Wilson
Nathan Baker
Ethan Campbell
Jackson Sahleen
Michael Wirthlin

FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments.

FPGA Technology Mapping with Adaptive Gate Decomposition

Longfei Fan
Chang Wu

Most existing technology mapping algorithms use graph covering approaches and suffer from the netlist structural bias problem. Chen and Cong proposed a simultaneous simple gate decomposition with technology mapping algorithm that encodes many gate decomposition choices into the netlist. However, their algorithm suffers from the long runtime problem due to a large set of choices. Later on, A. Mishchenko et al. proposed a mapping algorithm based on choice generation with the so-called lossless synthesis. Nevertheless, their algorithm cannot guarantee to find and keep all good choices a priori before mapping. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvement. Compared with the lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvements. Compared with the state-of-the-art ABC lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction.

FPGA Mux Usage and Routability Estimates without Explicit Routing

Jonathan W. Greene

A new algorithm is proposed to rapidly evaluate an FPGA routing architecture without need of explicitly routing benchmark applications. The algorithm takes as input a probability distribution of nets to be accommodated and a description of an architecture. It produces an estimate for the usage of each type of mux in the FPGA (including intra-cluster muxes), valuable feedback to the architect. The estimates are shown to correlate with actual routed applications in both academic and commercial architectures. This is due in part to the algorithm’s novel ability to account for long and multi-fanout nets. Run time is reduced by applying periodic graphs to model FPGAs’ regular structure.

We then show how Percolation Theory (a branch of statistical physics) can be applied to elucidate the relationship between mux usage and routability. We show that any blockages when routing a net are most likely to occur in the neighborhood of its terminals, and demonstrate a quantitative relationship among the post-placement wirelength of an application, the percolation threshold of an architecture, and the channel width required to map the application to the architecture. Supporting experimental data is provided.

SESSION: Banquet and Panel

Open-source and FPGAs: Hardware, Software, Both or None?

Dana How
Tim Ansell
Vaughn Betz
Chris Lavin
Ted Speers
Pierre-Emmanuel Gaillardon

Following the footsteps of the open-source software movement that is at the foundation of many fundamental infrastructures today, e.g., Linux, the internet, etc., a growing amount of open-source hardware initiatives have been impacting our field, e.g., the RISC-V ISA, Open chiplet standards, etc.

SESSION: Keynote II

FPGAs and Their Evolving Role in Domain Specific Architectures: A Case Study of the AMD 400G Adaptive SmartNIC/DPU SoC

Jaideep Dastidar

Domain Specific Architectures (DSA) typically apply heterogeneous compute elements such as FPGAs, GPUs, AI Engines, TPUs, etc. towards solving domain-specific problems, and have their accompanying Domain Specific Software. FPGAs have played a prominent role in DSAs for AI, Video Transcoding, Network Acceleration etc. This talk will start by going over a brief historical survey of FPGAs in DSAs and an emerging trend in Domain Specific Accelerators, where the programmable logic element is paired with other heterogeneous compute or acceleration elements. The talk will then perform a case study of AMD’s 400G Adaptive SmartNIC/DPU SoC and the considerations that went into that DSA. The case study includes where, why, and how the programmable logic element was paired with other hardened offload accelerators and embedded processors with the goal of striking the right balance between Software Processing on the embedded cores, Fastpath ASIC-like processing on the Hardened Accelerators, and Adaptive and Composable processing on the integrated FPGA. The talk will describe the data movement between various network, storage and interface acceleration elements and their shared and private memory resources. Throughout the talk, we will focus on the tradeoffs between the FPGA element and the rest of the heterogeneous compute or acceleration elements as they apply to SmartNIC/DPU offload acceleration.

SESSION: Session: Deep Learning

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

Jinming Zhuang
Jason Lau
Hanchen Ye
Zhuoping Yang
Yubo Du
Jack Lo
Kristof Denolf
Stephen Neuendorffer
Alex Jones
Jingtong Hu
Deming Chen
Jason Cong
Peipei Zhou

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?

We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing

Alireza Khataei
Gaurav Singh
Kia Bazargan

We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

Lei Cai
Jing Wang
Lianfeng Yu
Bonan Yan
Yaoyu Tao
Yuchao Yang

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

SESSION: Session: FPGA-Based Computing Engines

hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA

Xuan Wang
Lei Gong
Jing Cao
Wenqi Lou
Weiya Wang
Chao Wang
Xuehai Zhou

Regular expression (REGEX) matching tasks drive much research on automata processors (AP). Among them, the von Neumann AP can efficiently utilize on-chip memory to process the Deterministic Finite Automata (DFA), but it is limited to small REGEX sets due to the DFA’s state explosion problem. For large REGEX sets, the spatial AP based on Nondeterministic Finite Automaton (NFA) is the mainstream choice. However, there are two problems with previous FPGA-based spatial AP. First, it cannot obtain a balanced FPGA resource usage (LUT and BRAM), which easily leads to resource shortage. Second, to compress the report output data of large REGEX sets, it uses dynamic report compression, which not only consumes a lot of FPGA resources but also limits performance.

This paper optimizes the resource and IO overhead of spatial AP. First, noticing the resource optimization ability of the von Neumann AP, we propose the flex-hybrid-FA algorithm to generate small hybrid-FAs (an NFA/DFA hybrid model) and further propose the Spatial-von Neumann Heterogeneous AP to deploy hybrid-FA. Under the constraints of the flex-hybrid-FA algorithm, we can obtain balanced and efficient FPGA resource usage. Second, we propose High-Efficient Automata Report Compression (HEARC) with a compression ratio of up to 5.5-47.6x, which can thoroughly release the performance from IO congestion, and consumes less FPGA resource compared to previous dynamic report compression approaches. As far as we know, this is the first work to deploy large REGEX sets on low-cost small-scale FPGAs (e.g. Xilinx XCZU3CG). The experimental results show that compared to the previous FPGA-based APs, we save 4.0-6.6x power consumption and improve 2.7-5.9x energy efficiency.

CSAIL2019 Crypto-Puzzle Solver Architecture

Sergey Gribok
Bogdan Pasca
Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations (i.e. computations which cannot be parallelized), with each iteration performing a very large (3072-bit in the case of CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified.

In this paper, we present an FPGA architecture for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 21 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

ENCORE: Efficient Architecture Verification Framework with FPGA Acceleration

Kan Shi
Shuoxiang Xu
Yuhan Diao
David Boland
Yungang Bao

Verification typically consumes the majority of the time in the hardware development cycle. Primarily this is because multiple iterations to debug hardware using software simulation is extremely time-consuming. While FPGAs can be utilised to accelerate the simulation, existing methods either provide limited visibility of design details, or are expensive to check against a reference model dynamically at the system level.

In this paper, we present ENCORE, an FPGA-accelerated framework for processor architecture verification. The design-under-test (DUT) hardware and the corresponding software emulator run simultaneously on the same FPGA with hardened processors. ENCORE embodies hardware modules that dynamically monitor and compare key registers from both the DUT and reference model, pausing the execution if any mismatches are detected. In this case, ENCORE automatically creates snapshots of the current design status, and offloads this to software simulators for further debugging. We demonstrate the performance of ENCORE by running RISC-V processor designs and benchmarks. We show that ENCORE can achieve over 44000x speedup over a traditional software simulation-based approach, while maintaining full visibility and debugging capabilities.

BOBBER A Prototyping Platform for Batteryless Intermittent Accelerators

Vishak Narayanan
Rohit Sahu
Jidong Sun
Henry Duwe

Batteryless systems offer promising platforms to support pervasive, near-sensor intelligence in a sustainable manner. These systems solely rely on ambient energy sources that often provide limited power. One common approach to designing batteryless systems is using intermittent execution—a node banks energy into a capacitive store until a threshold voltage is met and the digital components turn on and consume the banked energy until the energy is depleted and they die. The limited amount of available energy demands the development of application- and domain-specific accelerators to achieve energy efficiency and timeliness. Given the extremely close relationship between volatile state and intermittent behavior, performing actual system prototyping has been critical for demonstrating feasibility of intermittent systems. However, no prototyping platform exists for intermittent accelerators. This paper introduces BOBBER, the first implementation of an intermittent FPGA-based accelerator prototyping platform. We demonstrate BOBBER in the optimization and evaluation of a neural network accelerator powered solely by RF energy harvesting.

SESSION: Poster Session II

Adapting Skip Connections for Resource-Efficient FPGA Inference

Olivia Weng
Gabriel Marcano
Vladimir Loncar
Alireza Khodamoradi
Nojan Sheybani
Farinaz Koushanfar
Kristof Denolf
Javier Mauricio Duarte
Ryan Kastner

Deep neural networks employ skip connections – identity functions that combine the outputs of different layers-to improve training convergence; however, these skip connections are costly to implement in hardware. In particular, for inference accelerators on resource-limited platforms, they require extra buffers, increasing not only on- and off-chip memory utilization but also memory bandwidth requirements. Thus, a network that has skip connections costs more to deploy in hardware than one that has none. We argue that, for certain classification tasks, a network’s skip connections are needed for the network to learn but not necessary for inference after convergence. We thus explore removing skip connections from a fully-trained network to mitigate their hardware cost. From this investigation, we introduce a fine-tuning/retraining method that adapts a network’s skip connections – by either removing or shortening them-to make them fit better in hardware with minimal to no loss in accuracy. With these changes, we decrease resource utilization by up to 34% for BRAMs, 7% for FFs, and 12% LUTs when implemented on an FPGA.

Multi-bit-width CNN Accelerator with Systolic-in-Systolic Dataflow and Single DSP Multiple Multiplication Scheme

Mingqiang Huang
Yucen Liu
Sixiao Huang
Kai Li
Qiuping Wu
Hao Yu

Multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has been one of the core hardware platforms for deploying various neural networks. However, it is still difficult to fully make use of the dedicated digital signal processing (DSP) blocks in FPGA for accelerating the multi-bit-width network. In this work, we develop state-of-the-art multi-bit-width convolutional neural network accelerator with novel systolic-in-systolic type of dataflow and single DSP multiple multiplication (SDMM) INT2/4/8 execution scheme. Multi-level optimizations have also been adopted to further improve the performance, including group-vector systolic array for maximizing the circuit efficiency as well as minimizing the systolic delay, and differential neural architecture search (NAS) method for the high accuracy multi-bit-width network generation. The proposed accelerator has been practically deployed on Xilinx ZCU102 with accelerating NAS optimized VGG16 and Resnet18 networks as case studies. Average performance on accelerating the convolutional layer in VGG16 and Resnet18 is 1289GOPs and 1155GOPs, respectively. Throughput for running the full multi-bit-width VGG16 network is 870.73 GOPS at 250MHz, which has exceeded all of previous CNN accelerators on the same platform.

Janus: An Experimental Reconfigurable SmartNIC with P4 Programmability and SDN Isolation

Bharat Sukhwani
Mohit Kapur
Alda Ohmacht
Liran Schour
Martin Ohmacht
Chris Ward
Chuck Haymes
Sameh Asaad

Disparate deployment models of cloud computing pose varying requirements on cloud infrastructure components such as networking, storage, provisioning, and security. Infrastructure providers need to study these and often create custom infrastructure components to satisfy these requirements. A major challenge in the research and development of these cloud infrastructure solutions, however, is the availability of customizable platforms for experimentation and trade-off analysis of the various hardware and software components. Most platforms are either general purpose or bespoke solutions created to assist a particular task, too rigid to allow meaningful customization. In this work, we present a 100G reconfigurable smartNIC prototyping platform called Janus that enables cloud infrastructure research and hardware-software co-design of infrastructure components such as hypervisor, secure boot, software defined networking and distributed storage. The platform provides a path to optimize the stack by offloading the functionalities from the host x86 to the embedded processor on the smartNIC and optimize performance by moving pieces to hardware using P4. Further, our platform provides hardware-enforced isolation of cloud network control plane, thereby securing the control plane from the tenants even for bare-metal deployments.

LAWS: Large-Scale Accelerated Wave Simulations on FPGAs

Dimitrios Gourounas
Bagus Hanindhito
Arash Fathi
Dimitar Trenev
Lizy John
Andreas Gerstlauer

Computing numerical solution to large-scale scientific computing problems described by partial differential equations is a common task in high-performance computing. Improving their performance and efficiency is critical to exa-scale computing. Application-specific hardware design is a well-known solution, but the wide range of kernels makes it infeasible to provision supercomputers with accelerators for all applications. This makes reconfigurable platforms a promising direction. In this work, we focus on wave simulations using discontinuous Galerkin solvers, as an important class of applications. Existing work using FPGAs is limited to accelerating specific kernels or small problems that fit into FPGA BRAM. We present LAWS, a generic and configurable architecture for large-scale accelerated wave simulation problems running on FPGAs out of DRAM. LAWS exploits fine- and coarse-grain parallelism using a scalable array of application-specific cores, and incorporates novel dataflow optimizations, including prefetching, kernel fusion, and memory layout optimizations to minimize data transfers and maximize DRAM bandwidth utilization. We further accompany LAWS with an analytical performance model that allows for scaling across technology trends and architecture configurations. We demonstrate LAWS on the simulation of elastic wave equations. Results show that a single FPGA core achieves 69% higher performance than 24 Xeon cores with 13.27x better energy efficiency, when given 1.94x less peak DRAM bandwidth. Scaling to the same peak DRAM bandwidth shows that an FPGA is 3.27x and 1.5x faster than 24 CPU cores and an Nvidia P100 GPU, with 22.3x and 4.53x better efficiency, respectively.

Mitigating the Last-Mile Bottleneck: A Two-Step Approach For Faster Commercial FPGA Routing

Shashwat Shrivastava
Stefan Nikolic
Chirag Ravishankar
Dinesh Gaitonde
Mirjana Stojilovic

We identified that in modern commercial FPGAs, routing signals from the general interconnect to the inputs of the CLB primitives through a very sparse input interconnect block (IIB) represents a significant runtime bottleneck. This is despite academic research often neglecting the runtime of last-mile routing through the IIB. We propose a two-step routing approach that allows resolving this bottleneck by leveraging massive parallelism of today’s compute infrastructure. The main premise that enables massive parallelization is that once the signals are legally routed in the general interconnect-only reaching the inputs of the IIB, but not the final targets-the remaining last-mile routing through the IIB can be completed independently for each FPGA tile.

We ran experiments using ISPD16 and industrial designs to demonstrate the dominant contribution of last-mile routing to the router’s runtime. We used an architectural model closely resembling Xilinx UltraScale FPGAs, which makes it highly representative of the current state of the art. For ISPD16 benchmarks, we observed that when the router is instructed to complete the entire routing, including its last-mile portion, the average number of heap pushes (a machine-agnostic measure of runtime) increases 4.1× compared to a simplified reference in which last-mile routing is neglected. On industrial designs, the number of heap pushes increased 4.4×. Last-mile routing was successfully completed using a SAT-based router in up to 83% of FPGA tiles. With a slight increase in density of IIB connectivity, we were able to bring the completion success rate up to 100%.

Towards a Machine Learning Approach to Predicting the Difficulty of FPGA Routing Problems

Andrew David Gunter
Steven Wilton

In this poster, we present a Machine Learning (ML) technique to predict the number of iterations needed for a Pathfinder-based FPGA router to complete a routing problem. Given a placed circuit, our technique uses features gathered on each routing iteration to predict if the circuit is routable and how many more iterations will be required to successfully route the circuit. This enables early exit for routing problems which are unlikely to be completed in a target number of iterations. Such early exit may help to achieve a successful route within tractable time by allowing the user to quickly retry the circuit compilation with a different random seed, a modified circuit design, or a different FPGA. We demonstrate our predictor in the VTR 8 framework; compared to VTR’s predictor, our ML predictor incurs lower prediction errors on the Koios Deep Learning benchmark suite. This corresponds with an approximate time saving of 48% from early rejection of unroutable FPGA designs while also successfully completing 5% more routable designs and having a 93% shorter early exit latency.

An FPGA-Based Weightless Neural Network for Edge Network Intrusion Detection

Zachary Susskind
Aman Arora
Alan T. L. Bacellar
Diego L. C. Dutra
Igor D. S. Miranda
Mauricio Breternitz
Priscila M. V. Lima
Felipe M. G. França
Lizy K. John

Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using ASICs, 6G emphasizes the adaptability of algorithms for specific user scenarios, which motivates broader adoption of FPGAs. In this paper, we propose the FPGA-based Weightless Intrusion Warden (FWIW), a novel solution for detecting anomalous network traffic on edge devices. While prior work in this domain is based on conventional deep neural networks (DNNs), FWIW incorporates a weightless neural network (WNN), a table lookup-based model which learns sophisticated nonlinear behaviors. This allows FWIW to achieve accuracy far superior to prior FPGA-based work at a very small fraction of the model footprint, enabling deployment on small, low-cost devices. FWIW achieves a prediction accuracy of 98.5% on the UNSW-NB15 dataset with a total model parameter size of just 192 bytes, reducing error by 7.9x and model size by 262x vs. LogicNets, the best prior edge-optimized implementation. Implemented on a Xilinx Virtex UltraScale+ FPGA, FWIW demonstrates a 59x reduction in LUT usage with a 1.6x increase in throughput. The accuracy of FWIW comes within 0.6% of the best-reported result in literature (Edge-Detect), a model several orders of magnitude larger. Our results make it clear that WNNs are worth exploring in the emerging domain of edge networking, and suggest that FPGAs are capable of providing the extreme throughput needed.

A Flexible Toolflow for Mapping CNN Models to High Performance FPGA-based Accelerators

Yongzheng Chen
Gang Wu

There have been many studies on developing automatic tools for mapping CNN models onto FPGAs. However, challenges remain in designing an easy-to-use toolflow. First, the toolflow should be able to handle models exported from various deep learning frameworks and models with different topologies. Second, the hardware architecture should make better use of on-chip resources to achieve high performance. In this work, we build a toolflow upon Open Neural Network Exchange (ONNX) IR to support different DL frameworks. We also try to maximize the overall throughput via multiple hardware-level efforts. We propose to accelerate the convolution operation by applying parallelism not only at the input and output channel level, but also at the output feature map level. Several on-chip buffers and corresponding management algorithms are also designed to leverage abundant memory resources. Moreover, we employ a fully pipelined systolic array running at 400 MHz as the convolution engine, and develop a dedicated bus to implement the im2col algorithm and provide feature inputs to the systolic array. We generated 4 accelerators with different systolic array shapes and compiled 12 CNN models for each accelerator. Deployed on a Xilinx VCU118 evaluation board, the performance of convolutional layers can reach 3267.61 GOPS, which is 99.72% of the ideal throughput (3276.8 GOPS). We also achieve an overall throughput of up to 2424.73 GOPS. Compared with previous studies, our toolflow is more user-friendly. The end-to-end performance of the generated accelerators is also better than that of related work at the same DSP utilization.

Senju: A Framework for the Design of Highly Parallel FPGA-based Iterative Stencil Loop Accelerators

Emanuele Del Sozzo
Davide Conficconi
Marco D. Santambrogio
Kentaro Sano

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving, seismic simulations, and financial option pricing, to name a few. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional system of points until it reaches convergence. However, due to their iterative and computationally intensive nature, these workloads are highly performance-hungry, demanding specialized solutions to boost performance and reduce power consumption. Here, FPGAs represent a valid architectural choice as their peculiar features enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for FPGA-based ISL accelerators. Starting from an input description, Senju builds highly parallel hardware modules and automatizes all their design phases. The experimental evaluation shows remarkable and scalable results, reaching significant performance and energy efficiency improvements compared to the other single-FPGA literature approaches.

FPGA Acceleration for Successive Interference Cancellation in Severe Multipath Acoustic Communication Channels

Jinfeng Li
Yahong Rosa Zheng

This paper proposes a hardware implementation of a Successive Interference Cancellation (SIC) scheme in a Turbo Equalizer for very long multipath fading channels where the Intersymbol-interference (ISI) channel length L is on the order of 100 taps. To reduce the computational complexity caused by large matrix arithmetic in the SIC, we explore the data dependencies and convolutional nature of the SIC algorithm and propose an FPGA acceleration architecture by taking advantage of the high degree of parallelism and the flexible data movements offered by FPGA. Instead of reconstructing interference in symbol-wise by matrix and vector multiplication directly for each symbol in a block, we propose a two-stage processing algorithm. The first stage is block-wise processing, where the convolution of the channel impulse response (CIR) vector and the vector consisting of the whole symbol block is computed. The second stage is symbol-wise processing but turns to the multiplication of one symbol and the CIR vector. The result shows that for a block of Nblk symbols and a channel of length L, the proposed architecture completes the SIC within 2Nblk+L clock cycles, while direct matrix multiplication requires L×Nblk clock cycles. Implemented on a Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit, the SIC and equalization in one turbo iteration based on this architecture is completed around 40 us for a 1024 symbol block and a channel length of L=100. This architecture achieves around 40× speed-up compared with the implementation on a powerful CPU platform.

FreezeTime: Towards System Emulation through Architectural Virtualization

Sergiu Mosanu
Joshua Fixelle
Kevin Skadron
Mircea Stan

High-end FPGAs enable architecture modeling through emulation with high speed and fidelity. However, the available reconfigurable logic and memory resources limit the size, complexity, and speed of the emulated target designs. The challenge is to map and model large and fast memory hierarchies, such as large caches and mixed main memory, various heterogeneous computation instances, such as CPUs, GPUs, AI/ML processing units and accelerator cores, and communication infrastructure, such as buses and networks. In addition to the spatial dimension, this work uses the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture. Our approach presents mechanisms to abstract instance plurality while preserving timing in sync. With only a subset of the architecture on the FPGA, we freeze a whole emulated module’s activity and state during the additional time intervals necessary for the action on the virtualized modules to elapse. We demonstrate this technique by emulating a hypothetical system consisting of a processor and an SRAM memory too large to map on the FPGA. For this, we modify a LiteX-generated SoC consisting of a VexRISC-V processor and DDR memory, with the memory controller issuing stall signals that freeze the processor, effectively ”hiding” the memory latency. For Linux boot, we measure significant emulation vs. simulation speedup while matching RTL simulation accuracy. The work is open-sourced.

SESSION: Session: Applications and Design Studies II

A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree Management

Yuan Meng
Rajgopal Kannan
Viktor Prasanna

Monte Carlo Tree Search (MCTS) is a widely used search technique in Artificial Intelligence (AI) applications. MCTS manages a dynamically evolving decision tree (i.e., one whose depth and height evolve at run-time) to guide an AI agent toward an optimal policy. In-tree operations are memory-bound leading to a critical performance bottleneck for large-scale parallel MCTS on general-purpose processors. CPU-FPGA accelerators can alleviate the memory bottleneck of in-tree operations. However, a major challenge for existing FPGA accelerators is the lack of dynamic memory management due to which they cannot efficiently support dynamically evolving MCTS trees. In this work, we address this challenge by proposing an MCTS acceleration framework that (1) incorporates an algorithm-hardware co-optimized accelerator design that supports in-tree operations on dynamically evolving trees without expensive hardware reconfiguration; (2) adopts a hybrid parallel execution model to fully exploit the compute power in a CPU-FPGA heterogeneous system; (3) supports Python-based programming API for easy integration of the proposed accelerator with RL domain-specific bench-marking libraries at run-time. We show that by using our framework, we achieve up to 6.8× speedup and superior scalability of parallel workers than state-of-the-art parallel MCTS on multi-core systems.

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

Linghao Song
Licheng Guo
Suhail Basalama
Yuze Chi
Robert F. Lucas
Jason Cong

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

Sasindu Wijeratne
Ta-Yang Wang
Rajgopal Kannan
Viktor Prasanna

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.

2023 Asia and South Pacific Design Automation Conference (ASPDAC) Table of Content

1 February 2023

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Technical Program: Reliability Considerations for Emerging Computing and Memory Architectures

A Fast Semi-Analytical Approach for Transient Electromigration Analysis of Interconnect Trees Using Matrix Exponential

Pavlos Stoikos
George Floros
Dimitrios Garyfallou
Nestor Evmorfopoulos
George Stamoulis

As integrated circuit technologies are moving to smaller technology nodes, Electromigration (EM) has become one of the most challenging problems facing the EDA industry. While numerical approaches have been widely deployed since they can handle complicated interconnect structures, they tend to be much slower than analytical approaches. In this paper, we present a fast semi-analytical approach, based on the matrix exponential, for the solution of Korhonen’s stress equation at discrete spatial points of interconnect trees, which enables the analytical calculation of EM stress at any time and point independently. The proposed approach is combined with the extended Krylov subspace method to accurately simulate large EM models and accelerate the calculation of the final solution. Experimental evaluation on OpenROAD benchmarks demonstrates that our method achieves 0.5% average relative error over the COMSOL industrial tool while being up to three orders of magnitude faster.

Chiplet Placement for 2.5D IC with Sequence Pair Based Tree and Thermal Consideration

Hong-Wen Chiou
Jia-Hao Jiang
Yu-Teng Chang
Yu-Min Lee
Chi-Wen Pan

This work develops an efficient chiplet placer with thermal consideration for 2.5D ICs. Combining the sequence-pair based tree, branch-and-bound method, and advanced placement/pruning techniques, the developed placer can find the solution fast with the optimized total wirelength (TWL) on half-perimeter wirelength (HPWL). Additionally, with the post placement procedure, the placer reduces maximum temperatures with slight increase of wirelength. Experimental results show that the placer can not only find better optimized TWL (reducing 1.035% HPWL) but also speed up at most two orders of magnitude than the prior art. With thermal consideration, the placer can reduce the maximum temperature up to 8.214 °C with an average 5.376% increase of TWL.

An On-Line Aging Detection and Tolerance Framework for Improving Reliability of STT-MRAMs

Yu-Guang Chen
Po-Yeh Huang
Jin-Fu Li

Spin-transfer-torque magnetic random-access memory (STT-MRAM) is one of the most promising emerging memories for on-chip memory. However, the magnetic tunnel junction (MTJ) in the STT-MRAM suffers from several reliability threats which degrade the endurance, create defects, and cause memory failure. One of the primary reliability issues comes from time-dependent dielectric breakdown (TDDB) on MTJ, which deviates resistance value of MTJ over time and may lead to reading error. To overcome this challenge, in this paper we present an on-line aging detection and tolerance framework to dynamically monitor the electrical parameter deviations and provide appropriate compensation to avoid reading error. The on-line aging detection mechanism can identify aged words by monitoring read current and then the aging tolerance mechanism can adjust the reference resistance of the sensing amplifier to compensate the aging-induced resistance drop of MTJ. In comparison with existing testing-based aging detection techniques, our mechanism can operate on-line with read operations for both aging detection and tolerance simultaneously with negligible performance overhead. Simulation and analysis results show that the proposed techniques can successfully detect 99% aging words under process variation and achieve at most 25% reliability improvement of STT-MRAMs.

SESSION: Technical Program: Accelerators and Equivalence Checking

Automated Equivalence Checking Method for Majority Based In-Memory Computing on ReRAM Crossbars

Arighna Deb
Kamalika Datta
Muhammad Hassan
Saeideh Shirinzadeh
Rolf Drechsler

Recent progress in the fabrication of Resistive Random Access Memory (ReRAM) devices has paved the way for large scale crossbar structures. In particular, in-memory computing on ReRAM crossbars helps in bridging the processor-memory speed gap for current CMOS technology. To this end, synthesis and mapping of Boolean functions to such crossbars have been investigated by researchers. However the verification of simple designs on crossbar is still done through manual inspection or sometimes complemented by simulation based techniques. Clearly this is an important problem as real world designs are complex and have higher number of inputs. As a result manual inspection and simulation based methods for these designs are not practical.

In this paper for the first time as per our knowledge we propose an automated equivalence checking methodology for majority based in-memory designs on ReRAM crossbars. Our contributions are twofold: first, we introduce an intermediate data structure called ReRAM Sequence Graph (ReSG) to represent the logic-in-memory design. This in turn is translated into Boolean Satifiability (SAT) formulas. These SAT formulas are verified against the golden functional specification using Z3 Satifiability Modulo Theory (SMT) solver. We validate the proposed method by running widely available benchmarks.

An Equivalence Checking Framework for Agile Hardware Design

Yanzhao Wang
Fei Xie
Zhenkun Yang
Pasquale Cocchini
Jin Yang

Agile hardware design enables designers to produce new design iterations efficiently. Equivalence checking is critical in ensuring that a new design iteration conforms to its specification. In this paper, we introduce an equivalence checking framework for hardware designs represented in HalideIR. HalideIR is a popular intermediate representation in software domains such as deep learning and image processing, and it is increasingly utilized in agile hardware design. We have developed a fully automatic equivalence checking workflow seamlessly integrated with HalideIR and several optimizations that leverage the incremental nature of agile hardware design to scale equivalence checking. Evaluations of two deep learning accelerator designs show our automatic equivalence checking framework scales to hardware designs of practical sizes and detects inconsistencies that manually crafted tests have missed.

Towards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector Duplication

Bowen Liu
Dajiang Liu

Sparse matrix-vector multiplication (SpMV) is widely used in many fields and usually dominates the execution time of a task. With large off-chip memory bandwidth, customizable on-chip resources and high-performance float-point operation, FPGA is a potential platform to accelerate SpMV tasks. However, as compressed data formats for SpMV usually introduce irregular memory access while it is also memory-intensive, implementing an SpMV accelerator on FPGA to achieve a high bandwidth utilization (BU) is a challenging work. Existing works either eliminate irregular memory access at the sacrifice of increasing data redundancy or try to locally reduce the port conflicts introduced by irregular memory access, leading to a limited BU improvement. To this end, this paper proposes a high-bandwidth-utilization SpMV accelerator on FPGAs using partial vector duplication, where read-conflict-free vector buffer, writing-conflict-free adder tree, and ping-pong-like accumulator registers are well elaborated. The FPGA implementation results show that the proposed design can achieve an average of 1.10x performance speedup compared to the state-of-the-art work.

SESSION: Technical Program: New Frontiers in Cyber-Physical and Autonomous Systems

Safety-Driven Interactive Planning for Neural Network-Based Lane Changing

Xiangguo Liu
Ruochen Jiao
Bowen Zheng
Dave Liang
Qi Zhu

Neural network-based driving planners have shown great promises in improving task performance of autonomous driving. However, it is critical and yet very challenging to ensure the safety of systems with neural network-based components, especially in dense and highly interactive traffic environments. In this work, we propose a safety-driven interactive planning framework for neural network-based lane changing. To prevent over-conservative planning, we identify the driving behavior of surrounding vehicles and assess their aggressiveness, and then adapt the planned trajectory for the ego vehicle accordingly in an interactive manner. The ego vehicle can proceed to change lanes if a safe evasion trajectory exists even in the predicted worst case; otherwise, it can stay around the current lateral position or return back to the original lane. We quantitatively demonstrate the effectiveness of our planner design and its advantage over baseline methods through extensive simulations with diverse and comprehensive experimental settings, as well as in real-world scenarios collected by an autonomous vehicle company.

Safety-Aware Flexible Schedule Synthesis for Cyber-Physical Systems Using Weakly-Hard Constraints

Shengjie Xu
Bineet Ghosh
Clara Hobbs
P. S. Thiagarajan
Samarjit Chakraborty

With the emergence of complex autonomous systems, multiple control tasks are increasingly being implemented on shared computational platforms. Due to the resource-constrained nature of such platforms in domains such as automotive, scheduling all the control tasks in a timely manner is often difficult. The usual requirement—that all task invocations must meet their deadlines—stems from the isolated design of a control strategy and its implementation (including scheduling) in software. This separation of concerns, where the control designer sets the deadlines, and the embedded software engineer aims to meet them, eases the design and verification process. However, it is not flexible and is overly conservative. In this paper, we show how to capture the deadline miss patterns under which the safety properties of the controllers will still be satisfied. The allowed patterns of such deadline misses may be captured using what are referred to as “weakly-hard constraints.” But scheduling tasks under these weakly-hard constraints is non-trivial since common scheduling policies like fixed-priority or earliest deadline first do not satisfy them in general. The main contribution of this paper is to automatically synthesize schedules from the safety properties of controllers. Using real examples, we demonstrate the effectiveness of this strategy and illustrate that traditional notions of schedulability, e.g., utility ratios, are not applicable when scheduling controllers to satisfy safety properties.

Mixed-Traffic Intersection Management Utilizing Connected and Autonomous Vehicles as Traffic Regulators

Pin-Chun Chen
Xiangguo Liu
Chung-Wei Lin
Chao Huang
Qi Zhu

Connected and autonomous vehicles (CAVs) can realize many revolutionary applications, but it is expected to have mixed-traffic including CAVs and human-driving vehicles (HVs) together for decades. In this paper, we target the problem of mixed-traffic intersection management and schedule CAVs to control the subsequent HVs. We develop a dynamic programming approach and a mixed integer linear programming (MILP) formulation to optimally solve the problems with the corresponding intersection models. We then propose an MILP-based approach which is more efficient and real-time-applicable than solving the optimal MILP formulation, while keeping good solution quality as well as outperforming the first-come-first-served (FCFS) approach. Experimental results and SUMO simulation indicate that controlling CAVs by our approaches is effective to regulate mixed-traffic even if the CAV penetration rate is low, which brings incentive to early adoption of CAVs.

SESSION: Technical Program: Machine Learning Assisted Optimization Techniques for Analog Circuits

Fully Automated Machine Learning Model Development for Analog Placement Quality Prediction

Chen-Chia Chang
Jingyu Pan
Zhiyao Xie
Yaguang Li
Yishuang Lin
Jiang Hu
Yiran Chen

Analog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art.

Efficient Hierarchical mm-Wave System Synthesis with Embedded Accurate Transformer and Balun Machine Learning Models

F. Passos
N. Lourenço
L. Mendes
R. Martins
J. Vaz
N. Horta

Integrated circuit design in millimeter-wave (mm-Wave) bands is exceptionally complex and dependent on costly electromagnetic (EM) simulations. Therefore, in the past few years, a growing interest has emerged in developing novel optimization-based methodologies for the automatic design of mm-Wave circuits. However, current approaches lack scalability when the circuit/system complexity increases. Besides, many also depend on EM simulators, which degrade their efficiency. This work resorts to hierarchical system partitioning and bottom-up design approaches, where a precise machine learning model – composed of hundreds of seamlessly integrated sub-models that guarantee high accuracy (validated against EM simulations and measurements) up to 200GHz – is embedded to design passive components, e.g., transformers and baluns. The model generates optimal design surfaces to be fed to the hierarchical levels above or acts as a performance estimator. With the proposed scheme, it is possible to remove the dependency of EM simulations during optimization. The proposed mixed-optimal-surface, performance estimator, and simulation-based bottom-up multiobjective optimization (MOO) are used to fully design a Ka-band mm-Wave transmitter from the device up to the system level in 65-nm CMOS for state-of-the-art specifications.

APOSTLE: Asynchronously Parallel Optimization for Sizing Analog Transistors Using DNN Learning

Ahmet F. Budak
David Smart
Brian Swahn
David Z. Pan

Analog circuit sizing is a high-cost process in terms of the manual effort invested and the computation time spent. With rapidly developing technology and high market demand, bringing automated solutions for sizing has attracted great attention. This paper presents APOSTLE, an asynchronously parallel optimization method for sizing analog transistors using Deep Neural Network (DNN) learning. This work introduces several methods to minimize real-time of optimization when the sizing task consists of several different simulations with varying time costs. The key contributions of this paper are: (1) a batch optimization framework, (2) a novel deep neural network architecture for exploring design points when the existed solutions are not always fully evaluated, (3) a ranking approximation method based on cheap evaluations and (4) a theoretical approach to balance between the cheap and the expensive simulations to maximize the optimization efficiency. Our method shows high real-time efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits while reaching similar or better performance.

SESSION: Technical Program: Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems

ML to the Rescue: Reliability Estimation from Self-Heating and Aging in Transistors All the Way up Processors

Hussam Amrouch
Florian Klemme

With increasingly confined 3D structures and newly-adopted materials of higher thermal resistance, transistor self-heating has risen to a critical reliability threat in state-of-the-art and emerging process nodes. One of the challenges of transistor self-heating is accelerated transistor aging, which leads to earlier failure of the chip if not considered appropriately. Nevertheless, adequate consideration of accelerated aging effects, induced by self-heating, throughout a large circuit design is profoundly challenging due to the large gap between where self-heating does originate (i.e., at the transistor level) and where its ultimate effect occurs (i.e., at the circuit and system levels). In this work, we demonstrate an end-to-end workflow starting from self-heating and aging effects in individual transistors all the way up to large circuits and processor designs. We demonstrate that with our accurately estimated degradations, the required timing guardband to ensure reliable operation of circuits is considerably reduced by up to 96% compared to otherwise worst-case estimations that are conventionally employed.

Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs

Lilas Alrahis
Johann Knechtel
Ozgur Sinanoglu

Graph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design.

In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at https://github.com/DfX-NYUAD/GNN4IC. Finally, we discuss interesting open problems for future research.

Detection and Classification of Malicious Bitstreams for FPGAs in Cloud Computing

Jayeeta Chaudhuri
Krishnendu Chakrabarty

As FPGAs are increasingly shared and remotely accessed by multiple users and third parties, they introduce significant security concerns. Modules running on an FPGA may include circuits that induce voltage-based fault attacks and denial-of-service (DoS). An attacker might configure some regions of the FPGA with bitstreams that implement malicious circuits. Attackers can also perform side-channel analysis and fault attacks to extract secret information (e.g., secret key of an AES encryption). In this paper, we present a convolutional neural network (CNN)-based defense to detect bitstreams of RO-based malicious circuits by analyzing the static features extracted from FPGA bitstreams. We further explore the criticality of RO-based circuits in order to detect malicious Trojans that are configured on the FPGA. Evaluation on Xilinx FPGAs demonstrates the effectiveness of the security solutions.

Learning Based Spatial Power Characterization and Full-Chip Power Estimation for Commercial TPUs

Jincong Lu
Jinwei Zhang
Wentian Jin
Sachin Sachdeva
Sheldon X.-D. Tan

In this paper, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer’s total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98mW/mm², or 2.6% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.

SESSION: Technical Program: High Performance Memory for Storage and Computing

DECC: Differential ECC for Read Performance Optimization on High-Density NAND Flash Memory

Yunpeng Song
Yina Lv
Liang Shi

3D NAND flash memory with advanced multi-level-cell technology has been widely adopted due to its high density, but with significantly degraded reliability. To solve the reliability issue, flash memory often adopts the low-density parity-check code (LDPC) as error correction code (ECC) to encode data and provide fault tolerance. For LDPC with a low code rate, it can provide a strong correction capability, but with a high energy cost. To avoid the cost, LDPC with a higher code rate is always adopted. When the accessed data is not successfully decoded, LDPC will rely on read retry operations to improve the error correction capability. However, the read retry operation will induce degraded read performance. In this work, a differential ECC (DECC) method is proposed to improve the read performance. The basic idea of DECC is to adopt LDPC with different code rates for data with different access characteristics. Specifically, when data is hot read and retried due to reliability, LDPC with a low code rate will be adopted to optimize performance. With this approach, the cost from LDPC with a low code rate is minimized and the performance is optimized. Through careful design and real-world workloads evaluation on a 3D triple-level-cell (TLC) NAND flash memory, DECC achieves encouraging read performance optimization.

Optimizing Data Layout for Racetrack Memory in Embedded Systems

Peng Hui
Edwin H.-M. Sha
Qingfeng Zhuge
Rui Xu
Han Wang

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too many shift operations decrease the performance of RTM and cause unpredictable performance. In this paper, we propose three schemes to optimize the performance of RTM from different aspects, including intra-DBC, inter-DBC, and hybrid SPM with SRAM and RTM. Firstly, a balanced group-based data placement method for the data layout inside one DBC is proposed to reduce shifts. Second, a grouping method for the data allocation among DBCs is proposed. It helps with the shift reduction while using fewer DBCs by using one DBC as multiple DBCs. Finally, we use SRAM to further help the cost reduction, and a cost evaluation metric is proposed to assist the shrinking method which determines the data allocation for hybrid SPM with SRAM and RTM. Experiments show that the proposed schemes can significantly improve the performance of pure RTM and hybrid SPM while using fewer DBCs.

Exploring Architectural Implications to Boost Performance for in-NVM B+-Tree

Yanpeng Hu
Qisheng Jiang
Chundong Wang

Computer architecture keeps evolving to support the byte-addressable non-volatile memory (NVM). Researchers have tailored the prevalent B+-tree with NVM, crafting a history of utilizing architectural supports to gain both high performance and crash consistency. The latest architecture-level changes for NVM, e.g., the eADR, motivate us to further explore architectural implications in the design and implementation of in-NVM B+-tree. Our quantitative study finds that eADR makes the cache misses impact increasingly on an in-NVM B+-tree’s performance. We hence propose Conan for the conflict-aware node allocation based on theoretical justifications. Conan decomposes the virtual addresses of B+-tree nodes regarding a VIPT cache and intentionally places them into different cache sets. Experiments show that Conan evidently reduces cache conflicts and boosts the performance of state-of-the-art in-NVM B+-tree.

An Efficient near-Bank Processing Architecture for Personalized Recommendation System

Yuqing Yang
Weidong Yang
Qin Wang
Naifeng Jing
Jianfei Jiang
Zhigang Mao
Weiguang Sheng

Personalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory challenges, near-memory processing (NMP) would be an effective solution which provides high bandwidth. Recent work proposes an NMP approach to accelerate the recommendation models by utilizing the through-silicon via (TSV) bandwidth in 3D-stacked DRAMs. However, the total bandwidth provided by TSVs is insufficient for a batch of embedding layers processed in parallel. In this paper, we propose a near-bank processing architecture to accelerate recommendation models. By integrating the compute-logic near memory banks on DRAM dies of the 3D-stacked DRAM, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. We also present a hardware/software interface for embedding layers offloading. Moreover, we propose an efficient mapping scheme to enhance the utilization of bank-level bandwidth. As a result, our architecture achieves up to 2.10X speedup and 31% energy saving for data movement over the state-of-the-art NMP solution for recommendation acceleration based on 3D-stacked memory.

SESSION: Technical Program: Cool and Efficient Approximation

PAALM: Power Density Aware Approximate Logarithmic Multiplier Design

Shuyuan Yu
Sheldon X.-D. Tan

Approximate hardware designs can lead to significant power or energy reduction. However, a recent study showed that approximated designs might lead to unwanted higher temperature and related reliability issues due to the increased power density. In this work, we try to mitigate this important problem by proposing a novel power density aware approximate logarithmic multiplier (called PAALM) design for the first time. The new multiplier design is based on the approximate logarithmic multiplier (ALM) framework due to its rigorous mathematics based foundation. The idea is to re-design the high computing switch activities of existing ALM designs based on equivalent mathematical formula so that the power density can be reduced at no accuracy loss while at costs of some area overheads. Our results show that the proposed PAALM design can improve 11.5%/5.7% of power density and 31.6%/70.8% of area with 8/16-bit precision when compared with the fixed-point multiplier baseline, respectively. And also achieves extremely low error bias: -0.17/0.08 for 8/16-bit precision, respectively. On top of this, we further implement the PAALM design in a Convolutional Neural Network (CNN) and test it on CIFAR10 dataset. The results show that with error compensation, PAALM can achieve the same inference accuracy as the fixed-point multiplier baseline. We also evaluate the PAALM in a discrete cosine transformation (DCT) application. The results show that with error compensation, PAALM can improve the image quality of 8.6dB in average when compared to the ALM design.

Approximate Floating-Point FFT Design with Wide Precision-Range and High Energy Efficiency

Chenyi Wen
Ying Wu
Xunzhao Yin
Cheng Zhuo

Fast Fourier Transform (FFT) is a key digital signal processing algorithm that is widely deployed in mobile and portable devices. Recently, with the popularity of human perception related tasks, it is noted that the requirements of full precision and exactness are not always necessary for FFT computation. We propose a top-down approximate Floating-Point FFT design methodology to fully exploit the error-tolerance nature of the FFT algorithm. An efficient error modeling of the configurable approximate multiplier is proposed to link the multiplier approximation to the FFT algorithm precision. Then an approximation optimization flow is formulated to maximize the energy efficiency. Experimental results show that the proposed approximate FFT can achieve up to 52% Area-Delay-Product improvement and 23% energy saving when compared to the exact FFT. The proposed approximate FFT is also found to cover almost 2X wider precision range with higher energy efficiency in comparison with the prior state-of-the-art approximate FFT.

RUCA: RUntime Configurable Approximate Circuits with Self-Correcting Capability

Jingxiao Ma
Sherief Reda

Approximate computing is an emerging computing paradigm that offers improved power consumption by relaxing the requirement for full accuracy. Since the requirements for accuracy may vary according to specific real-world applications, one trend of approximate computing is to design quality-configurable circuits, which are able to switch at runtime among different accuracy modes with different power and delay. In this paper, we present a novel framework RUCA which aims to synthesize runtime configurable approximate circuits based on arbitrary input circuits. By decomposing the truth table, our approach aims to approximate and separate the input circuit into multiple configuration blocks which support different accuracy levels, including a corrector circuit to restore full accuracy. Power gating is used to activate different blocks, such that the approximate circuit is able to operate at different accuracy-power configurations. To improve the scalability of our algorithm, we also provide a design space exploration scheme with circuit partitioning. We evaluate our methodology on a comprehensive set of benchmarks. For 3-level designs, RUCA saves power consumption by 43.71% within 2% error and by 30.15% within 1% error on average.

Approximate Logic Synthesis by Genetic Algorithm with an Error Rate Guarantee

Chun-Ting Lee
Yi-Ting Li
Yung-Chih Chen
Chun-Yao Wang

Approximate computing is an emerging design technique for error-tolerant applications, which may improve circuit area, delay, or power consumption by trading off a circuit’s correctness. In this paper, we propose a novel approximate logic synthesis approach based on genetic algorithm targeting at depth minimization with an error rate guarantee. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results demonstrate that the depth can be reduced by up to 50%, and 22% on average under a 5% error rate constraint. As compared with the state-of-the-art method, our approach can achieve an average of 159% more depth reduction under the same 5% error rate constraint.

SESSION: Technical Program: Logic Synthesis for AQFP, Quantum Logic, AI Driven and Efficient Data Layout for HBM

Depth-Optimal Buffer and Splitter Insertion and Optimization in AQFP Circuits

Alessandro Tempia Calvino
Giovanni De Micheli

The Adiabatic Quantum-Flux Parametron (AQFP) is an energy-efficient superconducting logic family. AQFP technology requires buffer and splitting elements (B/S) to be inserted to satisfy path-balancing and fanout-branching constraints. B/S insertion policies and optimization strategies have been recently proposed to minimize the number of buffers and splitters needed in an AQFP circuit. In this work, we study the B/S insertion and optimization methods. In particular, the paper proposes: i) an algorithm for B/S insertion that guarantees global depth optimality; ii) a new approach for B/S optimization based on minimum register retiming; iii) a B/S optimization flow based on (i), (ii), and existing work. We show that our approach reduces the number of B/S up to 20% while guaranteeing optimal depth and providing a 55X speed-up in run time compared to the state-of-the-art.

Area-Driven FPGA Logic Synthesis Using Reinforcement Learning

Guanglei Zhou
Jason H. Anderson

Logic synthesis involves a rich set of optimization algorithms applied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed “recipe” of such algorithms deemed to work well for a wide range of different circuits. We apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the set of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs. the conventional approach (resyn2). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of “unseen” circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.

Optimization of Reversible Logic Networks with Gate Sharing

Yung-Chih Chen
Feng-Jie Chao

Logic synthesis for quantum computing aims to transform a Boolean logic network into a quantum circuit. A conventional two-stage flow first synthesizes the given Boolean logic network into a reversible logic network composed of reversible logic gates. Then, it maps each reversible logic gate into quantum gates to generate a quantum circuit. The state-of-the-art method for the first stage takes advantage of the lookup-table (LUT) mapping technology for FPGAs to decompose the given Boolean logic network into sub-networks, and then maps the sub-networks into reversible logic networks. Although every sub-network is well synthesized, we observe that the reversible logic networks could be further optimized by sharing the reversible logic gates belonging to different sub-networks. Thus, in this paper, we propose a new optimization method for the reversible logic networks by sharing gates. We translate the problem of extracting shareable gates to the exclusive-sums-of-product term optimization problem. The experimental results show that the proposed method successfully optimizes the reversible logic networks generated by the LUT-based method. It is able to reduce an average of approximately 4% of quantum gate cost without increasing the number of ancilla lines for a set of IWLS 2005 benchmarks.

Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization

Stephanie Soldavini
Donatella Sciuto
Christian Pilato

Optimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.

SESSION: Technical Program: University Design Contest

ViraEye: An Energy-Efficient Stereo Vision Accelerator with Binary Neural Network in 55 nm CMOS

Yu Zhang
Gang Chen
Tao He
Qian Huang
Kai Huang

This paper presents the ViraEye chip, an energy-efficient stereo vision accelerator based on the binary neural network (BNN) to achieve high-quality and real-time stereo estimation. This stereo vision accelerator is designed as an end-to-end full pipeline architecture where all processing procedures, including stereo rectification, BNNs, cost aggregation and post-processing, are implemented on the ViraEye chip. ViraEye allows for top level pipelining between accelerator and image sensors, and no external CPUs or GPUs are required. The accelerator is implemented using SMIC 55nm CMOS technology and achieves top-performing processing speed in terms of million disparity estimations per second (MDE/s) metric among the existing ASIC in the open literature.

A 1.2nJ/Classification Fully Synthesized All-Digital Asynchronous Wired-Logic Processor Using Quantized Non-Linear Function Blocks in 0.18μm CMOS

Rei Sumikawa
Kota Shiba
Atsutake Kosuge
Mototsugu Hamada
Tadahiro Kuroda

A 5.3 times smaller and 2.6 times more energy-efficient all-digital wired-logic processor which infers MNIST with 90.6% accuracy and 1.2nJ of energy consumption has been developed. To improve area efficiency of wired-logic architecture, nonlinear neural network (NNN), which is a neuron and synapse efficient network, and logical compression technology to implement it with area-saving and low-power digital circuits by logic synthesis are proposed, and asynchronous digital combinational circuit DNN hardware has been developed.

A Fully Synthesized 13.7μJ/Prediction 88% Accuracy CIFAR-10 Single-Chip Data-Reusing Wired-Logic Processor Using Non-Linear Neural Network

Yao-Chung Hsu
Atsutake Kosuge
Rei Sumikawa
Kota Shiba
Mototsugu Hamada
Tadahiro Kuroda

An FPGA-based wired-logic CNN processor is presented that can process CIFAR-10 at 13.7μJ/prediction with an 88% accuracy, which is 2,036 times more energy-efficient than the prior state-of-the-art FPGA-based processor. Energy efficiency is greatly improved by implementing all processing elements and wirings in parallel on a single FPGA chip to eliminate the memory access. By utilizing both (1) a non-linear neural network which saves on neurons and synapses and (2) a shift register-based wired-logic architecture, hardware resource usage is reduced by three orders of magnitude.

A Multimode Hybrid Memristor-CMOS Prototyping Platform Supporting Digital and Analog Projects

K.-E. Harabi
C. Turck
M. Drouhin
A. Renaudineau
T. Bersani-Veroni
D. Querlioz
T. Hirtzlin
E. Vianello
M Bocquet
J.-M. Portal

We present an integrated circuit fabricated in a process co-integrating CMOS and hafnium-oxide memristor technology, which provides a prototyping platform for projects involving memristors. Our circuit includes the periphery circuitry for using memristors within digital circuits, as well as an analog mode with direct access to memristors. The platform allows optimizing the conditions for reading and writing memristors, as well as developing and testing innovative memristor-based neuromorphic concepts.

A Fully Synchronous Digital LDO with Built-in Adaptive Frequency Modulation and Implicit Dead-Zone Control

Shun Yamaguchi
Mahfuzul Islam
Takashi Hisakado
Osami Wada

This paper proposes a synchronous digital LDO with adaptive clocking and dead-zone control without additional reference voltages. A test chip fabricated in a commercial 65 nm CMOS general-purpose (GP) process achieves 580x frequency modulation with 99.9% maximum efficiency at 0.6V supply.

Demonstration of Order Statistics Based Flash ADC in a 65nm Process

Mahfuzul Islam
Takehiro Kitamura
Takashi Hisakado
Osami Wada

This paper presents measurement results of a flash ADC that utilizes offset voltages as references. To operate the minimum number of comparators, we select the target comparators based on the rankings of the offset voltage. We present performance improvement by tuning offset voltage distribution using multiple comparator groups under the same power. A test chip in a commercial 65 nm GP process demonstrates the ADCs at 1 GS/s operation.

SESSION: Technical Program: Synthesis of Quantum Circuits and Systems

A SAT Encoding for Optimal Clifford Circuit Synthesis

Sarah Schneider
Lukas Burgholzer
Robert Wille

Executing quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to devices’ limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm’s description first has to be synthesized using the device’s gate library. In this paper, we consider the optimal synthesis of Clifford circuits—an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are limited to a maximum of six qubits. The contribution of this work is twofold: First, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. The resulting tool is demonstrated to synthesize optimal circuits for up to 26 qubits—more than four times as many as the current state of the art. Second, we experimentally show that the overhead introduced by state-of-the-art heuristics exceeds the lower bound by 27 % on average. The resulting tool is publicly available at https://github.com/cda-tum/qmap.

An SMT-Solver-Based Synthesis of NNA-Compliant Quantum Circuits Consisting of CNOT, H and T Gates

Kyohei Seino
Shigeru Yamashita

It is natural to assume that we can perform quantum operations between only two adjacent physical qubits (quantum bits) to realize a quantum computer for both the current and possible future technologies. This restriction is called the Nearest Neighbor Architecture (NNA) restriction. This paper proposes an SMT-solver-based synthesis of quantum circuits consisting of CNOT, H, and T gates to satisfy the NNA restriction. Although the existing SMT-solver-based synthesis cannot treat H and T gates directly, our method treats the functionality of quantum-specific T and H gates carefully so that we can utilize an SMT-solver to minimize the number of CNOT gates; unlike the existing SMT-solver-based methods, our method considers “Don’t Care” conditions in intermediate points of a quantum circuit by exploiting the property of T gates to reduce CNOT gates. Experimental results show that our approach can reduce the number of CNOT gates by 58.11% on average compared to the naive application of the existing method which does not consider the “Don’t Care” condition.

Compilation of Entangling Gates for High-Dimensional Quantum Systems

Kevin Mato
Martin Ringbauer
Stefan Hillmich
Robert Wille

Most quantum computing architectures to date natively support multi-valued logic, albeit being typically operated in a binary fashion. Multi-valued, or qudit, quantum processors have access to much richer forms of quantum entanglement, which promise to significantly boost the performance and usefulness of quantum devices. However, much of the theory as well as corresponding design methods required for exploiting such hardware remain insufficient and generalizations from qubits are not straightforward. A particular challenge is the compilation of quantum circuits into sets of native qudit gates supported by state-of-the-art quantum hardware. In this work, we address this challenge by introducing a complete workflow for compiling any two-qudit unitary into an arbitrary native gate set. Case studies demonstrate the feasibility of both, the proposed approach as well as the corresponding implementation (which is freely available at github.com/cda-tum/qudit-entanglement-compilation).

WIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code

Wang Liao
Yasunari Suzuki
Teruo Tanimoto
Yosuke Ueno
Yuuki Tokunaga

Large error rates of quantum bits (qubits) are one of the main difficulties in the development of quantum computing. Performing quantum error correction (QEC) with surface codes is considered the most promising approach to reduce the error rates of qubits effectively. To perform error correction, we need an error-decoding unit, which estimates errors in the noisy physical qubits repetitively, to create a robust logical qubit. While complicated graph-matching problems must be solved within a strict time restriction for the error decoding, several hardware implementations that satisfy the restriction at a large code distance have been proposed.

However, the existing decoder designs are still challenging in reducing the logical error rate. This is because they assume that the error rates of physical qubits are uniform while they have large variations in practice. According to our numerical simulation based on the quantum chip with the largest qubit number, neglecting the non-uniform error properties of a real quantum chip in the decoding process induces significant degradation of the logical error rate and spoils the benefit of QEC. To take the non-uniformity into account, decoders need to solve matching problems on a weighted graph, but they are difficult to solve using the existing designs without exceeding the time limit of decoding. Therefore, a decoder that can treat both the non-uniform physical error rates and the large surface code is strongly demanded.

In this paper, we propose a hardware design of decoding units for the surface code that can treat the non-identical error properties with small latency at a large code distance. The key idea of our design is 1) constructing a look-up table for calculating the shortest paths between nodes in a weighted graph and 2) enabling parallel processing during decoding. The implementation results in field programmable gate array (FPGA) indicate that our design scales up to code distance 11 within a microsecond-level delay, which is comparable to the existing state-of-the-art designs, while our design can treat non-identical errors.

Quantum Data Compression for Efficient Generation of Control Pulses

Daniel Volya
Prabhat Mishra

In order to physically realize a robust quantum gate, a specifically tailored laser pulse needs to be derived via strategies such as quantum optimal control. Unfortunately, such strategies face exponential complexity with quantum system size and become infeasible even for moderate-sized quantum circuits. In this paper, we propose an automated framework for effective utilization of these quantum resources. Specifically, this paper makes three important contributions. First, we utilize an effective combination of register compression and dimensionality reduction to reduce the area of a quantum circuit. Next, due to the properties of an autoencoder, the compressed gates produced are robust even in the presence of noise. Finally, our proposed compression reduces the computation time of quantum control. Experimental evaluation using popular quantum algorithms demonstrates that our proposed approach can enable efficient generation of noise-resilient control pulses while state-of-the-art fails to handle large-scale quantum systems.

SESSION: Technical Program: In-Memory/Near-Memory Computing for Neural Networks

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

Yueting Li
He Zhang
Xueyan Wang
Hao Cai
Yundong Zhang
Shuqin Lv
Renguang Liu
Weisheng Zhao

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.

RIMAC: An Array-Level ADC/DAC-Free ReRAM-Based in-Memory DNN Processor with Analog Cache and Computation

Peiyu Chen
Meng Wu
Yufei Ma
Le Ye
Ru Huang

By directly computing in analog domain, processing-in-memory (PIM) is emerging as a promising alternative to overcome the memory bottleneck of traditional von-Neuman architecture, especially for deep neural networks (DNNs). However, the data outside PIM macros in most existing PIM accelerators are stored and operated as digital signals that require massive expensive digital-to-analog (D/A) and analog-to-digital (A/D) converters. In this work, an array-level ADC/DAC-free ReRAM-based in-memory DNN processor named RIMAC is proposed, which accelerates various DNNs in pure analog-domain with analog cache and analog computation modules to eliminate the expensive D/A and A/D conversions. Our experiment result shows the peak energy efficiency is improved by about 34.8×, 97.6×, 10.7×, and 14.0× compared to PRIME, ISAAC, Lattice, and 21’DAC for various DNNs on ImageNet, respectively.

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient in-Memory Acceleration

Shuo Huai
Di Liu
Xiangzhong Luo
Hui Chen
Weichen Liu
Ravi Subramaniam

Crossbar-based In-Memory Computing (IMC) accelerators preload the entire Deep Neural Network (DNN) into crossbars before inference. However, devices with limited crossbars cannot infer increasingly complex models. IMC-pruning can reduce the usage of crossbars, but current methods need expensive extra hardware for data alignment. Meanwhile, quantization can represent weights of DNNs by integers, but they employ non-integer scaling factors to ensure accuracy, requiring costly multipliers. In this paper, we first propose crossbar-aligned pruning to reduce the usage of crossbars without hardware overhead. Then, we introduce a quantization scheme to avoid multipliers in IMC devices. Finally, we design a learning method to complete above two schemes and cultivate an optimal compact DNN with high accuracy and large sparsity during training. Experiments demonstrate that our framework, compared to state-of-the-art methods, achieves larger sparsity and lower power consumption with higher accuracy. We even improve the accuracy by 0.43% for VGG-16 with an 88.25% sparsity rate on the Cifar-10 dataset. Compared to the original model, we reduce computing power and area by 19.8x and 18.8x, respectively.

Discovering the in-Memory Kernels of 3D Dot-Product Engines

Muhammad Rashedul Haq Rashed
Sumit Kumar Jha
Rickard Ewetz

The capability of resistive random access memory (ReRAM) to implement multiply-and-accumulate operations promises unprecedented efficiency in the design of scientific computing applications. While the use of two-dimensional (2D) ReRAM crossbar has been well investigated in the last few years, the design of in-memory dot-product engines using three-dimensional (3D) ReRAM crossbars remains a topic of active investigations. In this paper, we holistically explore how to leverage 3D ReRAM crossbars with several (2 to 7) stacked crossbar layers. In contrast, previous studies have focused on 3D ReRAM with at most 2 stacked crossbar layers. We first discover the in-memory compute kernels that can be realized using 3D ReRAM with multiple stacked crossbar layers. We discover that matrices with different sparsity patterns can be realized by appropriately assigning the inputs and outputs to the perpendicular metal wires within the 3D stack. We present a design automation tool to map sparse matrices within scientific computing applications to the discovered 3D kernels. The proposed framework is evaluated using 20 applications from the SuitSparse Matrix Collection. Compared with 2D crossbars, the proposed approach using 3D crossbars improves area, energy, and latency with 2.02X, 2.37X, 2.45X, respectively.

RVComp: Analog Variation Compensation for RRAM-Based in-Memory Computing

Jingyu He
Yucong Huang
Miguel Lastras
Terry Tao Ye
Chi-Ying Tsui
Kwang-Ting Cheng

Resistive Random Access Memory (RRAM) has shown great potential in accelerating memory-intensive computation in neural network applications. However, RRAM-based computing suffers from significant accuracy degradation due to the inevitable device variations. In this paper, we propose RVComp, a fine-grained analog Compensation approach to mitigate the accuracy loss of in-memory computing incurred by the Variations of the RRAM devices. Specifically, weights in the RRAM crossbar are accompanied by dedicated compensation RRAM cells to offset their programming errors with a scaling factor. A programming target shifting mechanism is further designed with the objectives of reducing the hardware overhead and minimizing the compensation errors under large device variations. Based on these two key concepts, we propose double and dynamic compensation schemes and the corresponding support architecture. Since the RRAM cells only account for a small fraction of the overall area of the computing macro due to the dominance of the peripheral circuitry, the overall area overhead of RVComp is low and manageable. Simulation results show RVComp achieves a negligible 1.80% inference accuracy drop for ResNet18 on the CIFAR-10 dataset under 30% device variation with only 7.12% area and 5.02% power overhead and no extra latency.

SESSION: Technical Program: Machine Learning-Based Design Automation

Rethink before Releasing Your Model: ML Model Extraction Attack in EDA

Chen-Chia Chang
Jingyu Pan
Zhiyao Xie
Jiang Hu
Yiran Chen

Machine learning (ML)-based techniques for electronic design automation (EDA) have boosted the performance of modern integrated circuits (ICs). Such achievement makes ML model to be of importance for the EDA industry. In addition, ML models for EDA are widely considered having high development cost because of the time-consuming and complicated training data generation process. Thus, confidentiality protection for EDA models is a critical issue. However, an adversary could apply model extraction attacks to steal the model in the sense of achieving the comparable performance to the victim’s model. As model extraction attacks have posed great threats to other application domains, e.g., computer vision and natural language process, in this paper, we study model extraction attacks for EDA models under two real-world scenarios. It is the first work that (1) introduces model extraction attacks on EDA models and (2) proposes two attack methods against the unlimited and limited query budget scenarios. Our results show that our approach can achieve competitive performance with the well-trained victim model without any performance degradation. Based on the results, we demonstrate that model extraction attacks truly threaten the EDA model privacy and hope to raise concerns about ML security issues in EDA.

MacroRank: Ranking Macro Placement Solutions Leveraging Translation Equivariancy

Yifan Chen
Jing Mai
Xiaohan Gao
Muhan Zhang
Yibo Lin

Modern large-scale designs make extensive use of heterogeneous macros, which can significantly affect routability. Predicting the final routing quality in the early macro placement stage can filter out poor solutions and speed up design closure. By observing that routing is correlated with the relative positions between instances, we propose MacroRank, a macro placement ranking framework leveraging translation equivariance and a Learning to Rank technique. The framework is able to learn the relative order of macro placement solutions and rank them based on routing quality metrics like wirelength, number of vias, and number of shorts. The experimental results show that compared with the most recent baseline, our framework can improve the Kendall rank correlation coefficient by 49.5% and the average performance of top-30 prediction by 8.1%, 2.3%, and 10.6% on wirelength, vias, and shorts, respectively.

BufFormer: A Generative ML Framework for Scalable Buffering

Rongjian Liang
Siddhartha Nath
Anand Rajaram
Jiang Hu
Haoxing Ren

Buffering is a prevalent interconnect optimization technique to help timing closure and is often performed after placement. A common buffering approach is to construct a Steiner tree and then buffers are inserted on the tree based on Ginneken-Lillis style algorithm. Such an approach is difficult to scale with large nets. Our work attempts to solve this problem with a generative machine-learning (ML) approach without Steiner tree construction. Our approach can extract and reuse knowledge from high quality samples and therefore has significantly improved scalability. A generative ML framework, BufFormer, is proposed to construct abstract tree topology while simultaneously determining buffer sizes & locations. A baseline method, FLUTE-based Steiner tree construction followed by Ginneken-Lillis style buffer insertion, is implemented to generate training samples. After training, BufFormer can produce solutions for unseen nets highly comparable to baseline results with a correlation coefficient 0.977 in terms of buffer area and 0.934 for driver-sink delays. On average, BufFormer-generated tree achieves similar delays with slightly larger buffer area. And up to 160X speedup can be achieved for large nets when running on a GPU over the baseline on a single CPU thread.

Decoupling Capacitor Insertion Minimizing IR-Drop Violations and Routing DRVs

Daijoon Hyun
Younggwang Jung
Insu Cho
Youngsoo Shin

Decoupling capacitor (decap) cells are inserted near function cells of high switching activities so that their IR-drop can be suppressed. Their design becomes more complex and uses higher metal layers, thereby starting to manifest themselves as routing blockage. Post-placement decap insertion, with a goal of minimizing both IR-drop violations and routing design rule violations (DRVs), is addressed for the first time. U-Net with graph convolutional network is introduced to predict routing DRV penalty. The decap insertion problem is formulated and a heuristic algorithm is presented. Experiments with a few test circuits demonstrate that DRVs are reduced by 16% on average with no IR-drop violations, compared to a conventional method which does not explicitly consider DRVs. This results in 48% reduction in routing runtime and 23% improvement in total negative slack.

DPRoute: Deep Learning Framework for Package Routing

Yeu-Haw Yeh
Simon Yi-Hung Chen
Hung-Ming Chen
Deng-Yao Tu
Guan-Qi Fang
Yun-Chih Kuo
Po-Yang Chen

For routing closures in package designs, net order is critical due to complex design rules and severe wire congestion. However, existing solutions are deliberatively designed using heuristics and are difficult to adapt to different design requirements unless updating the algorithm. This work presents a novel deep learning-based routing framework that can keep improving by accumulating data to accommodate increasingly complex design requirements. Based on the initial routing results, we apply deep learning to concurrent detailed routing to deal with the problem of net ordering decisions. We use multi-agent deep reinforcement learning to learn routing schedules between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict. Experimental results on industrial package design show that the proposed framework can improve the number of design rule violations by 99.5% and the wirelength by 2.9% for initial routing.

SESSION: Technical Program: Advanced Techniques for Yields, Low Power and Reliability

High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction

Shuo Yin
Guohao Dai
Wei W. Xing

Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of “dimensionality”, which is inevitable when dealing with modern large-scale circuits, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel and acts as a surrogate model to emulate the expensive SPICE simulation. To further improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is also enhanced with parallel batch sampling for parallel computing, making it ready for practical deployment. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art (SOTA) approaches in terms of accuracy and efficiency with up to 11.1x speedup over SOTA methods.

MIA-Aware Detailed Placement and VT Reassignment for Leakage Power Optimization

Hung-Chun Lin
Shao-Yun Fang

As the feature size decreases, leakage power consumption becomes an important target in the design. Using multiple threshold voltages (VTs) in cell-based designs is a popular technique to simultaneously optimize circuit timing and minimize leakage power. However, an arbitrary cell placement result of a multi-VT design may suffer from many design rule violations induced by the Minimum-Implant-Area (MIA) rule, and thus it is necessary to take the MIA rules into consideration during the detailed placement stage. The state-of-the-art works on detailed placement comprehensively tackling MIA rules either disallow VT change or only allow reducing cell VTs to avoid timing degradation. However, these limitations may either result in larger cell displacement or cause overhead in leakage power. In this paper, we propose an optimization framework of VT reassignment and detailed placement to simultaneously consider MIA rules and leakage power minimization under timing constraints. Experimental results show that compared with the state-of-the-art works, the proposed framework can efficiently achieve better trade-off between leakage power and cell displacement.

SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network

Junchi Ma
Sulei Huang
Zongtao Duan
Lei Tang
Luyang Wang

The trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.

SESSION: Technical Program: Microarchitectural Design and Neural Networks

Microarchitecture Power Modeling via Artificial Neural Network and Transfer Learning

Jianwang Zhai
Yici Cai
Bei Yu

Accurate and robust power models are highly demanded to explore better CPU designs. However, previous learning-based power models ignore the discrepancies in data distribution among different CPU designs, making it difficult to use data from the historical configuration to aid modeling for new target configuration. In this paper, we investigate the transferability of power models and propose a microarchitecture power modeling method based on transfer learning (TL). A novel TL method for artificial neural network (ANN)-based power models is proposed, where cross-domain mixup generates more auxiliary samples close to the target configuration to fill in the distribution discrepancy and domain-adversarial training extracts domain-invariant features to complete the target model construction. Experiments show that our method greatly improves the model transferability and can effectively utilize the knowledge of the existing CPU configuration to facilitate target power model construction.

MUGNoC: A Software-Configured Multicast-Unicast-Gather NoC for Accelerating CNN Dataflows

Hui Chen
Di Liu
Shiqing Li
Shuo Huai
Xiangzhong Luo
Weichen Liu

Current communication infrastructures for convolutional neural networks (CNNs) only focus on specific transmission patterns, not applicable to benefit the whole system if the dataflow changes or different dataflows run in one system. To reduce data movement, various CNN dataflows are presented. For these dataflows, parameters and results are delivered using different traffic patterns, i.e., multicast, unicast, and gather, preventing dataflow-specific communication backbones from benefiting the entire system if the dataflow changes or different dataflows run in the same system. Thus, in this paper, we propose MUG-NoC to support typical traffic patterns and accelerate them, therefore boosting multiple dataflows. Specifically, (i) we for the first time support multicast in 2D-mesh software configurable NoC by revising router configuration and proposing the efficient multicast routing; (ii) we decrease unicast latency by transmitting data through the different routes in parallel; (iii) we reduce output gather overheads by pipelining basic dataflow units. Experiments show that at least our proposed design can reduce 39.2% total data transmission time compared with the state-of-the-art CNN communication backbone.

COLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU

Bo-Wun Cheng
En-Ming Huang
Chen-Hao Chao
Wei-Fang Sun
Tsung-Tai Yeh
Chun-Yi Lee

In this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

SESSION: Technical Program: Novel Techniques for Scheduling and Memory Optimizations in Embedded Software

Mixed-Criticality with Integer Multiple WCETs and Dropping Relations: New Scheduling Challenges

Federico Reghenzani
William Fornaciari

Scheduling Mixed-Criticality (MC) workload is a challenging problem in real-time computing. Earliest Deadline First Virtual Deadline (EDF-VD) is one of the most famous scheduling algorithm with optimal speedup bound properties. However, when EDF-VD is used to schedule task sets using a model with additional or relaxed constraints, its scheduling properties change. Inspired by an application of MC to the scheduling of fault tolerant tasks, in this article, we propose two models for multiple criticality levels: the first is a specialization of the MC model, and the second is a generalization of it. We then show, via formal proofs and numerical simulations, that the former considerably improves the speedup bound of EDF-VD. Finally, we provide the proofs related to the optimality of the two models, identifying the need of new scheduling algorithms.

An Exact Schedulability Analysis for Global Fixed-Priority Scheduling of the AER Task Model

Thilanka Thilakasiri
Matthias Becker

Commercial off-the-shelf (COTS) multi-core platforms offer high performance and large availability of processing resources. Increased contention when accessing shared resources is a result of the high parallelism and one of the main challenges when realtime applications are deployed to these platforms. As a result, several execution models have been proposed to avoid contention by separating access to shared resources from execution.

In this work, we consider the Acquisition-Execution-Restitution (AER) model where contention to shared resources is avoided by design. We propose an exact schedulability test for the AER model under global fixed-priority scheduling using timed automata where we describe the schedulability problem as a reachability problem. To the best of our knowledge, this is the first exact schedulability test for the AER model under global fixed-priority scheduling on multiprocessor platforms. The performance of the proposed approach is evaluated using synthetic experiments and provides up to 65% more schedulable task sets than the state-of-the-art.

Skyrmion Vault: Maximizing Skyrmion Lifespan for Enabling Low-Power Skyrmion Racetrack Memory

Syue-Wei Lu
Shuo-Han Chen
Yu-Pei Liang
Yuan-Hao Chang
Kang Wang
Tseng-Yi Chen
Wei-Kuan Shih

Skyrmion racetrack memory (SK-RM) has demonstrated great potential as a high-density and low-cost nonvolatile memory. Nevertheless, even though random data accesses are supported on SK-RM, data accesses can not be carried out on individual data bit directly. Instead, special skyrmion manipulations, such as injecting and shifting, are required to support random information update and deletion. With such special manipulations, the latency and energy consumption of skyrmion manipulations could quickly accumulate and induce additional overhead on the data read/write path of SK-RM. Meanwhile, injection operation consumes more energy and has higher latency than any other manipulations. Although prior arts have tried to alleviate the overhead of skyrmion manipulations, the possibility of minimizing injections through buffering skyrmions for future reuse and energy conservation receives much less attention. Such observation motivates us to propose the concept of skyrmion vault to effectively utilize the skyrmion buffer track structure for energy conservation through maximizing the lifespan of injected skyrmions and minimizing the number of skyrmion injections. Experimental results have shown promising improvements in both energy consumption and skyrmions’ lifespan.

SESSION: Technical Program: Efficient Circuit Simulation and Synthesis for Analog Designs

Parallel Incomplete LU Factorization Based Iterative Solver for Fixed-Structure Linear Equations in Circuit Simulation

Lingjie Li
Zhiqiang Liu
Kan Liu
Shan Shen
Wenjian Yu

A series of fixed-structure sparse linear equations are solved in a circuit simulation process. We propose a parallel incomplete LU (ILU) preconditioned GMRES solver for those equations. A new subtree-based scheduling algorithm for ILU factorization and forward/backward substitution is adopted to overcome the load-balancing and data locality problem of the conventional levelization-based scheduling. Experimental results show that the proposed scheduling algorithm can achieve up to 2.6X speedup for ILU factorization and 3.1X speedup for forward/backward substitution compared to the levelization-based scheduling. The proposed ILU-GMRES solver achieves around 4X parallel speedup with 8 threads, which is up to 2.1X faster than that based on the levelization-based scheme. The proposed parallel solver also shows remarkable advantage over existing methods (including HSPICE) on transient simulation of linear and nonlinear circuits.

Accelerated Capacitance Simulation of 3-D Structures with Considerable Amounts of General Floating Metals

Jiechen Huang
Wenjian Yu
Mingye Song
Ming Yang

Floating metals are special conductors introduced into conductor structures by design for manufacturing (DFM). They bring difficulty to accurate capacitance simulation. In this work, we aim to accelerate the floating random walk (FRW) based capacitance simulation for structures with considerable amounts of general floating metals. We first discuss how the existing modified FRW is affected by the integral surfaces of floating metals and propose an improved placement of integral surface. Then, we propose a hybrid approach called incomplete network reduction to avoid random transitions trapped by floating metals. Experiments on structures from IC and FPD design, which involves multiple floating metals and single or multiple master conductors, have shown the effectiveness of the proposed techniques. The proposed techniques reduce the computational time of capacitance calculation, while preserving the accuracy.

On Automating Finger-Cap Array Synthesis with Optimal Parasitic Matching for Custom SAR ADC

Cheng-Yu Chiang
Chia-Lin Hu
Mark Po-Hung Lin
Yu-Szu Chung
Shyh-Jye Jou
Jieh-Tsorng Wu
Shiuh-hua Wood Chiang
Chien-Nan Jimmy Liu
Hung-Ming Chen

Due to its excellent power efficiency, the successive-approximation-register (SAR) analog-to-digital converter (ADC) is an attractive design choice for low-power ADC implements. In analog layout design, the parasitics induced by interconnecting wires and elements affect the accuracy and performance of the device. Due to the requirement of low-power and high-speed, series of very small lateral metal-metal capacitor units are usually adopted as the architecture of capacitor array. Besides power consumption and area reduction, the parasitic capacitance would significantly affect the matching properties and settling time of capacitors. This work presents a framework to synthesize good-quality binary-weighted capacitors for custom SAR ADC. Also, this work proposes a parasitic-aware ILP-based weight-dynamic network routing algorithm to generate a layout considering parasitic capacitance and capacitance ratio mismatch simultaneously. The experimental result shows that the effective number of bits (ENOB) of the layout generated by our approach is comparable to or better than that of manual design and other automated works, closing the gap between pre-sim and post-sim results.

SESSION: Technical Program: Security of Heterogeneous Systems Containing FPGAs

FPGANeedle: Precise Remote Fault Attacks from FPGA to CPU

Mathieu Gross
Jonas Krautter
Dennis Gnad
Michael Gruber
Georg Sigl
Mehdi Tahoori

FPGA as general-purpose accelerators can greatly improve system efficiency and performance in cloud and edge devices alike. However, they have recently become the focus of remote attacks, such as fault and side-channel attacks from one to another user of a part of the FPGA fabric. In this work, we consider system-on-chip platforms, where an FPGA and an embedded processor core are located on the same die. We show that the embedded processor core is vulnerable to voltage drops generated by the FPGA logic. Our experiments demonstrate the possibility of compromising the data transfer from external DDR memory to the processor cache hierarchy. Furthermore, we were also able to fault and skip instructions executed on an ARM Cortex-A9 core. The FPGA based fault injection is shown precise enough to recover the secret key of an AES T-tables implementation found in the mbedTLS library.

FPGA Based Countermeasures against Side Channel Attacks on Block Ciphers

Darshana Jayasinghe
Brian Udugama
Sri Parameswaran

Field Programmable Gate Arrays (FPGAs) are increasingly ubiquitous. FPGAs enable hardware acceleration and reconfigurability. Any security breach or attack on critical computations occurring on an FPGA can lead to devastating consequences. Side-channel attacks have the ability to reveal secret information, such as secret keys from cryptographic circuits running on FPGAs. Power dissipation (PA), Electromagnetic (EM) radiation, fault injection (FI) and remote power dissipation (RPA) attacks are the most compelling and noninvasive side-channel attacks demonstrated on FPGAs. This paper discusses two PA attack countermeasures (QuadSeal and RFTC) and one RPA attack countermeasure (UCloD) in detail to protect FPGAs.

SESSION: Technical Program: Novel Application & Architecture-Specific Quantization Techniques

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

Ruoyang Liu
Chenhan Wei
Yixiong Yang
Wenxun Wang
Huazhong Yang
Yongpan Liu

Data quantization is an effective method to accelerate neural network training and reduce power consumption. However, it is challenging to perform low-bit quantized training: the conventional equal-precision quantization will lead to either high accuracy loss or limited bit-width reduction, while existing mixed-precision methods offer high compression potential but failed to perform accurate and efficient bit-width assignment. In this work, we propose DYNASTY, a block-wise dynamic-precision neural network training framework. DYNASTY provides accurate data sensitivity information through fast online analytics, and maintains stable training convergence with an adaptive bit-width map generator. Network training experiments on CIFAR-100 and ImageNet dataset are carried out, and compared to 8-bit quantization baseline, DYNASTY brings up to 5.1× speedup and 4.7× energy consumption reduction with no accuracy drop and negligible hardware overhead.

Quantization through Search: A Novel Scheme to Quantize Convolutional Neural Networks in Finite Weight Space

Qing Lu
Weiwen Jiang
Xiaowei Xu
Jingtong Hu
Yiyu Shi

Quantization has become an essential technique in compressing deep neural networks for deployment onto resource-constrained hardware. It is noticed that, the hardware efficiency of implementing quantized networks is highly coupled with the actual values to be quantized into, and therefore, with given bit widths, we can smartly choose a value space to further boost the hardware efficiency. For example, using weights of only integer powers of two, multiplication can be fulfilled by bit operations. Under such circumstances, however, existing quantization-aware training methods are either not suitable to apply or unable to unleash the expressiveness of very low bit-widths. For the best hardware efficiency, we revisit the quantization of convolutional neural networks and propose to address the training process from a weight-searching angle, as opposed to optimizing the quantizer functions as in existing works. Extensive experiments on CIFAR10 and ImageNet classification tasks are examined with implementations onto well-established CNN architectures, such as ResNet, VGG, and MobileNet, etc. It is shown the proposed method can achieve a lower accuracy loss than the state of arts, and/or improving implementation efficiency by using hardware-friendly weight values at the same time.

Multi-Wavelength Parallel Training and Quantization-Aware Tuning for WDM-Based Optical Convolutional Neural Networks Considering Wavelength-Relative Deviations

Ying Zhu
Min Liu
Lu Xu
Lei Wang
Xi Xiao
Shaohua Yu

Wavelength Division Multiplexing (WDM)-based Mach-Zehnder Interferometer Optical Convolutional Neural Networks (MZI-OCNNs) have emerged as a promising platform to accelerate convolutions that cost most computing sources in neural networks. However, the wavelength-relative imperfect split ratios and actual phase shifts in MZIs and quantization errors from the electronic configuration module will degrade the inference accuracy of WDM-based MZI-OCNNs and thus render them unusable in practice. In this paper, we propose a framework that models the split ratios and phase shifts under different wavelengths, incorporates them into OCNN training, and introduces quantization-aware tuning to maintain inference accuracy and reduce electronic module complexity. Consequently, the framework can improve the inference accuracy by 49%, 76%, and 76%, respectively, for LeNet5, VGG7, and VGG8 implemented with multi-wavelength parallel computing. And instead of using Float 32/64 quantization resolutions, only 5,6, and 4 bits are needed and fewer quantization levels are utilized for configuration signals.

Semantic Guided Fine-Grained Point Cloud Quantization Framework for 3D Object Detection

Xiaoyu Feng
Chen Tang
Zongkai Zhang
Wenyu Sun
Yongpan Liu

Unlike the grid-paced RGB images, network compression, i.e.pruning and quantization, for the irregular and sparse 3D point cloud face more challenges. Traditional quantization ignores the unbalanced semantic distribution in 3D point cloud. In this work, we propose a semantic-guided adaptive quantization framework for 3D point cloud. Different from traditional quantization methods that adopt a static and uniform quantization scheme, our proposed framework can adaptively locate the semantic-rich foreground points in the feature maps to allocate a higher bitwidth for these “important” points. Since the foreground points are in a low proportion in the sparse 3D point cloud, such adaptive quantization can achieve higher accuracy than uniform compression under a similar compression rate. Furthermore, we adopt a block-wise fine-grained compression scheme in the proposed framework to fit the larger dynamic range in the point cloud. Moreover, a 3D point cloud based software and hardware co-evaluation process is proposed to evaluate the effectiveness of the proposed adaptive quantization in actual hardware devices. Based on the nuScenes dataset, we achieve 12.52% precision improvement under average 2-bit quantization. Compared with 8-bit quantization, we can achieve 3.11× energy efficiency based on co-evaluation results.

SESSION: Technical Program: Approximate Brain-Inspired Architectures for Efficient Learning

ReMeCo: Reliable Memristor-Based in-Memory Neuromorphic Computation

Ali BanaGozar
Seyed Hossein Hashemi Shadmehri
Sander Stuijk
Mehdi Kamal
Ali Afzali-Kusha
Henk Corporaal

Memristor-based in-memory neuromorphic computing systems promise a highly efficient implementation of vector-matrix multiplications, commonly used in artificial neural networks (ANNs). However, the immature fabrication process of memristors and circuit level limitations, i.e., stuck-at-fault (SAF), IR-drop, and device-to-device (D2D) variation, degrade the reliability of these platforms and thus impede their wide deployment. In this paper, we present ReMeCo, a redundancy-based reliability improvement framework. It addresses the non-idealities while constraining the induced overhead. It achieves this by performing a sensitivity analysis on ANN. With the acquired insight, ReMeCo avoids the redundant calculation of least sensitive neurons and layers. ReMeCo uses a heuristic approach to find the balance between recovered accuracy and imposed overhead. ReMeCo further decreases hardware redundancy by exploiting the bit-slicing technique. In addition, the framework employs the ensemble averaging method at the output of every ANN layer to incorporate the redundant neurons. The efficacy of the ReMeCo is assessed using two well-known ANN models, i.e., LeNet, and AlexNet, running the MNIST and CIFAR10 datasets. Our results show 98.5% accuracy recovery with roughly 4% redundancy which is more than 20× lower than the state-of-the-art.

SyFAxO-GeN: Synthesizing FPGA-Based Approximate Operators with Generative Networks

Rohit Ranjan
Salim Ullah
Siva Satyendra Sahoo
Akash Kumar

With rising trends of moving AI inference to the edge, due to communication and privacy challenges, there has been a growing focus on designing low-cost Edge-AI. Given the diversity of application areas at the edge, FPGA-based systems are increasingly used for high-performance inference. Similarly, approximate computing has emerged as a viable approach to achieve disproportionate resource gains by utilizing the applications’ inherent robustness. However, most related research has focused on selecting the appropriate approximate operators for an application from a set of ASIC-based designs. This approach fails to leverage the FPGA’s architectural benefits and limits the scope of approximation to already existing generic designs. To this end, we propose an AI-based approach to synthesizing novel approximate operators for FPGA’s Look-up-table-based structure. Specifically, we use state-of-the-art generative networks to search for constraint-aware arithmetic operator designs optimized for FPGA-based implementation. With the proposed GANs, we report up to 49% faster training, with negligible accuracy degradation, than related generative networks. Similarly, we report improved hypervolume and increased pareto-front design points compared to state-of-the-art approaches to synthesizing approximate multipliers.

Approximating HW Accelerators through Partial Extractions onto Shared Artificial Neural Networks

Prattay Chowdhury
Jorge Castro Godínez
Benjamin Carrion Schafer

One approach that has been suggested to further reduce the energy consumption of heterogenous Systems-on-Chip (SoCs) is approximate computing. In approximate computing the error at the output is relaxed in order to simplify the hardware and thus, achieve lower power. Fortunately, most of the hardware accelerators in these SoCs are also amenable to approximate computing.

In this work we propose a fully automatic method that substitutes portions of a hardware accelerator specified in C/C++/SystemC for High-Level Synthesis (HLS) to an Artificial Neural Network (ANN). ANNs have many advantages that make them well suited for this. First, they are very scalable which allows to approximate multiple separate portions of the behavioral description simultaneously on them. Second, multiple ANNs can be fused together and re-optimized to further reduce the power consumption. We use this to share the ANN to approximate multiple different HW accelerators in the same SoC. Experimental results with different error thresholds show that our proposed approach leads to better results than the state of the art.

DependableHD: A Hyperdimensional Learning Framework for Edge-Oriented Voltage-Scaled Circuits

Dehua Liang
Hiromitsu Awano
Noriyuki Miura
Jun Shiomi

Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing the stable operation in modern VLSI. To tackle such issues, we propose DependableHD, a learning framework based on HyperDimensional Computing (HDC), which supports the systems to tolerate bit-level memory failure in the low voltage region with high robustness. For the first time, DependableHD introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. Our experiment shows that under 10% memory error, DependableHD exhibits a 1.22% accuracy loss on average, which achieves an 11.2× improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHD supports the systems to reduce the supply voltage from 400mV to 300mV, which provides a 50.41% energy consumption reduction while maintaining competitive accuracy performance.

SESSION: Technical Program: Retrospect and Prospect of Verifiation and Test Technologies

EDDY: A Multi-Core BDD Package with Dynamic Memory Management and Reduced Fragmentation

Rune Krauss
Mehran Goli
Rolf Drechsler

In recent years, hardware systems have significantly grown in complexity. Due to the increasing complexity, there is a need to continuously improve the quality of the hardware design process. This leads designers to strive for more efficient data structures and algorithms operating on them to guarantee the correct behavior of such systems through verification techniques like model checking and meet time-to-market constraints. A Binary Decision Diagram (BDD) is a suitable data structure as it provides a canonical compact representation of Boolean functions, given variable ordering, and efficient algorithms for manipulating them. However, reduced ordered BDDs also have challenges: There is a large memory consumption for the BDD construction of some complex practical functions and the use of realizations in the form of BDD packages strongly depends on the application.

To address these issues, this paper presents a novel multi-core package called Engineer Decision Diagrams Yourself (EDDY) with dynamic memory management and reduced fragmentation. Experiments on BDD benchmarks of both combinational circuits and model checking show that using EDDY leads to a significantly performance boost compared to state-of-the-art packages.

Exploiting Reversible Computing for Verification: Potential, Possible Paths, and Consequences

Lukas Burgholzer
Robert Wille

Today, the verification of classical circuits poses a severe challenge for the design of circuits and systems. While the underlying (exponential) complexity is tackled in various fashions (simulation-based approaches, emulation, formal equivalence checking, fuzzing, model checking, etc.), no “silver bullet” has been found yet which allows to escape the growing verification gap. In this work, we entertain and investigate the idea of a complementary approach which aims at exploiting reversible computing. More precisely, we show the potential of the reversible computing paradigm for verification, debunk misleading paths that do not allow to exploit this potential, and discuss the resulting consequences for the development of future, complementary design and verification flows. An extensive empirical study (involving more than 30 million simulations) confirms these findings. Although this work cannot provide a fully-fledged realization yet, it may provide the basis for an alternative path towards overcoming the verification gap.

Automatic Test Pattern Generation and Compaction for Deep Neural Networks

Dina Moussa
Michael Hefenbrock
Christopher Münch
Mehdi Tahoori

Deep Neural Networks (DNNs) have gained considerable attention lately due to their excellent performance on a wide range of recognition and classification tasks. Accordingly, fault detection in DNNs and their implementations plays a crucial role in the quality of DNN implementations to ensure that their post-mapping and infield accuracy matches with model accuracy. This paper proposes a functional-level automatic test pattern generation approach for DNNs. This is done by generating inputs which causes misclassification of the output class label in the presence of single or multiple faults. Furthermore, to obtain a smaller set of test patterns with full coverage, a heuristic algorithm as well as a test pattern clustering method using K-means were implemented. The experimental results showed that the proposed test patterns achieved the highest label misclassification and a high output deviation compared to state-of-the-art approaches.

Wafer-Level Characteristic Variation Modeling Considering Systematic Discontinuous Effects

Takuma Nagao
Tomoki Nakamura
Masuo Kajiyama
Makoto Eiki
Michiko Inoue
Michihiro Shintani

Statistical wafer-level variation modeling is an attractive method for reducing the measurement cost in large-scale integrated circuit (LSI) testing while maintaining the test quality. In this method, the performance of unmeasured LSI circuits manufactured on a wafer is statistically predicted from a few measured LSI circuits. Conventional statistical methods model spatially smooth variations in wafer. However, actual wafers may have discontinuous variations that are systematically caused by the manufacturing environments, such as shot dependence. In this study, we propose a modeling method that considers discontinuous variations in wafer characteristics by applying the knowledge of manufacturing engineers to a model estimated using Gaussian process regression. In the proposed method, the process variation is decomposed into the systematic discontinuous and global components to improve the estimation accuracy. An evaluation performed using an industrial production test dataset shows that the proposed method reduces the estimation error for an entire wafer by over 33% compared to conventional methods.

SESSION: Technical Program: Computing, Erasing, and Protecting: The Security Challenges for the Next Generation of Memories

Hardware Security Primitives Using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs

Simranjeet Singh
Furqan Zahoor
Gokul Rajendran
Sachin Patkar
Anupam Chattopadhyay
Farhad Merchant

With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.

Data Sanitization on eMMCs

Aya Fukami
Francesco Regazzoni
Zeno Geradts

Data sanitization of modern digital devices is an important issue given that electronic wastes are being recycled and repurposed. The embedded Multi Media Card (eMMC), one of the NAND flash memory-based commodity devices, is one of the popularly recycled products in the current recycling ecosystem. We analyze a repurposed devices and evaluate its sanitization practice. Data from the formerly used device can still be recovered, which may lead to an unintentional leakage of sensitive data such as personally identifiable information (PII). Since the internal storage of an eMMC is the NAND flash memory, sanitization practice of the NAND flash memory-based systems should apply to the eMMC. However, proper sanitize operation is obviously not always performed in the current recycling ecosystem. We discuss how data stored in eMMC and other flash memory-based devices need to be deleted in order to avoid the potential data leakage. We also review the NAND flash memory data sanitization schemes and discuss how they should be applied in eMMCs.

Fundamentally Understanding and Solving RowHammer

Onur Mutlu
Ataberk Olgun
A. Giray Yağlıkcı

We provide an overview of recent developments and future directions in the RowHammer vulnerability that plagues modern DRAM (Dynamic Random Memory Access) chips, which are used in almost all computing systems as main memory.

RowHammer is the phenomenon in which repeatedly accessing a row in a real DRAM chip causes bitflips (i.e., data corruption) in physically nearby rows. This phenomenon leads to a serious and widespread system security vulnerability, as many works since the original RowHammer paper in 2014 have shown. Recent analysis of the RowHammer phenomenon reveals that the problem is getting much worse as DRAM technology scaling continues: newer DRAM chips are fundamentally more vulnerable to RowHammer at the device and circuit levels. Deeper analysis of RowHammer shows that there are many dimensions to the problem as the vulnerability is sensitive to many variables, including environmental conditions (temperature & voltage), process variation, stored data patterns, as well as memory access patterns and memory control policies. As such, it has proven difficult to devise fully-secure and very efficient (i.e., low-overhead in performance, energy, area) protection mechanisms against RowHammer and attempts made by DRAM manufacturers have been shown to lack security guarantees.

After reviewing various recent developments in exploiting, understanding, and mitigating RowHammer, we discuss future directions that we believe are critical for solving the RowHammer problem. We argue for two major directions to amplify research and development efforts in: 1) building a much deeper understanding of the problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and 2) the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.

SESSION: Technical Program: System-Level Codesign in DNN Accelerators

Hardware-Software Codesign of DNN Accelerators Using Approximate Posit Multipliers

Tom Glint
Kailash Prasad
Jinay Dagli
Krishil Gandhi
Aryan Gupta
Vrajesh Patel
Neel Shah
Joycee Mekie

Emerging data intensive AI/ML workloads encounter memory and power wall when run on general-purpose compute cores. This has led to the development of a myriad of techniques to deal with such workloads, among which DNN accelerator architectures have found a prominent place. In this work, we propose a hardware-software co-design approach to achieve system-level benefits. We propose a quantized data-aware POSIT number representation that leads to a highly optimized DNN accelerator. We demonstrate this work on SOTA SIMBA architecture, extendable to any other accelerator. Our proposal reduces the buffer/storage requirements within the architecture and reduces the data transfer cost between the main memory and the DNN accelerator. We have investigated the impact of using integer, IEEE floating point, and posit multipliers for LeNet, ResNet and VGG NNs trained and tested on MNIST, CIFAR10 and ImageNet datasets, respectively. Our system-level analysis shows that the proposed approximate-fixed-posit multiplier when implemented on SIMBA architecture, achieves on average ~2.2× speed up, consumes ~3.1× less energy and requires ~3.2× less area, respectively, against the baseline SOTA architecture, without loss of accuracy (~±1%)

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

Susmita Dey Manasi
Suvadeep Banerjee
Abhijit Davare
Anton A. Sorokin
Steven M. Burns
Desmond A. Kirkpatrick
Sachin S. Sapatnekar

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

Mohammadhossein Askarihemmat
Sean Wagner
Olexa Bilaniuk
Yassine Hariri
Yvon Savaria
Jean-Pierre David

We present a DNN accelerator that allows inference at arbitrary precision with dedicated processing elements that are configurable at the bit level. Our DNN accelerator has 8 Processing Elements controlled by a RISC-V controller with a combined 8.2 TMACs of computational power when implemented with the recent Alveo U250 FPGA platform. We develop a code generator tool that ingests CNN models in ONNX format and generates an executable command stream for the RISC-V controller. We demonstrate the scalable throughput of our accelerator by running different DNN kernels and models when different quantization levels are selected. Compared to other low precision accelerators, our accelerator provides run time programmability without hardware reconfiguration and can accelerate DNNs with multiple quantization levels, regardless of the target FPGA size. BARVINN is an open source project and it is available at https://github.com/hossein1387/BARVINN.

Agile Hardware and Software Co-Design for RISC-V-Based Multi-Precision Deep Learning Microprocessor

Zicheng He
Ao Shen
Qiufeng Li
Quan Cheng
Hao Yu

Recent network architecture search (NAS) has been widely applied to simplify deep learning neural networks, which typically result in a multi-precision network. Many multi-precision accelerators have been developed as well to support computing multi-precision networks manually. A software-hardware interface is thereby needed to automatically map multi-precision networks onto multi-precision accelerators. In this paper, we have developed an agile hardware and software co-design for RISC-V-based multi-precision deep learning microprocessor. We have designed custom RISC-V instructions with a framework to automatically compile multi-precision CNN networks onto multi-precision CNN accelerators, demonstrated on FPGA. Experiments show that with NAS optimized multi-precision CNN models (LeNet, VGG16, ResNet, MobileNet), the RISC-V core with multi-precision accelerators can reach the highest throughput in 2,4,8-bit precisions respectively on a Xilinx ZCU102 FPGA.

SESSION: Technical Program: New Advances in Hardware Trojan Detection

Hardware Trojan Detection Using Shapley Ensemble Boosting

Zhixin Pan
Prabhat Mishra

Due to globalized semiconductor supply chain, there is an increasing risk of exposing system-on-chip designs to hardware Trojans (HT). While there are promising machine Learning based HT detection techniques, they have three major limitations: ad-hoc feature selection, lack of explainability, and vulnerability towards adversarial attacks. In this paper, we propose a novel HT detection approach using an effective combination of Shapley value analysis and boosting framework. Specifically, this paper makes two important contributions. We use Shapley value (SHAP) to analyze the importance ranking of input features. It not only provides explainable interpretation for HT detection, but also serves as a guideline for feature selection. We utilize boosting (ensemble learning) to generate a sequence of lightweight models that significantly reduces the training time while provides robustness against adversarial attacks. Experimental results demonstrate that our approach can drastically improve both detection accuracy (up to 24.6%) and time efficiency (up to 5.1x) compared to state-of-the-art HT detection techniques.

ASSURER: A PPA-friendly Security Closure Framework for Physical Design

Guangxin Guo
Hailong You
Zhengguang Tang
Benzheng Li
Cong Li
Xiaojue Zhang

Hardware security is emerging in the very large scale integration (VLSI). The seminal threats, like hardware Trojan insertion, probing attacks, and fault injection, are hard to detect and almost impossible to fix at post-design stage. The optimal solution is to prevent them at the physical design stage. Usually, defending against them may cause a lot of power, performance, and area (PPA) loss. In this paper, we propose a PPA-friendly physical layout security closure framework ASSURER. Reward-directed placement refinement and multi-threshold partition algorithm are proposed to assure Trojan threats are empty. Cleaning up probing attacks is established on a patch-based ECO routing flow. Evaluated on the ISPD’22 benchmarks, ASSURER can clean out the Trojan threat with no leakage power increase when shrinking the physical layout area. When not shrinking, ASSURER only increases 14% total power. Compared with the work of first place in the ISPD2022 Contest, ASSURE reduced 53% additional total power consumption, and probing vulnerability can be reduced by 97.6% under the premise of timing closure. We believe this work shall open up a new perspective for preventing Trojan insertion and probing attacks.

Static Probability Analysis Guided RTL Hardware Trojan Test Generation

Haoyi Wang
Qiang Zhou
Yici Cai

Directed test generation is an effective method to detect potential hardware Trojan (HT) in RTL. While the existing works are able to activate hard-to-cover Trojans by covering security targets, the effectiveness and efficiency of identifying the targets to cover are ignored. We propose a static probability analysis method for identifying the hard-to-active data channel targets and generating the corresponding assertions for the HT test generation. Our method could generate test vectors to trigger Trojans from Trusthub, DeTrust, and OpenCores in 1 minute and get 104.33X time improvement on average compared with the existing method.

Hardware Trojan Detection and High-Precision Localization in NoC-Based MPSoC Using Machine Learning

Haoyu Wang
Basel Halak

Networks-on-Chips (NoC) based Multi-Processor System-on-Chip (MPSoC) are increasingly employed in industrial and consumer electronics. Outsourcing third-party IPs (3PIPs) and tools in NoC-based MPSoC is a prevalent development way in most fabless companies. However, Hardware Trojan (HT) injected during its design stage can maliciously tamper with the functionality of this communication scheme, which undermines the security of the system and may cause a failure. Detecting and localizing HT with high precision is a challenge for current techniques. This work proposes for the first time a novel approach that allows detection and high-precision localization of HT, which is based on the use of packet information and machine learning algorithms. It is equipped with a novel Dynamic Confidence Interval (DCI) algorithm to detect malicious packets, and a novel Dynamic Security Credit Table (DSCT) algorithm to localize HT. We evaluated the proposed framework on the mesh NoC running real workloads. The average detection precision of 96.3% and the average localization precision of 100% were obtained from the experiment results, and the minimum HT localization time is around 5.8 ~ 12.9us at 2GHz depending on the different HT-infected nodes and workloads.

SESSION: Technical Program: Advances in Physical Design and Timing Analysis

An Integrated Circuit Partitioning and TDM Assignment Optimization Framework for Multi-FPGA Systems

Dan Zheng
Evangeline F. Y. Young

In multi-FPGA systems, Time-Division Multiplexing (TDM) is a widely used method for transferring multiple signals over a common wire. The circuit performance will be significantly influenced by this inter-FPGA delay. Some inter-FPGA nets are driven by different clocks, in which case they cannot share the same wire. In this paper, to minimize the maximum delay of inter-FPGA nets, we propose a two-step framework. First, a TDM-aware partitioning algorithm is adopted to minimize the maximum cut size between an FPGA-pair. A TDM ratio assignment method is then applied to assign TDM ratio for each inter-FPGA net optimally. Experimental results show that our algorithm can reduce the maximum TDM ratio significantly within reasonable runtime.

A Robust FPGA Router with Concurrent Intra-CLB Rerouting

Jiarui Wang
Jing Mai
Zhixiong Di
Yibo Lin

Routing is the most time-consuming step in the FPGA design flow with increasingly complicated FPGA architectures and design scales. The growing complexity of connections between logic pins inside CLBs of FPGAs challenges the efficiency and quality of FPGA routers. Existing negotiation-based rip-up and reroute schemes will result in a large number of iterations when generating paths inside CLBs. In this work, we propose a robust routing framework for FPGAs with complex connections between logic elements and switch boxes. We propose a concurrent intra-CLB rerouting algorithm that can effectively resolve routing congestion inside a CLB tile. Experimental results on modified ISPD 2016 benchmarks demonstrate that our framework can achieve 100% routability in less wirelength and runtime, while the state-of-the-art VTR 8.0 routing algorithm fails at 4 of 12 benchmarks.

Efficient Global Optimization for Large Scaled Ordered Escape Routing

Chuandong Chen
Dishi Lin
Rongshan Wei
Qinghai Liu
Ziran Zhu
Jianli Chen

Ordered Escape Routing (OER) problem, which is an NP-hard problem, is critical in PCB design. Primary methods based on integer linear programming (ILP) or heuristic algorithms work well on small-scale PCBs with fewer pins. However, when dealing with large-scale instances, the performance of ILP strategies suffers dramatically as the number of variables increases due to time-consuming preprocessing. As for heuristic algorithms, ripping-up and rerouting is adopted to increase resource utilization, which frequently causes time violation. In this paper, we propose an efficient ILP-based routing engine for dense PCB to simultaneously minimize wiring length and runtime, considering the specific routing constraints. By weighting the length, we first model the OER problem as a special network flow problem. Then we separate the non-crossing constraint from typical ILP modeling to reduce the number of integral variables greatly. In addition, considering the congestion of routing resources, the ILP method is proposed to detect congestion. Finally, unlike the traditional schemes that deal with negotiated congestion, our approach works by reducing the local area capacity and then allowing the global automatic optimization of congestion. Compared with the state-of-the-art work, experimental results show that our algorithm can solve cases in larger scale in high routing quality of less length and reduce routing time by 76%.

An Adaptive Partition Strategy of Galerkin Boundary Element Method for Capacitance Extraction

Shengkun Wu
Biwei Xie
Xingquan Li

In advanced process, electromagnetic coupling among interconnect wires plays an increasingly important role in signoff analysis. For VLSI chip design, the requirement of fast and accurate capacitance extraction is becoming more and more urgent. And the critical step of extracting capacitance among interconnect wires is solving electric field. However, due to the high computational complexity, solving electric field is extreme timing-consuming. The Galerkin boundary element method (GBEM) was used for capacitance extraction in [2]. In this paper, we are going to use some mathematical theorems to analysis its error. Furthermore, with the error estimation of the Galerkin method, we design a boundary partition strategy to fit the electric field attenuation. It is worth to mention that this boundary partition strategy can greatly reduce the number of boundary elements on the promise of ensuring that the error is small enough. As a consequence, the matrix order of the discretization equation will also decrease. We also provide our suggestion of the calculation of the matrix elements. Experimental analysis demonstrates that, our partition strategy obtains a good enough result with a small number of boundary elements.

Graph-Learning-Driven Path-Based Timing Analysis Results Predictor from Graph-Based Timing Analysis

Yuyang Ye
Tinghuan Chen
Yifei Gao
Hao Yan
Bei Yu
Longxing Shi

With diminishing margins in advanced technology nodes, the performance of static timing analysis (STA) is a serious concern, including accuracy and runtime. The STA can generally be divided into graph-based analysis (GBA) and path-based analysis (PBA). For GBA, the timing results are always pessimistic, leading to overdesign during design optimization. For PBA, the timing pessimism is reduced via propagating real path-specific slews with the cost of severe runtime overheads relative to GBA. In this work, we present a fast and accurate predictor of post-layout PBA timing results from inexpensive GBA based on deep edge-featured graph attention network, namely deep EdgeGAT. Compared with the conventional machine and graph learning methods, deep EdgeGAT can learn global timing path information. Experimental results demonstrate that our predictor has the potential to substantially predict PBA timing results accurately and reduce timing pessimism of GBA with maximum error reaching 6.81 ps, and our work achieves an average 24.80× speedup faster than PBA using the commercial STA tool.

SESSION: Technical Program: Brain-Inspired Hyperdimensional Computing to the Rescue for Beyond von Neumann Era

Beyond von Neumann Era: Brain-Inspired Hyperdimensional Computing to the Rescue

Hussam Amrouch
Paul R. Genssler
Mohsen Imani
Mariam Issa
Xun Jiao
Wegdan Mohammad
Gloria Sepanta
Ruixuan Wang

Breakthroughs in deep learning (DL) continuously fuel innovations that profoundly improve our daily life. However, DNNs overwhelm conventional computing architectures by their massive data movements between processing and memory units. As a result, novel computer architectures are indispensable to improve or even replace the decades-old von Neumann architecture. Nevertheless, going far beyond the existing von Neumann principles comes with profound reliability challenges for the performed computations. This is due to analog computing together with emerging beyond-CMOS technologies being inherently noisy and inevitably leading to unreliable computing. Hence, novel robust algorithms become a key to go beyond the boundaries of the von Neumann era. Hyper-dimensional Computing (HDC) is rapidly emerging as an attractive alternative to traditional DL and ML algorithms. Unlike conventional DL and ML algorithms, HDC is inherently robust against errors along a much more efficient hardware implementation. In addition to these advantages at hardware level, HDC’s promise to learn from little data and the underlying algebra enable new possibilities at the application level. In this work, the robustness of HDC algorithms against errors and beyond von Neumann architectures are discussed. Further, the benefits of HDC as a machine learning algorithm are demonstrated with the example of outlier detection and reinforcement learning.

SESSION: Technical Program: System Level Design Space Exploration

System-Level Exploration of In-Package Wireless Communication for Multi-Chiplet Platforms

Rafael Medina
Joshua Kein
Giovanni Ansaloni
Marina Zapater
Sergi Abadal
Eduard Alarcón
David Atienza

Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

Efficient System-Level Design Space Exploration for High-Level Synthesis Using Pareto-Optimal Subspace Pruning

Yuchao Liao
Tosiron Adegbija
Roman Lysecky

High-level synthesis (HLS) is a rapidly evolving and popular approach to designing, synthesizing, and optimizing embedded systems. Many HLS methodologies utilize design space exploration (DSE) at the post-synthesis stage to find Pareto-optimal hardware implementations for individual components. However, the design space for the system-level Pareto-optimal configurations is orders of magnitude larger than component-level design space, making existing approaches insufficient for system-level DSE. This paper presents Pruned Genetic Design Space Exploration (PG-DSE)—an approach to post-synthesis DSE that involves a pruning method to effectively reduce the system-level design space and an elitist genetic algorithm to accurately find the system-level Pareto-optimal configurations. We evaluate PG-DSE using an autonomous driving application subsystem (ADAS) and three synthetic systems with extremely large design spaces. Experimental results show that PG-DSE can reduce the design space by several orders of magnitude compared to prior work while achieving higher quality results (an average improvement of 58.1x).

Automatic Generation of Complete Polynomial Interpolation Design Space for Hardware Architectures

Bryce Orloski
Samuel Coward
Theo Drane

Hardware implementations of elementary functions regularly deploy piecewise polynomial approximations. This work determines the complete design space of piecewise polynomial approximations meeting a given accuracy specification. Knowledge of this design space determines the minimum number of regions required to approximate the function accurately enough and facilitates the generation of optimized hardware which is competitive against the state of the art. Designers can explore the space of feasible architectures without needing to validate their choices. A heuristic based decision procedure is proposed to generate optimal ASIC hardware designs. Targeting alternative hardware technologies simply requires a modified decision procedure to explore the space. We highlight the difficulty in choosing an optimal number of regions to approximate the function with, as this is input width dependent.

SESSION: Technical Program: Security Assurance and Acceleration

SHarPen: SoC Security Verification by Hardware Penetration Test

Hasan Al-Shaikh
Arash Vafaei
Mridha Md Mashahedur Rahman
Kimia Zamiri Azar
Fahim Rahman
Farimah Farahmandi
Mark Tehranipoor

As modern SoC architectures incorporate many complex/heterogeneous intellectual properties (IPs), the protection of security assets has become imperative, and the number of vulnerabilities revealed is rising due to the increased number of attacks. Over the last few years, penetration testing (PT) has become an increasingly effective means of detecting software (SW) vulnerabilities. As of yet, no such technique has been applied to the detection of hardware vulnerabilities. This paper proposes a PT framework, SHarPen, for detecting hardware vulnerabilities, which facilitates the development of a SoC-level security verification framework. SHarPen proposes a formalism for performing gray-box hardware (HW) penetration testing instead of relying on coverage-based testing and provides an automation for mapping hardware vulnerabilities to logical/mathematical cost functions. SHarPen supports both simulation and FPGA-based prototyping, allowing us to automate security testing at different stages of the design process with high capabilities for identifying vulnerabilities in the targeted SoC.

SecHLS: Enabling Security Awareness in High-Level Synthesis

Shang Shi
Nitin Pundir
Hadi M Kamali
Mark Tehranipoor
Farimah Farahmandi

In their quest for further optimization, High-level synthesis (HLS) utilizes advanced automatic optimization algorithms to achieve lower implementation time/effort for even more complex designs. These optimization algorithms are for the HLS tools’ backend stages, e.g., allocation, scheduling, and binding, and they are highly optimized for resources/latency constraints. However, current HLS tools’ backend is unaware of designs’ security assets, and their algorithms are incapable of handling security constraints. In this paper, we propose Secure-HLS (SecHLS), which aims to define underlying security constraints for HLS tools’ backend stages and intermediate representations. In SecHLS, we improve a set of widely-used scheduling and binding algorithms by integrating the proposed security-related constraints into them. We evaluate the effectiveness of SecHLS in terms of power, performance, area (PPA), security, and complexity (execution time) on small and real-size benchmarks, showing how the proposed security constraints can be integrated into HLS while maintaining low PPA/complexity burdens.

A Flexible ASIC-Oriented Design for a Full NTRU Accelerator

Francesco Antognazza
Alessandro Barenghi
Gerardo Pelosi
Ruggero Susella

Post-quantum cryptosystems are the subject of a significant research effort, witnessed by various international standardization competitions. Among them, the NTRU Key Encapsulation Mechanism has been recognized as a secure, patent-free, and efficient public key encryption scheme. In this work, we perform a design space exploration on an FPGA target, with the final goal of an efficient ASIC realization. Specifically, we focus on the possible choices for the design of polynomial multipliers with different memory bus widths to trade-off lower clock cycle counts with larger interconnections. Our design outperforms the best FPGA synthesis results at the state of the art, and we report the results of ASIC syntheses minimizing latency and area with a 40nm industrial grade technology library. Our speed-oriented design computes an encapsulation in 4.1 to 10.2μs and a decapsulation in 7.1 to 11.7μs, depending on the NTRU security level, while our most compact design only takes 20% more area than the underlying SHA-3 hash module.

SESSION: Technical Program: Hardware and Software Co-Design of Emerging Machine Learning Algorithms

Robust Hyperdimensional Computing against Cyber Attacks and Hardware Errors: A Survey

Dongning Ma
Sizhe Zhang
Xun Jiao

Hyperdimensional Computing (HDC), also known as Vector Symbolic Architecture (VSA), is an emerging AI algorithm inspired by the way the human brain functions. Compared with deep neural networks (DNNs), HDC possesses several advantages such as smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. With the increasing deployment of AI in safety-critical systems such as healthcare and robotics, it is not only important to strive for high accuracy, but also to ensure its robustness under even highly uncertain and adversarial environments. However, recent studies show that HDC, just like DNNs, is vulnerable to both cyber attacks (e.g., adversarial attacks) and hardware errors (e.g., memory failures). While a growing body of research has been studying the robustness of HDC, there is a lack of systematic review of research efforts on this increasingly-important topic. To the best of our knowledge, this paper presents the first survey dedicated to review the research efforts made to the robustness of HDC against cyber attacks and hardware errors. While the performance and accuracy of HDC as an AI method still expects future theoretical advancement, this survey paper aims to shed light and call for community efforts on robustness research of HDC.

In-Memory Computing Accelerators for Emerging Learning Paradigms

Dayane Reis
Ann Franchesca Laguna
Michael Niemier
Xiaobo Sharon Hu

Over the past decades, emerging, data-driven machine learning (ML) paradigms have increased in popularity, and revolutionized many application domains. To date, a substantial effort has been devoted to devising mechanisms for facilitating the deployment and near ubiquitous use of these memory intensive ML models. This review paper presents the use of in-memory computing (IMC) accelerators for emerging ML paradigms from a bottom-up perspective through the choice of devices, the design of circuits/architectures, to the application-level results.

Toward Fair and Efficient Hyperdimensional Computing

Yi Sheng
Junhuan Yang
Weiwen Jiang
Lei Yang

We are witnessing the evolution that Machine Learning (ML) is applied to varied applications, such as intelligent security systems, medical diagnoses, etc. With this trend, it has high demand to run ML on end devices with limited resources. What’s more, the fairness in these ML algorithms is mounting important, since these applications are not designed for specific users (e.g., people with fair skin in skin disease diagnosis) but need to be applied to all possible users (i.e., people with different skin tones). Brain-inspired hyperdimensional computing (HDC) has demonstrated its ability to run ML tasks on edge devices with a small memory footprint; yet, it is unknown whether HDC can satisfy the fairness requirements from applications (e.g., medical diagnosis for people with different skin tones). In this paper, for the first time, we reveal that the vanilla HDC has severe bias due to its sensitivity to color information. Toward a fair and efficient HDC, we propose a holistic framework, namely FE-HDC, which integrates the image processing and input compression techniques in HDC’s encoder. Compared with the vanilla HDC, results show that the proposed FE-HDC can reduce the unfairness score by 90%, achieving fairer architectures with competitively high accuracy.

SESSION: Technical Program: Full-Stack Co-Design for on-Chip Learning in AI Systems

Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design

Xiaoxuan Yang
Shiyu Li
Qilin Zheng
Yiran Chen

Processing-in-memory (PIM) based architecture shows great potential to process several emerging artificial intelligence workloads, including vision and language models. Cross-layer optimizations could bridge the gap between computing density and the available resources by reducing the computation and memory cost of the model and improving the model’s robustness against non-ideal hardware effects. We first introduce several hardware-aware training methods to improve the model robustness to the PIM device’s non-ideal effects, including stuck-at-fault, process variation, and thermal noise. Then, we further demonstrate a software/hardware (SW/HW) co-design methodology to efficiently process the state-of-the-art attention-based model on PIM-based architecture by performing sparsity exploration for the attention-based model and circuit-architecture co-design to support the sparse processing.

Hardware-Software Co-Design for On-Chip Learning in AI Systems

M. L. Varshika
Abhishek Kumar Mishra
Nagarajan Kandasamy
Anup Das

Spike-based convolutional neural networks (CNNs) are empowered with on-chip learning in their convolution layers, enabling the layer to learn to detect features by combining those extracted in the previous layer. We propose ECHELON, a generalized design template for a tile-based neuromorphic hardware with on-chip learning capabilities. Each tile in ECHELON consists of a neural processing units (NPU) to implement convolution and dense layers of a CNN model, an on-chip learning unit (OLU) to facilitate spike-timing dependent plasticity (STDP) in the convolution layer, and a special function unit (SFU) to implement other CNN functions such as pooling, concatenation, and residual computation. These tile resources are interconnected using a shared bus, which is segmented and configured via the software to facilitate parallel communication inside the tile. Tiles are themselves interconnected using a classical Network-on-Chip (NoC) interconnect. We propose a system software to map CNN models to ECHELON, maximizing the performance. We integrate the hardware design and software optimization within a co-design loop to obtain the hardware and software architectures for a target CNN, satisfying both performance and resource constraints. In this preliminary work, we show the implementation of a tile on a FPGA and some early evaluations. Using 8 STDP-enabled CNN models, we show the potential of our co-design methodology to optimize hardware resources.

Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis

Vito Giovanni Castellana
Nicolas Bohm Agostini
Ankur Limaye
Vinay Amatya
Marco Minutoli
Joseph Manzano
Antonino Tumeo
Serena Curzel
Michele Fiorito
Fabrizio Ferrandi

The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

SESSION: Technical Program: Energy-Efficient Computing for Emerging Applications

Knowledge Distillation in Quantum Neural Network Using Approximate Synthesis

Mahabubul Alam
Satwik Kundu
Swaroop Ghosh

Recent assertions of a potential advantage of Quantum Neural Network (QNN) for specific Machine Learning (ML) tasks have sparked the curiosity of a sizable number of application researchers. The parameterized quantum circuit (PQC), a major building block of a QNN, consists of several layers of single-qubit rotations and multi-qubit entanglement operations. The optimum number of PQC layers for a particular ML task is generally unknown. A larger network often provides better performance in noiseless simulations. However, it may perform poorly on hardware compared to a shallower network. Because the amount of noise varies amongst quantum devices, the optimal depth of PQC can vary significantly. Additionally, the gates chosen for the PQC may be suitable for one type of hardware but not for another due to compilation overhead. This makes it difficult to generalize a QNN design to wide range of hardware and noise levels. An alternate approach is to build and train multiple QNN models targeted for each hardware which can be expensive. To circumvent these issues, we introduce the concept of knowledge distillation in QNN using approximate synthesis. The proposed approach will create a new QNN network with (i) a reduced number of layers or (ii) a different gate set without having to train it from scratch. Training the new network for a few epochs can compensate for the loss caused by approximation error. Through empirical analysis, we demonstrate ≈71.4% reduction in circuit layers, and still achieve ≈16.2% better accuracy under noise.

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring

Wentao Hou
Kai Zhong
Shulin Zeng
Guohao Dai
Huazhong Yang
Yu Wang

Graph Attention Network (GAT) has demonstrated better performance in many graph tasks than previous Graph Neural Networks (GNN). However, it involves graph attention operations with extra computing complexity. While a large amount of existing literature has researched GNN acceleration, few have focused on the attention mechanism in GAT. The graph attention mechanism makes the computation flow different. Therefore, previous GNN accelerators can not support GAT well. Besides, GAT distinguishes the importance of neighbors and makes it possible to reduce the workload through runtime tailoring. We present NTGAT, a software-hardware co-design approach to accelerate GAT with runtime node tailoring. Our work comprises both a runtime node tailoring algorithm and an accelerator design. We propose a pipeline sorting method and a hardware unit to support node tailoring during inference. The experiments show that our algorithm can reduce up to 86% of aggregation workload while incurring slight accuracy loss (<0.4%). And the FPGA based accelerator can achieve up to 3.8× speedup and 4.98× energy efficiency comparing to the GPU baseline.

A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks

Pai-Yu Tan
Cheng-Wen Wu

Spiking neural networks (SNNs) that enable energy-efficient neuromorphic hardware are receiving growing attention. Training SNNs directly with back-propagation has demonstrated accuracy comparable to deep neural networks (DNNs). However, previous direct-training algorithms require high-precision floating-point operations, which are not suitable for low-power end-point devices. The high-precision operations also require the learning algorithm to run on high-performance accelerator hardware. In this paper, we propose an improved approach that converts the high-precision floating-point operations to low-bitwidth integer operations for an existing direct-training algorithm, i.e., the Spatio-Temporal Back-Propagation (STBP) algorithm. The proposed low-bitwidth Integer-STBP algorithm requires only integer arithmetic for SNN training and inference, which greatly reduces the computational complexity. Experimental results show that the proposed STBP algorithm achieves comparable accuracy and higher energy efficiency than the original floating-point STBP algorithm. Moreover, it can be implemented on low-power end-point devices to provide learning capability during inference, which are mostly supported by fixed-point hardware.

TiC-SAT: Tightly-Coupled Systolic Accelerator for Transformers

Alireza Amirshahi
Joshua Alexander Harrison Klein
Giovanni Ansaloni
David Atienza

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework¹.

SESSION: Technical Program: Side-Channel Attacks and RISC-V Security

PMU-Leaker: Performance Monitor Unit-Based Realization of Cache Side-Channel Attacks

Pengfei Qiu
Qiang Gao
Dongsheng Wang
Yongqiang Lyu
Chunlu Wang
Chang Liu
Rihui Sun
Gang Qu

Performance Monitor Unit (PMU) is a special hardware module in processors that contains a set of counters to record various architectural and micro-architectural events. In this paper, we propose PMU-Leaker, a novel realization of all existing cache side-channel attacks where accurate execution time measurements are replaced by information leaked through PMU. The efficacy of PMU-Leaker is demonstrated by (1) leaking the secret data stored in Intel Software Guard Extensions (SGX) with the transient execution vulnerabilities including Spectre and ZombieLoad and (2) extracting the encryption key of a victim AES performed in SGX. We perform thorough experiments on a DELL Inspiron 15-7560 laptop that has an Intel® Core^™ i5-7200U processor with the Kaby Lake architecture and the results show that, among the 176 PMU counters, 24 of them are vulnerable and can be used to launch the PMU-Leaker attack.

EO-Shield: A Multi-Function Protection Scheme against Side Channel and Focused Ion Beam Attacks

Ya Gao
Qizhi Zhang
Haocheng Ma
Jiaji He
Yiqiang Zhao

Smart devices, especially Internet-connected devices, typically incorporate security protocols and cryptographic algorithms to ensure the control flow integrity and information security. However, there are various invasive and non-invasive attacks trying to tamper with these devices. Chip-level active shield has been proved to be an effective countermeasure against invasive attacks, but existing active shields cannot be utilized to counter side-channel attacks (SCAs). In this paper, we propose a multi-function protection scheme and an active shield prototype to against invasive and non-invasive attacks simultaneously. The protection scheme has a complex active shield implemented using the top metal layer of the chip and an information leakage obfuscation module underneath. The leakage obfuscation module generates its protection patterns based on the operating conditions of the circuit that needs to be protected, thus reducing the correlation between electromagnetic (EM) emanations and cryptographic data. We implement the protection scheme on one Advanced Encryption Standard (AES) circuit to demonstrate the effectiveness of the method. Experiment results demonstrate that the information leakage obfuscation module decreases SNR below 0.6 and reduces the success rate of SCAs. Compared to existing single-function protection methods against physical attacks, the proposed scheme provides good performance against both invasive and non-invasive attacks.

CompaSeC: A Compiler-Assisted Security Countermeasure to Address Instruction Skip Fault Attacks on RISC-V

Johannes Geier
Lukas Auer
Daniel Mueller-Gritschneder
Uzair Sharif
Ulf Schlichtmann

Fault-injection attacks are a risk for any computing system executing security-relevant tasks, such as a secure boot process. While hardware-based countermeasures to these invasive attacks have been found to be a suitable option, they have to be implemented via hardware extensions and are thus not available in most Commonly used Off-The-Shelf (COTS) components. Software Implemented Hardware Fault Tolerance (SIHFT) is therefore the only valid option to enhance a COTS system’s resilience against fault attacks. Established SIHFT techniques usually target the detection of random hardware errors for functional safety and not targeted attacks. Using the example of a secure boot system running on a RISC-V processor, in this work we first show that when the software is hardened by these existing techniques from the safety domain, the number of vulnerabilities in the boot process to single, double, triple, and quadruple instruction skips cannot be fully closed. We extend these techniques to the security domain and propose Compiler-assisted Security Countermeasure (CompaSeC). We demonstrate that CompaSeC can close all vulnerabilities for the studied secure boot system. To further reduce performance and memory overheads we additionally propose a method for CompaSeC to selectively harden individual vulnerable functions without compromising the security against the considered instruction skip faults.

Trojan-D2: Post-Layout Design and Detection of Stealthy Hardware Trojans – A RISC-V Case Study

Sajjad Parvin
Mehran Goli
Frank Sill Torres
Rolf Drechsler

With the exponential increase in the popularity of the RISC-V ecosystem, the security of this platform must be re-evaluated especially for mission-critical and IoT devices. Besides, the insertion of a Hardware Trojan (HT) into a chip after the in-house mask design is outsourced to a chip manufacturer abroad for fabrication is a significant source of concern. Though abundant HT detection methods have been investigated based on side-channel analysis, physical measurements, and functional testing to overcome this problem, there exists stealthy HTs that can hide from detection. This is due to the small overhead of such HTs compared to the whole circuit.

In this work, we propose several novel HTs that can be placed into a RISC-V core’s post-layout in an untrusted manufacturing environment. Next, we propose a non-invasive analytical method based on contactless optical probing to detect any stealthy HTs. Finally, we propose an open-source library of HTs that can be used to be placed into a processor unit in the post-layout phase. All the designs in this work are done using a commercial 28nm technology.

SESSION: Technical Program: Simulation and Verification of Quantum Circuits

Graph Partitioning Approach for Fast Quantum Circuit Simulation

Jaekyung Im
Seokhyeong Kang

Owing to the exponential increase in computational complexity, the fast simulation of the large quantum circuit has become very difficult. This is an important challenge for the utilization of quantum computers because it is closely related to the verification of quantum computation by classical machines. The Hybrid Schrödinger-Feynman simulation seems to be a promising solution, but its application is very limited. To solve this drawback, we propose an improved simulation method based on graph partitioning. Experimental results show that our approach significantly reduces the simulation time of the Hybrid Schrödinger-Feynman simulation.

A Robust Approach to Detecting Non-Equivalent Quantum Circuits Using Specially Designed Stimuli

Hsiao-Lun Liu
Yi-Ting Li
Yung-Chih Chen
Chun-Yao Wang

As several compilation and optimization techniques have been proposed, equivalence checking for quantum circuits has become essential in design flows. The state-of-the-art to this problem observed that even small errors substantially affect the entire quantum system. As a result, it exploited random simulations to prove the non-equivalence of two quantum circuits. However, when errors occurred close to outputs, it was hard for the work to prove the non-equivalence of some non-equivalent quantum circuits under a limited number of simulations. In this work, we propose a novel simulation-based approach using a set of specially designed stimuli. The simulation runs of the proposed approach is linear rather than exponential to the number of quantum bits of a circuit. According to the experimental results, the success rate of our approach is 100% (100%) under a simulation run (execution time) constraint for a set of benchmarks, while that of the state-of-the-art is only 69% (74%) on average. Our approach also achieves a speedup of 26 on average.

Equivalence Checking of Parameterized Quantum Circuits: Verifying the Compilation of Variational Quantum Algorithms

Tom Peham
Lukas Burgholzer
Robert Wille

Variational quantum algorithms have been introduced as a promising class of quantum-classical hybrid algorithms that can already be used with the noisy quantum computing hardware available today by employing parameterized quantum circuits. Considering the non-trivial nature of quantum circuit compilation and the subtleties of quantum computing, it is essential to verify that these parameterized circuits have been compiled correctly. Established equivalence checking procedures that handle parameter-free circuits already exist. However, no methodology capable of handling circuits with parameters has been proposed yet. This work fills this gap by showing that verifying the equivalence of parameterized circuits can be achieved in a purely symbolic fashion using an equivalence checking approach based on the ZX-calculus. At the same time, proofs of inequality can be efficiently obtained with conventional methods by taking advantage of the degrees of freedom inherent to parameterized circuits. We implemented the corresponding methods and proved that the resulting methodology is complete. Experimental evaluations (using the entire parametric ansatz circuit library provided by Qiskit as benchmarks) demonstrate the efficacy of the proposed approach.

Software Tools for Decoding Quantum Low-Density Parity-Check Codes

Lucas Berent
Lukas Burgholzer
Robert Wille

Quantum Error Correction (QEC) is an essential field of research towards the realization of large-scale quantum computers. On the theoretical side, a lot of effort is put into designing error-correcting codes that protect quantum data from errors, which inevitably happen due to the noisy nature of quantum hardware and quantum bits (qubits). Protecting data with an error-correcting code necessitates means to recover the original data, given a potentially corrupted data set—a task referred to as decoding. It is vital that decoding algorithms can recover error-free states in an efficient manner. While theoretical properties of certain QEC methods have been extensively studied, good techniques to analyze their performance in practically more relevant settings is still a widely unexplored area. In this work, we propose a set of software tools that facilitate numerical experiments with so-called Quantum Low-Density Parity-Check codes (QLDPC codes)—a broad class of codes, some of which have recently been shown to be asymptotically good. Based on that, we provide an implementation of a general decoder for QLDPC codes. On top of that, we propose a highly efficient heuristic decoder that eliminates the runtime bottlenecks of the general QLDPC decoder while still maintaining comparable decoding performance. These tools eventually make it possible to confirm theoretical results around QLDPC codes in a more practical setting and showcase the value of software tools (in addition to theoretical considerations) for investigating codes for practical applications. The resulting tool, which is publicly available at https://github.com/cda-tum/qecc as part of the Munich Quantum Toolkit (MQT), is meant to provide a playground for the search for “practically good” quantum codes.

SESSION: Technical Program: Learning x Security in DFM

Enabling Scalable AI Computational Lithography with Physics-Inspired Models

Haoyu Yang
Haoxing Ren

Computational lithography is a critical research area for the continued scaling of semiconductor manufacturing process technology by enhancing silicon printability via numerical computing methods. Today’s solutions for these problems are primarily CPU-based and require many thousands of CPUs running for days to tape out a modern chip. We seek AI/GPU-assisted solutions for the two problems, aiming at improving both runtime and quality. Prior academic research has proposed using machine learning for lithography modeling and mask optimization, typically represented as image-to-image mapping problems, where convolution layer backboned UNets and ResNets are applied. However, due to the lack of domain knowledge integrated into the framework designs, these solutions have been limited by their application scenarios or performance. Our method aims to tackle the limitations of such previous CNN-based solutions by introducing lithography bias into the neural network design, yielding a much more efficient model design and significant performance improvements.

Data-Driven Approaches for Process Simulation and Optical Proximity Correction

Hao-Chiang Shao
Chia-Wen Lin
Shao-Yun Fang

With continuous shrinking of process nodes, semiconductor manufacturing encounters more and more serious inconsistency between designed layout patterns and resulted wafer images. Conventionally, examining how a layout pattern can deviate from its original after complicated process steps, such as optical lithography and subsequent etching, relies on computationally expensive process simulation, which suffers from incredibly long runtime for large-scale circuit layouts, especially in advanced nodes. In addition, being one of the most important and commonly adopted resolution enhancement techniques, optical proximity correction (OPC) corrects image errors due to process effects by moving segment edges or adding extra polygons to mask patterns, while it is generally driven by simulation or time-consuming inverse lithography techniques (ILTs) to achieve acceptable accuracy. As a result, more and more state-of-the-art works on process simulation or/and OPC resort to the fast inference characteristic of machine/deep learning. This paper reviews these data-driven approaches to highlight the challenges in various aspects, explore preliminary solutions, and reveal possible future directions to push forward the frontiers of the research in design for manufacturability.

Mixed-Type Wafer Failure Pattern Recognition

Hao Geng
Qi Sun
Tinghuan Chen
Qi Xu
Tsung-Yi Ho
Bei Yu

The ongoing evolution in process fabrication enables us to step below the 5nm technology node. Although foundries can pattern and etch smaller but more complex circuits on silicon wafers, a multitude of challenges persist. For example, defects on the surface of wafers are inevitable during manufacturing. To increase the yield rate and reduce time-to-market, it is vital to recognize these failures and identify the failure mechanisms of these defects. Recently, applying machine learning-powered methods to combat single defect pattern classification has made significant progress. However, as the processes become increasingly complicated, various single-type defect patterns may emerge and be coupled on a wafer and thus shape a mixed-type pattern. In this paper, we will survey the recent pace of progress on advanced methodologies for wafer failure pattern recognition, especially for mixed-type one. We sincerely hope this literature review can highlight the future directions and promote the advancement of the wafer failure pattern recognition.

SESSION: Technical Program: Lightweight Models for Edge AI

Accelerating Convolutional Neural Networks in Frequency Domain via Kernel-Sharing Approach

Bosheng Liu
Hongyi Liang
Jigang Wu
Xiaoming Chen
Peng Liu
Yinhe Han

Convolutional neural networks (CNNs) are typically computationally heavy. Fast algorithms such as fast Fourier transforms (FFTs), are promising in significantly reducing computation complexity by replacing convolutions with frequency-domain element-wise multiplication. However, the increased high memory access overhead of complex weights counteracts the computing benefit, because frequency-domain convolutions not only pad weights to the same size as input maps, but also have no sharable complex kernel weights. In this work, we propose an FFT-based kernel-sharing technique called FS-Conv to reduce memory access. Based on FS-Conv, we derive the sharable complex weights in frequency-domain convolutions, which has never been solved. FS-Conv includes a hybrid padding approach, which utilizes the inherent periodic characteristic of FFT transformation to provide sharable complex weights for different blocks of complex input maps. We in addition build a frequency-domain inference accelerator (called Yixin) that can utilize the sharable complex weights for CNN accelerations. Evaluation results demonstrate the significant performance and energy efficiency benefits compared with the state-of-the-art baseline.

Mortar: Morphing the Bit Level Sparsity for General Purpose Deep Learning Acceleration

Yunhung Gao
Hongyan Li
Kevin Zhang
Xueru Yu
Hang Lu

Vanilla Deep Neural Networks (DNN) after training are represented with native floating-point 32 (fp32) weights. We observe that the bit-level sparsity of these weights is very abundant in the mantissa and can be directly exploited to speed up model inference. In this paper, we propose Mortar, an off-line/on-line collaborated approach for fp32 DNN acceleration, which includes two parts: first, an off-line bit sparsification algorithm to construct the target formulation by “mantissa morphing”, which maintains higher model accuracy while increasing bit-level sparsity; second, the associating hardware accelerator architecture to speed up the on-line fp32 inference through manipulating the enlarged bit sparsity. We highlight the following results by evaluating various deep learning tasks, including image classification, object detection, video understanding, video & image super-resolution, etc.: We (1) increase bit-level sparsity up to 1.28~2.51x with only a negligible -0.09~0.23% accuracy loss, (2) maintain on average 3.55% higher model accuracy while increasing more bit-level sparsity than the baseline, (3)and our hardware accelerator outperforms up to 4.8x over the baseline, with an area of 0.031 mm² and power of 68.58 mW.

Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

Yimeng Zhang
Akshay Karkal Kamath
Qiucheng Wu
Zhiwen Fan
Wuyang Chen
Zhangyang Wang
Shiyu Chang
Sijia Liu
Cong Hao

In this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream. First, to enable ultra-light video intelligence, we propose temporal frame-filtering and spatial saliency-focusing approaches to reduce the complexity of massive video data. Second, we exploit structure-aware weight sparsity to design a hardware-friendly model compression method. Third, assisted with data and model complexity reduction, we propose a sparsity-aware, scalable, and low-power accelerator design, aiming to deliver real-time performance with high energy efficiency. Different from existing works, we make a solid step towards the synergized software/hardware co-optimization for realistic MOT model implementation. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5× latency reduction, 20.9× effective frame rate improvement, 5.83× lower power, and 9.78× better energy efficiency, without much accuracy drop.

Latent Weight-Based Pruning for Small Binary Neural Networks

Tianen Chen
Noah Anderson
Younghyun Kim

Binary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove.

In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseduogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 33%–46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count.

SESSION: Technical Program: Design Automation for Emerging Devices

AutoFlex: Unified Evaluation and Design Framework for Flexible Hybrid Electronics

Tianliang Ma
Zhihui Deng
Leilai Shao

Flexible hybrid electronics (FHE), integrating high performance silicon chips with multi-functional sensors and actuators on flexible substrates, can be intimately attached onto irregular surfaces without compromising their functionalities, thus enabling more innovations in healthcare, internet of things (IoTs) and various human-machine interfaces (HMIs). Recent developments on compact models and process design kits (PDKs) of flexible electronics have made designs of small to medium flexible circuits feasible. However, the absence of a unified model and comprehensive evaluation benchmarks for flexible electronics makes it infeasible for a designer to fairly compare different flexible technologies and to explore potential design options for a heterogeneous FHE design. In this paper, we present AutoFlex, a unified evaluation and design framework for flexible hybrid electronics, where device parameters can be extracted automatically and performance can be evaluated comprehensively from device levels, digital blocks to large-scale digital circuits. Moreover, a ubiquitous FHE sensor acquisition system, including a flexible multi-functional sensor array, scan drivers, amplifiers and a silicon based analog-to-digital converter (ADC), is developed to reveal the design challenges of a representative FHE system.

CNFET7: An Open Source Cell Library for 7-nm CNFET Technology

Chenlin Shi
Shinobu Miwa
Tongxin Yang
Ryota Shioya
Hayato Yamaki
Hiroki Honda

In this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an open-source CNFET SPICE model called VS-CNFET, and various model parameters such as the channel width and carbon nanotube diameter are carefully tuned to mimic the predictive 7-nm CNFET technology presented in a published paper. Some nondisclosure parameters, such as the cell size and pin layout, are derived from those of the NanGate 15-nm open-source cell library in the same way as for an open-source framework for CNFET circuit design. CNFET7 includes two types of delay model (i.e., the composite current source and nonlinear delay model), each having 56 cells, such as INV_X1 and BUF_X1. CNFET7 supports both logic synthesis and timing-driven place and route in the Cadence design flow. Our experimental results for several synthesized circuits show that CNFET7 has reductions of up to 96%, 62% and 82% in dynamic and static power consumption and critical-path delay, respectively, when compared with ASAP7.

A Global Optimization Algorithm for Buffer and Splitter Insertion in Adiabatic Quantum-Flux-Parametron Circuits

Rongliang Fu
Mengmeng Wang
Yirong Kan
Nobuyuki Yoshikawa
Tsung-Yi Ho
Olivia Chen

As a highly energy-efficient application of low-temperature superconductivity, the adiabatic quantum-flux-parametron (AQFP) logic circuit has characteristics of extremely low-power consumption, making it an attractive candidate for extremely energy-efficient computing systems. Since logic gates are driven by the alternating current (AC) serving as the clock signal in AQFP circuits, plenty of AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. Meanwhile, since the currently developed AQFP logic gates can only drive a single output, splitters are required by logic gates to drive multiple fan-outs. These gates take up a significant amount of the circuit’s area and delay. This paper proposes a global optimization algorithm for buffer and splitter (B/S) insertion to address the issues above. The B/S insertion is first identified as a combinational optimization problem, and a dynamic programming formulation is presented to find the global optimal solution. Due to the limitation of its impractical search space, an integer linear programming formulation is proposed to explore the global optimization of B/S insertion approximately. Experimental results on the ISCAS’85 and simple arithmetic benchmark circuits show the effectiveness of the proposed method, with an average reduction of 8.22% and 7.37% in the number of buffers and splitters inserted compared to the state-of-the-art methods from ICCAD’21 and DAC’22, respectively.

FLOW-3D: Flow-Based Computing on 3D Nanoscale Crossbars with Minimal Semiperimeter

Sven Thijssen
Sumit Kumar Jha
Rickard Ewetz

The emergence of data-intensive applications has spurred the interest for in-memory computing using nanoscale crossbars. Flow-based in-memory computing is a promising approach for evaluating Boolean logic using the natural flow of electrical currents. While automated synthesis approaches have been developed for 2D crossbars, 3D crossbars have advantageous properties in terms of density, area, and performance. In this paper, we propose the first framework for performing flow-based computing using 3D crossbars. The framework, FLOW-3D, automatically synthesizes a Boolean function into a crossbar design. FLOW-3D is based on an analogy between BDDs and crossbars, resulting in the synthesis of 3D crossbar designs with minimal semiperimeter. A BDD with n nodes is mapped to a 3D crossbar with (n + k) metal wires. The k extra metal wires are needed to handle hardware-imposed constraints. Compared with the state-of-the-art synthesis tool for 2D crossbars, FLOW-3D improves semiperimeter, area, energy consumption, and latency up to 61%, 84%, 37%, and 41% on 15 Revlib benchmarks.

2022 ACM/IEEE Workshop on System Level Interconnect Pathfinding (SLIP) Table of Content

28 January 2023

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Breaking the Interconnect Limits

Session details: Breaking the Interconnect Limits

Ismail Bustany

Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning be?

Raveena Raikar
Dirk Stroobandt

High-capacity multi-die FPGA systems generally consist of multiple dies connected by external interposer lines. These external connections are limited in number. Further, these connections also contribute to a higher delay as compared to the internal network on a monolithic FPGA and should therefore be sparsely used. These architectural changes compel the placement & routing tools to minimize the number of signals at the die boundary. Incorporating a netlist partitioning step in the CAD flow can help to minimize the overall number of signals using the cross-die connections.

Conventional partitioning techniques focus on minimizing the cut edges at the cost of generating unequal-sized partitions. Such highly unbalanced partitions can affect the overall placement & routing quality by causing congestion on the denser die. Moreover, this can also negatively impact the overall runtime of the placement & routing tools as well as the FPGA resource utilization.

In previous studies, a low value of the unbalance was proposed to generate equal-sized partitions. In this work, we investigate the factors that influence the netlist partitioning quality for a multi-die FPGA system. A die-level partitioning step, performed using hMETIS, is incorporated into the flow before the packing step. Large heterogeneous circuits from the Koios benchmark suite are used to analyze the partitioning-packing results. Consequently, we examine the variation in output unbalance, the number of cut edges vs the input value of unbalance. We propose an empirical optimal parametric value of the unbalance factor for achieving the desired partitioning quality for the Koios benchmark suite.

Limiting Interconnect Heating in Power-Driven Physical Synthesis

Xiuyan Zhang
Shantanu Dutt

Current technology trend of VLSI chips includes sub-10 nm nodes and 3D ICs. Unfortunately, due to significantly increased Joule heating in these technologies, interconnect reliability has become a significant casualty. In this paper, we explore how interconnect power dissipation (of CV²/2 per logic transition) and thus heating can be effectively constrained during a power-optimizing physical synthesis (PS) flow that applies three different PS transformations: cell sizing, Vth assignment and cell replication; the latter is particularly useful for limiting interconnect heating. Other constraints considered are timing, slew and cell fanout load. To address this multi-constraint power-optimization problem effectively, we consider the application of the aforementioned three transforms simultaneously (as opposed to sequentially in some order) as well as simultaneously across all cells of the circuit using a novel discrete optimization technique called discretized network flow (DNF). We applied our algorithm to ISPD-13 benchmark circuits: the ISPD-13 competition was for power optimization for cell-sizing and Vth assignment transforms under timing, slew and cell fanout load constraints; to these we added the interconnect heating constraint and the cell replication transform—a much harder transform to engineer in a simultaneous-consideration framework than the other two. Results show the significant efficacy of our techniques.

SESSION: 2.5D/3D Extension for High-Performance Computing

Session details: 2.5D/3D Extension for High-Performance Computing

Pascal Vivet

Opportunities of Chip Power Integrity and Performance Improvement through Wafer Backside (BS) Connection: Invited Paper

Rongmei Chen
Giuliano Sisto
Odysseas Zografos
Dragomir Milojevic
Pieter Weckx
Geert Van der Plas
Eric Beyne

Technology node scaling is driven by the need to increase system performance, but it also leads to a significant power integrity bottleneck, due to the associated back-end-of-line (BEOL) scaling. Power integrity degradation induced by on-chip Power Delivery Network (PDN) IR drop is a result of increased power density and number of metal layers in the BEOL and their resistivity. Meanwhile, signal routing limits the SoC performance improvements due to increased routing congestion and delays. To conquer these issues, we introduce a disruptive technology: wafer backside (BS) connection to realize chip BS PDN (BSPDN) and BS signal routing. We first provide some key wafer processes features that were developed at imec to enable this technology. Further, we show benefits of this technology by demonstrating a large improvement in chip power integrity and performance after applying this technology to BSPDN and BS routing with a sub-2nm technology node design rule. Challenges and outlook of the BS technology are also discussed before conclusion of this paper.

SESSION: Compute-in-Memory and Design of Structured Compute Arrays

Session details: Compute-in-Memory and Design of Structured Compute Arrays

Shantanu Dutt

An Automated Design Methodology for Computational SRAM Dedicated to Highly Data-Centric Applications: Invited Paper

A. Philippe
L. Ciampolini
A. Philippe
M. Gerbaud
M. Ramirez-Corrales
V. Egloff
B. Giraud
J.-P. Noel

To meet the performance requirements of highly data-centric applications (e.g. edge-AI or lattice-based cryptography), Computational SRAM (C-SRAM), a new type of computational memory, was designed as a key element of an emerging computing paradigm called near-memory computing. For this particular type of applications, C-SRAM has been specialized to perform low-latency vector operations in order to limit energy-intensive data transfers with the processor or dedicated processing units. This paper presents a design methodology that aims at making the C-SRAM design flow as simple as possible by automating the configuration of the memory part (e.g. number of SRAM cuts and access ports) according to system constraints (e.g. instruction frequency or memory capacity) and off-the-shelf SRAM compilers. In order to fairly quantify the benefits of the proposed memory selector, it has been evaluated with three different CMOS process technologies from two different foundries. The results show that this memory selection methodology makes it possible to determine the best memory configuration whatever the CMOS process technology and the trade-off between area and power consumption. Furthermore, we also show how this methodology could be used to efficiently assess the level of design optimization of available SRAM compilers in a targeted CMOS process technology.

A Machine Learning Approach for Accelerating SimPL-Based Global Placement for FPGA’s

Tianyi Yu
Nima Karimpour Darav
Ismail Bustany
Mehrdad Eslami Dehkordi

Many commercial FPGA placement tools are based on the SimPL framework where the Lower Bound (LB) phase optimizes wire length and timing without considering cell overlaps and the Upper Bound (UB) phase spreads out cells while considering the target FPGA architectures. In the SimPL framework, the number of iterations depends on design complexity and the quality of UB placement, which highly impacts runtime. In this work, we propose a machine learning (ML) scheme where the anchor weights of cells are dynamically adjusted to make the process converge in a pre-determined budget for the number of iterations. In our approach and for a given FPGA architecture, a ML model constructs a trajectory guide function that is used for adjusting anchor weights during SimPL’s iterations. Our experimental results on industrial benchmarks show, we can achieve on average 28.01% and 4.7% runtime reduction in the runtime of Global Placement and the runtime of the whole placer, respectively while maintaining the quality of solutions within an acceptable range.

SESSION: Interconnect Performance Estimation Techniques

Session details: Interconnect Performance Estimation Techniques

Rasit Topaloglu

Neural Network Model for Detour Net Prediction

Jaehoon Ahn
Taewhan Kim

Identifying nets in a placement which will be very likely to be detoured routes in routing is very useful in that (1) in conjunction with the routing congestion, path timing, or design rule violation (DRV) prediction, predicting detour nets can be used as a complementary means of characterizing the outcome of those predictions in a more depth and (2) we can place more importance on the detour predicted nets for optimizing timing and routing resources in the early stage of placement since those nets consume more timing budget as well as metal/via resources. In this context, this work proposes a neural network based detour net prediction model. Our proposed model consists of two parts: CNN based and ANN based. The CNN based model processes the features describing various physical proximity maps or states while the ANN based model processes the features of individual nets in the form of vector descriptions, concatenated to the CNN outputs. Through experiments, we analyze and assess the accuracy of our prediction model in terms of F1 score and the complementary role of timing prediction and optimization. More specifically, it is shown that our proposed model improves the prediction accuracy by 9.9% on average in comparison with that produced by the conventional (vanilla ANN based) detour net prediction model. Furthermore, linking our prediction model to a state-of-the-art timing optimization of the commercial tool is able to reduce the worst negative slack by 18.4%, the total negative slack by 40.8%, and the number of timing violation paths by 30.9% on average.

Machine-Learning Based Delay Prediction for FPGA Technology Mapping

Hailiang Hu
Jiang Hu
Fan Zhang
Bing Tian
Ismail Bustany

Accurate delay prediction is important in the early stages of logic and high-level synthesis. In technology mapping for field programmable gate array (FPGA), a gate-level circuit is transcribed into a lookup table (LUT)-level circuit. Quick timing analysis is necessary on a pre-mapped circuit to guide optimizations downstream. However, a static timing analyzer is too slow due to its complexity and highly inaccurate like other faster empirical heuristics before technology mapping. In this work, we present a machine learning based framework for accurately and efficiently estimating the delay of a gate-level circuit from predicting the depth of the corresponding LUT logic after technology mapping. Our experimental results show that the proposed method achieves a 56x accuracy improvement compared to the existing delay estimation heuristic. Instead of running the mapper for the ground truth, our delay estimator saves 87.5% on runtime with negligible error.

Wang Ying

3 January 2023

Yibo Lin

No comments

Categories: Who's Who

February 2023

Wang Ying

Associate Professor

Institute of Computing Technology, Chinese Academy of Sciences

Email:
wangying2009@ict.ac.cn

Personal webpage
https://wangying-ict.github.io/

Research interests

Domain-Specific chips, processor architecture and design automation

Short bio

Ying Wang is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences. Wang’s research expertise is focused on VLSI testing, reliability and the design automation of domain-specific processors such as accelerators for computer vision, deep learning, graph computing and robotics. His group has conducted pioneering work in the open-source frameworks for automated neural network accelerator generation and customization. He has published more than 30 papers at DAC, and over 120 papers on other IEEE/ACM conferences and journals. He holds over 30 patents related to chip design. Wang is also a co-founder of Jeejio Tech in Beijing, which is granted the Special Start-up Award of the year 2018 by Chinese Academy of Sciences. Among Wang’s honors, it also includes the Young Talent Development Program Awardee from Chinese Association of Science and Technology (as one of the two awardees of computer science in 2016), 2017 CCF Intel outstanding researcher award, 2019 Early Career Award from Chinese Computer Federation and etc. He is the recipient of Under 40 Innovator award of DAC at 2021. Dr. Wang has also received several awards from international conferences, including the winner of System Design Contest at DAC 2018 and IEEE rebooting LPIRC 2016, the Best Paper Award at ITC-Asia 2018, GLSVLSI2021 (2nd place), ICCD 2019, and the best paper of 2011 IEEE Transaction on Computers, as well as the best paper nominee in ASPDAC.

Research highlights

Dr. Wang’s innovative research in the DeepBurning project has significantly contributed to one of the viable approaches toward automatic specialized accelerator generation and is considered one of the representative works in this area, which is to start from the software framework to directly generate a specialized circuit design implemented on FPGA or ASICs. After the initial project of DeepBurning1.0, he continues to pioneer several on-going works including ELNA (DAC2017), Dadu (DAC2018), 3D-Cube (DAC2019), DeepBurning-GL (ICCAD2020) and DeepBurning-Seg (Micro-2022), which also follows the same technical route of automatic hardware accelerator generation but has been extended to different applications and architectures. Also, the DeepBurning series not only develops horizontally to different areas, but also vertically go to the high level processor design stacks including early-stage design parameter exploration, ISA extension and compiler-hardware co-design. In general, his holistic work on this field has attracted considerable attention from different Chinese EDA companies. Based on the agile chip customization technology initiated by Dr. Wang, his company, Jeejio, is able to develop highly-customized chip solutions at a relatively low cost, and help its customers stay competitive in the niche IoT markets. Dr. Wang’s team has proposed the RISC-V compatible Sage architecture that can be used to customize AIoT SoC solution with user-redeemable computing power, for audio/video/image processing capability and also automotive scenarios.

Page 2 of 10

Jianli Chen

July 2023

Jianli Chen

Professor

State Key Laboratory of ASIC and System, Fudan University

Research interests

Short bio

Research highlights

2023 SIGDA LIVE webinars

2023 System Design Contest at Design Automation Conference (DAC-SDC’23)

Links

Organizing Committee

Sponsors

2023 International Symposium on Physical Design (ISPD) Table of Content

SESSION: Session 1: Opening Session and Keynote I

SESSION: Session 2: Routing

SESSION: Session 3: 3D ICs, Heterogeneous Integration, and Packaging I

SESSION: Session 4: 3D ICs, Heterogeneous Integration, and Packaging II

SESSION: Session 5: Analog Design

SESSION: Session 6: Keynote II

SESSION: Session 7: DFM, Reliability, and Electromigration

SESSION: Session 8: Placement

SESSION: Session 9: New Computing Techniques and Accelerators

SESSION: Session 10: Lifetime Achievement Commemoration for Professor Malgorzata Marek-Sadowska

SESSION: Session 11: Keynote III

SESSION: Session 12: Quantum Computing

SESSION: Session 13: Panel on EDA for Domain Specific Computing

SESSION: Session 14: Hardware Security and Bug Fixing

SESSION: Session 15: ISPD 2023 Contest Results and Closing Remarks

Scott Beamer

April 2023

Scott Beamer

Assistant Professor

Department of Computer Science & Engineering, University of California, Santa Cruz

Research interests

Short bio

Research highlights

Ming-Chang Yang

March 2023

Ming-Chang Yang

Associate Professor

Department of Computer Science and Engineering, The Chinese University of Hong Kong

Research interests

Short bio

Research highlights

2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA) Table of Content

SESSION: Keynote I

SESSION: Session: High-Level Abstraction and Tools

SESSION: Poster Session I

SESSION: Session: Applications and Design Studies I

SESSION: Session: Architecture, CAD, and Circuit Design

SESSION: Banquet and Panel

SESSION: Keynote II

SESSION: Session: Deep Learning

SESSION: Session: FPGA-Based Computing Engines

SESSION: Poster Session II

State Key Laboratory of ASIC and System,
Fudan University