Yibo Lin, Author at SIGDA

Heba Abunahla

3 January 2023

Yibo Lin

No comments

Categories: Who's Who

January 2023

Heba Abunahla

Assistant Professor

Quantum and Computer Engineering department, TU Delft, Netherlands

Email:
Heba.nadhmi@gmail.com

Research interests

• Emerging RRAM devices
• Smart sensors
• Hardware security
• Graphene-based electronics
• CNTs-based electronics
• Neuromorphic computing

Short bio

Heba Abunahla is currently Assistant Professor at the Quantum and Computer Engineering department, Delft University of Technology. Abunahla received the BSc, MSc and PhD degrees (with honors) from United Arab Emirates University, University of Sharjah and Khalifa University, respectively, via competitive scholarship. Prior to joining TU Delft as an Assistant Professor, Abunahla spent over five years as Postdoctoral Fellow and Research Scientist working extensively on the design, fabrication and characterization of emerging memory devices with great emphasis on computing, sensing and security applications.

Abunahla owns two patents, has published one book, and has co-authored over 30 conference and journal papers. In 2017. Abunahla had a collaborative project with University of Adelaide, Australia, on developing novel non-enzymatic glucose sensor. According to her achievements, she received Australian Global Talent Permanent Residency in 2021. Moreover, Abunahla’s innovation of deploying emerging RRAM devices in neuromorphic computing has been selected by Nature-Scientific Reports to be among Top 100 in materials Science. Also, her recent achievement in fabricating RRAM-based tunable filters was selected to be published in the first issue of Innovation@UAE Magazine launched by Ministry of Education.

Abunahla has several awards and competitive scholarships. E.g., she is the recipient of Unique Fellowship for Top Female Academic Scientists – Electrical Engineering, Mathematics & Computer Science (2022) from Delft University of Technology. Abunahla serves as a lead Editor in Frontiers in Neuroscience. She is an active reviewer for several high impact journals and conferences.

Research highlights

Secure intelligent memory and sensors are crucial components in our daily electronic devices and systems. CMOS has been the core technology to provide such requirements for decades. However, the limitations associated to power and area have led to the need for an alternative technology, called Resistive-RAM (RRAM). RRAM devices are able to perform memory and computation in the same cell, which enables in-memory computation feature. Moreover, RRAM can be deployed as a smart sensor due to its ability to change its I-V characteristic against the surrounding environment. Inherit stochasticity in RRAM junctions is also a great asset for security applications.

Abunahla has built a strong expertise in the field of micro-electronics design, modeling, fabrication, and characterization of high-performance and high-density memory devices. Abunahla developed novel RRAM devices that have been uniquely deployed in sensing, computing, security, and communication applications. For instance, Abunahla demonstrated a novel approach to measuring glucose levels for an adult human, and demonstrated the ability to fabricate such biosensor using a simple, low-cost standard photolithography process. In contrast to other sensors, the developed sensor has the ability to accurately measure glucose levels at neutral pH conditions (i.e. pH=7). Abunahla filed a US patent for this device and all the details of the innovation are published by the prestigious Nature Scientific Reports. This work has great commercialization opportunity; being unique and cutting edge in nature, and Abunahla is currently working with her team toward providing lab-on-chip sensing approach based on this technology.

Furthermore, Abunahla has recently innovated flexible memory devices, namely NeuroMem, that can mimic the memorization behavior of the brain. This unique feature makes NeuroMem a potential candidate for emerging in-memory-computing applications. This work is the first to report on the great potential of this technology for Artificial Intelligence (AI) inference for edge devices. Abunahla filed a US patent for this innovation and published the work in the prestigious Nature Scientific Reports. Further, her innovative research in using nanoscale devices for Gamma-ray sensing using Sol-gel/drop-coated micro-think nanomaterials is very unique and has been filed as US patent and published by the prestigious Journal of Materials Chemistry and Physics. Moreover, Abunahla has fabricated novel RRAM-based tunable filters which prove the possibility of tuning RF devices without any localized surface mount device (SMD) element or complex realization technique. In the field of hardware security, Abunahla developed an efficient flexible RRAM-based true random number generation, named SecureMem. The data generated by SecureMem prototype passed all NIST tests without any post-processing or hardware overhead.

Aman Arora

3 January 2023

Yibo Lin

No comments

Categories: Who's Who

December, 2022

Aman Arora

Assistant Professor

School of Computing and Augmented Intelligence, Arizona State University

Email:
aman.kbm@asu.edu

Personal webpage
https://labs.engineering.asu.edu/advent/

Research interests

Reconfigurable computing, Domain-specific acceleration, Hardware for Machine Learning

Short bio

Aman Arora is a Graduate Fellow and Ph.D. Candidate in the Department of Electrical and Computer Engineering at the University of Texas at Austin. His research vision is to minimize the gap between ASICs and FPGAs in terms of performance and efficiency, and to minimize the gap between CPUs/GPUs and FPGAs in terms of programmability. He imagines a future where FPGAs are first-class citizens in the world of computing and first-choice for accelerating new workloads. His PhD dissertation research focuses on the search for efficient reconfigurable fabrics for Deep Learning (DL) by proposing new DL-optimized blocks for FPGAs. His research has resulted in 11 paper publications in top conferences and journals in the field of reconfigurable computing and computer architecture and design. His work received a Best Paper Award at the IEEE FCCM conference in 2022, and he currently holds a fellowship from the UT Austin Graduate School. His research has been funded by the NSF. Aman has served as a secondary reviewer in top conferences like ACM FPGA (in 2021 and 2022). He is also the leader of the AI+FPGA committee at Open-Source FPGA (OSFPGA) Foundation, where he leads research efforts and organizes training webinars. He has 12 years of experience in the semiconductor industry in design, verification, testing and architecture roles. Most recently, he worked in the GPU Deep Learning architecture team at NVIDIA.

Research highlights

Aman’s past and current research focusses on architecting efficient reconfigurable acceleration substrates (or fabrics) for Deep Learning (DL). With Moore’s law slowing down, the requirements of resource-hungry applications like DL growing & changing rapidly, and climate change already knocking at our doors, this research theme has never been more relevant and important.

Aman has proposed changing the architecture of FPGAs to make them better DL accelerators. He proposed replacing a portion of the FPGA’s programmable logic area with new blocks called Tensor Slices, which are specialized for performing matrix operations like matrix-matrix multiplication and matrix-vector multiplication that are common in DL workloads. The FPGA industry has parallelly developed similar blocks like Intel AI Tensor Block and Achronix Machine Learning Processor.

In addition, Aman proposed adding compute capabilities to the on-chip memory blocks on FPGAs, so they can operate on data without having to move the data to compute units on the FPGA. He was the first to exploit the dual port nature of FPGA BRAMs to design these blocks instead of using technologies that significantly impact the circuitry of the RAM array and degrade its performance. He calls these new blocks CoMeFa RAMs. This work won the Best Paper Award at IEEE FCCM 2022.

Aman also led a team effort spanning three universities – UT Austin, University of Toronto, and University of New Brunswick – to develop an open-source DL benchmark suite called Koios. These benchmarks can be used to perform FPGA architecture and CAD research, and are integrated into VTR, which is the most popular open-source FPGA CAD flow.

Other research projects Aman has worked on or is working on include: (1) developing a parallel reconfigurable spatial acceleration fabric consisting of PIM (Processing-In-Memory) blocks connected using an FPGA-like interconnect, (2) implementing efficient accelerators for Weightless Neural Networks (WNNs) on FPGAs, (3) enabling support for open-source tools in FPGA research tools like COFFE, and (4) using Machine Learning (ML) to perform cross-prediction of power consumption on FPGAs and developing an open-source dataset that can be widely used for such prediction problems.

Aman hopes to start and direct a research lab at a university soon. His future research will straddle the entire stack of computer engineering: programmability at the top, architecture exploration in the middle, and hardware design at the bottom. The research thrusts he plans to focus on are next-gen reconfigurable fabrics, ML and FPGA related tooling, enabling the creation of an FPGA app store, and sustainable acceleration.

2022 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Table of Content

22 December 2022

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: The Role of Graph Neural Networks in Electronic Design Automation

Session details: The Role of Graph Neural Networks in Electronic Design Automation

Jeyavijayan Rajendran

Why are Graph Neural Networks Effective for EDA Problems?: (Invited Paper)

Haoxing Ren
Siddhartha Nath
Yanqing Zhang
Hao Chen
Mingjie Liu

In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.

On Advancing Physical Design Using Graph Neural Networks

Yi-Chen Lu
Sung Kyu Lim

As modern Physical Design (PD) algorithms and methodologies evolve into the post-Moore era with the aid of machine learning, Graph Neural Networks (GNNs) are becoming increasingly ubiquitous given that netlists are essentially graphs. Recently, their ability to perform effective graph learning has provided significant insights to understand the underlying dynamics during netlist-to-layout transformations. GNNs follow a message-passing scheme, where the goal is to construct meaningful representations either at the entire graph or node-level by recursively aggregating and transforming the initial features. In the realm of PD, the GNN-learned representations have been leveraged to solve the tasks such as cell clustering, quality-of-result prediction, activity simulation, etc., which often overcome the limitations of traditional PD algorithms. In this work, we first revisit recent advancements that GNNs have made in PD. Second, we discuss how GNNs serve as the backbone of novel PD flows. Finally, we present our thoughts on ongoing and future PD challenges that GNNs can tackle and succeed.

Applying GNNs to Timing Estimation at RTL

Daniela Sánchez Lopera
Wolfgang Ecker

In the Electronic Design Automation (EDA) flow, signoff checks, such as timing analysis, are performed only after physical synthesis. Encountered timing violations cause re-iterations of the design flow. Hence, timing estimations at initial design stages, such as Register Transfer Level (RTL), would increase the quality of the results and lower the flow iterations. Machine learning has been used to estimate the timing behavior of chip components. However, existing solutions map EDA objects to Euclidean data without considering that EDA objects are represented naturally as graphs. Recent advances in Graph Neural Networks (GNNs) motivate the mapping from EDA objects to graphs for design metric prediction tasks at different stages. This paper maps RTL designs to directed, featured graphs with multidimensional node and edge features. These are the input to GNNs for estimating component delays and slews. An in-house hardware generation framework and open-source EDA tools for ASIC synthesis are employed for collecting training data. Experiments over unseen circuits show that GNN-based models are promising for timing estimation, even when the features come from early RTL implementations. Based on estimated delays, critical areas of the design can be detected, and proper RTL micro-architectures can be chosen without running long design iterations.

Embracing Graph Neural Networks for Hardware Security

Lilas Alrahis
Satwik Patnaik
Muhammad Shafique
Ozgur Sinanoglu

Graph neural networks (GNNs) have attracted increasing attention due to their superior performance in deep learning on graph-structured data. GNNs have succeeded across various domains such as social networks, chemistry, and electronic design automation (EDA). Electronic circuits have a long history of being represented as graphs, and to no surprise, GNNs have demonstrated state-of-the-art performance in solving various EDA tasks. More importantly, GNNs are now employed to address several hardware security problems, such as detecting intellectual property (IP) piracy and hardware Trojans (HTs), to name a few.

In this survey, we first provide a comprehensive overview of the usage of GNNs in hardware security and propose the first taxonomy to divide the state-of-the-art GNN-based hardware security systems into four categories: (i) HT detection systems, (ii) IP piracy detection systems, (iii) reverse engineering platforms, and (iv) attacks on logic locking. We summarize the different architectures, graph types, node features, benchmark data sets, and model evaluation of the employed GNNs. Finally, we elaborate on the lessons learned and discuss future directions.

SESSION: Compiler and System-Level Techniques for Efficient Machine Learning

Session details: Compiler and System-Level Techniques for Efficient Machine Learning

Sri Parameswaran
Martin Rapp

Fine-Granular Computation and Data Layout Reorganization for Improving Locality

Mahmut Kandemir
Xulong Tang
Jagadish Kotra
Mustafa Karakoy

While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.

An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration

Nicolas Bohm Agostini
Serena Curzel
Vinay Amatya
Cheng Tan
Marco Minutoli
Vito Giovanni Castellana
Joseph Manzano
David Kaeli
Antonino Tumeo

The generation of custom hardware accelerators for applications implemented within high-level productive programming frameworks requires considerable manual effort. To automate this process, we introduce SODA-OPT, a compiler tool that extends the MLIR infrastructure. SODA-OPT automatically searches, outlines, tiles, and pre-optimizes relevant code regions to generate high-quality accelerators through high-level synthesis. SODA-OPT can support any high-level programming framework and domain-specific language that interface with the MLIR infrastructure. By leveraging MLIR, SODA-OPT solves compiler optimization problems with specialized abstractions. Backend synthesis tools connect to SODA-OPT through progressive intermediate representation lowerings. SODA-OPT interfaces to a design space exploration engine to identify the combination of compiler optimization passes and options that provides high-performance generated designs for different backends and targets. We demonstrate the practical applicability of the compilation flow by exploring the automatic generation of accelerators for deep neural networks operators outlined at arbitrary granularity and by combining outlining with tiling on large convolution layers. Experimental results with kernels from the PolyBench benchmark show that our high-level optimizations improve execution delays of synthesized accelerators up to 60x. We also show that for the selected kernels, our solution outperforms the current of state-of-the art in more than 70% of the benchmarks and provides better average speedup in 55% of them. SODA-OPT is an open source project available at https://gitlab.pnnl.gov/sodalite/soda-opt.

Physics-Aware Differentiable Discrete Codesign for Diffractive Optical Neural Networks

Yingjie Li
Ruiyang Chen
Weilu Gao
Cunxi Yu

Diffractive optical neural networks (DONNs) have attracted lots of attention as they bring significant advantages in terms of power efficiency, parallelism, and computational speed compared with conventional deep neural networks (DNNs), which have intrinsic limitations when implemented on digital platforms. However, inversely mapping algorithm-trained physical model parameters onto real-world optical devices with discrete values is a non-trivial task as existing optical devices have non-unified discrete levels and non-monotonic properties. This work proposes a novel device-to-system hardware-software codesign framework, which enables efficient physics-aware training of DONNs w.r.t arbitrary experimental measured optical devices across layers. Specifically, Gumbel-Softmax is employed to enable differentiable discrete mapping from real-world device parameters into the forward function of DONNs, where the physical parameters in DONNs can be trained by simply minimizing the loss function of the ML task. The results have demonstrated that our proposed framework offers significant advantages over conventional quantization-based methods, especially with low-precision optical devices. Finally, the proposed algorithm is fully verified with physical experimental optical systems in low-precision settings.

Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture

Gokul Krishnan
A. Alper Goksoy
Sumit K. Mandal
Zhenyu Wang
Chaitali Chakrabarti
Jae-sun Seo
Umit Y. Ogras
Yu Cao

Monolithic in-memory computing (IMC) architectures face significant yield and fabrication cost challenges as the complexity of DNNs increases. Chiplet-based IMCs that integrate multiple dies with advanced 2.5D/3D packaging offers a low-cost and scalable solution. They enable heterogeneous architectures where the chiplets and their associated interconnection can be tailored to the non-uniform algorithmic structures to maximize IMC utilization and reduce energy consumption. This paper proposes a heterogeneous IMC architecture with big-little chiplets and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. For a given DNN, we develop a custom methodology to map the model onto the big-little architecture such that the early layers in the DNN are mapped to the little chiplets with higher NoP bandwidth and the subsequent layers are mapped to the big chiplets with lower NoP bandwidth. Furthermore, we achieve a scalable solution by incorporating a DRAM into each chiplet to support a wide range of DNNs beyond the area limit. Compared to a homogeneous chiplet-based IMC architecture, the proposed big-little architecture achieves up to 329× improvement in the energy-delay-area product (EDAP) and up to 2× higher IMC utilization. Experimental evaluation of the proposed big-little chiplet-based RRAM IMC architecture for ResNet-50 on ImageNet shows 259×, 139×, and 48× improvement in energy-efficiency at lower area compared to Nvidia V100 GPU, Nvidia T4 GPU, and SIMBA architecture, respectively.

SESSION: Addressing Sensor Security through Hardware/Software Co-Design

Session details: Addressing Sensor Security through Hardware/Software Co-Design

Marilyn Wolf

Attacks on Image Sensors

Marilyn Wolf
Kruttidipta Samal

This paper provides a taxonomy of security vulnerabilities of smart image sensor systems. Image sensors form an important class of sensors. Many image sensors include computation units that can provide traditional algorithms such as image or video compression along with machine learning tasks such as classification. Some attacks rely on the physics and optics of imaging. Other attacks take advantage of the complex logic and software required to perform imaging systems.

False Data Injection Attacks on Sensor Systems

Dimitrios Serpanos

False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.

Stochastic Mixed-Signal Circuit Design for In-Sensor Privacy

Ningyuan Cao
Jianbo Liu
Boyang Cheng
Muya Chang

The ubiquitous data acquisition and extensive data exchange of sensors pose severe security and privacy concerns for the end-users and the public. To enable real-time protection of raw data, it is demanding to facilitate privacy-preserving algorithms at data generation, or in-sensory privacy. However, due to the severe sensor resource constraints and intensive computation/security cost, it remains an open question of how to enable data protection algorithms with efficient circuit techniques. To answer this question, this paper discusses the potential of a stochastic mixed-signal (SMS) circuit for ultra-low-power, small-foot-print data security. In particular, this paper discusses digitally-controlled-oscillators (DCO) and their advantages in (1) seamless analog interface, (2) stochastic computation efficiency, and (3) unified entropy generation over conventional digital circuit baselines. With DCO as an illustrative case, we target (1) SMS privacy-preserving architecture definition and systematic SMS analysis on its performance gains across various hardware/software configurations, and (2) revisit analog/mixed-signal voltage/transistor scaling in the context of entropy-based data protection.

Sensor Security: Current Progress, Research Challenges, and Future Roadmap (Invited Paper)

Anomadarshi Barua
Mohammad Abdullah Al Faruque

Sensors are one of the most pervasive and integral components of today’s safety-critical systems. Sensors serve as a bridge between physical quantities and connected systems. The connected systems with sensors blindly believe the sensor as there is no way to authenticate the signal coming from a sensor. This could be an entry point for an attacker. An attacker can inject a fake input signal along with the legitimate signal by using a suitable spoofing technique. As the sensor’s transducer is not smart enough to differentiate between a fake and legitimate signal, the injected fake signal eventually can collapse the connected system. This type of attack is known as the transduction attack. Over the last decade, several works have been published to provide a defense against the transduction attack. However, the defenses are proposed on an ad-hoc basis; hence, they are not well-structured. Our work begins to fill this gap by providing a checklist that a defense technique should always follow to be considered as an ideal defense against the transduction attack. We name this checklist as the Golden reference of sensor defense. We provide insights on how this Golden reference can be achieved and argue that sensors should be redesigned from the transducer level to the sensor electronics level. We point out that only hardware or software modification is not enough; instead, a hardware/software (HW/SW) co-design approach is required to ride on this future roadmap to the robust and resilient sensor.

SESSION: Advances in Partitioning and Physical Optimization

Session details: Advances in Partitioning and Physical Optimization

Markus Olbrich
Yu-Guang Chen

SpecPart: A Supervised Spectral Framework for Hypergraph Partitioning Solution Improvement

Ismail Bustany
Andrew B. Kahng
Ioannis Koutis
Bodhisatta Pramanik
Zhiang Wang

State-of-the-art hypergraph partitioners follow the multilevel paradigm that constructs multiple levels of progressively coarser hypergraphs that are used to drive cut refinements on each level of the hierarchy. Multilevel partitioners are subject to two limitations: (i) Hypergraph coarsening processes rely on local neighborhood structure without fully considering the global structure of the hypergraph. (ii) Refinement heuristics can stagnate on local minima. In this paper, we describe SpecPart, the first supervised spectral framework that directly tackles these two limitations. SpecPart solves a generalized eigenvalue problem that captures the balanced partitioning objective and global hypergraph structure in a low-dimensional vertex embedding while leveraging initial high-quality solutions from multilevel partitioners as hints. SpecPart further constructs a family of trees from the vertex embedding and partitions them with a tree-sweeping algorithm. Then, a novel overlay of multiple tree-based partitioning solutions, followed by lifting to a coarsened hypergraph, where an ILP partitioning instance is solved to alleviate local stagnation. We have validated SpecPart on multiple sets of benchmarks. Experimental results show that for some benchmarks, our SpecPart can substantially improve the cutsize by more than 50% with respect to the best published solutions obtained with leading partitioners hMETIS and KaHyPar.

HyperEF: Spectral Hypergraph Coarsening by Effective-Resistance Clustering

Ali Aghdaei
Zhuo Feng

This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over 70× runtime speedups over hMetis and 20× speedups over HyperSF.

Design and Technology Co-Optimization Utilizing Multi-Bit Flip-Flop Cells

Soomin Kim
Taewhan Kim

The benefit of multi-bit flip-flop (MBFF) as opposed to single-bit flip-flop is sharing in-cell clock inverters among the master and slave latches in the internal flip-flops of MBFF. Theoretically, the more flip-flops an MBFF has, the more power saving it can achieve. However, in practice, physically increasing the size of MBFF to accommodate many flip-flops imposes two new challenging problems in physical design: (1) non-flexible MBFF cell flipping for multiple D-to-Q signals and (2) unbalanced or wasted use of MBFF footprint space. In this work, we solve the two problems in a way to enhance routability and timing at the placement and routing stages. Precisely, for problem 1, we make the non-flexible MBFF cell flipping to be fully flexible by generating MBFF layouts supporting diverse D-to-Q flow directions in the detailed placement to improve routability and for problem 2, we enhance the setup and clock-to-Q delay on timing critical flip-flops in MBFF through gate upsizing (i.e., transistor folding) by using the unused space in MBFF to improve timing slack at the post-routing stage. Through experiments with benchmark circuits, it is shown that our proposed design and technology co-optimization (DTCO) flow using MBFFs that solves problems 1 and 2 is very promising.

Transitive Closure Graph-Based Warpage-Aware Floorplanning for Package Designs

Yang Hsu
Min-Hsuan Chung
Yao-Wen Chang
Ci-Hong Lin

In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.

SESSION: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

Session details: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges

Antonino Tumeo

A Mixed Open-Source and Proprietary EDA Commons for Education and Prototyping

Andrew B. Kahng

In recent years, several open-source projects have shown potential to serve a future technology commons for EDA and design prototyping. This paper examines how open-source and proprietary EDA technologies will inevitably take on complementary roles within a future technology commons. Proprietary EDA technologies offer numerous benefits that will endure, including (i) exceptional technology and engineering; (ii) ever-increasing importance in design-based equivalent scaling and the overall semiconductor value chain; and (iii) well-established commercial and partner relationships. On the other hand, proprietary EDA technologies face challenges that will also endure, including (i) inability to pursue directions such as massive leverage of cloud compute, extreme reduction of turnaround times, or “free tools”; and (ii) difficulty in evolving and addressing new applications and markets. By contrast, open-source EDA technologies offer benefits that include (i) the capability to serve as a friction-free, democratized platform for education and future workforce development (i.e., as a platform for EDA research, and as a means of teaching / training both designers and EDA developers with public code); and (ii) addressing the needs of underserved, non-enterprise account markets (e.g., older nodes, research flows, cost-sensitive IoT, new devices and integrations, system-design-technology pathfinding). This said, open-source will always face challenges such as sustainability, governance, and how to achieve critical mass and critical quality. The paper will conclude with key directions and synergies for open-source and proprietary EDA within an EDA Commons for education and prototyping.

SODA Synthesizer: An Open-Source, Multi-Level, Modular, Extensible Compiler from High-Level Frameworks to Silicon

Nicolas Bohm Agostini
Ankur Limaye
Marco Minutoli
Vito Giovanni Castellana
Joseph Manzano
Antonino Tumeo
Serena Curzel
Fabrizio Ferrandi

The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.

A Scalable Methodology for Agile Chip Development with Open-Source Hardware Components

Maico Cassel dos Santos
Tianyu Jia
Martin Cochet
Karthik Swaminathan
Joseph Zuckerman
Paolo Mantovani
Davide Giri
Jeff Jun Zhang
Erik Jens Loscalzo
Gabriele Tombesi
Kevin Tien
Nandhini Chandramoorthy
John-David Wellman
David Brooks
Gu-Yeon Wei
Kenneth Shepard
Luca P. Carloni
Pradip Bose

We present a scalable methodology for the agile physical design of tile-based heterogeneous system-on-chip (SoC) architectures that simplifies the reuse and integration of open-source hardware components. The methodology leverages the regularity of the on-chip communication infrastructure, which is based on a multi-plane network-on-chip (NoC), and the modularity of socket interfaces, which connect the tiles to the NoC. Each socket also provides its tile with a set of platform services, including independent clocking and voltage control. As a result, the physical design of each tile can be decoupled from its location in the top-level floorplan of the SoC and the overall SoC design can benefit from a hierarchical timing-closure flow, design reuse and, if necessary, fast respin. With the proposed methodology we completed two SoC tapeouts of increasing complexity, which illustrate its capabilities and the resulting gains in terms of design productivity.

SESSION: Accelerators on A New Horizon

Session details: Accelerators on A New Horizon

Vaibhav Verma
Georgios Zervakis

GraphRC: Accelerating Graph Processing on Dual-Addressing Memory with Vertex Merging

Wei Cheng
Chun-Feng Wu
Yuan-Hao Chang
Ing-Chao Lin

Architectural innovation in graph accelerators attracts research attention due to foreseeable inflation in data sizes and the irregular memory access pattern of graph algorithms. Conventional graph accelerators ignore the potential of Non-Volatile Memory (NVM) crossbar as a dual-addressing memory and treat it as a traditional single-addressing memory with higher density and better energy efficiency. In this work, we present GraphRC, a graph accelerator that leverages the power of dual-addressing memory by mapping in-edge/out-edge requests to column/row-oriented memory accesses. Although the capability of dual-addressing memory greatly improves the performance of graph processing, some memory accesses still suffer from low-utilization issues. Therefore, we propose a vertex merging (VM) method that improves cache block utilization rate by merging memory requests from consecutive vertices. VM reduces the execution time of all 6 graph algorithms on all 4 datasets by 24.24% on average. We then identify the data dependency inherent in a graph limits the usage of VM, and its effectiveness is bounded by the percentage of mergeable vertices. To overcome this limitation, we propose an aggressive vertex merging (AVM) method that outperforms VM by ignoring the data dependency inherent in a graph. AVM significantly reduces the execution time of ranking-based algorithms on all 4 datasets while preserving the correct ranking of the top 20 vertices.

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Matheus Cavalcante
Domenic Wüthrich
Matteo Perotti
Samuel Riedel
Luca Benini

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz’ performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

Edward Richter
Deming Chen

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

ReSiPI: A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for Energy-Efficient Interposer Communication

Ebadollah Taheri
Sudeep Pasricha
Mahdi Nikdast

2.5D chiplet systems have been proposed to improve the low manufacturing yield of large-scale chips. However, connecting the chiplets through an electronic interposer imposes a high traffic load on the interposer network. Silicon photonics technology has shown great promise towards handling a high volume of traffic with low latency in intra-chip network-on-chip (NoC) fabrics. Although recent advances in silicon photonic devices have extended photonic NoCs to enable high bandwidth communication in 2.5D chiplet systems, such interposer-based photonic networks still suffer from high power consumption. In this work, we design and analyze a novel Reconfigurable power-efficient and congestion-aware Silicon-Photonic 2.5D Interposer network, called ReSiPI. Considering runtime traffic, ReSiPI is able to dynamically deploy inter-chiplet photonic gateways to improve the overall network congestion. ReSiPI also employs switching elements based on phase change materials (PCMs) to dynamically reconfigure and power-gate the photonic interposer network, thereby improving the network power efficiency. Compared to the best prior state-of-the-art 2.5D photonic network, ReSiPI demonstrates, on average, 37% lower latency, 25% power reduction, and 53% energy minimization in the network.

SESSION: CAD for Confidentiality of Hardware IPS

Session details: CAD for Confidentiality of Hardware IPS

Swarup Bhunia

Hardware IP Protection against Confidentiality Attacks and Evolving Role of CAD Tool

Swarup Bhunia
Amitabh Das
Saverio Fazzari
Vivian Kammler
David Kehlet
Jeyavijayan Rajendran
Ankur Srivastava

With growing use of hardware intellectual property (IP) based integrated circuits (IC) design and increasing reliance on a globalized supply chain, the threats to confidentiality of hardware IPs have emerged as major security concerns to the IP producers and owners. These threats are diverse, including reverse engineering (RE), piracy, cloning, and extraction of design secrets, and span different phases of electronics life cycle. The academic research community and the semiconductor industry have made significant efforts over the past decade on developing effective methodologies and CAD tools targeted to protect hardware IPs against these threats. These solutions include watermarking, logic locking, obfuscation, camouflaging, split manufacturing, and hardware redaction. This paper focuses on key topics on confidentiality of hardware IPs encompassing the major threats, protection approaches, security analysis, and metrics. It discusses the strengths and limitations of the major solutions in protecting hardware IPs against the confidentiality attacks, and future directions to address the limitations in the modern supply chain ecosystem.

SESSION: Analyzing Reliability, Defects and Patterning

Session details: Analyzing Reliability, Defects and Patterning

Gaurav Rajavendra Reddy
Kostas Adam

Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction Using Graph Neural Network and U-Net

Kyeonghyeon Baek
Hyunbum Park
Suwan Kim
Kyumyung Choi
Taewhan Kim

An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.

A Novel Semi-Analytical Approach for Fast Electromigration Stress Analysis in Multi-Segment Interconnects

Olympia Axelou
Nestor Evmorfopoulos
George Floros
George Stamoulis
Sachin S. Sapatnekar

As integrated circuit technologies move below 10 nm, Electromigration (EM) has become an issue of great concern for the longterm reliability due to the stricter performance, thermal and power requirements. The problem of EM becomes even more pronounced in power grids due to the large unidirectional currents flowing in these structures. The attention for EM analysis during the past years has been drawn to accurate physics-based models describing the interplay between the electron wind force and the back stress force, in a single Partial Differential Equation (PDE) involving wire stress. In this paper, we present a fast semi-analytical approach for the solution of the stress PDE at discrete spatial points in multi-segment lines of power grids, which allows the analytical calculation of EM stress independently at any time in these lines. Our method exploits the specific form of the discrete stress coefficient matrix whose eigenvalues and eigenvectors are known beforehand. Thus, a closed-form equation can be constructed with almost linear time complexity without the need of time discretization. This closed-form equation can be subsequently used at any given time in transient stress analysis. Our experimental results, using the industrial IBM power grid benchmarks, demonstrate that our method has excellent accuracy compared to the industrial tool COMSOL while being orders of magnitude times faster.

HierPINN-EM: Fast Learning-Based Electromigration Analysis for Multi-Segment Interconnects Using Hierarchical Physics-Informed Neural Network

Wentian Jin
Liang Chen
Subed Lamichhane
Mohammadamir Kavousi
Sheldon X.-D. Tan

Electromigration (EM) becomes a major concern for VLSI circuits as the technology advances in the nanometer regime. The crux of problem is to solve the partial differential Korhonen equations, which remains challenging due to the increasing integrated density. Recently, scientific machine learning has been explored to solve partial differential equations (PDE) due to breakthrough success in deep neural networks and existing approach such as physics-informed neural networks (PINN) shows promising results for some small PDE problems. However, for large engineering problems like EM analysis for large interconnect trees, it was shown that the plain PINN does not work well due the to large number of variables. In this work, we propose a novel hierarchical PINN approach, HierPINN-EM for fast EM induced stress analysis for multi-segment interconnects. Instead of solving the interconnect tree as a whole, we first solve EM problem for one wire segment under different boundary and geometrical parameters using supervised learning. Then we apply unsupervised PINN concept to solve the whole interconnects by enforcing the physics laws in the boundaries for all wire segments. In this way, HierPINN-EM can significantly reduce the number of variables at plain PINN solver. Numerical results on a number of synthetic interconnect trees show that HierPINN-EM can lead to orders of magnitude speedup in training and more than 79× better accuracy over the plain PINN method. Furthermore, HierPINN-EM yields 19% better accuracy with 99% reduction in training cost over recently proposed Graph Neural Network-based EM solver, EMGraph.

Sub-Resolution Assist Feature Generation with Reinforcement Learning and Transfer Learning

Guan-Ting Liu
Wei-Chen Tai
Yi-Ting Lin
Iris Hui-Ru Jiang
James P. Shiely
Pu-Jen Cheng

As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.

SESSION: New Frontier in Verification Technology

Session details: New Frontier in Verification Technology

Jyotirmoy Vinay
Zahra Ghodsi

Automatic Test Configuration and Pattern Generation (ATCPG) for Neuromorphic Chips

I-Wei Chiu
Xin-Ping Chen
Jennifer Shueh-Inn Hu
James Chien-Mo Li

The demand for low-power, high-performance neuromorphic chips is increasing. However, conventional testing is not applicable to neuromorphic chips due to three reasons: (1) lack of scan DfT, (2) stochastic characteristic, and (3) configurable functionality. In this paper, we present an automatic test configuration and pattern generation (ATCPG) method for testing a configurable stochastic neuromorphic chip without using scan DfT. We use machine learning to generate test configurations. Then, we apply a modified fast gradient sign method to generate test patterns. Finally, we determine test repetitions with statistical power of test. We conduct experiments on one of the neuromorphic architectures, spiking neural network, to evaluate the effectiveness of our ATCPG. The experimental results show that our ATCPG can achieve 100% fault coverage for the five fault models we use. For testing a 3-layer model at 0.05 significant level, we produce 5 test configurations and 67 test patterns. The average test repetitions of neuron faults and synapse faults are 2,124 and 4,557, respectively. Besides, our simulation results show that the overkill matched our significance level perfectly.

ScaleHD: Robust Brain-Inspired Hyperdimensional Computing via Adapative Scaling

Sizhe Zhang
Mohsen Imani
Xun Jiao

Brain-inspired hyperdimensional computing (HDC) has demonstrated promising capability in various cognition tasks such as robotics, bio-medical signal analysis, and natural language processing. Compared to deep neural networks, HDC models show advantages such as light-weight model and one/few-shot learning capabilities, making it a promising alternative paradigm to traditional resource-demanding deep learning models particularly in edge devices with limited resources. Despite the growing popularity of HDC, the robustness of HDC models and the approaches to enhance HDC robustness has not been systematically analyzed and sufficiently examined. HDC relies on high-dimensional numerical vectors referred to as hypervectors (HV) to perform cognition tasks and the values inside the HVs are critical to the robustness of an HDC model. We propose ScaleHD, an adaptive scaling method that scales the value of HVs in the associative memory of an HDC model to enhance the robustness of HDC models. We propose three different modes of ScaleHD including Global-ScaleHD, Class-ScaleHD, and (Class + Clip)-ScaleHD which are based on different adaptive scaling strategies. Results show that ScaleHD is able to enhance HDC robustness against memory errors up to 10,000X. Moreover, we leverage the enhanced HDC robustness in exchange for energy saving via voltage scaling method. Experimental results show that ScaleHD can reduce energy consumption on HDC memory system up to 72.2% with less than 1% accuracy loss.

Quantitative Verification and Design Space Exploration under Uncertainty with Parametric Stochastic Contracts

Chanwook Oh
Michele Lora
Pierluigi Nuzzo

This paper proposes an automated framework for quantitative verification and design space exploration of cyber-physical systems in the presence of uncertainty, leveraging assume-guarantee contracts expressed in Stochastic Signal Temporal Logic (StSTL). We introduce quantitative semantics for StSTL and formulations of the quantitative verification and design space exploration problems as bi-level optimization problems. We show that these optimization problems can be effectively solved for a class of stochastic systems and a fragment of bounded-time StSTL formulas. Our algorithm searches for partitions of the upper-level design space such that the solutions of the lower-level problems satisfy the upper-level constraints. A set of optimal parameter values are then selected within these partitions. We illustrate the effectiveness of our framework on the design of a multi-sensor perception system and an automatic cruise control system.

SESSION: Low Power Edge Intelligence

Session details: Low Power Edge Intelligence

Sabya Das
Jiang Hu

Reliable Machine Learning for Wearable Activity Monitoring: Novel Algorithms and Theoretical Guarantees

Dina Hussein
Taha Belkhouja
Ganapati Bhat
Janardhan Rao Doppa

Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.

Neurally-Inspired Hyperdimensional Classification for Efficient and Robust Biosignal Processing

Yang Ni
Nicholas Lesica
Fan-Gang Zeng
Mohsen Imani

The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.

In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.

EVE: Environmental Adaptive Neural Network Models for Low-Power Energy Harvesting System

Sahidul Islam
Shanglin Zhou
Ran Ran
Yu-Fang Jin
Wujie Wen
Caiwen Ding
Mimi Xie

IoT devices are increasingly being implemented with neural network models to enable smart applications. Energy harvesting (EH) technology that harvests energy from ambient environment is a promising alternative to batteries for powering those devices due to the low maintenance cost and wide availability of the energy sources. However, the power provided by the energy harvester is low and has an intrinsic drawback of instability since it varies with the ambient environment. This paper proposes EVE, an automated machine learning (autoML) co-exploration framework to search for desired multi-models with shared weights for energy harvesting IoT devices. Those shared models incur significantly reduced memory footprint with different levels of model sparsity, latency, and accuracy to adapt to the environmental changes. An efficient on-device implementation architecture is further developed to efficiently execute each model on device. A run-time model extraction algorithm is proposed that retrieves individual model with negligible overhead when a specific model mode is triggered. Experimental results show that the neural networks models generated by EVE is on average 2.5× times faster than the baseline models without pruning and shared weights.

SESSION: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

Session details: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics

Hussam Amrouch
Sheldon Tan

Designing Energy-Efficient Decision Tree Memristor Crossbar Circuits Using Binary Classification Graphs

Pranav Sinha
Sunny Raj

We propose a method to design in-memory, energy-efficient, and compact memristor crossbar circuits for implementing decision trees using flow-based computing. We develop a new tool called binary classification graph, which is equivalent to decision trees in accuracy but uses bit values of input features to make decisions instead of thresholds. Our proposed design is resilient to manufacturing errors and can scale to large crossbar sizes due to the utilization of sneak paths in computations. Our design uses zero transistor and one memristor (0T1R) crossbars with only two resistance states of high and low, which makes it resilient to resistance drift and radiation degradation. We test the performance of our designs on multiple standard machine learning datasets and show that our method utilizes circuits of size 5.23 × 10^-3 mm² and uses 20.5 pJ per decision, and outperforms state-of-the-art decision tree acceleration algorithms on these metrics.

Fuse and Mix: MACAM-Enabled Analog Activation for Energy-Efficient Neural Acceleration

Hanqing Zhu
Keren Zhu
Jiaqi Gu
Harrison Jin
Ray T. Chen
Jean Anne Incorvia
David Z. Pan

Analog computing has been recognized as a promising low-power alternative to digital counterparts for neural network acceleration. However, conventional analog computing is mainly in a mixed-signal manner. Tedious analog/digital (A/D) conversion cost significantly limits the overall system’s energy efficiency. In this work, we devise an efficient analog activation unit with magnetic tunnel junction (MTJ)-based analog content-addressable memory (MACAM), simultaneously realizing nonlinear activation and A/D conversion in a fused fashion. To compensate for the nascent and therefore currently limited representation capability of MACAM, we propose to mix our analog activation unit with digital activation dataflow. A fully differential framework, SuperMixer, is developed to search for an optimized activation workload assignment, adaptive to various activation energy constraints. The effectiveness of our proposed methods is evaluated on a silicon photonic accelerator. Compared to standard activation implementation, our mixed activation system with the searched assignment can achieve competitive accuracy with >60% energy saving on A/D conversion and activation.

Aging-Aware Training for Printed Neuromorphic Circuits

Haibin Zhao
Michael Hefenbrock
Michael Beigl
Mehdi B. Tahoori

Printed electronics allow for ultra-low-cost circuit fabrication with unique properties such as flexibility, non-toxicity, and stretchability. Because of these advanced properties, there is a growing interest in adapting printed electronics for emerging areas such as fast-moving consumer goods and wearable technologies. In such domains, analog signal processing in or near the sensor is favorable. Printed neuromorphic circuits have been recently proposed as a solution to perform such analog processing natively. Additionally, their learning-based design process allows high efficiency of their optimization and enables them to mitigate the high process variations associated with low-cost printed processes. In this work, we address the aging of the printed components. This effect can significantly degrade the accuracy of printed neuromorphic circuits over time. For this, we develop a stochastic aging-model to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device. This approach ensures to provide acceptable accuracy over the device lifetime. Our experiments show that an overall 35.8% improvement in terms of expected accuracy over the device lifetime can be achieved using the proposed learning approach.

SESSION: Designing DNN Accelerators

Session details: Designing DNN Accelerators

Elliott Delaye
Yiyu Shi

Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates

Naebeom Park
Daehyun Ahn
Jae-Joon Kim

Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.

Re2fresh: A Framework for Mitigating Read Disturbance in ReRAM-Based DNN Accelerators

Hyein Shin
Myeonggu Kang
Lee-Sup Kim

A severe read disturbance problem degrades the inference accuracy of a resistive RAM (ReRAM) based deep neural network (DNN) accelerator. Refresh, which reprograms the ReRAM cells, is the most obvious solution for the problem, but programming ReRAM consumes huge energy. To address the issue, we first analyze the resistance drift pattern of each conductance state and the actual read stress applied to the ReRAM array by considering the characteristics of ReRAM-based DNN accelerators. Based on the analysis, we cluster ReRAM cells into a few groups for each layer of DNN and generate a proper refresh cycle for each group in the offline phase. The individual refresh cycles reduce energy consumption by reducing the number of unnecessary refresh operations. In the online phase, the refresh controller selectively launches refresh operations according to the generated refresh cycles. ReRAM cells are selectively refreshed by minimally modifying the conventional structure of the ReRAM-based DNN accelerator. The proposed work successfully resolves the read disturbance problem by reducing 97% of the energy consumption for the refresh operation while preserving inference accuracy.

FastStamp: Accelerating Neural Steganography and Digital Watermarking of Images on FPGAs

Shehzeen Hussain
Nojan Sheybani
Paarth Neekhara
Xinqiao Zhang
Javier Duarte
Farinaz Koushanfar

Steganography and digital watermarking are the tasks of hiding recoverable data in image pixels. Deep neural network (DNN) based image steganography and watermarking techniques are quickly replacing traditional hand-engineered pipelines. DNN based water-marking techniques have drastically improved the message capacity, imperceptibility and robustness of the embedded watermarks. However, this improvement comes at the cost of increased computational overhead of the watermark encoder neural network. In this work, we design the first accelerator platform FastStamp to perform DNN based steganography and digital watermarking of images on hardware. We first propose a parameter efficient DNN model for embedding recoverable bit-strings in image pixels. Our proposed model can match the success metrics of prior state-of-the-art DNN based watermarking methods while being significantly faster and lighter in terms of memory footprint. We then design an FPGA based accelerator framework to further improve the model throughput and power consumption by leveraging data parallelism and customized computation paths. FastStamp allows embedding hardware signatures into images to establish media authenticity and ownership of digital media. Our best design achieves 68× faster inference as compared to GPU implementations of prior DNN based watermark encoder while consuming less power.

SESSION: Novel Chiplet Approaches from Interconnect to System (Virtual)

Session details: Novel Chiplet Approaches from Interconnect to System (Virtual)

Xinfei Guo

GIA: A Reusable General Interposer Architecture for Agile Chiplet Integration

Fuping Li
Ying Wang
Yuanqing Cheng
Yujie Wang
Yinhe Han
Huawei Li
Xiaowei Li

2.5D chiplet technology is gaining popularity for the efficiency of integrating multiple heterogeneous dies or chiplets on interposers, and it is also considered an ideal option for agile silicon system design by mitigating the huge design, verification, and manufacturing overhead of monolithic SoCs. Although it significantly reduces development costs by chiplet reuse, the design and fabrication of interposers also introduce additional high non-recurring engineering (NRE) costs and development cycles which might be prohibitive for application-specific designs having low volume.

To address this challenge, in this paper, we propose a reusable general interposer architecture (GIA) to amortize NRE costs and accelerate integration flows of interposers across different chiplet-based systems effectively. The proposed assembly-time configurable interposer architecture covers both active interposers and passive interposers considering diverse applications of 2.5D systems. The agile interposer integration is also facilitated by a novel end-to-end design automation framework to generate optimal system assembly configurations including the selection of chiplets, inter-chiplet network configuration, placement of chiplets, and mapping on GIA, which are specialized for the given target workload. The experimental results show that our proposed active GIA and passive GIA achieve 3.15x and 60.92x performance boost with 2.57x and 2.99x power saving over baselines respectively.

Accelerating Cache Coherence in Manycore Processor through Silicon Photonic Chiplet

Chengeng Li
Fan Jiang
Shixi Chen
Jiaxu Zhang
Yinyi Liu
Yuxiang Fu
Jiang Xu

Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.

Re-LSM: A ReRAM-Based Processing-in-Memory Framework for LSM-Based Key-Value Store

Qian Wei
Zhaoyan Shen
Yiheng Tong
Zhiping Jia
Lei Ju
Jiezhi Chen
Bingzhe Li

Log-structured merge (LSM) tree based key-value (KV) stores organize writes into hierarchical batches for high-speed writing. However, the notorious compaction process of LSM-tree severely hurts system performance. It not only involves huge I/O operations but also consumes tremendous computation and memory resources. In this paper, first we find that when compaction happens in the high levels (i.e., L₀, L₁) of the LSM-tree, it may saturate all system computation and memory resources, and eventually stall the whole system. Based on this observation, we present Re-LSM, a ReRAM-based Processing-in-Memory (PIM) framework for LSM-based Key-Value Store. Specifically, in Re-LSM, we propose to offload certain computation and memory-intensive tasks in the high levels of the LSM-tree to the ReRAM-based PIM space. A high parallel ReRAM compaction accelerator is designed by decomposing the three-phased compaction into basic logic operating units. Evaluation results based on db_bench and YCSB show that Re-LSM achieves 2.2× improvement on the throughput of random writes compared to RocksDB, and the ReRAM-based compaction accelerator speedups the CPU-based implementation by 64.3× and saves 25.5× energy.

SESSION: Architecture for DNN Acceleration (Virtual)

Session details: Architecture for DNN Acceleration (Virtual)

Zhezhi He

Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability

Yiming Chen
Guodong Yin
Mingyen Lee
Wenjun Tang
Zekun Yang
Yongpan Liu
Huazhong Yang
Xueqing Li

Motivated by reducing the data transfer activities in data-intensive neural network computing, SRAM-based compute-in-memory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.

This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that “hides” ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, Hidden-ROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.

DCIM-GCN: Digital Computing-in-Memory to Efficiently Accelerate Graph Convolutional Networks

Yikan Qiu
Yufei Ma
Wentao Zhao
Meng Wu
Le Ye
Ru Huang

Computing-in-memory (CIM) is emerging as a promising architecture to accelerate graph convolutional networks (GCNs) normally bounded by redundant and irregular memory transactions. Current analog based CIM requires frequent analog and digital conversions (AD/DA) that dominate the overall area and power consumption. Furthermore, the analog non-ideality degrades the accuracy and reliability of CIM. In this work, an SRAM based digital CIM system is proposed to accelerate memory intensive GCNs, namely DCIM-GCN, which covers innovations from CIM circuit level eliminating costly AD/DA converters to architecture level addressing irregularity and sparsity of graph data. DCIM-GCN achieves 2.07X, 1.76X, and 1.89× speedup and 29.98×, 1.29×, and 3.73× energy efficiency improvement on average over CIM based PIMGCN, TARe, and PIM-GCN, respectively.

Hardware Computation Graph for DNN Accelerator Design Automation without Inter-PU Templates

Jun Li
Wei Wang
Wu-Jun Li

Existing deep neural network (DNN) accelerator design automation (ADA) methods adopt architecture templates to predetermine parts of design choices and then explore the left design choices beyond templates. These templates can be classified into intra-PU templates and inter-PU templates according to the architecture hierarchy. Since templates limit the flexibility of ADA, designing effective ADA methods without templates has become an important research topic. Although there have appeared some works to enhance the flexibility of ADA by removing intra-PU templates, to the best of our knowledge no existing works have studied ADA methods without inter-PU templates. ADA with predetermined inter-PU templates is typically inefficient in terms of resource utilization, especially for DNNs with complex topology. In this paper, we propose a novel method, called hardware computation graph (HCG), for ADA without inter-PU templates. Experiments show that HCG method can achieve competitive latency while using only 1.4× ~ 5× fewer on-chip memory, compared with existing state-of-the-art ADA methods.

SESSION: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

Session details: Multi-Purpose Fundamental Digital Design Improvements (Virtual)

Sabya Das
Mondira Pant

Dynamic Frequency Boosting Beyond Critical Path Delay

Nikolaos Zompakis
Sotirios Xydis

This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) technique to release “hidden” performance margins of digital circuit designs currently suppressed by typical critical path constraint design flows, thus defining higher limits of operation speed. The proposed technique goes beyond state-of-the-art and exploits the data-driven path delay variability incorporating an innovative hardware clocking mechanism that detects in real-time the paths’ activation. In contrast to timing speculation, the operating speed is adjusted on the nominal path delay activation, succeeding an error-free acceleration. The proposed technique has been evaluated on three FPGA-based use cases carefully selected to exhibit differing domain characteristics, i.e i) a third party DNN inference accelerator IP for CIFAR-10 images achieving an average speedup of 18%, ii) a highly designer-optimized Optical Digital Equalizer design, in which DBF delivered a speedup of 50% and iii) a set of 5 synthetic designs examining high frequency (beyond 400 MHz) applications in FPGAs, achieving accelerations of 20–60% depending on the underlying path variability.

ASPPLN: Accelerated Symbolic Probability Propagation in Logic Network

Weihua Xiao
Weikang Qian

Probability propagation is an important task used in logic network analysis, which propagates signal probabilities from its primary inputs to its primary outputs. It has many applications such as power estimation, reliability analysis, and error analysis for approximate circuits. Existing methods for the task can be divided into two categories: simulation-based and probability-based methods. However, most of them suffer from low accuracy or bad scalability. In this work, we propose ASPPLN, a method for accelerated symbolic probability propagation in logic network, which has a linear complexity with the network size. We first introduce a new definition in a graph called redundant input and take advantage of it to simplify the propagation process without losing accuracy. Then, a technique called symbol limitation is proposed to limit the complexity of each node’s propagation according to the partial probability significances of the symbols. The experimental results showed that compared to the existing methods, ASPPLN improves the estimation accuracy of switching activity by up to 24.70%, while it also has a speedup of up to 29X.

A High-Precision Stochastic Solver for Steady-State Thermal Analysis with Fourier Heat Transfer Robin Boundary Conditions

Longlong Yang
Cuiyang Ding
Changhao Yan
Dian Zhou
Xuan Zeng

In this work, we propose a path integral random walk (PIRW) solver, the first accurate stochastic method for steady-state thermal analysis with mixed boundary conditions, especially involving Fourier heat transfer Robin boundary conditions. We innovatively adopt the strictly correct calculation of the local time and the Feynman-Kac functional ê_c (t) to handle Neumann and Robin boundary conditions with high precision. Compared with ANSYS, experimental results show that PIRW achieves over 121× speedup and over 83× storage space reduction with a negligible error within 0.8° C at a single point. An application combining PIRW with low-accuracy ANSYS for the temperature calculation at hot-spots is provided as a more accurate and faster solution than only ANSYS used.

SESSION: GPU Acceleration for Routing Algorithms (Virtual)

Session details: GPU Acceleration for Routing Algorithms (Virtual)

Umamaheswara Rao Tida

Superfast Full-Scale CPU-Accelerated Global Routing

Shiju Lin
Martin D. F. Wong

Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.

X-Check: CPU-Accelerated Design Rule Checking via Parallel Sweepline Algorithms

Zhuolun He
Yuzhe Ma
Bei Yu

Design rule checking (DRC) is essential in physical verification to ensure high yield and reliability for VLSI circuit designs. To achieve reasonable design cycle time, acceleration for computationally intensive DRC tasks has been demanded to accommodate the ever-growing complexity of modern VLSI circuits. In this paper, we propose X-Check, a GPU-accelerated design rule checker. X-Check integrates novel parallel sweepline algorithms, which are both efficient in practice and with nontrivial theoretical guarantees. Experimental results have demonstrated significant speedup achieved by X-Check compared with a multi-threaded CPU checker.

GPU-Accelerated Rectilinear Steiner Tree Generation

Zizheng Guo
Feng Gu
Yibo Lin

Rectilinear Steiner minimum tree (RSMT) generation is a fundamental component in the VLSI design automation flow. Due to its extensive usage in circuit design iterations at early design stages like synthesis, placement, and routing, the performance of RSMT generation is critical for a reasonable design turnaround time. State-of-the-art RSMT generation algorithms, like fast look-up table estimation (FLUTE), are constrained by CPU-based parallelism with limited runtime improvements. The acceleration of RSMT on GPUs is an important yet difficult task, due to the complex and non-trivial divide-and-conquer computation patterns with recursions. In this paper, we present the first GPU-accelerated RSMT generation algorithm based on FLUTE. By designing GPU-efficient data structures and levelized decomposition, table look-up, and merging operations, we incorporate large-scale data parallelism into the generation of Steiner trees. An up to 10.47× runtime speed-up has been achieved compared with FLUTE running on 40 CPU cores, filling in a critical missing component in today’s GPU-accelerated design automation framework.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)

Christian Pilato
Miroslav Velev

HECTOR: A Multi-Level Intermediate Representation for Hardware Synthesis Methodologies

Ruifan Xu
Youwei Xiao
Jin Luo
Yun Liang

Hardware synthesis requires a complicated process to generate synthesizable register transfer level (RTL) code. High-level synthesis tools can automatically transform a high-level description into hardware design, while hardware generators adopt domain specific languages and synthesis flows for specific applications. The implementation of these tools generally requires substantial engineering efforts due to RTL’s weak expressivity and low level of abstraction. Furthermore, different synthesis tools adopt different levels of intermediate representations (IR) and transformations. A unified IR obviously is a good way to lower the engineering cost and get competitive hardware design rapidly by exploring different synthesis methodologies.

In this paper, we propose Hector, a two-level IR providing a unified intermediate representation for hardware synthesis methodologies. The high-level IR binds computation with a control graph annotated with timing information, while the low-level IR provides a concise way to describe hardware modules and elastic interconnections among them. Implemented based on the multi-level compiler infrastructure (MLIR), Hector’s IRs can be converted to synthesizable RTL designs. To demonstrate the expressivity and versatility, we implement three synthesis approaches based on Hector: a high-level synthesis (HLS) tool, a systolic array generator, and a hardware accelerator. The hardware generated by Hector’s HLS approach is comparable to that generated by the state-of-the-art HLS tools, and the other two cases outperform HLS implementations in performance and productivity.

QCIR: Pattern Matching Based Universal Quantum Circuit Rewriting Framework

Mingyu Chen
Yu Zhang
Yongshang Li
Zhen Wang
Jun Li
Xiangyang Li

Due to multiple limitations of quantum computers in the NISQ era, quantum compilation efforts are required to efficiently execute quantum algorithms on NISQ devices Program rewriting based on pattern matching can improve the generalization ability of compiler optimization. However, it has rarely been explored for quantum circuit optimization, further considering physical features of target devices.

In this paper, we propose a pattern-matching based quantum circuit optimization framework QCIR with a novel pattern description format, enabling the user-configured cost model and two categories of patterns, i.e., generic patterns and folding patterns. To get better compilation latency, we propose a DAG representation of quantum circuit called QCir-DAG, and QVF algorithm for subcircuit matching. We implement continuous single-qubit optimization pass constructed by QCIR, achieving 10% and 20% optimization rate for benchmarks from Qiskit and ScaffCC, respectively. The practicality of QCIR is demonstrated by execution time and experimental results on the quantum simulator and quantum devices.

Batch Sequential Black-Box Optimization with Embedding Alignment Cells for Logic Synthesis

Chang Feng
Wenlong Lyu
Zhitang Chen
Junjie Ye
Mingxuan Yuan
Jianye Hao

During the logic synthesis flow of EDA, a sequence of graph transformation operators are applied to the circuits so that the Quality of Results (QoR) of the circuits highly depends on the chosen operators and their specific parameters in the sequence, making the search space operator-dependent and increasingly exponential. In this paper, we formulate the logic synthesis design space exploration as a conditional sequence optimization problem, where at each transformation step, an optimization operator is selected and its corresponding parameters are decided. To solve this problem, we propose a novel sequential black-box optimization approach without human intervention: 1) Due to the conditional and sequential structure of operator sequence with variable length, we build an embedding alignment cells based recurrent neural network as a surrogate model to estimate the QoR of the logic synthesis flow with historical data. 2) With the surrogate model, we construct acquisition function to balance exploration and exploitation with respect to each metric of the QoR. 3) We use multi-objective optimization algorithm to find the Pareto front of the acquisition functions, along which a batch of sequences, consisting of parameterized operators, are (randomly) selected to users for evaluation under the budget of computing resource. We repeat the above three steps until convergence or time limit. Experimental results on public EPFL benchmarks demonstrate the superiority of our approach over the expert-crafted optimization flows and other machine learning based methods. Compared to resyn2, we achieve 11.8% LUT-6 count descent improvements without sacrificing level values.

Heterogeneous Graph Neural Network-Based Imitation Learning for Gate Sizing Acceleration

Xinyi Zhou
Junjie Ye
Chak-Wa Pui
Kun Shao
Guangliang Zhang
Bin Wang
Jianye Hao
Guangyong Chen
Pheng Ann Heng

Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.

SESSION: Smart Search (Virtual)

Session details: Smart Search (Virtual)

Jianlei Yang

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

Huihong Shi
Haoran You
Yang Zhao
Zhongfeng Wang
Yingyan Lin

Multiplication is arguably the most cost-dominant operation in modern deep neural networks (DNNs), limiting their achievable efficiency and thus more extensive deployment in resource-constrained applications. To tackle this limitation, pioneering works have developed handcrafted multiplication-free DNNs, which require expert knowledge and time-consuming manual iteration, calling for fast development tools. To this end, we propose a Neural Architecture Search and Acceleration framework dubbed NASA, which enables automated multiplication-reduced DNN development and integrates a dedicated multiplication-reduced accelerator for boosting DNNs’ achievable efficiency. Specifically, NASA adopts neural architecture search (NAS) spaces that augment the state-of-the-art one with hardware inspired multiplication-free operators, such as shift and adder, armed with a novel progressive pretrain strategy (PGP) together with customized training recipes to automatically search for optimal multiplication-reduced DNNs; On top of that, NASA further develops a dedicated accelerator, which advocates a chunk-based template and auto-mapper dedicated for NASA-NAS resulting DNNs to better leverage their algorithmic properties for boosting hardware efficiency. Experimental results and ablation studies consistently validate the advantages of NASA’s algorithm-hardware co-design framework in terms of achievable accuracy and efficiency tradeoffs. Codes are available at https://github.com/shihuihong214/NASA.

Personalized Heterogeneity-Aware Federated Search Towards Better Accuracy and Energy Efficiency

Zhao Yang
Qingshuang Sun

Federated learning (FL), a new distributed technology, allows us to train the global model on the edge and embedded devices without local data sharing. However, due to the wide distribution of different types of devices, FL faces severe heterogeneity issues. The accuracy and efficiency of FL deployment at the edge are severely impacted by heterogeneous data and heterogeneous systems. In this paper, we perform joint FL model personalization for heterogeneous systems and heterogeneous data to address the challenges posed by heterogeneities. We begin by using model inference efficiency as a starting point to personalize network scale on each node. Furthermore, it can be used to guide the efficient FL training process, which can help to ease the problem of straggler devices and improve FL’s energy efficiency. During FL training, federated search is then used to acquire highly accurate personalized network structures. By taking into account the unique characteristics of FL deployment at edge devices, the personalized network structures obtained by our federated search framework with a lightweight search controller can achieve competitive accuracy with state-of-the-art (SOTA) methods, while reducing inference and training energy consumption by up to 3.57× and 1.82×, respectively.

SESSION: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

Session details: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)

Cheng Tan

Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning

Keqi Fu
Zhi Qi
Jiaxuan Cai
Xulong Shi

As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.

Towards High-Quality CGRA Mapping with Graph Neural Networks and Reinforcement Learning

Yan Zhuang
Zhihao Zhang
Dajiang Liu

Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution to accelerate domain applications due to its good combination of energy-efficiency and flexibility. Loops, as computation-intensive parts of applications, are often mapped onto CGRA and modulo scheduling is commonly used to improve the execution performance. However, the actual performance using modulo scheduling is highly dependent on the mapping ability of the Data Dependency Graph (DDG) extracted from a loop. As existing approaches usually separate routing exploration of multi-cycle dependence from mapping for fast compilation, they may easily suffer from poor mapping quality. In this paper, we integrate the routing explorations into the mapping process and make it have more opportunities to find a globally optimized solution. Meanwhile, with a reduced resource graph defined, the searching space of the new mapping problem is not greatly increased. To efficiently solve the problem, we introduce graph neural network based reinforcement learning to predict a placement distribution over different resource nodes for all operations in a DDG. Using the routing connectivity as the reward signal, we optimize the parameters of neural network to find a valid mapping solution with a policy gradient method. Without much engineering and heuristic designing, our approach achieves 1.57× mapping quality, as compared to the state-of-the-art heuristic.

SESSION: Hardware Security: Attacks and Countermeasures (Virtual)

Session details: Hardware Security: Attacks and Countermeasures (Virtual)

Johann Knechtel
Lejla Batina

Attack Directories on ARM big.LITTLE Processors

Zili Kou
Sharad Sinha
Wenjian He
Wei Zhang

Eviction-based cache side-channel attacks take advantage of inclusive cache hierarchies and shared cache hardware. Processors with the template ARM big.LITTLE architecture do not guarantee such preconditions and therefore will not usually allow cross-core attacks let alone cross-cluster attacks. This work reveals a new side-channel based on the snoop filter (SF), an unexplored directory structure embedded in template ARM big.LITTLE processors. Our systematic reverse engineering unveils the undocumented structure and property of the SF, and we successfully utilize it to bootstrap cross-core and cross-cluster cache eviction. We demonstrate a comprehensive methodology to exploit the SF side-channel, including the construction of eviction sets, the covert channel, and attacks against RSA and AES. When attacking TrustZone, we conduct an interrupt-based side-channel attack to extract the key of RSA by a single profiling trace, despite the strict cache clean defense. Supported by detailed experiments, the SF side-channel not only achieves competitive performance but also overcomes the main challenge of cache side-channel attacks on ARM big.LITTLE processors.

AntiSIFA-CAD: A Framework to Thwart SIFA at the Layout Level

Rajat Sadhukhan
Sayandeep Saha
Debdeep Mukhopadhyay

Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.

SESSION: Advanced VLSI Routing and Layout Learning

Session details: Advanced VLSI Routing and Layout Learning

Wing-Kai Chow
David Chinnery

A Stochastic Approach to Handle Non-Determinism in Deep Learning-Based Design Rule Violation Predictions

Rongjian Liang
Hua Xiang
Jinwook Jung
Jiang Hu
Gi-Joon Nam

Deep learning is a promising approach to early DRV (Design Rule Violation) prediction. However, non-deterministic parallel routing hampers model training and degrades prediction accuracy. In this work, we propose a stochastic approach, called LGC-Net, to solve this problem. In this approach, we develop new techniques of Gaussian random field layer and focal likelihood loss function to seamlessly integrate Log Gaussian Cox process with deep learning. This approach provides not only statistical regression results but also classification ones with different thresholds without retraining. Experimental results with noisy training data on industrial designs demonstrate that LGC-Net achieves significantly better accuracy of DRV density prediction than prior arts.

Obstacle-Avoiding Multiple Redistribution Layer Routing with Irregular Structures

Yen-Ting Chen
Yao-Wen Chang

In advanced packages, redistribution layers (RDLs) are extra metal layers for high interconnections among the chips and printed circuit board (PCB). To better utilize the routing resources of RDLs, published works adopted flexible vias such that they can place the vias everywhere. Furthermore, some regions may be blocked for signal integrity protection or manually prerouted nets (such as power/ground nets or feeding lines of antennas) to achieve higher performance. These blocked regions will be treated as obstacles in the routing process. Since the positions of pads, obstacles, and vias can be arbitrary, the structures of RDLs become irregular. The obstacles and irregular structures substantially increase the difficulty of the routing process. This paper proposes a three-stage algorithm: First, the layout is partitioned by a method based on constrained Delaunay triangulation (CDT). Then we present a global routing graph model and generate routing guides for unified-assignment netlists. Finally, a novel tile routing method is developed to obtain detailed routes. Experiment results demonstrate the robustness and effectiveness of our proposed algorithm.

TAG: Learning Circuit Spatial Embedding from Layouts

Keren Zhu
Hao Chen
Walker J. Turner
George F. Kokai
Po-Hsuan Wei
David Z. Pan
Haoxing Ren

Analog and mixed-signal (AMS) circuit designs still rely on human design expertise. Machine learning has been assisting circuit design automation by replacing human experience with artificial intelligence. This paper presents TAG, a new paradigm of learning the circuit representation from layouts leveraging Text, self Attention and Graph. The embedding network model learns spatial information without manual labeling. We introduce text embedding and a self-attention mechanism to AMS circuit learning. Experimental results demonstrate the ability to predict layout distances between instances with industrial FinFET technology benchmarks. The effectiveness of the circuit representation is verified by showing the transferability to three other learning tasks with limited data in the case studies: layout matching prediction, wirelength estimation, and net parasitic capacitance prediction.

SESSION: Physical Attacks and Countermeasures

Session details: Physical Attacks and Countermeasures

Satwik Patnaik
Gang Qu

PowerTouch: A Security Objective-Guided Automation Framework for Generating Wired Ghost Touch Attacks on Touchscreens

Huifeng Zhu
Zhiyuan Yu
Weidong Cao
Ning Zhang
Xuan Zhang

The wired ghost touch attacks are the emerging and severe threats against modern touchscreens. The attackers can make touchscreens falsely report nonexistent touches (i.e., ghost touches) by injecting common-mode noise (CMN) into the target devices via power cables. Existing attacks rely on reverse-engineering the touchscreens, then manually crafting the CMN waveforms to control the types and locations of ghost touches. Although successful, they are limited in practicality and attack capability due to the touchscreens’ black-box nature and the immense search space of attack parameters. To overcome the above limitations, this paper presents PowerTouch, a framework that can automatically generate wired ghost touch attacks. We adopt a software-hardware co-design approach and propose a domain-specific genetic algorithm-based method that is tailored to account for the characteristics of the CMN waveform. Based on the security objectives, our framework automatically optimizes the CMN waveform towards injecting the desired type of ghost touches into regions specified by attackers. The effectiveness of PowerTouch is demonstrated by successfully launching attacks on touchscreen devices from two different brands given nine different objectives. Compared with the state-of-the-art attack, we seminally achieve controlling taps on an extra dimension and injecting swipes on both dimensions. We can place an average of 84.2% taps on the targeted side of the screen, with the location error in the other dimension no more than 1.53mm. An average of 94.5% of injected swipes with correct directions is also achieved. The quantitative comparison with the state-of-the-art method shows that a better attack performance can be achieved by PowerTouch.

A Combined Logical and Physical Attack on Logic Obfuscation

Michael Zuzak
Yuntao Liu
Isaac McDaniel
Ankur Srivastava

Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.

A Pragmatic Methodology for Blind Hardware Trojan Insertion in Finalized Layouts

Alexander Hepp
Tiago Perez
Samuel Pagliarini
Georg Sigl

A potential vulnerability for integrated circuits (ICs) is the insertion of hardware trojans (HTs) during manufacturing. Understanding the practicability of such an attack can lead to appropriate measures for mitigating it. In this paper, we demonstrate a pragmatic framework for analyzing HT susceptibility of finalized layouts. Our framework is representative of a fabrication-time attack, where the adversary is assumed to have access only to a layout representation of the circuit. The framework inserts trojans into tapeoutready layouts utilizing an Engineering Change Order (ECO) flow. The attacked security nodes are blindly searched utilizing reverse-engineering techniques. For our experimental investigation, we utilized three crypto-cores (AES-128, SHA-256, and RSA) and a microcontroller (RISC-V) as targets. We explored 96 combinations of triggers, payloads and targets for our framework. Our findings demonstrate that even in high-density designs, the covert insertion of sophisticated trojans is possible. All this while maintaining the original target logic, with minimal impact on power and performance. Furthermore, from our exploration, we conclude that it is too naive to only utilize placement resources as a metric for HT vulnerability. This work highlights that the HT insertion success is a complex function of the placement, routing resources, the position of the attacked nodes, and further design-specific characteristics. As a result, our framework goes beyond just an attack, we present the most advanced analysis tool to assess the vulnerability of HT insertion into finalized layouts.

SESSION: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

Session details: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

Rolf Drechsler

Polynomial Formal Verification: Ensuring Correctness under Resource Constraints

Rolf Drechsler
Alireza Mahzoon

Recently, a lot of effort has been put into developing formal verification approaches by both academic and industrial research. In practice, these techniques often give satisfying results for some types of circuits, while they fail for others. A major challenge in this domain is that the verification techniques suffer from unpredictability in their performance. The only way to overcome this challenge is the calculation of bounds for the space and time complexities. If a verification method has polynomial space and time complexities, scalability can be guaranteed.

In this tutorial paper, we review recent developments in formal verification techniques and give a comprehensive overview of Polynomial Formal Verification (PFV). In PFV, polynomial upper bounds for the run-time and memory needed during the entire verification task hold. Thus, correctness under resource constraints can be ensured. We discuss the importance and advantages of PFV in the design flow. Formal methods on the bit-level and the word-level, and their complexities when used to verify different types of circuits, like adders, multipliers, or ALUs are presented. The current status of this new research field and directions for future work are discussed.

SESSION: Scalable Verification Technologies

Session details: Scalable Verification Technologies

Viraphol Chaiyakul
Alex Orailoglu

Arjun: An Efficient Independent Support Computation Technique and its Applications to Counting and Sampling

Mate Soos
Kuldeep S. Meel

Given a Boolean formula ϕ over the set of variables X and a projection set P ⊆ X, then if I ⊆ P is independent support of P, then if two solutions of ϕ agree on I, then they also agree on P. The notion of independent support is related to the classical notion of definability dating back to 1901, and have been studied over the decades. Recently, the computational problem of determining independent support for a given formula has attained importance owing to the crucial importance of independent support for hashing-based counting and sampling techniques.

In this paper, we design an efficient and scalable independent support computation technique that can handle formulas arising from real-world benchmarks. Our algorithmic framework, called Arjun¹, employs implicit and explicit definability notions, and is based on a tight integration of gate-identification techniques and assumption-based framework. We demonstrate that augmenting the state-of-the-art model counter ApproxMC4 and sampler UniGen3 with Arjun leads to significant performance improvements. In particular, ApproxMC4 augmented with Arjun counts 576 more benchmarks out of 1896 while UniGen3 augmented with Arjun samples 335 more benchmarks within the same time limit.

Compositional Verification Using a Formal Component and Interface Specification

Yue Xing
Huaixi Lu
Aarti Gupta
Sharad Malik

Property-based specification s uch a s SystemVerilog Assertions (SVA) uses mathematical logic to specify the temporal behavior of RTL designs which can then be formally verified using model checking algorithms. These properties are specified for a single component (which may contain other components in the design hierarchy). Composing design components that have already been verified requires additional verification since incorrect communication at their interface may invalidate the properties that have been checked for the individual components. This paper focuses on a specification for their interface which can be checked individually for each component, and which guarantees that refinement-based properties checked for each component continue to hold after their composition. We do this in the setting of the Instruction-level Abstraction (ILA) specification and verification methodology. The ILA methodology provides a uniform specification for processors, accelerators and general modules at the instruction-level, and the automatic generation of a complete set of correctness properties for checking that the RTL model is a refinement of the ILA specification. We add an interface specification to model the inter-ILA communication. Further, we use our interface specification to generate a set of interface checking properties that check that the communication between the RTL components is correct. This provides the following guarantee: if each RTL component is a refinement of its ILA specification and the interface checks pass, then the RTL composition is a refinement of the ILA composition. We have applied the proposed methodology to six case studies including parts of large-scale designs such as parts of the FlexASR and NVDLA machine learning accelerators, demonstrating the practical applicability of our method.

Usage-Based RTL Subsetting for Hardware Accelerators

Qinhan Tan
Aarti Gupta
Sharad Malik

Recent years have witnessed increasing use of domain-specific accelerators in computing platforms to provide power-performance efficiency for emerging applications. To increase their applicability within the domain, these accelerators tend to support a large set of functions, e.g. Nvidia’s open-source Deep Learning Accelerator, NVDLA, supports five distinct groups of functions [17]. However, an individual use case of an accelerator may utilize only a subset of these functions. The unused functions lead to unnecessary overhead of silicon area, power, and hardware verification/hardware-software co-verification complexity. This motivates our research question: Given an RTL design for an accelerator and a subset of functions of interest, can we automatically extract a subset of the RTL that is sufficient for these functions and sequentially equivalent to the original RTL? We call this the Usage-based RTL Subsetting problem, referred to as the RTL subsetting problem in short. We first formally define this problem and show that it can be formulated as a program synthesis problem, which can be solved by performing expensive hyperproperty checks. To overcome the high cost, we propose multiple levels of sound over-approximations to construct an effective algorithm based on relatively less expensive temporal property checking and taint analysis for information flow checking. We demonstrate the acceptable computation cost and the quality of the results of our algorithm through several case studies of accelerators from different domains. The applicability of our proposed algorithm can be seen in its ability to subset the large NVDLA accelerator (with over 50,000 registers and 1,600,000 gates) for the group of convolution functions, where the subset reduces the total number of registers by 18.6% and the total number of gates by 37.1%.

SESSION: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

Session details: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops

Amit Gupta
Kerim Kalafala

TransSizer: A Novel Transformer-Based Fast Gate Sizer

Siddhartha Nath
Geraldo Pradipta
Corey Hu
Tian Yang
Brucek Khailany
Haoxing Ren

Gate sizing is a fundamental netlist optimization move and researchers have used supervised learning-based models in gate sizers. Recently, Reinforcement Learning (RL) has been tried for sizing gates (and other EDA optimization problems) but are very runtime-intensive. In this work, we explore a novel Transformer-based gate sizer, TransSizer, to directly generate optimized gate sizes given a placed and unoptimized netlist. TransSizer is trained on datasets obtained from real tapeout-quality industrial designs in a foundry 5nm technology node. Our results indicate that TransSizer achieves 97% accuracy in predicting optimized gate sizes at the postroute optimization stage. Furthermore, TransSizer has a speedup of ~1400× while delivering similar timing, power and area metrics when compared to a leading-edge commercial tool for sizing-only optimization.

Generation of Mixed-Driving Multi-Bit Flip-Flops for Power Optimization

Meng-Yun Liu
Yu-Cheng Lai
Wai-Kei Mak
Ting-Chi Wang

Multi-bit flip-flops (MBFFs) are often used to reduce the number of clock sinks, resulting in a low-power design. A traditional MBFF is composed of individual FFs of uniform driving strength. However, if some but not all of the bits of an MBFF violate timing constraints, the MBFF has to be sized up or decomposed into smaller bit-width combinations to satisfy timing, which reduces the power saving. In this paper, we present a new MBFF generation approach considering mixed-driving MBFFs whose certain bits have a higher driving strength than the other bits. To maximize the FF merging rate (and hence to minimize the final amount of clock sinks), our approach will first perform aggressive FF merging subject to timing constraints. Our merging is aggressive in the sense that we are willing to possibly oversize some FFs and allow the presence of empty bits in an MBFF to merge FFs into MBFFs of uniform driving strengths as much as possible. The oversized individual FFs of an MBFF will be later downsized subject to timing constraints by our approach, which results in a mixed-driving MBFF. Our MBFF generation approach has been combined with a commercial place and route tool, and our experimental results show the superiority of our approach over a prior work that considers uniform-driving MBFFs only in terms of the clock sink count, the FF power, the clock buffer count, and the routed clock wirelength.

DEEP: Developing Extremely Efficient Runtime On-Chip Power Meters

Zhiyao Xie
Shiyu Li
Mingyuan Ma
Chen-Chia Chang
Jingyu Pan
Yiran Chen
Jiang Hu

Accurate and efficient on-chip power modeling is crucial to runtime power, energy, and voltage management. Such power monitoring can be achieved by designing and integrating on-chip power meters (OPMs) into the target design. In this work, we propose a new method named DEEP to automatically develop extremely efficient OPM solutions for a given design. DEEP selects OPM inputs from all individual bits in RTL signals. Such bit-level selection provides an unprecedentedly large number of input candidates and supports lower hardware cost, compared with signal-level selection in prior works. In addition, DEEP proposes a powerful two-step OPM input selection method, and it supports reporting both total power and the power of major design components. Experiments on a commercial microprocessor demonstrate that DEEP’s OPM solution achieves correlation R > 0.97 in per-cycle power prediction with an unprecedented low area overhead on hardware, i.e., < 0.1% of the microprocessor layout. This reduces the OPM hardware cost by 4 — 6× compared with the state-of-the-art solution.

SESSION: Energy Efficient Hardware Acceleration and Stochastic Computing

Session details: Energy Efficient Hardware Acceleration and Stochastic Computing

Sunil Khatri
Anish Krishnakumar

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

Ranyang Zhou
Arman Roohi
Durga Misra
Shaahin Angizi

In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.

Sparse-T: Hardware Accelerator Thread for Unstructured Sparse Data Processing

Pranathi Vasireddy
Krishna Kavi
Gayatri Mehta

Sparse matrix-dense vector (SpMV) multiplication is inherent in most scientific, neural networks and machine learning algorithms. To efficiently exploit sparsity of data in SpMV computations, several compressed data representations have been used. However, compressed data representations of sparse data can result in overheads of locating nonzero values, requiring indirect memory accesses which increases instruction count and memory access delays. We call these translations of compressed representations as metadata processing. We propose a memory-side accelerator for metadata (or indexing) computations and supplying only the required nonzero values to the processor, additionally permitting an overlap of indexing with core computations on nonzero elements. In this contribution, we target our accelerator for low-end micro-controllers with very limited memory and processing capabilities. In this paper we will explore two dedicated ASIC designs of the proposed accelerator that handles the indexed memory accesses for compressed sparse row (CSR) format working alongside a simple RISC-like programmable core. One version of the accelerator supplies only vector values corresponding to nonzero matrix values and the second version supplies both nonzero matrix and matching vector values for SpMV computations. Our experiments show speedups ranging between 1.3 and 2.1 times for SpMV for different levels of sparsity. Our accelerator also results in energy savings ranging between 15.8% and 52.7% over different matrix sizes, when compared to the baseline system with primary RISC-V core performing all computations. We use smaller synthetic matrices with different sparsity levels and larger real-world matrices with higher sparsity (below 1% non-zeros) in our experimental evaluations.

Sound Source Localization Using Stochastic Computing

Peter Schober
Seyedeh Newsha Estiri
Sercan Aygun
Nima TaheriNejad
M. Hassan Najafi

Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.

SESSION: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

Session details: Special Session: Approximate Computing and the Efficient Machine Learning Expedition

Medhi Tahoori

Approximate Computing and the Efficient Machine Learning Expedition

Jörg Henkel
Hai Li
Anand Raghunathan
Mehdi B. Tahoori
Swagath Venkataramani
Xiaoxuan Yang
Georgios Zervakis

Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads associated with ML applications-that were effectively mitigated by corresponding approximations-led to a perfect matching and a fruitful synergy. AxC for AI/ML has transcended beyond academic prototypes. In this work, we enlighten the synergistic nature of AxC and ML and elucidate the impact of AxC in designing efficient ML systems. To that end, we present an overview and taxonomy of AxC for ML and use two descriptive application scenarios to demonstrate how AxC boosts the efficiency of ML systems.

SESSION: Co-Search Methods and Tools

Session details: Co-Search Methods and Tools

Cunxi Yu
Yingyan “Celine” Lin

ObfuNAS: A Neural Architecture Search-Based DNN Obfuscation Approach

Tong Zhou
Shaolei Ren
Xiaolin Xu

Malicious architecture extraction has been emerging as a crucial concern for deep neural network (DNN) security. As a defense, architecture obfuscation is proposed to remap the victim DNN to a different architecture. Nonetheless, we observe that, with only extracting an obfuscated DNN architecture, the adversary can still retrain a substitute model with high performance (e.g., accuracy), rendering the obfuscation techniques ineffective. To mitigate this under-explored vulnerability, we propose ObfuNAS, which converts the DNN architecture obfuscation into a neural architecture search (NAS) problem. Using a combination of function-preserving obfuscation strategies, ObfuNAS ensures that the obfuscated DNN architecture can only achieve lower accuracy than the victim. We validate the performance of ObfuNAS with open-source architecture datasets like NAS-Bench-101 and NAS-Bench-301. The experimental results demonstrate that ObfuNAS can successfully find the optimal mask for a victim model within a given FLOPs constraint, leading up to 2.6% inference accuracy degradation for attackers with only 0.14× FLOPs overhead. The code is available at: https://github.com/Tongzhou0101/ObfuNAS.

Deep Learning Toolkit-Accelerated Analytical Co-Optimization of CNN Hardware and Dataflow

Rongjian Liang
Jianfeng Song
Yuan Bo
Jiang Hu

The continuous growth of CNN complexity not only intensifies the need for hardware acceleration but also presents a huge challenge. That is, the solution space for CNN hardware design and dataflow mapping becomes enormously large besides the fact that it is discrete and lacks a well behaved structure. Most previous works either are stochastic metaheuristics, such as genetic algorithm, which are typically very slow for solving large problems, or rely on expensive sampling, e.g., Gumbel Softmax-based differentiable optimization and Bayesian optimization. We propose an analytical model for evaluating power and performance of CNN hardware design and dataflow solutions. Based on this model, we introduce a co-optimization method consisting of nonlinear programming and parallel local search. A key innovation in this model is its matrix form, which enables the use of deep learning toolkit for highly efficient computations of power/performance values and gradients in the optimization. In handling power-performance tradeoff, our method can lead to better solutions than minimizing a weighted sum of power and latency. The average relative error of our model compared with Timeloop is as small as 1%. Compared to state-of-the-art methods, our approach achieves solutions with up to 1.7 × shorter inference latency, 37.5% less power consumption, and 3 × less area on ResNet 18. Moreover, it provides a 6.2 × speedup of optimization runtime.

HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration

William Andrew Simon
Una Pale
Tomas Teijeiro
David Atienza

The HyperDimensional Computing (HDC) Machine Learning (ML) paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other ML approaches, necessitating frameworks enabling fast HDC algorithm design space exploration. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch’s utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. We also demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data may not generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.

SESSION: Reconfigurable Computing: Accelerators and Methodologies II

Session details: Reconfigurable Computing: Accelerators and Methodologies II

Peipei Zhou

DARL: Distributed Reconfigurable Accelerator for Hyperdimensional Reinforcement Learning

Hanning Chen
Mariam Issa
Yang Ni
Mohsen Imani

Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems such as robotics control. Modern RL algorithms, i.e., Deep Q-Learning, are based on costly and resource hungry deep neural networks. This motivates us to deploy alternative models for powering RL agents on edge devices. Recently, brain-inspired Hyper-Dimensional Computing (HDC) has been introduced as a promising solution for lightweight and efficient machine learning, particularly for classification.

In this work, we develop a novel platform capable of real-time hyperdimensional reinforcement learning. Our heterogeneous CPU-FPGA platform, called DARL, maximizes FPGA’s computing capabilities by applying hardware optimizations to hyperdimensional computing’s critical operations, including hardware-friendly encoder IP, the hypervector chunk fragmentation, and the delayed model update. Aside from hardware innovation, we also extend the platform to basic single-agent RL to support multi-agents distributed learning. We evaluate the effectiveness of our approach on OpenAI Gym tasks. Our results show that the FPGA platform provides on average 20× speedup compared to current state-of-the-art hyperdimensional RL methods running on Intel Xeon 6226 CPU. In addition, DARL provides around 4.8× faster and 4.2× higher energy efficiency compared to the state-of-the-art RL accelerator while ensuring a better or comparable quality of learning.

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

Carl-Johannes Johnsen
Tiziano De Matteis
Tal Ben-Nun
Johannes de Fine Licht
Torsten Hoefler

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for finegrained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.

SESSION: Compute-in-Memory for Neural Networks

Session details: Compute-in-Memory for Neural Networks

Bo Yuan

ISSA: Input-Skippable, Set-Associative Computing-in-Memory (SA-CIM) Architecture for Neural Network Accelerators

Yun-Chen Lo
Chih-Chen Yeh
Jun-Shen Wu
Chia-Chun Wang
Yu-Chih Tsai
Wen-Chien Ting
Ren-Shuo Liu

Among several emerging architectures, computing in memory (CIM), which features in-situ analog computation, is a potential solution to the data movement bottleneck of the Von Neumann architecture for artificial intelligence (AI). Interestingly, more strengths of CIM significantly different from in-situ analog computation are not widely known yet. In this work, we point out that mutually stationary vectors (MSVs), which can be maximized by introducing associativity to CIM, are another inherent power unique to CIM. By MSVs, CIM exhibits significant freedom to dynamically vectorize the stored data (e.g., weights) to perform agile computation using the dynamically formed vectors.

We have designed and realized an SA-CIM silicon prototype and corresponding architecture and acceleration schemes in the TSMC 28 nm process. More specifically, the contributions of this paper are fourfold: 1) We identify MSVs as new features that can be exploited to improve the current performance and energy challenges of the CIM-based hardware. 2) We propose SA-CIM to enhance MSVs for skipping the zeros, small values, and sparse vectors. 3) We propose a transposed systolic dataflow to efficiently conduct conv3×3 while being capable of exploiting input-skipping schemes. 4) We propose a design flow to search for optimal aggressive skipping scheme setups while satisfying the accuracy loss constraint.

The proposed ISSA architecture improves the throughput by 1.91× to 2.97× speedup and the energy efficiency by 2.5× to 4.2×.

Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?

Zheyu Yan
Xiaobo Sharon Hu
Yiyu Shi

Computing-in-Memory (CiM) architectures based on emerging nonvolatile memory (NVM) devices have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, NVM devices suffer from various non-idealities, especially device-to-device variations due to fabrication defects and cycle-to-cycle variations due to the stochastic behavior of devices. As such, the DNN weights actually mapped to NVM devices could deviate significantly from the expected values, leading to large performance degradation. To address this issue, most existing works focus on maximizing average performance under device variations. This objective would work well for general-purpose scenarios. But for safety-critical applications, the worst-case performance must also be considered. Unfortunately, this has been rarely explored in the literature. In this work, we formulate the problem of determining the worst-case performance of CiM DNN accelerators under the impact of device variations. We further propose a method to effectively find the specific combination of device variation in the high-dimensional space that leads to the worst-case performance. We find that even with very small device variations, the accuracy of a DNN can drop drastically, causing concerns when deploying CiM accelerators in safety-critical applications. Finally, we show that surprisingly none of the existing methods used to enhance average DNN performance in CiM accelerators are very effective when extended to enhance the worst-case performance, and further research down the road is needed to address this problem.

SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist II

Session details: Breakthroughs in Synthesis – Infrastructure and ML Assist II

Sunil Khatri
Cunxi Yu

Language Equation Solving via Boolean Automata Manipulation

Wan-Hsuan Lin
Chia-Hsuan Su
Jie-Hong R. Jiang

Language equations are a powerful tool for compositional synthesis, modeled as the unknown component problem. Given a (sequential) system specification S and a fixed component F, we are asked to synthesize an unknown component X such that whose composition with F fulfills S. The synthesis of X can be formulated with language equation solving. Although prior work exploits partitioned representation for effective finite automata manipulation, it remains challenging to solve language equations involving a large number of states. In this work, we propose variants of Boolean automata as the underlying succinct representation for regular languages. They admit logic circuit manipulation and extend the scalability for solving language equations. Experimental results demonstrate the superiority of our method to the state-of-the-art in solving nine more cases out of the 36 studied benchmarks and achieving an average of 740× speedup.

How Good Is Your Verilog RTL Code?: A Quick Answer from Machine Learning

Prianka Sengupta
Aakash Tyagi
Yiran Chen
Jiang Hu

Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.

SESSION: In-Memory Computing Revisited

Session details: In-Memory Computing Revisited

Biresh Kumar Joardar
Ulf Schlichtmann

Logic Synthesis for Digital In-Memory Computing

Muhammad Rashedul Haq Rashed
Sumit Kumar Jha
Rickard Ewetz

Processing in-memory is a promising solution strategy for accelerating data-intensive applications. While analog in-memory computing is extremely efficient, the limited precision is only acceptable for approximate computing applications. Digital in-memory computing provides the deterministic precision required to accelerate high assurance applications. State-of-the-art digital in-memory computing schemes rely on manually decomposing arithmetic operations into in-memory compute kernels. In contrast, traditional digital circuits are synthesized using complex and automated design flows. In this paper, we propose a logic synthesis framework called LOGIC for mapping high-level applications into digital in-memory compute kernels that can be executed using non-volatile memory. We first propose techniques to decompose element-wise arithmetic operations into in-memory kernels while minimizing the number of in-memory operations. Next, the sequence of the in-memory operation is optimized to minimize non-volatile memory utilization. Lastly, data layout re-organization is used to efficiently accelerate applications dominated by sparse matrix-vector multiplication operations. The experimental evaluations show that the proposed synthesis approach improves the area and latency of fixed-point multiplication by 77% and 20% over the state-of-the-art, respectively. On scientific computing applications from Suite Sparse Matrix Collection, the proposed design improves the area, latency and, energy by 3.6X, 2.6X, and 8.3X, respectively.

Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators

Kang He
Indranil Chakraborty
Cheng Wang
Kaushik Roy

In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.

SESSION: Special Session: 2022 CAD Contest at ICCAD

Session details: Special Session: 2022 CAD Contest at ICCAD

Yu-Guang Chen

Overview of 2022 CAD Contest at ICCAD

Yu-Guang Chen
Chun-Yao Wang
Tsung-Wei Huang
Takashi Sato

The “CAD Contest at ICCAD” is a challenging, multi-month, research and development competition, focusing on advanced, real-world problems in the field of electronic design automation (EDA). Since 2012, the contest has been publishing many sophisticated circuit design problems, from system-level design to physical design, together with industrial benchmarks and solution evaluators. Contestants can participate in one or more problems provided by EDA/IC industry. The winners will be awarded at an ICCAD special session dedicated to this contest. Every year, the contest attracts more than a hundred teams, fosters productive industry-academia collaborations, and leads to hundreds of publications in top-tier conferences and journals. The 2022 CAD Contest has 166 teams from all over the world. Moreover, the problems of this year cover state-of-the-art EDA research trends such as circuit security, 3D-IC, and design space exploration from well-known EDA/IC companies. We believe the contest keeps enhancing impact and boosting EDA researches.

2022 CAD Contest Problem A: Learning Arithmetic Operations from Gate-Level Circuit

Chung-Han Chou
Chih-Jen (Jacky) Hsu
Chi-An (Rocky) Wu
Kuan-Hua Tu

Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.

In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.

2022 ICCAD CAD Contest Problem B: 3D Placement with D2D Vertical Connections

Kai-Shun Hu
I-Jye Lin
Yu-Hui Huang
Hao-Yu Chi
Yi-Hsuan Wu
Chin-Fang Cindy Shen

In the chiplet era, the benefits from multiple factors can be observed by splitting a large single die into multiple small dies. By having the multiple small dies with die-to-die (D2D) vertical connections, the benefits including: 1) better yield, 2) better timing/performance, and 3) better cost. How to do the netlist partitioning, cell placement in each of the small dies, and also how to determine the location of the D2D inter-connection terminals becomes a new topic.

To address this chiplet era physical implementation problem, ICCAD-2022 contest encourages the research in the techniques of multi-die netlist partitioning and placement with D2D vertical connections. We provided (i) a set of benchmarks and (ii) an evaluation metric that facilitate contestants to develop, test, and evaluate their new algorithms.

2022 ICCAD CAD Contest Problem C: Microarchitecture Design Space Exploration

Sicheng Li
Chen Bai
Xuechao Wei
Bizhao Shi
Yen-Kuang Chen
Yuan Xie

It is vital to select microarchitectures to achieve good trade-offs between performance, power, and area in the chip development cycle. Combining high-level hardware description languages and optimization of electronic design automation tools empowers microarchitecture exploration at the circuit level. Due to the extremely large design space and high runtime cost to evaluate a microarchitecture, ICCAD 2022 CAD Contest Problem C calls for an effective design space exploration algorithm to solve the problem. We formulate the research topic as a contest problem and provide benchmark suites, contest benchmark platforms, etc., for all contestants to innovate and estimate their algorithms.

IEEE CEDA DATC: Expanding Research Foundations for IC Physical Design and ML-Enabled EDA

Jinwook Jung
Andrew B. Kahng
Ravi Varadarajan
Zhiang Wang

This paper describes new elements in the RDF-2022 release of the DATC Robust Design Flow, along with other activities of the IEEE CEDA DATC. The RosettaStone initiated with RDF-2021 has been augmented to include 35 benchmarks and four open-source technologies (ASAP7, NanGate45 and SkyWater130HS/HD), plus timing-sensible versions created using path-cutting. The Hier-RTLMP macro placer is now part of DATC RDF, enabling macro placement for large modern designs with hundreds of macros. To establish a clear baseline for macro placers, new open-source benchmark suites on open PDKs, with corresponding flows for fully reproducible results, are provided. METRICS2.1 infrastructure in OpenROAD and OpenROAD-flow-scripts now uses native JSON metrics reporting, which is more robust and general than the previous Python script-based method. Calibrations on open enablements have also seen notable updates in the RDF. Finally, we also describe an approach to establishing a generic, cloud-native large-scale design of experiments for ML-enabled EDA. Our paper closes with future research directions related to DATC’s efforts.

SESSION: Architectures and Methodologies for Advanced Hardware Security

Session details: Architectures and Methodologies for Advanced Hardware Security

Amin Rezaei
Gang Qu

Inhale: Enabling High-Performance and Energy-Efficient In-SRAM Cryptographic Hash for IoT

Jingyao Zhang
Elaheh Sadredini

In the age of big data, information security has become a major issue of debate, especially with the rise of the Internet of Things (IoT), where attackers can effortlessly obtain physical access to edge devices. The hash algorithm is the current foundation for data integrity and authentication. However, it is challenging to provide a high-performance, high-throughput, and energy-efficient solution on resource-constrained edge devices. In this paper, we propose Inhale, an in-SRAM architecture to effectively compute hash algorithms with innovative data alignment and efficient read/write strategies to implicitly execute data shift operations through the in-situ controller. We present two variations of Inhale: Inhale-Opt, which is optimized for latency, throughput, and area-overhead; and Inhale-Flex, which offers flexibility in repurposing a part of last-level caches for hash computation. We thoroughly evaluate our proposed architectures on both SRAM and ReRAM memories and compare them with the state-of-the-art in-memory and ASIC accelerators. Our performance evaluation confirms that Inhale can achieve 1.4× – 14.5× higher throughput-per-area and about two-orders-of-magnitude higher throughput-per-area-per-energy compared to the state-of-the-art solutions.

Accelerating N-Bit Operations over TFHE on Commodity CPU-FPGA

Kevin Nam
Hyunyoung Oh
Hyungon Moon
Yunheung Paek

TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.

Fast and Compact Interleaved Modular Multiplication Based on Carry Save Addition

Oleg Mazonka
Eduardo Chielle
Deepraj Soni
Michail Maniatakos

Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.

Accelerating Fully Homomorphic Encryption by Bridging Modular and Bit-Level Arithmetic

Eduardo Chielle
Oleg Mazonka
Homer Gamil
Michail Maniatakos

The dramatic increase of data breaches in modern computing platforms has emphasized that access control is not sufficient to protect sensitive user data. Recent advances in cryptography allow end-to-end processing of encrypted data without the need for decryption using Fully Homomorphic Encryption (FHE). Such computation however, is still orders of magnitude slower than direct (unencrypted) computation. Depending on the underlying cryptographic scheme, FHE schemes can work natively either at bit-level using Boolean circuits, or over integers using modular arithmetic. Operations on integers are limited to addition/subtraction and multiplication. On the other hand, bit-level arithmetic is much more comprehensive allowing more operations, such as comparison and division. While modular arithmetic can emulate bit-level computation, there is a significant cost in performance. In this work, we propose a novel method, dubbed bridging, that blends faster and restricted modular computation with slower and comprehensive bit-level computation, making them both usable within the same application and with the same cryptographic scheme instantiation. We introduce and open source C++ types representing the two distinct arithmetic modes, offering the possibility to convert from one to the other. Experimental results show that bridging modular and bit-level arithmetic computation can lead to 1–2 orders of magnitude performance improvement for tested synthetic benchmarks, as well as one real-world FHE application: a genotype imputation case study.

SESSION: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

Session details: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing

Jiang Hu

A Reconfigurable Hardware Library for Robot Scene Perception

Yanqi Liu
Anthony Opipari
Odest Chadwicke Jenkins
R. Iris Bahar

Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26× more energy efficient than a high-end GPU and 11× faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.

Analyzing and Improving Resilience and Robustness of Autonomous Systems

Zishen Wan
Karthik Swaminathan
Pin-Yu Chen
Nandhini Chandramoorthy
Arijit Raychowdhury

Autonomous systems have reached a tipping point, with a myriad of self-driving cars, unmanned aerial vehicles (UAVs), and robots being widely applied and revolutionizing new applications. The continuous deployment of autonomous systems reveals the need for designs that facilitate increased resiliency and safety. The ability of an autonomous system to tolerate, or mitigate against errors, such as environmental conditions, sensor, hardware and software faults, and adversarial attacks, is essential to ensure its functional safety. Application-aware resilience metrics, holistic fault analysis frameworks, and lightweight fault mitigation techniques are being proposed for accurate and effective resilience and robustness assessment and improvement. This paper explores the origination of fault sources across the computing stack of autonomous systems, discusses the various fault impacts and fault mitigation techniques of different scales of autonomous systems, and concludes with challenges and opportunities for assessing and building next-generation resilient and robust autonomous systems.

Factor Graph Accelerator for LiDAR-Inertial Odometry (Invited Paper)

Yuhui Hao
Bo Yu
Qiang Liu
Shaoshan Liu
Yuhao Zhu

Factor graph is a graph representing the factorization of a probability distribution function, and has been utilized in many autonomous machine computing tasks, such as localization, tracking, planning and control etc. We are developing an architecture with the goal of using factor graph as a common abstraction for most, if not, all autonomous machine computing tasks. If successful, the architecture would provide a very simple interface of mapping autonomous machine functions to the underlying compute hardware. As a first step of such an attempt, this paper presents our most recent work of developing a factor graph accelerator for LiDAR-Inertial Odometry (LIO), an essential task in many autonomous machines, such as autonomous vehicles and mobile robots. By modeling LIO as a factor graph, the proposed accelerator not only supports multi-sensor fusion such as LiDAR, inertial measurement unit (IMU), GPS, etc., but solves the global optimization problem of robot navigation in batch or incremental modes. Our evaluation demonstrates that the proposed design significantly improves the real-time performance and energy efficiency of autonomous machine navigation systems. The initial success suggests the potential of generalizing the factor graph architecture as a common abstraction for autonomous machine computing, including tracking, planning, and control etc.

Hardware Architecture of Graph Neural Network-Enabled Motion Planner (Invited Paper)

Lingyi Huang
Xiao Zang
Yu Gong
Bo Yuan

Motion planning aims to find a collision-free trajectory from the start to goal configurations of a robot. As a key cognition task for all the autonomous machines, motion planning is fundamentally required in various real-world robotic applications, such as 2-D/3-D autonomous navigation of unmanned mobile and aerial vehicles and high degree-of-freedom (DoF) autonomous manipulation of industry/medical robot arms and graspers.

Motion planning can be performed using either non-learning-based classical algorithms or learning-based neural approaches. Most recently, the powerful capabilities of deep neural networks (DNNs) make neural planners become very attractive because of their superior planning performance over the classical methods. In particular, graph neural network (GNN)-enabled motion planner has demonstrated the state-of-the-art performance across a set of challenging high-dimensional planning tasks, motivating the efficient hardware acceleration to fully unleash its potential and promote its widespread deployment in practical applications.

To that end, in this paper we perform preliminary study of the efficient accelerator design of the GNN-based neural planner, especially for the neural explorer as the key component of the entire planning pipeline. By performing in-depth analysis on the different design choices, we identify that the hybrid architecture, instead of the uniform sparse matrix multiplication (SpMM)-based solution that is popularly adopted in the existing GNN hardware, is more suitable for our target neural explorer. With a set of optimization on microarchitecture and dataflow, several design challenges incurred by using hybrid architecture, such as extensive memory access and imbalanced workload, can be efficiently mitigated. Evaluation results show that our proposed customized hardware architecture achieves order-of-magnitude performance improvement over the CPU/GPU-based implementation with respect to area and energy efficiency in various working environments.

SESSION: From Logical to Physical Qubits: New Models and Techniques for Mapping

Session details: From Logical to Physical Qubits: New Models and Techniques for Mapping

Weiwen Jiang

A Robust Quantum Layout Synthesis Algorithm with a Qubit Mapping Checker

Tsou-An Wu
Yun-Jhe Jiang
Shao-Yun Fang

Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP-free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000× runtime speedup over a state-of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800× speedup and obtains near-optimal solutions with only 3% SWAP overhead.

Reinforcement Learning and DEAR Framework for Solving the Qubit Mapping Problem

Ching-Yao Huang
Chi-Hsiang Lien
Wai-Kei Mak

Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self-attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR-framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.

Qubit Mapping for Reconfigurable Atom Arrays

Bochen Tan
Dolev Bluvstein
Mikhail D. Lukin
Jason Cong

Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.

MCQA: Multi-Constraint Qubit Allocation for Near-FTQC Device

Sunghye Park
Dohun Kim
Jae-Yoon Sim
Seokhyeong Kang

In response to the rapid development of quantum processors, quantum software must be advanced by considering the actual hardware limitations. Among the various design automation problems in quantum computing, qubit allocation modifies the input circuit to match the hardware topology constraints. In this work, we present an effective heuristic approach for qubit allocation that considers not only the hardware topology but also other constraints for near-fault-tolerant quantum computing (near-FTQC). We propose a practical methodology to find an effective initial mapping to reduce both the number of gates and circuit latency. We then perform dynamic scheduling to maximize the number of gates executed in parallel in the main mapping phase. Our experimental results with a Surface-17 processor confirmed a substantial reduction in the number of gates, latency, and runtime by 58%, 28%, and 99%, respectively, compared with the previous method [18]. Moreover, our mapping method is scalable and has a linear time complexity with respect to the number of gates.

SESSION: Smart Embedded Systems (Virtual)

Session details: Smart Embedded Systems (Virtual)

Leonidas Kosmidis
Pietro Mercati

Smart Scissor: Coupling Spatial Redundancy Reduction and CNN Compression for Embedded Hardware

Hao Kong
Di Liu
Shuo Huai
Xiangzhong Luo
Weichen Liu
Ravi Subramaniam
Christian Makaya
Qian Lin

Scaling down the resolution of input images can greatly reduce the computational overhead of convolutional neural networks (CNNs), which is promising for edge AI. However, as an image usually contains much spatial redundancy, e.g., background pixels, directly shrinking the whole image will lose important features of the foreground object and lead to severe accuracy degradation. In this paper, we propose a dynamic image cropping framework to reduce the spatial redundancy by accurately cropping the foreground object from images. To achieve the instance-aware fine cropping, we introduce a lightweight foreground predictor to efficiently localize and crop the foreground of an image. The finely cropped images can be correctly recognized even at a small resolution. Meanwhile, computational redundancy also exists in CNN architectures. To pursue higher execution efficiency on resource-constrained embedded devices, we also propose a compound shrinking strategy to coordinately compress the three dimensions (depth, width, resolution) of CNNs. Eventually, we seamlessly combine the proposed dynamic image cropping and compound shrinking into a unified compression framework, Smart Scissor, which is expected to significantly reduce the computational overhead of CNNs while still maintaining high accuracy. Experiments on ImageNet-1K demonstrate that our method reduces the computational cost of ResNet50 by 41.5% while improving the top-1 accuracy by 0.3%. Moreover, compared to HRank, the state-of-the-art CNN compression framework, our method achieves 4.1% higher top-1 accuracy at the same computational cost. The codes and data are available at https://github.com/ntuliuteam/smart-scissor

SHAPE: Scheduling of Fixed-Priority Tasks on Heterogeneous Architectures with Multiple CPUs and Many PEs

Yuankai Xu
Tiancheng He
Ruiqi Sun
Yehan Ma
Yier Jin
An Zou

Despite being employed in burgeoning efforts to accelerate artificial intelligence, heterogeneous architectures have yet to be well managed with strict timing constraints. As a classic task model, multi-segment self-suspension (MSSS) has been proposed for general I/O-intensive systems and computation offloading. However, directly applying this model to heterogeneous architectures with multiple CPUs and many processing units (PEs) suffers tremendous pessimism. In this paper, we present a real-time scheduling approach, SHAPE, for general heterogeneous architectures with significant schedulability and improved utilization rate. We start with building the general task execution pattern on a heterogeneous architecture integrating multiple CPU cores and many PEs such as GPU streaming multiprocessors and FPGA IP cores. A real-time scheduling strategy and corresponding schedulability analysis are presented following the task execution pattern. Compared with state-of-the-art scheduling algorithms through comprehensive experiments on unified and versatile tasks, SHAPE improves the schedulability by 11.1% – 100%. Moreover, experiments performed on the NVIDIA GPU systems further indicate up to 70.9% of pessimism reduction can be achieved by the proposed scheduling. Since we target general heterogeneous architectures, SHAPE can be directly applied to off-the-shelf heterogeneous computing systems with guaranteed deadlines and improved schedulability.

On Minimizing the Read Latency of Flash Memory to Preserve Inter-Tree Locality in Random Forest

Yu-Cheng Lin
Yu-Pei Liang
Tseng-Yi Chen
Yuan-Hao Chang
Shuo-Han Chen
Wei-Kuan Shih

Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.

SESSION: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

Session details: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)

Biying Xu
Ilya Yusim

Numerically-Stable and Highly-Scalable Parallel LU Factorization for Circuit Simulation

Xiaoming Chen

A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to “guess” the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the “guessed” dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.

EI-MOR: A Hybrid Exponential Integrator and Model Order Reduction Approach for Transient Power/Ground Network Analysis

Cong Wang
Dongen Yang
Quan Chen

Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.

Multi-Package Co-Design for Chiplet Integration

Zhen Zhuang
Bei Yu
Kai-Yuan Chao
Tsung-Yi Ho

Due to the cost and design complexity associated with advanced technology nodes, it is difficult for traditional monolithic System-on-Chip to follow the Moore’s Law, which means the economic benefits have been weakened. Semiconductor industries are looking for advanced packages to improve the economic advantages. Since the multi-chiplet architecture supporting heterogeneous integration has the robust re-usability and effective cost reduction, chiplet integration has become the mainstream of advanced packages. Nowadays, the number of mounted chiplets in a package is continuously increasing with the requirement of high system performance. However, the large area caused by the increasing of chiplets leads to the serious reliability issues, including warpage and bump stress, which worsens the yield and cost. The multi-package architecture, which can distribute chiplets to multiple packages and use less area of each package, is a popular alternative to enhance the reliability and reduce the cost in advanced packages. However, the primary challenge of the multi-package architecture lies in the tradeoff between the inter-package costs, i.e., the interconnection among packages, and the intra-package costs, i.e., the reliability caused by warpage and bump stress. Therefore, a co-design methodology is indispensable to optimize multiple packages simultaneously to improve the quality of the whole system. To tackle this challenge, we adopt mathematical programming methods in the multi-package co-design problem regarding the nature of the synergistic optimization of multiple packages. To the best of our knowledge, this is the first work to solve the multi-package co-design problem.

SESSION: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

Session details: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)

Grace Li Zhang

Gzippo: Highly-Compact Processing-in-Memory Graph Accelerator Alleviating Sparsity and Redundancy

Xing Li
Rachata Ausavarungnirun
Xiao Liu
Xueyuan Liu
Xuan Zhang
Heng Lu
Zhuoran Song
Naifeng Jing
Xiaoyao Liang

Graph application plays a significant role in real-world data computation. However, the memory access patterns become the performance bottleneck of the graph applications, which include low compute-to-communication ratio, poor temporal locality, and poor spatial locality. Existing RRAM-based processing-in-memory accelerators reduce the data movements but fail to address both sparsity and redundancy of graph data. In this work, we present Gzippo, a highly-compact design that supports graph computation in the compressed sparse format. Gzippo employs a tandem-isomorphic-crossbar architecture both to eliminate redundant searches and sequential indexing during iterations, and to remove sparsity leading to non-effective computation on zero values. Gzippo achieves a 3.0× (up to 17.4×) performance speedup, 23.9× (up to 163.2×) energy efficiency over state-of-the-art RRAM-based PIM accelerator, respectively.

CoMUX: Combinatorial-Coding-Based High-Performance Microfluidic Control Multiplexer Design

Siyuan Liang
Mengchu Li
Tsun-Ming Tseng
Ulf Schlichtmann
Tsung-Yi Ho

Flow-based microfluidic chips are one of the most promising platforms for biochemical experiments. Transportation channels and operation devices inside these chips are controlled by microvalves, which are driven by external pressure sources. As the complexity of experiments on these chips keeps increasing, control multiplexers (MUXes) become necessary for the actuation of the enormous number of valves. However, current binary-coding-based MUXes do not take full advantage of the coding capacity and suffer from the reliability problem caused by the high control channel density. In this work, we propose a novel MUX coding strategy, named Combinatorial Coding, along with an algorithm to synthesize combinatorial-coding-based MUXes (CoMUXes) of arbitrary sizes with the proven maximum coding capacity. Moreover, we develop a simplification method to reduce the number of valves and control channels in CoMUXes and thus improve their reliability. We compare CoMUX with the state-of-the-art MUXes under different control demands with up to 10 × 2¹³ independent control channels. Experiments show that CoMUXes can reliably control more independent control channels with fewer resources. For example, when the number of the to-be-controlled control channels is up to 10 × 2¹³, compared to a state-of-the-art MUX, the optimized CoMUX reduces the number of required flow channels by 44% and the number of valves by 90%.

Exploiting Uniform Spatial Distribution to Design Efficient Random Number Source for Stochastic Computing

Kuncai Zhong
Zexi Li
Haoran Jin
Weikang Qian

Stochastic computing (SC) generally suffers from long latency. One solution is to apply proper random number sources (RNSs). Nevertheless, current RNS designs either have high hardware cost or low accuracy. To address the issue, motivated by that the uniform spatial distribution generally leads to a high accuracy for an SC circuit, we propose a basic architecture to generate the uniform spatial distribution and a further detailed implementation of it. For the implementation, we further propose a method to optimize its hardware cost and a method to optimize its accuracy. The method for hardware cost optimization can optimize the hardware cost without affecting the accuracy. The experimental results show that our proposed implementation can achieve both low hardware cost and high accuracy. Compared to the state-of-the-art stochastic number generator design, the proposed design can reduce 88% area with close accuracy.

SESSION: On Automating Heterogeneous Designs (Virtual)

Session details: On Automating Heterogeneous Designs (Virtual)

Haocheng Li

A Novel Blockage-Avoiding Macro Placement Approach for 3D ICs Based on POCS

Jai-Ming Lin
Po-Chen Lu
Heng-Yu Lin
Jia-Ting Tsai

Although the 3D integrated circuit (IC) placement problem has been studied for many years, few publications devoted to the macro legalization. Due to large sizes of macros, the macro placement problem is harder than cell placement, especially when preplaced macros exist in a multi-tier structure. In order to have a more global view, this paper proposes the partitioning-last macro-first flow to handle 3D placement for mixed-size designs, which performs tier partitioning after placement prototyping and then legalizes macros before cell placement. A novel two-step approach is proposed to handle 3D macro placement. The first step determines locations of macros in a projection plane based on a new representation, named K-tier Partially Occupied Corner Stitching. It not only can keep the prototyping result but also guarantees a legal placement after tier assignment of macros. Next, macros are assigned to respective tiers by Integer Linear Programming (ILP) algorithm. Experimental results show that our design flow can obtain better solutions than other flows especially in the cases with more preplaced macros.

Routability-Driven Analytical Placement with Precise Penalty Models for Large-Scale 3D ICs

Jai-Ming Lin
Hao-Yuan Hsieh
Hsuan Kung
Hao-Jia Lin

Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.

SESSION: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

Session details: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)

Swaroop Ghosh

Quantum Machine Learning for Material Synthesis and Hardware Security (Invited Paper)

Collin Beaudoin
Satwik Kundu
Rasit Onur Topaloglu
Swaroop Ghosh

Using quantum computing, this paper addresses two scientifically-pressing and day to day-relevant problems, namely, chemical retrosynthesis which is an important step in drug/material discovery and security of semiconductor supply chain. We show that Quantum Long Short-Term Memory (QLSTM) is a viable tool for retrosynthesis. We achieve 65% training accuracy with QLSTM whereas classical LSTM can achieve 100%. However, in testing we achieve 80% accuracy with the QLSTM while classical LSTM peaks at only 70% accuracy! We also demonstrate an application of Quantum Neural Network (QNN) in the hardware security domain, specifically in Hardware Trojan (HT) detection using a set of power and area Trojan features. The QNN model achieves detection accuracy as high as 97.27%.

Quantum Machine Learning Applications in High-Energy Physics

Andrea Delgado
Kathleen E. Hamilton

Some of the most significant achievements of the modern era of particle physics, such as the discovery of the Higgs boson, have been made possible by the tremendous effort in building and operating large-scale experiments like the Large Hadron Collider or the Tevatron. In these facilities, the ultimate theory to describe matter at the most fundamental level is constantly probed and verified. These experiments often produce large amounts of data that require storing, processing, and analysis techniques that continually push the limits of traditional information processing schemes. Thus, the High-Energy Physics (HEP) field has benefited from advancements in information processing and the development of algorithms and tools for large datasets. More recently, quantum computing applications have been investigated to understand how the community can benefit from the advantages of quantum information science. Nonetheless, to unleash the full potential of quantum computing, there is a need to understand the quantum behavior and, thus, scale up current algorithms beyond what can be simulated in classical processors. In this work, we explore potential applications of quantum machine learning to data analysis tasks in HEP and how to overcome the limitations of algorithms targeted for Noisy Intermediate-Scale Quantum (NISQ) devices.

SESSION: Making Patterning Work (Virtual)

Session details: Making Patterning Work (Virtual)

Yuzhe Ma

DeePEB: A Neural Partial Differential Equation Solver for Post Exposure Baking Simulation in Lithography

Qipan Wang
Xiaohan Gao
Yibo Lin
Runsheng Wang
Ru Huang

Post Exposure Baking (PEB) has been widely utilized in advanced lithography. PEB simulation is critical in the lithography simulation flow, as it bridges the optical simulation result and the final developed profile in the photoresist. The process of PEB can be described by coupled partial differential equations (PDE) and corresponding boundary and initial conditions. Recent years have witnessed growing presence of machine learning algorithms in lithography simulation, while PEB simulation is often ignored or treated with compact models, considering the huge cost of solving PDEs exactly. In this work, based on the observation of the physical essence of PEB, we propose DeePEB: a neural PDE Solver for PEB simulation. This model is capable of predicting the PEB latent image with high accuracy and >100 × acceleration (compared to the commercial rigorous simulation tool), paving the way for efficient and accurate photoresist modeling in lithography simulation and layout optimization.

AdaOPC: A Self-Adaptive Mask Optimization Framework for Real Design Patterns

Wenqian Zhao
Xufeng Yao
Ziyang Yu
Guojin Chen
Yuzhe Ma
Bei Yu
Martin D. F. Wong

Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that different sub-regions have different pattern complexity. Besides, we also find that many patterns repetitively appear in the design layout, and these patterns may possibly share optimized masks. We exploit these properties and propose a self-adaptive OPC framework to improve efficiency. Firstly we choose different OPC solvers adaptively for patterns of different complexity from an extensible solver pool to reach a speed/accuracy co-optimization. Apart from that, we prove the feasibility of reusing optimized masks for repeated patterns and hence, build a graph-based dynamic pattern library reusing stored masks to further speed up the OPC flow. Experimental results show that our framework achieves substantial improvement in both performance and efficiency.

LayouTransformer: Generating Layout Patterns with Transformer via Sequential Pattern Modeling

Liangjian Wen
Yi Zhu
Lei Ye
Guojin Chen
Bei Yu
Jianzhuang Liu
Chunjing Xu

Generating legal and diverse layout patterns to establish large pattern libraries is fundamental for many lithography design applications. Existing pattern generation models typically regard the pattern generation problem as image generation of layout maps and learn to model the patterns via capturing pixel-level coherence, which is insufficient to achieve polygon-level modeling, e.g., shape and layout of patterns, thus leading to poor generation quality. In this paper, we regard the pattern generation problem as an unsupervised sequence generation problem, in order to learn the pattern design rules by explicitly modeling the shapes of polygons and the layouts among polygons. Specifically, we first propose a sequential pattern representation scheme that fully describes the geometric information of polygons by encoding the 2D layout patterns as sequences of tokens, i.e., vertexes and edges. Then we train a sequential generative model to capture the long-term dependency among tokens and thus learn the design rules from training examples. To generate a new pattern in sequence, each token is generated conditioned on the previously generated tokens that are from the same polygon or different polygons in the same layout map. Our framework, termed LayouTransformer, is based on the Transformer architecture due to its remarkable ability in sequence modeling. Comprehensive experiments show that our LayouTransformer not only generates a large amount of legal patterns but also maintains high generation diversity, demonstrating its superiority over existing pattern generative models.

WaferHSL: Wafer Failure Pattern Classification with Efficient Human-Like Staged Learning

Qijing Wang
Martin D. F. Wong

As the demand for semiconductor products increases and the integrated circuits (IC) processes become more and more complex, wafer failure pattern classification is gaining more attention from manufacturers and researchers to improve yield. To further cope with the real-world scenario that there are only very limited labeled data and without any unlabeled data in the early manufacturing stage of new products, this work proposes an efficient human-like staged learning framework for wafer failure pattern classification named WaferHSL. Inspired by human’s knowledge acquisition process, a mutually reinforcing task fusion scheme is designed for guiding the deep learning model to simultaneously establish the knowledge of spatial relationships, geometry properties and semantics. Furthermore, a progressive stage controller is deployed to partition and control the learning process, so as to enable humanlike progressive advancement in the model. Experimental results show that with only 10% labeled samples and no unlabeled samples, WaferHSL can achieve better results than previous SOTA methods trained with 60% labeled samples and a large number of unlabeled samples, while the improvement is even more significant when using the same size of labeled training set.

SESSION: Advanced Verification Technologies (Virtual)

Session details: Advanced Verification Technologies (Virtual)

Takahide Yoshikawa

Combining BMC and Complementary Approximate Reachability to Accelerate Bug-Finding

Xiaoyu Zhang
Shengping Xiao
Jianwen Li
Geguang Pu
Ofer Strichman

Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.

Equivalence Checking of Dynamic Quantum Circuits

Xin Hong
Yuan Feng
Sanjiang Li
Mingsheng Ying

Despite the rapid development of quantum computing these years, state-of-the-art quantum devices still contain only a limited number of qubits. One possible way to execute more realistic algorithms in near-term quantum devices is to employ dynamic quantum circuits (DQCs). In DQCs, measurements can happen during the circuit, and their outcomes can be processed with classical computers and used to control other parts of the circuit. This technique can help significantly reduce the qubit resources required to implement a quantum algorithm. In this paper, we give a formal definition of DQCs and then characterise their functionality in terms of ensembles of linear operators, following the Kraus representation of superoperators. We further interpret DQCs as tensor networks, implement their functionality as tensor decision diagrams (TDDs), and reduce the equivalence of two DQCs to checking if they have the same TDD representation. Experiments show that embedding classical logic into conventional quantum circuits does not incur a significant time and space burden.

SESSION: Routing with Cell Movement (Virtual)

Session details: Routing with Cell Movement (Virtual)

Guojie Luo

ATLAS: A Two-Level Layer-Aware Scheme for Routing with Cell Movement

Xinshi Zang
Fangzhou Wang
Jinwei Liu
Martin D. F. Wong

Placement and routing are two crucial steps in the physical design of integrated circuits (ICs). To close the gap between placement and routing, the routing with cell movement problem has attracted great attention recently. In this problem, a certain number of cells can be moved to new positions and the nets can be rerouted to improve the total wire length. In this work, we advance the study on this problem by proposing a two-level layer-aware scheme, named ATLAS. A coarse-level cluster-based cell movement is first performed to optimize via usage and provides a better starting point for the next fine-level single cell movement. To further encourage routing on the upper metal layers, we utilize a set of adjusted layer weights to increase the routing cost on lower layers. Experimental results on the ICCAD 2020 contest benchmarks show that ATLAS achieves much more wire length reduction compared with the state-of-the-art routing with cell movement engine. Furthermore, applied on the ICCAD 2021 contest benchmarks, ATLAS outperforms the first place team of the contest with much better solution quality while being 3× faster.

A Robust Global Routing Engine with High-Accuracy Cell Movement under Advanced Constraints

Ziran Zhu
Fuheng Shen
Yangjie Mei
Zhipeng Huang
Jianli Chen
Jun Yang

Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.

SESSION: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

Session details: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges

Michael Raitza

Securing Hardware through Reconfigurable Nano-Structures

Nima Kavand
Armin Darjani
Shubham Rai
Akash Kumar

Hardware security has been an ever-growing concern of the integrated circuit (IC) designers. Through different stages in the IC design and life cycle, an adversary can extract sensitive design information and private data stored in the circuit using logical, physical, and structural weaknesses. Besides, in recent times, ML-based attacks have become the new de facto standard in hardware security community. Contemporary defense strategies are often facing unforeseen challenges to cope up with these attack schemes. Additionally, the high overhead of the CMOS-based secure addon circuitry and intrinsic limitations of these devices indicate the need for new nano-electronics. Emerging reconfigurable devices like Reconfigurable Field Effect transistors (RFETs) provide unique features to fortify the design against various threats at different stages in the IC design and life cycle. In this manuscript, we investigate the applications of the RFETs for securing the design against traditional and machine learning (ML)-based intellectual property (IP) piracy techniques and side-channel attacks (SCAs).

Reconfigurable Logic for Hardware IP Protection: Opportunities and Challenges

Luca Collini
Benjamin Tan
Christian Pilato
Ramesh Karri

Protecting the intellectual property (IP) of integrated circuit (IC) design is becoming a significant concern of fab-less semiconductor design houses. Malicious actors can access the chip design at any stage, reverse engineer the functionality, and create illegal copies. On the one hand, defenders are crafting more and more solutions to hide the critical portions of the circuit. On the other hand, attackers are designing more and more powerful tools to extract useful information from the design and reverse engineer the functionality, especially when they can get access to working chips. In this context, the use of custom reconfigurable fabrics has recently been investigated for hardware IP protection. This paper will discuss recent trends in hardware obfuscation with embedded FPGAs, focusing also on the open challenges that must be necessarily addressed for making this solution viable.

SESSION: Performance, Power and Temperature Aspects in Deep Learning

Session details: Performance, Power and Temperature Aspects in Deep Learning

Callie Hao
Jeff Zhang

RT-NeRF: Real-Time On-Device Neural Radiance Fields Towards Immersive AR/VR Rendering

Chaojian Li
Sixu Li
Yang Zhao
Wenbo Zhu
Yingyan Lin

Neural Radiance Field (NeRF) based rendering has attracted growing attention thanks to its state-of-the-art (SOTA) rendering quality and wide applications in Augmented and Virtual Reality (AR/VR). However, immersive real-time (> 30 FPS) NeRF based rendering enabled interactions are still limited due to the low achievable throughput on AR/VR devices. To this end, we first profile SOTA efficient NeRF algorithms on commercial devices and identify two primary causes of the aforementioned inefficiency: (1) the uniform point sampling and (2) the dense accesses and computations of the required embeddings in NeRF. Furthermore, we propose RT-NeRF, which to the best of our knowledge is the first algorithm-hardware co-design acceleration of NeRF. Specifically, on the algorithm level, RT-NeRF integrates an efficient rendering pipeline for largely alleviating the inefficiency due to the commonly adopted uniform point sampling method in NeRF by directly computing the geometry of pre-existing points. Additionally, RT-NeRF leverages a coarse-grained view-dependent computing ordering scheme for eliminating the (unnecessary) processing of invisible points. On the hardware level, our proposed RT-NeRF accelerator (1) adopts a hybrid encoding scheme to adaptively switch between a bitmap- or coordinate-based sparsity encoding format for NeRF’s sparse embeddings, aiming to maximize the storage savings and thus reduce the required DRAM accesses while supporting efficient NeRF decoding; and (2) integrates both a high-density sparse search unit and a dual-purpose bi-direction adder & search tree to coordinate the two aforementioned encoding formats. Extensive experiments on eight datasets consistently validate the effectiveness of RT-NeRF, achieving a large throughput improvement (e.g., 9.7×~3,201×) while maintaining the rendering quality as compared with SOTA efficient NeRF solutions.

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Yifan Gong
Zheng Zhan
Pu Zhao
Yushu Wu
Chao Wu
Caiwen Ding
Weiwen Jiang
Minghai Qin
Yanzhi Wang

During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

Robustify ML-Based Lithography Hotspot Detectors

Jingyu Pan
Chen-Chia Chang
Zhiyao Xie
Jiang Hu
Yiran Chen

Deep learning has been widely applied in various VLSI design automation tasks, from layout quality estimation to design optimization. Though deep learning has shown state-of-the-art performance in several applications, recent studies reveal that deep neural networks exhibit intrinsic vulnerability to adversarial perturbations, which pose risks in the ML-aided VLSI design flow. One of the most effective strategies to improve robustness is regularization approaches, which adjust the optimization objective to make the deep neural network generalize better. In this paper, we examine several adversarial defense methods to improve the robustness of ML-based lithography hotspot detectors. We present an innovative design rule checking (DRC)-guided curvature regularization (CURE) approach, which is customized to robustify ML-based lithography hotspot detectors against white-box attacks. Our approach allows for improvements in both the robustness and the accuracy of the model. Experiments show that the model optimized by DRC-guided CURE achieves the highest robustness and accuracy compared with those trained using the baseline defense methods. Compared with the vanilla model, DRC-guided CURE decreases the average attack success rate by 53.9% and increases the average ROC-AUC by 12.1%. Compared with the best of the defense baselines, DRC-guided CURE reduces the average attack success rate by 18.6% and improves the average ROC-AUC by 4.3%.

Associative Memory Based Experience Replay for Deep Reinforcement Learning

Mengyuan Li
Arman Kazemi
Ann Franchesca Laguna
X. Sharon Hu

Experience replay is an essential component in deep reinforcement learning (DRL), which stores the experiences and generates experiences for the agent to learn in real time. Recently, prioritized experience replay (PER) has been proven to be powerful and widely deployed in DRL agents. However, implementing PER on traditional CPU or GPU architectures incurs significant latency overhead due to its frequent and irregular memory accesses. This paper proposes a hardware-software co-design approach to design an associative memory (AM) based PER, AMPER, with an AM-friendly priority sampling operation. AMPER replaces the widely-used time-costly tree-traversal-based priority sampling in PER while preserving the learning performance. Further, we design an in-memory computing hardware architecture based on AM to support AMPER by leveraging parallel in-memory search operations. AMPER shows comparable learning performance while achieving 55× to 270× latency improvement when running on the proposed hardware compared to the state-of-the-art PER running on GPU.

SESSION: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

Session details: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits

Hanrui Wang

TorchQuantum Case Study for Robust Quantum Circuits

Hanrui Wang
Zhiding Liang
Jiaqi Gu
Zirui Li
Yongshan Ding
Weiwen Jiang
Yiyu Shi
David Z. Pan
Frederic T. Chong
Song Han

Quantum Computing has attracted much research attention because of its potential to achieve fundamental speed and efficiency improvements in various domains. Among different quantum algorithms, Parameterized Quantum Circuits (PQC) for Quantum Machine Learning (QML) show promises to realize quantum advantages on the current Noisy Intermediate-Scale Quantum (NISQ) Machines. Therefore, to facilitate the QML and PQC research, a recent python library called TorchQuantum has been released. It can construct, simulate, and train PQC for machine learning tasks with high speed and convenient debugging supports. Besides quantum for ML, we want to raise the community’s attention on the reversed direction: ML for quantum. Specifically, the TorchQuantum library also supports using data-driven ML models to solve problems in quantum system research, such as predicting the impact of quantum noise on circuit fidelity and improving the quantum circuit compilation efficiency.

This paper presents a case study of the ML for quantum part in TorchQuantum. Since estimating the noise impact on circuit reliability is an essential step toward understanding and mitigating noise, we propose to leverage classical ML to predict noise impact on circuit fidelity. Inspired by the natural graph representation of quantum circuits, we propose to leverage a graph transformer model to predict the noisy circuit fidelity. We firstly collect a large dataset with a variety of quantum circuits and obtain their fidelity on noisy simulators and real machines. Then we embed each circuit into a graph with gate and noise properties as node features, and adopt a graph transformer to predict the fidelity. We can avoid exponential classical simulation cost and efficiently estimate fidelity with polynomial complexity.

Evaluated on 5 thousand random and algorithm circuits, the graph transformer predictor can provide accurate fidelity estimation with RMSE error 0.04 and outperform a simple neural network-based model by 0.02 on average. It can achieve 0.99 and 0.95 R² scores for random and algorithm circuits, respectively. Compared with circuit simulators, the predictor has over 200× speedup for estimating the fidelity. The datasets and predictors can be accessed in the TorchQuantum library.

SESSION: Emerging Machine Learning Primitives: From Technology to Application

Session details: Emerging Machine Learning Primitives: From Technology to Application

Dharanidhar Dang
Hai Helen Lee

COSIME: FeFET Based Associative Memory for In-Memory Cosine Similarity Search

Che-Kai Liu
Haobang Chen
Mohsen Imani
Kai Ni
Arman Kazemi
Ann Franchesca Laguna
Michael Niemier
Xiaobo Sharon Hu
Liang Zhao
Cheng Zhuo
Xunzhao Yin

In a number of machine learning models, an input query is searched across the trained class vectors to find the closest feature class vector in cosine similarity metric. However, performing the cosine similarities between the vectors in Von-Neumann machines involves a large number of multiplications, Euclidean normalizations and division operations, thus incurring heavy hardware energy and latency overheads. Moreover, due to the memory wall problem that presents in the conventional architecture, frequent cosine similarity-based searches (CSSs) over the class vectors requires a lot of data movements, limiting the throughput and efficiency of the system. To overcome the aforementioned challenges, this paper introduces COSIME, a general in-memory associative memory (AM) engine based on the ferroelectric FET (FeFET) device for efficient CSS. By leveraging the one-transistor AND gate function of FeFET devices, current-based translinear analog circuit and winner-take-all (WTA) circuitry, COSIME can realize parallel in-memory CSS across all the entries in a memory block, and output the closest word to the input query in cosine similarity metric. Evaluation results at the array level suggest that the proposed COSIME design achieves 333× and 90.5× latency and energy improvements, respectively, and realizes better classification accuracy when compared with an AM design implementing approximated CSS. The proposed in-memory computing fabric is evaluated for an HDC problem, showcasing that COSIME can achieve on average 47.1× and 98.5× speedup and energy efficiency improvements compared with an GPU implementation.

DynaPAT: A Dynamic Pattern-Aware Encoding Technique for Robust MLC PCM-Based Deep Neural Networks

Thai-Hoang Nguyen
Muhammad Imran
Joon-Sung Yang

As the effectiveness of Deep Neural Networks (DNNs) is rising over time, so is the need for highly scalable and efficient hardware architectures to capitalize this effectiveness in many practical applications. Emerging non-volatile Phase Change Memory (PCM) technology has been found to be a promising candidate for future memory systems due to its better scalability, non-volatility and low leakage/dynamic power consumption, compared to conventional charged-based memories. Additionally, with its cell’s wide resistance span, PCM also has the Flash-like Multi-Level Cell (MLC) capability, which has enhanced storage density, providing an opportunity for the deployment of data-intensive applications such as DNNs on resource-constrained edge devices. However, the practical deployment of MLC PCM is hampered by certain reliability challenges, among which, the resistance drift is considered to be a critical concern. In a DNN application, the presence of resistance drift in MLC PCM can cause a severe impact to DNN’s accuracy if no drift-error-tolerance technique is utilized. This paper proposes DynaPAT, a low-cost and effective pattern-aware encoding technique to enhance the drift-error-tolerance of MLC PCM-based Deep Neural Networks. DynaPAT has been constructed on the insight into DNN’s vulnerability against different data pattern switching. Based on this insight, DynaPAT efficiently maps the most-frequent data pattern in DNN’s parameters to the least-drift-prone level of the MLC PCM, thus significantly enhancing the robustness of the system against drift errors. Various experiments on different DNN models and configurations demonstrate the effectiveness of DynaPAT. The experimental results indicate that DynaPAT can achieve up to 500× enhancement in the drift-errors-tolerance capability over the baseline MLC PCM based DNN while requiring only a negligible hardware overhead (below 1% storage overhead). Being orthogonal, DynaPAT can be integrated with existing drift-tolerance schemes for even higher gains in reliability.

Graph Neural Networks for Idling Error Mitigation

Vedika Servanan
Samah Mohamed Saeed

Dynamical Decoupling (DD)-based protocols have been shown to reduce the idling errors encountered in quantum circuits. However, the current research in suppressing idling qubit errors suffers from scalability issues due to the large number of tuning quantum circuits that should be executed first to find the locations of the DD sequences in the target quantum circuit, which boost the output state fidelity. This process becomes tedious as the size of the quantum circuit increases. To address this challenge, we propose a Graph Neural Network (GNN) framework, which mitigates idling errors through an efficient insertion of DD sequences into quantum circuits by modeling their impact at different idle qubit windows. Our paper targets maximizing the benefit of DD sequences using a limited number of tuning circuits. We propose to classify the idle qubit windows into critical and non-critical (benign) windows using a data-driven reliability model. Our results obtained from IBM Lagos quantum computer show that our proposed GNN models, which determine the locations of DD sequences in the quantum circuits, significantly improve the output state fidelity by a factor of 1.4x on average and up to 2.6x compared to the adaptive DD approach, which searches for the best locations of DD sequences at run-time.

Quantum Neural Network Compression

Zhirui Hu
Peiyan Dong
Zhepeng Wang
Youzuo Lin
Yanzhi Wang
Weiwen Jiang

Model compression, such as pruning and quantization, has been widely applied to optimize neural networks on resource-limited classical devices. Recently, there are growing interest in variational quantum circuits (VQC), that is, a type of neural network on quantum computers (a.k.a., quantum neural networks). It is well known that the near-term quantum devices have high noise and limited resources (i.e., quantum bits, qubits); yet, how to compress quantum neural networks has not been thoroughly studied. One might think it is straightforward to apply the classical compression techniques to quantum scenarios. However, this paper reveals that there exist differences between the compression of quantum and classical neural networks. Based on our observations, we claim that the compilation/traspilation has to be involved in the compression process. On top of this, we propose the very first systematical framework, namely CompVQC, to compress quantum neural networks (QNNs). In CompVQC, the key component is a novel compression algorithm, which is based on the alternating direction method of multipliers (ADMM) approach. Experiments demonstrate the advantage of the CompVQC, reducing the circuit depth (almost over 2.5×) with a negligible accuracy drop (<1%), which outperforms other competitors. Another promising truth is our CompVQC can indeed promote the robustness of the QNN on the near-term noisy quantum devices.

SESSION: Design for Low Energy, Low Resource, but High Quality

Session details: Design for Low Energy, Low Resource, but High Quality

Ravikumar Chakaravarthy
Cong “Callie” Hao

Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications

Azat Azamat
Jaewoo Park
Jongeun Lee

The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.

WSQ-AdderNet: Efficient Weight Standardization Based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization

Yunxiang Zhang
Biao Sun
Weixiong Jiang
Yajun Ha
Miao Hu
Wenfeng Zhao

Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ₁-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.

Low-Cost 7T-SRAM Compute-in-Memory Design Based on Bit-Line Charge-Sharing Based Analog-to-Digital Conversion

Kyeongho Lee
Joonhyung Kim
Jongsun Park

Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.

SESSION: Microarchitectural Attacks and Countermeasures

Session details: Microarchitectural Attacks and Countermeasures

Rajesh JS
Amin Rezaei

Speculative Load Forwarding Attack on Modern Processors

Hasini Witharana
Prabhat Mishra

Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.

Fast, Robust and Accurate Detection of Cache-Based Spectre Attack Phases

Arash Pashrashid
Ali Hajiabadi
Trevor E. Carlson

Modern processors achieve high performance and efficiency by employing techniques such as speculative execution and sharing resources such as caches. However, recent attacks like Spectre and Meltdown exploit the speculative execution of modern processors to leak sensitive information from the system. Many mitigation strategies have been proposed to restrict the speculative execution of processors and protect potential side-channels. Currently, these techniques have shown a significant performance overhead. A solution that can detect memory leaks before the attacker has a chance to exploit them would allow the processor to reduce the performance overhead by enabling protections only when the system is at risk.

In this paper, we propose a mechanism to detect speculative execution attacks that use caches as a side-channel. In this detector we track the phases of a successful attack and raise an alert before the attacker gets a chance to recover sensitive information. We accomplish this through monitoring the microarchitectural changes in the core and caches, and detect the memory locations that can be potential memory data leaks. We achieve 100% accuracy and negligible false positive rate in detecting Spectre attacks and evasive versions of Spectre that the state-of-the-art detectors are unable to detect. Our detector has no performance overhead with negligible power and area overheads.

CASU: Compromise Avoidance via Secure Update for Low-End Embedded Systems

Ivan De Oliveira Nunes
Sashidhar Jakkamsetti
Youngil Kim
Gene Tsudik

Guaranteeing runtime integrity of embedded system software is an open problem. Trade-offs between security and other priorities (e.g., cost or performance) are inherent, and resolving them is both challenging and important. The proliferation of runtime attacks that introduce malicious code (e.g., by injection) into embedded devices has prompted a range of mitigation techniques. One popular approach is Remote Attestation (RA), whereby a trusted entity (verifier) checks the current software state of an untrusted remote device (prover). RA yields a timely authenticated snapshot of prover state that verifier uses to decide whether an attack occurred.

Current RA schemes require verifier to explicitly initiate RA, based on some unclear criteria. Thus, in case of prover’s compromise, verifier only learns about it late, upon the next RA instance. While sufficient for compromise detection, some applications would benefit from a more proactive, prevention-based approach. To this end, we construct CASU: Compromise Avoidance via Secure Updates. CASU is an inexpensive hardware/software co-design enforcing: (i) runtime software immutability, thus precluding any illegal software modification, and (ii) authenticated updates as the sole means of modifying software. In CASU, a successful RA instance serves as a proof of successful update, and continuous subsequent software integrity is implicit, due to the runtime immutability guarantee. This obviates the need for RA in between software updates and leads to unobtrusive integrity assurance with guarantees akin to those of prior RA techniques, with better overall performance.

SESSION: Genetic Circuits Meet Ising Machines

Session details: Genetic Circuits Meet Ising Machines

Marc Riedel
Lei Yang

Technology Mapping of Genetic Circuits: From Optimal to Fast Solutions

Tobias Schwarz
Christian Hochberger

Synthetic Biology aims to create biological systems from scratch that do not exist in nature. An important method in this context is the engineering of DNA sequences such that cells realize Boolean functions that serve as control mechanisms in biological systems, e.g. in medical or agricultural applications. Libraries of logic gates exist as predefined gene sequences, based on the genetic mechanism of transcriptional regulation. Each individual gate is composed of different biological parts to allow for the differentiation of their output signals. Even gates of the same logic type therefore exhibit different transfer characteristics, i.e. relation from input to output signals. Thus, simulation of the whole network of genetic gates is needed to determine the performance of a genetic circuit. This makes mapping Boolean functions to these libraries much more complicated compared to EDA. Yet, optimal results are desired in the design phase due to high lab implementation costs. In this work, we identify fundamental features of the transfer characteristic of gates based on transcriptional regulation which is widely used in genetic gate technologies. Based on this, we present novel exact (Branch-and-Bound) and heuristic (Branch-and-Bound, Simulated Annealing) algorithms for the problem of technology mapping of genetic circuits and evaluate them using a prominent gate library. In contrast to state-of-the-art tools, all obtained solutions feature a (near) optimal output performance. Our exact method only explores 6.5 % and the heuristics even 0.2 % of the design space.

DaS: Implementing Dense Ising Machines Using Sparse Resistive Networks

Naomi Sagan
Jaijeet Roychowdhury

Ising machines have generated much excitement in recent years due to their promise for solving hard combinatorial optimization problems. However, achieving physical all-to-all connectivity in IC implementations of large, densely-connected Ising machines remains a key challenge. We present a novel approach, DaS, that uses low-rank decomposition to achieve effectively-dense Ising connectivity using only sparsely interconnected hardware. The innovation consists of two components. First, we use the SVD to find a low-rank approximation of the Ising coupling matrix while maintaining very high accuracy. This decomposition requires substantially fewer nonzeros to represent the dense Ising coupling matrix. Second, we develop a method to translate the low-rank decomposition to a hardware implementation that uses only sparse resistive interconnections. We validate DaS on the MU-MIMO detection problem, important in modern telecommunications. Our results indicate that as problem sizes scale, DaS can achieve dense Ising coupling using only 5%-20% of the resistors needed for brute-force dense connections (which would be physically infeasible in ICs). We also outline a crossbar-style physical layout scheme for realizing sparse resistive networks generated by DaS.

QuBRIM: A CMOS Compatible Resistively-Coupled Ising Machine with Quantized Nodal Interactions

Yiqiao Zhang
Uday Kumar Reddy Vengalam
Anshujit Sharma
Michael Huang
Zeljko Ignjatovic

Physical Ising machines have been shown to solve combinatoric optimization problems with orders-of-magnitude improvements in speed and energy efficiency o ver v on N eumann systems. However, building such a system is still in its infancy and a scalable, robust implementation remains challenging. CMOS-compatible electronic Ising machines (e.g., [1]) are promising as the mature technology helps bring scale, speed, and energy efficiency to the dynamical system. However, subtle issues can arise when using voltage-controlled transistors to act as programmable resistive coupling. In this paper, we propose a version of resistively-coupled Ising machine using quantized nodal interactions (QuBRIM), which significantly i mproved the predictability of the coupling resistor. The functionality of QuBRIM is demonstrated by solving the well-known Max-Cut problem using both behavioral and circuit level simulations in 45 nm CMOS technology node. We show that the dynamical system naturally seeks local minima in the objective function’s energy landscape and that by applying spin-fix a nnealing, t he system reaches a global minimum with a high probability.

SESSION: Energy Efficient Neural Networks via Approximate Computations

Session details: Energy Efficient Neural Networks via Approximate Computations

M. Hasan Najafi
Vidya Chabria

Combining Gradients and Probabilities for Heterogeneous Approximation of Neural Networks

Elias Trommer
Bernd Waschneck
Akash Kumar

This work explores the search for heterogeneous approximate multiplier configurations for neural networks that produce high accuracy and low energy consumption. We discuss the validity of additive Gaussian noise added to accurate neural network computations as a surrogate model for behavioral simulation of approximate multipliers. The continuous and differentiable properties of the solution space spanned by the additive Gaussian noise model are used as a heuristic that generates meaningful estimates of layer robustness without the need for combinatorial optimization techniques. Instead, the amount of noise injected into the accurate computations is learned during network training using backpropagation. A probabilistic model of the multiplier error is presented to bridge the gap between the domains; the model estimates the standard deviation of the approximate multiplier error, connecting solutions in the additive Gaussian noise space to actual hardware instances. Our experiments show that the combination of heterogeneous approximation and neural network retraining reduces the energy consumption for multiplications by 70% to 79% for different ResNet variants on the CIFAR-10 dataset with a Top-1 accuracy loss below one percentage point. For the more complex Tiny ImageNet task, our VGG16 model achieves a 53 % reduction in energy consumption with a drop in Top-5 accuracy of 0.5 percentage points. We further demonstrate that our error model can predict the parameters of an approximate multiplier in the context of the commonly used additive Gaussian noise (AGN) model with high accuracy. Our software implementation is available under https://github.com/etrommer/agn-approx.

Tunable Precision Control for Approximate Image Filtering in an In-Memory Architecture with Embedded Neurons

Ayushi Dube
Ankit Wagle
Gian Singh
Sarma Vrudhula

This paper presents a novel hardware-software co-design consisting of a Processing in-Memory (PiM) architecture with embedded neural processing elements (NPE) that are highly reconfigurable. The PiM platform and proposed approximation strategies are employed for various image filtering applications while providing the user with fine-grain dynamic control over energy efficiency, precision, and throughput (EPT). The proposed co-design can change the Peak Signal to Noise Ratio (PSNR, output quality metric for image filtering applications) from 25dB to 50dB (acceptable PSNR range for image filtering applications) without incurring any extra cost in terms of energy or latency. While switching from accurate to approximate mode of computation in the proposed co-design, the maximum improvement in energy efficiency and throughput is 2X. However, the gains in energy efficiency against a MAC-based PE array with the proposed memory platform are 3X-6X. The corresponding improvements in throughput are 2.26X-4.52X, respectively.

AppGNN: Approximation-Aware Functional Reverse Engineering Using Graph Neural Networks

Tim Bücher
Lilas Alrahis
Guilherme Paim
Sergio Bampi
Ozgur Sinanoglu
Hussam Amrouch

The globalization of the Integrated Circuit (IC) market is attracting an ever-growing number of partners, while remarkably lengthening the supply chain. Thereby, security concerns, such as those imposed by functional Reverse Engineering (RE), have become quintessential. RE leads to disclosure of confidential information to competitors, potentially enabling the theft of intellectual property. Traditional functional RE methods analyze a given gate-level netlist through employing pattern matching towards reconstructing the underlying basic blocks, and hence, reverse engineer the circuit’s function.

In this work, we are the first to demonstrate that applying Approximate Computing (AxC) principles to circuits significantly improves the resiliency against RE. This is attributed to the increased complexity in the underlying pattern-matching process. The resiliency remains effective even for Graph Neural Networks (GNNs) that are presently one of the most powerful state-of-the-art techniques in functional RE. Using AxC, we demonstrate a substantial reduction in GNN average classification accuracy- from 98% to a mere 53%. To surmount the challenges introduced by AxC in RE, we propose the highly promising AppGNN platform, which enables GNNs (still being trained on exact circuits) to: (i) perform accurate classifications, and (ii) reverse engineer the circuit functionality, notwithstanding the applied approximation technique. AppGNN accomplishes this by implementing a novel graph-based node sampling approach that mimics generic approximation methodologies, requiring zero knowledge of the targeted approximation type.

We perform an extensive evaluation targeting wide-ranging adder and multiplier circuits that are approximated using various AxC techniques, including state-of-the-art evolutionary-based approaches. We show that, using our method, we can improve the classification accuracy from 53% to 81% when classifying approximate adder circuits that have been generated using evolutionary algorithms, which our method is oblivious of. Our AppGNN framework is publicly available under https://github.com/ML-CAD/AppGNN

Seprox: Sequence-Based Approximations for Compressing Ultra-Low Precision Deep Neural Networks

Aradhana Mohan Parvathy
Sarada Krithivasan
Sanchari Sen
Anand Raghunathan

Compression techniques such as quantization and pruning are indispensable for deploying state-of-the-art Deep Neural Networks (DNNs) on resource-constrained edge devices. Quantization is widely used in practice – many commercial platforms already support 8-bits, with recent trends towards ultra-low precision (4-bits and below). Pruning, which increases network sparsity (incidence of zero-valued weights), enables compression by storing only the nonzero weights and their indices. Unfortunately, the compression benefits of pruning deteriorate or even vanish in ultra-low precision DNNs. This is due to (i) the unfavorable tradeoff between the number of bits needed to store a weight (which reduces with lower precision) and the number of bits needed to encode an index (which remains unchanged), and (ii) the lower sparsity levels that are achievable at lower precisions.

We propose Seprox, a new compression scheme that overcomes the aforementioned challenges by exploiting two key observations about ultra-low precision DNNs. First, with lower precision, fewer weight values are possible, leading to increased incidence of frequently-occurring weights and weight sequences. Second, some weight values occur rarely and can be eliminated by replacing them with similar values. Leveraging these insights, Seprox encodes frequently-occurring weight sequences (as opposed to individual weights) while using the eliminated weight values to encode them, thereby avoiding indexing overheads and achieving higher compression. Additionally, Seprox uses approximation techniques to increase the frequencies of the encoded sequences. Across six ultra-low precision DNNs trained on the Cifar10 and ImageNet datasets, Seprox achieves model compressions, energy improvements and speed-ups of up to 35.2%, 14.8% and 18.2% respectively.

SESSION: Algorithms and Tools for Security Analysis and Secure Hardware Design

Session details: Algorithms and Tools for Security Analysis and Secure Hardware Design

Rosario Cammarota
Satwik Patnaik

Evaluating the Security of eFPGA-Based Redaction Algorithms

Amin Rezaei
Raheel Afsharmazayejani
Jordan Maynard

Hardware IP owners must envision procedures to avoid piracy and overproduction of their designs under a fabless paradigm. A newly proposed technique to obfuscate critical components in a logic design is called eFPGA-based redaction, which replaces a sensitive sub-circuit with an embedded FPGA, and the eFPGA is configured to perform the same functionality as the missing sub-circuit. In this case, the configuration bitstream acts as a hidden key only known to the hardware IP owner. In this paper, we first evaluate the security promise of the existing eFPGA-based redaction algorithms as a preliminary study. Then, we break eFPGA-based redaction schemes by an initial but not necessarily efficient attack named DIP Exclusion that excludes problematic input patterns from checking in a brute-force manner. Finally, by combining cycle breaking and unrolling, we propose a novel and powerful attack called Break & Unroll that is able to recover the bitstream of state-of-the-art eFPGA-based redaction schemes in a relatively short time even with the existence of hard cycles and large size keys. This study reveals that the common perception that eFPGA-based redaction is by default secure against oracle-guided attacks, is prejudice. It also shows that additional research on how to systematically create an exponential number of non-combinational hard cycles is required to secure eFPGA-based redaction schemes.

An Approach to Unlocking Cyclic Logic Locking: LOOPLock 2.0

Pei-Pei Chen
Xiang-Min Yang
Yi-Ting Li
Yung-Chih Chen
Chun-Yao Wang

Cyclic logic locking is a new type of SAT-resistant techniques in hardware security. Recently, LOOPLock 2.0 was proposed, which is a cyclic logic locking method creating cycles deliberately in the locked circuit to resist SAT Attack, CycSAT, BeSAT, and Removal Attack simultaneously. The key idea of LOOPLock 2.0 is that the resultant circuit is still cyclic no matter the key vector is correct or not. This property refuses attackers and demonstrates its success on defending against attackers. In this paper, we propose an unlocking approach to LOOPLock 2.0 based on structure analysis and SAT solvers. Specifically, we identify and remove non-combinational cycles in the locked circuit before running SAT solvers. The experimental results show that the proposed unlocking approach is promising.

Garbled EDA: Privacy Preserving Electronic Design Automation

Mohammad Hashemi
Steffi Roy
Fatemeh Ganji
Domenic Forte

The complexity of modern integrated circuits (ICs) necessitates collaboration between multiple distrusting parties, including third-party intellectual property (3PIP) vendors, design houses, CAD/EDA tool vendors, and foundries, which jeopardizes confidentiality and integrity of each party’s IP. IP protection standards and the existing techniques proposed by researchers are ad hoc and vulnerable to numerous structural, functional, and/or side-channel attacks. Our framework, Garbled EDA, proposes an alternative direction through formulating the problem in a secure multi-party computation setting, where the privacy of IPs, CAD tools, and process design kits (PDKs) is maintained. As a proof-of-concept, Garbled EDA is evaluated in the context of simulation, where multiple IP description formats (Verilog, C, S) are supported. Our results demonstrate a reasonable logical-resource cost and negligible memory overhead. To further reduce the overhead, we present another efficient implementation methodology, feasible when the resource utilization is a bottleneck, but the communication between two parties is not restricted. Interestingly, this implementation is private and secure even in the presence of malicious adversaries attempting to, e.g., gain access to PDKs or in-house IPs of the CAD tool providers.

Don’t CWEAT It: Toward CWE Analysis Techniques in Early Stages of Hardware Design

Baleegh Ahmad
Wei-Kai Liu
Luca Collini
Hammond Pearce
Jason M. Fung
Jonathan Valamehr
Mohammad Bidmeshki
Piotr Sapiecha
Steve Brown
Krishnendu Chakrabarty
Ramesh Karri
Benjamin Tan

To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code that might contain one of a set of MITRE’s common weakness enumerations (CWEs). We explore the CWE database to characterize the scope and attributes of the CWEs and identify those that are amenable to static analysis. We prototype scanners and evaluate them on 11 open source designs – 4 system-on-chips (SoC) and 7 processor cores – and explore the nature of identified weaknesses. Our analysis reported 53 potential weaknesses in the OpenPiton SoC used in Hack@DAC-21, 11 of which we confirmed as security concerns.

SESSION: Special Session: Making ML Reliable: From Devices to Systems to Software

Session details: Special Session: Making ML Reliable: From Devices to Systems to Software

Krishnendu Chakrabarty
Partha Pande

Reliable Computing of ReRAM Based Compute-in-Memory Circuits for AI Edge Devices

Meng-Fan Chang
Je-Ming Hung
Ping-Cheng Chen
Tai-Hao Wen

Compute-in-memory macros based on non-volatile memory (nvCIM) are a promising approach to break through the memory bottleneck for artificial intelligence (AI) edge devices; however, the development of these devices involves unavoidable tradeoffs between reliability, energy efficiency, computing latency, and readout accuracy. This paper outlines the background of ReRAM-based nvCIM as well as the major challenges in its further development, including process variation in ReRAM devices and transistors and the small signal margins associated with variation in input-weight patterns. This paper also investigates the error model of a nvCIM macro, and the correspondent degradation of inference accuracy as a function of error model when using nvCIM macros. Finally, we summarize recent trends and advances in the development of reliable ReRAM-based nvCIM macro.

Fault-Tolerant Deep Learning Using Regularization

Biresh Kumar Joardar
Aqeeb Iqbal Arka
Janardhan Rao Doppa
Partha Pratim Pande

Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.

Machine Learning for Testing Machine-Learning Hardware: A Virtuous Cycle

Arjun Chaudhuri
Jonti Talukdar
Krishnendu Chakrabarty

The ubiquitous application of deep neural networks (DNN) has led to a rise in demand for AI accelerators. DNN-specific functional criticality analysis identifies faults that cause measurable and significant deviations from acceptable requirements such as the inferencing accuracy. This paper examines the problem of classifying structural faults in the processing elements (PEs) of systolic-array accelerators. We first present a two-tier machine-learning (ML) based method to assess the functional criticality of faults. While supervised learning techniques can be used to accurately estimate fault criticality, it requires a considerable amount of ground truth for model training. We therefore describe a neural-twin framework for analyzing fault criticality with a negligible amount of ground-truth data. We further describe a topological and probabilistic framework to estimate the expected number of PE’s primary outputs (POs) flipping in the presence of defects and use the PO-flip count as a surrogate for determining fault criticality. We demonstrate that the combination of PO-flip count and neural twin-enabled sensitivity analysis of internal nets can be used as additional features in existing ML-based criticality classifiers.

Observation Point Insertion Using Deep Learning

Bonita Bhaskaran
Sanmitra Banerjee
Kaushik Narayanun
Shao-Chun Hung
Seyed Nima Mozaffari Mojaveri
Mengyun Liu
Gang Chen
Tung-Che Liang

Silent Data Corruption (SDC) is one of the critical problems in the field of testing, where errors or corruption do not manifest externally. As a result, there is increased focus on improving the outgoing quality of dies by striving for better correlation between structural and functional patterns to achieve a low DPPM. This is very important for NVIDIA’s chips due to the various markets we target; for example, automotive and data center markets have stringent in-field testing requirements. One aspect of these efforts is to also target better testability while incurring lower test cost. Since structural testing is faster than functional tests, it is important to make these structural test patterns as effective as possible and free of test escapes. However, with the rising cell count in today’s digital circuits, it is becoming increasingly difficult to sensitize faults and propagate the fault effects to scan-flops or primary outputs. Hence, methods to insert observation points to facilitate the detection of hard-to-detect (HtD) faults are being increasingly explored. In this work, we propose an Observation Point Insertion (OPI) scheme using deep learning with the motivation of achieving – 1) better quality test points than commercial EDA tools leading to a potential lower pattern count 2) faster turnaround time to generate the test points. In order to achieve better pattern compaction than commercial EDA tools, we employ Graph Convolutional Networks (GCNs) to learn the topology of logic circuits along with the features that influence its testability. The graph structures are subsequently used to train two GCN-type deep learning models – the first model predicts signal probabilities at different nets and the second model uses these signal probabilities along with other features to predict the reduction in test-pattern count when OPs are inserted at different locations in the design. The features we consider include structural features like gate type, gate logic, reconvergent-fanouts and testability features like SCOAP. Our simulation results indicate that the proposed machine learning models can predict the probabilistic testability metrics with reasonable accuracy and can identify observation points that reduce pattern count.

SESSION: Autonomous Systems and Machine Learning on Embedded Systems

Session details: Autonomous Systems and Machine Learning on Embedded Systems

Ibrahim (Abe) Elfadel
Mimi Xie

Romanus: Robust Task Offloading in Modular Multi-Sensor Autonomous Driving Systems

Luke Chen
Mohanad Odema
Mohammad Abdullah Al Faruque

Due to the high performance and safety requirements of self-driving applications, the complexity of modern autonomous driving systems (ADS) has been growing, instigating the need for more sophisticated hardware which could add to the energy footprint of the ADS platform. Addressing this, edge computing is poised to encompass self-driving applications, enabling the compute-intensive autonomy-related tasks to be offloaded for processing at compute-capable edge servers. Nonetheless, the intricate hardware architecture of ADS platforms, in addition to the stringent robustness demands, set forth complications for task offloading which are unique to autonomous driving. Hence, we present ROMANUS, a methodology for robust and efficient task offloading for modular ADS platforms with multi-sensor processing pipelines. Our methodology entails two phases: (i) the introduction of efficient offloading points along the execution path of the involved deep learning models, and (ii) the implementation of a runtime solution based on Deep Reinforcement Learning to adapt the operating mode according to variations in the perceived road scene complexity, network connectivity, and server load. Experiments on the object detection use case demonstrated that our approach is 14.99% more energy-efficient than pure local execution while achieving a 77.06% reduction in risky behavior from a robust-agnostic offloading baseline.

ModelMap: A Model-Based Multi-Domain Application Framework for Centralized Automotive Systems

Soham Sinha
Anam Farrukh
Richard West

This paper presents ModelMap, a model-based multi-domain application development framework for DriveOS, our in-house centralized vehicle management software system. DriveOS runs on multicore x86 machines and uses hardware virtualization to host isolated RTOS and Linux guest OS sandboxes. In this work, we design Simulink interfaces for model-based vehicle control function development across multiple sandboxed domains in DriveOS. ModelMap provides abstractions to: (1) automatically generate periodic tasks bound to threads in different OS domains, (2) establish cross-domain synchronous and asynchronous communication interfaces, and (3) handle USB-based CAN I/O in Simulink. We introduce the concept of a nested binary, for the deployment of ELF binary executable code in different sandboxed domains. We demonstrate ModelMap using a combination of synthetic benchmarks, and experiments with Simulink models of a CAN Gateway and HVAC service running on an electric car. ModelMap eases the development of applications, which are shown to achieve industry-target performance using a multicore hardware platform in DriveOS.

INDENT: Incremental Online Decision Tree Training for Domain-Specific Systems-on-Chip

Anish Krishnakumar
Radu Marculescu
Umit Ogras

The performance and energy efficiency potential of heterogeneous architectures has fueled domain-specific systems-on-chip (DSSoCs) that integrate general-purpose and domain-specialized hardware accelerators. Decision trees (DTs) perform high-quality, low-latency task scheduling to utilize the massive parallelism and heterogeneity in DSSoCs effectively. However, offline trained DT scheduling policies can quickly become ineffective when applications or hardware configurations change. There is a critical need for runtime techniques to train DTs incrementally without sacrificing accuracy since current training approaches have large memory and computational power requirements. To address this need, we propose INDENT, an incremental online DT framework to update the scheduling policy and adapt it to unseen scenarios. INDENT updates DT schedulers at runtime using only 1–8% of the original training data embedded during training. Thorough evaluations with hardware platforms and DSSoC simulators demonstrate that INDENT performs within 5% of a DT trained from scratch using the entire dataset and outperforms current state-of-the-art approaches.

SGIRR: Sparse Graph Index Remapping for ReRAM Crossbar Operation Unit and Power Optimization

Cheng-Yuan Wang
Yao-Wen Chang
Yuan-Hao Chang

Resistive Random Access Memory (ReRAM) Crossbars are a promising process-in-memory technology to reduce enormous data movement overheads of large-scale graph processing between computation and memory units. ReRAM cells can combine with crossbar arrays to effectively accelerate graph processing, and partitioning ReRAM crossbar arrays into Operation Units (OUs) can further improve computation accuracy of ReRAM crossbars. The operation unit utilization was not optimized in previous work, incurring extra cost. This paper proposes a two-stage algorithm with a crossbar OU-aware scheme for sparse graph index remapping for ReRAM (SGIRR) crossbars, mitigating the influence of graph sparsity. In particular, this paper is the first to consider the given operation unit size with the remapping index algorithm, optimizing the operation unit and power dissipation. Experimental results show that our proposed algorithm reduces the utilization of crossbar OUs by 31.4%, improves the total OU block usage by 10.6%, and saves energy consumption by 17.2%, on average.

2021 ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE) Table of Content

22 November 2022

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Polynomial word-level verification of arithmetic circuits

Mohammed Barhoush
Alireza Mahzoon
Rolf Drechsler

Verifying the functional correctness of a circuit is often the most time-consuming part of the design process. Recently, world-level formal verification methods, e.g., Binary Moment Diagram (BMD) and Symbolic Computer Algebra (SCA) have reported very good results for proving the correctness of arithmetic circuits. However, these techniques still frequently fail due to memory or time requirements. The unknown complexity bounds of these techniques make it impossible to predict before invoking the verification tool whether it will successfully terminate or run for an indefinite amount of time.

In this paper, we formally prove that for integer arithmetic circuits, the entire verification process requires at most linear space and quadratic time with respect to the size of the circuit function. This is shown for the two main word-level verification methods: backward construction using BMD and backward substitution using SCA. We support the architectures which are used in the implementation of integer polynomial operations, e.g., X³ – XY² + XY. Finally, we show in practice that the required space and run times of the word-level methods match the predicted results in theory when it comes to the verification of different arithmetic circuits, including exponentiation circuits with different power values (X^P : 2 ≤ P ≤ 7) and more complicated circuits (e.g., X² + XY + X).

Simplification of numeric variables for PLC model checking

Ignacio D. Lopez-Miguel
Borja Fernández Adiego
Jean-Charles Tournier
Enrique Blanco Viñuela
Juan A. Rodriguez-Aguilar

Software model checking has recently started to be applied in the verification of programmable logic controller (PLC) programs. It works efficiently when the number of input variables is limited, their interaction is small and, thus, the number of states the program can reach is not large. As observed in the large code base of the CERN industrial PLC applications, this is usually not the case: it thus leads to the well-known state-space explosion problem, making it impossible to perform model checking. One of the main reasons that causes state-space explosion is the inclusion of numeric variables due to the wide range of values they can take. In this paper, we propose an approach to discretize PLC input numeric variables (modelled as non-deterministic). This discretization is complemented with a set of transformations on the control-flow automaton that models the PLC program so that no extra behaviours are added. This approach is then quantitatively evaluated with a set of empirical tests using the PLC model checking framework PLCverif and three different state-of-the-art model checkers (CBMC, nuXmv, and Theta), showing beneficial results for BDD-based model checkers.

Enforcement FSMs: specification and verification of non-functional properties of program executions on MPSoCs

Khalil Esper
Stefan Wildermann
Jürgen Teich

Many embedded system applications impose hard real-time, energy or safety requirements on corresponding programs typically concurrently executed on a given MPSoC target platform. Even when mutually isolating applications in space or time, the enforcement of such properties, e.g., by adjusting the number of processors allocated to a program or by scaling the voltage/frequency mode of involved processors, is a difficult problem to solve, particularly in view of typically largely varying environmental input (workload) per execution. In this paper, we formalize the related control problem using finite state machine models for the uncertain environment determining the workload, the system response (feedback), as well as the enforcer strategy. The contributions of this paper are as follows: a) Rather than trace-based simulation, the uncertain environment is modeled by a discrete-time Markov chain (DTMC) as a random process to characterize possible input sequences an application may experience. b) A number of important verification goals to analyze different enforcer FSMs are formulated in PCTL for the resulting stochastic verification problem, i.e., the likelihood of violating a timing or energy constraint, or the expected number of steps for a system to return to a given execution time corridor. c) Applying stochastic model checking, i.e., PRISM to analyze and compare enforcer FSMs in these properties, and finally d) proposing an approach for reducing the environment DTMC by partitioning equivalent environmental states (i.e., input states leading to an equal system response in each MPSoC mode) such that verification times can be reduced by orders of magnitude to just a few ms for real-world examples.

LION: real-time I/O transfer control for massively parallel processor arrays

Dominik Walter
Jürgen Teich

The performance of many accelerator architectures depends on the communication with external memory. During execution, new I/O data is continuously fetched forth and back to memory. This data exchange is very often performance-critical and a careful orchestration thus vital. To satisfy the I/O demand for accelerators of loop nests, it was shown that the individual reads and writes can be merged into larger blocks, which are subsequently transferred by a single DMA transfer. Furthermore, the order in which such DMA transfers must be issued, was shown to be reducible to a real-time task scheduling problem to be solved at run time. Rather than just concepts, we investigate in this paper efficient algorithms, data structures and their implementation in hardware of such a programmable Loop I/O Controller architecture called LION that only needs to be synthesized once for each processor array size and I/O buffer configuration, thus supporting a large class of processor arrays. Based on a proposed heap-based priority queue, LION is able to issue every 6 cycles a new DMA request to a memory bus. Even on a simple FPGA prototype running at just 200 MHz, this allows for more than 33 million DMA requests to be issued per second. Since the execution time of a typical DMA request is in general at least one order of magnitude longer, we can conclude that this rate is sufficient to fully utilize a given memory interface. Finally, we present implementations on FPGA and also 22nm FDX ASIC showing that the overall overhead of a LION typically amounts to less than 5% of an overall processor array design.

Learning optimal decisions for stochastic hybrid systems

Mathis Niehage
Arnd Hartmanns
Anne Remke

We apply reinforcement learning to approximate the optimal probability that a stochastic hybrid system satisfies a temporal logic formula. We consider systems with (non)linear continuous dynamics, random events following general continuous probability distributions, and discrete nondeterministic choices. We present a discretized view of states to the learner, but simulate the continuous system. Once we have learned a near-optimal scheduler resolving the choices, we use statistical model checking to estimate its probability of satisfying the formula. We implemented the approach using Q-learning in the tools HYPEG and modes, which support Petri net- and hybrid automata-based models, respectively. Via two case studies, we show the feasibility of the approach, and compare its performance and effectiveness to existing analytical techniques for a linear model. We find that our new approach quickly finds near-optimal prophetic as well as non-prophetic schedulers, which maximize or minimize the probability that a specific signal temporal logic property is satisfied.

A secure insulin infusion system using verification monitors

Abhinandan panda
Srinivas Pinisetty
Partha Roop

Wearable and implantable medical devices are being increasingly deployed for diagnosis, monitoring, and to provide therapy for critical medical conditions. Such medical devices are examples of safety-critical, cyber-physical systems. In this paper we focus on insulin infusion systems (IISs), which are used by diabetics to maintain safe blood glucose levels. These systems support wireless features introducing potential vulnerabilities. Although these devices go through rigorous safety certification processes, these are not able to mitigate security threats. Based on published literature, attackers can remotely command to inject an incorrect amount of insulin thereby posing threat to a patient’s life. While prior work based on formal methods have been proposed to detect potential attack vectors using different forms of static analysis, these have limitations in preventing attacks at run-time. Also, as these devices are safety critical, it is not possible to apply security patches, when new types of attacks are detected, due to the need for recertification.

This paper addresses these limitations by developing a formal framework for the detection of cyber-physical attacks on an IIS. First, we propose a wearable device that senses the familiar ECG to detect attacks. Thus, this device is separate from the insulin infusion system, ensuring no need for recertification of IISs. To facilitate the design of this device, we establish a correlation of ECG intervals and blood glucose levels using statistical analysis. This helps us in proposing a framework for security policy mining using the developed statistical analysis. This paves the way for the design of formal verification monitors for IISs for the first time. We perform performance evaluation of the verification monitor, which proves the technical feasibility for the design of wearable devices for attack detection of IISs. Our approach is amenable to the application of security patches, when new attack vectors are detected, making the approach ideal for run-time monitoring of medical CPSs.

Translating structured sequential programs to dataflow graphs

Klaus Schneider

In this paper, a translation from structured sequential programs to equivalent dataflow process networks (DPNs) is presented that is based on a carefully chosen set of nodes including load/store operations to access a shared global memory. For every data structure stored in the main memory, we use corresponding tokens to enforce the sequential ordering of load/store operations accessing that data structure as far as needed. Except for the load/store nodes, all nodes obey the Kahn principle so that they are deterministic in the sense that the same inputs are always mapped to the same outputs regardless of the execution schedule of the nodes. Due to the sequential ordering of load/store nodes, determinacy is also maintained by them. Moreover, the generated DPNs are quasi-static, i.e., they have schedules that are bounded in a very strict sense: For every statement of the sequential program, the corresponding DPN behaves like a homogeneous synchronous actor, i.e., it consumes one value of each input port and will finally provide one value on each output port. Hence, no more than one value needs to be stored in each buffer.

Online monitoring of spatio-temporal properties for imprecise signals

Ennio Visconti
Ezio Bartocci
Michele Loreti
Laura Nenzi

From biological systems to cyber-physical systems, monitoring the behavior of such dynamical systems often requires reasoning about complex spatio-temporal properties of physical and computational entities that are dynamically interconnected and arranged in a particular spatial configuration. Spatio-Temporal Reach and Escape Logic (STREL) is a recent logic-based formal language designed to specify and reason about spatio-temporal properties. STREL considers each system’s entity as a node of a dynamic weighted graph representing its spatial arrangement. Each node generates a set of mixed-analog signals describing the evolution over time of computational and physical quantities characterizing the node’s behavior. While there are offline algorithms available for monitoring STREL specifications over logged simulation traces, here we investigate for the first time an online algorithm enabling the runtime verification during the system’s execution or simulation. Our approach extends the original framework by considering imprecise signals and by enhancing the logics’ semantics with the possibility to express partial guarantees about the conformance of the system’s behavior with its specification. Finally, we demonstrate our approach in a real-world environmental monitoring case study.

Verified functional programming of an IoT operating system’s bootloader

Shenghao Yuan
Jean-Pierre Talpin

The fault of one device on a grid may incur severe economical or physical damages. Among the many critical components in such IoT devices, the operating system’s bootloader comes first to initiate the trusted function of the device on the network. However, a bootloader uses hardware-dependent features that make its functional correctness proof difficult. This paper uses verified programming to automate the verification of both the C libraries and assembly boot-sequence of such a, real-world, bootloader in an operating system for ARM-based IoT devices: RIoT. We first define the ARM ISA specification, semantics and properties in F* to model its critical assembly code boot sequence. We then use Low*, a DSL rendering a C-like memory model in F*, to implement the complete bootloader library and verify its functional correctness and memory safety. Other than fixing potential faults and vulnerabilities in the source C and ASM bootloader, our evaluation provides an optimized and formally documented code structure, a reasonable specification/implementation ratio, a high degree of proof automation and an equally efficient generated code.

Controller verification meets controller code: a case study

Felix Freiberger
Stefan Schupp
Holger Hermanns
Erika Ábrahám

Cyber-physical systems are notoriously hard to verify due to the complex interaction between continuous physical behavior and discrete control. A widespread and important class is formed by digital controllers that operate on fixed control cycles to interact with the physical environment they are embedded in. This paper presents a case study for integrating such controllers into a rigorous verification method for cyber-physical systems, using flowpipe-based verification methods to verify legally binding requirements for electrified vehicles to a custom bike design. The controller is integrated in the underlying model in a way that correctly represents the input discretization performed by any digital controller.

Translation of continuous function charts to imperative synchronous quartz programs

Marcel Christian Werner
Klaus Schneider

Programmable logic controllers operating in a sequential execution scheme are widely used for various applications in industrial environments with real-time requirements. The graphical programming languages described in the third part of IEC 61131 are often intended to perform open and closed loop control tasks. Continuous Function Charts (CFCs) represent an additional language accepted in practice which can be interpreted as an extension of IEC 61131-3 Function Block Diagrams. Those charts allow more flexible positioning and interconnection of function blocks, but can quickly become difficult to manage. Furthermore, the sequential execution order forces a sequential processing of possible independent and thus possibly parallel program paths. The question arises whether a translation of existing CFCs to synchronous programs considering independent actions can lead to a more manageable software model. While current formalization approaches for CFCs primarily focus on verification, the focus of this approach is on restructuring and possible reuse in engineering. This paper introduces a possible automated translation of CFCs to imperative synchronous Quartz programs and outlines the potential for reducing the states of equivalent extended finite state machines through restructuring.

Design and formal verification of a copland-based attestation protocol

Adam Petz
Grant Jurgensen
Perry Alexander

We present the design and formal analysis of a remote attestation protocol and accompanying security architecture that generate evidence of trustworthy execution for legacy software. For formal guarantees of measurement ordering and cryptographic evidence strength, we leverage the Copland language and Copland Virtual Machine execution semantics. For isolation of attestation mechanisms we design a layered attestation architecture that leverages the seL4 microkernel. The formal properties of the protocol and architecture together serve to discharge assumptions made by an existing higher-level model-finding tool to characterize all ways an active adversary can corrupt a target and go undetected. As a proof of concept, we instantiate this analysis framework with a specific Copland protocol and security architecture to measure a legacy flight planning application. By leveraging components that are amenable to formal analysis, we demonstrate a principled way to design an attestation protocol and argue for its end-to-end correctness.

Sampling of shape expressions with ShapEx

Nicolas Basset
Thao Dang
Felix Gigler
Cristinel Mateis
Dejan Ničković

In this paper we present ShapEx, a tool that generates random behaviors from shape expressions, a formal specification language for describing sophisticated temporal behaviors of CPS. The tool samples a random behavior in two steps: (1) it first explores the space of qualitative parameterized shapes and then (2) instantiates parameters by sampling a possibly non-linear constraint. We implement several sampling strategies in the tool that we present in the paper and demonstrate its applicability on two use scenarios.

SEESAW: a tool for detecting memory vulnerabilities in protocol stack implementations

Farhaan Fowze
Tuba Yavuz

As the number of Internet of Things (IoT) devices proliferate, an in-depth understanding of the IoT attack surface has become quintessential for dealing with the security and reliability risks. IoT devices and components execute implementations of various communication protocols. Vulnerabilities in the protocol stack implementations form an important part of the IoT attack surface. Therefore, finding memory errors in such implementations is essential for improving the IoT security and reliability. This paper presents a tool, SEESAW, that is built on top of a static analysis tool and a symbolic execution engine to achieve scalable analysis of protocol stack implementations. SEESAW leverages the API model of the analyzed code base to perform component-level analysis. SEESAW has been applied to the USB and Bluetooth modules within the Linux kernel. SEESAW can reproduce known memory vulnerabilities in a more scalable way compared to baseline symbolic execution.

Formal modelling of attack scenarios and mitigation strategies in IEEE 1588

Kelvin Anto
Partha S. Roop
Akshya K. Swain

IEEE 1588 is a time synchronization protocol that is extensively used by many Cyber-Physical Systems (CPSs). However, this protocol is prone to various types of attacks. We focus on a specific type of Man-in-the-Middle (MITM) attack, where the attacker introduces random delays to the messages being exchanged between a master and a slave. Such attacks have been modelled previously and some mitigation strategies have also been developed. However, the proposed methods work only under constant delay attacks and the developed mitigation strategies are ad-hoc. We propose the first formal framework for modelling and mitigating time delay attacks in IEEE 1588. Initially, the master, the slave and the communication medium are modelled as Timed Automata (TA) assuming the absence of any attacks. Subsequently, a generic attacker is modelled as a TA, which can formally represent various attacks including constant delay, linear delay and exponential delay. Finally, system identification methods of control theory is used to design proportional controllers for mitigating the effects of time delay attacks. We use model checking to ensure the resilience of protocol to time delay attacks using the proposed mitigation strategy.

Xun Jiao

28 October 2022

Yibo Lin

No comments

Categories: Who's Who

Nov 1st, 2022

Xun Jiao

Assistant Professor

Villanova University

Email:
xun.jiao@villanova.edu

Personal webpage
https://vu-detail.github.io/people/jiao

Research interests

Robust Computing, Efficient Computing, AI/Machine Learning, Brain-inspired Computing, Fuzz Testing

Short bio

Xun Jiao is an assistant professor in ECE department of Villanova University. He leads the Dependable, Efficient, and Intelligent Computing Lab (DETAIL). Before that, he obtained his Ph.D. degree from UC San Diego in 2018. He earned a dual first-class Bachelor degree from Joint Program of Queen Mary University of London and Beijing University of Posts and Telecommunications in 2013. His research interests are on robust and efficient computing for intelligent applications such as AI and machine learning. He published 50+ papers in international conferences and journals. He received 6 paper awards/nominations in international conferences such as DATE, EMSOFT, DSD, and SELSE. He is an associate editor of IEEE Trans on CAD, and a TPC member of DAC, ICCAD, ASP-DAC, GLSVLSI, LCTES. His research is sponsored by NSF, NIH, L3Harris, and Nvidia. He has delivered an invited presentation at U.S. Congressional House. He is a recipient of 2022 IEEE Young Engineer of the Year Award (Philadelphia Section).

Research highlights

Robust computing
• With continuous scaling of CMOS technology, circuits are even more susceptible to timing errors caused by microelectronic variations such as voltage and temperature variations, making them a notable threat to circuit/system reliability. Dr. Jiao has adopted a cross-layer approach (circuit-architecture-application) to combat errors/faults originated in hardware. Specifically, Dr. Jiao has pioneered in developing machine learning-based models to model/predict the errors in hardware and take proactive actions such as instruction-based frequency scaling to prevent errors. By exploiting the application-level error resilience of different applications (e.g., AI/machine learning, multimedia), Dr. Jiao has also developed various approximate computing techniques for more efficient execution.

Energy-efficient computing
• Energy efficiency has become a top priority for both high-performance computing systems and resource-constrained embedded systems. Dr. Jiao proposed solutions to this challenge at multiple abstraction levels. He proposed intelligent dynamic voltage and frequency scaling (DVFS) for circuits and systems, as well as designing novel efficient architecture such as in-memory computing and bloom filter to execute emerging workloads such as deep neural networks.

AI/brain-inspired computing
• Hyperdimensional computing (HDC) was introduced as an alternative computational model mimicking the “human brain” at the functionality level. Compared with DNNs, the advantages of HDC include smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. Dr. Jiao’s work has been pioneering the robustness of HDC against adversarial attacks and hardware errors, which has earned him a best paper nomination at DATE 2022. He also applied HDC to various application domains such as natural language processing, drug discovery, and anomaly detection, which demonstrated promising performance compared to traditional learning methods.

Fuzzing-based secure system
• Cyber-security in the digital age is a first-class concern. The ever-increasing use of digital devices, unfortunately, is facing significant challenges, due to the serious effects of security vulnerabilities. Dr. Jiao has developed a series of vulnerability detection techniques based on fuzzing, and has applied to software, firmware, and hardware. Over 100 previously unknown vulnerabilities are discovered and are reported to the US National Vulnerability Database with unique CVE assignments. He received two best paper nominations from EMSOFT 2019 and 2020.

CADathlon Brasil 2022 Highlights

9 October 2022

Yibo Lin

No comments

Categories: News

The CADathlon Brasil 2022 – 2^nd Brazilian Programming Contest for Design Automation of Integrated Circuits (https://csbc.sbc.org.br/2022/cadathlon-brasil-en/) took place on August 2^nd in Niteroi, Rio de Janeiro State, Brazil, as a co-located event of the 42^nd Annual Congress of SBC (Brazilian Computer Society). It was organized by Federal University of Santa Catarina (UFSC) and Fluminense Federal University (UFF) and sponsored by ACM/SIGDA, IEEE CEDA (Council on Electronic Design Automation), SBC/CECCI (SBC Special Committee on Integrated Circuits and Systems Design) and SBMicro (Brazilian Microelectronics Society). It was financially sponsored by Synopsys, Chipus Microelectronics, ACM/SIGDA and IEEE CEDA.

As in the first edition, CADAthlon Brasil 2022 followed the same format as the ACM/SIGDA CADathlon, which happens annually co-located with ICCAD (International Conference on Computer-Aided Design). During the whole day, 15 two-person teams of students coming from different regions of Brazil worked to solve 6 practical problems on classical EDA topics such as circuit design & analysis, physical design, logic synthesis, high-level synthesis, circuit verification, and application of AI to design automation. The problems were prepared by a team of researchers from industry and academia.

This year the first place was won by team “turma da Monica”, from University of Brasília (UnB), formed by Enzo Yoshio Niho and Eduardo Quirino de Oliveira, and the second place was won by team “Rabisco UFSC”, from the Federal University of Santa Catarina (UFSC), formed by Arthur Joao Lourenço and Bernardo Borges Sandoval. The top 2 teams were awarded money prizes offered by Synopsys.

The CADAthlon Brasil 2022 Organizing Committee greatly thanks the Congress of SBC organizers for the logistics support, the problem preparation team and all sponsors, specially the financial support from Synopsys, Chipus Microelectronics, ACM/SIGDA and IEEE CEDA (through the South Brazil Chapter), which made it possible to cover the travel expenses of the competitors, making the event a huge success.

The next edition of CADathlon Brasil will occur as a co-located event of the 43^rd Annual Congress of SBC, in July 2023, in Joao Pessoa, northeast region of Brazil.

Photos:

CADathlon Brasil 2022: team “Turma da monica”, winner of the First Place

CADathlon Brasil 2022: team “Rabisco UFSC”, winner of the Second Place

CADathlon Brasil 2022: the organizers, the societies representatives and the Platinum sponsor representative

CADathlon Brasil 2022 banner at the lab door

CADathlon Brasil 2022 dinner & award session: third place awarding

CADathlon Brasil 2022 dinner & award session: second place awarding

CADathlon Brasil 2022 dinner & award session: first place awarding (representative professor from UnB)

2022 ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) Table of Content

10 September 2022

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Session 1: Physical Design and Optimization with ML

Placement Optimization via PPA-Directed Graph Clustering

Yi-Chen Lu
Tian Yang
Sung Kyu Lim
Haoxing Ren

In this paper, we present the first Power, Performance, and Area (PPA)-directed, end-to-end placement optimization framework that provides cell clustering constraints as placement guidance to advance commercial placers. Specifically, we formulate PPA metrics as Machine Learning (ML) loss functions, and use graph clustering techniques to optimize them by improving clustering assignments. Experimental results on 5 GPU/CPU designs in a 5nm technology not only show that our framework immediately improves the PPA metrics at the placement stage, but also demonstrate that the improvements last firmly to the post-route stage, where we observe improvements of 89% in total negative slack (TNS), 26% in effective frequency, 2.4% in wirelength, and 1.4% in clock power.

From Global Route to Detailed Route: ML for Fast and Accurate Wire Parasitics and Timing Prediction

Vidya A. Chhabria
Wenjing Jiang
Andrew B. Kahng
Sachin S. Sapatnekar

Timing prediction and optimization are challenging in design stages prior to detailed routing (DR) due to the unavailability of routing information. Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure. This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a “complete” netlist. The paper first documents that having “oracle knowledge” of the final post-DR parasitics enables post-global routing (GR) optimization to produce improved final timing outcomes. Machine learning (ML)-based models are proposed to bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization. These models show higher accuracy than GR-based timing estimation and, when used during post-GR optimization, show demonstrable improvements in post-DR circuit performance. Results on open 45nm and 130nm enablements using OpenROAD show efficient improvements in post-DR WNS and TNS metrics without increasing congestion.

Faster FPGA Routing by Forecasting and Pre-Loading Congestion Information

Umair Siddiqi
Timothy Martin
Sam Van Den Eijnden
Ahmed Shamli
Gary Grewal
Sadiq Sait
Shawki Areibi

Field Programmable Gate Array (FPGA) routing is one of the most time consuming tasks within the FPGA design flow, requiring hours and even days to complete for some large industrial designs. This is becoming a major concern for FPGA users and tool developers. This paper proposes a simple, yet effective, framework that reduces the runtime of PathFinder based routers. A supervised Machine Learning (ML) algorithm is developed to forecast costs (from the placement phase) associated with possible congestion and hot spot creation in the routing phase. These predicted costs are used to guide the router to avoid highly congested regions while routing nets, thus reducing the total number of iterations and rip-up and reroute operations involved. Results obtained indicate that the proposed ML approach achieves on average a 43 reduction in the number of routing iterations and 28.6 reduction in runtime when implemented in the state-of-the-art enhanced PathFinder algorithm.

SESSION: Session 2: Machine Learning for Analog Design

Deep Reinforcement Learning for Analog Circuit Sizing with an Electrical Design Space and Sparse Rewards

Yannick Uhlmann
Michael Essich
Lennart Bramlage
Jürgen Scheible
Cristóbal Curio

There is still a great reliance on human expert knowledge during the analog integrated circuit sizing design phase due to its complexity and scale, with the result that there is a very low level of automation associated with it. Current research shows that reinforcement learning is a promising approach for addressing this issue. Similarly, it has been shown that the convergence of conventional optimization approaches can be improved by transforming the design space from the geometrical domain into the electrical domain. Here, this design space transformation is employed as an alternative action space for deep reinforcement learning agents. The presented approach is based entirely on reinforcement learning, whereby agents are trained in the craft of analog circuit sizing without explicit expert guidance. After training and evaluating agents on circuits of varying complexity, their behavior when confronted with a different technology, is examined, showing the applicability, feasibility as well as transferability of this approach.

LinEasyBO: Scalable Bayesian Optimization Approach for Analog Circuit Synthesis via One-Dimensional Subspaces

Shuhan Zhang
Fan Yang
Changhao Yan
Dian Zhou
Xuan Zeng

A large body of literature has proved that the Bayesian optimization framework is especially efficient and effective in analog circuit synthesis. However, most of the previous research works only focus on designing informative surrogate models or efficient acquisition functions. Even if searching for the global optimum over the acquisition function surface is itself a difficult task, it has been largely ignored. In this paper, we propose a fast and robust Bayesian optimization approach via one-dimensional subspaces for analog circuit synthesis. By solely focusing on optimizing one-dimension subspaces at each iteration, we greatly reduce the computational overhead of the Bayesian optimization framework while safely maximizing the acquisition function. By combining the benefits of different dimension selection strategies, we adaptively balancing between searching globally and locally. By leveraging the batch Bayesian optimization framework, we further accelerate the optimization procedure by making full use of the hardware resources. Experimental results quantitatively show that our proposed algorithm can accelerate the optimization procedure by up to $9\times$ and $38\times$ compared to LP-EI and REMBOpBO respectively when the batch size is 15.

RobustAnalog: Fast Variation-Aware Analog Circuit Design Via Multi-task RL

Wei Shi
Hanrui Wang
Jiaqi Gu
Mingjie Liu
David Z. Pan
Song Han
Nan Sun

Analog/mixed-signal circuit design is one of the most complex and time-consuming stages in the whole chip design process. Due to various process, voltage, and temperature (PVT) variations from chip manufacturing, analog circuits inevitably suffer from performance degradation. Although there has been plenty of work on automating analog circuit design under the nominal condition, limited research has been done on exploring robust designs under the real and unpredictable silicon variations. Automatic analog design against variations requires prohibitive computation and time costs. To address the challenge, we present RobustAnalog, a robust circuit design framework that involves the variation information in the optimization process. Specifically, circuit optimizations under different variations are considered as a set of tasks. Similarities among tasks are leveraged and competitions are alleviated to realize a sample-efficient multi-task training. Moreover, RobustAnalog prunes the task space according to the current performance in each iteration, leading to a further simulation cost reduction. In this way, RobustAnalog can rapidly produce a set of circuit parameters that satisfies diverse constraints (e.g. gain, bandwidth, noise…) across variations. We compare RobustAnalog with Bayesian optimization, Evolutionary algorithm, and Deep Deterministic Policy Gradient (DDPG) and demonstrate that RobustAnalog can significantly reduce required the optimization time by 14x-30x. Therefore, our study provides a feasible method to handle various real silicon conditions.

Automatic Analog Schematic Diagram Generation based on Building Block Classification and Reinforcement Learning

Hung-Yun Hsu
Mark Po-Hung Lin

Schematic visualization is important for analog circuit designers to quickly recognize the structures and functions of transistor-level circuit netlists. However, most of the original analog design or other automatically extracted analog circuits are stored in the form of transistor-level netlists in the SPICE format. It can be error-prone and time-consuming to manually create an elegant and readable schematic from a netlist. Different from the conventional graph-based methods, this paper introduces a novel analog schematic diagram generation flow based on comprehensive building block classification and reinforcement learning. The experimental results show that the proposed method can effectively generate aesthetic analog circuit schematics with a higher building block compliance rate, and fewer numbers of wire bends and net crossings, resulting in better readability, compared with existing methods and modern tools.

SESSION: Plenary I

The Changing Landscape of AI-driven System Optimization for Complex Combinatorial Optimization

Somdeb Majumdar

With the unprecedented success of modern machine learning in areas like computer vision and natural language processing, a natural question is where can it have maximum impact in real life. At Intel Labs, we are actively investing in research that leverages the robustness and generalizability of deep learning to solve system optimization problems. Examples of such systems include individual hardware modules like memory schedulers and power management units on a chip, automated compiler and software design tools as well as broader problems like chip design. In this talk, I will address some of the open challenges in systems optimization and how Intel and others in the research community are harnessing the power of modern reinforcement learning to address those challenges. A particular aspect of problems in the domain of chip design is the very large combinatorial complexity of the solution space. For example, the number of possible ways to place standard cells and macros on a canvas for even small to medium sized netlists can approach 10100 to 101000. Importantly, only a very small subset of these possible outcomes are actually valid and performant.

Standard approaches like reinforcement learning struggle to learn effective policies under such conditions. For example, a sequential placement policy can get a reinforcing reward signal only after having taken several thousand individual placement actions. This reward is inherently noisy – especially when we need to assign credit to the earliest steps of the multi-step placement episode. This is an example of the classic credit assessment problem in reinforcement learning.

A different way to tackle such problems is to simply search over the solution space. Many approaches exist ranging from Genetic Algorithms to Monte Carlo Tree Search. However, they suffer from very slow convergence times due to the size of the search space.

In order to tackle such problems, we investigate an approach that combines the fast learning capabilities of reinforcement learning and the ability of search based methods to find performant solutions. We use deep reinforcement learning to strategies that are sub-optimal but quick to find. We use these partial solutions as anchors around which we constrain a genetic algorithm based search. This allows us to still exploit the power of genetic algorithms to find performant solutions while significantly reducing the overall search time.

I will describe this solution in the context of combinatorial optimization problems like device placement where we show the ability to learn effective strategies on combinatorial complexities of up to 10300. We also show that by representing these policies as neural networks, we are able to achieve reasonably good zero shot transfer learning performance on unseen problem configurations. Finally, I will touch upon how we are adapting this framework to handle similar combinatoric optimization problems for placement in EDA pipelines.

SESSION: Invited Session I

AI Chips Built by AI – Promise or Reality?: An Industry Perspective

Thomas Andersen

Artificial Intelligence is an avenue to innovation that is touching every industry worldwide. AI has made rapid advances in areas like speech and image recognition, gaming, and even self-driving cars, essentially automating less complex human tasks. In turn, this demand drives rapid growth across the semiconductor industry with new chip architectures emerging to deliver the specialized processing needed for the huge breadth of AI applications. Given the advances made to automate simple human tasks, can AI solve more complex tasks such as designing a computer chip? In this talk, we will discuss the challenges and opportunities of building advanced chip designs with the help of artificial intelligence, enabling higher performance, faster time to market, and utilizing reuse of machine-generated learning for successive products.

ML for Analog Design: Good Progress, but More to Do

Borivoje Nikolić

Analog and mixed-signal (AMS) blocks are often critical and time-consuming part of System-on-Chip (SoC) design, due to the largely manual process of circuit design, simulation and SoC integration iterations. There have been numerous efforts to realize AMS blocks from specification by using a process analogous to digital synthesis, with automated place and route techniques [1], [2], but although very effective within their application domains, they have been limited in scope. AMS block design process, outlined in Figure 1, starts with the derivation of its target performance specifications (gain, bandwidth, phase margin, settling time, etc.) from system requirements, and establishes a simulation testbench. Then, a designer relies on their expertise to choose the topology that is most likely to achieve the desired performance with minimum power consumption. Circuit sizing is a process of determining schematic-level transistor widths and lengths to attain the specifications, with minimum power consumption. Many of the commonly used analog circuits can be sized by using well-established heuristics to achieve near-optimal performance [3]-[5]. The performance is verified by running simulations, and there has been a notable progress in enriching the commercial simulators to automate the testbench design. Machine learning (ML) based techniques have recently been deployed in circuit sizing to achieve optimality without relying on design heuristics [6]-[8]. Many of the commonly employed ML techniques require a rich training dataset; reinforcement learning (RL) sidesteps this issue by using agent that interacts with its simulation environment through a trial-and-error process that mimics learning in humans. In each step, the RL agent, which contains a neural network, observes the state of the environment and takes a sizing action. The most time-consuming step in a traditional design procedure is layout, which is typically a manual iterative process. Layout parasitics degrade the schematic-level performance, requiring circuit resizing. However, the use of circuit generators, such as the Berkeley Analog Generator (BAG) [9] automates the layout iterations. RL agents have been coupled with BAG to automate the complete design process for a fixed circuit topology [7]. Simulations with post-layout parasitics are much slower than schematic-level simulations, which calls for deployment of RL techniques that limit the sampled space. Finally, the process of integrating an AMS block into an SoC and verifying its system-level performance can be very time consuming.

SESSION: Session 3: Circuit Evaluation and Simulation with ML

SpeedER: A Supervised Encoder-Decoder Driven Engine for Effective Resistance Estimation of Power Delivery Networks

Bing-Yue Wu
Shao-Yun Fang
Hsiang-Wen Chang
Peter Wei

Voltage (IR) analysis tools need to be launched multiple times during the Engineering Change Order (ECO) phase in the modern design cycle for Power Delivery Network (PDN) refinement, while analyzing the IR characteristics of advanced chip designs by using traditional IR analysis tools suffers from massive run-time. Multiple Machine Learning (ML)-driven IR analysis approaches have been frequently proposed to benefit from the fast inference time and flexible prediction ability. Among these ML-driven approaches, the Effective Resistance (effR) of a given PDN has been shown to be one of the most critical features that can greatly enhance model performance and thus prediction accuracy; however, calculating effR alone is still computationally expensive. In addition, in the ECO phase, even if only local adjustments of the PDN are required, the run-time of obtaining the regional effR changes by using traditional Laplacian Systems grows exponentially as the size of the chip grows. It is because the whole PDN needs to be considered in a Laplacian solver for computing the effR of any single network node. To address the problem, this paper proposes an ML-driven engine, SpeedER, that combines a U-Net model and a Fully Connected Neural Network (FCNN) with five selected features to speed up the process of estimating regional effRs. Experimental results show that SpeedER can be approximately four times faster than a commercial tool using a Laplacian System with errors of only around 1%.

XT-PRAGGMA: Crosstalk Pessimism Reduction Achieved with GPU Gate-level Simulations and Machine Learning

Vidya A. Chhabria
Ben Keller
Yanqing Zhang
Sandeep Vollala
Sreedhar Pratty
Haoxing Ren
Brucek Khailany

Accurate crosstalk-aware timing analysis is critical in nanometer-scale process nodes. While today’s VLSI flows rely on static timing analysis (STA) techniques to perform crosstalk-aware timing signoff, these techniques are limited due to their static nature as they use imprecise heuristics such as arbitrary aggressor filtering and simplified delay calculations. This paper proposes XT-PRAGGMA, a tool that uses GPU-accelerated dynamic gate-level simulations and machine learning to eliminate false aggressors and accurately predict crosstalk-induced delta delays. XT-PRAGGMA reduces STA pessimism and provides crucial information to identify crosstalk-critical nets, which can be considered for accurate SPICE simulation before signoff. The proposed technique is fast (less than two hours to simulate 30,000 vectors on million-gate designs) and reduces falsely-reported total negative slack in timing signoff by 70%.

Fast Prediction of Dynamic IR-Drop Using Recurrent U-Net Architecture

Yonghwi Kwon
Youngsoo Shin

Recurrent U-Net (RU-Net) is employed for fast prediction of dynamic IR-drop when power distribution network (PDN) contains capacitor components. Each capacitor can be modeled by a resistor and a current source, which is a function of v(t-Δ t) node voltages at time t – Δ t allow the PDN to be solved at time t which then allows the analysis at t + Δ t and so on. Provided that a quick prediction of IR-drop at one time instance can be done by U-Net, a image segmentation model, the analysis of PDN containing capacitors can be done by a number of U-Net instances connected in series, which become RU-Net architecture. Four input maps (effective PDN resistance map, PDN capacitance map, current map, and power pad distance map) are extracted from each layout clip, and are provided to RU-Net for IR-drop prediction. Experiments demonstrate that the proposed IR-drop prediction using the RU-Net is faster than a commercial tool by 16 times with about 12% error, while a simple U-Net-based prediction yields 19% error due to its inability to consider capacitors.

SESSION: Session 4: DRC, Test and Hotspot Detection using ML Methods

Efficient Design Rule Checking Script Generation via Key Information Extraction

Binwu Zhu
Xinyun Zhang
Yibo Lin
Bei Yu
Martin Wong

Design rule checking (DRC) is a critical step in integrated circuit design. DRC requires formatted scripts as the input to the design rule checker. However, these scripts are always generated manually in the foundry, and such a generation process is extremely inefficient, especially when encountering a large number of design rules. To mitigate this issue, we first propose a deep learning-based key information extractor to automatically identify the essential arguments of the scripts from rules. Then, a script translator is designed to organize the extracted arguments into executable DRC scripts. In addition, we incorporate three specific design rule generation techniques to improve the performance of our extractor. Experimental results demonstrate that our proposed method can significantly reduce the cost of script generation and show remarkable superiority over other baselines.

Scan Chain Clustering and Optimization with Constrained Clustering and Reinforcement Learning

Naiju Karim Abdul
George Antony
Rahul M. Rao
Suriya T. Skariah

Scan chains are used in design for test by providing controllability and observability at each register. Scan optimization is run during physical design after placement where scannable elements are re-ordered along the chain to reduce total wirelength (and power). In this paper, we present a machine learning based technique that leverages constrained clustering and reinforcement learning to obtain a wirelength efficient scan chain solution. Novel techniques like next-min sorted assignment, clustered assignment, node collapsing, partitioned Q-Learning and in-context start-end node determination are introduced to enable improved wire length while honoring design-for-test constraints. The proposed method is shown to provide up to 24% scan wirelength reduction over a traditional algorithmic optimization technique across 188 moderately sized blocks from an industrial 7nm design.

Autoencoder-Based Data Sampling for Machine Learning-Based Lithography Hotspot Detection

Mohamed Tarek Ismail
Hossam Sharara
Kareem Madkour
Karim Seddik

Technology scaling has increased the complexity of integrated circuit design. It has also led to more challenges in the field of Design for Manufacturing (DFM). One of these challenges is lithography hotspot detection. Hotspots (HS) are design patterns that negatively affect the output yield. Identifying these patterns early in the design phase is crucial for high yield fabrication. Machine Learning-based (ML) hotspot detection techniques are promising since they have shown superior results to other methods such as pattern matching. Training ML models is a challenging task due three main reasons. First, industrial training designs contain millions of unique patterns. It is impractical to train models using this large number of patterns due to limited computational and memory resources. Second, the HS detection problem has an imbalanced nature; datasets typically have a limited number of HS and a large number of non-hotspots. Lastly, hotspot and non-hotspot patterns can have very similar geometries causing models to be susceptible to high false positive rates. Due to these reasons, the use of data sampling techniques is needed to choose the best representative dataset for training. In this paper, a dataset sampling technique based on autoencoders is introduced. The autoencoders are used to identify latent data features that can reconstruct the input patterns. These features are used to group the patterns using Density-based spatial clustering of applications with noise (DBSCAN). Then, the clustered patterns are sampled to reduce the training set size. Experiments on the ICCAD-2019 dataset show that the proposed data sampling approach can reduce the dataset size while maintaining the levels of recall and precision that were obtained using the full dataset.

SESSION: Session 5: Power and Thermal Evaluation with ML

Driving Early Physical Synthesis Exploration through End-of-Flow Total Power Prediction

Yi-Chen Lu
Wei-Ting Chan
Vishal Khandelwal
Sung Kyu Lim

Leading-edge designs on advanced nodes are pushing physical design (PD) flow runtime into several weeks. Stringent time-to-market constraint necessitates efficient power, performance, and area (PPA) exploration by developing accurate models to evaluate netlist quality in early design stages. In this work, we propose PD-LSTM, a framework that leverages graph neural networks (GNNs) and long short-term memory (LSTM) networks to perform end-of-flow power predictions in early PD stages. Experimental results on two commercial CPU designs and five OpenCore netlists demonstrate that PD-LSTM achieves high fidelity total power prediction results within 4% normalized root-mean-squared error (NRMSE) on unseen netlists and a correlation coefficient score as high as 0.98.

Towards Neural Hardware Search: Power Estimation of CNNs for GPGPUs with Dynamic Frequency Scaling

Christopher A. Metz
Mehran Goli
Rolf Drechsler

Machine Learning (ML) algorithms are essential for emerging technologies such as autonomous driving and application-specific Internet of Things(IoT) devices. Convolutional Neural Network(CNN) is one of the major techniques used in such systems. This leads to leveraging ML accelerators like GPGPUs to meet the design constraints. However, GPGPUs have high power consumption, and selecting the most appropriate accelerator requires Design Space Exploration(DSE), which is usually time-consuming and needs high manual effort. Neural Hardware Search(NHS) is an upcoming approach to automate the DSE for Neural Networks. Therefore, automatic approaches for power, performance, and memory estimations are needed.

In this paper, we present a novel approach, enabling designers to fast and accurately estimate the power consumption of CNNs inferencing on GPGPUs with Dynamic Frequency Scaling(DFS) in the early stages of the design process. The proposed approach uses static analysis for feature extraction and Random Forest Tree regression analysis for predictive model generation. Experimental results demonstrate that our approach can predict the CNNs power consumption with a Mean Absolute Percentage Error(MAPE) of 5.03% compared to the actual hardware.

A Thermal Machine Learning Solver For Chip Simulation

Rishikesh Ranade
Haiyang He
Jay Pathak
Norman Chang
Akhilesh Kumar
Jimin Wen

Thermal analysis provides deeper insights into electronic chips’ behavior under different temperature scenarios and enables faster design exploration. However, obtaining detailed and accurate thermal profile on chip is very time-consuming using FEM or CFD. Therefore, there is an urgent need for speeding up the on-chip thermal solution to address various system scenarios. In this paper, we propose a thermal machine-learning (ML) solver to speed-up thermal simulations of chips. The thermal ML-Solver is an extension of the recent novel approach, CoAEMLSim (Composable Autoencoder Machine Learning Simulator) with modifications to the solution algorithm to handle constant and distributed HTC. The proposed method is validated against commercial solvers, such as Ansys MAPDL, as well as a latest ML baseline, UNet, under different scenarios to demonstrate its enhanced accuracy, scalability, and generalizability.

SESSION: Session 6: Performance Prediction with ML Models and Algorithms

Physically Accurate Learning-based Performance Prediction of Hardware-accelerated ML Algorithms

Hadi Esmaeilzadeh
Soroush Ghodrati
Andrew B. Kahng
Joon Kyung Kim
Sean Kinzer
Sayak Kundu
Rohan Mahapatra
Susmita Dey Manasi
Sachin S. Sapatnekar
Zhiang Wang
Ziqing Zeng

Parameterizable ML accelerators are the product of recent breakthroughs in machine learning (ML). To fully enable the design space exploration, we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural network (DNN) and non-DNN ML algorithms. It employs a unified methodology, coupling backend power, performance and area (PPA) analysis with frontend performance simulation, thus achieving realistic estimation of both backend PPA and system metrics (runtime and energy). Experimental studies show that the approach provides excellent predictions for both ASIC (in a 12nm commercial process) and FPGA implementations on the VTA and VeriGOOD-ML platforms.

Graph Representation Learning for Gate Arrival Time Prediction

Pratik Shrestha
Saran Phatharodom
Ioannis Savidis

An accurate estimate of the timing profile at different stages of the physical design flow allows for pre-emptive changes to the circuit, significantly reducing the design time and effort. In this work, a graph based deep regression model is utilized to predict the gate level arrival time of the timing paths of a circuit. Three scenarios for post routing prediction are considered: prediction after completing floorplanning, prediction after completing placement, and prediction after completing clock tree synthesis (CTS). A commercial static timing analysis (STA) tool is utilized to determine the mean absolute percentage error (MAPE) and the mean absolute error (MAE) for each scenario. Results obtained across all models trained on the complete dataset indicate that the proposed methodology outperforms the baseline errors produced by the commercial physical design tools with an average improvement of 61.58 in the MAPE score when predicting the post-routing arrival time after completing floorplanning and 13.53 improvement when predicting the post-routing arrival time after completing placement. Additional prediction scenarios are analyzed, where the complete dataset is further sub-divided based on the size of the circuits, which leads to an average improvement of 34.83 in the MAPE score as compared to the commercial tool for post-floorplanning prediction of the post-routing arrival time and 22.71 improvement for post-placement prediction of the post-routing arrival time.

A Tale of EDA’s Long Tail: Long-Tailed Distribution Learning for Electronic Design Automation

Zixuan Jiang
Mingjie Liu
Zizheng Guo
Shuhan Zhang
Yibo Lin
David Pan

Long-tailed distribution is a common and critical issue in the field of machine learning. While prior work addressed data imbalance in several tasks in electronic design automation (EDA), insufficient attention has been paid to the long-tailed distribution in real-world EDA problems. In this paper, we argue that conventional performance metrics can be misleading, especially in EDA contexts. Through two public EDA problems using convolutional neural networks and graph neural networks, we demonstrate that simple yet effective model-agnostic methods can alleviate the issue induced by long-tailed distribution when applying machine learning algorithms in EDA.

SESSION: Plenary II

Industrial Experience with Open-Source EDA Tools

Christian Lück
Daniela Sánchez Lopera
Sven Wenzek
Wolfgang Ecker

Commonly, the design flow of integrated circuits from initial specifications to fabrication employs commercial, proprietary EDA tools. While these tools deliver high-quality, production-ready results, they can be seen as expensive black boxes and thus, are not suited for research and academic purposes. Innovations on the field are mostly focused on optimizing the quality of the results of the designs by modifying core elements of the tool chain or using techniques of the Machine Learning domain. In both cases, researchers require many or long runs of EDA tools for comparing results or generating training data for Machine Learning models. Using proprietary, commercial tools in those cases may be either not affordable or not possible at all.

With OpenROAD and OpenLane mature open-source alternatives emerged in the past couple of years. The development is driven by a growing community that is improving and extending the tools daily. In contrast to commercial tools, OpenROAD and OpenLane are transparent and allow inspection, modification and replacement of every tool aspect. They are also free and therefore are well suited for use cases such as Machine Learning data generation. Specifically, the fact that no licenses are needed neither for the tools nor for the default PDK enables even fresh students and starters on the field to quickly deploy their ideas and create initial proof of concepts.

Therefore, we at Infineon are using OpenROAD and OpenLane for more experimental and innovative projects. Our vision is to build initial prototypes using free software, and then improve upon them by cross-checking and polishing with commercial tools before delivering them for production. This talk will show Infineon’s experience with these open-source tools so far.

The first steps involved getting OpenLane installed in our company IT infrastructure. While their developers offer convenient build methods using Docker containers, these cannot be used in Infineon’s compute farm. This, and also the fact that most of the open-source tools are currently evolving quickly with little to no versioning, lead to the setup of an in-house continuous integration and continuous delivery system for nightly and weekly builds of the tools. Once the necessary tools were installed and running, effort was put into integrating Infineon’s in-house technology data.

At Infineon, we envision two use cases for OpenROAD/OpenLane: physical synthesis hyperparameter exploration (and tuning) and optimization of the complete flow starting from RTL. First, our goal is to use OpenROAD’s AutoTuner in the path-finding phase to automatically and cost-effectively find optimal parameters for the flow and then build upon these results within a commercial tool for the later steps near the tapeout. Second, we want to include not only the synthesis flow inside the optimization loop of the AutoTuner, but also our in-house RTL generation framework (MetaRTL). For instance, having RTL generators for a RISC-V CPU and also the results of simulated runtime benchmarks for each iteration, the AutoTuner should be able to change fundamental aspects (for example number of pipeline stages) of the RTL to reach certain power, performance, and area requirements when running the benchmark code on the CPU.

Overall, we see OpenROAD/OpenLane as a viable alternative to commercial tools, especially for research and academic use, where modifications to the tools are needed and where very long and otherwise costly tool runtimes are expected.

SESSION: Session 7: ML Models for Analog Design and Optimization

Invertible Neural Networks for Design of Broadband Active Mixers

Oluwaseyi Akinwande
Osama Waqar Bhatti
Xingchen Li
Madhavan Swaminathan

In this work, we present the invertible neural network for predicting the posterior distributions of the design space of broadband active mixers with RF from 100 MHz to 10 GHz. This invertible method gives a fast and accurate model when investigating crucial properties of active mixers such as conversion gain and noise figure. Our results show that the response generated by the invertible neural network model has close correlation with the output response from the circuit simulator.

High Dimensional Optimization for Electronic Design

Yuejiang Wen
Jacob Dean
Brian A. Floyd
Paul D. Franzon

Bayesian optimization (BO) samples points of interest to update a surrogate model for a blackbox function. This makes it a powerful technique to optimize electronic designs which have unknown objective functions and demand high computational cost of simulation. Unfortunately, Bayesian optimization suffers from scalability issues, e.g., it can perform well in problems up to 20 dimensions. This paper addresses the curse of dimensionality and proposes an algorithm entitled Inspection-based Combo Random Embedding Bayesian Optimization (IC-REMBO). IC-REMBO improves the effectiveness and efficiency of the Random EMbedding Bayesian Optimization (REMBO) approach, which is a state-of-the-art high dimensional optimization method. Generally, it inspects the space near local optima to explore more points near local optima, so that it mitigates the over-exploration on boundaries and embedding distortion in REMBO. Consequently, it helps escape from local optima and provides a family of feasible solutions when inspecting near global optimum within a limited number of iterations.

The effectiveness and efficiency of the proposed algorithm are compared with the state-of-the-art REMBO when optimizing a mmWave receiver with 38 calibration parameters to meet 4 objectives. The optimization results are close to that of a human expert. To the best of our knowledge, this is the first time applying REMBO or inspection method to electronic design.

Transfer of Performance Models Across Analog Circuit Topologies with Graph Neural Networks

Zhengfeng Wu
Ioannis Savidis

In this work, graph neural networks (GNNs) and transfer learning are leveraged to transfer device sizing knowledge learned from data of related analog circuit topologies to predict the performance of a new topology. A graph is generated from the netlist of a circuit, with nodes representing the devices and edges the connections between devices. To allow for the simultaneous training of GNNs on data of multiple topologies, graph isomorphism networks are adopted to address the limitation of graph convolutional networks in distinguishing between different graph structures. The techniques are applied to transfer predictions of performance across four op-amp topologies in a 65 nm technology, with 10000 sets of sizing and performance evaluations sampled for each circuit. Two scenarios, zero-shot learning and few-shot learning, are considered based on the availability of data in the target domain. Results from the analysis indicate that zero-shot learning with GNNs trained on all the data of the three related topologies is effective for coarse estimates of the performance of the fourth unseen circuit without requiring any data from the fourth circuit. Few-shot learning by fine-tuning the GNNs with a small dataset of 100 points from the target topology after pre-training on data from the other three topologies further boosts the model performance. The fine-tuned GNNs outperform the baseline artificial neural networks (ANNs) trained on the same dataset of 100 points from the target topology with an average reduction in the root-mean-square error of 70.6%. Applying the proposed techniques, specifically GNNs and transfer learning, improves the sample efficiency of the performance models of the analog ICs through the transfer of predictions across related circuit topologies.

RxGAN: Modeling High-Speed Receiver through Generative Adversarial Networks

Priyank Kashyap
Archit Gajjar
Yongjin Choi
Chau-Wai Wong
Dror Baron
Tianfu Wu
Chris Cheng
Paul Franzon

Creating models for modern high-speed receivers using circuit-level simulations is costly, as it requires computationally expensive simulations and upwards of months to finalize a model. Added to this is that many models do not necessarily agree with the final hardware they are supposed to emulate. Further, these models are complex due to the presence of various filters, such as a decision feedback equalizer (DFE) and continuous-time linear equalizer (CTLE), which enable the correct operation of the receiver. Other data-driven approaches tackle receiver modeling through multiple models to account for as many configurations as possible. This work proposes a data-driven approach using generative adversarial training to model a real-world receiver with varying DFE and CTLE configurations while handling different channel conditions and bitstreams. The approach is highly accurate as the eye height and width are within 1.59% and 1.12% of the ground truth. The horizontal and vertical bathtub curves match the ground truth and correlate to the ground truth bathtub curves.

Tsung-Wei Huang

2 September 2022

Yibo Lin

No comments

Categories: Who's Who

Sep 1st, 2022

Tsung-Wei Huang

Associate Professor

University of Wisconsin-Madison

Email:
tsung-wei.huang@wisc.edu

Personal webpage
https://tsung-wei-huang.github.io/

Research interests

Design automation and high-performance computing.

Short bio

Dr. Tsung-Wei Huang received his B.S. and M.S. degrees from the Department of Computer Science at Taiwan’s National Cheng-Kung University in 2010 and 2011, respectively. He then received his Ph.D. degree from the Department of Electrical and Computer Engineering (ECE) at the University of Illinois at Urbana-Champaign (UIUC) in 2017. He has been researching on high-performance computing systems with application focus on design automation algorithms and machine learning kernels. He has created several open-source software, such as Taskflow and OpenTimer, that are being used by many people. Dr. Huang receives several awards for his research contributions, including ACM SIGDA Outstanding PhD Dissertation Award in 2019, NSF CAREER Award in 2022, Humboldt Research Fellowship Award in 2022. He also received the 2022 ACM SIGDA Service Award for recognizing his community service that engaged students in design automation research.

Research highlights

(1) Parallel Programming Environment: Modern scientific computing relies on a heterogeneous mix of computational patterns, domain algorithms, and specialized hardware to achieve key scientific milestones that go beyond traditional capabilities. However, programming these applications often requires complex expert-level tools and a deep understanding of parallel decomposition methodologies. Our research investigates new programming environments to assist researchers and developers to tackle the implementation complexities of high-performance parallel and heterogeneous programs.

(2) Electronic Design Automation (EDA): The ever-increasing design complexity in VLSI implementation has far exceeded what many existing EDA tools can scale with reasonable design time and effort. A key fundamental challenge is that EDA must incorporate new parallel paradigms comprising manycore CPUs and GPUs to achieve transformational performance and productivity milestones. Our research investigates new computing methods to advance the current state-of-the-art by assisting everyone to efficiently tackle the challenges of designing, implementing, and deploying parallel EDA algorithms on heterogeneous nodes.

(3) Machine Learning Systems: Machine learning has become centric to a wide range of today’s applications, such as recommendation systems and natural language processing. Due to the unique performance characteristics, GPUs are increasingly used for machine learning applications and can dramatically accelerate neural network training and inference. Modern GPUs are fast and are equipped with new programming models and scheduling runtimes that can bring significant yet largely untapped performance benefits to many machine learning applications. Our research investigates novel parallel algorithms and frameworks to accelerate machine learning system kernels with order-of-magnitude performance breakthrough.

2022 ACM/IEEE Design Automation Conference (DAC) Table of Content

26 August 2022

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

QuantumNAT: quantum noise-aware training with noise injection, quantization and normalization

Hanrui Wang
Jiaqi Gu
Yongshan Ding
Zirui Li
Frederic T. Chong
David Z. Pan
Song Han

Parameterized Quantum Circuits (PQC) are promising towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of PQC models has a severe degradation on real quantum devices. Take Quantum Neural Network (QNN) as an example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of PQC; on the other hand, existing PQC work does not consider noise effect. To this end, we present QuantumNAT, a PQC-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We experimentally observe that the effect of quantum noise to PQC measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to PQC according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that QuantumNAT improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class classification accuracy measured on real quantum computers. The code for construction and noise-aware training of PQC is available in the TorchQuantum library.

Optimizing quantum circuit synthesis for permutations using recursion

Cynthia Chen
Bruno Schmitt
Helena Zhang
Lev S. Bishop
Ali Javadi-Abhar

We describe a family of recursive methods for the synthesis of qubit permutations on quantum computers with limited qubit connectivity. Two objectives are of importance: circuit size and depth. In each case we combine a scalable heuristic with a non-scalable, yet exact, synthesis.

A fast and scalable qubit-mapping method for noisy intermediate-scale quantum computers

Sunghye Park
Daeyeon Kim
Minhyuk Kweon
Jae-Yoon Sim
Seokhyeong Kang

This paper presents an efficient qubit-mapping method that redesigns a quantum circuit to overcome the limitations of qubit connectivity. We propose a recursive graph-isomorphism search to generate the scalable initial mapping. In the main mapping, we use an adaptive look-ahead window search to resolve the connectivity constraint within a short runtime. Compared with the state-of-the-art method [15], our proposed method reduced the number of additional gates by 23% on average and the runtime by 68% for the three largest benchmark circuits. Furthermore, our method improved circuit stability by reducing the circuit depth and thus can be a step forward towards fault tolerance.

Optimizing quantum circuit placement via machine learning

Hongxiang Fan
Ce Guo
Wayne Luk

Quantum circuit placement (QCP) is the process of mapping the synthesized logical quantum programs on physical quantum machines, which introduces additional SWAP gates and affects the performance of quantum circuits. Nevertheless, determining the minimal number of SWAP gates has been demonstrated to be an NP-complete problem. Various heuristic approaches have been proposed to address QCP, but they suffer from suboptimality due to the lack of exploration. Although exact approaches can achieve higher optimality, they are not scalable for large quantum circuits due to the massive design space and expensive runtime. By formulating QCP as a bilevel optimization problem, this paper proposes a novel machine learning (ML)-based framework to tackle this challenge. To address the lower-level combinatorial optimization problem, we adopt a policy-based deep reinforcement learning (DRL) algorithm with knowledge transfer to enable the generalization ability of our framework. An evolutionary algorithm is then deployed to solve the upper-level discrete search problem, which optimizes the initial mapping with a lower SWAP cost. The proposed ML-based approach provides a new paradigm to overcome the drawbacks in both traditional heuristic and exact approaches while enabling the exploration of optimality-runtime trade-off. Compared with the leading heuristic approaches, our ML-based method significantly reduces the SWAP cost by up to 100%. In comparison with the leading exact search, our proposed algorithm achieves the same level of optimality while reducing the runtime cost by up to 40 times.

HERO: hessian-enhanced robust optimization for unifying and improving generalization and quantization performance

Huanrui Yang
Xiaoxuan Yang
Neil Zhenqiang Gong
Yiran Chen

With the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model’s generalizability on unseen testing data, as well as enhance the model’s robustness under fixed-point quantization for efficient deployment. Minimizing the training loss, however, provides few guarantees on the generalization and quantization performance. In this work, we fulfill the need of improving generalization and quantization performance simultaneously by theoretically unifying them under the framework of improving the model’s robustness against bounded weight perturbation and minimizing the eigenvalues of the Hessian matrix with respect to model weights. We therefore propose HERO, a Hessian-enhanced robust optimization method, to minimize the Hessian eigenvalues through a gradient-based training process, simultaneously improving the generalization and quantization performance. HERO enables up to a 3.8% gain on test accuracy, up to 30% higher accuracy under 80% training label perturbation, and the best post-training quantization accuracy across a wide range of precision, including a > 10% accuracy improvement over SGD-trained models for common model architectures on various datasets.

Neural computation for robust and holographic face detection

Mohsen Imani
Ali Zakeri
Hanning Chen
TaeHyun Kim
Prathyush Poduval
Hyunsei Lee
Yeseong Kim
Elaheh Sadredini
Farhad Imani

Face detection is an essential component of many tasks in computer vision with several applications. However, existing deep learning solutions are significantly slow and inefficient to enable face detection on embedded platforms. In this paper, we propose HDFace, a novel framework for highly efficient and robust face detection. HDFace exploits HyperDimensional Computing (HDC) as a neurally-inspired computational paradigm that mimics important brain functionalities towards high-efficiency and noise-tolerant computation. We first develop a novel technique that enables HDC to perform stochastic arithmetic computations over binary hypervectors. Next, we expand these arithmetic for efficient and robust processing of feature extraction algorithms in hyperspace. Finally, we develop an adaptive hyperdimensional classification algorithm for effective and robust face detection. We evaluate the effectiveness of HDFace on large-scale emotion detection and face detection applications. Our results indicate that HDFace provides, on average, 6.1X (4.6X) speedup and 3.0X (12.1X) energy efficiency as compared to neural networks running on CPU (FPGA), respectively.

FHDnn: communication efficient and robust federated learning for AIoT networks

Rishikanth Chandrasekaran
Kazim Ergun
Jihyun Lee
Dhanush Nanjunda
Jaeyoung Kang
Tajana Rosing

The advent of IoT and advances in edge computing inspired federated learning, a distributed algorithm to enable on device learning. Transmission costs, unreliable networks and limited compute power all of which are typical characteristics of IoT networks pose a severe bottleneck for federated learning. In this work we propose FHDnn, a synergetic federated learning framework that combines the salient aspects of CNNs and Hyperdimensional Computing. FHDnn performs hyperdimensional learning on features extracted from a self-supervised contrastive learning framework to accelerate training, lower communication costs, and increase robustness to network errors by avoiding the transmission of the CNN and training only the hyperdimensional component. Compared to CNNs, we show through experiments that FHDnn reduces communication costs by 66X, local client compute and energy consumption by 1.5 – 6X, while being highly robust to network errors with minimal loss in accuracy.

ODHD: one-class brain-inspired hyperdimensional computing for outlier detection

Ruixuan Wang
Xun Jiao
X. Sharon Hu

Outlier detection is a classical and important technique that has been used in different application domains such as medical diagnosis and Internet-of-Things. Recently, machine learning-based outlier detection algorithms, such as one-class support vector machine (OCSVM), isolation forest and autoencoder, have demonstrated promising results in outlier detection. In this paper, we take a radical departure from these classical learning methods and propose ODHD, an outlier detection method based on hyperdimensional computing (HDC). In ODHD, the outlier detection process is based on a P-U learning structure, in which we train a one-class HV based on inlier samples. This HV represents the abstraction information of all inlier samples; hence, any (testing) sample whose corresponding HV is dissimilar from this HV will be considered as an outlier. We perform an extensive evaluation using six datasets across different application domains and compare ODHD with multiple baseline methods including OCSVM, isolation forest, and autoencoder using three metrics including accuracy, F1 score and ROC-AUC. Experimental results show that ODHD outperforms all the baseline methods on every dataset for every metric. Moreover, we perform a design space exploration for ODHD to illustrate the tradeoff between performance and efficiency. The promising results presented in this paper provide a viable option and alternative to traditional learning algorithms for outlier detection.

High-level synthesis performance prediction using GNNs: benchmarking, modeling, and advancing

Nan Wu
Hang Yang
Yuan Xie
Pan Li
Cong Hao

Agile hardware development requires fast and accurate circuit quality evaluation from early design stages. Existing work of high-level synthesis (HLS) performance prediction usually requires extensive feature engineering after the synthesis process. To expedite circuit evaluation from as early design stage as possible, we propose rapid and accurate performance prediction methods, which exploit the representation power of graph neural networks (GNNs) by representing C/C++ programs as graphs. The contribution of this work is three-fold. (1) Benchmarking. We build a standard benchmark suite with 40k C programs, which includes synthetic programs and three sets of real-world HLS benchmarks. Each program is synthesized and implemented on FPGA to obtain post place-and-route performance metrics as the ground truth. (2) Modeling. We formally formulate the HLS performance prediction problem on graphs and propose multiple modeling strategies with GNNs that leverage different trade-offs between prediction timeliness (early/late prediction) and accuracy. (3) Advancing. We further propose a novel hierarchical GNN that does not sacrifice timeliness but largely improves prediction accuracy, significantly outperforming HLS tools. We apply extensive evaluations for both synthetic and unseen real-case programs; our proposed predictor largely outperforms HLS by up to 40X and excels existing predictors by 2X to 5X in terms of resource usage and timing prediction. The benchmark and explored GNN models are publicly available at https://github.com/lydiawunan/HLS-Perf-Prediction-with-GNNs.

Automated accelerator optimization aided by graph neural networks

Atefeh Sohrabizadeh
Yunsheng Bai
Yizhou Sun
Jason Cong

Using High-Level Synthesis (HLS), the hardware designers must describe only a high-level behavioral flow of the design. However, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level to explore. Besides, it takes several minutes to hours to evaluate the design with the HLS tool. To solve this problem, we model the HLS tool with a graph neural network that is trained to be used for a wide range of applications. The experimental results demonstrate that our model can estimate the quality of design in milliseconds with high accuracy, resulting in up to 79X speedup (with an average of 48X) for optimizing the design compared to the previous state-of-the-art work relying on the HLS tool.

Functionality matters in netlist representation learning

Ziyi Wang
Chen Bai
Zhuolun He
Guangliang Zhang
Qiang Xu
Tsung-Yi Ho
Bei Yu
Yu Huang

Learning feasible representation from raw gate-level netlists is essential for incorporating machine learning techniques in logic synthesis, physical design, or verification. Existing message-passing-based graph learning methodologies focus merely on graph topology while overlooking gate functionality, which often fails to capture underlying semantic, thus limiting their generalizability. To address the concern, we propose a novel netlist representation learning framework that utilizes a contrastive scheme to acquire generic functional knowledge from netlists effectively. We also propose a customized graph neural network (GNN) architecture that learns a set of independent aggregators to better cooperate with the above framework. Comprehensive experiments on multiple complex real-world designs demonstrate that our proposed solution significantly outperforms state-of-the-art netlist feature learning flows.

EMS: efficient memory subsystem synthesis for spatial accelerators

Liancheng Jia
Yuyue Wang
Jingwen Leng
Yun Liang

Spatial accelerators provide massive parallelism with an array of homogeneous PEs, and enable efficient data reuse with PE array dataflow and on-chip memory. Many previous works have studied the dataflow architecture of spatial accelerators, including performance analysis and automatic generation. However, existing accelerator generators fail to exploit the entire memory-level reuse opportunities, and generate suboptimal designs with data duplication and inefficient interconnection.

In this paper, we propose EMS, an efficient memory subsystem synthesis and optimization framework for spatial accelerators. We first use space-time transformation (STT) to analyze both PE-level and memory-level data reuse. Based on the reuse analysis, we develop an algorithm to automatically generate data layout of the multi-banked scratchpad memory, data mapping, and access controller for the memory. Our generated memory subsystem supports multiple PE-memory interconnection topologies including direct, multicast, and rotated connection. The memory and interconnection generation approach can efficiently utilize the memory-level reuse to avoid duplicated data storage with low hardware cost. EMS can automatically synthesize tensor algebra to hardware designed in Chisel. Experiments show that our proposed memory generator reduces the on-chip memory size by an average of 28% than the state-of-the-art, and achieves comparable hardware performance.

DA PUF: dual-state analog PUF

Jiliang Zhang
Lin Ding
Zhuojun Chen
Wenshang Li
Gang Qu

Physical unclonable function (PUF) is a promising lightweight hardware security primitive that exploits process variations during chip fabrication for applications such as key generation and device authentication. Reliability of the PUF information plays a vital role and poses a major challenge for PUF design. In this paper, we propose a novel dual-state analog PUF (DA PUF) which has been successfully fabricated in 55nm process. The 40,960 bits generated by the fabricated DA PUF pass the NIST randomness test with reliability over 99.99% for working environment of -40 ~ 125° C (temperature) and 0.96 ~ 1.44V (voltage), outperforming the two state-of-the-art analog PUFs reported in JSSC 2016 and 2021.

PathFinder: side channel protection through automatic leaky paths identification and obfuscation

Haocheng Ma
Qizhi Zhang
Ya Gao
Jiaji He
Yiqiang Zhao
Yier Jin

Side-channel analysis (SCA) attacks show an enormous threat to cryptographic integrated circuits (ICs). To address this threat, designers try to adopt various countermeasures during the IC development process. However, many existing solutions are costly in terms of area, power and/or performance, and may require full-custom circuit design for proper implementations. In this paper, we propose a tool, namely PathFinder, to automatically identify leaky paths and protect the design, and is compatible with the commercial design flow. The tool first finds out partial logic cells that leak the most information through dynamic correlation analysis. PathFinder then exploits static security checking to construct complete leaky paths based on these cells. After leaky paths are identified, PathFinder will leverage proper hardware countermeasures, including Boolean masking and random precharge, to eliminate information leakage from these paths. The effectiveness of PathFinder is validated both through simulation and physical measurements on FPGA implementations. Results demonstrate more than 1000X improvements on side-channel resistance, with less than 6.53% penalty to the power, area and performance.

LOCK&ROLL: deep-learning power side-channel attack mitigation using emerging reconfigurable devices and logic locking

Gaurav Kolhe
Tyler Sheaves
Kevin Immanuel Gubbi
Soheil Salehi
Setareh Rafatirad
Sai Manoj PD
Avesta Sasan
Houman Homayoun

The security and trustworthiness of ICs are exacerbated by the modern globalized semiconductor business model. This model involves many steps performed at multiple locations by different providers and integrates various Intellectual Properties (IPs) from several vendors for faster time-to-market and cheaper fabrication costs. Many existing works have focused on mitigating the well-known SAT attack and its derivatives. Power Side-Channel Attacks (PSCAs) can retrieve the sensitive contents of the IP and can be leveraged to find the key to unlock the obfuscated circuit without simulating powerful SAT attacks. To mitigate P-SCA and SAT-attack together, we propose a multi-layer defense mechanism called LOCK&ROLL: Deep-Learning Power Side-Channel Attack Mitigation using Emerging Reconfigurable Devices and Logic Locking. LOCK&ROLL utilizes our proposed Magnetic Random-Access Memory (MRAM)-based Look Up Table called Symmetrical MRAM-LUT (SyM-LUT). Our simulation results using 45nm technology demonstrate that the SyM-LUT incurs a small overhead compared to traditional Static Random Access Memory LUT (SRAM-LUT). Additionally, SyM-LUT has a standby energy consumption of 20aJ while consuming 33fJ and 4.6fJ for write and read operations, respectively. LOCK&ROLL is resilient against various attacks such as SAT-attacks, removal attack, scan and shift attacks, and P-SCA.

Efficient access scheme for multi-bank based NTT architecture through conflict graph

Xiangren Chen
Bohan Yang
Yong Lu
Shouyi Yin
Shaojun Wei
Leibo Liu

Number Theoretical Transform (NTT) hardware accelerator becomes crucial building block in many cryptosystems like post-quantum cryptography. In this paper, we provide new insights into the construction of conflict-free memory mapping scheme (CFMMS) for multi-bank NTT architecture. Firstly, we offer parallel loop structure of arbitrary-radix NTT and propose two point-fetching modes. Afterwards, we transform the conflict-free mapping problem into conflict graph and develop novel heuristic to explore the design space of CFMMS, which turns out more efficient access scheme than classic works. To further verify the methodology, we design high-performance NTT/INTT kernels for Dilithium, whose area-time efficiency significantly outperforms state-of-the-art works on the similar FPGA platform.

InfoX: an energy-efficient ReRAM accelerator design with information-lossless low-bit ADCs

Yintao He
Songyun Qu
Ying Wang
Bing Li
Huawei Li
Xiaowei Li

ReRAM-based accelerators have shown great potential in neural network acceleration via in-memory analog computing. However, high-precision analog-to-digital converters (ADCs), which are required by the ReRAM crossbars to achieve high-accuracy network model inference, play an essential role in the energy-efficiency of the accelerators. Based on the discovery that the ADC precision requirements of crossbars are different, we propose the model-aware crossbarwise ADC precision assignment and the accompanied information-lossless low-bit ADCs to reduce energy overhead without sacrificing model accuracy. In experiments, the proposed information-lossless ReRAM accelerator, InfoX, only consumes 8.97% ADC energy of the SOTA baseline with no accuracy degradation at all.

PHANES: ReRAM-based photonic accelerator for deep neural networks

Yinyi Liu
Jiaqi Liu
Yuxiang Fu
Shixi Chen
Jiaxu Zhang
Jiang Xu

Resistive random access memory (ReRAM) has demonstrated great promises of in-situ matrix-vector multiplications to accelerate deep neural networks. However, subject to the intrinsic properties of analog processing, most of the proposed ReRAM-based accelerators require excessive costly ADC/DAC to avoid distortion of electronic analog signals during inter-tile transmission. Moreover, due to bit-shifting before addition, prior works require longer cycles to serially calculate partial sum compared to multiplications, which dramatically restricts the throughput and is more likely to stall the pipeline between layers of deep neural networks.

In this paper, we present a novel ReRAM-based photonic accelerator (PHANES) architecture, which calculates multiplications in ReRAM and parallel weighted accumulations during optical transmission. Such photonic paradigm also serves as high-fidelity analog-analog links to further reduce ADC/DAC. To circumvent the memory wall problem, we further propose a progressive bit-depth technique. Evaluations show that PHANES improves the energy efficiency by 6.09x and throughput density by 14.7x compared to state-of-the-art designs. Our photonic architecture also has great potentials for scalability towards very-large-scale accelerators.

CP-SRAM: charge-pulsation SRAM marco for ultra-high energy-efficiency computing-in-memory

He Zhang
Linjun Jiang
Jianxin Wu
Tingran Chen
Junzhan Liu
Wang Kang
Weisheng Zhao

SRAM-based computing-in-memory (SRAM-CIM) provides fast speed and good scalability with advanced process technology. However, the energy efficiency of the state-of-the-art current-domain SRAM-CIM bit-cell structure is limited and the peripheral circuitry (e.g., DAC/ADC) for high-precision is expensive. This paper proposes a charge-pulsation SRAM (CP-SRAM) structure to achieve ultra-high energy-efficiency thanks to its charge-domain mechanism. Furthermore, our proposed CP-SRAM CIM supports configurable precision (2/4/6-bit). The CP-SRAM CIM macro was designed in 180nm (with silicon verification) and 40nm (simulation) nodes. The simulation results in 40nm show that our macro can achieve energy efficiency of ~2950Tops/W at 2-bit precision, ~576.4 Tops/W at 4-bit precision and ~111.7 Tops/W at 6-bit precision, respectively.

CREAM: computing in ReRAM-assisted energy and area-efficient SRAM for neural network acceleration

Liukai Xu
Songyuan Liu
Zhi Li
Dengfeng Wang
Yiming Chen
Yanan Sun
Xueqing Li
Weifeng He
Shi Xu

Computing-in-memory has been widely explored to accelerate DNN. However, most existing CIM cannot store all NN weights due to limited SRAM capacity for edge AI devices, inducing a large amount off-chip DRAM access. In this paper, a new computing in ReRAM-assisted energy and area-efficient SRAM (CREAM) is proposed for implementing large-scale NNs while eliminating off-chip DRAM access. The weights of DNN are all stored in the high-dense on-chip ReRAM devices and restored to the proposed nvSRAM-CIM cells with array-level parallelism. A data-aware weight-mapping method is also proposed to enhance the CIM performance while fully exploiting the hardware utilization. Experiment results show that the proposed CREAM scheme enhances the storage density by up to 7.94x compared to the traditional SRAM arrays. The energy-efficiency of proposed CREAM is also enhanced by 2.14x and 1.99x, compared to the traditional SRAM-CIM with off-chip DRAM access and ReRAM-CIM circuits, respectively.

Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration

Yinxiao Feng
Kaisheng Ma

Multi-chip integration is widely recognized as the extension of Moore’s Law. Cost-saving is a frequently mentioned advantage, but previous works rarely present quantitative demonstrations on the cost superiority of multi-chip integration over monolithic SoC. In this paper, we build a quantitative cost model and put forward an analytical method for multi-chip systems based on three typical multi-chip integration technologies to analyze the cost benefits from yield improvement, chiplet and package reuse, and heterogeneity. We re-examine the actual cost of multi-chip systems from various perspectives and show how to reduce the total cost of the VLSI system through appropriate multi-chiplet architecture.

PANORAMA: divide-and-conquer approach for mapping complex loop kernels on CGRA

Dhananjaya Wijerathne
Zhaoying Li
Thilini Kaushalya Bandara
Tulika Mitra

CGRAs are well-suited as hardware accelerators due to power efficiency and reconfigurability. However, their potential is limited by the inability of the compiler to map complex loop kernels onto the architectures effectively. We propose PANORAMA, a fast and scalable compiler based on a divide-and-conquer approach to generate quality mapping for complex dataflow graphs (DFG) representing loop bodies onto larger CGRAs. PANORAMA improves the throughput of the mapped loops by up to 2.6x with 8.7x faster compilation time compared to the state-of-the-art techniques.

A fast parameter tuning framework via transfer learning and multi-objective bayesian optimization

Zheng Zhang
Tinghuan Chen
Jiaxin Huang
Meng Zhang

Design space exploration (DSE) can automatically and effectively determine design parameters to achieve the optimal performance, power and area (PPA) in very large-scale integration (VLSI) design. The lack of prior knowledge causes low efficient exploration. In this paper, a fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization is proposed to quickly find the optimal design parameters. Gaussian Copula is utilized to establish the correlation of the implemented technology. The prior knowledge is integrated into multi-objective Bayesian optimization through transforming the PPA data to residual observation. The uncertainty-aware search acquisition function is employed to explore design space efficiently. Experiments on a CPU design show that this framework can achieve a higher quality of Pareto frontier with less design flow running than state-of-the-art methodologies.

PriMax: maximizing DSL application performance with selective primitive acceleration

Nicholas Wendt
Todd Austin
Valeria Bertacco

Domain-specific languages (DSLs) improve developer productivity by abstracting away low-level details of an algorithm’s implementation within a specialized domain. These languages often provide powerful primitives to describe complex operations, potentially granting flexibility during compilation to target hardware acceleration. This work proposes PriMax, a novel methodology to effectively map DSL applications to hardware accelerators. It builds decision trees based on benchmark results, which select between distinct implementations of accelerated primitives to maximize a target performance metric. In our graph analytics case study with two accelerators, PriMax produces a geometric mean speedup of 1.57x over a multicore CPU, higher than either target accelerator alone, and approaching the maximum 1.58x speedup attainable with these target accelerators.

Accelerating and pruning CNNs for semantic segmentation on FPGA

Pierpaolo Morì
Manoj-Rohit Vemparala
Nael Fasfous
Saptarshi Mitra
Sreetama Sarkar
Alexander Frickenstein
Lukas Frickenstein
Domenik Helms
Naveen Shankar Nagaraja
Walter Stechele
Claudio Passerone

Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA

SoftSNN: low-cost fault tolerance for spiking neural network accelerators under soft errors

Rachmad Vidya Wicaksana Putra
Muhammad Abdullah Hanif
Muhammad Shafique

Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.

A joint management middleware to improve training performance of deep recommendation systems with SSDs

Chun-Feng Wu
Carole-Jean Wu
Gu-Yeon Wei
David Brooks

As the sizes and variety of training data scale over time, data preprocessing is becoming an important performance bottleneck for training deep recommendation systems. This challenge becomes more serious when training data is stored in Solid-State Drives (SSDs). Due to the access behavior gap between recommendation systems and SSDs, unused training data may be read and filtered out during preprocessing. This work advocates a joint management middleware to avoid reading unused data by bridging the access behavior gap. The evaluation results show that our middleware can effectively improve the performance of the data preprocessing phase so as to boost training performance.

The larger the fairer?: small neural networks can achieve fairness for edge devices

Yi Sheng
Junhuan Yang
Yawen Wu
Kevin Mao
Yiyu Shi
Jingtong Hu
Weiwen Jiang
Lei Yang

Along with the progress of AI democratization, neural networks are being deployed more frequently in edge devices for a wide range of applications. Fairness concerns gradually emerge in many applications, such as face recognition and mobile medical. One fundamental question arises: what will be the fairest neural architecture for edge devices? By examining the existing neural networks, we observe that larger networks typically are fairer. But, edge devices call for smaller neural architectures to meet hardware specifications. To address this challenge, this work proposes a novel Fairness- and Hardware-aware Neural architecture search framework, namely FaHaNa. Coupled with a model freezing approach, FaHaNa can efficiently search for neural networks with balanced fairness and accuracy, while guaranteed to meet hardware specifications. Results show that FaHaNa can identify a series of neural networks with higher fairness and accuracy on a dermatology dataset. Target edge devices, FaHaNa finds a neural architecture with slightly higher accuracy, 5.28X smaller size, 15.14% higher fairness score, compared with MobileNetV2; meanwhile, on Raspberry PI and Odroid XU-4, it achieves 5.75X and 5.79X speedup.

SCAIE-V: an open-source SCAlable interface for ISA extensions for RISC-V processors

Mihaela Damian
Julian Oppermann
Christoph Spang
Andreas Koch

Custom instructions extending a base ISA are often used to increase performance. However, only few cores provide open interfaces for integrating such ISA Extensions (ISAX). In addition, the degree to which a core’s capabilities are exposed for extension varies wildly between interfaces. Thus, even when using open-source cores, the lack of standardized ISAX interfaces typically causes high engineering effort when implementing or porting ISAXes. We present SCAIE-V, a highly portable and feature-rich ISAX interface that supports custom control flow, decoupled execution, multi-cycle-instructions, and memory transactions. The cost of the interface itself scales with the complexity of the ISAXes actually used.

A scalable symbolic simulation tool for low power embedded systems

Subhash Sethumurugan
Shashank Hegde
Hari Cherupalli
John Sartori

Recent work has demonstrated the effectiveness of using symbolic simulation to perform hardware software co-analysis on an application-processor pair and developed a variety of hardware and software design techniques and optimizations, ranging from providing system security guarantees to automated generation of application-specific bespoke processors. Despite their potential benefits, current state-of-the-art symbolic simulation tools for hardware-software co-analysis are restricted in their applicability, since prior work relies on a costly process of building a custom simulation tool for each processor design to be simulated. Furthermore, prior work does not describe how to extend the symbolic analysis technique to other processor designs.

In an effort to generalize the technique for any processor design, we propose a custom symbolic simulator that uses iverilog to perform symbolic behavioral simulation. With iverilog – an open source synthesis and simulation tool – we implement a design-agnostic symbolic simulation tool for hardware-software co-analysis. To demonstrate the generality of our tool, we apply symbolic analysis to three embedded processors with different ISAs: bm32 (a MIPS-based processor), darkRiscV (a RISC-V-based processor), and openMSP430 (based on MSP430). We use analysis results to generate bespoke processors for each design and observe gate count reductions of 27%, 16%, and 56% on these processors, respectively. Our results demonstrate the versatility of our simulation tool and the uniqueness of each design with respect to symbolic analysis and the bespoke methodology.

Designing critical systems with iterative automated safety analysis

Ran Wei
Zhe Jiang
Xiaoran Guo
Haitao Mei
Athanasios Zolotas
Tim Kelly

Safety analysis is an important aspect in Safety-Critical Systems Engineering (SCSE) to discover design problems that can potentially lead to hazards and eventually, accidents. Performing safety analysis requires significant manual effort — its automation has become the research focus in the critical system domain due to the increasing complexity of systems and emergence of open adaptive systems. In this paper, we present a methodology, in which automated safety analysis drives the design of safety-critical systems. We discuss our approach with its tool support and evaluate its applicability. We briefly discuss how our approach fits into current practice of SCSE.

Efficient ensembles of graph neural networks

Amrit Nagarajan
Jacob R. Stevens
Anand Raghunathan

Ensembles improve the accuracy and robustness of Graph Neural Networks (GNNs), but suffer from high latency and storage requirements. To address this challenge, we propose GNN Ensembles through Error Node Isolation (GEENI). The key concept in GEENI is to identify nodes that are likely to be incorrectly classified (error nodes) and suppress their outgoing messages, leading to simultaneous accuracy and efficiency improvements. GEENI also enables aggressive approximations of the constituent models in the ensemble while maintaining accuracy. To improve the efficacy of GEENI, we propose techniques for diverse ensemble creation and accurate error node identification. Our experiments establish that GEENI models are simultaneously up to 4.6% (3.8%) more accurate and up to 2.8X (5.7X) faster compared to non-ensemble (conventional ensemble) GNN models.

Sign bit is enough: a learning synchronization framework for multi-hop all-reduce with ultimate compression

Feijie Wu
Shiqi He
Song Guo
Zhihao Qu
Haozhao Wang
Weihua Zhuang
Jie Zhang

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.

GLite: a fast and efficient automatic graph-level optimizer for large-scale DNNs

Jiaqi Li
Min Peng
Qingan Li
Meizheng Peng
Mengting Yuan

We propose a scalable graph-level optimizer named GLite to speed up search-based optimizations on large neural networks. GLite leverages a potential-based partitioning strategy to partition large computation graphs into small subgraphs without losing profitable substitution patterns. To avoid redundant subgraph matching, we propose a dynamic programming algorithm to reuse explored matching patterns. The experimental results show that GLite reduces the running time of search-based optimizations from hours to milliseconds, without compromising in inference performance.

Contrastive quant: quantization makes stronger contrastive learning

Yonggan Fu
Qixuan Yu
Meng Li
Xu Ouyang
Vikas Chandra
Yingyan Lin

Contrastive learning learns visual representations by enforcing feature consistency under different augmented views. In this work, we explore contrastive learning from a new perspective. Interestingly, we find that quantization, when properly engineered, can enhance the effectiveness of contrastive learning. To this end, we propose a novel contrastive learning framework, dubbed Contrastive Quant, to encourage feature consistency under both differently augmented inputs via various data transformations and differently augmented weights/activations via various quantization levels. Extensive experiments, built on top of two state-of-the-art contrastive learning methods SimCLR and BYOL, show that Contrastive Quant consistently improves the learned visual representation.

Serpens: a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication

Linghao Song
Yuze Chi
Licheng Guo
Jason Cong

Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV, which features memory-centric processing engines and index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy/bandwidth efficiency, Serpens is 1.71x/1.99x, 1.90x/2.69x, and 6.25x/4.06x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS) and up to 3.79x over GraphLily. The code is available at https://github.com/UCLA-VAST/Serpens.

An energy-efficient seizure detection processor using event-driven multi-stage CNN classification and segmented data processing with adaptive channel selection

Jiahao Liu
Zirui Zhong
Yong Zhou
Hui Qiu
Jianbiao Xiao
Jiajing Fan
Zhaomin Zhang
Sixu Li
Yiming Xu
Siqi Yang
Weiwei Shan
Shuisheng Lin
Liang Chang
Jun Zhou

Recently wearable EEG monitoring devices with seizure detection processor using convolutional neural network (CNN) have been proposed to detect the seizure onset of patients in real time for alert or stimulation purpose. High energy efficiency and accuracy are required for the seizure detection processor due to the tight energy constraint of wearable devices. However, the use of CNN and multi-channel processing nature of seizure detection result in significant energy consumption. In this work, an energy-efficient seizure detection processor is proposed, featuring multi-stage CNN classification, segmented data processing and adaptive channel selection to reduce the energy consumption while achieving high accuracy. The design has been fabricated and tested using a 55nm process technology. Compared with several state-of-the-art designs, the proposed design achieves the lowest energy per classification (0.32 μJ) with high sensitivity (97.78%) and low false positive rate per hour (0.5).

PatterNet: explore and exploit filter patterns for efficient deep neural networks

Behnam Khaleghi
Uday Mallappa
Duygu Yaldiz
Haichao Yang
Monil Shah
Jaeyoung Kang
Tajana Rosing

Weight clustering is an effective technique for compressing deep neural networks (DNNs) memory by using a limited number of unique weights and low-bit weight indexes to store clustering information. In this paper, we propose PatterNet, which enforces shared clustering topologies on filters. Cluster sharing leads to a greater extent of memory reduction by reusing the index information. PatterNet effectively factorizes input activations and post-processes the unique weights, which saves multiplications by several orders of magnitude. Furthermore, PatterNet reduces the add operations by harnessing the fact that filters sharing a clustering pattern have the same factorized terms. We introduce techniques for determining and assigning clustering patterns and training a network to fulfill the target patterns. We also propose and implement an efficient accelerator that builds upon the patterned filters. Experimental results show that PatterNet shrinks the memory and operation count up to 80.2% and 73.1%, respectively, with similar accuracy to the baseline models. PatterNet accelerator improves the energy efficiency by 107x over Nvidia 1080 1080 GTX and 2.2x over state of the art.

E2SR: an end-to-end video CODEC assisted system for super resolution acceleration

Zhuoran Song
Zhongkai Yu
Naifeng Jing
Xiaoyao Liang

Nowadays high-resolution (HR) videos have been a popular choice for a better viewing experience. Recent works have shown that super-resolution (SR) algorithms can provide superior quality HR video by applying the deep neural network (DNN) to each low-resolution (LR) frame. Obviously, such per-frame DNN processing is compute-intensive and hampers the deployment of SR algorithms on mobile devices. Although many accelerators have proposed solutions, they only focus on mobile devices. Differently, we notice that the HR video is originally stored in the cloud server and should be well exploited to gain high accuracy and performance improvement. Based on this observation, this paper proposes an end-to-end video CODEC assisted system (E²SR), which tightly couples the cloud server with the device to deliver a smooth and real-time video viewing experience. We propose the motion vector search algorithm executed in the cloud server, which can search the motion vectors and residuals for part of HR video frames and then pack them as addons. We further propose the reconstruction algorithm executed in the device to fast reconstruct the corresponding HR frames using the addons to skip part of DNN computations. We design the corresponding E²SR architecture to enable the reconstruction algorithm in the device, which achieves significant speedup with minimal hardware overhead. Our experimental results show that the E²SR system achieves 3.4x performance improvement with less than 0.56 PSNR loss compared with the state-of-the-art “EDVR” scheme.

MATCHA: a fast and energy-efficient accelerator for fully homomorphic encryption over the torus

Lei Jiang
Qian Lou
Nrushad Joshi

Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.

VirTEE: a full backward-compatible TEE with native live migration and secure I/O

Jianqiang Wang
Pouya Mahmoody
Ferdinand Brasser
Patrick Jauernig
Ahmad-Reza Sadeghi
Donghui Yu
Dahan Pan
Yuanyuan Zhang

Modern security architectures provide Trusted Execution Environments (TEEs) to protect critical data and applications against malicious privileged software in so-called enclaves. However, the seamless integration of existing TEEs into the cloud is hindered, as they require substantial adaptation of the software executing inside an enclave as well as the cloud management software to handle enclaved workloads. We tackle these challenges by presenting VirTEE, the first TEE architecture that allows strongly isolated execution of unmodified virtual machines (VMs) in enclaves, as well as secure live migration of VM enclaves between VirTEE-enabled servers. Combined with its secure I/O capabilities, VirTEE enables the integration of enclaved computing in today’s complex cloud infrastructure. We thoroughly evaluate our RISC-V-based prototype, and show its effectiveness and efficiency.

Apple vs. EMA: electromagnetic side channel attacks on apple CoreCrypto

Gregor Haas
Aydin Aysu

Cryptographic instruction set extensions are commonly used for ciphers which would otherwise face unacceptable side channel risks. A prominent example of such an extension is the ARMv8 Cryptographic Extension, or ARM CE for short, which defines dedicated instructions to securely accelerate AES. However, while these extensions may be resistant to traditional “digital” side channel attacks, they may still be vulnerable to physical side channel attacks.

In this work, we demonstrate the first such attack on a standard ARM CE AES implementation. We specifically focus on the implementation used by Apple’s CoreCrypto library which we run on the Apple A10 Fusion SoC. To that end, we implement an optimized side channel acquisition infrastructure involving both custom iPhone software and accelerated analysis code. We find that an adversary which can observe 5–30 million known-ciphertext traces can reliably extract secret AES keys using electromagnetic (EM) radiation as a side channel. This corresponds to an encryption operation on less than half of a gigabyte of data, which could be acquired in less than 2 seconds on the iPhone 7 we examined. Our attack thus highlights the need for side channel defenses for real devices and production, industry-standard encryption software.

Algorithm/architecture co-design for energy-efficient acceleration of multi-task DNN

Jaekang Shin
Seungkyu Choi
Jongwoo Ra
Lee-Sup Kim

Real-world AI applications, such as augmented reality or autonomous driving, require processing multiple CV tasks simultaneously. However, the enormous data size and the memory footprint have been a crucial hurdle for deep neural networks to be applied in resource-constrained devices. To solve the problem, we propose an algorithm/architecture co-design. The proposed algorithmic scheme, named SqueeD, reduces per-task weight and activation size by 21.9x and 2.1x, respectively, by sharing those data between tasks. Moreover, we design architecture and dataflow to minimize DRAM access by fully utilizing benefits from SqueeD. As a result, the proposed architecture reduces the DRAM access increment and energy consumption increment per task by 2.2x and 1.3x, respectively.

EBSP: evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks

Fangxin Liu
Wenbo Zhao
Zongwu Wang
Yongbiao Chen
Zhezhi He
Naifeng Jing
Xiaoyao Liang
Li Jiang

Model compression has been extensively investigated for supporting efficient neural network inference on edge-computing platforms due to the huge model size and computation amount. Recent researches embrace joint-way compression across multiple techniques for extreme compression. However, most joint-way methods adopt a naive solution that applies two approaches sequentially, which can be sub-optimal, as it lacks a systematic approach to incorporate them.

This paper proposes the integration of aggressive joint-way compression into hardware design, namely EBSP. It is motivated by 1) the quantization allows simplifying hardware implementations; 2) the bit distribution of quantized weights can be viewed as an independent trainable variable; 3) the exploitation of bit sparsity in the quantized network has the potential to achieve better performance. To achieve that, this paper introduces the bit sparsity patterns to construct both highly expressive and inherently regular bit distribution in the quantized network. We further incorporate our sparsity constraint in training to evolve inherently bit distributions to the bit sparsity pattern. Moreover, the structure of the introduced bit sparsity pattern engenders minimum hardware implementation under competitive classification accuracy. Specifically, the quantized network constrained by bit sparsity pattern can be processed using LUTs with the fewest entries instead of multipliers in minimally modified computational hardware. Our experiments show that compared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance gain on average, respectively.

A time-to-first-spike coding and conversion aware training for energy-efficient deep spiking neural network processor design

Dongwoo Lew
Kyungchul Lee
Jongsun Park

In this paper, we present an energy-efficient SNN architecture, which can seamlessly run deep spiking neural networks (SNNs) with improved accuracy. First, we propose a conversion aware training (CAT) to reduce ANN-to-SNN conversion loss without hardware implementation overhead. In the proposed CAT, the activation function developed for simulating SNN during ANN training, is efficiently exploited to reduce the data representation error after conversion. Based on the CAT technique, we also present a time-to-first-spike coding that allows lightweight logarithmic computation by utilizing spike time information. The SNN processor design that supports the proposed techniques has been implemented using 28nm CMOS process. The processor achieves the top-1 accuracies of 91.7%, 67.9% and 57.4% with inference energy of 486.7uJ, 503.6uJ, and 1426uJ to process CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, when running VGG-16 with 5bit logarithmic weights.

XMA: a crossbar-aware multi-task adaption framework via shift-based mask learning method

Fan Zhang
Li Yang
Jian Meng
Jae-sun Seo
Yu (Kevin) Cao
Deliang Fan

ReRAM crossbar array as a high-parallel fast and energy-efficient structure attracts much attention, especially on the acceleration of Deep Neural Network (DNN) inference on one specific task. However, due to the high energy consumption of weight re-programming and the ReRAM cells’ low endurance problem, adapting the crossbar array for multiple tasks has not been well explored. In this paper, we propose XMA, a novel crossbar-aware shift-based mask learning method for multiple task adaption in the ReRAM crossbar DNN accelerator for the first time. XMA leverages the popular mask-based learning algorithm’s benefit to mitigate catastrophic forgetting and learn a task-specific, crossbar column-wise, and shift-based multi-level mask, rather than the most commonly used element-wise binary mask, for each new task based on a frozen backbone model. With our crossbar-aware design innovation, the required masking operation to adapt for a new task could be implemented in an existing crossbar-based convolution engine with minimal hardware/memory overhead and, more importantly, no need for power-hungry cell re-programming, unlike prior works. The extensive experimental results show that, compared with state-of-the-art multiple task adaption Piggyback method [1], XMA achieves 3.19% higher accuracy on average, while saving 96.6% memory overhead. Moreover, by eliminating cell re-programming, XMA achieves ~4.3x higher energy efficiency than Piggyback.

SWIM: selective write-verify for computing-in-memory neural accelerators

Zheyu Yan
Xiaobo Sharon Hu
Yiyu Shi

Computing-in-Memory architectures based on non-volatile emerging memories have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, these emerging devices can suffer from significant variations during the mapping process (i.e., programming weights to the devices), and if left undealt with, can cause significant accuracy degradation. The non-ideality of weight mapping can be compensated by iterative programming with a write-verify scheme, i.e., reading the conductance and rewriting if necessary. In all existing works, such a practice is applied to every single weight of a DNN as it is being mapped, which requires extensive programming time. In this work, we show that it is only necessary to select a small portion of the weights for write-verify to maintain the DNN accuracy, thus achieving significant speedup. We further introduce a second derivative based technique SWIM, which only requires a single pass of forward and backpropagation, to efficiently select the weights that need write-verify. Experimental results on various DNN architectures for different datasets show that SWIM can achieve up to 10x programming speedup compared with conventional full-blown write-verify while attaining a comparable accuracy.

Enabling efficient deep convolutional neural network-based sensor fusion for autonomous driving

Xiaoming Zeng
Zhendong Wang
Yang Hu

Autonomous driving demands accurate perception and safe decision-making. To achieve this, automated vehicles are typically equipped with multiple sensors (e.g., cameras, Lidar, etc.), enabling them to exploit complementary environmental contexts by fusing data from different sensing modalities. With the success of Deep Convolutional Neural Network (DCNN), the fusion between multiple DCNNs has been proved to be a promising strategy to achieve satisfactory perception accuracy. However, existing mainstream DCNN fusion strategies conduct fusion by simply element-wisely adding feature maps extracted from different modalities together at various stages, failing to consider whether the features being fused are matched or not. Therefore, we first propose a feature disparity metric to quantitatively measure the degree of feature disparity between the fusing feature maps. Then, we propose a Fusion-filter as the Feature-matching techniques to tackle the feature-mismatching issue. We also propose a Layer-sharing technique in the deep layer of the DCNN to achieve high accuracy. With the assistance of feature disparity working as an additional loss, our proposed technologies enable DCNN to learn corresponding feature maps with similar characteristics and complementary visual context from different modalities. Evaluations demonstrate that our proposed fusion techniques can achieve higher accuracy on KITTI dataset with less computation resources consumption.

Zhuyi: perception processing rate estimation for safety in autonomous vehicles

Yu-Shun Hsiao
Siva Kumar Sastry Hari
Michał Filipiuk
Timothy Tsai
Michael B. Sullivan
Vijay Janapa Reddi
Vasu Singh
Stephen W. Keckler

The processing requirement of autonomous vehicles (AVs) for high-accuracy perception in complex scenarios can exceed the resources offered by the in-vehicle computer, degrading safety and comfort. This paper proposes a sensor frame processing rate (FPR) estimation model, Zhuyi, that quantifies the minimum safe FPR continuously in a driving scenario. Zhuyi can be employed post-deployment as an online safety check and to prioritize work. Experiments conducted using a multi-camera state-of-the-art industry AV system show that Zhuyi’s estimated FPRs are conservative, yet the system can maintain safety by processing only 36% or fewer frames compared to a default 30-FPR system in the tested scenarios.

Processing-in-SRAM acceleration for ultra-low power visual 3D perception

Yuquan He
Songyun Qu
Gangliang Lin
Cheng Liu
Lei Zhang
Ying Wang

Real-time ego-motion tracking and 3D structural estimation are the fundamental tasks for the ubiquitous cyper-physical systems, and they can be conducted via the state-of-the-art Edge-Based Visual Odometry (EBVO) algorithm. However, the intrinsic data-intensive process of EBVO emplaces a memory-wall hurdle in practical deployment on conventional von-Neumann-style computing systems. In this work, we attempt to leverage SRAM based processing-in-memory (PIM) technique to alleviate such memory-wall bottleneck, so as to optimize the EBVO systematically from the perspectives of the algorithm layer and physical layer. In the algorithm layer, we first investigate the data reuse patterns of the essential computing kernels required for the feature detection and pose estimation steps in EBVO, and propose PIM friendly data layout and computing scheme for each kernel accordingly. We distill the basic logical and arithmetical operations required in the algorithm layer, and in the physical layer, we propose a novel bit-parallel and reconfigurable SRAM-PIM architecture to realize the operations with high computing precision and throughput. Our experimental result shows that the proposed multi-layer optimization allows for high tracking accuracy of EBVO, and it can improve 11x processing speed and reduce 20x energy consumption compared to the CPU implementation.

Response time analysis for dynamic priority scheduling in ROS2

Abdullah Al Arafat
Sudharsan Vaidhun
Kurt M. Wilson
Jinghao Sun
Zhishan Guo

Robot Operating System (ROS) is the most popular framework for developing robotics software. Typically, robotics software is safety-critical and employed in real-time systems requiring timing guarantees. Since the first generation of ROS provides no timing guarantee, the recent release of its second generation, ROS2, is necessary and timely, and has since received immense attention from practitioners and researchers. Unfortunately, the existing analysis of ROS2 showed the peculiar scheduling strategy of ROS2 executor, which severely affects the response time of ROS2 applications. This paper proposes a deadline-based scheduling strategy for the ROS2 executor. It further presents an analysis for an end-to-end response time of ROS2 workload (processing chain) and an evaluation of the proposed scheduling strategy for real workloads.

Voltage prediction of drone battery reflecting internal temperature

Jiwon Kim
Seunghyeok Jeon
Jaehyun Kim
Hojung Cha

Drones are commonly used in mission-critical applications, and the accurate estimation of available battery capacity before flight is critical for reliable and efficient mission planning. To this end, the battery voltage should be predicted accurately prior to launching a drone. However, in drone applications, a rise in the battery’s internal temperature changes the voltage significantly and leads to challenges in voltage prediction. In this paper, we propose a battery voltage prediction method that takes into account the battery’s internal temperature to accurately estimate the available capacity of the drone battery. To this end, we devise a temporal temperature factor (TTF) metric that is calculated by accumulating time series data about the battery’s discharge history. We employ a machine learning-based prediction model, reflecting the TTF metric, to achieve high prediction accuracy and low complexity. We validated the accuracy and complexity of our model with extensive evaluation. The results show that the proposed model is accurate with less than 1.5% error and readily operates on resource-constrained embedded devices.

A near-storage framework for boosted data preprocessing of mass spectrum clustering

Weihong Xu
Jaeyoung Kang
Tajana Rosing

Mass spectrometry (MS) has been a key to proteomics and metabolomics due to its unique ability to identify and analyze protein structures. Modern MS equipment generates massive amount of tandem mass spectra with high redundancy, making spectral analysis the major bottleneck in design of new medicines. Mass spectrum clustering is one promising solution as it greatly reduces data redundancy and boosts protein identification. However, state-of-the-art MS tools take many hours to run spectrum clustering. Spectra loading and preprocessing consumes average 82% execution time and energy during clustering. We propose a near-storage framework, MSAS, to speed up spectrum preprocessing. Instead of loading data into host memory and CPU, MSAS processes spectra near storage, thus reducing the expensive cost of data movement. We present two types of accelerators that leverage internal bandwidth at two storage levels: SSD and channel. The accelerators are optimized to match the data rate at each storage level with negligible overhead. Our results demonstrate that the channel-level design yields the best performance improvement for preprocessing – it is up to 187X and 1.8X faster than the CPU and the state-of-the-art in-storage computing solution, INSIDER, respectively. After integrating channel-level MSAS into existing MS clustering tools, we measure system level improvements in speed of 3.5X to 9.8X with 2.8X to 11.9X better energy efficiency.

MetaZip: a high-throughput and efficient accelerator for DEFLATE

Ruihao Gao
Xueqi Li
Yewen Li
Xun Wang
Guangming Tan

Booming data volume has become an important challenge for data center storage and bandwidth resources. Consequently, fast and efficient compression architecture is becoming the most fundamental design in data centers. However, the compression ratio (CR) and compression throughput are often difficult to achieve at the same time on existing computing platforms. DEFLATE is a widely used compression format in data centers, which is an ideal case for hardware acceleration. Unfortunately, Deflate has an inherent connection among its special memory access pattern, which limits a higher throughput.

In this paper, we propose MetaZip, a high-throughput and scalable data-compression architecture, which is targeted for FPGA-enabled data centers. To improve the compression throughput within the constraints of FPGA resources, we propose an adaptive parallel-width pipeline, which can be fed 64bytes per cycle. To balance the compression quality, we propose a series of sub-modules (e.g. 8-bytes MetaHistory, Seed Bypass, Serialization Predictor). Experimental results show that MetaZip achieves the throughput of 15.6GB/s with a single engine, which is 234X/2.78X than a CPU gzip baseline and FPGA based architecture, respectively.

Enabling fast uncertainty estimation: accelerating bayesian transformers via algorithmic and hardware optimizations

Hongxiang Fan
Martin Ferianc
Wayne Luk

Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.

Eventor: an efficient event-based monocular multi-view stereo accelerator on FPGA platform

Mingjun Li
Jianlei Yang
Yingjie Qi
Meng Dong
Yuhao Yang
Runze Liu
Weitao Pan
Bei Yu
Weisheng Zhao

Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams. Event-based monocular multi-view stereo (EMVS) is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory. It is a critical task for event-based monocular SLAM. However, the required intensive computation workloads make it challenging for real-time deployment on embedded platforms. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. Highly paralleled and fully pipelined processing elements are specially designed via FPGA and integrated with the embedded ARM as a heterogeneous system to improve the throughput and reduce the memory footprint. Meanwhile, the EMVS algorithm is reformulated to a more hardware-friendly manner by rescheduling, approximate computing and hybrid data quantization. Evaluation results on DAVIS dataset show that Eventor achieves up to 24X improvement in energy efficiency compared with Intel i5 CPU platform.

GEML: GNN-based efficient mapping method for large loop applications on CGRA

Mingyang Kou
Jun Zeng
Boxiao Han
Fei Xu
Jiangyuan Gu
Hailong Yao

Coarse-grained reconfigurable architecture (CGRA) is an emerging hardware architecture, with reconfigurable Processing Elements (PEs) for executing operations efficiently and flexibly. One major challenge for current CGRA compilers is the scalability issue for large loop applications, where valid loop mapping results cannot be obtained in an acceptable time. This paper proposes an enhanced loop mapping method based on Graph Neural Network (GNN), which effectively addresses the scalability issue and generates valid loop mapping results for large applications. Experimental results show that the proposed method enhances the compilation time by 10.8x on average over existing methods, with even better loop mapping solutions.

Mixed-granularity parallel coarse-grained reconfigurable architecture

Jinyi Deng
Linyun Zhang
Lei Wang
Jiawei Liu
Kexiang Deng
Shibin Tang
Jiangyuan Gu
Boxiao Han
Fei Xu
Leibo Liu
Shaojun Wei
Shouyi Yin

Coarse-Grained Reconfigurable Architecture (CGRA) is a high-performance computing architecture. However, existing CGRA silicon utilization is low due to the lack of fine-grained parallelism inside Processing Element (PE) and general coarse-grained parallel approach on PE array. No fine-grained parallelism in PE not only leads to low silicon utilization of PE, but also makes the mapping loose and irregular. No generalized parallel method for the mapping cause low PE utilization on CGRA. Our goal is to design an execution model and a Mixed-granularity Parallel CGRA (MP-CGRA), which is capable to fine-grained parallelize operators excution in PEs and parallelize data transmission in channels, leading to a compact mapping. A coarse-grained general parallel method is proposed to vectorize the compact mapping. Evaluated with Machsuite, MP-CGRA achieves an improvement of 104.65% silicon utilization on PE array and a 91.40% performance per area improvement compared with baseline-CGRA.

GuardNN: secure accelerator architecture for privacy-preserving deep learning

Weizhe Hua
Muhammad Umar
Zhiru Zhang
G. Edward Suh

This paper proposes GuardNN, a secure DNN accelerator that provides hardware-based protection for user data and model parameters even in an untrusted environment. GuardNN shows that the architecture and protection can be customized for a specific application to provide strong confidentiality and integrity guarantees with negligible overhead. The design of the GuardNN instruction set reduces the TCB to just the accelerator and allows confidentiality protection even when the instructions from a host cannot be trusted. GuardNN minimizes the overhead of memory encryption and integrity verification by customizing the off-chip memory protection for the known memory access patterns of a DNN accelerator. GuardNN is prototyped on an FPGA, demonstrating effective confidentiality protection with ~3% performance overhead for inference.

SRA: a secure ReRAM-based DNN accelerator

Lei Zhao
Youtao Zhang
Jun Yang

Deep Neural Network (DNN) accelerators are increasingly developed to pursue high efficiency in DNN computing. However, the IP protection of the DNNs deployed on such accelerators is an important topic that has been less addressed. Although there are previous works that targeted this problem for CMOS-based designs, there is still no solution for ReRAM-based accelerators which pose new security challenges due to their crossbar structure and non-volatility. ReRAM’s non-volatility retains data even after the system is powered off, making the stored DNN model vulnerable to attacks by simply reading out the ReRAM content. Because the crossbar structure can only compute on plaintext data, encrypting the ReRAM content is no longer a feasible solution in this scenario.

In this paper, we propose SRA – a secure ReRAM-based DNN accelerator that stores DNN weights on crossbars in an encrypted format while still maintaining ReRAM’s in-memory computing capability. The proposed encryption scheme also supports sharing bits among multiple weights, significantly reducing the storage overhead. In addition, SRA uses a novel high-bandwidth SC conversion scheme to protect each layer’s intermediate results, which also contain sensitive information of the model. Our experimental results show that SRA can effectively prevent pirating the deployed DNN weights as well as the intermediate results with negligible accuracy loss, and achieves 1.14X performance speedup and 9% energy reduction compared to ISAAC – a non-secure ReRAM-based baseline.

ABNN2: secure two-party arbitrary-bitwidth quantized neural network predictions

Liyan Shen
Ye Dong
Binxing Fang
Jinqiao Shi
Xuebin Wang
Shengli Pan
Ruisheng Shi

Data privacy and security issues are preventing a lot of potential on-cloud machine learning as services from happening. In the recent past, secure multi-party computation (MPC) has been used to achieve the secure neural network predictions, guaranteeing the privacy of data. However, the cost of the existing two-party solutions is expensive and they are impractical in real-world setting.

In this work, we utilize the advantages of quantized neural network (QNN) and MPC to present ABNN², a practical secure two-party framework that can realize arbitrary-bitwidth quantized neural network predictions. Concretely, we propose an efficient and novel matrix multiplication protocol based on 1-out-of-N OT extension and optimize the the protocol through a parallel scheme. In addition, we design optimized protocol for the ReLU function. The experiments demonstrate that our protocols are about 2X-36X and 1.4X–7X faster than SecureML (S&P’17) and MiniONN (CCS’17) respectively. And ABNN² obtain comparable efficiency as state of the art QNN prediction protocol QUOTIENT (CCS’19), but the later only supports ternary neural network.

Adaptive neural recovery for highly robust brain-like representation

Prathyush Poduval
Yang Ni
Yeseong Kim
Kai Ni
Raghavan Kumar
Rossario Cammarota
Mohsen Imani

Today’s machine learning platforms have major robustness issues dealing with insecure and unreliable memory systems. In conventional data representation, bit flips due to noise or attack can cause value explosion, which leads to incorrect learning prediction. In this paper, we propose RobustHD, a robust and noise-tolerant learning system based on HyperDimensional Computing (HDC), mimicking important brain functionalities. Unlike traditional binary representation, RobustHD exploits a redundant and holographic representation, ensuring all bits have the same impact on the computation. RobustHD also proposes a runtime framework that adaptively identifies and regenerates the faulty dimensions in an unsupervised way. Our solution not only provides security against possible bit-flip attacks but also provides a learning solution with high robustness to noises in the memory. We performed a cross-stacked evaluation from a conventional platform to emerging processing in-memory architecture. Our evaluation shows that under 10% random bit flip attack, RobustHD provides a maximum of 0.53% quality loss, while deep learning solutions are losing over 26.2% accuracy.

Efficiency attacks on spiking neural networks

Sarada Krithivasan
Sanchari Sen
Nitin Rathi
Kaushik Roy
Anand Raghunathan

Spiking Neural Networks are a class of artificial neural networks that process information as discrete spikes. The time and energy consumed in SNN implementations is strongly dependent on the number of spikes processed. We explore this sensitivity from an adversarial perspective and propose SpikeAttack, a completely new class of attacks on SNNs. SpikeAttack impacts the efficiency of SNNs via imperceptible perturbations that increase the overall spiking activity of the network, leading to increased time and energy consumption. Across four SNN benchmarks, SpikeAttackresults in 1.7x-2.5X increase in spike activity, leading to increases of 1.6x-2.3x and 1.4x-2.2x in latency and energy consumption, respectively.

L-QoCo: learning to optimize cache capacity overloading in storage systems

Ji Zhang
Xijun Li
Xiyao Zhou
Mingxuan Yuan
Zhuo Cheng
Keji Huang
Yifan Li

Cache plays an important role to maintain high and stable performance (i.e. high throughput, low tail latency and throughput jitter) in storage systems. Existing rule-based cache management methods, coupled with engineers’ manual configurations, cannot meet ever-growing requirements of both time-varying workloads and complex storage systems, leading to frequent cache overloading.

In this paper, we propose the first light-weight learning-based cache bandwidth control technique, called L-QoCo which can adaptively control the cache bandwidth so as to effectively prevent cache overloading in storage systems. Extensive experiments with various workloads on real systems show that L-QoCo, with its strong adaptability and fast learning ability, can adapt to various workloads to effectively control cache bandwidth, thereby significantly improving the storage performance (e.g. increasing the throughput by 10%-20% and reducing the throughput jitter and tail latency by 2X-6X and 1.5X-4X, respectively, compared with two representative rule-based methods).

Pipette: efficient fine-grained reads for SSDs

Shuhan Bai
Hu Wan
Yun Huang
Xuan Sun
Fei Wu
Changsheng Xie
Hung-Chih Hsieh
Tei-Wei Kuo
Chun Jason Xue

Big data applications, such as recommendation system and social network, often generate a huge number of fine-grained reads to the storage. Block-oriented storage devices tend to suffer from these fine-grained read operations in terms of I/O traffic as well as performance. Motivated by this challenge, a fine-grained read framework, Pipette, is proposed in this paper, as an extension to the traditional I/O framework. With an adaptive caching design, Pipette framework offers a tremendous reduction in I/O traffic as well as achieves significant performance gain. A Pipette prototype was implemented with Ext4 file system on an SSD for two real-world applications, where the I/O throughput is improved by 31.6% and 33.5%, and the I/O traffic is reduced by 95.6% and 93.6%, respectively.

CDB: critical data backup design for consumer devices with high-density flash based hybrid storage

Longfei Luo
Dingcui Yu
Liang Shi
Chuanmin Ding
Changlong Li
Edwin H.-M. Sha

Hybrid flash based storage constructed with high-density and low-cost flash memory are becoming increasingly popular in consumer devices during the last decade. However, to protect critical data, existing methods are designed for improving reliability of consumer devices with non-hybrid flash storage. Based on evaluations and analysis, these methods will result in significant performance and lifetime degradation in consumer devices with hybrid storage. The reason is that different kinds of memory in hybrid storage have different characteristics, such as performance and access granularity. To address the above problems, a critical data backup (CDB) method is proposed to backup designated critical data with making full use of different kinds of memory in hybrid storage. Experiment results show that compared with the state-of-the-arts, CDB achieves encouraging performance and lifetime improvement.

SS-LRU: a smart segmented LRU caching

Chunhua Li
Man Wu
Yuhan Liu
Ke Zhou
Ji Zhang
Yunqing Sun

Many caching policies use machine learning to predict data reuse, but they ignore the impact of incorrect prediction on cache performance, especially for large-size objects. In this paper, we propose a smart segmented LRU (SS-LRU) replacement policy, which adopts a size-aware classifier designed for cache scenarios and considers the cache cost caused by misprediction. Besides, SS-LRU enhances the migration rules of segmented LRU (SLRU) and implements a smart caching with unequal priorities and segment sizes based on prediction and multiple access patterns. We conducted Extensive experiments under the real-world workloads to demonstrate the superiority of our approach over state-of-the-art caching policies.

NobLSM: an LSM-tree with non-blocking writes for SSDs

Haoran Dang
Chongnan Ye
Yanpeng Hu
Chundong Wang

Solid-state drives (SSDs) are gaining popularity. Meanwhile, key-value stores built on log-structured merge-tree (LSM-tree) are widely deployed for data management. LSM-tree frequently calls syncs to persist newly-generated files for crash consistency. The blocking syncs are costly for performance. We revisit the necessity of syncs for LSM-tree. We find that Ext4 journaling embraces asynchronous commits to implicitly persist files. Hence, we design NobLSM that makes LSM-tree and Ext4 cooperate to substitute most syncs with non-blocking asynchronous commits, without losing consistency. Experiments show that NobLSM significantly outperforms state-of-the-art LSM-trees with higher throughput on an ordinary SSD.

TailCut: improving performance and lifetime of SSDs using pattern-aware state encoding

Jaeyong Lee
Myungsunk Kim
Wonil Choi
Sanggu Lee
Jihong Kim

Although lateral charge spreading is considered as a dominant error source in 3D NAND flash memory, little is known about its detailed characteristics at the storage system level. From a device characterization study, we observed that lateral charge spreading strongly depends on vertically adjacent state patterns and a few specific patterns are responsible for a large portion of bit errors from lateral charge spreading. We propose a new state encoding scheme, called TailCut, which removes vulnerable state patterns by modifying encoded states. By removing vulnerable patterns, TailCut can improve the SSD lifetime and read latency by 80% and 25%, respectively.

HIMap: a heuristic and iterative logic synthesis approach

Xing Li
Lei Chen
Fan Yang
Mingxuan Yuan
Hongli Yan
Yupeng Wan

Recently, many models show their superiority in sequence and parameter tuning. However, they usually generate non-deterministic flows and require lots of training data. We thus propose a heuristic and iterative flow, namely HIMap, for deterministic logic synthesis. In which, domain knowledge of the functionality and parameters of synthesis operators and their correlations to netlist PPA is fully utilized to design synthesis templates for various objetives. We also introduce deterministic and effective heuristics to tune the templates with relatively fixed operator combinations and iteratively improve netlist PPA. Two nested iterations with local searching and early stopping can thus generate dynamic sequence for various circuits and reduce runtime. HIMap improves 13 best results of the EPFL combinational benchmarks for delay (5 for area). Especially, for several arithmetic benchmarks, HIMap significantly reduces LUT-6 levels by 11.6 ~ 21.2% and delay after P&R by 5.0 ~ 12.9%.

Improving LUT-based optimization for ASICs

Walter Lau Neto
Luca Amarú
Vinicius Possani
Patrick Vuillod
Jiong Luo
Alan Mishchenko
Pierre-Emmanuel Gaillardon

LUT-based optimization techniques are finding new applications in synthesis of ASIC designs. Intuitively, packing logic into LUTs provides a better balance between functionality and structure in logic optimization. On this basis, the LUT-engine framework [1] was introduced to enhance the ASIC synthesis. In this paper, we present key improvements, at both algorithmic and flow levels, making a much stronger LUT-engine. We restructure the flow of LUT-engine, to benefit from a heterogeneous mixture of LUT sizes, and revisit its requirements for maximum scalability. We propose a dedicated LUT mapper for the new flow, based on FlowMap, natively balancing LUT-count and NAND2-count for a wide range LUT sizes. We describe a specialized Boolean factoring technique, exploiting the fanin bounds in LUT networks, resulting in a very fast LUT-based AIG minimization. By using the proposed methodology, we improve 9 of the best area results in the ongoing EPFL synthesis competition. Integrated in a complete EDA flow for ASICs, the new LUT-engine performs well on a set of 87 benchmarks: -4.60% area and -3.41% switching power at +5% runtime, compared to the baseline flow without LUT-based optimizations, and -3.02% area and -2.54% switching power with -1% runtime, compared to the original LUT-engine.

NovelRewrite: node-level parallel AIG rewriting

Shiju Lin
Jinwei Liu
Tianji Liu
Martin D. F. Wong
Evangeline F. Y. Young

Logic rewriting is an important part in logic optimization. It rewrites a circuit by replacing local subgraphs with logically equivalent ones, so that the area and the delay of the circuit can be optimized. This paper introduces a parallel AIG rewriting algorithm with a new concept of logical cuts. Experiments show that this algorithm implemented with one GPU can be on average 32X faster than the logic rewriting in the logic synthesis tool ABC on large benchmarks. Compared with other logic rewriting acceleration works, ours has the best quality and the shortest running time.

Search space characterization for approximate logic synthesis

Linus Witschen
Tobias Wiersema
Lucas Reuter
Marco Platzner

Approximate logic synthesis aims at trading off a circuit’s quality to improve a target metric. Corresponding methods explore a search space by approximating circuit components and verifying the resulting quality of the overall circuit, which is costly.

We propose a methodology that determines reasonable values for the component’s local error bounds prior to search space exploration. Utilizing formal verification on a novel approximation miter guarantees the circuit’s quality for such local error bounds, independent of employed approximation methods, resulting in reduced runtimes due to omitted verifications. Experiments show speed-ups of up to 3.7x for approximate logic synthesis using our method.

SEALS: sensitivity-driven efficient approximate logic synthesis

Chang Meng
Xuan Wang
Jiajun Sun
Sijun Tao
Wei Wu
Zhihang Wu
Leibin Ni
Xiaolong Shen
Junfeng Zhao
Weikang Qian

Approximate computing is an emerging computing paradigm to design energy-efficient systems. Many greedy approximate logic synthesis (ALS) methods have been proposed to automatically synthesize approximate circuits. They typically need to consider all local approximate changes (LACs) in each iteration of the ALS flow to select the best one, which is time-consuming. In this paper, we propose SEALS, a Sensitivity-driven Efficient ALS method to speed up a greedy ALS flow. SEALS centers around a newly proposed concept called sensitivity, which enables a fast and accurate error estimation method and an efficient method to filter out unpromising LACs. SEALS can handle any statistical error metric. The experimental results show that it outperforms a state-of-the-art ALS method in runtime by 12X to 15X without reducing circuit quality.

Beyond local optimality of buffer and splitter insertion for AQFP circuits

Siang-Yun Lee
Heinz Riener
Giovanni De Micheli

Adiabatic quantum-flux parametron (AQFP) is an energy-efficient superconducting technology. Buffer and splitter (B/S) cells must be inserted to an AQFP circuit to meet the technology-imposed constraints on path balancing and fanout branching. These cells account for a significant amount of the circuit’s area and delay. In this paper, we identify that B/S insertion is a scheduling problem, and propose (a) a linear-time algorithm for locally optimal B/S insertion subject to a given schedule; (b) an SMT formulation to find the global optimum; and (c) an efficient heuristic for global B/S optimization. Experimental results show a reduction of 4% on the B/S cost and 124X speed-up compared to the state-of-the-art algorithm, and capability to scale to a magnitude larger benchmarks.

NAX: neural architecture and memristive xbar based accelerator co-design

Shubham Negi
Indranil Chakraborty
Aayush Ankit
Kaushik Roy

Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX – an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.

MC-CIM: a reconfigurable computation-in-memory for efficient stereo matching cost computation

Zhiheng Yue
Yabing Wang
Leibo Liu
Shaojun Wei
Shuoyi Yin

This paper proposes the design of a computation-in-memory for stereo matching cost computation. The matching cost computation incurs large energy and latency overhead because of frequent memory access. To overcome previous design limitations, this work, named MC-CIM, performs matching cost computation without incurring memory access and introduces several key features. (1) Lightweight balanced computing unit is integrated within cell array to reduce memory access and improve system throughput. (2) Self-optimized circuit design enables to alter arithmetic operation for matching algorithm in various scenario. (3) Flexible data mapping method and reconfigurable digital peripheral explore maximum parallelism on different algorithm and bit-precision. The proposed design is implemented in 28nm technology and achieves average performance of 277 TOPs/W.

iMARS: an in-memory-computing architecture for recommendation systems

Mengyuan Li
Ann Franchesca Laguna
Dayane Reis
Xunzhao Yin
Michael Niemier
X. Sharon Hu

Recommendation systems (RecSys) suggest items to users by predicting their preferences based on historical data. Typical RecSys handle large embedding tables and many embedding table related operations. The memory size and bandwidth of the conventional computer architecture restrict the performance of RecSys. This work proposes an in-memory-computing (IMC) architecture (iMARS) for accelerating the filtering and ranking stages of deep neural network-based RecSys. iMARS leverages IMC-friendly embedding tables implemented inside a ferroelectric FET based IMC fabric. Circuit-level and system-level evaluation show that iMARS achieves 16.8x (713x) end-to-end latency (energy) improvement compared to the GPU counterpart for the MovieLens dataset.

ReGNN: a ReRAM-based heterogeneous architecture for general graph neural networks

Cong Liu
Haikun Liu
Hai Jin
Xiaofei Liao
Yu Zhang
Zhuohui Duan
Jiahong Xu
Huize Li

Graph Neural Networks (GNNs) have both graph processing and neural network computational features. Traditional graph accelerators and NN accelerators cannot meet these dual characteristics of GNN applications simultaneously. In this work, we propose a ReRAM-based processing-in-memory (PIM) architecture called ReGNN for GNN acceleration. ReGNN is composed of analog PIM (APIM) modules for accelerating matrix vector multiplication (MVM) operations, and digital PIM (DPIM) modules for accelerating non-MVM aggregation operations. To improve data parallelism, ReGNN maps data to aggregation sub-engines based on the degree of vertices and the dimension of feature vectors. Experimental results show that ReGNN speeds up GNN inference by 228x and 8.4x, and reduces energy consumption by 305.2x and 10.5x, compared with GPU and the ReRAM-based GNN accelerator ReGraphX, respectively.

You only search once: on lightweight differentiable architecture search for resource-constrained embedded platforms

Xiangzhong Luo
Di Liu
Hao Kong
Shuo Huai
Hui Chen
Weichen Liu

Benefiting from the search efficiency, differentiable neural architecture search (NAS) has evolved as the most dominant alternative to automatically design competitive deep neural networks (DNNs). We note that DNNs must be executed under strictly hard performance constraints in real-world scenarios, for example, the runtime latency on autonomous vehicles. However, to obtain the architecture that meets the given performance constraint, previous hardware-aware differentiable NAS methods have to repeat a plethora of search runs to manually tune the hyper-parameters by trial and error, and thus the total design cost increases proportionally. To resolve this, we introduce a lightweight hardware-aware differentiable NAS framework dubbed LightNAS, striving to find the required architecture that satisfies various performance constraints through a one-time search (i.e., you only search once). Extensive experiments are conducted to show the superiority of LightNAS over previous state-of-the-art methods. Related codes will be released at https://github.com/stepbuystep/LightNAS.

EcoFusion: energy-aware adaptive sensor fusion for efficient autonomous vehicle perception

Arnav Vaibhav Malawade
Trier Mortlock
Mohammad Abdullah Al Faruque

Autonomous vehicles use multiple sensors, large deep-learning models, and powerful hardware platforms to perceive the environment and navigate safely. In many contexts, some sensing modalities negatively impact perception while increasing energy consumption. We propose EcoFusion: an energy-aware sensor fusion approach that uses context to adapt the fusion method and reduce energy consumption without affecting perception performance. EcoFusion performs up to 9.5% better at object detection than existing fusion methods with approximately 60% less energy and 58% lower latency on the industry-standard Nvidia Drive PX2 hardware platform. We also propose several context-identification strategies, implement a joint optimization between energy and performance, and present scenario-specific results.

Human emotion based real-time memory and computation management on resource-limited edge devices

Yijie Wei
Zhiwei Zhong
Jie Gu

Emotional AI or Affective Computing has been projected to grow rapidly in the upcoming years. Despite many existing developments in the application space, there has been a lack of hardware-level exploitation of the user’s emotions. In this paper, we propose a deep collaboration between user’s affects and the hardware system management on resource-limited edge devices. Based on classification results from efficient affect classifiers on smartphone devices, novel real-time management schemes for memory, and video processing are proposed to improve the energy efficiency of mobile devices. Case studies on H.264 / AVC video playback and Android smartphone usages are provided showing significant power saving of up to 23% and reduction of memory loading of up to 17% using the proposed affect adaptive architecture and system management schemes.

Hierarchical memory-constrained operator scheduling of neural architecture search networks

Zihan Wang
Chengcheng Wan
Yuting Chen
Ziyi Lin
He Jiang
Lei Qiao

Neural Architecture Search (NAS) is widely used in industry, searching for neural networks meeting task requirements. Meanwhile, it faces a challenge in scheduling networks satisfying memory constraints. This paper proposes HMCOS that performs hierarchical memory-constrained operator scheduling of NAS networks: given a network, HMCOS constructs a hierarchical computation graph and employs an iterative scheduling algorithm to progressively reduce peak memory footprints. We evaluate HMCOS against RPO and Serenity (two popular scheduling techniques). The results show that HMCOS outperforms existing techniques in supporting more NAS networks, reducing 8.7~42.4% of peak memory footprints, and achieving 137–283x of speedups in scheduling.

MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning

Abhiroop Bhattacharjee
Yeshwanth Venkatesha
Abhishek Moitra
Priyadarshini Panda

Recent years have seen a paradigm shift towards multi-task learning. This calls for memory and energy-efficient solutions for inference in a multi-task scenario. We propose an algorithm-hardware co-design approach called MIME. MIME reuses the weight parameters of a trained parent task and learns task-specific threshold parameters for inference on multiple child tasks. We find that MIME results in highly memory-efficient DRAM storage of neural-network parameters for multiple tasks compared to conventional multi-task inference. In addition, MIME results in input-dependent dynamic neuronal pruning, thereby enabling energy-efficient inference with higher throughput on a systolic-array hardware. Our experiments with benchmark datasets (child tasks)- CIFAR10, CIFAR100, and Fashion-MNIST, show that MIME achieves ~ 3.48x memory-efficiency and ~ 2.4 – 3.1x energy-savings compared to conventional multi-task inference in Pipelined task mode.

Sniper: cloud-edge collaborative inference scheduling with neural network similarity modeling

Weihong Liu
Jiawei Geng
Zongwei Zhu
Jing Cao
Zirui Lian

The cloud-edge collaborative inference demands scheduling the artificial intelligence (AI) tasks efficiently to the appropriate edge smart device. However, the continuously iterative deep neural networks (DNNs) and heterogeneous devices pose great challenges for inference tasks scheduling. In this paper, we propose a self-update cloud-edge collaborative inference scheduling system (Sniper) with time awareness. At first, considering that similar networks exhibit similar behaviors, we develop a non-invasive performance characterization network (PCN) based on neural network similarity (NNS) to accurately predict the inference time of DNNs. Moreover, PCN and time-based scheduling algorithms can be flexibly combined into the scheduling module of Sniper. Experimental results show that the average relative error of network inference time prediction is about 8.06%. Compared with the traditional method without time awareness, Sniper can reduce the waiting time by 52% on average while achieving a stable increase in throughput.

LPCA: learned MRC profiling based cache allocation for file storage systems

Yibin Gu
Yifan Li
Hua Wang
Li Liu
Ke Zhou
Wei Fang
Gang Hu
Jinhu Liu
Zhuo Cheng

File storage system (FSS) uses multi-caches to accelerate data accesses. Unfortunately, efficient FSS cache allocation remains extremely difficult. First, as the key of cache allocation, existing miss ratio curve (MRC) constructions are limited to LRU. Second, existing techniques are suitable for same-layer caches but not for hierarchical ones.

We present a Learned MRC Profiling based Cache Allocation (LPCA) scheme for FSS. To the best of our knowledge, LPCA is the first to apply machine learning to model MRC under non-LRU, LPCA also explores optimization target for hierarchical caches, in that LPCA can provide universal and efficient cache allocation for FSSs.

Equivalence checking paradigms in quantum circuit design: a case study

Tom Peham
Lukas Burgholzer
Robert Wille

As state-of-the-art quantum computers are capable of running increasingly complex algorithms, the need for automated methods to design and test potential applications rises. Equivalence checking of quantum circuits is an important, yet hardly automated, task in the development of the quantum software stack. Recently, new methods have been proposed that tackle this problem from widely different perspectives. However, there is no established baseline on which to judge current and future progress in equivalence checking of quantum circuits. In order to close this gap, we conduct a detailed case study of two of the most promising equivalence checking methodologies—one based on decision diagrams and one based on the ZX-calculus—and compare their strengths and weaknesses.

Accurate BDD-based unitary operator manipulation for scalable and robust quantum circuit verification

Chun-Yu Wei
Yuan-Hung Tsai
Chiao-Shan Jhang
Jie-Hong R. Jiang

Quantum circuit verification is essential, ensuring that quantum program compilation yields a sequence of primitive unitary operators executable correctly and reliably on a quantum processor. Most prior quantum circuit equivalence checking methods rely on edge-weighted decision diagrams and suffer from scalability and verification accuracy issues. This work overcomes these issues by extending a recent BDD-based algebraic representation of state vectors to support unitary operator manipulation. Experimental results demonstrate the superiority of the new method in scalability and exactness in contrast to the inexactness of prior approaches. Also, our method is much more robust in verifying dissimilar circuits than previous work.

Handling non-unitaries in quantum circuit equivalence checking

Lukas Burgholzer
Robert Wille

Quantum computers are reaching a level where interactions between classical and quantum computations can happen in real-time. This marks the advent of a new, broader class of quantum circuits: dynamic quantum circuits. They offer a broader range of available computing primitives that lead to new challenges for design tasks such as simulation, compilation, and verification. Due to the non-unitary nature of dynamic circuit primitives, most existing techniques and tools for these tasks are no longer applicable in an out-of-the-box fashion. In this work, we discuss the resulting consequences for quantum circuit verification, specifically equivalence checking, and propose two different schemes that eventually allow to treat the involved circuits as if they did not contain non-unitaries at all. As a result, we demonstrate methodically, as well as, experimentally that existing techniques for verifying the equivalence of quantum circuits can be kept applicable for this broader class of circuits.

A bridge-based algorithm for simultaneous primal and dual defects compression on topologically quantum-error-corrected circuits

Wei-Hsiang Tseng
Yao-Wen Chang

Topological quantum error correction (TQEC) using the surface code is among the most promising techniques for fault-tolerant quantum circuits. The required resource of a TQEC circuit can be modeled as a space-time volume of a three-dimensional diagram by describing the defect movement along the time axis. For large-scale complex problems, it is crucial to minimize the space-time volume for a quantum algorithm with a reasonable physical qubit number and computation time. Previous work proposed an automated tool to perform bridge compression on a large-scale TQEC circuit. However, the existing automated bridging compression is only for dual defects and not for primal defects. This paper presents an algorithm to perform bridge compression on primal and dual defects simultaneously. In addition, the automatic compression algorithm performs initialization/measurement simplification and flipping to improve the compression. Compared with the state-of-the-art work, experimental results show that our proposed algorithm can averagely reduce space-time volumes by 47%.

FaSe: fast selective flushing to mitigate contention-based cache timing attacks

Tuo Li
Sri Parameswaran

Caches are widely used to improve performance in modern processors. By carefully evicting cache lines and identifying cache hit/miss time, contention-based cache timing channel attacks can be orchestrated to leak information from the victim process. Existing hardware countermeasures explored cache partitioning and randomization, are either costly, not applicable for the L1 data cache, or are vulnerable to sophisticated attacks. Countermeasures using cache flush exist but are slow since all cache lines have to be evacuated during a cache flush. In this paper, we propose for the first time a hardware/software flush-based countermeasure, called fast selective flushing (FaSe). By utilizing an ISA extension and cache modification, FaSe selectively flushes cache lines and provides a mitigation method with a similar effect to methods using naive flush. FaSe is implemented on RISC-V Rocket Chip and evaluated on Xilinx FPGA running user programs and the Linux OS. Our experiments show that FaSe reduces time overhead by 36% for user programs and 42% for the OS compared to the methods with naive flushing, with less than 1% hardware overhead. Our security test shows FaSe can mitigate target cache timing attacks.

Conditional address propagation: an efficient defense mechanism against transient execution attacks

Peinan Li
Rui Hou
Lutan Zhao
Yifan Zhu
Dan Meng

Speculative execution is a critical technique in modern high performance processors. However, continuously exposed transient execution attacks, including Spectre and Meltdown, disclosed a large attack surface in mispredicted execution. Current state-of-the-art defense strategy blocks all memory accesses that use addresses loaded speculatively. However, propagation of base addresses is common in general applications and we find that more than 60% blocked memory accesses use propagated base rather than offset addresses. Therefore, we propose a novel hardware defense mechanism, named Conditional Address Propagation, to identify safe base addresses through taint tracking and address checking by a History Table. Then, the safe base addresses are allowed to be propagated to retrieve performance. For remaining unsafe addresses, they cannot be propagated for security. We constructed experiments on cycle-accurate Gem5 simulator. Compared to the representative study, STT, our mechanism effectively decreases the performance overhead from 13.27% to 1.92% targeting Spectre-type and 19.66% to 5.23% targeting all-type cache-based transient execution attacks.

Timed speculative attacks exploiting store-to-load forwarding bypassing cache-based countermeasures

Anirban Chakraborty
Nikhilesh Singh
Sarani Bhattacharya
Chester Rebeiro
Debdeep Mukhopadhyay

In this paper, we propose a novel class of speculative attacks, called Timed Speculative Attacks (TSA), that does not depend on the state changes in the cache memory. Instead, it makes use of the timing differences that occur due to store-to-load forwarding. We propose two attack strategies – Fill-and-Forward utilizing correctly speculated loads, and Fill-and-Misdirect using mis-speculated load instructions. While Fill-and-Forward exploits the shared store buffers in a multi-threaded CPU core, the Fill-and-Misdirect approach exploits the influence of rolled back mis-speculated loads on subsequent instructions. As case studies, we demonstrate a covert channel using Fill-and-Forward and key recovery attacks on OpenSSL AES and Romulus-N Authenticated Encryption with Associated Data scheme using Fill-and-Misdirect approach. Finally, we show that TSA is able to subvert popular cache-based countermeasures for transient attacks.

DARPT: defense against remote physical attack based on TDC in multi-tenant scenario

Fan Zhang
Zhiyong Wang
Haoting Shen
Bolin Yang
Qianmei Wu
Kui Ren

With rapidly increasing demands for cloud computing, Field Programmable Gate Array (FPGA) has become popular in cloud datacenters. Although it improves computing performance through flexible hardware acceleration, new security concerns also come along. For example, unavoidable physical leakage from the Power Distribution Network (PDN) can be utilized by attackers to mount remote Side-Channel Attacks (SCA), such as Correlation Power Attacks (CPA). Remote Fault Attacks (FA) can also be successfully presented by malicious tenants in a cloud multi-tenant scenario, posing a significant threat to legal tenants. There are few hardware-based countermeasures to defeat both remote attacks that aforementioned. In this work, we exploit Time-to-Digital Converter (TDC) and propose a novel defense technique called DARPT (Defense Against Remote Physical attack based on TDC) to protect sensitive information from CPA and FA. Specifically, DARPT produces random clock jitters to reduce possible information leakage through the power side-channel and provides an early warning of FA by constantly monitoring the variation of the voltage drop across PDN. In comparison to the fact that 8k traces are enough for a successful CPA on FPGA without DARPT, our experimental results show that up to 800k traces (100 times) are not enough for the same FPGA protected by DARPT. Meanwhile, the TDC-based voltage monitor presents significant readout changes (by 51.82% or larger) under FA with ring oscillators, demonstrating sufficient sensitivities to voltage-drop-based FA.

GNNIE: GNN inference engine with load-balancing and graph-specific caching

Sudipta Mondal
Susmita Dey Manasi
Kishor Kunal
Ramprasath S
Sachin S. Sapatnekar

Graph neural networks (GNN) inferencing involves weighting vertex feature vectors, followed by aggregating weighted vectors over a vertex neighborhood. High and variable sparsity in the input vertex feature vectors, and high sparsity and power-law degree distributions in the adjacency matrix, can lead to (a) unbalanced loads and (b) inefficient random memory accesses. GNNIE ensures load-balancing by splitting features into blocks, proposing a flexible MAC architecture, and employing load (re)distribution. GNNIE’s novel caching scheme bypasses the high costs of random DRAM accesses. GNNIE shows high speedups over CPUs/GPUs; it is faster and runs a broader range of GNNs than existing accelerators.

SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences

Guan Shen
Jieru Zhao
Quan Chen
Jingwen Leng
Chao Li
Minyi Guo

The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.

NN-LUT: neural approximation of non-linear operations for efficient transformer inference

Joonsang Yu
Junki Park
Seongmin Park
Minsoo Kim
Sihwa Lee
Dong Hyun Lee
Jungwook Choi

Non-linear operations such as GELU, Layer normalization, and Soft-max are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a Look-up table(LUT). The proposed framework called Neural network generated LUT(NN-LUT) can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.

Self adaptive reconfigurable arrays (SARA): learning flexible GEMM accelerator configuration and mapping-space using ML

Ananda Samajdar
Eric Qin
Michael Pellauer
Tushar Krishna

This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present a self-adaptive (SA) unit, which runs a learnt model for one-shot configuration optimization in hardware offloading the software stack thus easing the deployment of the proposed design. We evaluate an instance of the proposed methodology with a 32.768 TOPS reference implementation called SAGAR, that can provide the same mapping flexibility as a compute equivalent distributed system while achieving 3.5X more power efficiency and 3.2X higher compute density demonstrated via architectural and post-layout simulation.

Enabling hard constraints in differentiable neural network and accelerator co-exploration

Deokki Hong
Kanghyun Choi
Hye Yoon Lee
Joonsang Yu
Noseong Park
Youngsok Kim
Jinho Lee

Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.

Heuristic adaptability to input dynamics for SpMM on CPUs

Guohao Dai
Guyue Huang
Shang Yang
Zhongming Yu
Hengrui Zhang
Yufei Ding
Yuan Xie
Huazhong Yang
Yu Wang

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics.

To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26X~1.37X speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59X end-to-end speedup to Graph Neural Networks.

H2H: heterogeneous model to heterogeneous system mapping with computation and communication awareness

Xinyi Zhang
Cong Hao
Peipei Zhou
Alex Jones
Jingtong Hu

The complex nature of real-world problems calls for heterogeneity in both machine learning (ML) models and hardware systems. The heterogeneity in ML models comes from multi-sensor perceiving and multi-task learning, i.e., multi-modality multi-task (MMMT), resulting in diverse deep neural network (DNN) layers and computation patterns. The heterogeneity in systems comes from diverse processing components, as it becomes the prevailing method to integrate multiple dedicated accelerators into one system. Therefore, a new problem emerges: heterogeneous model to heterogeneous system mapping (H2H). While previous mapping algorithms mostly focus on efficient computations, in this work, we argue that it is indispensable to consider computation and communication simultaneously for better system efficiency. We propose a novel H2H mapping algorithm with both computation and communication awareness; by slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced. The superior performance of our work is evaluated based on MAESTRO modeling, demonstrating 15%-74% latency reduction and 23%-64% energy reduction compared with existing computation-prioritized mapping algorithms. Code is publicly available at https://github.com/xyzxinyizhang/H2H.

PARIS and ELSA: an elastic scheduling algorithm for reconfigurable multi-GPU inference servers

Yunseong Kim
Yujeong Choi
Minsoo Rhu

Providing low latency to end-users while maximizing server utilization and system throughput is crucial for cloud ML servers. NVIDIA’s recently announced Ampere GPU architecture provides features to “reconfigure” one large, monolithic GPU into multiple smaller “GPU partitions”. Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. We study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server, presenting a sophisticated partitioning algorithm for reconfigurable GPUs combined with an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server.

Pursuing more effective graph spectral sparsifiers via approximate trace reduction

Zhiqiang Liu
Wenjian Yu

Spectral graph sparsification aims to find ultra-sparse subgraphs which can preserve spectral properties of original graphs. In this paper, a new spectral criticality metric based on trace reduction is first introduced for identifying spectrally important off-subgraph edges. Then, a physics-inspired truncation strategy and an approach using approximate inverse of Cholesky factor are proposed to compute the approximate trace reduction efficiently. Combining them with the iterative densification scheme in [8] and the strategy of excluding spectrally similar off-subgraph edges in [13], we develop a highly effective graph sparsification algorithm. The proposed method has been validated with various kinds of graphs. Experimental results show that it always produces sparsifiers with remarkably better quality than the state-of-the-art GRASS [8] in same computational cost, enabling more than 40% time reduction for preconditioned iterative equation solver on average. In the applications of power grid transient analysis and spectral graph partitioning, the derived iterative solver shows 3.3X or more advantages on runtime and memory cost, over the approach based on direct sparse solver.

Accelerating nonlinear DC circuit simulation with reinforcement learning

Zhou Jin
Haojie Pei
Yichao Dong
Xiang Jin
Xiao Wu
Wei W. Xing
Dan Niu

DC analysis is the foundation for nonlinear electronic circuit simulation. Pseudo transient analysis (PTA) methods have gained great success among various continuation algorithms. However, PTA tends to be computationally intensive without careful tuning of parameters and proper stepping strategies. In this paper, we harness the latest advancing in machine learning to resolve these challenges simultaneously. Particularly, an active learning is leveraged to provide a fine initial solver environment, in which a TD3-based Reinforcement Learning (RL) is implemented to accelerate the simulation on the fly. The RL agent is strengthen with dual agents, priority sampling, and cooperative learning to enhance its robustness and convergence. The proposed algorithms are implemented in an out-of-the-box SPICElike simulator, which demonstrated a significant speedup: up to 3.1X for the initial stage and 234X for the RL stage.

An efficient yield optimization method for analog circuits via gaussian process classification and varying-sigma sampling

Xiaodong Wang
Changhao Yan
Fan Yang
Dian Zhou
Xuan Zeng

This paper presents an efficient yield optimization method for analog circuits via Gaussian process classification and varying-sigma sampling. To quickly determine the better design, yield estimations are executed at varying sigma of process variations. Instead of regression methods requiring accurate yield values, a Gaussian process classification method is applied to model these preference information of designs with binary comparison results, and the preferential Bayesian optimization framework is implemented to guide the search. Additionally, a multi-fidelity surrogate model is adopted to learn the yield correlation at different sigmas. Compared with the state-of-the-art methods, the proposed method achieves up to 12× speed-up without loss of accuracy.

Partition and place finite element model on wafer-scale engine

Jinwei Liu
Xiaopeng Zhang
Shiju Lin
Xinshi Zang
Jingsong Chen
Bentian Jiang
Martin D. F. Wong
Evangeline F. Y. Young

The finite element method (FEM) is a well-known technique for approximately solving partial differential equations and it finds application in various engineering disciplines. The recently introduced wafer-scale engine (WSE) has shown the potential to accelerate FEM by up to 10,000×. However, accelerating FEM to the full potential of a WSE is non-trivial. Thus, in this work, we propose a partitioning algorithm to partition a 3D finite element model into tiles. The tiles can be thought of as a special netlist and are placed onto the 2D array of a WSE by our placement algorithm. Compared to the best-known approach, our partitioning has around 5% higher accuracy, and our placement algorithm can produce around 11% shorter wirelength (L_1.5-normalized) on average.

CNN-inspired analytical global placement for large-scale heterogeneous FPGAs

Huimin Wang
Xingyu Tong
Chenyue Ma
Runming Shi
Jianli Chen
Kun Wang
Jun Yu
Yao-Wen Chang

The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net model is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.

High-performance placement for large-scale heterogeneous FPGAs with clock constraints

Ziran Zhu
Yangjie Mei
Zijun Li
Jingwen Lin
Jianli Chen
Jun Yang
Yao-Wen Chang

With the increasing complexity of the field-programmable gate array (FPGA) architecture, heterogeneity and clock constraints have greatly challenged FPGA placement. In this paper, we present a high-performance placement algorithm for large-scale heterogeneous FPGAs with clock constraints. We first propose a connectivity-aware and type-balanced clustering method to construct the hierarchy and improve the scalability. In each hierarchy level, we develop a novel hybrid penalty and augmented Lagrangian method to formulate the heterogeneous and clock-aware placement as a sequence of unconstrained optimization subproblems and adopt the Adam method to solve each unconstrained optimization subproblem. Then, we present a matching-based IP blocks legalization to legalize the RAMs and DSPs, and a multi-stage packing technique is proposed to cluster FFs and LUTs into HCLBs. Finally, history-based legalization is developed to legalize CLBs in an FPGA. Based on the ISPD 2017 clock-aware FPGA placement contest benchmarks, experimental results show that our algorithm achieves the smallest routed wirelength for all the benchmarks among all published works in a reasonable runtime.

Multi-electrostatic FPGA placement considering SLICEL-SLICEM heterogeneity and clock feasibility

Jing Mai
Yibai Meng
Zhixiong Di
Yibo Lin

Modern field-programmable gate arrays (FPGAs) contain heterogeneous resources, including CLB, DSP, BRAM, IO, etc. A Configurable Logic Block (CLB) slice is further categorized to SLICEL and SLICEM, which can be configured as specific combinations of instances in {LUT, FF, distributed RAM, SHIFT, CARRY}. Such kind of heterogeneity challenges the existing FPGA placement algorithms. Meanwhile, limited clock routing resources also lead to complicated clock constraints, causing difficulties in achieving clock feasible placement solutions. In this work, we propose a heterogeneous FPGA placement framework considering SLICEL-SLICEM heterogeneity and clock feasibility based on a multi-electrostatic formulation. We support a comprehensive set of the aforementioned instance types with a uniform algorithm for wirelength, routability, and clock optimization. Experimental results on both academic and industrial benchmarks demonstrate that we outperform the state-of-the-art placers in both quality and efficiency.

QOC: quantum on-chip training with parameter shift and gradient pruning

Hanrui Wang
Zirui Li
Jiaqi Gu
Yongshan Ding
David Z. Pan
Song Han

Parameterized Quantum Circuits (PQC) are drawing increasing research interest thanks to its potential to achieve quantum advantages on near-term Noisy Intermediate Scale Quantum (NISQ) hardware. In order to achieve scalable PQC learning, the training process needs to be offloaded to real quantum machines instead of using exponential-cost classical simulators. One common approach to obtain PQC gradients is parameter shift whose cost scales linearly with the number of qubits. We present QOC, the first experimental demonstration of practical on-chip PQC training with parameter shift. Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naïve parameter shift have low fidelity and thus degrading the training accuracy. To this end, we further propose probabilistic gradient pruning to firstly identify gradients with potentially large errors and then remove them. Specifically, small gradients have larger relative errors than large ones, thus having a higher probability to be pruned. We perform extensive experiments with the Quantum Neural Network (QNN) benchmarks on 5 classification tasks using 5 real quantum machines. The results demonstrate that our on-chip training achieves over 90% and 60% accuracy for 2-class and 4-class image classification tasks. The probabilistic gradient pruning brings up to 7% PQC accuracy improvements over no pruning. Overall, we successfully obtain similar on-chip training accuracy compared with noise-free simulation but have much better training scalability. The QOC code is available in the TorchQuantum library.

Memory-efficient training of binarized neural networks on the edge

Mikail Yayla
Jian-Jia Chen

A visionary computing paradigm is to train resource efficient neural networks on the edge using dedicated low-power accelerators instead of cloud infrastructures, eliminating communication overheads and privacy concerns. One promising resource-efficient approach for inference is binarized neural networks (BNNs), which binarize parameters and activations. However, training BNNs remains resource demanding. State-of-the-art BNN training methods, such as the binary optimizer (Bop), require to store and update a large number of momentum values in the floating point (FP) format.

In this work, we focus on memory-efficient FP encodings for the momentum values in Bop. To achieve this, we first investigate the impact of arbitrary FP encodings. When the FP format is not properly chosen, we prove that the updates of the momentum values can be lost and the quality of training is therefore dropped. With the insights, we formulate a metric to determine the number of unchanged momentum values in a training iteration due to the FP encoding. Based on the metric, we develop an algorithm to find FP encodings that are more memory-efficient than the standard FP encodings. In our experiments, the memory usage in BNN training is decreased by factors 2.47x, 2.43x, 2.04x, depending on the BNN model, with minimal accuracy cost (smaller than 1%) compared to using 32-bit FP encoding.

DeepGate: learning neural representations of logic gates

Min Li
Sadaf Khan
Zhengyuan Shi
Naixing Wang
Huang Yu
Qiang Xu

Applying deep learning (DL) techniques in the electronic design automation (EDA) field has become a trending topic. Most solutions apply well-developed DL models to solve specific EDA problems. While demonstrating promising results, they require careful model tuning for every problem. The fundamental question on “How to obtain a general and effective neural representation of circuits?” has not been answered yet. In this work, we take the first step towards solving this problem. We propose DeepGate, a novel representation learning solution that effectively embeds both logic function and structural information of a circuit as vectors on each gate. Specifically, we propose transforming circuits into unified and-inverter graph format for learning and using signal probabilities as the supervision task in DeepGate. We then introduce a novel graph neural network that uses strong inductive biases in practical circuits as learning priors for signal probability prediction. Our experimental results show the efficacy and generalization capability of DeepGate.

Bipolar vector classifier for fault-tolerant deep neural networks

Suyong Lee
Insu Choi
Joon-Sung Yang

Deep Neural Networks (DNNs) surpass the human-level performance on specific tasks. The outperforming capability accelerate an adoption of DNNs to safety-critical applications such as autonomous vehicles and medical diagnosis. Millions of parameters in DNN requires a high memory capacity. A process technology scaling allows increasing memory density, however, the memory reliability confronts significant reliability issues causing errors in the memory. This can make stored weights in memory erroneous. Studies show that the erroneous weights can cause a significant accuracy loss. This motivates research on fault-tolerant DNN architectures. Despite of these efforts, DNNs are still vulnerable to errors, especially error in DNN classifier. In the worst case, because a classifier in convolutional neural network (CNN) is the last stage determining an input class, a single error in the classifier can cause a significant accuracy drop. To enhance the fault tolerance in CNN, this paper proposes a novel bipolar vector classifier which can be easily integrated with any CNN structures and can be incorporated with other fault tolerance approaches. Experimental results show that the proposed method stably maintains an accuracy with a high bit error rate up to 10⁻³ in the classifier.

HDLock: exploiting privileged encoding to protect hyperdimensional computing models against IP stealing

Shijin Duan
Shaolei Ren
Xiaolin Xu

Hyperdimensional Computing (HDC) is facing infringement issues due to straightforward computations. This work, for the first time, raises a critical vulnerability of HDC — an attacker can reverse engineer the entire model, only requiring the unindexed hypervector memory. To mitigate this attack, we propose a defense strategy, namely HDLock, which significantly increases the reasoning cost of encoding. Specifically, HDLock adds extra feature hypervector combination and permutation in the encoding module. Compared to the standard HDC model, a two-layer-key HDLock can increase the adversarial reasoning complexity by 10 order of magnitudes without inference accuracy loss, with only 21% latency overhead.

Terminator on SkyNet: a practical DVFS attack on DNN hardware IP for UAV object detection

Junge Xu
Bohan Xuan
Anlin Liu
Mo Sun
Fan Zhang
Zeke Wang
Kui Ren

With increasing computation of various applications, dynamic voltage and frequency scaling (DVFS) is gradually deployed on FPGAs. However, its reliability and security haven’t been sufficiently evaluated. In this paper, we present a practical DVFS fault attack targeting at the SkyNet accelerator IP and successfully destroy the detection accuracy. With no knowledge about the internal accelerator structure, our attack can achieve more than 98% detection accuracy loss under ten vulnerable operating point pairs (OPPs). Meanwhile, we explore the local injection with 1 ms duration and next double the intensity which can achieve more than 50% and 74% average accuracy loss respectively.

AL-PA: cross-device profiled side-channel attack using adversarial learning

Pei Cao
Hongyi Zhang
Dawu Gu
Yan Lu
Yidong Yuan

In this paper, we focus on the portability issue in profiled side-channel attacks (SCAs) that arises due to significant device-to-device variations. Device discrepancy is inevitable in realistic attacks, but it is often neglected in research works. In this paper, we identify such device variations and take a further step towards leveraging the transferability of neural networks. We propose a novel adversarial learning-based profiled attack (AL-PA), which enables our neural network to learn device-invariant features. We evaluated our strategy on eight XMEGA microcontrollers. Without the need for target-specific preprocessing and multiple profiling devices, our approach has outperformed the state-of-the-art methods.

DETERRENT: detecting trojans using reinforcement learning

Vasudev Gohil
Satwik Patnaik
Hao Guo
Dileep Kalathil
Jeyavijayan (JV) Rajendran

Insertion of hardware Trojans (HTs) in integrated circuits is a pernicious threat. Since HTs are activated under rare trigger conditions, detecting them using random logic simulations is infeasible. In this work, we design a reinforcement learning (RL) agent that circumvents the exponential search space and returns a minimal set of patterns that is most likely to detect HTs. Experimental results on a variety of benchmarks demonstrate the efficacy and scalability of our RL agent, which obtains a significant reduction (169×) in the number of test patterns required while maintaining or improving coverage (95.75%) compared to the state-of-the-art techniques.

Exploiting data locality in memory for ORAM to reduce memory access overheads

Jinxi Kuang
Minghua Shen
Yutong Lu
Nong Xiao

This paper proposes a locality-aware Oblivious RAM (ORAM) primitive, named Green ORAM, which exploits spatial locality of data in the physical memory for reducing ORAM overheads. The Green ORAM is novel consisting of three policies. The first is row-guided label allocation used for mapping spatial locality onto ORAM tree to reduce the number of memory commands. The second is segment-based path replacement able to improve the data locality within the path in the ORAM tree in order to remove the redundant memory accesses. The third is multi-path write-back able to improve the data locality between different paths in order to obtain theoretical best stash hit rate. Notably, the Green ORAM still maintains the security as we analyzed. Experimental results show that Green ORAM achieves a 28.72% access latency reduction, and a 19.06% memory energy consumption reduction on average, compared with the state-of-the-art String ORAM.

HWST128: complete memory safety accelerator on RISC-V with metadata compression

Hsu-Kang Dow
Tuo Li
Sri Parameswaran

Memory safety is paramount for secure systems. Pointer-based memory safety relies on additional information (metadata) to check validity when a pointer is dereferenced. Such operations on the metadata introduce significant performance overhead to the system. This paper presents HWST128, a system to reduce performance overhead by using hardware/software co-design. As a result, the system described achieves spatial and temporal safety by utilizing microarchitecture support, pointer analysis from the compiler, and metadata compression. HWST128 is the first complete solution for memory safety (spatial and temporal) on RISC-V. The system is implemented and tested on a Xilinx ZCU102 FPGA board with 1536 LUTs (+4.11%) and 112 FFs (+0.66%) on top of a Rocket Chip processor. HWST128 is 3.74× faster than the equivalent software-based safety system in the SPEC2006 benchmark suite while providing similar or better security coverage for the Juliet test suite.

RegVault: hardware assisted selective data randomization for operating system kernels

Jinyan Xu
Haoran Lin
Ziqi Yuan
Wenbo Shen
Yajin Zhou
Rui Chang
Lei Wu
Kui Ren

This paper presents RegVault, a hardware-assisted lightweight data randomization scheme for OS kernels. RegVault introduces novel cryptographically strong hardware primitives to protect both the confidentiality and integrity of register-grained data. RegVault leverages annotations to mark sensitive data and instruments their loads and stores automatically. Moreover, RegVault also introduces new techniques to protect the interrupt context and safeguard the sensitive data spilling. We implement a prototype of RegVault by extending RISC-V architecture to protect six types of sensitive data in Linux kernel. Our evaluations show that RegVault can defend against the kernel data attacks effectively with a minimal performance overhead.

ASAP: reconciling asynchronous real-time operations and proofs of execution in simple embedded systems

Adam Caulfield
Norrathep Rattanavipanon
Ivan De Oliveira Nunes

Embedded devices are increasingly ubiquitous and their importance is hard to overestimate. While they often support safety-critical functions (e.g., in medical devices and sensor-alarm combinations), they are usually implemented under strict cost/energy budgets, using low-end microcontroller units (MCUs) that lack sophisticated security mechanisms. Motivated by this issue, recent work developed architectures capable of generating Proofs of Execution (PoX) for the correct/expected software in potentially compromised low-end MCUs. In practice, this capability can be leveraged to provide “integrity from birth” to sensor data, by binding the sensed results/outputs to an unforgeable cryptographic proof of execution of the expected sensing process. Despite this significant progress, current PoX schemes for low-end MCUs ignore the real-time needs of many applications. In particular, security of current PoX schemes precludes any interrupts during the execution being proved. We argue that lack of asynchronous capabilities (i.e., interrupts within PoX) can obscure PoX usefulness, as several applications require processing real-time and asynchronous events. To bridge this gap, we propose, implement, and evaluate an Architecture for Secure Asynchronous Processing in PoX (ASAP). ASAP is secure under full software compromise, enables asynchronous PoX, and incurs less hardware overhead than prior work.

Towards a formally verified hardware root-of-trust for data-oblivious computing

Lucas Deutschmann
Johannes Müller
Mohammad R. Fadiheh
Dominik Stoffel
Wolfgang Kunz

The importance of preventing microarchitectural timing side channels in security-critical applications has surged immensely over the last several years. Constant-time programming has emerged as a best-practice technique to prevent leaking out secret information through timing. It builds on the assumption that certain basic machine instructions execute timing-independently w.r.t. their input data. However, whether an instruction fulfills this data-independent timing criterion varies strongly from architecture to architecture.

In this paper, we propose a novel methodology to formally verify data-oblivious behavior in hardware using standard property checking techniques. Each successfully verified instruction represents a trusted hardware primitive for developing data-oblivious algorithms. A counterexample, on the other hand, represents a restriction that must be communicated to the software developer. We evaluate the proposed methodology in multiple case studies, ranging from small arithmetic units to medium-sized processors. One case study uncovered a data-dependent timing violation in the extensively verified and highly secure Ibex RISC-V core.

A scalable SIMD RISC-V based processor with customized vector extensions for CRYSTALS-kyber

Huimin Li
Nele Mentens
Stjepan Picek

This paper uses RISC-V vector extensions to speed up lattice-based operations in architectures based on HW/SW co-design. We analyze the structure of the number-theoretic transform (NTT), inverse NTT (INTT), and coefficient-wise multiplication (CWM) in CRYSTALS-Kyber, a lattice-based key encapsulation mechanism. We propose 12 vector extensions for CRYSTALS-Kyber multiplication and four for finite field operations in combination with two optimizations of the HW/SW interface. This results in a speed-up of 141.7, 168.7, and 245.5 times for NTT, INTT, and CWM, respectively, compared with the baseline implementation, and a speed-up of over four times compared with the state-of-the-art HW/SW co-design using RV32IMC.

Hexagons are the bestagons: design automation for silicon dangling bond logic

Marcel Walter
Samuel Sze Hang Ng
Konrad Walus
Robert Wille

Field-coupled Nanocomputing (FCN) defines a class of post-CMOS nanotechnologies that promises compact layouts, low power operation, and high clock rates. Recent breakthroughs in the fabrication of Silicon Dangling Bonds (SiDBs) acting as quantum dots enabled the demonstration of a sub-30 nm² OR gate and wire segments. This motivated the research community to invest manual labor in the design of additional gates and whole circuits which, however, is currently severely limited by scalability issues. In this work, these limitations are overcome by the introduction of a design automation framework that establishes a flexible topology based on hexagons as well as a corresponding Bestagon gate library for this technology and, additionally, provides automatic methods for physical design. By this, the first design automation solution for the promising SiDB platform is proposed. In an effort to support open research and open data, the resulting framework and all design files will be made available.

Improving compute in-memory ECC reliability with successive correction

Brian Crafton
Zishen Wan
Samuel Spetalnick
Jong-Hyeok Yoon
Wei Wu
Carlos Tokunaga
Vivek De
Arijit Raychowdhury

Compute in-memory (CIM) is an exciting technique that minimizes data transport, maximizes memory throughput, and performs computation on the bitline of memory sub-arrays. This is especially interesting for machine learning applications, where increased memory bandwidth and analog domain computation offer improved area and energy efficiency. Unfortunately, CIM faces new challenges traditional CMOS architectures have avoided. In this work, we explore the impact of device variation (calibrated with measured data on foundry RRAM arrays) and propose a new class of error correcting codes (ECC) for hard and soft errors in CIM. We demonstrate single, double, and triple error correction offering over 16,000× reduction in bit error rate over a design without ECC and over 427× over prior work, while consuming only 29.1% area and 26.3% power overhead.

Energy efficient data search design and optimization based on a compact ferroelectric FET content addressable memory

Jiahao Cai
Mohsen Imani
Kai Ni
Grace Li Zhang
Bing Li
Ulf Schlichtmann
Cheng Zhuo
Xunzhao Yin

Content Addressable Memory (CAM) is widely used for associative search tasks in advanced machine learning models and data-intensive applications due to the highly parallel pattern matching capability. Most state-of-the-art CAM designs focus on reducing the CAM cell area by exploiting the nonvolatile memories (NVMs). There exists only little research on optimizing the design and energy efficiency of NVM based CAMs for practical deployment in edge devices and AI hardware. In this paper, we propose a general compact and energy efficient CAM design scheme that alleviates the design overhead by employing just one NVM device in the cell. We also propose an adaptive matchline (ML) precharge and discharge scheme that further optimizes the search energy by fully reducing the ML voltage swing. We consider Ferroelectric field effect transistors (FeFETs) as the representative NVM, and present a 2T-1FeFET CAM array including a sense amplifier implementing the proposed ML scheme. Evaluation results suggest that our proposed 2T-1FeFET CAM design achieves 6.64×/4.74×/9.14×/3.02× better energy efficiency compared with CMOS/ReRAM/STT-MRAM/2FeFET CAM arrays. Benchmarking results show that our approach provides 3.3×/2.1× energy-delay product improvement over the 2T-2R/2FeFET CAM in accelerating query processing applications.

CamSkyGate: camouflaged skyrmion gates for protecting ICs

Yuqiao Zhang
Chunli Tang
Peng Li
Ujjwal Guin

Magnetic skyrmion has the potential to become one of the candidates for emerging technologies due to its ultra-high integration density and ultra-low energy. Skyrmion is a magnetic pattern created by transverse current injection in the ferromagnetic (FM) layer. A skyrmion can be generated by localized spin-polarized current and behaves like a stable pseudoparticle. Different logic gates have been proposed, where the presence or absence of a single skyrmion is represented as binary logic 1 or logic 0, respectively. In this paper, we propose novel camouflaged logic gate designs to prevent an adversary from extracting the original netlist. The proposal uses differential doping to block the propagation of the skyrmions to realize the camouflaged gates. To the best of our knowledge, we are the first to propose camouflaged skyrmion gates to prevent an adversary from performing reverse engineering. We demonstrate the functionality of different camouflaged gates using the mumax³ micromagnetic simulator. We have also evaluated the security of the proposed camouflaged designs using SAT attacks. We show that the same security from the traditional CMOS-based camouflaged circuits can be retained.

GNN-based concentration prediction for random microfluidic mixers

Weiqing Ji
Xingzhuo Guo
Shouan Pan
Tsung-Yi Ho
Ulf Schlichtmann
Hailong Yao

Recent years have witnessed significant advances brought by microfluidic biochips in automating biochemical processing. Accurate preparation of fluid samples with microfluidic mixers is a fundamental step in various biomedical applications, where concentration prediction and generation are critical. Finite element analysis (FEA) is the most commonly used simulation method for accurate concentration prediction of a given biochip design, such as COMSOL. However, the FEA simulation process is time-consuming with poor scalability for large biochip sizes. This paper proposes a new concentration prediction method based on the graph neural networks (GNN), which efficiently and accurately predicts the generated concentration by random microfluidic mixers of different sizes. Experimental results show that compared with the state-of-the-art method, the proposed GNN-based simulation method obtains a reduction of 88% in terms of errors of predicted concentration, which validates the effectiveness of the proposed GNN model.

Designing ML-resilient locking at register-transfer level

Dominik Sisejkovic
Luca Collini
Benjamin Tan
Christian Pilato
Ramesh Karri
Rainer Leupers

Various logic-locking schemes have been proposed to protect hardware from intellectual property piracy and malicious design modifications. Since traditional locking techniques are applied on the gate-level netlist after logic synthesis, they have no semantic knowledge of the design function. Data-driven, machine-learning (ML) attacks can uncover the design flaws within gate-level locking. Recent proposals on register-transfer level (RTL) locking have access to semantic hardware information. We investigate the resilience of ASSURE, a state-of-the-art RTL locking method, against ML attacks. We used the lessons learned to derive two ML-resilient RTL locking schemes built to reinforce ASSURE locking. We developed ML-driven security metrics to evaluate the schemes against an RTL adaptation of the state-of-the-art, ML-based SnapShot attack.

O’clock: lock the clock via clock-gating for SoC IP protection

M Sazadur Rahman
Rui Guo
Hadi M Kamali
Fahim Rahman
Farimah Farahmandi
Mohamed Abdel-Moneum
Mark Tehranipoor

Existing logic locking techniques can prevent IP piracy or tampering. However, they often come at the expense of high overhead and are gradually becoming vulnerable to emerging deobfuscation attacks. To protect SoC IPs, we propose O’Clock, a fully-automated clock-gating-based approach that ‘locks the clock’ to protect IPs in complex SoCs. O’Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree. O’Clock has minimal changes to the original design and no change to the IC design flow. Our experimental results show its high resiliency against state-of-the-art de-obfuscation attacks (e.g., oracle-guided SAT, unrolling-/BMC-based SAT, removal, and oracle-less machine learning-based attacks) at negligible power, performance, and area (PPA) overhead.

ALICE: an automatic design flow for eFPGA redaction

Chiara Muscari Tomajoli
Luca Collini
Jitendra Bhandari
Abdul Khader Thalakkattu Moosa
Benjamin Tan
Xifan Tang
Pierre-Emmanuel Gaillardon
Ramesh Karri
Christian Pilato

Fabricating an integrated circuit is becoming unaffordable for many semiconductor design houses. Outsourcing the fabrication to a third-party foundry requires methods to protect the intellectual property of the hardware designs. Designers can rely on embedded reconfigurable devices to completely hide the real functionality of selected design portions unless the configuration string (bitstream) is provided. However, selecting such portions and creating the corresponding reconfigurable fabrics are still open problems. We propose ALICE, a design flow that addresses the EDA challenges of this problem. ALICE partitions the RTL modules between one or more reconfigurable fabrics and the rest of the circuit, automating the generation of the corresponding redacted design.

DELTA: DEsigning a stealthy trigger mechanism for analog hardware trojans and its detection analysis

Nishant Gupta
Mohil Sandip Desai
Mark Wijtvliet
Shubham Rai
Akash Kumar

This paper presents a stealthy triggering mechanism that reduces the dependencies of analog hardware Trojans on the frequent toggling of the software-controlled rare nets. The trigger to activate the Trojan is generated by using a glitch generation circuit and a clock signal, which increases the selectivity and feasibility of the trigger signal. The proposed trigger is able to evade the state-of-the-art run-time detection (R2D2) and Built-In Acceleration Structure (BIAS) schemes. Furthermore, the simulation results show that the proposed trigger circuit incurs a minimal overhead in side-channel footprints in terms of area (29 transistors), delay (less than 1ps in the clock cycle), and power (1μW).

VIPR-PCB: a machine learning based golden-free PCB assurance framework

Aritra Bhattacharyay
Prabuddha Chakraborty
Jonathan Cruz
Swarup Bhunia

Printed circuit boards (PCBs) form an integral part of the electronics life cycle by providing mechanical support and electrical connections to microchips and discrete electronic components. PCBs follow a similar life cycle as microchips and are vulnerable to similar assurance issues. Malicious design alterations, i.e., hardware Trojan attacks, have emerged as a major threat to PCB assurance. Board-level Trojans are extremely challenging to detect due to (1) the lack of golden or reference models in most use cases, (2) potentially unbounded attack space, and (3) the growing complexity of commercial PCB designs. Existing PCB inspection techniques (e.g., optical and electrical) do not scale to large volume and are expensive, time-consuming, and often not reliable in covering diverse Trojan space. To address these issues, in this paper, we present VIPR-PCB, a board-level Trojan detection framework that employs a machine learning (ML) model to learn Trojan signatures in functional and structural space and uses a trained model to discover Trojans in suspect PCB designs with high fidelity. Using extensive evaluation with 10 open-source PCB designs and a wide variety of Trojan instances, we demonstrate that VIPR-PCB can achieve over 98% accuracy and is even capable of detecting Trojans in partially-recovered PCB designs.

CLIMBER: defending phase change memory against inconsistent write attacks

Zhuohui Duan
Haobo Wang
Haikun Liu
Xiaofei Liao
Hai Jin
Yu Zhang
Fubing Mao

Non-volatile Memories (NVMs) usually demonstrate vast endurance variation due to Process Variation (PV). They are vulnerable to an Inconsistent Write Attack (IWA) which reverses the write intensity distribution in two adjacent wear leveling windows. In this paper, we propose CLIMBER, a defense mechanism to neutralize IWA for NVMs. CLIMBER dynamically changes harmful address mappings so that intensive writes to weak cells are still redirected to strong cells. CLIMBER also conceals weak NVM cells from attackers by randomly mapping cold addresses to weak NVM regions. Experimental results show that CLIMBER can reduce maximum page wear rate by 43.2% compared with the state-of-the-art Toss-up Wear Leveling and prolong NVM lifetime from 4.19 years to 7.37 years with trivial performance/hardware overhead.

Rethinking key-value store for byte-addressable optane persistent memory

Sung-Ming Wu
Li-Pin Chang

Optane Persistent Memory (PM) is a pioneering solution to byte-addressable PM for commodity systems. However, the performance of Optane PM is highly workload-sensitive, rendering many prior designs of Key-Value (KV) store inefficient. To cope with this reality, we advocate rethinking KV store design for Optane PM. Our design follows a principle of Single-stream Writing with managed Multi-stream Reading (SWMR): Incoming KV pairs are written to PM through a single write stream and managed by an ordered index in DRAM. Through asynchronously sorting and rewriting large sets of KV pairs, range queries are handled with a managed number of concurrent streams. YCSB results show that our design improved upon existing ones by 116% and 21% for write-only throughput and read-write throughput, respectively.

libcrpm: improving the checkpoint performance of NVM

Feng Ren
Kang Chen
Yongwei Wu

libcrpm is a new programming library to improve the checkpoint performance for applications running in NVM. It proposes the failure-atomic differential checkpointing protocol, which addresses two problems simultaneously that exist in the current NVM-based checkpoint-recovery libraries: (1) high write amplification when page-granularity incremental checkpointing is used, and (2) high persistence costs from excessive memory fence instructions when fine-grained undo-log or copy-on-write is used. Evaluation results show that libcrpm reduces the checkpoint overhead in realistic workloads. For MPI-based parallel applications such as LULESH, the checkpoint overhead of libcrpm is only 44.78% of FTI, an application-level checkpoint-recovery library.

Scalable crash consistency for secure persistent memory

Ming Zhang
Yu Hua
Xuan Li
Hao Xu

Persistent memory (PM) suffers from data security and crash consistency issues due to non-volatility. Counter-mode encryption (CME) and bonsai merkle tree (BMT) have been adopted to ensure data security by using security metadata. The data and its security metadata need to be atomically persisted for correct recovery. To ensure crash consistency, durable transactions have been widely employed. However, the long-time BMT update increases the transaction latency, and the security metadata incur heavy write traffic. This paper presents Secon to ensure SEcurity and crash CONsistency for PM with high performance. Secon leverages a scalable write-through metadata cache to ensure the atomicity of data and its security metadata. To reduce the transaction latency, Secon proposes a transaction-specific epoch persistency model to minimize the ordering constraints. To reduce the amount of PM writes, Secon co-locates counters with log entries and coalesces BMT blocks. Experimental results demonstrate that Secon significantly improves the transaction performance and decreases the write traffic.

Don’t open row: rethinking row buffer policy for improving performance of non-volatile memories

Yongho Lee
Osang Kwon
Seokin Hong

Among the various NVM technologies, phase-change-memory (PCM) has attracted substantial attention as a candidate to replace the DRAM for next-generation memory. However, the characteristics of PCM cause it to have much longer read and write latencies than DRAM. This paper proposes a Write-Around PCM System that addresses this limitation using two novel schemes: Pseudo-Row Activation and Direct Write. Pseudo-Row Activation provides fast row activation for PCM writes by connecting a target row to bitlines, but it does not fetch the data into the row buffer. With the Direct Write scheme, our system allows for writing operations to update the data even if the target row is in the logically closed state.

SMART: on simultaneously marching racetracks to improve the performance of racetrack-based main memory

Xiangjun Peng
Ming-Chang Yang
Ho Ming Tsui
Chi Ngai Leung
Wang Kang

RaceTrack Memory (RTM) is a promising media for modern Main Memory subsystems. However, the “shift-before-access” principle, as the nature of RTM, introduces considerable overheads to the access latency. To obtain more insights for the mitigation of shift overheads, this work characterizes and observes that the access patterns, exhibited by the state-of-the-art RTM-based Main Memory, mismatches with the granularity of shift commands (i.e., a group of RaceTracks called Domain Block Cluster (DBC)). Based on the characterization, we propose a novel mechanism called SMART, which simultaneously and proactively marches all DBCs within a subarray, so that subsequent accesses to other DBCs can be served without additional shift commands. Evaluation results show that, averaged across 15 real-world workloads, SMART significantly outperforms other state-of-the-art proposals of RTM-based Main Memory by at least 1.53X in terms of the total execution time, on two different generations of RTM technologies.

SAPredictor: a simple and accurate self-adaptive predictor for hierarchical hybrid memory system

Yujuan Tan
Wei Chen
Zhulin Ma
Dan Xiao
Zhichao Yan
Duo Liu
Xianzhang Chen

In a hybrid memory system using DRAM as the NVM cache, DRAM and NVM can be accessed in serial or parallel mode. However, we found that using either mode alone will bring access latency and bandwidth problems. In this paper, we integrate these two access modes and design a simple but accurate predictor (called SAPredictor) to help choose the appropriate access mode, thereby avoiding long access latency and bandwidth problems to improve memory performance. Our experiments show that SAPredictor achieves an accuracy rate of up to 97.1% and helps reduce access latency by up to 35.6% at fairly low costs.

AVATAR: an aging- and variation-aware dynamic timing analyzer for application-based DVAFS

Zuodong Zhang
Zizheng Guo
Yibo Lin
Runsheng Wang
Ru Huang

As the timing guardband continues to increase with the continuous technology scaling, better-than-worst-case (BTWC) design has gained more and more attention. BTWC design can improve energy efficiency and/or performance by relaxing the conservative static timing constraints and exploiting the dynamic timing margin. However, to avoid potential reliability hazards, the existing dynamic timing analysis (DTA) tools have to add extra aging and variation guardbands, which are estimated under the worst-case corners of aging and variation. Such guardbanding method introduces unnecessary margin in timing analysis, thus reducing the performance and efficiency gains of BTWC designs. Therefore, in this paper, we propose AVATAR, an aging- and variation-aware dynamic timing analyzer that can perform DTA with the impact of transistor aging and random process variation. We also propose an application-based dynamic-voltage-accuracy-frequency-scaling (DVAFS) design flow based on AVATAR, which can improve energy efficiency by exploiting both dynamic timing slack (DTS) and the intrinsic error tolerance of the application. The results show that a 45.8% performance improvement and 68% power savings can be achieved by exploiting the intrinsic error tolerance. Compared with the conventional flow based on the corner-based DTA, the additional performance improvement of the proposed flow can be up to 14% or the additional power-saving can be up to 20%.

A defect tolerance framework for improving yield

Shiva Shankar Thiagarajan
Suriyaprakash Natarajan
Yiorgos Makris

In the latest technology nodes, there is a growing concern about yield loss due to timing failures and delay degradation resulting from manufacturing complexities. Largely, these process imperfections are fixed using empirical methods such as layout guidelines and process fixes which come late during the design cycle. In this work, we propose a framework for improving the design yield by synthesizing netlists with improved ability to withstand delay variations to reduce yield loss. We advocate a defect tolerant approach during early design stages to synthesize netlists by introducing defect-awareness to EDA synthesis, thereby generating robust netlists that can withstand delays induced by process imperfections. Toward this objective, we present a) a methodology to characterize standard library cells for delay defects to model the robustness of the cell delays, and b) a solution to drive design synthesis using the intelligence from the cell characterization to achieve design robustness to timing errors. We also introduce defect tolerance metrics to quantify the robustness of standard cells to timing variations, which we use to generate defect-aware libraries to guide defect-aware synthesis. Effectiveness of the proposed defect-aware methodology is evaluated on a set of benchmarks implemented in GF 12nm technology using static timing analysis (STA), revealing a 70–80% reduction of yield loss due to timing errors arising from manufacturing defects, with minimum impact on the area, power and no impact on performance.

Winograd convolution: a perspective from fault tolerance

Xinghua Xue
Haitong Huang
Cheng Liu
Tao Luo
Lei Zhang
Ying Wang

Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49% or energy consumption by 7.19% without any accuracy loss compared to that without being aware of the fault tolerance.

Towards resilient analog in-memory deep learning via data layout re-organization

Muhammad Rashedul Haq Rashed
Amro Awad
Sumit Kumar Jha
Rickard Ewetz

Processing in-memory paves the way for neural network inference engines. An arising challenge is to develop the software/hardware interface to automatically compile deep learning models onto in-memory computing platforms. In this paper, we observe that the data layout organization of a deep neural network (DNN) model directly impacts the model’s classification accuracy. This stems from that the resistive parasitics within a crossbar introduces a dependency between the matrix data and the precision of the analog computation. To minimize the impact of the parasitics, we first perform a case study to understand the underlying matrix properties that result in computation with low and high precision, respectively. Next, we propose the XORG framework that performs data layout organization for DNNs deployed on in-memory computing platforms. The data layout organization improves precision by optimizing the weight matrix to crossbar assignments at compile time. The experimental results show that the XORG framework improves precision with up to 3.2X and 31% on the average. When accelerating DNNs using XORG, the write bit-accuracy requirements are relaxed with 1-bit and the robustness to random telegraph noise (RTN) is improved.

SEM-latch: a lost-cost and high-performance latch design for mitigating soft errors in nanoscale CMOS process

Zhong-Li Tang
Chia-Wei Liang
Ming-Hsien Hsiao
Charles H.-P. Wen

Soft errors (primarily single-event transients (SET) and single-event upsets (SEU)) are receiving increased attention due to the increasing prevalence of automotive and biomedical electronics. In recent years, several latch designs have been developed for SEU/SET protection, but each has its own issues regarding timing, area, and power. Therefore, we propose a novel soft-error mitigating latch design, called SEM-Latch, which extends QUATRO and incorporates a speed path whereas embedding a reference voltage generator (RVG) for simultaneously improving timing, area, and power in 45nm CMOS process. SEM-Latch effectively reduces the power, area, and PDAP (product of delay, area, and power) by an average of 1.4%, 12.5%, and 8.7%, respectively, in comparison to a previous latch (HPST) with equivalent SEU protection. Furthermore, in comparison to AMSER-Latch, SEM-Latch reduces area, timing overhead and PDAP by 27.2%, 48.2%, and 60.2%, respectively, to provide 99.9999% particle rejection rate for SET protection.

BlueSeer: AI-driven environment detection via BLE scans

Valentin Poirot
Oliver Harms
Hendric Martens
Olaf Landsiedel

IoT devices rely on environment detection to trigger specific actions, e.g., for headphones to adapt noise cancellation to the surroundings. While phones feature many sensors, from GNSS to cameras, small wearables must rely on the few energy-efficient components they already incorporate. In this paper, we demonstrate that a Bluetooth radio is the only component required to accurately classify environments and present BlueSeer, an environment-detection system that solely relies on received BLE packets and an embedded neural network. BlueSeer achieves an accuracy of up to 84% differentiating between 7 environments on resource-constrained devices, and requires only ~ 12 ms for inference on a 64 MHz microcontroller-unit.

Compressive sensing based asymmetric semantic image compression for resource-constrained IoT system

Yujun Huang
Bin Chen
Jianghui Zhang
Qiu Han
Shu-Tao Xia

The widespread application of Internet-of-Things (IoT) and deep learning have made machine-to-machine semantic communication possible. However, it remains challenging to deploy DNN model on IoT devices, due to their limited computing and storage capacity. In this paper, we propose Compressed Sensing based Asymmetric Semantic Image Compression (CS-ASIC) for resource-constrained IoT systems, which consists of a lightweight front encoder and a deep iterative decoder offloaded at the server. We further consider a task-oriented scenario and optimize CS-ASIC for the semantic recognition tasks. The experiment results demonstrate that CS-ASIC achieves considerable data-semantic rate-distortion trade-off, and low encoding complexity over prevailing codecs.

R2B: high-efficiency and fair I/O scheduling for multi-tenant with differentiated demands

Diansen Sun
Yunpeng Chai
Chaoyang Liu
Weihao Sun
Qingpeng Zhang

Big data applications have differentiated requirements for I/O resources in cloud environments. For instance, data analytic and AI/ML applications usually have periodical burst I/O traffic, and data stream processing and database applications often introduce fluctuating I/O loads based on a guaranteed I/O bandwidth. However, the existing resource isolation model (i.e., RLW) and methods (e.g., Token-bucket, mClock, and cgroup) cannot support the fluctuating I/O load and differentiated I/O demands well, and thus cannot achieve fairness, high resource utilization, and high performance for applications at the same time. In this paper, we propose a novel efficient and fair I/O resource isolation model and method called R2B, which can adapt to the differentiated I/O characteristics and requirements of different applications in a shared resource environment. R2B can simultaneously satisfy the fairness and achieve both high application efficiency and high bandwidth utilization.

This work aims to help the cloud provider achieve higher utilization by shifting the burden to the cloud customers to specify their type of workload.

Fast and scalable human pose estimation using mmWave point cloud

Sizhe An
Umit Y. Ogras

Millimeter-Wave (mmWave) radar can enable high-resolution human pose estimation with low cost and computational requirements. However, mmWave data point cloud, the primary input to processing algorithms, is highly sparse and carries significantly less information than other alternatives such as video frames. Furthermore, the scarce labeled mmWave data impedes the development of machine learning (ML) models that can generalize to unseen scenarios. We propose a fast and scalable human pose estimation (FUSE) framework that combines multi-frame representation and meta-learning to address these challenges. Experimental evaluations show that FUSE adapts to the unseen scenarios 4× faster than current supervised learning approaches and estimates human joint coordinates with about 7 cm mean absolute error.

VWR2A: a very-wide-register reconfigurable-array architecture for low-power embedded devices

Benoît W. Denkinger
Miguel Peón-Quirós
Mario Konijnenburg
David Atienza
Francky Catthoor

Edge-computing requires high-performance energy-efficient embedded systems. Fixed-function or custom accelerators, such as FFT or FIR filter engines, are very efficient at implementing a particular functionality for a given set of constraints. However, they are inflexible when facing application-wide optimizations or functionality upgrades. Conversely, programmable cores offer higher flexibility, but often with a penalty in area, performance, and, above all, energy consumption. In this paper, we propose VWR2A, an architecture that integrates high computational density and low power memory structures (i.e., very-wide registers and scratchpad memories). VWR2A narrows the energy gap with similar or better performance on FFT kernels with respect to an FFT accelerator. Moreover, VWR2A flexibility allows to accelerate multiple kernels, resulting in significant energy savings at the application level.

Alleviating datapath conflicts and design centralization in graph analytics acceleration

Haiyang Lin
Mingyu Yan
Duo Wang
Mo Zou
Fengbin Tu
Xiaochun Ye
Dongrui Fan
Yuan Xie

Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2× speedup (1.5× on average) as well as better scalability.

Hyperdimensional hashing: a robust and efficient dynamic hash table

Mike Heddes
Igor Nunes
Tony Givargis
Alexandru Nicolau
Alex Veidenbaum

Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms that minimize key remapping as the hash table resizes. While memory errors in large-scale cloud deployments are common, neither algorithm offers both efficiency and robustness. Hyperdimensional Computing is an emerging computational model that has inherent efficiency, robustness and is well suited for vector or hardware acceleration. We propose Hyperdimensional (HD) hashing and show that it has the efficiency to be deployed in large systems. Moreover, a realistic level of memory errors causes more than 20% mismatches for consistent hashing while HD hashing remains unaffected.

In-situ self-powered intelligent vision system with inference-adaptive energy scheduling for BNN-based always-on perception

Maimaiti Nazhamaiti
Haijin Su
Han Xu
Zheyu Liu
Fei Qiao
Qi Wei
Zidong Du
Xinghua Yang
Li Luo

This paper proposes an in-situ self-powered BNN-based intelligent visual perception system that harvests light energy utilizing the indispensable image sensor itself. The harvested energy is allocated to the low-power BNN computation modules layer by layer, adopting a light-weighted duty-cycling-based energy scheduler. A software-hardware co-design method, which exploits the layer-wise error tolerance of BNN as well as the computing-error and energy consumption characteristics of the computation circuit, is proposed to determine the parameters of the energy scheduler, achieving high energy efficiency for self-powered BNN inference. Simulation results show that with the proposed inference-adaptive energy scheduling method, self-powered MNIST classification task can be performed at a frame rate of 4 fps if the harvesting power is 1μW, while guaranteeing at least 90% inference accuracy using binary LeNet-5 network.

Adaptive window-based sensor attack detection for cyber-physical systems

Lin Zhang
Zifan Wang
Mengyu Liu
Fanxin Kong

Sensor attacks alter sensor readings and spoof Cyber-Physical Systems (CPS) to perform dangerous actions. Existing detection works tend to minimize the detection delay and false alarms at the same time, while there is a clear trade-off between the two metrics. Instead, we argue that attack detection should dynamically balance the two metrics when a physical system is at different states. Along with this argument, we propose an adaptive sensor attack detection system that consists of three components – an adaptive detector, detection deadline estimator, and data logger. It can adapt the detection delay and thus false alarms at run time to meet a varying detection deadline and improve usability (or false alarms). Finally, we implement our detection system and validate it using multiple CPS simulators and a reduced-scale autonomous vehicle testbed.

Design-while-verify: correct-by-construction control learning with verification in the loop

Yixuan Wang
Chao Huang
Zhaoran Wang
Zhilu Wang
Qi Zhu

In the current control design of safety-critical cyber-physical systems, formal verification techniques are typically applied after the controller is designed to evaluate whether the required properties (e.g., safety) are satisfied. However, due to the increasing system complexity and the fundamental hardness of designing a controller with formal guarantees, such an open-loop process of design-then-verify often results in many iterations and fails to provide the necessary guarantees. In this paper, we propose a correct-by-construction control learning framework that integrates the verification into the control design process in a closed-loop manner, i.e., design-while-verify. Specifically, we leverage the verification results (computed reachable set of the system state) to construct feedback metrics for control learning, which measure how likely the current design of control parameters can meet the required reach-avoid property for safety and goal-reaching. We formulate an optimization problem based on such metrics for tuning the controller parameters, and develop an approximated gradient descent algorithm with a difference method to solve the optimization problem and learn the controller. The learned controller is formally guaranteed to meet the required reach-avoid property. By treating verifiability as a first-class objective and effectively leveraging the verification results during the control learning process, our approach can significantly improve the chance of finding a control design with formal property guarantees, demonstrated in a set of experiments that use model-based or neural network based controllers.

GaBAN: a generic and flexibly programmable vector neuro-processor on FPGA

Jiajie Chen
Le Yang
Youhui Zhang

Spiking neural network (SNN) is the main computational model of brain-inspired computing and neuroscience, which also acts as the bridge between them. With the rapid development of neuroscience, accurate and flexible SNN simulation with high performance is becoming important. This paper proposes GaBAN, a generic and flexibly programmable neuro-processor on FPGA. Different from the majority of current designs that realize neural components by custom hardware directly, it is centered on a compact, versatile vector instruction set, which supports multiple-precision vector calculation, indexed-/strided-memory access, and conditional execution to accommodate computational characteristics. By software and hardware co-design, the compiler extracts memory-accesses from SNN programs to generate micro-ops executed by an independent hardware unit; the latter interacts with the computing pipeline through an asynchronous buffering mechanism. Thus memory access delay can fully cover the calculation. Tests show that GaBAN can not only outperform the SOTA ISA-based FPGA solution remarkably but also be comparable with counterparts of the hardware-fixed model on some tasks. Moreover, in end-to-end testing, its simulation performance exceeds that of high-performance X86 processor (1.44–3.0x).

ADEPT: automatic differentiable DEsign of photonic tensor cores

Jiaqi Gu
Hanqing Zhu
Chenghao Feng
Zixuan Jiang
Mingjie Liu
Shuhan Zhang
Ray T. Chen
David Z. Pan

Photonic tensor cores (PTCs) are essential building blocks for optical artificial intelligence (AI) accelerators based on programmable photonic integrated circuits. PTCs can achieve ultra-fast and efficient tensor operations for neural network (NN) acceleration. Current PTC designs are either manually constructed or based on matrix decomposition theory, which lacks the adaptability to meet various hardware constraints and device specifications. To our best knowledge, automatic PTC design methodology is still unexplored. It will be promising to move beyond the manual design paradigm and “nurture” photonic neurocomputing with AI and design automation. Therefore, in this work, for the first time, we propose a fully differentiable framework, dubbed ADEPT, that can efficiently search PTC designs adaptive to various circuit footprint constraints and foundry PDKs. Extensive experiments show superior flexibility and effectiveness of the proposed ADEPT framework to explore a large PTC design space. On various NN models and benchmarks, our searched PTC topology outperforms prior manually-designed structures with competitive matrix representability, 2×-30× higher footprint compactness, and better noise robustness, demonstrating a new paradigm in photonic neural chip design. The code of ADEPT is available at link using the TorchONN library.

Unicorn: a multicore neuromorphic processor with flexible fan-in and unconstrained fan-out for neurons

Zhijie Yang
Lei Wang
Yao Wang
Linghui Peng
Xiaofan Chen
Xun Xiao
Yaohua Wang
Weixia Xu

Neuromorphic processor is popular due to its high energy efficiency for spatio-temporal applications. However, when running the spiking neural network (SNN) topologies with the ever-growing scale, existing neuromorphic architectures face challenges due to their restrictions on neuron fan-in and fan-out. This paper proposes Unicorn, a multicore neuromorphic processor with a spike train sliding multicasting mechanism (STSM) and neuron merging mechanism (NMM) to support unconstrained fan-out and flexible fan-in of neurons. Unicorn supports 36K neurons and 45M synapses and thus supports a variety of neuromorphic applications. The peak performance and energy efficiency of Unicorn reach 36TSOPS and 424GSOPS/W respectively. Experimental results show that Unicorn can achieve 2×-5.5× energy reduction over the state-of-the-art neuromorphic processor when running an SNN with a relatively large fan-out and fan-in.

Effective zero compression on ReRAM-based sparse DNN accelerators

Hoon Shin
Rihae Park
Seung Yul Lee
Yeonhong Park
Hyunseung Lee
Jae W. Lee

For efficient DNN inference Resistive RAM (ReRAM) crossbars have emerged as a promising building block to compute matrix multiplication in an area- and power-efficient manner. To improve inference throughput sparse models can be deployed on the ReRAM-based DNN accelerator. While unstructured pruning maintains both high accuracy and high sparsity, it performs poorly on the crossbar architecture due to the irregular locations of pruned weights. Meanwhile, due to the non-ideality of ReRAM cells and the high cost of ADCs, matrix multiplication is usually performed at a fine granularity, called Operation Unit (OU), along both wordline and bitline dimensions. While fine-grained, OU- based row compression (ORC) has recently been proposed to increase weight compression ratio, significant performance potentials are still left on the table due to sub-optimal weight mappings. Thus, we propose a novel weight mapping scheme that effectively clusters zero weights via OU-level filter reordering, hence improving the effective weight compression ratio. We also introduce a weight recovery scheme to further improve accuracy or compression ratio, or both. Our evaluation with three popular DNNs demonstrates that the proposed scheme effectively eliminates redundant weights in the crossbar array and hence ineffectual computation to achieve 3.27–4.26× of array compression ratio with negligible accuracy loss over the baseline ReRAM-based DNN accelerator.

Y-architecture-based flip-chip routing with dynamic programming-based bend minimization

Szu-Ru Nie
Yen-Ting Chen
Yao-Wen Chang

In modern VLSI designs, I/O counts have been growing continuously as the system becomes more complicated. To achieve higher routability, the hexagonal array is introduced with higher pad density and a larger pitch. However, the routing for hexagonal arrays is significantly different from that for traditional gird and staggered arrays. In this paper, we consider the Y-architecture-based flip-chip routing used for the hexagonal array. Unlike the conventional Manhattan and the X-architectures, the Y-architecture allows wires to be routed in three directions, namely, 0-, 60-, and 120-degrees. We first analyze the routing properties of the hexagonal array. Then, we propose a triangular tile model and a chord-based internal node division method that can handle both pre-assignment and free-assignment nets without wire crossing. Finally, we develop a novel dynamic programming-based bend minimization method to reduce the number of routing bends in the final solution. Experimental results show that our algorithm can achieve 100% routability with minimized total wirelength and the number of routing bends effectively.

Towards collaborative intelligence: routability estimation based on decentralized private data

Jingyu Pan
Chen-Chia Chang
Zhiyao Xie
Ang Li
Minxue Tang
Tunhou Zhang
Jiang Hu
Yiran Chen

Applying machine learning (ML) in design flow is a popular trend in Electronic Design Automation (EDA) with various applications from design quality predictions to optimizations. Despite its promise, which has been demonstrated in both academic researches and industrial tools, its effectiveness largely hinges on the availability of a large amount of high-quality training data. In reality, EDA developers have very limited access to the latest design data, which is owned by design companies and mostly confidential. Although one can commission ML model training to a design company, the data of a single company might be still inadequate or biased, especially for small companies. Such data availability problem is becoming the limiting constraint on future growth of ML for chip design. In this work, we propose an Federated-Learning based approach for well-studied ML applications in EDA. Our approach allows an ML model to be collaboratively trained with data from multiple clients but without explicit access to the data for respecting their data privacy. To further strengthen the results, we co-design a customized ML model FLNet and its personalization under the decentralized training scenario. Experiments on a comprehensive dataset show that collaborative training improves accuracy by 11% compared with individual local models, and our customized model FLNet significantly outperforms the best of previous routability estimators in this collaborative training flow.

A2-ILT: GPU accelerated ILT with spatial attention mechanism

Qijing Wang
Bentian Jiang
Martin D. F. Wong
Evangeline F. Y. Young

Inverse lithography technology (ILT) is one of the promising resolution enhancement techniques (RETs) in modern design-for-manufacturing closure, however, it suffers from huge computational overhead and unaffordable mask writing time. In this paper, we propose A2-ILT, a GPU-accelerated ILT framework with spatial attention mechanism. Based on the previous GPU-accelerated ILT flow, we significantly improve the ILT quality by introducing spatial attention map and on-the-fly mask rectilinearization, and strengthen the robustness by Reinforcement-Learning deployment. Experimental results show that, comparing to the state-of-the-art solutions, A2-ILT achieves 5.06% and 11.60% reduction in printing error and process variation band with a lower mask complexity and superior runtime performance.

Generic lithography modeling with dual-band optics-inspired neural networks

Haoyu Yang
Zongyi Li
Kumara Sastry
Saumyadip Mukhopadhyay
Mark Kilgard
Anima Anandkumar
Brucek Khailany
Vivek Singh
Haoxing Ren

Lithography simulation is a critical step in VLSI design and optimization for manufacturability. Existing solutions for highly accurate lithography simulation with rigorous models are computationally expensive and slow, even when equipped with various approximation techniques. Recently, machine learning has provided alternative solutions for lithography simulation tasks such as coarse-grained edge placement error regression and complete contour prediction. However, the impact of these learning-based methods has been limited due to restrictive usage scenarios or low simulation accuracy. To tackle these concerns, we introduce an dual-band optics-inspired neural network design that considers the optical physics underlying lithography. To the best of our knowledge, our approach yields the first published via/metal layer contour simulation at 1nm²/pixel resolution with any tile size. Compared to previous machine learning based solutions, we demonstrate that our framework can be trained much faster and offers a significant improvement on efficiency and image quality with 20× smaller model size. We also achieve 85× simulation speedup over traditional lithography simulator with ~ 1% accuracy loss.

Statistical computing framework and demonstration for in-memory computing systems

Bonan Zhang
Peter Deaville
Naveen Verma

With the increasing importance of data-intensive workloads, such as AI, in-memory computing (IMC) has demonstrated substantial energy/throughput benefits by addressing both compute and data-movement/accessing costs, and holds significant further promise by its ability to leverage emerging forms of highly-scaled memory technologies. However, IMC fundamentally derives its advantages through parallelism, which poses a trade-off with SNR, whereby variations and noise in nanoscaled devices directly limit possible gains. In this work, we propose novel training approaches to improve model tolerance to noise via a contrastive loss function and a progressive training procedure. We further propose a methodology for modeling and calibrating hardware noise, efficiently at the level of a macro operation and through a limited number of hardware measurements. The approaches are demonstrated on a fabricated MRAM-based IMC prototype in 22nm FD-SOI, together with a neural network training framework implemented in PyTorch. For CIFAR-10/100 classifications, model performance is restored to the level of ideal noise-free execution, and generalized performance of the trained model deployed across different chips is demonstrated.

Write or not: programming scheme optimization for RRAM-based neuromorphic computing

Ziqi Meng
Yanan Sun
Weikang Qian

One main fault-tolerant method for a neural network accelerator based on resistive random access memory crossbars is the programming-based method, which is also known as write-and-verify (W-V). In the basic W-V scheme, all devices in crossbars are programmed repeatedly until they are close enough to their targets, which costs huge overhead. To reduce the cost, we optimize the W-V scheme by proposing a probabilistic termination criterion on a single device and a systematic optimization method on multiple devices. Furthermore, we propose a joint algorithm that assists the novel W-V scheme by incremental retraining, which further reduces the W-V cost. Compared to the basic W-V scheme, our proposed method improves the accuracy by 0.23% for ResNet18 on CIFAR10 with only 9.7% W-V cost under variation with σ = 1.2.

ReSMA: accelerating approximate string matching using ReRAM-based content addressable memory

Huize Li
Hai Jin
Long Zheng
Yu Huang
Xiaofei Liao
Zhuohui Duan
Dan Chen
Chuangyi Gui

Approximate string matching (ASM) functions as the basic operation kernel for a large number of string processing applications. Existing Von-Neumann-based ASM accelerators suffer from huge intermediate data with the ever-increasing string data, leading to massive off-chip data transmissions. This paper presents a novel ASM processing-in-memory (PIM) accelerator, namely ReSMA, based on ReCAM- and ReRAM-arrays to eliminate the off-chip data transmissions in ASM. We develop a novel ReCAM-friendly filter-and-filtering algorithm to process the q-grams filtering in ReCAM memory. We also design a new data mapping strategy and a new verification algorithm, which enables computing the edit distances totally in ReRAM crossbars for energy saving. Experimental results show that ReSMA outperforms the CPU-, GPU-, FPGA-, ASIC-, and PIM-based solutions by 268.7×, 38.6×, 20.9×, 707.8×, and 14.7× in terms of performance, and 153.8×, 42.2×, 31.6×, 18.3×, and 5.3× in terms of energy-saving, respectively.

VStore: in-storage graph based vector search accelerator

Shengwen Liang
Ying Wang
Ziming Yuan
Cheng Liu
Huawei Li
Xiaowei Li

Graph-based vector search that finds best matches to user queries based on their semantic similarities using a graph data structure, becomes instrumental in data science and AI application. However, deploying graph-based vector search in production systems requires high accuracy and cost-efficiency with low latency and memory footprint, which existing work fails to offer. We present VStore, a graph-based vector search solution that collaboratively optimizes accuracy, latency, memory, and data movement on large-scale vector data based on in-storage computing. The evaluation shows that VStore exhibits significant search efficiency improvement and energy reduction while attaining accuracy over CPU, GPU, and ZipNN platforms.

Scaled-CBSC: scaled counting-based stochastic computing multiplication for improved accuracy

Shuyuan Yu
Sheldon X.-D. Tan

Stochastic computing (SC) can lead area-efficient implementation of logic designs. Existing SC multiplication, however, suffers a long-standing problem: large multiplication error with small inputs due to its intrinsic nature of bit-stream based computing. In this article, we propose a new scaled counting-based SC multiplication approach, called Scaled-CBSC, to mitigate this issue by introducing scaling bits to ensure the bit ‘1’ density of the stochastic number is sufficiently large. The idea is to convert the “small” inputs to “large” inputs, thus improve the accuracy of SC multiplication. But different from an existing stream-bit based approach, the new method uses the binary format and does not require stochastic addition as the SC multiplication always starts with binary numbers. Furthermore, Scaled-CBSC only requires all the numbers to be larger than 0.5 instead of arbitrary defined threshold, which leads to integer numbers only for the scaling term. The experimental results show that the 8-bit Scaled-CBSC multiplication with 3 scaling bits can achieve up to 46.6% and 30.4% improvements in mean error and standard deviation, respectively; reduce the peak relative error from 100% to 1.8%; and improve 12.6%, 51.5%, 57.6%, 58.4% in delay, area, area-delay product, energy consumption, respectively, over the state of art work. Furthermore, we evaluate the proposed multiplication approach in a discrete cosine transformation (DCT) application. The results show that with 3 scaling bits, 8-bit scaled counting-based SC multiplication can improve the image quality with 5.9dB upon the state of art work in average.

Tailor: removing redundant operations in memristive analog neural network accelerators

Xingchen Li
Zhihang Yuan
Guangyu Sun
Liang Zhao
Zhichao Lu

Analog in-situ computation based on memristive circuits has been regarded as a promising approach for designing high-performance and low-power neural network accelerators. However, despite the low-cost and highly parallel memristive crossbars, the peripheral circuits especially analog-digital-converters (ADCs) induce significant overhead. Quantitative analysis shows that ADCs can contribute up to 91% energy consumption and 72% chip area, which significantly offset the advantages of memristive NN accelerators.

To address this problem, we first mathematically analyze the computation flow in a memristive accelerator, and find that there are many useless operations. These operations significantly increase the demand for peripheral circuits. Then, based on our discovery, we propose a novel architecture, Tailor, which removes these unnecessary operations without accuracy loss. We design two types of Tailor. General Tailor is compatible with most existing memristive accelerators and can be easily applied to them. Customized Tailor is specialized for a certain NN application and can obtain more improvement. Experimental results show that, General Tailor can reduce 14% ~ 20% inference time and 33% ~ 41% energy consumption. Customized Tailor can further achieve 56% ~ 87% higher computation density.

Domain knowledge-infused deep learning for automated analog/radio-frequency circuit parameter optimization

Weidong Cao
Mouhacine Benosman
Xuan Zhang
Rui Ma

The design automation of analog circuits is a longstanding challenge. This paper presents a reinforcement learning method enhanced by graph learning to automate the analog circuit parameter optimization at the pre-layout stage, i.e., finding device parameters to fulfill desired circuit specifications. Unlike all prior methods, our approach is inspired by human experts who rely on domain knowledge of analog circuit design (e.g., circuit topology and couplings between circuit specifications) to tackle the problem. By originally incorporating such key domain knowledge into policy training with a multimodal network, the method best learns the complex relations between circuit parameters and design targets, enabling optimal decisions in the optimization process. Experimental results on exemplary circuits show it achieves human-level design accuracy (~99%) with 1.5× efficiency of existing best-performing methods. Our method also shows better generalization ability to unseen specifications and optimality in circuit performance optimization. Moreover, it applies to design radio-frequency circuits on emerging semiconductor technologies, breaking the limitations of prior learning methods in designing conventional analog circuits.

A cost-efficient fully synthesizable stochastic time-to-digital converter design based on integral nonlinearity scrambling

Qiaochu Zhang
Shiyu Su
Mike Shuo-Wei Chen

Stochastic time-to-digital converters (STDCs) are gaining increasing interest in submicron CMOS analog/mixed-signal design for their superior tolerance to nonlinear quantization levels. However, the large number of required delay units and time comparators for conventional STDC operation incurs excessive implementation costs. This paper presents a fully synthesizable STDC architecture based on an integral non-linearity (INL) scrambling technique, allowing order-of-magnitude cost reduction. The proposed technique randomizes and averages the STDC INL using a digital-to-time converter. Moreover, we propose an associated design automation flow and demonstrate an STDC design in 12nm FinFET process. Post-layout simulations show significant linearity and area/power efficiency improvements compared to prior arts.

Using machine learning to optimize graph execution on NUMA machines

Hiago Mayk G. de A. Rocha
Janaina Schwarzrock
Arthur F. Lorenzon
Antonio Carlos S. Beck

This paper proposes PredG, a Machine Learning framework to enhance the graph processing performance by finding the ideal thread and data mapping on NUMA systems. PredG is agnostic to the input graph: it uses the available graphs’ features to train an ANN to perform predictions as new graphs arrive – without any application execution after being trained. When evaluating PredG over representative graphs and algorithms on three NUMA systems, its solutions are up to 41% faster than the Linux OS Default and the Best Static – on average 2% far from the Oracle -, and it presents lower energy consumption.

HCG: optimizing embedded code generation of simulink with SIMD instruction synthesis

Zhuo Su
Zehong Yu
Dongyan Wang
Yixiao Yang
Yu Jiang
Rui Wang
Wanli Chang
Jiaguang Sun

Simulink is widely used for the model-driven design of embedded systems. It is able to generate optimized embedded control software code through expression folding, variable reuse, etc. However, for some commonly used computing-sensitive models, such as the models for signal processing applications, the efficiency of the generated code is still limited.

In this paper, we propose HCG, an optimized code generator for the Simulink model with SIMD instruction synthesis. It will select the optimal implementations for intensive computing actors based on adaptively pre-calculation of the input scales, and synthesize the appropriate SIMD instructions for batch computing actors based on the iterative dataflow graph mapping. We implemented and evaluated its performance on benchmark Simulink models. Compared to the built-in Simulink Coder and the most recent DFSynth, the code generated by HCG achieves an improvement of 38.9%-92.9% and 41.2%-76.8% in terms of execution time across different architectures and compilers, respectively.

Raven: a novel kernel debugging tool on RISC-V

Hongyi Lu
Fengwei Zhang

Debugging is an essential part of kernel development. However, debugging features are not available on RISC-V without the use of external hardware. In this paper, we leverage a security feature called Physical Memory Protection (PMP) as a debugging primitive to address this issue. Based on this debugging primitive, we design Raven, a novel kernel debugging tool with the standard functionalities (breakpoints, watchpoints, stepping, introspection). A prototype of Raven is implemented on a SiFive Unmatched development board. Our experiments show that Raven imposes a moderate but acceptable overhead to the kernel. Moreover, a real-world debugging scenario is set up to test its effectiveness.

GTuner: tuning DNN computations on GPU via graph attention network

Qi Sun
Xinyun Zhang
Hao Geng
Yuxuan Zhao
Yang Bai
Haisheng Zheng
Bei Yu

It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.

Pref-X: a framework to reveal data prefetching in commercial in-order cores

Quentin Huppert
Francky Catthoor
Lionel Torres
David Novo

Computer system simulators are major tools used by architecture researchers to develop and evaluate new ideas. Clearly, such evaluations are more conclusive when compared to commercial state-of-the-art architectures. However, the behavior of key components in existing processors is often not disclosed, complicating the construction of faithful reference models. The data prefetching engine is one of such obscured components that can have a significant impact on key metrics such as performance and energy.

In this paper, we propose Pref-X, a framework to analyze functional characteristics of data prefetching in commercial in-order cores. Our framework reveals data prefetches by X-raying into the cache memory at the request granularity, which allows linking memory access patterns with changes in the cache content. To demonstrate the power and accuracy of our methodology, we use Pref-X to replicate the data prefetching mechanisms of two representative processors, namely the Arm Cortex-A7 and the Arm Cortex-A53, with a 99.8% and 96.9% average accuracy, respectively.

Architecting DDR5 DRAM caches for non-volatile memory systems

Xin Xin
Wanyi Zhu
Li Zhao

With the release of Intel’s Optane DIMM, Non-Volatile Memories (NVMs) are emerging as viable alternatives to DRAM memories because of the advantage of higher capacity. However, the higher latency and lower bandwidth of Optane prevent it from outright replacing DRAM. A prevailing strategy is to employ existing DRAM as a data cache for Optane, thereby achieving overall benefit in capacity, bandwidth, and latency.

In this paper, we inspect new features in DDR5 to better support the DRAM cache design for Optane. Specifically, we leverage the two-level ECC scheme, i.e., DIMM ECC and on-die ECC, in DDR5 to construct a narrower channel for tag probing and propose a new operation for fast cache replacement. Experimental results show that our proposed strategy can achieve, on average, 26% performance improvement.

GraphRing: an HMC-ring based graph processing framework with optimized data movement

Zerun Li
Xiaoming Chen
Yinhe Han

Due to the irregular memory access and high bandwidth demanding, graph processing is usually inefficient on conventional computer architectures. The recent development of the processing-in-memory (PIM) technique such as hybrid memory cube (HMC) has provided a feasible design direction for graph processing accelerators. Although PIM provides high internal bandwidth, inter-node memory access is inevitable in large-scale graph processing, which greatly affects the performance. In this paper, we propose an HMC-based graph processing framework, GraphRing. GraphRing is a software-hardware codesign framework that optimizes inter-HMC communication. It contains a regularity- and locality-aware graph execution model and a ring-based multi-HMC architecture. The evaluation results based on 5 graph datasets and 4 graph algorithms show that GraphRing achieves on average 2.14× speedup and 3.07× inter-HMC communication energy saving, compared with GraphQ, a state-of-the-art graph processing architecture.

AxoNN: energy-aware execution of neural network inference on multi-accelerator heterogeneous SoCs

Ismet Dagli
Alexander Cieslewicz
Jedidiah McClurg
Mehmet E. Belviranli

The energy and latency demands of critical workload execution, such as object detection, in embedded systems vary based on the physical system state and other external factors. Many recent mobile and autonomous System-on-Chips (SoC) embed a diverse range of accelerators with unique power and performance characteristics. The execution flow of the critical workloads can be adjusted to span into multiple accelerators so that the trade-off between performance and energy fits to the dynamically changing physical factors.

In this study, we propose running neural network (NN) inference on multiple accelerators of an SoC. Our goal is to enable an energy-performance trade-off with an by distributing layers in a NN between a performance- and a power-efficient accelerator. We first provide an empirical modeling methodology to characterize execution and inter-layer transition times. We then find an optimal layers-to-accelerator mapping by representing the trade-off as a linear programming optimization constraint. We evaluate our approach on the NVIDIA Xavier AGX SoC with commonly used NN models. We use the Z3 SMT solver to find schedules for different energy consumption targets, with up to 98% prediction accuracy.

PIPF-DRAM: processing in precharge-free DRAM

Nezam Rohbani
Mohammad Arman Soleimani
Hamid Sarbazi-Azad

To alleviate costly data communication among processing cores and memory modules, parallel processing-in-memory (PIM) is a promising approach which exploits the huge available internal memory bandwidth. High capacity, wide row size, and maturity of DRAM technology, make DRAM an alluring structure for PIM. However, dense layout, high process variation, and noise vulnerability of DRAMs make it very challenging to apply PIM for DRAMs in practice. This work proposes a PIM structure which eliminates these DRAM limitations, exploiting a precharge-free DRAM (PF-DRAM) structure. The proposed PIM structure, called PIPF-DRAM, performs parallel bitwise operations only by modifying control signal sequences in PF-DRAM, with almost zero structural and circuit modifications. Comparing the state-of-the-art PIM techniques, PIPF-DRAM is 4.2× more robust to process variation, 4.1% faster in average cycle time of operations, and consumes 66.1% less energy.

TAIM: ternary activation in-memory computing hardware with 6T SRAM array

Nameun Kang
Hyungjun Kim
Hyunmyung Oh
Jae-Joon Kim

Recently, various in-memory computing accelerators for low precision neural networks have been proposed. While in-memory Binary Neural Network (BNN) accelerators achieved significant energy efficiency, BNNs show severe accuracy degradation compared to their full precision counterpart models. To mitigate the problem, we propose TAIM, an in-memory computing hardware that can support ternary activation with negligible hardware overhead. In TAIM, a 6T SRAM cell can compute the multiplication between ternary activation and binary weight. Since the 6T SRAM cell consumes no energy when the input activation is 0, the proposed TAIM hardware can achieve even higher energy efficiency compared to BNN case by exploiting input 0’s. We fabricated the proposed TAIM hardware in 28nm CMOS process and evaluated the energy efficiency on various image classification benchmarks. The experimental results show that the proposed TAIM hardware can achieve ~ 3.61× higher energy efficiency on average compared to previous designs which support ternary activation.

PIM-DH: ReRAM-based processing-in-memory architecture for deep hashing acceleration

Fangxin Liu
Wenbo Zhao
Yongbiao Chen
Zongwu Wang
Zhezhi He
Rui Yang
Qidong Tang
Tao Yang
Cheng Zhuo
Li Jiang

Deep hashing has gained growing momentum in large-scale image retrieval. However, deep hashing is computation- and memory-intensive, which demands hardware acceleration. The unique process of hash sequence computation in deep hashing is non-trivial to accelerate due to the lack of an efficient compute primitive for Hamming distance calculation and ranking.

This paper proposes the first PIM-based scheme for deep hashing accelerator, namely PIM-DH. PIM-DH is supported by an algorithm and architecture co-design. The proposed algorithm seeks to compress the hash sequence to increase the retrieval efficiency by exploiting the hash code sparsity without accuracy loss. Further, we design a lightweight circuit to assist CAM to optimize hash computation efficiency. This design leads to an elegant extension of current PIM-based architectures for adapting to various hashing algorithms and arbitrary size of hash sequence induced by pruning. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA RTX2080 GPU, PIM-DH achieves an average 4.75E+03 speedup with 4.64E+05 energy reduction over CPU, 2.30E+02 speedup with 3.38E+04 energy reduction over GPU. Compared with PIM architecture CASCADE, PIM-DH can improve computing efficiency by 17.49× and energy efficiency by 41.38×.

YOLoC: deploy large-scale neural network by ROM-based computing-in-memory using residual branch on a chip

Yiming Chen
Guodong Yin
Zhanhong Tan
Mingyen Lee
Zekun Yang
Yongpan Liu
Huazhong Yang
Kaisheng Ma
Xueqing Li

Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to achieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (DarkNet-19) and 4.8x for ResNet-18, with <8% latency overhead and almost no mean average precision (mAP) loss (−0.5% ~ +0.2%), compared with the fully SRAM-based CiM.

ASTERS: adaptable threshold spike-timing neuromorphic design with twin-column ReRAM synapses

Ziru Li
Qilin Zheng
Bonan Yan
Ru Huang
Bing Li
Yiran Chen

Complex event-driven neuron dynamics was an obstacle to implementing efficient brain-inspired computing architectures with VLSI circuits. To solve this problem and harness the event-driven advantage, we propose ASTERS, a resistive random-access memory (ReRAM) based neuromorphic design to conduct the time-to-first-spike SNN inference. In addition to the fundamental novel axon and neuron circuits, we also propose two techniques through hardware-software co-design: “Multi-Level Firing Threshold Adjustment” to mitigate the impact of ReRAM device process variations, and “Timing Threshold Adjustment” to further speed up the computation. Experimental results show that our cross-layer solution ASTERS achieves more than 34.7% energy savings compared to the existing spiking neuromorphic designs, meanwhile maintaining 90.1% accuracy under the process variations with a 20% standard deviation.

SATO: spiking neural network acceleration via temporal-oriented dataflow and architecture

Fangxin Liu
Wenbo Zhao
Zongwu Wang
Yongbiao Chen
Tao Yang
Zhezhi He
Xiaokang Yang
Li Jiang

Event-driven spiking neural networks (SNNs) have shown great promise for being strikingly energy-efficient. SNN neurons integrate the spikes, accumulate the membrane potential, and fire output spike when the potential exceeds a threshold. Existing SNN accelerators, however, have to carry out such accumulation-comparison operation in serial. Repetitive spike generation at each time step not only increases latency as well as overall energy budget, but also incurs memory access overhead of fetching membrane potentials, both of which lessen the efficiency of SNN accelerators. Meanwhile, inherent highly sparse spikes of SNNs lead to imbalanced workloads among neurons that hurdle the utilization of processing elements (PEs).

This paper proposes SATO, a temporal-parallel SNN accelerator that accumulates the membrane potential for all time steps in parallel. SATO architecture contains a novel binary adder-search tree to generate the output spike train, which decouples the chronological dependence in the accumulation-comparison operation. Moreover, SATO can evenly dispatch the compressed workloads to all PEs with maximized data locality of input spike trains based on a bucket-sort-based method. Our evaluations show that SATO outperforms the previous ANN accelerator 8-bit version of “Eyeriss” by 30.9× in terms of speedup and 12.3×, in terms of energy-saving. Compared with the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve 6.4× performance gain and 4.8× energy reduction, which is quite impressive for inference.

LeHDC: learning-based hyperdimensional computing classifier

Shijin Duan
Yejia Liu
Shaolei Ren
Xiaolin Xu

Thanks to the tiny storage and efficient execution, hyperdimensional Computing (HDC) is emerging as a lightweight learning framework on resource-constrained hardware. Nonetheless, the existing HDC training relies on various heuristic methods, significantly limiting their inference accuracy. In this paper, we propose a new HDC framework, called LeHDC, which leverages a principled learning approach to improve the model accuracy. Concretely, LeHDC maps the existing HDC framework into an equivalent Binary Neural Network architecture, and employs a corresponding training strategy to minimize the training loss. Experimental validation shows that LeHDC outperforms previous HDC training strategies and can improve on average the inference accuracy over 15% compared to the baseline HDC.

GENERIC: highly efficient learning engine on edge using hyperdimensional computing

Behnam Khaleghi
Jaeyoung Kang
Hanyang Xu
Justin Morris
Tajana Rosing

Hyperdimensional Computing (HDC) mimics the brain’s basic principles in performing cognitive tasks by encoding the data to high-dimensional vectors and employing non-complex learning techniques. Conventional processing platforms such as CPUs and GPUs are incapable of taking full advantage of the highly-parallel bit-level operations of HDC. On the other hand, existing HDC encoding techniques do not cover a broad range of applications to make a custom design plausible. In this paper, we first propose a novel encoding that achieves high accuracy for diverse applications. Thereafter, we leverage the proposed encoding and design a highly efficient and flexible ASIC accelerator, dubbed GENERIC, suited for the edge domain. GENERIC supports both classification (train and inference) and clustering for unsupervised learning on edge. Our design is flexible in the input size (hence it can run various applications) and hypervectors dimensionality, allowing it to trade off the accuracy and energy/performance on-demand. We augment GENERIC with application-opportunistic power-gating and voltage over-scaling (thanks to the notable error resiliency of HDC) for further energy reduction. GENERIC encoding improves the prediction accuracy over previous HDC and ML techniques by 3.5% and 6.5%, respectively. At 14 nm technology node, GENERIC occupies an area of 0.30 mm², and consumes 0.09 mW static and 1.97 mW active power. Compared to the previous inference-only accelerator, GENERIC reduces the energy consumption by 4.1×.

Solving traveling salesman problems via a parallel fully connected ising machine

Qichao Tao
Jie Han

Annealing-based Ising machines have shown promising results in solving combinatorial optimization problems. As a typical class of these problems, however, traveling salesman problems (TSPs) are very challenging to solve due to the constraints imposed on the solution. This article proposes a parallel annealing algorithm for a fully connected Ising machine that significantly improves the accuracy and performance in solving constrained combinatorial optimization problems such as the TSP. Unlike previous parallel annealing algorithms, this improved parallel annealing (IPA) algorithm efficiently solves TSPs using an exponential temperature function with a dynamic offset. Compared with digital annealing (DA) and momentum annealing (MA), the IPA reduces the run time by 44.4 times and 19.9 times for a 14-city TSP, respectively. Large scale TSPs can be more efficiently solved by taking a k-medoids clustering approach that decreases the average travel distance of a 22-city TSP by 51.8% compared with DA and by 42.0% compared with MA. This approach groups neighboring cities into clusters to form a reduced TSP, which is then solved in a hierarchical manner by using the IPA algorithm.

PATH: evaluation of boolean logic using path-based in-memory computing

Sven Thijssen
Sumit Kumar Jha
Rickard Ewetz

Processing in-memory breaks von Neumann-based constructs to accelerate data-intensive applications. Noteworthy efforts have been devoted to executing Boolean logic using digital in-memory computing. The limitation of state-of-the-art paradigms is that they heavily rely on repeatedly switching the state of the non-volatile resistive devices using expensive WRITE operations. In this paper, we propose a new in-memory computing paradigm called path-based computing for evaluating Boolean logic. Computation within the paradigm is performed using a one-time expensive compile phase and a fast and efficient evaluation phase. The key property of the paradigm is that the execution phase only involves cheap READ operations. Moreover, a synthesis tool called PATH is proposed to automatically map computation to a single crossbar design. The PATH tool also supports the synthesis of path-based computing systems where the total number of crossbars and the number of inter-crossbar connections are minimized. We evaluate the proposed paradigm using 10 circuits from the RevLib benchmark suite. Compared with state-of-the-art digital in-memory computing paradigms, path-based computing improves energy and latency up to 4.7X and 8.5X, respectively.

A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining

Hongwu Peng
Shaoyi Huang
Shiyang Chen
Bingbing Li
Tong Geng
Ang Li
Weiwen Jiang
Wujie Wen
Jinbo Bi
Hang Liu
Caiwen Ding

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 × and 2.6 × speedup compared to CPU and GPU implementation, and 4 × higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

HDPG: hyperdimensional policy-based reinforcement learning for continuous control

Yang Ni
Mariam Issa
Danny Abraham
Mahdi Imani
Xunzhao Yin
Mohsen Imani

Traditional robot control or more general continuous control tasks often rely on carefully hand-crafted classic control methods. These models often lack the self-learning adaptability and intelligence to achieve human-level control. On the other hand, recent advancements in Reinforcement Learning (RL) present algorithms that have the capability of human-like learning. The integration of Deep Neural Networks (DNN) and RL thereby enables autonomous learning in robot control tasks. However, DNN-based RL brings both high-quality learning and high computation cost, which is no longer ideal for currently fast-growing edge computing scenarios.

In this paper, we introduce HDPG, a highly-efficient policy-based RL algorithm using Hyperdimensional Computing. Hyperdimensional computing is a lightweight brain-inspired learning methodology; its holistic representation of information leads to a well-defined set of hardware-friendly high-dimensional operations. Our HDPG fully exploits the efficient HDC for high-quality state value approximation and policy gradient update. In our experiments, we use HDPG for robotics tasks with continuous action space and achieve significantly higher rewards than DNN-based RL. Our evaluation also shows that HDPG achieves 4.7× faster and 5.3× higher energy efficiency than DNN-based RL running on embedded FPGA.

CarM: hierarchical episodic memory for continual learning

Soobee Lee
Minindu Weerakoon
Jonghyun Choi
Minjia Zhang
Di Wang
Myeongjae Jeon

Continual Learning (CL) is an emerging machine learning paradigm in mobile or IoT devices that learns from a continuous stream of tasks. To avoid forgetting of knowledge of the previous tasks, episodic memory (EM) methods exploit a subset of the past samples while learning from new data. Despite the promising results, prior studies are mostly simulation-based and unfortunately do not promise to meet an insatiable demand for both EM capacity and system efficiency in practical system setups. We propose CarM, the first CL framework that meets the demand by a novel hierarchical EM management strategy. CarM has EM on high-speed RAMs for system efficiency and exploits the abundant storage to preserve past experiences and alleviate the forgetting by allowing CL to efficiently migrate samples between memory and storage. Extensive evaluations show that our method significantly outperforms popular CL methods while providing high training efficiency.

Shfl-BW: accelerating deep neural network inference with tensor-core aware weight pruning

Guyue Huang
Haoran Li
Minghai Qin
Fei Sun
Yufei Ding
Yuan Xie

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss.

In this work, we propose a novel sparse pattern, Shuffled Blockwise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer [1] by 1.81, 4.18 and 1.90× on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

QuiltNet: efficient deep learning inference on multi-chip accelerators using model partitioning

Jongho Park
HyukJun Kwon
Seowoo Kim
Junyoung Lee
Minho Ha
Euicheol Lim
Mohsen Imani
Yeseong Kim

We have seen many successful deployments of deep learning accelerator designs on different platforms and technologies, e.g., FPGA, ASIC, and Processing In-Memory platforms. However, the size of the deep learning models keeps increasing, making computations a burden on the accelerators. A naive approach to resolve this issue is to design larger accelerators; however, it is not scalable due to high resource requirements, e.g., power consumption and off-chip memory sizes. A promising solution is to utilize multiple accelerators and use them as needed, similar to conventional multiprocessing. For example, for smaller networks, we may use a single accelerator, while we may use multiple accelerators with proper network partitioning for larger networks. However, partitioning DNN models into multiple parts leads to large communication overheads due to inter-layer communications. In this paper, we propose a scalable solution to accelerate DNN models on multiple devices by devising a new model partitioning technique. Our technique transforms a DNN model into layer-wise partitioned models using an autoencoder. Since the autoencoder encodes a tensor output into a smaller dimension, we can split the neural network model into multiple pieces while significantly reducing the communication overhead to pipeline them. Our evaluation results conducted on state-of-the-art deep learning models show that the proposed technique significantly improves performance and energy efficiency. Our solution increases performance and energy efficiency by up to 30.5% and 28.4% with minimal accuracy loss as compared to running the same model on pipelined multi-block accelerators without the autoencoder.

Glimpse: mathematical embedding of hardware specification for neural compilation

Byung Hoon Ahn
Sean Kinzer
Hadi Esmaeilzadeh

Success of Deep Neural Networks (DNNs) and their computational intensity has heralded Cambrian explosion of DNN hardware. While hardware design has advanced significantly, optimizing the code for them is still an open challenge. Recent research has moved past traditional compilation techniques and taken a stochastic search algorithmic path that blindly generates rather stochastic samples of the binaries for real hardware measurements to guide the search. This paper opens a new dimension by incorporating the mathematical embedding of the hardware specification of the GPU accelerators dubbed Blueprint to better guide the search algorithm and focus on sub-spaces that have higher potential for yielding higher performance binaries. While various sample efficient yet blind hardware-agnostic techniques have been proposed, none of the state-of-the-art compilers have considered hardware specification as hints to improve the sample efficiency and the search. To mathematically embed the hardware specifications into the search, we devise a Bayesian optimization framework called Glimpse with multiple exclusively unique components. We first use the Blueprint as an input to generate prior distributions of different dimensions in the search space. Then, we devise a light-weight neural acquisition function that takes into account the Blueprint to conform to the hardware specification while balancing the exploration-exploitation trade-off. Finally, we generate an ensemble of predictors from the Blueprint that collectively vote to reject invalid binary samples. We compare Glimpse with hardware-agnostic compilers. Comparison to AutoTVM [3], Chameleon [2], and DGP [16] with multiple generations of GPUs shows that Glimpse provides 6.73×, 1.51×, and 1.92× faster compilation time, respectively, while also achieving the best inference latency.

Bringing source-level debugging frameworks to hardware generators

Keyi Zhang
Zain Asgar
Mark Horowitz

High-level hardware generators have significantly increased the productivity of design engineers. They use software engineering constructs to reduce the repetition required to express complex designs and enable more composability. However, these benefits are undermined by a lack of debugging infrastructure, requiring hardware designers to debug generated, usually incomprehensible, RTL code. This paper describes a framework that connects modern software source-level debugging frameworks to RTL created from hardware generators. Our working prototype offers an Integrated Development Environment (IDE) experience for generators such as RocketChip (Chisel), allowing designers to set breakpoints in complex source code, relate RTL simulation state back to source-level variables, and do forward and backward debugging, with almost no simulation overhead (less than 5%).

Verifying SystemC TLM peripherals using modern C++ symbolic execution tools

Pascal Pieper
Vladimir Herdt
Daniel Große
Rolf Drechsler

In this paper we propose an effective approach for verification of real-world SystemC TLM peripherals using modern C++ symbolic execution tools. We designed a lightweight SystemC peripheral kernel that enables an efficient integration with the modern symbolic execution engine KLEE and acts as a drop-in replacement for the normal SystemC kernel on pre-processed TLM peripherals. The pre-processing step essentially replaces context switches in SystemC threads with normal function calls which can be handled by KLEE. Our experiments, using a publicly available RISC-V specific interrupt controller, demonstrate the scalability and bug hunting effectiveness of our approach.

Formal verification of modular multipliers using symbolic computer algebra and boolean satisfiability

Alireza Mahzoon
Daniel Große
Christoph Scholl
Alexander Konrad
Rolf Drechsler

Modular multipliers are the essential components in cryptography and Residue Number System (RNS) designs. Especially, 2ⁿ – 1 and 2ⁿ + 1 modular multipliers have gained more attention due to their regular structures and a wide variety of applications. However, there is no automated formal verification method to prove the correctness of these multipliers. As a result, bugs might remain undetected after the design phase.

In this paper, we present our modular verifier that combines Symbolic Computer Algebra (SCA) and Boolean Satisfiability (SAT) to prove the correctness of 2ⁿ – 1 and 2ⁿ + 1 modular multipliers. Our verifier takes advantage of three techniques, i.e. coefficient correction, SAT-based local vanishing removal, and SAT-based output condition check to overcome the challenges of SCA-based verification. The efficiency of our verifier is demonstrated using an extensive set of modular multipliers with up to several million gates.

Silicon validation of LUT-based logic-locked IP cores

Gaurav Kolhe
Tyler Sheaves
Kevin Immanuel Gubbi
Tejas Kadale
Setareh Rafatirad
Sai Manoj PD
Avesta Sasan
Hamid Mahmoodi
Houman Homayoun

Modern semiconductor manufacturing often leverages a fabless model in which design and fabrication are partitioned. This has led to a large body of work attempting to secure designs sent to an untrusted third party through obfuscation methods. On the other hand, efficient de-obfuscation attacks have been proposed, such as Boolean Satisfiability attacks (SAT attacks). However, there is a lack of frameworks to validate the security and functionality of obfuscated designs. Additionally, unconventional obfuscated design flows, which vary from one obfuscation to another, have been key impending factors in realizing logic locking as a mainstream approach for securing designs. In this work, we address these two issues for Lookup Table-based obfuscation. We study both Volatile and Non-volatile versions of LUT-based obfuscation and develop a framework to validate SAT runtime using machine learning. We can achieve unparallel SAT-resiliency using LUT-based obfuscation while incurring 7% area and less than 1% power overheads. Following this, we discuss and implement a validation flow for obfuscated designs. We then fabricate a chip consisting of several benchmark designs and a RISC-V CPU in TSMC 65nm for post functionality validation. We show that the design flow and SAT-runtime validation can easily integrate LUT-based obfuscation into existing CAD tools while adding minimal verification overhead. Finally, we justify SAT-resilient LUT-based obfuscation as a promising candidate for securing designs.

Efficient bayesian yield analysis and optimization with active learning

Shuo Yin
Xiang Jin
Linxu Shi
Kang Wang
Wei W. Xing

Yield optimization for circuit design is computationally intensive due to the expensive yield estimation based on Monte Carlo methods and the difficult optimization process. In this work, a uniform framework to solve these problems simultaneously is proposed. Firstly, a novel efficient Bayesian yield analysis framework, BYA, is proposed by deriving a Bayesian estimation for the yield and introducing active learning based on reductions of integral entropy. A tractable convolutional entropy infill technique is then proposed to efficiently solve the entropy reduction problem. Lastly, we extend BYA for yield optimization by transforming knowledge across the design space and variational space. Experimental results based on SRAM and adder circuits show that BYA is 410x faster (in terms of the number of simulations) than standard MC and averagely 10x (up to 10000x) more accurate than the state-of-the-art method for yield estimation, and is about 5x faster than the SOTA yield optimization methods.

Accelerated synthesis of neural network-based barrier certificates using collaborative learning

Jun Xia
Ming Hu
Xin Chen
Mingsong Chen

Most of existing Neural Network (NN)-based barrier certificate synthesis methods cannot deal with high-dimensional continuous systems, since a large quantity of sampled data may easily result in inaccurate initial models coupled with slow convergence rate. To accelerate the synthesis of NN-based barrier certificates, this paper presents an effective two-stage approach named CL-BC, which fully exploits the parallel processing capability of underlying hardware to enable quick search for a barrier certificate. Unlike existing NN-based methods that adopt a random initial model for barrier certificate synthesis, in the first stage CL-BC pre-trains an initial model based on a small subset of sampling data. In this way, an approximate barrier certificate in an NN form can be quickly achieved with little overhead. Based on our proposed collaborative learning scheme, in the second stage CL-BC conducts the parallel learning on partitioned domains, where the learned knowledge from different partitions can be aggregated to accelerate the convergence of a global NN model for barrier certificate synthesis. In this way, the overall synthesis time of an NN-based barrier certificate can be drastically reduced. Experimental results show that our approach can not only drastically reduce barrier synthesis time, but also can synthesize barrier certificates for complex systems that cannot be handled by state-of-the-art.

A timing engine inspired graph neural network model for pre-routing slack prediction

Zizheng Guo
Mingjie Liu
Jiaqi Gu
Shuhan Zhang
David Z. Pan
Yibo Lin

Fast and accurate pre-routing timing prediction is essential for timing-driven placement since repetitive routing and static timing analysis (STA) iterations are expensive and unacceptable. Prior work on timing prediction aims at estimating net delay and slew, lacking the ability to model global timing metrics. In this work, we present a timing engine inspired graph neural network (GNN) to predict arrival time and slack at timing endpoints. We further leverage edge delays as local auxiliary tasks to facilitate model training with increased model performance. Experimental results on real-world open-source designs demonstrate improved model accuracy and explainability when compared with vanilla deep GNN models.

Accurate timing prediction at placement stage with look-ahead RC network

Xu He
Zhiyong Fu
Yao Wang
Chang Liu
Yang Guo

Timing closure is a critical but effort-taking task in VLSI designs. In placement stage, a fast and accurate net delay estimator is highly desirable to guide the timing optimization prior to routing, and thus reduce the timing pessimism and shorten the design turn-around time. To handle the timing uncertainty at the placement stage, we propose a fast net delay timing predictor based on machine learning, which extract the fully timing features using a look-ahead RC network. Experimental results show that the proposed timing predictor has achieved average correlation over 0.99 with the post-routing sign-off timing results obtained in Synopsys PrimeTime.

Timing macro modeling with graph neural networks

Kevin Kai-Chun Chang
Chun-Yao Chiang
Pei-Yu Lee
Iris Hui-Ru Jiang

Due to rapidly growing design complexity, timing macro modeling has been widely adopted to enable hierarchical and parallel timing analysis. The main challenge of timing macro modeling is to identify timing variant pins for achieving high timing accuracy while keeping a compact model size. To tackle this challenge, prior work applied ad-hoc techniques and threshold setting. In this work, we present a novel timing macro modeling approach based on graph neural networks (GNNs). A timing sensitivity metric is proposed to precisely evaluate the influence of each pin on the timing accuracy. Based on the timing sensitivity data and the circuit topology, the GNN model can effectively learn and capture timing variant pins. Experimental results show that our GNN-based framework reduces 10% model sizes while preserving the same timing accuracy as the state-of-the-art. Furthermore, taking common path pessimism removal (CPPR) as an example, the generality and applicability of our framework on various timing analysis models and modes are also validated empirically.

Worst-case dynamic power distribution network noise prediction using convolutional neural network

Xiao Dong
Yufei Chen
Xunzhao Yin
Cheng Zhuo

Worst-case dynamic PDN noise analysis is an essential step in PDN sign-off to ensure the performance and reliability of chips. However, with the growing PDN size and increasing scenarios to be validated, it becomes very time- and resource-consuming to conduct full-stack PDN simulation to check the worst-case noise for different test vectors. Recently, various works have proposed machine learning based methods for supply noise prediction, many of which still suffer from large training overhead, inefficiency, or non-scalability. Thus, this paper proposed an efficient and scalable framework for the worst-case dynamic PDN noise prediction. The framework first reduces the spatial and temporal redundancy in the PDN and input current vector, and then employs efficient feature extraction as well as a novel convolutional neural network architecture to predict the worst-case dynamic PDN noise. Experimental results show that the proposed framework consistently outperforms the commercial tool and the state-of-the-art machine learning method with only 0.63–1.02% mean relative error and 25–69× speedup.

GATSPI: GPU accelerated gate-level simulation for power improvement

Yanqing Zhang
Haoxing Ren
Akshay Sridharan
Brucek Khailany

In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry-sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared to a commercial gate-level simulator running on a single CPU core. GATSPI supports a range of simple to complex cell types from an industry standard cell library and SDF conditional delay statements without requiring prior calibration runs and produces industry-standard SAIF files from delay-aware gate-level simulation. Finally, we deploy GATSPI in a glitch-optimization flow, achieving a 1.4% power saving with a 449X speedup in turnaround time compared to a similar flow using a commercial simulator.

PPATuner: pareto-driven tool parameter auto-tuning in physical design via gaussian process transfer learning

Hao Geng
Qi Xu
Tsung-Yi Ho
Bei Yu

Thanks to the amazing semiconductor scaling, incredible design complexity makes the synthesis-centric very large-scale integration (VLSI) design flow increasingly rely on electronic design automation (EDA) tools. However, invoking EDA tools especially the physical synthesis tool may require several hours or even days for only one possible parameters combination. Even worse, for a new design, oceans of attempts to navigate high quality-of-results (QoR) after physical synthesis have to be made via multiple tool runs with numerous combinations of tunable tool parameters. Additionally, designers often puzzle over simultaneously considering multiple QoR metrics of interest (e.g., delay, power, and area). To tackle the dilemma within finite resource budget, designing a multi-objective parameter auto-tuning framework of the physical design tool which can learn from historical tool configurations and transfer the associated knowledge to new tasks is in demand. In this paper, we propose PPATuner, a Pareto-driven physical design tool parameter tuning methodology, to achieve a good trade-off among multiple QoR metrics of interest (e.g., power, area, delay) at the physical design stage. By incorporating the transfer Gaussian process (GP) model, it can autonomously learn the transfer knowledge from the existing tool parameter combinations. The experimental results on industrial benchmarks under the 7nm technology node demonstrate the merits of our framework.

Efficient maximum data age analysis for cause-effect chains in automotive systems

Ran Bi
Xinbin Liu
Jiankang Ren
Pengfei Wang
Huawei Lv
Guozhen Tan

Automotive systems are often subjected to stringent requirements on the maximum data age of certain cause-effect chains. In this paper, we present an efficient method for formally analyzing maximum data age of cause-effect chains. In particular, we decouple the problem of bounding the maximum data age of a chain into a problem of bounding the releasing interval of successive Last-to-Last data propagation instances in the chain. Owing to the problem decoupling, a relatively tighter data age upper bound can be effectively obtained in polynomial time. Experiments demonstrate that our approach can achieve high precision analysis with lower computational cost.

Optimizing parallel PREM compilation over nested loop structures

Zhao Gu
Rodolfo Pellizzoni

We consider automatic parallelization of a computational kernel executed according to the PRedictable Execution Model (PREM), where each thread is divided into execution and memory phases. We target a scratchpad-based architecture, where memory phases are executed by a dedicated DMA component. We employ data analysis and loop tiling to split the kernel execution into segments, and schedule them based on a DAG representation of data and execution dependencies. Our main observation is that properly selecting tile sizes is key to optimize the makespan of the kernel. We thus propose a heuristic that efficiently searches for optimized tile size and core assignments over deeply nested loops, and demonstrate its applicability and performance compared to the state-of-the-art in PREM compilation using the PolyBench-NN benchmark suite.

Scheduling and analysis of real-time tasks with parallel critical sections

Yang Wang
Xu Jiang
Nan Guan
Mingsong Lv
Dong Ji
Wang Yi

Locks are the most widely used mechanisms to coordinate simultaneous accesses to exclusive shared resources. While locking protocols and associated schedulability analysis techniques have been extensively studied for sequential real-time tasks, work for parallel tasks largely lags behind. In the limited existing work on this topic, a common assumption is that a critical section must execute sequentially. However, this is not necessarily the case with parallel programming languages. In this paper, we study the analysis of parallel heavy real-time tasks (the density of which is greater than 1) with critical sections in parallel structures. We show that applying existing analysis techniques directly could be unsafe or much pessimistic for the considered model, and develop new techniques to address these problems. Comprehensive experiments are conducted to evaluate the performance of our method.

This work was partially supported by the National Natural Science Foundation of China (NSFC 62102072) and Research Grants Council of Hong Kong (GRF 15206221).

BlueScale: a scalable memory architecture for predictable real-time computing on highly integrated SoCs

Zhe Jiang
Kecheng Yang
Neil Audsley
Nathan Fisher
Weisong Shi
Zheng Dong

In real-time embedded computing, time-predictability and performance are required simultaneously by memory transactions. However, with increasingly more elements being integrated into hardware, memory interconnects become a critical stumbling block to satisfying timing correctness, due to lack of hardware and scheduling scalability. In this paper, we propose a new hierarchically distributed memory interconnect, BlueScale, managing memory transactions using identical Scale Elements, which ensures hardware scalability. The Scale Element introduces two nested priority queues, achieving iterative compositional scheduling for memory transactions, guaranteeing transaction tasks’ scheduling schedulability. Associated with the new architecture, a theoretical model is established to improve BlueScale’s real-time performance.

Precise and scalable shared cache contention analysis for WCET estimation

Wei Zhang
Mingsong Lv
Wanli Chang
Lei Ju

Worst-Case Execution Time (WCET) analysis for real-time tasks must precisely predict cache hit/miss of memory accesses. While bringing great performance benefits, multi-core processors significantly complicate the cache analysis problem due to the shared cache contentions among different cores. Existing methods pessimistically consider that memory references of parallel executing tasks will contend with each other as long as they are mapped to the same cache line. However, in reality, numerous shared cache contentions are mutually exclusive, due to the partial orders among the programs executed in parallel. The presence of shared cache contentions greatly exacerbates the computational complexity of the WCET computation, as finding the longest path needs exploring an exponentially large partial ordering space. In this paper, we propose a quantitative method with O(n²) time complexity to precisely estimate the worst-case extra execution time (WCEET) caused by shared cache contentions. The proposed method can be easily integrated into the abstract-interpretation based WCET estimation framework. Experiments with MRTC benchmarks show that our method can averagely tighten the WCET estimation by 13% without sacrificing the analysis efficiency.

Predictable sharing of last-level cache partitions for multi-core safety-critical systems

Zhuanhao Wu
Hiren Patel

Last-level cache (LLC) partitioning is a technique to provide temporal isolation and low worst-case latency (WCL) bounds when cores access the shared LLC in multicore safety-critical systems. A typical approach to cache partitioning involves allocating a separate partition to a distinct core. A central criticism of this approach is its poor utilization of cache storage. Today’s trend of integrating a larger number of cores exacerbates this issue such that we are forced to consider shared LLC partitions for effective deployments. This work presents an approach to share LLC partitions among multiple cores while being able to provide low WCL bounds.

Thermal-aware optical-electrical routing codesign for on-chip signal communications

Yu-Sheng Lu
Kuan-Cheng Chen
Yu-Ling Hsu
Yao-Wen Chang

The optical interconnection is a promising solution for on-chip signal communication in modern system-on-chip (SoC) and heterogeneous integration designs, providing large bandwidth and high-speed transmission with low power consumption. Previous works do not handle two main issues for on-chip optical-electrical (O-E) co-design: the thermal impact during O-E routing and the trade-offs among power consumption, wirelength, and congestion. As a result, the thermal-induced band shift might incur transmission malfunction; the power consumption estimation is inaccurate; thus, only suboptimal results are obtained. To remedy these disadvantages, we present a thermal-aware optical-electrical routing co-design flow to minimize power consumption, thermal impact, and wirelength. Experimental results based on the ISPD 2019 contest benchmarks show that our co-design flow significantly outperforms state-of-the-art works in power consumption, thermal impact, and wire-length.

Power-aware pruning for ultrafast, energy-efficient, and accurate optical neural network design

Naoki Hattori
Yutaka Masuda
Tohru Ishihara
Akihiko Shinya
Masaya Notomi

With the rapid progress of the integrated nanophotonics technology, the optical neural network (ONN) architecture has been widely investigated. Although the ONN inference is fast, conventional densely connected network structures consume large amounts of power in laser sources. We propose a novel ONN design method that finds an ultrafast, energy-efficient, and accurate ONN structure. The key idea is power-aware edge pruning that derives the near-optimal numbers of edges in the entire network. Optoelectronic circuit simulation demonstrates the correct functional behavior of the ONN. Furthermore, experimental evaluations using tensor-flow show the proposed methods achieved 98.28% power reduction without significant loss of accuracy.

REACT: a heterogeneous reconfigurable neural network accelerator with software-configurable NoCs for training and inference on wearables

Mohit Upadhyay
Rohan Juneja
Bo Wang
Jun Zhou
Weng-Fai Wong
Li-Shiuan Peh

On-chip training improves model accuracy on personalised user data and preserves privacy. This work proposes REACT, an AI accelerator for wearables that has heterogeneous cores supporting both training and inference. REACT’s architecture is NoC-centric, with weights, features and gradients distributed across cores, accessed and computed efficiently through software-configurable NoCs. Unlike conventional dynamic NoCs, REACT’s NoCs have no buffer queues, flow control or routing, as they are entirely configured by software for each neural network. REACT’s online learning realises upto 75% accuracy improvement, and is upto 25× faster and 520× more energy-efficient than state-of-the-art accelerators with similar memory and computation footprint.

LHNN: lattice hypergraph neural network for VLSI congestion prediction

Bowen Wang
Guibao Shen
Dong Li
Jianye Hao
Wulong Liu
Yu Huang
Hongzhong Wu
Yibo Lin
Guangyong Chen
Pheng Ann Heng

Precise congestion prediction from a placement solution plays a crucial role in circuit placement. This work proposes the lattice hypergraph (LH-graph), a novel graph formulation for circuits, which preserves netlist data during the whole learning process, and enables the congestion information propagated geometrically and topologically. Based on the formulation, we further developed a heterogeneous graph neural network architecture LHNN, jointing the routing demand regression to support the congestion spot classification. LHNN constantly achieves more than 35% improvements compared with U-nets and Pix2Pix on the F1 score. We expect our work shall highlight essential procedures using machine learning for congestion prediction.

Floorplanning with graph attention

Yiting Liu
Ziyi Ju
Zhengming Li
Mingzhi Dong
Hai Zhou
Jia Wang
Fan Yang
Xuan Zeng
Li Shang

Floorplanning has long been a critical physical design task with high computation complexity. Its key objective is to determine the initial locations of macros and standard cells with optimized wirelength for a given area constraint. This paper presents Flora, a graph attention-based floorplanner to learn an optimized mapping between circuit connectivity and physical wirelength, and produce a chip floorplan using efficient model inference. Flora has been integrated with two state-of-the-art mixed-size placers. Experimental studies using both academic benchmarks and industrial designs demonstrate that compared to state-of-the-art mixed-size placers alone, Flora improves placement runtime by 18%, with 2% wirelength reduction on average.

Xplace: an extremely fast and extensible global placement framework

Lixin Liu
Bangqi Fu
Martin D. F. Wong
Evangeline F. Y. Young

Placement serves as a fundamental step in VLSI physical design. Recently, GPU-based global placer DREAMPlace[1] demonstrated its superiority over CPU-based global placers. In this work, we develop an extremely fast GPU accelerated global placer Xplace which achieves around 2x speedup with better solution quality compared to DREAMPlace. We also plug a novel Fourier neural network into Xplace as an extension to further improve the solution quality. We believe this work not only proposes a new, fast, extensible placement framework but also illustrates a possibility to incorporate a neural network component into a GPU accelerated analytical placer.

Differentiable-timing-driven global placement

Zizheng Guo
Yibo Lin

Placement is critical to the timing closure of the very-large-scale integrated (VLSI) circuit design flow. This paper proposes a differentiable-timing-driven global placement framework inspired by deep neural networks. By establishing the analogy between static timing analysis and neural network propagation, we propose a differentiable timing objective for placement to explicitly optimize timing metrics such as total negative slack (TNS) and worst negative slack (WNS). The framework can achieve at most 32.7% and 59.1% improvements on WNS and TNS respectively compared with the state-of-the-art timing-driven placer, and achieve 1.80× speed-up when both running on GPU.

TAAS: a timing-aware analytical strategy for AQFP-capable placement automation

Peiyan Dong
Yanyue Xie
Hongjia Li
Mengshu Sun
Olivia Chen
Nobuyuki Yoshikawa
Yanzhi Wang

Adiabatic Quantum-Flux-Parametron (AQFP) is a superconducting logic with extremely high energy efficiency. AQFP circuits adopt the deep pipeline structure, where the four-phase AC-power serves as both the energy supply and the clock signal and transfers the data from one clock phase to the next. However, the deep pipeline structure causes the stage delay of the data propagation is comparable to the delay of the zigzag clocking, which triggers timing violations easily. In this paper, we propose a timing-aware analytical strategy for the AQFP placement, TAAS, that immensely reduces timing violations under specific spacing constraints and wirelength constraints of AQFP. TAAS includes two main characteristics: 1) a timing-aware objective function that incorporates a four-phase timing model for the analytical global placement. 2) a unique detailed placement including the timing-aware dynamic programming technique and the time-space cell regularization. To validate the effectiveness of TAAS, various representative circuits are adopted as benchmarks. As shown in the experimental results, our strategy can increase the maximum operating frequency by up to 30% ~ 40% with a negligible wirelength increase -3.41%~1%.

A cross-layer approach to cognitive computing: invited

Gobinda Saha
Cheng Wang
Anand Raghunathan
Kaushik Roy

Remarkable advances in machine learning and artificial intelligence have been made in various domains, achieving near-human performance in a plethora of cognitive tasks including vision, speech and natural language processing. However, implementations of such cognitive algorithms in conventional “von-Neumann” architectures are orders of magnitude more area and power expensive than the biological brain. Therefore, it is imperative to search for fundamentally new approaches so that the improvement in computing performance and efficiency can keep up with the exponential growth of the AI computational demand. In this article, we present a cross-layer approach to the exploration of new paradigms in cognitive computing. This effort spans new learning algorithms inspired from biological information processing principles, network architectures best suited for such algorithms, and neuromorphic hardware substrates such as computing-in-memory fabrics in order to build intelligent machines that can achieve orders of improvement in energy efficiency at cognitive processing. We argue that such cross-layer innovations in cognitive computing are well-poised to enable a new wave of autonomous intelligence across the computing spectrum, from resource-constrained IoT devices to the cloud.

Generative self-supervised learning for gate sizing: invited

Siddhartha Nath
Geraldo Pradipta
Corey Hu
Tian Yang
Brucek Khailany
Haoxing Ren

Self-supervised learning has shown great promise in leveraging large amounts of unlabeled data to achieve higher accuracy than supervised learning methods in many domains. Generative self-supervised learning can generate new data based on the trained data distribution. In this paper, we evaluate the effectiveness of generative self-supervised learning on combinational gate sizing in VLSI designs. We propose a novel use of Transformers for gate sizing when trained on a large dataset generate from a commercial EDA tool. We demonstrate that our trained model can achieve 93% accuracy, 1440X speedup and fast design convergence when compared to a leading commercial EDA tool.

Hammer: a modular and reusable physical design flow tool: invited

Harrison Liew
Daniel Grubb
John Wright
Colin Schmidt
Nayiri Krzysztofowicz
Adam Izraelevitz
Edward Wang
Krste Asanović
Jonathan Bachrach
Borivoje Nikolić

Process technology scaling and hardware architecture specialization have vastly increased the need for chip design space exploration, while optimizing for power, performance, and area. Hammer is an open-source, reusable physical design (PD) flow generator that reduces design effort and increases portability by enforcing a separation among design-, tool-, and process technology-specific concerns with a modular software architecture. In this work, we outline Hammer’s structure and highlight recent extensions that support both physical chip designers and hardware architects evaluating the merit and feasibility of their proposed designs. This is accomplished through the integration of more tools and process technologies—some open-source—and the designer-driven development of flow step generators. An evaluation of chip designs in process technologies ranging from 130nm down to 12nm across a series of RISC-V-based chips shows how Hammer-generated flows are reusable and enable efficient optimization for diverse applications.

mflowgen: a modular flow generator and ecosystem for community-driven physical design: invited

Alex Carsello
James Thomas
Ankita Nayak
Po-Han Chen
Mark Horowitz
Priyanka Raina
Christopher Torng

Achieving high code reuse in physical design flows is challenging but increasingly necessary to build complex systems. Unfortunately, existing approaches based on parameterized Tcl generators support very limited reuse as designers customize flows for specific designs and technologies, preventing their reuse in future flows. We present a vision and framework based on modular flow generators that encapsulates coarse-grained and fine-grained reusable code in modular nodes and assembles them into complete flows. The key feature is a flow consistency and instrumentation layer embedded in Python, which supports mechanisms for rapid and early feedback on inconsistent composition. We evaluate the design flows of successive generations of silicon prototypes built in TSMC16, TSMC28, TSMC40, SKY130, and IBM180 technologies, showing how our approach can enable significant code reuse in future flows.

A distributed approach to silicon compilation: invited

Andreas Olofsson
William Ransohoff
Noah Moroze

Hardware specialization for the long tail of future energy constrained edge applications will require reducing design costs by orders of magnitude. In this work, we take a distributed approach to hardware compilation, with the goal of creating infrastructure that scales to thousands of developers and millions of servers. Technical contributions in this work include (i) a standardized hardware build system manifest, (ii) a light-weight flowgraph based programming model, (iii) a client/server execution model, and (iv) a provenance tracking system for distributed development. These ideas have been reduced to practice in SiliconCompiler, an open source build system that demonstrates an order of magnitude compilation speed up on multiple designs and PDKs compared to single threaded build systems.

Improving GNN-based accelerator design automation with meta learning

Yunsheng Bai
Atefeh Sohrabizadeh
Yizhou Sun
Jason Cong

Recently, there is a growing interest in developing learning-based models as a surrogate of the High-Level Synthesis (HLS) tools, where the key objective is rapid prediction of the quality of a candidate HLS design for automated design space exploration (DSE). Training is usually conducted on a given set of computation kernels (or kernels in short) needed for hardware acceleration. However, the model must also perform well on new kernels. The discrepancy between the training set and new kernels, called domain shift, frequently leads to model accuracy drop which in turn negatively impact the DSE performance. In this paper, we investigate the possibility of adapting an existing meta-learning approach, named MAML, to the task of design quality prediction. Experiments show the MAML-enhanced model outperforms a simple baseline based on fine tuning in terms of both offline evaluation on hold-out test sets and online evaluation for DSE speedup results¹.

Accelerator design with decoupled hardware customizations: benefits and challenges: invited

Debjit Pal
Yi-Hsiang Lai
Shaojie Xiang
Niansong Zhang
Hongzheng Chen
Jeremy Casas
Pasquale Cocchini
Zhenkun Yang
Jin Yang
Louis-Noël Pouchet
Zhiru Zhang

The past decade has witnessed increasing adoption of high-level synthesis (HLS) to implement specialized hardware accelerators targeting either FPGAs or ASICs. However, current HLS programming models entangle algorithm specifications with hardware customization techniques, which lowers both the productivity and portability of the accelerator design. To tackle this problem, recent efforts such as HeteroCL propose to decouple algorithm definition from essential hardware customization techniques in compute, data type, and memory, increasing productivity, portability, and performance.

While the decoupling of the algorithm and customizations provides benefits to the compilation/synthesis process, they also create new hurdles for the programmers to productively debug and validate the correctness of the optimized design. In this work, using HeteroCL and realistic machine learning applications as case studies, we first explain the key advantages of the decoupled programming model brought to a programmer to rapidly develop high-performance accelerators. Using the same case studies, we will further show how seemingly benign usage of the customization primitives can lead to new challenges to verification. We will then outline the research opportunities and discuss some of our recent efforts as the first step to enable a robust and viable verification solution in the future.

ScaleHLS: a scalable high-level synthesis framework with multi-level transformations and optimizations: invited

Hanchen Ye
HyeGang Jun
Hyunmin Jeong
Stephen Neuendorffer
Deming Chen

This paper presents an enhanced version of a scalable HLS (High-Level Synthesis) framework named ScaleHLS, which can compile HLS C/C++ programs and PyTorch models to highly-efficient and synthesizable C++ designs. The original version of ScaleHLS achieved significant speedup on both C/C++ kernels and PyTorch models [14]. In this paper, we first highlight the key features of ScaleHLS on tackling the challenges present in the representation, optimization, and exploration of large-scale HLS designs. To further improve the scalability of ScaleHLS, we then propose an enhanced HLS transform and analysis library supported in both C++ and Python, and a new design space exploration algorithm to handle HLS designs with hierarchical structures more effectively. Comparing to the original ScaleHLS, our enhanced version improves the speedup by up to 60.9× on FPGAs. ScaleHLS is fully open-sourced at https://github.com/hanchenye/scalehls.

The SODA approach: leveraging high-level synthesis for hardware/software co-design and hardware specialization: invited

Nicolas Bohm Agostini
Serena Curzel
Ankur Limaye
Vinay Amatya
Marco Minutoli
Vito Giovanni Castellana
Joseph Manzano
Antonino Tumeo
Fabrizio Ferrandi

Novel “converged” applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.

Automatic oracle generation in microsoft’s quantum development kit using QIR and LLVM passes

Mathias Soeken
Mariia Mykhailova

Automatic oracle generation techniques can find optimized quantum circuits for classical components in quantum algorithms. However, most implementations of oracle generation techniques require that the classical component is expressed in terms of a conventional logic representation such as logic networks, truth tables, or decision diagrams. We implemented LLVM passes that can automatically generate QIR functions representing classical Q# functions into QIR code implementing such functions quantumly. We are using state-of-the-art logic optimization and oracle generation techniques based on XOR-AND graphs for this purpose. This enables not only a more natural description of the quantum algorithm on a higher level of abstraction, but also enables technology-dependent or application-specific generation of the oracles.

The basis of design tools for quantum computing: arrays, decision diagrams, tensor networks, and ZX-calculus

Robert Wille
Lukas Burgholzer
Stefan Hillmich
Thomas Grurl
Alexander Ploier
Tom Peham

Quantum computers promise to efficiently solve important problems classical computers never will. However, in order to capitalize on these prospects, a fully automated quantum software stack needs to be developed. This involves a multitude of complex tasks from the classical simulation of quantum circuits, over their compilation to specific devices, to the verification of the circuits to be executed as well as the obtained results. All of these tasks are highly non-trivial and necessitate efficient data structures to tackle the inherent complexity. Starting from rather straight-forward arrays over decision diagrams (inspired by the design automation community) to tensor networks and the ZX-calculus, various complementary approaches have been proposed. This work provides a look “under the hood” of today’s tools and showcases how these means are utilized in them, e.g., for simulation, compilation, and verification of quantum circuits.

Secure by construction: addressing security vulnerabilities introduced during high-level synthesis: invited

Md Rafid Muttaki
Zahin Ibnat
Farimah Farahmandi

Working towards a higher level of abstraction (C/C++) facilitates designers to execute and validate complex designs faster in response to highly demanding time-to-market requirements. High-Level Synthesis (HLS) is an automatic process that translates the high-level description of the design behaviors into the corresponding hardware description language (HDL) modules. However, HLS translation steps/optimizations can cause security vulnerabilities since they have not been designed with security in mind. It is very important that HLS generates functionally correct RTL in a secure manner in the first place since it is not easy to read the automatically generated codes and trace them back to the source of vulnerabilities. Even if one manages to identify and fix the security vulnerabilities in one design, the core of the HLS engine remains vulnerable. Therefore, the same vulnerabilities will appear in all other HLS generated RTL codes. This paper shows a systematic approach for identifying the source of security vulnerabilities introduced during HLS and mitigating them.

High-level design methods for hardware security: is it the right choice? invited

Christian Pilato
Donatella Sciuto
Benjamin Tan
Siddharth Garg
Ramesh Karri

Due to the globalization of the electronics supply chain, hardware engineers are increasingly interested in modifying their chip designs to protect their intellectual property (IP) or the privacy of the final users. However, the integration of state-of-the-art solutions for hardware and hardware-assisted security is not fully automated, requiring the amendment of stable tools and industrial toolchains. This significantly limits the application in industrial designs, potentially affecting the security of the resulting chips. We discuss how existing solutions can be adapted to implement security features at higher levels of abstractions (during high-level synthesis or directly at the register-transfer level) and complement current industrial design and verification flows. Our modular framework allows designers to compose these solutions and create additional protection layers.

Trusting the trust anchor: towards detecting cross-layer vulnerabilities with hardware fuzzing

Chen Chen
Rahul Kande
Pouya Mahmoody
Ahmad-Reza Sadeghi
JV Rajendran

The rise in the development of complex and application-specific commercial and open-source hardware and the shrinking verification time are causing numerous hardware-security vulnerabilities. Traditional verification techniques are limited in both scalability and completeness. Research in this direction is hindered due to the lack of robust testing benchmarks. In this paper, in collaboration with our industry partners, we built an ecosystem mimicking the hardware-development cycle where we inject bugs inspired by real-world vulnerabilities into RISC-V SoC design and organized an open-to-all bug-hunting competition. We equipped the participating researchers with industry-standard static and dynamic verification tools in a ready-to-use environment. The findings from our competition shed light on the strengths and weaknesses of the existing verification tools and highlight the potential for future research in developing new vulnerability detection techniques.

Automating hardware security property generation: invited

Ryan Kastner
Francesco Restuccia
Andres Meza
Sayak Ray
Jason Fung
Cynthia Sturton

Security verification is an important part of the hardware design process. Security verification teams can uncover weaknesses, vulnerabilities, and flaws. Unfortunately, the verification process involves substantial manual analysis to create the threat model, identify important security assets, articulate weaknesses, define security requirements, and specify security properties that formally describe security requirements upon the hardware. This work describes current hardware security verification practices. Many of these rely on manual analysis. We argue that the property generation process is a first step towards scalable and reproducible hardware security verification.

Efficient timing propagation with simultaneous structural and pipeline parallelisms: late breaking results

Cheng-Hsiang Chiu
Tsung-Wei Huang

Graph-based timing propagation (GBP) is an essential component for all static timing analysis (STA) algorithms. To speed up GBP, the state-of-the-art timer leverages the task graph model to explore structural parallelism in an STA graph. However, many designs exhibit linear segments that cause the parallelism to serialize, degrading the performance significantly. To overcome this problem, we introduce an efficient GBP framework by exploring both structural and pipeline parallelisms in an STA task graph. Our framework identifies linear segments and parallelizes their propagation tasks using pipeline in an STA task graph. We have shown up to 25% performance improvement over the state-of-the-art task graph-based timer.

A fast and low-cost comparison-free sorting engine with unary computing: late breaking results

Amir Hossein Jalilvand
Seyedeh Newsha Estiri
Samaneh Naderi
M. Hassan Najafi
Mohsen Imani

Hardware-efficient implementation of sorting operation is crucial for numerous applications, particularly when fast and energy-efficient sorting of data is desired. Unary computing has been used for low-cost hardware sorting. This work proposes a comparison-free unary sorting engine by iteratively finding maximum values. Synthesis results show up to 81% reduction in hardware area compared to the state-of-the-art unary sorting design. By processing right-aligned unary bit-streams, our unary sorter is able to sort many inputs in fewer clock cycles.

Flexible chip placement via reinforcement learning: late breaking results

Fu-Chieh Chang
Yu-Wei Tseng
Ya-Wen Yu
Ssu-Rui Lee
Alexandru Cioba
I-Lun Tseng
Da-shan Shiu
Jhih-Wei Hsu
Cheng-Yuan Wang
Chien-Yi Yang
Ren-Chu Wang
Yao-Wen Chang
Tai-Chen Chen
Tung-Chieh Chen

Recently, successful applications of reinforcement learning to chip placement have emerged. Pretrained models are necessary to improve efficiency and effectiveness. Currently, the weights of objective metrics (e.g., wirelength, congestion, and timing) are fixed during pretraining. However, fixed-weighed models cannot generate the diversity of placements required for engineers to accommodate changing requirements as they arise. This paper proposes flexible multiple-objective reinforcement learning (MORL) to support objective functions with inference-time variable weights using just a single pretrained model. Our macro placement results show that MORL can generate the Pareto frontier of multiple objectives effectively.

FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization: late breaking results

Mengshu Sun
Zhengang Li
Alec Lu
Haoyu Ma
Geng Yuan
Yanyue Xie
Hao Tang
Yanyu Li
Miriam Leeser
Zhangyang Wang
Xue Lin
Zhenman Fang

Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.31% to 1.25% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6× improvement on the frame rate (i.e., 56.4 FPS vs. 10.0 FPS) with 0.83% accuracy drop for DeiT-base.

Hardware-efficient stochastic rounding unit design for DNN training: late breaking results

Sung-En Chang
Geng Yuan
Alec Lu
Mengshu Sun
Yanyu Li
Xiaolong Ma
Zhengang Li
Yanyue Xie
Minghai Qin
Xue Lin
Zhenman Fang
Yanzhi Wang

Stochastic rounding is crucial in the training of low-bit deep neural networks (DNNs) to achieve high accuracy. Unfortunately, prior studies require a large number of high-precision stochastic rounding units (SRUs) to guarantee the low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we propose an automated framework to explore hardware-efficient low-bit SRUs (ESRUs) that can still generate high-quality random numbers to guarantee the accuracy of low-bit DNN training. Experimental results using state-of-the-art DNN models demonstrate that, compared to the prior 24-bit SRU with 24-bit pseudo random number generator (PRNG), our 8-bit with 3-bit PRNG reduces the SRU resource usage by 9.75× while achieving a higher accuracy.

Placement initialization via a projected eigenvector algorithm: late breaking results

Pengwen Chen
Chung-Kuan Cheng
Albert Chern
Chester Holtz
Aoxi Li
Yucheng Wang

Canonical methods for analytical placement of VLSI designs rely on solving nonlinear programs to minimize wirelength and cell overlap. We focus on producing initial layouts such that a global analytical placer performs better compared to existing heuristics for initialization. We reduce the problem of initialization to a quadratically constrained quadratic program. Our formulation is aware of fixed macros. We propose an efficient algorithm which can quickly generate initializations for testcases with millions of cells. We show that the our method for parameter initialization results in superior performance with respect to post-detailed placement wirelength.

Subgraph matching based reference placement for PCB designs: late breaking results

Miaodi Su
Yifeng Xiao
Shu Zhang
Haiyuan Su
Jiacen Xu
Huan He
Ziran Zhu
Jianli Chen
Yao-Wen Chang

Reference placement is promising to handle the increasing complexity in PCB design. We model the netlist into a graph and use a subgraph matching algorithm to find the isomorphism of the placed template in component combination to reuse the placement. The state-of-the-art VF3 algorithm can achieve high matching accuracy while suffering from high computation time in large-scale instances. Thus, we propose the D2BS algorithm to guarantee matching quality and efficiency. We build and filter the candidate set (CS) according to designed features to construct the CS structure. In the CS optimization, a graph diversity tolerance strategy is adopted to achieve inexact matching. Then, hierarchical match is developed to search the template embeddings in the CS structure guided by branch backtracking and matched nodes snatching. Experimental results show that D2BS outperforms VF3 in accuracy and runtime, achieving 100% accuracy on PCB instances.

Thermal-aware drone battery management: late breaking results

Hojun Choi
Youngmoon Lee

Users have reported that their drones unexpectedly shutoff even when they show more than 10% remaining battery capacity. We discovered that the causes of these unexpected shutoffs to be significant thermal degradation of a cell caused by thermal coupling between the drones and their battery cells. This causes a large voltage drop for the cell affected by the drone heat dissipation, which leads to low supply voltage and unexpected shutoffs. This paper describes the design and implementation of a thermal and battery-aware power management framework designed specifically for drones. Our framework provides an accurate state-of-charge and state-of-power estimation for individual battery cells by accounting for their different thermal degradation. We have implemented our framework on commodity drones without additional hardware or system modification. We have evaluated its effectiveness using three different batteries demonstrating our framework generates accurate state-of-charge and prevents unexpected shutoffs.

Waveform-based performance analysis of RISC-V processors: late breaking results

Lucas Klemmer
Daniel Große

In this paper, we demonstrate the use of the open-source domain specific language WAL to analyze performance metrics of RISC-V processors. The WAL programs calculate these metrics by evaluating the processors signals while “walking” over the simulation waveform (VCD). The presented WAL programs are flexible and generic, and can be easily adapted to different RISC-V cores.

Mohsen Imani

10 August 2022

Yibo Lin

No comments

Categories: Who's Who

Aug 1st, 2022

Mohsen Imani

Assistant Professor

Department of Computer Science,
University of California Irvine

Email:
m.imani@uci.edu

Personal webpage
https://www.ics.uci.edu/~mohseni/

Research interests

Brain-Inspired Computing, Computer Architecture, Embedded Systems

Short bio

Mohsen Imani is an Assistant Professor in the Department of Computer Science at UC Irvine. He is also a director of Bio-Inspired Architecture and Systems Laboratory (BIASLab). He is working on a wide range of practical problems in the area of brain-inspired computing, machine learning, computer architecture, and embedded systems. His research goal is to design real-time, robust, and programmable computing platforms that can natively support a wide range of learning and cognitive tasks on edge devices. Dr. Imani received his Ph.D. from the Department of Computer Science and Engineering at UC San Diego. He has a stellar record of publication with over 120 papers in top conferences/journals. His contribution has led to a new direction in brain-inspired hyperdimensional computing that enables ultra-efficient and real-time learning and cognitive support. His research was also the main initiative in opening up multiple industrial and governmental research programs. Dr. Imani’s research has been recognized with several awards, including the Bernard and Sophia Gordon Engineering Leadership Award, the Outstanding Researcher Award, and the Powell Fellowship Award. He also received the Best Doctorate Research from UCSD, the best paper award in Design Automation and Test in Europe (DATE) in 2022, and several best paper nomination awards at multiple top conferences including Design Automation Conference (DAC) in 2019 and 2020, Design Automation and Test in Europe (DATE) in 2020, and International Conference on Computer-Aided Design (ICCAD) in 2020.

Reasearch highlights

Dr. Imani’s research has been instrumental in developing practical implementations of Hyper-dimensional (HD) computing – a computational technique modeled after the brain. The Hyper-dimensional computing system enabled large-scale learning in real-time, including both training and inference. He has developed such a system by not only accelerating machine learning algorithms in hardware but also redesigning the algorithms themselves using strategies that more closely model the ultimate efficient learning machine: the human brain. HD computing is motivated by the observation that the key aspects of human memory, perception, and cognition can be explained by the mathematical properties of high-dimensional spaces. It thereby models the human memory using points of a high-dimensional space, that is, with hypervectors (tens of thousand dimensions.) These points can be manipulated under a formal algebra to represent semantic relationships between objects, and thus we can devise various cognitive solutions which memorize and learn from the relation of data. HD computing also mimics several desirable properties of the human brain including robustness to noise and failure of memory cells, and one-shot learning which does not require a gradient-based algorithm. Dr. Imani exploited these key principles of brain functionalities to create cognitive platforms. The platforms include (1) novel HD algorithms supporting classification and clustering which represent the most popular categories of algorithms used regularly by professional data scientists, (2) novel HD hardware accelerators capable of up to three orders of magnitude improvement in energy efficiency relative to GPU implementations, and (3) an integrated software infrastructure that makes it easy for users to integrate HD computing as a part of systems, and that enables secure distributed learning on encrypted information using HD computing. The software contributions are backed by efficient hardware acceleration in GPU, FPGA, and processing in-memory. Dr. Imani leveraged the memory-centric nature of HD computing to develop efficient hardware/software infrastructure for a highly-parallel PIM acceleration. In HD computing, hypervectors have holographic distribution, where the information is uniformly distributed over a large number of dimensions. This makes HD computing significantly robust to the failure of an individual memory component (Robust to ∼30% failure in the hardware). In particular, Dr. Imani exploited this robustness to design an approximate in-memory associative search that checks the similarity of hypervectors in about tens of nano-seconds, while providing orders of magnitude improvement in energy efficiency as compared to today’s exact processors.

Page 3 of 10