ICCAD ’22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design
Full Citation in the ACM Digital Library
SESSION: The Role of Graph Neural Networks in Electronic Design Automation
- Haoxing Ren
- Siddhartha Nath
- Yanqing Zhang
- Hao Chen
- Mingjie Liu
In this paper, we discuss the source of effectiveness of Graph Neural Networks (GNNs) in EDA, particularly in the VLSI design automation domain. We argue that the effectiveness comes from the fact that GNNs implicitly embed the prior knowledge and inductive biases associated with given VLSI tasks, which is one of the three approaches to make a learning algorithm physics-informed. These inductive biases are different to those common used in GNNs designed for other structured data, such as social networks and citation networks. We will illustrate this principle with several recent GNN examples in the VLSI domain, including predictive tasks such as switching activity prediction, timing prediction, parasitics prediction, layout symmetry prediction, as well as optimization tasks such as gate sizing and macro and cell transistor placement. We will also discuss the challenges of applications of GNN and the opportunity of applying self-supervised learning techniques with GNN for VLSI optimization.
As modern Physical Design (PD) algorithms and methodologies evolve into the post-Moore era with the aid of machine learning, Graph Neural Networks (GNNs) are becoming increasingly ubiquitous given that netlists are essentially graphs. Recently, their ability to perform effective graph learning has provided significant insights to understand the underlying dynamics during netlist-to-layout transformations. GNNs follow a message-passing scheme, where the goal is to construct meaningful representations either at the entire graph or node-level by recursively aggregating and transforming the initial features. In the realm of PD, the GNN-learned representations have been leveraged to solve the tasks such as cell clustering, quality-of-result prediction, activity simulation, etc., which often overcome the limitations of traditional PD algorithms. In this work, we first revisit recent advancements that GNNs have made in PD. Second, we discuss how GNNs serve as the backbone of novel PD flows. Finally, we present our thoughts on ongoing and future PD challenges that GNNs can tackle and succeed.
- Daniela Sánchez Lopera
- Wolfgang Ecker
In the Electronic Design Automation (EDA) flow, signoff checks, such as timing analysis, are performed only after physical synthesis. Encountered timing violations cause re-iterations of the design flow. Hence, timing estimations at initial design stages, such as Register Transfer Level (RTL), would increase the quality of the results and lower the flow iterations. Machine learning has been used to estimate the timing behavior of chip components. However, existing solutions map EDA objects to Euclidean data without considering that EDA objects are represented naturally as graphs. Recent advances in Graph Neural Networks (GNNs) motivate the mapping from EDA objects to graphs for design metric prediction tasks at different stages. This paper maps RTL designs to directed, featured graphs with multidimensional node and edge features. These are the input to GNNs for estimating component delays and slews. An in-house hardware generation framework and open-source EDA tools for ASIC synthesis are employed for collecting training data. Experiments over unseen circuits show that GNN-based models are promising for timing estimation, even when the features come from early RTL implementations. Based on estimated delays, critical areas of the design can be detected, and proper RTL micro-architectures can be chosen without running long design iterations.
- Lilas Alrahis
- Satwik Patnaik
- Muhammad Shafique
- Ozgur Sinanoglu
Graph neural networks (GNNs) have attracted increasing attention due to their superior performance in deep learning on graph-structured data. GNNs have succeeded across various domains such as social networks, chemistry, and electronic design automation (EDA). Electronic circuits have a long history of being represented as graphs, and to no surprise, GNNs have demonstrated state-of-the-art performance in solving various EDA tasks. More importantly, GNNs are now employed to address several hardware security problems, such as detecting intellectual property (IP) piracy and hardware Trojans (HTs), to name a few.
In this survey, we first provide a comprehensive overview of the usage of GNNs in hardware security and propose the first taxonomy to divide the state-of-the-art GNN-based hardware security systems into four categories: (i) HT detection systems, (ii) IP piracy detection systems, (iii) reverse engineering platforms, and (iv) attacks on logic locking. We summarize the different architectures, graph types, node features, benchmark data sets, and model evaluation of the employed GNNs. Finally, we elaborate on the lessons learned and discuss future directions.
SESSION: Compiler and System-Level Techniques for Efficient Machine Learning
- Sri Parameswaran
- Martin Rapp
- Mahmut Kandemir
- Xulong Tang
- Jagadish Kotra
- Mustafa Karakoy
While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
- Nicolas Bohm Agostini
- Serena Curzel
- Vinay Amatya
- Cheng Tan
- Marco Minutoli
- Vito Giovanni Castellana
- Joseph Manzano
- David Kaeli
- Antonino Tumeo
The generation of custom hardware accelerators for applications implemented within high-level productive programming frameworks requires considerable manual effort. To automate this process, we introduce SODA-OPT, a compiler tool that extends the MLIR infrastructure. SODA-OPT automatically searches, outlines, tiles, and pre-optimizes relevant code regions to generate high-quality accelerators through high-level synthesis. SODA-OPT can support any high-level programming framework and domain-specific language that interface with the MLIR infrastructure. By leveraging MLIR, SODA-OPT solves compiler optimization problems with specialized abstractions. Backend synthesis tools connect to SODA-OPT through progressive intermediate representation lowerings. SODA-OPT interfaces to a design space exploration engine to identify the combination of compiler optimization passes and options that provides high-performance generated designs for different backends and targets. We demonstrate the practical applicability of the compilation flow by exploring the automatic generation of accelerators for deep neural networks operators outlined at arbitrary granularity and by combining outlining with tiling on large convolution layers. Experimental results with kernels from the PolyBench benchmark show that our high-level optimizations improve execution delays of synthesized accelerators up to 60x. We also show that for the selected kernels, our solution outperforms the current of state-of-the art in more than 70% of the benchmarks and provides better average speedup in 55% of them. SODA-OPT is an open source project available at https://gitlab.pnnl.gov/sodalite/soda-opt.
- Yingjie Li
- Ruiyang Chen
- Weilu Gao
- Cunxi Yu
Diffractive optical neural networks (DONNs) have attracted lots of attention as they bring significant advantages in terms of power efficiency, parallelism, and computational speed compared with conventional deep neural networks (DNNs), which have intrinsic limitations when implemented on digital platforms. However, inversely mapping algorithm-trained physical model parameters onto real-world optical devices with discrete values is a non-trivial task as existing optical devices have non-unified discrete levels and non-monotonic properties. This work proposes a novel device-to-system hardware-software codesign framework, which enables efficient physics-aware training of DONNs w.r.t arbitrary experimental measured optical devices across layers. Specifically, Gumbel-Softmax is employed to enable differentiable discrete mapping from real-world device parameters into the forward function of DONNs, where the physical parameters in DONNs can be trained by simply minimizing the loss function of the ML task. The results have demonstrated that our proposed framework offers significant advantages over conventional quantization-based methods, especially with low-precision optical devices. Finally, the proposed algorithm is fully verified with physical experimental optical systems in low-precision settings.
- Gokul Krishnan
- A. Alper Goksoy
- Sumit K. Mandal
- Zhenyu Wang
- Chaitali Chakrabarti
- Jae-sun Seo
- Umit Y. Ogras
- Yu Cao
Monolithic in-memory computing (IMC) architectures face significant yield and fabrication cost challenges as the complexity of DNNs increases. Chiplet-based IMCs that integrate multiple dies with advanced 2.5D/3D packaging offers a low-cost and scalable solution. They enable heterogeneous architectures where the chiplets and their associated interconnection can be tailored to the non-uniform algorithmic structures to maximize IMC utilization and reduce energy consumption. This paper proposes a heterogeneous IMC architecture with big-little chiplets and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. For a given DNN, we develop a custom methodology to map the model onto the big-little architecture such that the early layers in the DNN are mapped to the little chiplets with higher NoP bandwidth and the subsequent layers are mapped to the big chiplets with lower NoP bandwidth. Furthermore, we achieve a scalable solution by incorporating a DRAM into each chiplet to support a wide range of DNNs beyond the area limit. Compared to a homogeneous chiplet-based IMC architecture, the proposed big-little architecture achieves up to 329× improvement in the energy-delay-area product (EDAP) and up to 2× higher IMC utilization. Experimental evaluation of the proposed big-little chiplet-based RRAM IMC architecture for ResNet-50 on ImageNet shows 259×, 139×, and 48× improvement in energy-efficiency at lower area compared to Nvidia V100 GPU, Nvidia T4 GPU, and SIMBA architecture, respectively.
SESSION: Addressing Sensor Security through Hardware/Software Co-Design
- Marilyn Wolf
- Kruttidipta Samal
This paper provides a taxonomy of security vulnerabilities of smart image sensor systems. Image sensors form an important class of sensors. Many image sensors include computation units that can provide traditional algorithms such as image or video compression along with machine learning tasks such as classification. Some attacks rely on the physics and optics of imaging. Other attacks take advantage of the complex logic and software required to perform imaging systems.
False data injection attacks on sensor systems are an emerging threat to cyberphysical systems, creating significant risks to all application domains and, importantly, to critical infrastructures. Cyberphysical systems are process-dependent leading to differing false data injection attacks that target disruption of the specific processes (plants). We present a taxonomy of false data injection attacks, using a general model for cyberphysical systems, showing that global and continuous attacks are extremely powerful. In order to detect false data injection attacks, we describe three methods that can be employed to enable effective monitoring and detection of false data injection attacks during plant operation. Considering that sensor failures have equivalent effects to relative false data injection attacks, the methods are effective for sensor fault detection as well.
- Ningyuan Cao
- Jianbo Liu
- Boyang Cheng
- Muya Chang
The ubiquitous data acquisition and extensive data exchange of sensors pose severe security and privacy concerns for the end-users and the public. To enable real-time protection of raw data, it is demanding to facilitate privacy-preserving algorithms at data generation, or in-sensory privacy. However, due to the severe sensor resource constraints and intensive computation/security cost, it remains an open question of how to enable data protection algorithms with efficient circuit techniques. To answer this question, this paper discusses the potential of a stochastic mixed-signal (SMS) circuit for ultra-low-power, small-foot-print data security. In particular, this paper discusses digitally-controlled-oscillators (DCO) and their advantages in (1) seamless analog interface, (2) stochastic computation efficiency, and (3) unified entropy generation over conventional digital circuit baselines. With DCO as an illustrative case, we target (1) SMS privacy-preserving architecture definition and systematic SMS analysis on its performance gains across various hardware/software configurations, and (2) revisit analog/mixed-signal voltage/transistor scaling in the context of entropy-based data protection.
- Anomadarshi Barua
- Mohammad Abdullah Al Faruque
Sensors are one of the most pervasive and integral components of today’s safety-critical systems. Sensors serve as a bridge between physical quantities and connected systems. The connected systems with sensors blindly believe the sensor as there is no way to authenticate the signal coming from a sensor. This could be an entry point for an attacker. An attacker can inject a fake input signal along with the legitimate signal by using a suitable spoofing technique. As the sensor’s transducer is not smart enough to differentiate between a fake and legitimate signal, the injected fake signal eventually can collapse the connected system. This type of attack is known as the transduction attack. Over the last decade, several works have been published to provide a defense against the transduction attack. However, the defenses are proposed on an ad-hoc basis; hence, they are not well-structured. Our work begins to fill this gap by providing a checklist that a defense technique should always follow to be considered as an ideal defense against the transduction attack. We name this checklist as the Golden reference of sensor defense. We provide insights on how this Golden reference can be achieved and argue that sensors should be redesigned from the transducer level to the sensor electronics level. We point out that only hardware or software modification is not enough; instead, a hardware/software (HW/SW) co-design approach is required to ride on this future roadmap to the robust and resilient sensor.
SESSION: Advances in Partitioning and Physical Optimization
- Markus Olbrich
- Yu-Guang Chen
- Ismail Bustany
- Andrew B. Kahng
- Ioannis Koutis
- Bodhisatta Pramanik
- Zhiang Wang
State-of-the-art hypergraph partitioners follow the multilevel paradigm that constructs multiple levels of progressively coarser hypergraphs that are used to drive cut refinements on each level of the hierarchy. Multilevel partitioners are subject to two limitations: (i) Hypergraph coarsening processes rely on local neighborhood structure without fully considering the global structure of the hypergraph. (ii) Refinement heuristics can stagnate on local minima. In this paper, we describe SpecPart, the first supervised spectral framework that directly tackles these two limitations. SpecPart solves a generalized eigenvalue problem that captures the balanced partitioning objective and global hypergraph structure in a low-dimensional vertex embedding while leveraging initial high-quality solutions from multilevel partitioners as hints. SpecPart further constructs a family of trees from the vertex embedding and partitions them with a tree-sweeping algorithm. Then, a novel overlay of multiple tree-based partitioning solutions, followed by lifting to a coarsened hypergraph, where an ILP partitioning instance is solved to alleviate local stagnation. We have validated SpecPart on multiple sets of benchmarks. Experimental results show that for some benchmarks, our SpecPart can substantially improve the cutsize by more than 50% with respect to the best published solutions obtained with leading partitioners hMETIS and KaHyPar.
This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over 70× runtime speedups over hMetis and 20× speedups over HyperSF.
The benefit of multi-bit flip-flop (MBFF) as opposed to single-bit flip-flop is sharing in-cell clock inverters among the master and slave latches in the internal flip-flops of MBFF. Theoretically, the more flip-flops an MBFF has, the more power saving it can achieve. However, in practice, physically increasing the size of MBFF to accommodate many flip-flops imposes two new challenging problems in physical design: (1) non-flexible MBFF cell flipping for multiple D-to-Q signals and (2) unbalanced or wasted use of MBFF footprint space. In this work, we solve the two problems in a way to enhance routability and timing at the placement and routing stages. Precisely, for problem 1, we make the non-flexible MBFF cell flipping to be fully flexible by generating MBFF layouts supporting diverse D-to-Q flow directions in the detailed placement to improve routability and for problem 2, we enhance the setup and clock-to-Q delay on timing critical flip-flops in MBFF through gate upsizing (i.e., transistor folding) by using the unused space in MBFF to improve timing slack at the post-routing stage. Through experiments with benchmark circuits, it is shown that our proposed design and technology co-optimization (DTCO) flow using MBFFs that solves problems 1 and 2 is very promising.
- Yang Hsu
- Min-Hsuan Chung
- Yao-Wen Chang
- Ci-Hong Lin
In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.
SESSION: Democratizing Design Automation with Open-Source Tools: Perspectives, Opportunities, and Challenges
In recent years, several open-source projects have shown potential to serve a future technology commons for EDA and design prototyping. This paper examines how open-source and proprietary EDA technologies will inevitably take on complementary roles within a future technology commons. Proprietary EDA technologies offer numerous benefits that will endure, including (i) exceptional technology and engineering; (ii) ever-increasing importance in design-based equivalent scaling and the overall semiconductor value chain; and (iii) well-established commercial and partner relationships. On the other hand, proprietary EDA technologies face challenges that will also endure, including (i) inability to pursue directions such as massive leverage of cloud compute, extreme reduction of turnaround times, or “free tools”; and (ii) difficulty in evolving and addressing new applications and markets. By contrast, open-source EDA technologies offer benefits that include (i) the capability to serve as a friction-free, democratized platform for education and future workforce development (i.e., as a platform for EDA research, and as a means of teaching / training both designers and EDA developers with public code); and (ii) addressing the needs of underserved, non-enterprise account markets (e.g., older nodes, research flows, cost-sensitive IoT, new devices and integrations, system-design-technology pathfinding). This said, open-source will always face challenges such as sustainability, governance, and how to achieve critical mass and critical quality. The paper will conclude with key directions and synergies for open-source and proprietary EDA within an EDA Commons for education and prototyping.
- Nicolas Bohm Agostini
- Ankur Limaye
- Marco Minutoli
- Vito Giovanni Castellana
- Joseph Manzano
- Antonino Tumeo
- Serena Curzel
- Fabrizio Ferrandi
The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.
- Maico Cassel dos Santos
- Tianyu Jia
- Martin Cochet
- Karthik Swaminathan
- Joseph Zuckerman
- Paolo Mantovani
- Davide Giri
- Jeff Jun Zhang
- Erik Jens Loscalzo
- Gabriele Tombesi
- Kevin Tien
- Nandhini Chandramoorthy
- John-David Wellman
- David Brooks
- Gu-Yeon Wei
- Kenneth Shepard
- Luca P. Carloni
- Pradip Bose
We present a scalable methodology for the agile physical design of tile-based heterogeneous system-on-chip (SoC) architectures that simplifies the reuse and integration of open-source hardware components. The methodology leverages the regularity of the on-chip communication infrastructure, which is based on a multi-plane network-on-chip (NoC), and the modularity of socket interfaces, which connect the tiles to the NoC. Each socket also provides its tile with a set of platform services, including independent clocking and voltage control. As a result, the physical design of each tile can be decoupled from its location in the top-level floorplan of the SoC and the overall SoC design can benefit from a hierarchical timing-closure flow, design reuse and, if necessary, fast respin. With the proposed methodology we completed two SoC tapeouts of increasing complexity, which illustrate its capabilities and the resulting gains in terms of design productivity.
SESSION: Accelerators on A New Horizon
- Vaibhav Verma
- Georgios Zervakis
- Wei Cheng
- Chun-Feng Wu
- Yuan-Hao Chang
- Ing-Chao Lin
Architectural innovation in graph accelerators attracts research attention due to foreseeable inflation in data sizes and the irregular memory access pattern of graph algorithms. Conventional graph accelerators ignore the potential of Non-Volatile Memory (NVM) crossbar as a dual-addressing memory and treat it as a traditional single-addressing memory with higher density and better energy efficiency. In this work, we present GraphRC, a graph accelerator that leverages the power of dual-addressing memory by mapping in-edge/out-edge requests to column/row-oriented memory accesses. Although the capability of dual-addressing memory greatly improves the performance of graph processing, some memory accesses still suffer from low-utilization issues. Therefore, we propose a vertex merging (VM) method that improves cache block utilization rate by merging memory requests from consecutive vertices. VM reduces the execution time of all 6 graph algorithms on all 4 datasets by 24.24% on average. We then identify the data dependency inherent in a graph limits the usage of VM, and its effectiveness is bounded by the percentage of mergeable vertices. To overcome this limitation, we propose an aggressive vertex merging (AVM) method that outperforms VM by ignoring the data dependency inherent in a graph. AVM significantly reduces the execution time of ranking-based algorithms on all 4 datasets while preserving the correct ranking of the top 20 vertices.
- Matheus Cavalcante
- Domenic Wüthrich
- Matteo Perotti
- Samuel Riedel
- Luca Benini
While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz’ performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.
- Edward Richter
- Deming Chen
While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.
- Ebadollah Taheri
- Sudeep Pasricha
- Mahdi Nikdast
2.5D chiplet systems have been proposed to improve the low manufacturing yield of large-scale chips. However, connecting the chiplets through an electronic interposer imposes a high traffic load on the interposer network. Silicon photonics technology has shown great promise towards handling a high volume of traffic with low latency in intra-chip network-on-chip (NoC) fabrics. Although recent advances in silicon photonic devices have extended photonic NoCs to enable high bandwidth communication in 2.5D chiplet systems, such interposer-based photonic networks still suffer from high power consumption. In this work, we design and analyze a novel Reconfigurable power-efficient and congestion-aware Silicon-Photonic 2.5D Interposer network, called ReSiPI. Considering runtime traffic, ReSiPI is able to dynamically deploy inter-chiplet photonic gateways to improve the overall network congestion. ReSiPI also employs switching elements based on phase change materials (PCMs) to dynamically reconfigure and power-gate the photonic interposer network, thereby improving the network power efficiency. Compared to the best prior state-of-the-art 2.5D photonic network, ReSiPI demonstrates, on average, 37% lower latency, 25% power reduction, and 53% energy minimization in the network.
SESSION: CAD for Confidentiality of Hardware IPS
- Swarup Bhunia
- Amitabh Das
- Saverio Fazzari
- Vivian Kammler
- David Kehlet
- Jeyavijayan Rajendran
- Ankur Srivastava
With growing use of hardware intellectual property (IP) based integrated circuits (IC) design and increasing reliance on a globalized supply chain, the threats to confidentiality of hardware IPs have emerged as major security concerns to the IP producers and owners. These threats are diverse, including reverse engineering (RE), piracy, cloning, and extraction of design secrets, and span different phases of electronics life cycle. The academic research community and the semiconductor industry have made significant efforts over the past decade on developing effective methodologies and CAD tools targeted to protect hardware IPs against these threats. These solutions include watermarking, logic locking, obfuscation, camouflaging, split manufacturing, and hardware redaction. This paper focuses on key topics on confidentiality of hardware IPs encompassing the major threats, protection approaches, security analysis, and metrics. It discusses the strengths and limitations of the major solutions in protecting hardware IPs against the confidentiality attacks, and future directions to address the limitations in the modern supply chain ecosystem.
SESSION: Analyzing Reliability, Defects and Patterning
- Gaurav Rajavendra Reddy
- Kostas Adam
- Kyeonghyeon Baek
- Hyunbum Park
- Suwan Kim
- Kyumyung Choi
- Taewhan Kim
An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.
- Olympia Axelou
- Nestor Evmorfopoulos
- George Floros
- George Stamoulis
- Sachin S. Sapatnekar
As integrated circuit technologies move below 10 nm, Electromigration (EM) has become an issue of great concern for the longterm reliability due to the stricter performance, thermal and power requirements. The problem of EM becomes even more pronounced in power grids due to the large unidirectional currents flowing in these structures. The attention for EM analysis during the past years has been drawn to accurate physics-based models describing the interplay between the electron wind force and the back stress force, in a single Partial Differential Equation (PDE) involving wire stress. In this paper, we present a fast semi-analytical approach for the solution of the stress PDE at discrete spatial points in multi-segment lines of power grids, which allows the analytical calculation of EM stress independently at any time in these lines. Our method exploits the specific form of the discrete stress coefficient matrix whose eigenvalues and eigenvectors are known beforehand. Thus, a closed-form equation can be constructed with almost linear time complexity without the need of time discretization. This closed-form equation can be subsequently used at any given time in transient stress analysis. Our experimental results, using the industrial IBM power grid benchmarks, demonstrate that our method has excellent accuracy compared to the industrial tool COMSOL while being orders of magnitude times faster.
- Wentian Jin
- Liang Chen
- Subed Lamichhane
- Mohammadamir Kavousi
- Sheldon X.-D. Tan
Electromigration (EM) becomes a major concern for VLSI circuits as the technology advances in the nanometer regime. The crux of problem is to solve the partial differential Korhonen equations, which remains challenging due to the increasing integrated density. Recently, scientific machine learning has been explored to solve partial differential equations (PDE) due to breakthrough success in deep neural networks and existing approach such as physics-informed neural networks (PINN) shows promising results for some small PDE problems. However, for large engineering problems like EM analysis for large interconnect trees, it was shown that the plain PINN does not work well due the to large number of variables. In this work, we propose a novel hierarchical PINN approach, HierPINN-EM for fast EM induced stress analysis for multi-segment interconnects. Instead of solving the interconnect tree as a whole, we first solve EM problem for one wire segment under different boundary and geometrical parameters using supervised learning. Then we apply unsupervised PINN concept to solve the whole interconnects by enforcing the physics laws in the boundaries for all wire segments. In this way, HierPINN-EM can significantly reduce the number of variables at plain PINN solver. Numerical results on a number of synthetic interconnect trees show that HierPINN-EM can lead to orders of magnitude speedup in training and more than 79× better accuracy over the plain PINN method. Furthermore, HierPINN-EM yields 19% better accuracy with 99% reduction in training cost over recently proposed Graph Neural Network-based EM solver, EMGraph.
- Guan-Ting Liu
- Wei-Chen Tai
- Yi-Ting Lin
- Iris Hui-Ru Jiang
- James P. Shiely
- Pu-Jen Cheng
As modern photolithography feature sizes continue to shrink, sub-resolution assist feature (SRAF) generation has become a key resolution enhancement technique to improve the manufacturing process window. State-of-the-art works resort to machine learning to overcome the deficiencies of model-based and rule-based approaches. Nevertheless, these machine learning-based methods do not consider or implicitly consider the optical interference between SRAFs, and highly rely on post-processing to satisfy SRAF mask manufacturing rules. In this paper, we are the first to generate SRAFs using reinforcement learning to address SRAF interference and produce mask-rule-compliant results directly. In this way, our two-phase learning enables us to emulate the style of model-based SRAFs while further improving the process variation (PV) band. A state alignment and action transformation mechanism is proposed to achieve orientation equivariance while expediting the training process. We also propose a transfer learning framework, allowing SRAF generation under different light sources without retraining the model. Compared with state-of-the-art works, our method improves the solution quality in terms of PV band and edge placement error (EPE) while reducing the overall runtime.
SESSION: New Frontier in Verification Technology
- Jyotirmoy Vinay
- Zahra Ghodsi
- I-Wei Chiu
- Xin-Ping Chen
- Jennifer Shueh-Inn Hu
- James Chien-Mo Li
The demand for low-power, high-performance neuromorphic chips is increasing. However, conventional testing is not applicable to neuromorphic chips due to three reasons: (1) lack of scan DfT, (2) stochastic characteristic, and (3) configurable functionality. In this paper, we present an automatic test configuration and pattern generation (ATCPG) method for testing a configurable stochastic neuromorphic chip without using scan DfT. We use machine learning to generate test configurations. Then, we apply a modified fast gradient sign method to generate test patterns. Finally, we determine test repetitions with statistical power of test. We conduct experiments on one of the neuromorphic architectures, spiking neural network, to evaluate the effectiveness of our ATCPG. The experimental results show that our ATCPG can achieve 100% fault coverage for the five fault models we use. For testing a 3-layer model at 0.05 significant level, we produce 5 test configurations and 67 test patterns. The average test repetitions of neuron faults and synapse faults are 2,124 and 4,557, respectively. Besides, our simulation results show that the overkill matched our significance level perfectly.
- Sizhe Zhang
- Mohsen Imani
- Xun Jiao
Brain-inspired hyperdimensional computing (HDC) has demonstrated promising capability in various cognition tasks such as robotics, bio-medical signal analysis, and natural language processing. Compared to deep neural networks, HDC models show advantages such as light-weight model and one/few-shot learning capabilities, making it a promising alternative paradigm to traditional resource-demanding deep learning models particularly in edge devices with limited resources. Despite the growing popularity of HDC, the robustness of HDC models and the approaches to enhance HDC robustness has not been systematically analyzed and sufficiently examined. HDC relies on high-dimensional numerical vectors referred to as hypervectors (HV) to perform cognition tasks and the values inside the HVs are critical to the robustness of an HDC model. We propose ScaleHD, an adaptive scaling method that scales the value of HVs in the associative memory of an HDC model to enhance the robustness of HDC models. We propose three different modes of ScaleHD including Global-ScaleHD, Class-ScaleHD, and (Class + Clip)-ScaleHD which are based on different adaptive scaling strategies. Results show that ScaleHD is able to enhance HDC robustness against memory errors up to 10,000X. Moreover, we leverage the enhanced HDC robustness in exchange for energy saving via voltage scaling method. Experimental results show that ScaleHD can reduce energy consumption on HDC memory system up to 72.2% with less than 1% accuracy loss.
- Chanwook Oh
- Michele Lora
- Pierluigi Nuzzo
This paper proposes an automated framework for quantitative verification and design space exploration of cyber-physical systems in the presence of uncertainty, leveraging assume-guarantee contracts expressed in Stochastic Signal Temporal Logic (StSTL). We introduce quantitative semantics for StSTL and formulations of the quantitative verification and design space exploration problems as bi-level optimization problems. We show that these optimization problems can be effectively solved for a class of stochastic systems and a fragment of bounded-time StSTL formulas. Our algorithm searches for partitions of the upper-level design space such that the solutions of the lower-level problems satisfy the upper-level constraints. A set of optimal parameter values are then selected within these partitions. We illustrate the effectiveness of our framework on the design of a multi-sensor perception system and an automatic cruise control system.
SESSION: Low Power Edge Intelligence
- Dina Hussein
- Taha Belkhouja
- Ganapati Bhat
- Janardhan Rao Doppa
Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.
- Yang Ni
- Nicholas Lesica
- Fan-Gang Zeng
- Mohsen Imani
The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.
In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.
- Sahidul Islam
- Shanglin Zhou
- Ran Ran
- Yu-Fang Jin
- Wujie Wen
- Caiwen Ding
- Mimi Xie
IoT devices are increasingly being implemented with neural network models to enable smart applications. Energy harvesting (EH) technology that harvests energy from ambient environment is a promising alternative to batteries for powering those devices due to the low maintenance cost and wide availability of the energy sources. However, the power provided by the energy harvester is low and has an intrinsic drawback of instability since it varies with the ambient environment. This paper proposes EVE, an automated machine learning (autoML) co-exploration framework to search for desired multi-models with shared weights for energy harvesting IoT devices. Those shared models incur significantly reduced memory footprint with different levels of model sparsity, latency, and accuracy to adapt to the environmental changes. An efficient on-device implementation architecture is further developed to efficiently execute each model on device. A run-time model extraction algorithm is proposed that retrieves individual model with negligible overhead when a specific model mode is triggered. Experimental results show that the neural networks models generated by EVE is on average 2.5× times faster than the baseline models without pruning and shared weights.
SESSION: Crossbars, Analog Accelerators for Neural Networks, and Neuromorphic Computing Based on Printed Electronics
- Hussam Amrouch
- Sheldon Tan
We propose a method to design in-memory, energy-efficient, and compact memristor crossbar circuits for implementing decision trees using flow-based computing. We develop a new tool called binary classification graph, which is equivalent to decision trees in accuracy but uses bit values of input features to make decisions instead of thresholds. Our proposed design is resilient to manufacturing errors and can scale to large crossbar sizes due to the utilization of sneak paths in computations. Our design uses zero transistor and one memristor (0T1R) crossbars with only two resistance states of high and low, which makes it resilient to resistance drift and radiation degradation. We test the performance of our designs on multiple standard machine learning datasets and show that our method utilizes circuits of size 5.23 × 10-3 mm2 and uses 20.5 pJ per decision, and outperforms state-of-the-art decision tree acceleration algorithms on these metrics.
- Hanqing Zhu
- Keren Zhu
- Jiaqi Gu
- Harrison Jin
- Ray T. Chen
- Jean Anne Incorvia
- David Z. Pan
Analog computing has been recognized as a promising low-power alternative to digital counterparts for neural network acceleration. However, conventional analog computing is mainly in a mixed-signal manner. Tedious analog/digital (A/D) conversion cost significantly limits the overall system’s energy efficiency. In this work, we devise an efficient analog activation unit with magnetic tunnel junction (MTJ)-based analog content-addressable memory (MACAM), simultaneously realizing nonlinear activation and A/D conversion in a fused fashion. To compensate for the nascent and therefore currently limited representation capability of MACAM, we propose to mix our analog activation unit with digital activation dataflow. A fully differential framework, SuperMixer, is developed to search for an optimized activation workload assignment, adaptive to various activation energy constraints. The effectiveness of our proposed methods is evaluated on a silicon photonic accelerator. Compared to standard activation implementation, our mixed activation system with the searched assignment can achieve competitive accuracy with >60% energy saving on A/D conversion and activation.
- Haibin Zhao
- Michael Hefenbrock
- Michael Beigl
- Mehdi B. Tahoori
Printed electronics allow for ultra-low-cost circuit fabrication with unique properties such as flexibility, non-toxicity, and stretchability. Because of these advanced properties, there is a growing interest in adapting printed electronics for emerging areas such as fast-moving consumer goods and wearable technologies. In such domains, analog signal processing in or near the sensor is favorable. Printed neuromorphic circuits have been recently proposed as a solution to perform such analog processing natively. Additionally, their learning-based design process allows high efficiency of their optimization and enables them to mitigate the high process variations associated with low-cost printed processes. In this work, we address the aging of the printed components. This effect can significantly degrade the accuracy of printed neuromorphic circuits over time. For this, we develop a stochastic aging-model to describe the behavior of aged printed resistors and modify the training objective by considering the expected loss over the lifetime of the device. This approach ensures to provide acceptable accuracy over the device lifetime. Our experiments show that an overall 35.8% improvement in terms of expected accuracy over the device lifetime can be achieved using the proposed learning approach.
SESSION: Designing DNN Accelerators
- Naebeom Park
- Daehyun Ahn
- Jae-Joon Kim
Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.
- Hyein Shin
- Myeonggu Kang
- Lee-Sup Kim
A severe read disturbance problem degrades the inference accuracy of a resistive RAM (ReRAM) based deep neural network (DNN) accelerator. Refresh, which reprograms the ReRAM cells, is the most obvious solution for the problem, but programming ReRAM consumes huge energy. To address the issue, we first analyze the resistance drift pattern of each conductance state and the actual read stress applied to the ReRAM array by considering the characteristics of ReRAM-based DNN accelerators. Based on the analysis, we cluster ReRAM cells into a few groups for each layer of DNN and generate a proper refresh cycle for each group in the offline phase. The individual refresh cycles reduce energy consumption by reducing the number of unnecessary refresh operations. In the online phase, the refresh controller selectively launches refresh operations according to the generated refresh cycles. ReRAM cells are selectively refreshed by minimally modifying the conventional structure of the ReRAM-based DNN accelerator. The proposed work successfully resolves the read disturbance problem by reducing 97% of the energy consumption for the refresh operation while preserving inference accuracy.
- Shehzeen Hussain
- Nojan Sheybani
- Paarth Neekhara
- Xinqiao Zhang
- Javier Duarte
- Farinaz Koushanfar
Steganography and digital watermarking are the tasks of hiding recoverable data in image pixels. Deep neural network (DNN) based image steganography and watermarking techniques are quickly replacing traditional hand-engineered pipelines. DNN based water-marking techniques have drastically improved the message capacity, imperceptibility and robustness of the embedded watermarks. However, this improvement comes at the cost of increased computational overhead of the watermark encoder neural network. In this work, we design the first accelerator platform FastStamp to perform DNN based steganography and digital watermarking of images on hardware. We first propose a parameter efficient DNN model for embedding recoverable bit-strings in image pixels. Our proposed model can match the success metrics of prior state-of-the-art DNN based watermarking methods while being significantly faster and lighter in terms of memory footprint. We then design an FPGA based accelerator framework to further improve the model throughput and power consumption by leveraging data parallelism and customized computation paths. FastStamp allows embedding hardware signatures into images to establish media authenticity and ownership of digital media. Our best design achieves 68× faster inference as compared to GPU implementations of prior DNN based watermark encoder while consuming less power.
SESSION: Novel Chiplet Approaches from Interconnect to System (Virtual)
- Fuping Li
- Ying Wang
- Yuanqing Cheng
- Yujie Wang
- Yinhe Han
- Huawei Li
- Xiaowei Li
2.5D chiplet technology is gaining popularity for the efficiency of integrating multiple heterogeneous dies or chiplets on interposers, and it is also considered an ideal option for agile silicon system design by mitigating the huge design, verification, and manufacturing overhead of monolithic SoCs. Although it significantly reduces development costs by chiplet reuse, the design and fabrication of interposers also introduce additional high non-recurring engineering (NRE) costs and development cycles which might be prohibitive for application-specific designs having low volume.
To address this challenge, in this paper, we propose a reusable general interposer architecture (GIA) to amortize NRE costs and accelerate integration flows of interposers across different chiplet-based systems effectively. The proposed assembly-time configurable interposer architecture covers both active interposers and passive interposers considering diverse applications of 2.5D systems. The agile interposer integration is also facilitated by a novel end-to-end design automation framework to generate optimal system assembly configurations including the selection of chiplets, inter-chiplet network configuration, placement of chiplets, and mapping on GIA, which are specialized for the given target workload. The experimental results show that our proposed active GIA and passive GIA achieve 3.15x and 60.92x performance boost with 2.57x and 2.99x power saving over baselines respectively.
- Chengeng Li
- Fan Jiang
- Shixi Chen
- Jiaxu Zhang
- Yinyi Liu
- Yuxiang Fu
- Jiang Xu
Cache coherence overhead in manycore systems is becoming prominent with the increase of system scale. However, traditional electrical networks restrict the efficiency of cache coherence transactions in the system due to the limited bandwidth and long latency. Optical network promises high bandwidth and low latency, and supports both efficient unicast and multicast transmission, which can potentially accelerate cache coherence in manycore systems. This work proposes a novel photonic cache coherence network with a physically centralized logically distributed directory called PCCN for chiplet-based manycore systems. PCCN adopts a channel sharing method with a contention solving mechanism for efficient long-distance coherence-related packet transmission. Experiment results show that compared to state-of-the-art proposals, PCCN can speed up application execution time by 1.32x, reduce memory access latency by 26%, and improve energy efficiency by 1.26x, on average, in a 128-core system.
- Qian Wei
- Zhaoyan Shen
- Yiheng Tong
- Zhiping Jia
- Lei Ju
- Jiezhi Chen
- Bingzhe Li
Log-structured merge (LSM) tree based key-value (KV) stores organize writes into hierarchical batches for high-speed writing. However, the notorious compaction process of LSM-tree severely hurts system performance. It not only involves huge I/O operations but also consumes tremendous computation and memory resources. In this paper, first we find that when compaction happens in the high levels (i.e., L0, L1) of the LSM-tree, it may saturate all system computation and memory resources, and eventually stall the whole system. Based on this observation, we present Re-LSM, a ReRAM-based Processing-in-Memory (PIM) framework for LSM-based Key-Value Store. Specifically, in Re-LSM, we propose to offload certain computation and memory-intensive tasks in the high levels of the LSM-tree to the ReRAM-based PIM space. A high parallel ReRAM compaction accelerator is designed by decomposing the three-phased compaction into basic logic operating units. Evaluation results based on db_bench and YCSB show that Re-LSM achieves 2.2× improvement on the throughput of random writes compared to RocksDB, and the ReRAM-based compaction accelerator speedups the CPU-based implementation by 64.3× and saves 25.5× energy.
SESSION: Architecture for DNN Acceleration (Virtual)
- Yiming Chen
- Guodong Yin
- Mingyen Lee
- Wenjun Tang
- Zekun Yang
- Yongpan Liu
- Huazhong Yang
- Xueqing Li
Motivated by reducing the data transfer activities in data-intensive neural network computing, SRAM-based compute-in-memory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.
This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that “hides” ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, Hidden-ROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.
- Yikan Qiu
- Yufei Ma
- Wentao Zhao
- Meng Wu
- Le Ye
- Ru Huang
Computing-in-memory (CIM) is emerging as a promising architecture to accelerate graph convolutional networks (GCNs) normally bounded by redundant and irregular memory transactions. Current analog based CIM requires frequent analog and digital conversions (AD/DA) that dominate the overall area and power consumption. Furthermore, the analog non-ideality degrades the accuracy and reliability of CIM. In this work, an SRAM based digital CIM system is proposed to accelerate memory intensive GCNs, namely DCIM-GCN, which covers innovations from CIM circuit level eliminating costly AD/DA converters to architecture level addressing irregularity and sparsity of graph data. DCIM-GCN achieves 2.07X, 1.76X, and 1.89× speedup and 29.98×, 1.29×, and 3.73× energy efficiency improvement on average over CIM based PIMGCN, TARe, and PIM-GCN, respectively.
- Jun Li
- Wei Wang
- Wu-Jun Li
Existing deep neural network (DNN) accelerator design automation (ADA) methods adopt architecture templates to predetermine parts of design choices and then explore the left design choices beyond templates. These templates can be classified into intra-PU templates and inter-PU templates according to the architecture hierarchy. Since templates limit the flexibility of ADA, designing effective ADA methods without templates has become an important research topic. Although there have appeared some works to enhance the flexibility of ADA by removing intra-PU templates, to the best of our knowledge no existing works have studied ADA methods without inter-PU templates. ADA with predetermined inter-PU templates is typically inefficient in terms of resource utilization, especially for DNNs with complex topology. In this paper, we propose a novel method, called hardware computation graph (HCG), for ADA without inter-PU templates. Experiments show that HCG method can achieve competitive latency while using only 1.4× ~ 5× fewer on-chip memory, compared with existing state-of-the-art ADA methods.
SESSION: Multi-Purpose Fundamental Digital Design Improvements (Virtual)
- Nikolaos Zompakis
- Sotirios Xydis
This paper introduces an innovative post-implementation Dynamic Frequency Boosting (DFB) technique to release “hidden” performance margins of digital circuit designs currently suppressed by typical critical path constraint design flows, thus defining higher limits of operation speed. The proposed technique goes beyond state-of-the-art and exploits the data-driven path delay variability incorporating an innovative hardware clocking mechanism that detects in real-time the paths’ activation. In contrast to timing speculation, the operating speed is adjusted on the nominal path delay activation, succeeding an error-free acceleration. The proposed technique has been evaluated on three FPGA-based use cases carefully selected to exhibit differing domain characteristics, i.e i) a third party DNN inference accelerator IP for CIFAR-10 images achieving an average speedup of 18%, ii) a highly designer-optimized Optical Digital Equalizer design, in which DBF delivered a speedup of 50% and iii) a set of 5 synthetic designs examining high frequency (beyond 400 MHz) applications in FPGAs, achieving accelerations of 20–60% depending on the underlying path variability.
Probability propagation is an important task used in logic network analysis, which propagates signal probabilities from its primary inputs to its primary outputs. It has many applications such as power estimation, reliability analysis, and error analysis for approximate circuits. Existing methods for the task can be divided into two categories: simulation-based and probability-based methods. However, most of them suffer from low accuracy or bad scalability. In this work, we propose ASPPLN, a method for accelerated symbolic probability propagation in logic network, which has a linear complexity with the network size. We first introduce a new definition in a graph called redundant input and take advantage of it to simplify the propagation process without losing accuracy. Then, a technique called symbol limitation is proposed to limit the complexity of each node’s propagation according to the partial probability significances of the symbols. The experimental results showed that compared to the existing methods, ASPPLN improves the estimation accuracy of switching activity by up to 24.70%, while it also has a speedup of up to 29X.
- Longlong Yang
- Cuiyang Ding
- Changhao Yan
- Dian Zhou
- Xuan Zeng
In this work, we propose a path integral random walk (PIRW) solver, the first accurate stochastic method for steady-state thermal analysis with mixed boundary conditions, especially involving Fourier heat transfer Robin boundary conditions. We innovatively adopt the strictly correct calculation of the local time and the Feynman-Kac functional êc (t) to handle Neumann and Robin boundary conditions with high precision. Compared with ANSYS, experimental results show that PIRW achieves over 121× speedup and over 83× storage space reduction with a negligible error within 0.8° C at a single point. An application combining PIRW with low-accuracy ANSYS for the temperature calculation at hot-spots is provided as a more accurate and faster solution than only ANSYS used.
SESSION: GPU Acceleration for Routing Algorithms (Virtual)
- Shiju Lin
- Martin D. F. Wong
Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.
- Zhuolun He
- Yuzhe Ma
- Bei Yu
Design rule checking (DRC) is essential in physical verification to ensure high yield and reliability for VLSI circuit designs. To achieve reasonable design cycle time, acceleration for computationally intensive DRC tasks has been demanded to accommodate the ever-growing complexity of modern VLSI circuits. In this paper, we propose X-Check, a GPU-accelerated design rule checker. X-Check integrates novel parallel sweepline algorithms, which are both efficient in practice and with nontrivial theoretical guarantees. Experimental results have demonstrated significant speedup achieved by X-Check compared with a multi-threaded CPU checker.
- Zizheng Guo
- Feng Gu
- Yibo Lin
Rectilinear Steiner minimum tree (RSMT) generation is a fundamental component in the VLSI design automation flow. Due to its extensive usage in circuit design iterations at early design stages like synthesis, placement, and routing, the performance of RSMT generation is critical for a reasonable design turnaround time. State-of-the-art RSMT generation algorithms, like fast look-up table estimation (FLUTE), are constrained by CPU-based parallelism with limited runtime improvements. The acceleration of RSMT on GPUs is an important yet difficult task, due to the complex and non-trivial divide-and-conquer computation patterns with recursions. In this paper, we present the first GPU-accelerated RSMT generation algorithm based on FLUTE. By designing GPU-efficient data structures and levelized decomposition, table look-up, and merging operations, we incorporate large-scale data parallelism into the generation of Steiner trees. An up to 10.47× runtime speed-up has been achieved compared with FLUTE running on 40 CPU cores, filling in a critical missing component in today’s GPU-accelerated design automation framework.
SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist I (Virtual)
- Christian Pilato
- Miroslav Velev
- Ruifan Xu
- Youwei Xiao
- Jin Luo
- Yun Liang
Hardware synthesis requires a complicated process to generate synthesizable register transfer level (RTL) code. High-level synthesis tools can automatically transform a high-level description into hardware design, while hardware generators adopt domain specific languages and synthesis flows for specific applications. The implementation of these tools generally requires substantial engineering efforts due to RTL’s weak expressivity and low level of abstraction. Furthermore, different synthesis tools adopt different levels of intermediate representations (IR) and transformations. A unified IR obviously is a good way to lower the engineering cost and get competitive hardware design rapidly by exploring different synthesis methodologies.
In this paper, we propose Hector, a two-level IR providing a unified intermediate representation for hardware synthesis methodologies. The high-level IR binds computation with a control graph annotated with timing information, while the low-level IR provides a concise way to describe hardware modules and elastic interconnections among them. Implemented based on the multi-level compiler infrastructure (MLIR), Hector’s IRs can be converted to synthesizable RTL designs. To demonstrate the expressivity and versatility, we implement three synthesis approaches based on Hector: a high-level synthesis (HLS) tool, a systolic array generator, and a hardware accelerator. The hardware generated by Hector’s HLS approach is comparable to that generated by the state-of-the-art HLS tools, and the other two cases outperform HLS implementations in performance and productivity.
- Mingyu Chen
- Yu Zhang
- Yongshang Li
- Zhen Wang
- Jun Li
- Xiangyang Li
Due to multiple limitations of quantum computers in the NISQ era, quantum compilation efforts are required to efficiently execute quantum algorithms on NISQ devices Program rewriting based on pattern matching can improve the generalization ability of compiler optimization. However, it has rarely been explored for quantum circuit optimization, further considering physical features of target devices.
In this paper, we propose a pattern-matching based quantum circuit optimization framework QCIR with a novel pattern description format, enabling the user-configured cost model and two categories of patterns, i.e., generic patterns and folding patterns. To get better compilation latency, we propose a DAG representation of quantum circuit called QCir-DAG, and QVF algorithm for subcircuit matching. We implement continuous single-qubit optimization pass constructed by QCIR, achieving 10% and 20% optimization rate for benchmarks from Qiskit and ScaffCC, respectively. The practicality of QCIR is demonstrated by execution time and experimental results on the quantum simulator and quantum devices.
- Chang Feng
- Wenlong Lyu
- Zhitang Chen
- Junjie Ye
- Mingxuan Yuan
- Jianye Hao
During the logic synthesis flow of EDA, a sequence of graph transformation operators are applied to the circuits so that the Quality of Results (QoR) of the circuits highly depends on the chosen operators and their specific parameters in the sequence, making the search space operator-dependent and increasingly exponential. In this paper, we formulate the logic synthesis design space exploration as a conditional sequence optimization problem, where at each transformation step, an optimization operator is selected and its corresponding parameters are decided. To solve this problem, we propose a novel sequential black-box optimization approach without human intervention: 1) Due to the conditional and sequential structure of operator sequence with variable length, we build an embedding alignment cells based recurrent neural network as a surrogate model to estimate the QoR of the logic synthesis flow with historical data. 2) With the surrogate model, we construct acquisition function to balance exploration and exploitation with respect to each metric of the QoR. 3) We use multi-objective optimization algorithm to find the Pareto front of the acquisition functions, along which a batch of sequences, consisting of parameterized operators, are (randomly) selected to users for evaluation under the budget of computing resource. We repeat the above three steps until convergence or time limit. Experimental results on public EPFL benchmarks demonstrate the superiority of our approach over the expert-crafted optimization flows and other machine learning based methods. Compared to resyn2, we achieve 11.8% LUT-6 count descent improvements without sacrificing level values.
- Xinyi Zhou
- Junjie Ye
- Chak-Wa Pui
- Kun Shao
- Guangliang Zhang
- Bin Wang
- Jianye Hao
- Guangyong Chen
- Pheng Ann Heng
Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.
SESSION: Smart Search (Virtual)
- Huihong Shi
- Haoran You
- Yang Zhao
- Zhongfeng Wang
- Yingyan Lin
Multiplication is arguably the most cost-dominant operation in modern deep neural networks (DNNs), limiting their achievable efficiency and thus more extensive deployment in resource-constrained applications. To tackle this limitation, pioneering works have developed handcrafted multiplication-free DNNs, which require expert knowledge and time-consuming manual iteration, calling for fast development tools. To this end, we propose a Neural Architecture Search and Acceleration framework dubbed NASA, which enables automated multiplication-reduced DNN development and integrates a dedicated multiplication-reduced accelerator for boosting DNNs’ achievable efficiency. Specifically, NASA adopts neural architecture search (NAS) spaces that augment the state-of-the-art one with hardware inspired multiplication-free operators, such as shift and adder, armed with a novel progressive pretrain strategy (PGP) together with customized training recipes to automatically search for optimal multiplication-reduced DNNs; On top of that, NASA further develops a dedicated accelerator, which advocates a chunk-based template and auto-mapper dedicated for NASA-NAS resulting DNNs to better leverage their algorithmic properties for boosting hardware efficiency. Experimental results and ablation studies consistently validate the advantages of NASA’s algorithm-hardware co-design framework in terms of achievable accuracy and efficiency tradeoffs. Codes are available at https://github.com/shihuihong214/NASA.
Federated learning (FL), a new distributed technology, allows us to train the global model on the edge and embedded devices without local data sharing. However, due to the wide distribution of different types of devices, FL faces severe heterogeneity issues. The accuracy and efficiency of FL deployment at the edge are severely impacted by heterogeneous data and heterogeneous systems. In this paper, we perform joint FL model personalization for heterogeneous systems and heterogeneous data to address the challenges posed by heterogeneities. We begin by using model inference efficiency as a starting point to personalize network scale on each node. Furthermore, it can be used to guide the efficient FL training process, which can help to ease the problem of straggler devices and improve FL’s energy efficiency. During FL training, federated search is then used to acquire highly accurate personalized network structures. By taking into account the unique characteristics of FL deployment at edge devices, the personalized network structures obtained by our federated search framework with a lightweight search controller can achieve competitive accuracy with state-of-the-art (SOTA) methods, while reducing inference and training energy consumption by up to 3.57× and 1.82×, respectively.
SESSION: Reconfigurable Computing: Accelerators and Methodologies I (Virtual)
- Keqi Fu
- Zhi Qi
- Jiaxuan Cai
- Xulong Shi
As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine binarization with pruning techniques, further exploring the redundancy of BNNs. However, coarse-grained pruning methods may cause server accuracy drops, while traditional fine-grained ones induce irregular sparsity hard to be utilized by hardware. In this paper, we propose two advanced fine-grained BNN pruning modules, i.e., structured channel-wise kernel pruning and dynamic spatial pruning, from a joint perspective of algorithm and hardware. The pruned BNN models are trained from scratch and present not only a higher precision but also a high degree of parallelism. Then, we develop an accelerator architecture that can effectively exploit the sparsity caused by our algorithm. Finally, we implement the pruned BNN models on an embedded FPGA (Ultra96v2). The results show that our software and hardware codesign achieves 5.4x inference-speedup than the baseline BNN, with higher resource and energy efficiency compared with prior FPGA implemented BNN works.
- Yan Zhuang
- Zhihao Zhang
- Dajiang Liu
Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution to accelerate domain applications due to its good combination of energy-efficiency and flexibility. Loops, as computation-intensive parts of applications, are often mapped onto CGRA and modulo scheduling is commonly used to improve the execution performance. However, the actual performance using modulo scheduling is highly dependent on the mapping ability of the Data Dependency Graph (DDG) extracted from a loop. As existing approaches usually separate routing exploration of multi-cycle dependence from mapping for fast compilation, they may easily suffer from poor mapping quality. In this paper, we integrate the routing explorations into the mapping process and make it have more opportunities to find a globally optimized solution. Meanwhile, with a reduced resource graph defined, the searching space of the new mapping problem is not greatly increased. To efficiently solve the problem, we introduce graph neural network based reinforcement learning to predict a placement distribution over different resource nodes for all operations in a DDG. Using the routing connectivity as the reward signal, we optimize the parameters of neural network to find a valid mapping solution with a policy gradient method. Without much engineering and heuristic designing, our approach achieves 1.57× mapping quality, as compared to the state-of-the-art heuristic.
SESSION: Hardware Security: Attacks and Countermeasures (Virtual)
- Johann Knechtel
- Lejla Batina
- Zili Kou
- Sharad Sinha
- Wenjian He
- Wei Zhang
Eviction-based cache side-channel attacks take advantage of inclusive cache hierarchies and shared cache hardware. Processors with the template ARM big.LITTLE architecture do not guarantee such preconditions and therefore will not usually allow cross-core attacks let alone cross-cluster attacks. This work reveals a new side-channel based on the snoop filter (SF), an unexplored directory structure embedded in template ARM big.LITTLE processors. Our systematic reverse engineering unveils the undocumented structure and property of the SF, and we successfully utilize it to bootstrap cross-core and cross-cluster cache eviction. We demonstrate a comprehensive methodology to exploit the SF side-channel, including the construction of eviction sets, the covert channel, and attacks against RSA and AES. When attacking TrustZone, we conduct an interrupt-based side-channel attack to extract the key of RSA by a single profiling trace, despite the strict cache clean defense. Supported by detailed experiments, the SF side-channel not only achieves competitive performance but also overcomes the main challenge of cache side-channel attacks on ARM big.LITTLE processors.
- Rajat Sadhukhan
- Sayandeep Saha
- Debdeep Mukhopadhyay
Fault Attacks (FA) have gained a lot of attention from both industry and academia due to their practicality, and wide applicability to different domains of computing. In the context of symmetric-key cryptography, designing countermeasures against FA is still an open problem. Recently proposed attacks such as Statistical Ineffective Fault Analysis (SIFA) has shown that merely adding redundancy or infection-based countermeasure to detect the fault doesn’t work and a proper combination of masking and error correction/detection is required. In this work, we show that masking which is mathematically established as a good countermeasure against a certain class of SIFA faults, in practice may fall short if low-level details during physical design layout development are not taken care of. We initiate this study by demonstrating a successful SIFA attack on a post placed-and-routed masked crypto design for ASIC platform. Eventually, we propose a fully automated approach along with a proper choice of placement constraints which can be realized easily for any commercial CAD tools to successfully get rid of this vulnerability during the physical layout development process. Our experimental validation of our tool flow over masked implementation on PRESENT cipher establishes our claim.
SESSION: Advanced VLSI Routing and Layout Learning
- Wing-Kai Chow
- David Chinnery
- Rongjian Liang
- Hua Xiang
- Jinwook Jung
- Jiang Hu
- Gi-Joon Nam
Deep learning is a promising approach to early DRV (Design Rule Violation) prediction. However, non-deterministic parallel routing hampers model training and degrades prediction accuracy. In this work, we propose a stochastic approach, called LGC-Net, to solve this problem. In this approach, we develop new techniques of Gaussian random field layer and focal likelihood loss function to seamlessly integrate Log Gaussian Cox process with deep learning. This approach provides not only statistical regression results but also classification ones with different thresholds without retraining. Experimental results with noisy training data on industrial designs demonstrate that LGC-Net achieves significantly better accuracy of DRV density prediction than prior arts.
- Yen-Ting Chen
- Yao-Wen Chang
In advanced packages, redistribution layers (RDLs) are extra metal layers for high interconnections among the chips and printed circuit board (PCB). To better utilize the routing resources of RDLs, published works adopted flexible vias such that they can place the vias everywhere. Furthermore, some regions may be blocked for signal integrity protection or manually prerouted nets (such as power/ground nets or feeding lines of antennas) to achieve higher performance. These blocked regions will be treated as obstacles in the routing process. Since the positions of pads, obstacles, and vias can be arbitrary, the structures of RDLs become irregular. The obstacles and irregular structures substantially increase the difficulty of the routing process. This paper proposes a three-stage algorithm: First, the layout is partitioned by a method based on constrained Delaunay triangulation (CDT). Then we present a global routing graph model and generate routing guides for unified-assignment netlists. Finally, a novel tile routing method is developed to obtain detailed routes. Experiment results demonstrate the robustness and effectiveness of our proposed algorithm.
- Keren Zhu
- Hao Chen
- Walker J. Turner
- George F. Kokai
- Po-Hsuan Wei
- David Z. Pan
- Haoxing Ren
Analog and mixed-signal (AMS) circuit designs still rely on human design expertise. Machine learning has been assisting circuit design automation by replacing human experience with artificial intelligence. This paper presents TAG, a new paradigm of learning the circuit representation from layouts leveraging Text, self Attention and Graph. The embedding network model learns spatial information without manual labeling. We introduce text embedding and a self-attention mechanism to AMS circuit learning. Experimental results demonstrate the ability to predict layout distances between instances with industrial FinFET technology benchmarks. The effectiveness of the circuit representation is verified by showing the transferability to three other learning tasks with limited data in the case studies: layout matching prediction, wirelength estimation, and net parasitic capacitance prediction.
SESSION: Physical Attacks and Countermeasures
- Huifeng Zhu
- Zhiyuan Yu
- Weidong Cao
- Ning Zhang
- Xuan Zhang
The wired ghost touch attacks are the emerging and severe threats against modern touchscreens. The attackers can make touchscreens falsely report nonexistent touches (i.e., ghost touches) by injecting common-mode noise (CMN) into the target devices via power cables. Existing attacks rely on reverse-engineering the touchscreens, then manually crafting the CMN waveforms to control the types and locations of ghost touches. Although successful, they are limited in practicality and attack capability due to the touchscreens’ black-box nature and the immense search space of attack parameters. To overcome the above limitations, this paper presents PowerTouch, a framework that can automatically generate wired ghost touch attacks. We adopt a software-hardware co-design approach and propose a domain-specific genetic algorithm-based method that is tailored to account for the characteristics of the CMN waveform. Based on the security objectives, our framework automatically optimizes the CMN waveform towards injecting the desired type of ghost touches into regions specified by attackers. The effectiveness of PowerTouch is demonstrated by successfully launching attacks on touchscreen devices from two different brands given nine different objectives. Compared with the state-of-the-art attack, we seminally achieve controlling taps on an extra dimension and injecting swipes on both dimensions. We can place an average of 84.2% taps on the targeted side of the screen, with the location error in the other dimension no more than 1.53mm. An average of 94.5% of injected swipes with correct directions is also achieved. The quantitative comparison with the state-of-the-art method shows that a better attack performance can be achieved by PowerTouch.
- Michael Zuzak
- Yuntao Liu
- Isaac McDaniel
- Ankur Srivastava
Logic obfuscation protects integrated circuits from an untrusted foundry attacker during manufacturing. To counter obfuscation, a number of logical (e.g. Boolean satisfiability) and physical (e.g. electro-optical probing) attacks have been proposed. By definition, these attacks use only a subset of the information leaked by a circuit to unlock it. Countermeasures often exploit the resulting blind-spots to thwart these attacks, limiting their scalability and generalizability. To overcome this, we propose a combined logical and physical attack against obfuscation called the CLAP attack. The CLAP attack leverages both the logical and physical properties of a locked circuit to prune the keyspace in a unified and theoretically-rigorous fashion, resulting in a more versatile and potent attack. To formulate the physical portion of the CLAP attack, we derive a logical formulation that provably identifies input sequences capable of sensitizing logically expressive regions in a circuit. We prove that electro-optically probing these regions infers portions of the key. For the logical portion of the attack, we integrate the physical attack results into a Boolean satisfiability attack to find the correct key. We evaluate the CLAP attack by launching it against four obfuscation schemes in benchmark circuits. The physical portion of the attack fully specified 60.6% of key bits and partially specified another 10.3%. The logical portion of the attack found the correct key in the physical-attack-limited keyspace in under 30 minutes. Thus, the CLAP attack unlocked each circuit despite obfuscation.
- Alexander Hepp
- Tiago Perez
- Samuel Pagliarini
- Georg Sigl
A potential vulnerability for integrated circuits (ICs) is the insertion of hardware trojans (HTs) during manufacturing. Understanding the practicability of such an attack can lead to appropriate measures for mitigating it. In this paper, we demonstrate a pragmatic framework for analyzing HT susceptibility of finalized layouts. Our framework is representative of a fabrication-time attack, where the adversary is assumed to have access only to a layout representation of the circuit. The framework inserts trojans into tapeoutready layouts utilizing an Engineering Change Order (ECO) flow. The attacked security nodes are blindly searched utilizing reverse-engineering techniques. For our experimental investigation, we utilized three crypto-cores (AES-128, SHA-256, and RSA) and a microcontroller (RISC-V) as targets. We explored 96 combinations of triggers, payloads and targets for our framework. Our findings demonstrate that even in high-density designs, the covert insertion of sophisticated trojans is possible. All this while maintaining the original target logic, with minimal impact on power and performance. Furthermore, from our exploration, we conclude that it is too naive to only utilize placement resources as a metric for HT vulnerability. This work highlights that the HT insertion success is a complex function of the placement, routing resources, the position of the attacked nodes, and further design-specific characteristics. As a result, our framework goes beyond just an attack, we present the most advanced analysis tool to assess the vulnerability of HT insertion into finalized layouts.
SESSION: Tutorial: Polynomial Formal Verification: Ensuring Correctness under Resource Constraints
- Rolf Drechsler
- Alireza Mahzoon
Recently, a lot of effort has been put into developing formal verification approaches by both academic and industrial research. In practice, these techniques often give satisfying results for some types of circuits, while they fail for others. A major challenge in this domain is that the verification techniques suffer from unpredictability in their performance. The only way to overcome this challenge is the calculation of bounds for the space and time complexities. If a verification method has polynomial space and time complexities, scalability can be guaranteed.
In this tutorial paper, we review recent developments in formal verification techniques and give a comprehensive overview of Polynomial Formal Verification (PFV). In PFV, polynomial upper bounds for the run-time and memory needed during the entire verification task hold. Thus, correctness under resource constraints can be ensured. We discuss the importance and advantages of PFV in the design flow. Formal methods on the bit-level and the word-level, and their complexities when used to verify different types of circuits, like adders, multipliers, or ALUs are presented. The current status of this new research field and directions for future work are discussed.
SESSION: Scalable Verification Technologies
- Viraphol Chaiyakul
- Alex Orailoglu
- Mate Soos
- Kuldeep S. Meel
Given a Boolean formula ϕ over the set of variables X and a projection set P ⊆ X, then if I ⊆ P is independent support of P, then if two solutions of ϕ agree on I, then they also agree on P. The notion of independent support is related to the classical notion of definability dating back to 1901, and have been studied over the decades. Recently, the computational problem of determining independent support for a given formula has attained importance owing to the crucial importance of independent support for hashing-based counting and sampling techniques.
In this paper, we design an efficient and scalable independent support computation technique that can handle formulas arising from real-world benchmarks. Our algorithmic framework, called Arjun1, employs implicit and explicit definability notions, and is based on a tight integration of gate-identification techniques and assumption-based framework. We demonstrate that augmenting the state-of-the-art model counter ApproxMC4 and sampler UniGen3 with Arjun leads to significant performance improvements. In particular, ApproxMC4 augmented with Arjun counts 576 more benchmarks out of 1896 while UniGen3 augmented with Arjun samples 335 more benchmarks within the same time limit.
- Yue Xing
- Huaixi Lu
- Aarti Gupta
- Sharad Malik
Property-based specification s uch a s SystemVerilog Assertions (SVA) uses mathematical logic to specify the temporal behavior of RTL designs which can then be formally verified using model checking algorithms. These properties are specified for a single component (which may contain other components in the design hierarchy). Composing design components that have already been verified requires additional verification since incorrect communication at their interface may invalidate the properties that have been checked for the individual components. This paper focuses on a specification for their interface which can be checked individually for each component, and which guarantees that refinement-based properties checked for each component continue to hold after their composition. We do this in the setting of the Instruction-level Abstraction (ILA) specification and verification methodology. The ILA methodology provides a uniform specification for processors, accelerators and general modules at the instruction-level, and the automatic generation of a complete set of correctness properties for checking that the RTL model is a refinement of the ILA specification. We add an interface specification to model the inter-ILA communication. Further, we use our interface specification to generate a set of interface checking properties that check that the communication between the RTL components is correct. This provides the following guarantee: if each RTL component is a refinement of its ILA specification and the interface checks pass, then the RTL composition is a refinement of the ILA composition. We have applied the proposed methodology to six case studies including parts of large-scale designs such as parts of the FlexASR and NVDLA machine learning accelerators, demonstrating the practical applicability of our method.
- Qinhan Tan
- Aarti Gupta
- Sharad Malik
Recent years have witnessed increasing use of domain-specific accelerators in computing platforms to provide power-performance efficiency for emerging applications. To increase their applicability within the domain, these accelerators tend to support a large set of functions, e.g. Nvidia’s open-source Deep Learning Accelerator, NVDLA, supports five distinct groups of functions [17]. However, an individual use case of an accelerator may utilize only a subset of these functions. The unused functions lead to unnecessary overhead of silicon area, power, and hardware verification/hardware-software co-verification complexity. This motivates our research question: Given an RTL design for an accelerator and a subset of functions of interest, can we automatically extract a subset of the RTL that is sufficient for these functions and sequentially equivalent to the original RTL? We call this the Usage-based RTL Subsetting problem, referred to as the RTL subsetting problem in short. We first formally define this problem and show that it can be formulated as a program synthesis problem, which can be solved by performing expensive hyperproperty checks. To overcome the high cost, we propose multiple levels of sound over-approximations to construct an effective algorithm based on relatively less expensive temporal property checking and taint analysis for information flow checking. We demonstrate the acceptable computation cost and the quality of the results of our algorithm through several case studies of accelerators from different domains. The applicability of our proposed algorithm can be seen in its ability to subset the large NVDLA accelerator (with over 50,000 registers and 1,600,000 gates) for the group of convolution functions, where the subset reduces the total number of registers by 18.6% and the total number of gates by 37.1%.
SESSION: Optimizing Digital Design Aspects: From Gate Sizing to Multi-Bit Flip-Flops
- Amit Gupta
- Kerim Kalafala
- Siddhartha Nath
- Geraldo Pradipta
- Corey Hu
- Tian Yang
- Brucek Khailany
- Haoxing Ren
Gate sizing is a fundamental netlist optimization move and researchers have used supervised learning-based models in gate sizers. Recently, Reinforcement Learning (RL) has been tried for sizing gates (and other EDA optimization problems) but are very runtime-intensive. In this work, we explore a novel Transformer-based gate sizer, TransSizer, to directly generate optimized gate sizes given a placed and unoptimized netlist. TransSizer is trained on datasets obtained from real tapeout-quality industrial designs in a foundry 5nm technology node. Our results indicate that TransSizer achieves 97% accuracy in predicting optimized gate sizes at the postroute optimization stage. Furthermore, TransSizer has a speedup of ~1400× while delivering similar timing, power and area metrics when compared to a leading-edge commercial tool for sizing-only optimization.
- Meng-Yun Liu
- Yu-Cheng Lai
- Wai-Kei Mak
- Ting-Chi Wang
Multi-bit flip-flops (MBFFs) are often used to reduce the number of clock sinks, resulting in a low-power design. A traditional MBFF is composed of individual FFs of uniform driving strength. However, if some but not all of the bits of an MBFF violate timing constraints, the MBFF has to be sized up or decomposed into smaller bit-width combinations to satisfy timing, which reduces the power saving. In this paper, we present a new MBFF generation approach considering mixed-driving MBFFs whose certain bits have a higher driving strength than the other bits. To maximize the FF merging rate (and hence to minimize the final amount of clock sinks), our approach will first perform aggressive FF merging subject to timing constraints. Our merging is aggressive in the sense that we are willing to possibly oversize some FFs and allow the presence of empty bits in an MBFF to merge FFs into MBFFs of uniform driving strengths as much as possible. The oversized individual FFs of an MBFF will be later downsized subject to timing constraints by our approach, which results in a mixed-driving MBFF. Our MBFF generation approach has been combined with a commercial place and route tool, and our experimental results show the superiority of our approach over a prior work that considers uniform-driving MBFFs only in terms of the clock sink count, the FF power, the clock buffer count, and the routed clock wirelength.
- Zhiyao Xie
- Shiyu Li
- Mingyuan Ma
- Chen-Chia Chang
- Jingyu Pan
- Yiran Chen
- Jiang Hu
Accurate and efficient on-chip power modeling is crucial to runtime power, energy, and voltage management. Such power monitoring can be achieved by designing and integrating on-chip power meters (OPMs) into the target design. In this work, we propose a new method named DEEP to automatically develop extremely efficient OPM solutions for a given design. DEEP selects OPM inputs from all individual bits in RTL signals. Such bit-level selection provides an unprecedentedly large number of input candidates and supports lower hardware cost, compared with signal-level selection in prior works. In addition, DEEP proposes a powerful two-step OPM input selection method, and it supports reporting both total power and the power of major design components. Experiments on a commercial microprocessor demonstrate that DEEP’s OPM solution achieves correlation R > 0.97 in per-cycle power prediction with an unprecedented low area overhead on hardware, i.e., < 0.1% of the microprocessor layout. This reduces the OPM hardware cost by 4 — 6× compared with the state-of-the-art solution.
SESSION: Energy Efficient Hardware Acceleration and Stochastic Computing
- Sunil Khatri
- Anish Krishnakumar
- Ranyang Zhou
- Arman Roohi
- Durga Misra
- Shaahin Angizi
In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.
- Pranathi Vasireddy
- Krishna Kavi
- Gayatri Mehta
Sparse matrix-dense vector (SpMV) multiplication is inherent in most scientific, neural networks and machine learning algorithms. To efficiently exploit sparsity of data in SpMV computations, several compressed data representations have been used. However, compressed data representations of sparse data can result in overheads of locating nonzero values, requiring indirect memory accesses which increases instruction count and memory access delays. We call these translations of compressed representations as metadata processing. We propose a memory-side accelerator for metadata (or indexing) computations and supplying only the required nonzero values to the processor, additionally permitting an overlap of indexing with core computations on nonzero elements. In this contribution, we target our accelerator for low-end micro-controllers with very limited memory and processing capabilities. In this paper we will explore two dedicated ASIC designs of the proposed accelerator that handles the indexed memory accesses for compressed sparse row (CSR) format working alongside a simple RISC-like programmable core. One version of the accelerator supplies only vector values corresponding to nonzero matrix values and the second version supplies both nonzero matrix and matching vector values for SpMV computations. Our experiments show speedups ranging between 1.3 and 2.1 times for SpMV for different levels of sparsity. Our accelerator also results in energy savings ranging between 15.8% and 52.7% over different matrix sizes, when compared to the baseline system with primary RISC-V core performing all computations. We use smaller synthetic matrices with different sparsity levels and larger real-world matrices with higher sparsity (below 1% non-zeros) in our experimental evaluations.
- Peter Schober
- Seyedeh Newsha Estiri
- Sercan Aygun
- Nima TaheriNejad
- M. Hassan Najafi
Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.
SESSION: Special Session: Approximate Computing and the Efficient Machine Learning Expedition
- Jörg Henkel
- Hai Li
- Anand Raghunathan
- Mehdi B. Tahoori
- Swagath Venkataramani
- Xiaoxuan Yang
- Georgios Zervakis
Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads associated with ML applications-that were effectively mitigated by corresponding approximations-led to a perfect matching and a fruitful synergy. AxC for AI/ML has transcended beyond academic prototypes. In this work, we enlighten the synergistic nature of AxC and ML and elucidate the impact of AxC in designing efficient ML systems. To that end, we present an overview and taxonomy of AxC for ML and use two descriptive application scenarios to demonstrate how AxC boosts the efficiency of ML systems.
SESSION: Co-Search Methods and Tools
- Cunxi Yu
- Yingyan “Celine” Lin
- Tong Zhou
- Shaolei Ren
- Xiaolin Xu
Malicious architecture extraction has been emerging as a crucial concern for deep neural network (DNN) security. As a defense, architecture obfuscation is proposed to remap the victim DNN to a different architecture. Nonetheless, we observe that, with only extracting an obfuscated DNN architecture, the adversary can still retrain a substitute model with high performance (e.g., accuracy), rendering the obfuscation techniques ineffective. To mitigate this under-explored vulnerability, we propose ObfuNAS, which converts the DNN architecture obfuscation into a neural architecture search (NAS) problem. Using a combination of function-preserving obfuscation strategies, ObfuNAS ensures that the obfuscated DNN architecture can only achieve lower accuracy than the victim. We validate the performance of ObfuNAS with open-source architecture datasets like NAS-Bench-101 and NAS-Bench-301. The experimental results demonstrate that ObfuNAS can successfully find the optimal mask for a victim model within a given FLOPs constraint, leading up to 2.6% inference accuracy degradation for attackers with only 0.14× FLOPs overhead. The code is available at: https://github.com/Tongzhou0101/ObfuNAS.
- Rongjian Liang
- Jianfeng Song
- Yuan Bo
- Jiang Hu
The continuous growth of CNN complexity not only intensifies the need for hardware acceleration but also presents a huge challenge. That is, the solution space for CNN hardware design and dataflow mapping becomes enormously large besides the fact that it is discrete and lacks a well behaved structure. Most previous works either are stochastic metaheuristics, such as genetic algorithm, which are typically very slow for solving large problems, or rely on expensive sampling, e.g., Gumbel Softmax-based differentiable optimization and Bayesian optimization. We propose an analytical model for evaluating power and performance of CNN hardware design and dataflow solutions. Based on this model, we introduce a co-optimization method consisting of nonlinear programming and parallel local search. A key innovation in this model is its matrix form, which enables the use of deep learning toolkit for highly efficient computations of power/performance values and gradients in the optimization. In handling power-performance tradeoff, our method can lead to better solutions than minimizing a weighted sum of power and latency. The average relative error of our model compared with Timeloop is as small as 1%. Compared to state-of-the-art methods, our approach achieves solutions with up to 1.7 × shorter inference latency, 37.5% less power consumption, and 3 × less area on ResNet 18. Moreover, it provides a 6.2 × speedup of optimization runtime.
- William Andrew Simon
- Una Pale
- Tomas Teijeiro
- David Atienza
The HyperDimensional Computing (HDC) Machine Learning (ML) paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other ML approaches, necessitating frameworks enabling fast HDC algorithm design space exploration. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch’s utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. We also demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data may not generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.
SESSION: Reconfigurable Computing: Accelerators and Methodologies II
- Hanning Chen
- Mariam Issa
- Yang Ni
- Mohsen Imani
Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems such as robotics control. Modern RL algorithms, i.e., Deep Q-Learning, are based on costly and resource hungry deep neural networks. This motivates us to deploy alternative models for powering RL agents on edge devices. Recently, brain-inspired Hyper-Dimensional Computing (HDC) has been introduced as a promising solution for lightweight and efficient machine learning, particularly for classification.
In this work, we develop a novel platform capable of real-time hyperdimensional reinforcement learning. Our heterogeneous CPU-FPGA platform, called DARL, maximizes FPGA’s computing capabilities by applying hardware optimizations to hyperdimensional computing’s critical operations, including hardware-friendly encoder IP, the hypervector chunk fragmentation, and the delayed model update. Aside from hardware innovation, we also extend the platform to basic single-agent RL to support multi-agents distributed learning. We evaluate the effectiveness of our approach on OpenAI Gym tasks. Our results show that the FPGA platform provides on average 20× speedup compared to current state-of-the-art hyperdimensional RL methods running on Intel Xeon 6226 CPU. In addition, DARL provides around 4.8× faster and 4.2× higher energy efficiency compared to the state-of-the-art RL accelerator while ensuring a better or comparable quality of learning.
- Carl-Johannes Johnsen
- Tiziano De Matteis
- Tal Ben-Nun
- Johannes de Fine Licht
- Torsten Hoefler
The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for finegrained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.
SESSION: Compute-in-Memory for Neural Networks
- Yun-Chen Lo
- Chih-Chen Yeh
- Jun-Shen Wu
- Chia-Chun Wang
- Yu-Chih Tsai
- Wen-Chien Ting
- Ren-Shuo Liu
Among several emerging architectures, computing in memory (CIM), which features in-situ analog computation, is a potential solution to the data movement bottleneck of the Von Neumann architecture for artificial intelligence (AI). Interestingly, more strengths of CIM significantly different from in-situ analog computation are not widely known yet. In this work, we point out that mutually stationary vectors (MSVs), which can be maximized by introducing associativity to CIM, are another inherent power unique to CIM. By MSVs, CIM exhibits significant freedom to dynamically vectorize the stored data (e.g., weights) to perform agile computation using the dynamically formed vectors.
We have designed and realized an SA-CIM silicon prototype and corresponding architecture and acceleration schemes in the TSMC 28 nm process. More specifically, the contributions of this paper are fourfold: 1) We identify MSVs as new features that can be exploited to improve the current performance and energy challenges of the CIM-based hardware. 2) We propose SA-CIM to enhance MSVs for skipping the zeros, small values, and sparse vectors. 3) We propose a transposed systolic dataflow to efficiently conduct conv3×3 while being capable of exploiting input-skipping schemes. 4) We propose a design flow to search for optimal aggressive skipping scheme setups while satisfying the accuracy loss constraint.
The proposed ISSA architecture improves the throughput by 1.91× to 2.97× speedup and the energy efficiency by 2.5× to 4.2×.
- Zheyu Yan
- Xiaobo Sharon Hu
- Yiyu Shi
Computing-in-Memory (CiM) architectures based on emerging nonvolatile memory (NVM) devices have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, NVM devices suffer from various non-idealities, especially device-to-device variations due to fabrication defects and cycle-to-cycle variations due to the stochastic behavior of devices. As such, the DNN weights actually mapped to NVM devices could deviate significantly from the expected values, leading to large performance degradation. To address this issue, most existing works focus on maximizing average performance under device variations. This objective would work well for general-purpose scenarios. But for safety-critical applications, the worst-case performance must also be considered. Unfortunately, this has been rarely explored in the literature. In this work, we formulate the problem of determining the worst-case performance of CiM DNN accelerators under the impact of device variations. We further propose a method to effectively find the specific combination of device variation in the high-dimensional space that leads to the worst-case performance. We find that even with very small device variations, the accuracy of a DNN can drop drastically, causing concerns when deploying CiM accelerators in safety-critical applications. Finally, we show that surprisingly none of the existing methods used to enhance average DNN performance in CiM accelerators are very effective when extended to enhance the worst-case performance, and further research down the road is needed to address this problem.
SESSION: Breakthroughs in Synthesis – Infrastructure and ML Assist II
- Wan-Hsuan Lin
- Chia-Hsuan Su
- Jie-Hong R. Jiang
Language equations are a powerful tool for compositional synthesis, modeled as the unknown component problem. Given a (sequential) system specification S and a fixed component F, we are asked to synthesize an unknown component X such that whose composition with F fulfills S. The synthesis of X can be formulated with language equation solving. Although prior work exploits partitioned representation for effective finite automata manipulation, it remains challenging to solve language equations involving a large number of states. In this work, we propose variants of Boolean automata as the underlying succinct representation for regular languages. They admit logic circuit manipulation and extend the scalability for solving language equations. Experimental results demonstrate the superiority of our method to the state-of-the-art in solving nine more cases out of the 36 studied benchmarks and achieving an average of 740× speedup.
- Prianka Sengupta
- Aakash Tyagi
- Yiran Chen
- Jiang Hu
Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.
SESSION: In-Memory Computing Revisited
- Biresh Kumar Joardar
- Ulf Schlichtmann
- Muhammad Rashedul Haq Rashed
- Sumit Kumar Jha
- Rickard Ewetz
Processing in-memory is a promising solution strategy for accelerating data-intensive applications. While analog in-memory computing is extremely efficient, the limited precision is only acceptable for approximate computing applications. Digital in-memory computing provides the deterministic precision required to accelerate high assurance applications. State-of-the-art digital in-memory computing schemes rely on manually decomposing arithmetic operations into in-memory compute kernels. In contrast, traditional digital circuits are synthesized using complex and automated design flows. In this paper, we propose a logic synthesis framework called LOGIC for mapping high-level applications into digital in-memory compute kernels that can be executed using non-volatile memory. We first propose techniques to decompose element-wise arithmetic operations into in-memory kernels while minimizing the number of in-memory operations. Next, the sequence of the in-memory operation is optimized to minimize non-volatile memory utilization. Lastly, data layout re-organization is used to efficiently accelerate applications dominated by sparse matrix-vector multiplication operations. The experimental evaluations show that the proposed synthesis approach improves the area and latency of fixed-point multiplication by 77% and 20% over the state-of-the-art, respectively. On scientific computing applications from Suite Sparse Matrix Collection, the proposed design improves the area, latency and, energy by 3.6X, 2.6X, and 8.3X, respectively.
- Kang He
- Indranil Chakraborty
- Cheng Wang
- Kaushik Roy
In-Memory Computing (IMC) has become a promising paradigm for accelerating machine learning (ML) inference. While IMC architectures built on various memory technologies have demonstrated higher throughput and energy efficiency compared to conventional digital architectures, little research has been done from system-level perspective to provide comprehensive and fair comparisons of different memory technologies under the same hardware budget (area). Since large-scale analog IMC hardware relies on the costly analog-digital converters (ADCs) for robust digital communication, optimizing IMC architecture performance requires synergistic co-design of memory arrays and peripheral ADCs, wherein the trade-offs could depend on the underlying memory technologies. To that effect, we co-explore IMC macro design space and memory technology to identify the best design point for each memory type under iso-area budgets, aiming to make fair comparisons among different technologies, including SRAM, phase change memory, resistive RAM, ferroelectrics and spintronics. First, an extended simulation framework employing spatial architecture with off-chip DRAM is developed, capable of integrating both CMOS and nonvolatile memory technologies. Subsequently, we propose different modes of ADC operations with distinctive weight mapping schemes to cope with different on-chip area budgets. Our results show that under an iso-area budget, the various memory technologies being evaluated will need to adopt different IMC macro-level designs to deliver the optimal energy-delay-product (EDP) at system level. We demonstrate that under small area budgets, the choice of best memory technology is determined by its cell area and writing energy. While area budgets are larger, cell area becomes the dominant factor for technology selection.
SESSION: Special Session: 2022 CAD Contest at ICCAD
- Yu-Guang Chen
- Chun-Yao Wang
- Tsung-Wei Huang
- Takashi Sato
The “CAD Contest at ICCAD” is a challenging, multi-month, research and development competition, focusing on advanced, real-world problems in the field of electronic design automation (EDA). Since 2012, the contest has been publishing many sophisticated circuit design problems, from system-level design to physical design, together with industrial benchmarks and solution evaluators. Contestants can participate in one or more problems provided by EDA/IC industry. The winners will be awarded at an ICCAD special session dedicated to this contest. Every year, the contest attracts more than a hundred teams, fosters productive industry-academia collaborations, and leads to hundreds of publications in top-tier conferences and journals. The 2022 CAD Contest has 166 teams from all over the world. Moreover, the problems of this year cover state-of-the-art EDA research trends such as circuit security, 3D-IC, and design space exploration from well-known EDA/IC companies. We believe the contest keeps enhancing impact and boosting EDA researches.
- Chung-Han Chou
- Chih-Jen (Jacky) Hsu
- Chi-An (Rocky) Wu
- Kuan-Hua Tu
Extracting circuit functionality from a gate-level netlist is critical in CAD tools. For security, it helps designers to detect hardware Trojans or malicious design changes in the netlist with third-party resources such as fabrication services and soft/hard IP cores. For verification, it can reduce the complexity and effort of keeping design information in aggressive optimization strategies adopted by synthesis tools. For Engineering Change Order (ECO), it can keep the designer from locating the ECO gate in a sea of bit-level gates.
In this contest, we formulated a datapath learning and extraction problem. With a set of benchmarks and an evaluation metric, we expect contestants to develop a tool to learn the arithmetic equations from a synthesized gate-level netlist.
- Kai-Shun Hu
- I-Jye Lin
- Yu-Hui Huang
- Hao-Yu Chi
- Yi-Hsuan Wu
- Chin-Fang Cindy Shen
In the chiplet era, the benefits from multiple factors can be observed by splitting a large single die into multiple small dies. By having the multiple small dies with die-to-die (D2D) vertical connections, the benefits including: 1) better yield, 2) better timing/performance, and 3) better cost. How to do the netlist partitioning, cell placement in each of the small dies, and also how to determine the location of the D2D inter-connection terminals becomes a new topic.
To address this chiplet era physical implementation problem, ICCAD-2022 contest encourages the research in the techniques of multi-die netlist partitioning and placement with D2D vertical connections. We provided (i) a set of benchmarks and (ii) an evaluation metric that facilitate contestants to develop, test, and evaluate their new algorithms.
- Sicheng Li
- Chen Bai
- Xuechao Wei
- Bizhao Shi
- Yen-Kuang Chen
- Yuan Xie
It is vital to select microarchitectures to achieve good trade-offs between performance, power, and area in the chip development cycle. Combining high-level hardware description languages and optimization of electronic design automation tools empowers microarchitecture exploration at the circuit level. Due to the extremely large design space and high runtime cost to evaluate a microarchitecture, ICCAD 2022 CAD Contest Problem C calls for an effective design space exploration algorithm to solve the problem. We formulate the research topic as a contest problem and provide benchmark suites, contest benchmark platforms, etc., for all contestants to innovate and estimate their algorithms.
- Jinwook Jung
- Andrew B. Kahng
- Ravi Varadarajan
- Zhiang Wang
This paper describes new elements in the RDF-2022 release of the DATC Robust Design Flow, along with other activities of the IEEE CEDA DATC. The RosettaStone initiated with RDF-2021 has been augmented to include 35 benchmarks and four open-source technologies (ASAP7, NanGate45 and SkyWater130HS/HD), plus timing-sensible versions created using path-cutting. The Hier-RTLMP macro placer is now part of DATC RDF, enabling macro placement for large modern designs with hundreds of macros. To establish a clear baseline for macro placers, new open-source benchmark suites on open PDKs, with corresponding flows for fully reproducible results, are provided. METRICS2.1 infrastructure in OpenROAD and OpenROAD-flow-scripts now uses native JSON metrics reporting, which is more robust and general than the previous Python script-based method. Calibrations on open enablements have also seen notable updates in the RDF. Finally, we also describe an approach to establishing a generic, cloud-native large-scale design of experiments for ML-enabled EDA. Our paper closes with future research directions related to DATC’s efforts.
SESSION: Architectures and Methodologies for Advanced Hardware Security
- Jingyao Zhang
- Elaheh Sadredini
In the age of big data, information security has become a major issue of debate, especially with the rise of the Internet of Things (IoT), where attackers can effortlessly obtain physical access to edge devices. The hash algorithm is the current foundation for data integrity and authentication. However, it is challenging to provide a high-performance, high-throughput, and energy-efficient solution on resource-constrained edge devices. In this paper, we propose Inhale, an in-SRAM architecture to effectively compute hash algorithms with innovative data alignment and efficient read/write strategies to implicitly execute data shift operations through the in-situ controller. We present two variations of Inhale: Inhale-Opt, which is optimized for latency, throughput, and area-overhead; and Inhale-Flex, which offers flexibility in repurposing a part of last-level caches for hash computation. We thoroughly evaluate our proposed architectures on both SRAM and ReRAM memories and compare them with the state-of-the-art in-memory and ASIC accelerators. Our performance evaluation confirms that Inhale can achieve 1.4× – 14.5× higher throughput-per-area and about two-orders-of-magnitude higher throughput-per-area-per-energy compared to the state-of-the-art solutions.
- Kevin Nam
- Hyunyoung Oh
- Hyungon Moon
- Yunheung Paek
TFHE is a fully homomorphic encryption (FHE) scheme that evaluates Boolean gates, which we will hereafter call Tgates, over encrypted data. TFHE is considered to have higher expressive power than many existing schemes in that it is able to compute not only N-bit Arithmetic operations but also Logical/Relational ones as arbitrary ALR operations can be represented by Tgate circuits. Despite such strength, TFHE has a weakness that like all other schemes, it suffers from colossal computational overhead. Incessant efforts to reduce the overhead have been made by exploiting the inherent parallelism of FHE operations on ciphertexts. Unlike other FHE schemes, the parallelism of TFHE can be decomposed into multilayers: one inside each FHE operation (equivalent to a single Tgate) and the other between Tgates. Unfortunately, previous works focused only on exploiting the parallelism inside Tgate. However, as each N-bit operation over TFHE corresponds to a Tgate circuit constructed from multiple Tgates, it is also necessary to utilize the parallelism between Tgates for optimizing an entire operation. This paper proposes an acceleration technique to maximize performance of a TFHE N-bit operation by simultaneously utilizing both parallelism comprising the operation. To fully profit from both layers of parallelism, we have implemented our technique on a commodity CPU-FPGA hybrid machine with parallel execution capabilities in hardware. Our implementation outperforms prior ones by 2.43× in throughput and 12.19× in throughput per watt when performing N-bit operations under the 128-bit quantum security parameters.
- Oleg Mazonka
- Eduardo Chielle
- Deepraj Soni
- Michail Maniatakos
Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.
- Eduardo Chielle
- Oleg Mazonka
- Homer Gamil
- Michail Maniatakos
The dramatic increase of data breaches in modern computing platforms has emphasized that access control is not sufficient to protect sensitive user data. Recent advances in cryptography allow end-to-end processing of encrypted data without the need for decryption using Fully Homomorphic Encryption (FHE). Such computation however, is still orders of magnitude slower than direct (unencrypted) computation. Depending on the underlying cryptographic scheme, FHE schemes can work natively either at bit-level using Boolean circuits, or over integers using modular arithmetic. Operations on integers are limited to addition/subtraction and multiplication. On the other hand, bit-level arithmetic is much more comprehensive allowing more operations, such as comparison and division. While modular arithmetic can emulate bit-level computation, there is a significant cost in performance. In this work, we propose a novel method, dubbed bridging, that blends faster and restricted modular computation with slower and comprehensive bit-level computation, making them both usable within the same application and with the same cryptographic scheme instantiation. We introduce and open source C++ types representing the two distinct arithmetic modes, offering the possibility to convert from one to the other. Experimental results show that bridging modular and bit-level arithmetic computation can lead to 1–2 orders of magnitude performance improvement for tested synthetic benchmarks, as well as one real-world FHE application: a genotype imputation case study.
SESSION: Special Session: The Dawn of Domain-Specific Hardware Accelerators for Robotic Computing
- Yanqi Liu
- Anthony Opipari
- Odest Chadwicke Jenkins
- R. Iris Bahar
Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26× more energy efficient than a high-end GPU and 11× faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.
- Zishen Wan
- Karthik Swaminathan
- Pin-Yu Chen
- Nandhini Chandramoorthy
- Arijit Raychowdhury
Autonomous systems have reached a tipping point, with a myriad of self-driving cars, unmanned aerial vehicles (UAVs), and robots being widely applied and revolutionizing new applications. The continuous deployment of autonomous systems reveals the need for designs that facilitate increased resiliency and safety. The ability of an autonomous system to tolerate, or mitigate against errors, such as environmental conditions, sensor, hardware and software faults, and adversarial attacks, is essential to ensure its functional safety. Application-aware resilience metrics, holistic fault analysis frameworks, and lightweight fault mitigation techniques are being proposed for accurate and effective resilience and robustness assessment and improvement. This paper explores the origination of fault sources across the computing stack of autonomous systems, discusses the various fault impacts and fault mitigation techniques of different scales of autonomous systems, and concludes with challenges and opportunities for assessing and building next-generation resilient and robust autonomous systems.
- Yuhui Hao
- Bo Yu
- Qiang Liu
- Shaoshan Liu
- Yuhao Zhu
Factor graph is a graph representing the factorization of a probability distribution function, and has been utilized in many autonomous machine computing tasks, such as localization, tracking, planning and control etc. We are developing an architecture with the goal of using factor graph as a common abstraction for most, if not, all autonomous machine computing tasks. If successful, the architecture would provide a very simple interface of mapping autonomous machine functions to the underlying compute hardware. As a first step of such an attempt, this paper presents our most recent work of developing a factor graph accelerator for LiDAR-Inertial Odometry (LIO), an essential task in many autonomous machines, such as autonomous vehicles and mobile robots. By modeling LIO as a factor graph, the proposed accelerator not only supports multi-sensor fusion such as LiDAR, inertial measurement unit (IMU), GPS, etc., but solves the global optimization problem of robot navigation in batch or incremental modes. Our evaluation demonstrates that the proposed design significantly improves the real-time performance and energy efficiency of autonomous machine navigation systems. The initial success suggests the potential of generalizing the factor graph architecture as a common abstraction for autonomous machine computing, including tracking, planning, and control etc.
- Lingyi Huang
- Xiao Zang
- Yu Gong
- Bo Yuan
Motion planning aims to find a collision-free trajectory from the start to goal configurations of a robot. As a key cognition task for all the autonomous machines, motion planning is fundamentally required in various real-world robotic applications, such as 2-D/3-D autonomous navigation of unmanned mobile and aerial vehicles and high degree-of-freedom (DoF) autonomous manipulation of industry/medical robot arms and graspers.
Motion planning can be performed using either non-learning-based classical algorithms or learning-based neural approaches. Most recently, the powerful capabilities of deep neural networks (DNNs) make neural planners become very attractive because of their superior planning performance over the classical methods. In particular, graph neural network (GNN)-enabled motion planner has demonstrated the state-of-the-art performance across a set of challenging high-dimensional planning tasks, motivating the efficient hardware acceleration to fully unleash its potential and promote its widespread deployment in practical applications.
To that end, in this paper we perform preliminary study of the efficient accelerator design of the GNN-based neural planner, especially for the neural explorer as the key component of the entire planning pipeline. By performing in-depth analysis on the different design choices, we identify that the hybrid architecture, instead of the uniform sparse matrix multiplication (SpMM)-based solution that is popularly adopted in the existing GNN hardware, is more suitable for our target neural explorer. With a set of optimization on microarchitecture and dataflow, several design challenges incurred by using hybrid architecture, such as extensive memory access and imbalanced workload, can be efficiently mitigated. Evaluation results show that our proposed customized hardware architecture achieves order-of-magnitude performance improvement over the CPU/GPU-based implementation with respect to area and energy efficiency in various working environments.
SESSION: From Logical to Physical Qubits: New Models and Techniques for Mapping
- Tsou-An Wu
- Yun-Jhe Jiang
- Shao-Yun Fang
Layout synthesis in quantum circuits maps the logical qubits of a synthesized circuit onto the physical qubits of a hardware device (coupling graph) and complies with the hardware limitations. Existing studies on the problem usually suffer from intractable formulation complexity and thus prohibitively long runtimes. In this paper, we propose an efficient layout synthesizer by developing a satisfiability modulo theories (SMT)-based qubit mapping checker. The proposed qubit mapping checker can efficiently derive a SWAP-free solution if one exists. If no SWAP-free solution exists for a circuit, we propose a divide-and-conquer scheme that utilizes the checker to find SWAP-free sub-solutions for sub-circuits, and the overall solution is found by merging sub-solutions with SWAP insertion. Experimental results show that the proposed optimization flow can achieve more than 3000× runtime speedup over a state-of-the-art work to derive optimal solutions for a set of SWAP-free circuits. Moreover, for the other set of benchmark circuits requiring SWAP gates, our flow achieves more than 800× speedup and obtains near-optimal solutions with only 3% SWAP overhead.
- Ching-Yao Huang
- Chi-Hsiang Lien
- Wai-Kei Mak
Quantum computing is gaining more and more attention due to its huge potential and the constant progress in quantum computer development. IBM and Google have released quantum architectures with more than 50 qubits. However, in these machines, the physical qubits are not fully connected so that two-qubit interaction can only be performed between specific pairs of the physical qubits. To execute a quantum circuit, it is necessary to transform it into a functionally equivalent one that respects the constraints imposed by the target architecture. Quantum circuit transformation inevitably introduces additional gates which reduces the fidelity of the circuit. Therefore, it is important that the transformation method completes the transformation with minimal overheads. It consists of two steps, initial mapping and qubit routing. Here we propose a reinforcement learning-based model to solve the initial mapping problem. Initial mapping is formulated as sequence-to-sequence learning and self-attention network is used to extract features from a circuit. For qubit routing, a DEAR (Dynamically-Extract-and-Route) framework is proposed. The framework iteratively extracts a subcircuit and uses A* search to determine when and where to insert additional gates. It helps to preserve the lookahead ability dynamically and to provide more accurate cost estimation efficiently during A* search. The experimental results show that our RL-model generates better initial mappings than the best known algorithms with 12% fewer additional gates in the qubit routing stage. Furthermore, our DEAR-framework outperforms the state-of-the-art qubit routing approach with 8.4% and 36.3% average reduction in the number of additional gates and execution time starting from the same initial mapping.
- Bochen Tan
- Dolev Bluvstein
- Mikhail D. Lukin
- Jason Cong
Because of the largest number of qubits available, and the massive parallel execution of entangling two-qubit gates, atom arrays is a promising platform for quantum computing. The qubits are selectively loaded into arrays of optical traps, some of which can be moved during the computation itself. By adjusting the locations of the traps and shining a specific global laser, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. In comparison, previous QC architectures only generate entanglement on a fixed set of quantum register pairs. Thus, reconfigurable atom arrays (RAA) present a new challenge for QC compilation, especially the qubit mapping/layout synthesis stage which decides the qubit placement and gate scheduling. In this paper, we consider an RAA QC architecture that contains multiple arrays, supports 2D array movements, represents cutting-edge experimental platforms, and is much more general than previous works. We start by systematically examining the fundamental constraints on RAA imposed by physics. Built upon this understanding, we discretize the state space of the architecture, and we formulate layout synthesis for such an architecture to a satisfactory modulo theories problem. Finally, we demonstrate our work by compiling the quantum approximate optimization algorithm (QAOA), one of the promising near-term quantum computing applications. Our layout synthesizer reduces the number of required native two-qubit gates in 22-qubit QAOA by 5.72x (geomean) compared to leading experiments on a superconducting architecture. Combined with a better coherence time, there is an order-of-magnitude increase in circuit fidelity.
- Sunghye Park
- Dohun Kim
- Jae-Yoon Sim
- Seokhyeong Kang
In response to the rapid development of quantum processors, quantum software must be advanced by considering the actual hardware limitations. Among the various design automation problems in quantum computing, qubit allocation modifies the input circuit to match the hardware topology constraints. In this work, we present an effective heuristic approach for qubit allocation that considers not only the hardware topology but also other constraints for near-fault-tolerant quantum computing (near-FTQC). We propose a practical methodology to find an effective initial mapping to reduce both the number of gates and circuit latency. We then perform dynamic scheduling to maximize the number of gates executed in parallel in the main mapping phase. Our experimental results with a Surface-17 processor confirmed a substantial reduction in the number of gates, latency, and runtime by 58%, 28%, and 99%, respectively, compared with the previous method [18]. Moreover, our mapping method is scalable and has a linear time complexity with respect to the number of gates.
SESSION: Smart Embedded Systems (Virtual)
- Leonidas Kosmidis
- Pietro Mercati
- Hao Kong
- Di Liu
- Shuo Huai
- Xiangzhong Luo
- Weichen Liu
- Ravi Subramaniam
- Christian Makaya
- Qian Lin
Scaling down the resolution of input images can greatly reduce the computational overhead of convolutional neural networks (CNNs), which is promising for edge AI. However, as an image usually contains much spatial redundancy, e.g., background pixels, directly shrinking the whole image will lose important features of the foreground object and lead to severe accuracy degradation. In this paper, we propose a dynamic image cropping framework to reduce the spatial redundancy by accurately cropping the foreground object from images. To achieve the instance-aware fine cropping, we introduce a lightweight foreground predictor to efficiently localize and crop the foreground of an image. The finely cropped images can be correctly recognized even at a small resolution. Meanwhile, computational redundancy also exists in CNN architectures. To pursue higher execution efficiency on resource-constrained embedded devices, we also propose a compound shrinking strategy to coordinately compress the three dimensions (depth, width, resolution) of CNNs. Eventually, we seamlessly combine the proposed dynamic image cropping and compound shrinking into a unified compression framework, Smart Scissor, which is expected to significantly reduce the computational overhead of CNNs while still maintaining high accuracy. Experiments on ImageNet-1K demonstrate that our method reduces the computational cost of ResNet50 by 41.5% while improving the top-1 accuracy by 0.3%. Moreover, compared to HRank, the state-of-the-art CNN compression framework, our method achieves 4.1% higher top-1 accuracy at the same computational cost. The codes and data are available at https://github.com/ntuliuteam/smart-scissor
- Yuankai Xu
- Tiancheng He
- Ruiqi Sun
- Yehan Ma
- Yier Jin
- An Zou
Despite being employed in burgeoning efforts to accelerate artificial intelligence, heterogeneous architectures have yet to be well managed with strict timing constraints. As a classic task model, multi-segment self-suspension (MSSS) has been proposed for general I/O-intensive systems and computation offloading. However, directly applying this model to heterogeneous architectures with multiple CPUs and many processing units (PEs) suffers tremendous pessimism. In this paper, we present a real-time scheduling approach, SHAPE, for general heterogeneous architectures with significant schedulability and improved utilization rate. We start with building the general task execution pattern on a heterogeneous architecture integrating multiple CPU cores and many PEs such as GPU streaming multiprocessors and FPGA IP cores. A real-time scheduling strategy and corresponding schedulability analysis are presented following the task execution pattern. Compared with state-of-the-art scheduling algorithms through comprehensive experiments on unified and versatile tasks, SHAPE improves the schedulability by 11.1% – 100%. Moreover, experiments performed on the NVIDIA GPU systems further indicate up to 70.9% of pessimism reduction can be achieved by the proposed scheduling. Since we target general heterogeneous architectures, SHAPE can be directly applied to off-the-shelf heterogeneous computing systems with guaranteed deadlines and improved schedulability.
- Yu-Cheng Lin
- Yu-Pei Liang
- Tseng-Yi Chen
- Yuan-Hao Chang
- Shuo-Han Chen
- Wei-Kuan Shih
Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.
SESSION: Analog/Mixed-Signal Simulation, Layout, and Packaging (Virtual)
A number of sparse linear systems are solved by sparse LU factorization in a circuit simulation process. The coefficient matrices of these linear systems have the identical structure but different values. Pivoting is usually needed in sparse LU factorization to ensure the numerical stability, which leads to the difficulty of predicting the exact dependencies for scheduling parallel LU factorization. However, the matrix values usually change smoothly in circuit simulation iterations, which provides the potential to “guess” the dependencies. This work proposes a novel parallel LU factorization algorithm with pivoting reduction, but the numerical stability is equivalent to LU factorization with pivoting. The basic idea is to reuse the previous structural and pivoting information as much as possible to perform highly-scalable parallel factorization without pivoting, which is scheduled by the “guessed” dependencies. Once a pivot is found to be too small, the remaining matrix is factorized with pivoting in a pipelined way. Comprehensive experiments including comparisons with state-of-the-art CPU- and GPU-based parallel sparse direct solvers on 66 circuit matrices and real SPICE DC simulations on 4 circuit netlists reveal the superior performance and scalability of the proposed algorithm. The proposed solver is available at https://github.com/chenxm1986/cktso.
- Cong Wang
- Dongen Yang
- Quan Chen
Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.
- Zhen Zhuang
- Bei Yu
- Kai-Yuan Chao
- Tsung-Yi Ho
Due to the cost and design complexity associated with advanced technology nodes, it is difficult for traditional monolithic System-on-Chip to follow the Moore’s Law, which means the economic benefits have been weakened. Semiconductor industries are looking for advanced packages to improve the economic advantages. Since the multi-chiplet architecture supporting heterogeneous integration has the robust re-usability and effective cost reduction, chiplet integration has become the mainstream of advanced packages. Nowadays, the number of mounted chiplets in a package is continuously increasing with the requirement of high system performance. However, the large area caused by the increasing of chiplets leads to the serious reliability issues, including warpage and bump stress, which worsens the yield and cost. The multi-package architecture, which can distribute chiplets to multiple packages and use less area of each package, is a popular alternative to enhance the reliability and reduce the cost in advanced packages. However, the primary challenge of the multi-package architecture lies in the tradeoff between the inter-package costs, i.e., the interconnection among packages, and the intra-package costs, i.e., the reliability caused by warpage and bump stress. Therefore, a co-design methodology is indispensable to optimize multiple packages simultaneously to improve the quality of the whole system. To tackle this challenge, we adopt mathematical programming methods in the multi-package co-design problem regarding the nature of the synergistic optimization of multiple packages. To the best of our knowledge, this is the first work to solve the multi-package co-design problem.
SESSION: Advanced PIM and Biochip Technology and Stochastic Computing (Virtual)
- Xing Li
- Rachata Ausavarungnirun
- Xiao Liu
- Xueyuan Liu
- Xuan Zhang
- Heng Lu
- Zhuoran Song
- Naifeng Jing
- Xiaoyao Liang
Graph application plays a significant role in real-world data computation. However, the memory access patterns become the performance bottleneck of the graph applications, which include low compute-to-communication ratio, poor temporal locality, and poor spatial locality. Existing RRAM-based processing-in-memory accelerators reduce the data movements but fail to address both sparsity and redundancy of graph data. In this work, we present Gzippo, a highly-compact design that supports graph computation in the compressed sparse format. Gzippo employs a tandem-isomorphic-crossbar architecture both to eliminate redundant searches and sequential indexing during iterations, and to remove sparsity leading to non-effective computation on zero values. Gzippo achieves a 3.0× (up to 17.4×) performance speedup, 23.9× (up to 163.2×) energy efficiency over state-of-the-art RRAM-based PIM accelerator, respectively.
- Siyuan Liang
- Mengchu Li
- Tsun-Ming Tseng
- Ulf Schlichtmann
- Tsung-Yi Ho
Flow-based microfluidic chips are one of the most promising platforms for biochemical experiments. Transportation channels and operation devices inside these chips are controlled by microvalves, which are driven by external pressure sources. As the complexity of experiments on these chips keeps increasing, control multiplexers (MUXes) become necessary for the actuation of the enormous number of valves. However, current binary-coding-based MUXes do not take full advantage of the coding capacity and suffer from the reliability problem caused by the high control channel density. In this work, we propose a novel MUX coding strategy, named Combinatorial Coding, along with an algorithm to synthesize combinatorial-coding-based MUXes (CoMUXes) of arbitrary sizes with the proven maximum coding capacity. Moreover, we develop a simplification method to reduce the number of valves and control channels in CoMUXes and thus improve their reliability. We compare CoMUX with the state-of-the-art MUXes under different control demands with up to 10 × 213 independent control channels. Experiments show that CoMUXes can reliably control more independent control channels with fewer resources. For example, when the number of the to-be-controlled control channels is up to 10 × 213, compared to a state-of-the-art MUX, the optimized CoMUX reduces the number of required flow channels by 44% and the number of valves by 90%.
- Kuncai Zhong
- Zexi Li
- Haoran Jin
- Weikang Qian
Stochastic computing (SC) generally suffers from long latency. One solution is to apply proper random number sources (RNSs). Nevertheless, current RNS designs either have high hardware cost or low accuracy. To address the issue, motivated by that the uniform spatial distribution generally leads to a high accuracy for an SC circuit, we propose a basic architecture to generate the uniform spatial distribution and a further detailed implementation of it. For the implementation, we further propose a method to optimize its hardware cost and a method to optimize its accuracy. The method for hardware cost optimization can optimize the hardware cost without affecting the accuracy. The experimental results show that our proposed implementation can achieve both low hardware cost and high accuracy. Compared to the state-of-the-art stochastic number generator design, the proposed design can reduce 88% area with close accuracy.
SESSION: On Automating Heterogeneous Designs (Virtual)
- Jai-Ming Lin
- Po-Chen Lu
- Heng-Yu Lin
- Jia-Ting Tsai
Although the 3D integrated circuit (IC) placement problem has been studied for many years, few publications devoted to the macro legalization. Due to large sizes of macros, the macro placement problem is harder than cell placement, especially when preplaced macros exist in a multi-tier structure. In order to have a more global view, this paper proposes the partitioning-last macro-first flow to handle 3D placement for mixed-size designs, which performs tier partitioning after placement prototyping and then legalizes macros before cell placement. A novel two-step approach is proposed to handle 3D macro placement. The first step determines locations of macros in a projection plane based on a new representation, named K-tier Partially Occupied Corner Stitching. It not only can keep the prototyping result but also guarantees a legal placement after tier assignment of macros. Next, macros are assigned to respective tiers by Integer Linear Programming (ILP) algorithm. Experimental results show that our design flow can obtain better solutions than other flows especially in the cases with more preplaced macros.
- Jai-Ming Lin
- Hao-Yuan Hsieh
- Hsuan Kung
- Hao-Jia Lin
Quality of a true 3D placement approach greatly relies on the correctness of the models used in its formulation. However, the models used by previous approaches are not precise enough. Moreover, they do not actually place TSVs which makes their approach unable to get accurate wirelength and construct a correct congestion map. Besides, they rarely discuss routability which is the most important issue considered in 2D placement. To resolve this insufficiency, this paper proposes more accurate models to estimate placement utilization and TSV number by the softmax function which can align cells to exact tiers. Moreover, we propose a fast parallel algorithm to update the locations of TSVs when cells are moved during optimization. Finally, we present a novel penalty model to estimate routing overflow of regions covered by cells and inflate cells in congested regions according to this model. Experimental results show that our methodology can obtain better results than previous works.
SESSION: Special Session: Quantum Computing to Solve Chemistry, Physics and Security Problems (Virtual)
- Collin Beaudoin
- Satwik Kundu
- Rasit Onur Topaloglu
- Swaroop Ghosh
Using quantum computing, this paper addresses two scientifically-pressing and day to day-relevant problems, namely, chemical retrosynthesis which is an important step in drug/material discovery and security of semiconductor supply chain. We show that Quantum Long Short-Term Memory (QLSTM) is a viable tool for retrosynthesis. We achieve 65% training accuracy with QLSTM whereas classical LSTM can achieve 100%. However, in testing we achieve 80% accuracy with the QLSTM while classical LSTM peaks at only 70% accuracy! We also demonstrate an application of Quantum Neural Network (QNN) in the hardware security domain, specifically in Hardware Trojan (HT) detection using a set of power and area Trojan features. The QNN model achieves detection accuracy as high as 97.27%.
- Andrea Delgado
- Kathleen E. Hamilton
Some of the most significant achievements of the modern era of particle physics, such as the discovery of the Higgs boson, have been made possible by the tremendous effort in building and operating large-scale experiments like the Large Hadron Collider or the Tevatron. In these facilities, the ultimate theory to describe matter at the most fundamental level is constantly probed and verified. These experiments often produce large amounts of data that require storing, processing, and analysis techniques that continually push the limits of traditional information processing schemes. Thus, the High-Energy Physics (HEP) field has benefited from advancements in information processing and the development of algorithms and tools for large datasets. More recently, quantum computing applications have been investigated to understand how the community can benefit from the advantages of quantum information science. Nonetheless, to unleash the full potential of quantum computing, there is a need to understand the quantum behavior and, thus, scale up current algorithms beyond what can be simulated in classical processors. In this work, we explore potential applications of quantum machine learning to data analysis tasks in HEP and how to overcome the limitations of algorithms targeted for Noisy Intermediate-Scale Quantum (NISQ) devices.
SESSION: Making Patterning Work (Virtual)
- Qipan Wang
- Xiaohan Gao
- Yibo Lin
- Runsheng Wang
- Ru Huang
Post Exposure Baking (PEB) has been widely utilized in advanced lithography. PEB simulation is critical in the lithography simulation flow, as it bridges the optical simulation result and the final developed profile in the photoresist. The process of PEB can be described by coupled partial differential equations (PDE) and corresponding boundary and initial conditions. Recent years have witnessed growing presence of machine learning algorithms in lithography simulation, while PEB simulation is often ignored or treated with compact models, considering the huge cost of solving PDEs exactly. In this work, based on the observation of the physical essence of PEB, we propose DeePEB: a neural PDE Solver for PEB simulation. This model is capable of predicting the PEB latent image with high accuracy and >100 × acceleration (compared to the commercial rigorous simulation tool), paving the way for efficient and accurate photoresist modeling in lithography simulation and layout optimization.
- Wenqian Zhao
- Xufeng Yao
- Ziyang Yu
- Guojin Chen
- Yuzhe Ma
- Bei Yu
- Martin D. F. Wong
Optical proximity correction (OPC) is a widely-used resolution enhancement technique (RET) for printability optimization. Recently, rigorous numerical optimization and fast machine learning are the research focus of OPC in both academia and industry, each of which complements the other in terms of robustness or efficiency. We inspect the pattern distribution on a design layer and find that different sub-regions have different pattern complexity. Besides, we also find that many patterns repetitively appear in the design layout, and these patterns may possibly share optimized masks. We exploit these properties and propose a self-adaptive OPC framework to improve efficiency. Firstly we choose different OPC solvers adaptively for patterns of different complexity from an extensible solver pool to reach a speed/accuracy co-optimization. Apart from that, we prove the feasibility of reusing optimized masks for repeated patterns and hence, build a graph-based dynamic pattern library reusing stored masks to further speed up the OPC flow. Experimental results show that our framework achieves substantial improvement in both performance and efficiency.
- Liangjian Wen
- Yi Zhu
- Lei Ye
- Guojin Chen
- Bei Yu
- Jianzhuang Liu
- Chunjing Xu
Generating legal and diverse layout patterns to establish large pattern libraries is fundamental for many lithography design applications. Existing pattern generation models typically regard the pattern generation problem as image generation of layout maps and learn to model the patterns via capturing pixel-level coherence, which is insufficient to achieve polygon-level modeling, e.g., shape and layout of patterns, thus leading to poor generation quality. In this paper, we regard the pattern generation problem as an unsupervised sequence generation problem, in order to learn the pattern design rules by explicitly modeling the shapes of polygons and the layouts among polygons. Specifically, we first propose a sequential pattern representation scheme that fully describes the geometric information of polygons by encoding the 2D layout patterns as sequences of tokens, i.e., vertexes and edges. Then we train a sequential generative model to capture the long-term dependency among tokens and thus learn the design rules from training examples. To generate a new pattern in sequence, each token is generated conditioned on the previously generated tokens that are from the same polygon or different polygons in the same layout map. Our framework, termed LayouTransformer, is based on the Transformer architecture due to its remarkable ability in sequence modeling. Comprehensive experiments show that our LayouTransformer not only generates a large amount of legal patterns but also maintains high generation diversity, demonstrating its superiority over existing pattern generative models.
- Qijing Wang
- Martin D. F. Wong
As the demand for semiconductor products increases and the integrated circuits (IC) processes become more and more complex, wafer failure pattern classification is gaining more attention from manufacturers and researchers to improve yield. To further cope with the real-world scenario that there are only very limited labeled data and without any unlabeled data in the early manufacturing stage of new products, this work proposes an efficient human-like staged learning framework for wafer failure pattern classification named WaferHSL. Inspired by human’s knowledge acquisition process, a mutually reinforcing task fusion scheme is designed for guiding the deep learning model to simultaneously establish the knowledge of spatial relationships, geometry properties and semantics. Furthermore, a progressive stage controller is deployed to partition and control the learning process, so as to enable humanlike progressive advancement in the model. Experimental results show that with only 10% labeled samples and no unlabeled samples, WaferHSL can achieve better results than previous SOTA methods trained with 60% labeled samples and a large number of unlabeled samples, while the improvement is even more significant when using the same size of labeled training set.
SESSION: Advanced Verification Technologies (Virtual)
- Xiaoyu Zhang
- Shengping Xiao
- Jianwen Li
- Geguang Pu
- Ofer Strichman
Bounded Model Checking (BMC) is so far considered as the best engine for bug-finding in hardware model checking. Given a bound K, BMC can detect if there is a counterexample to a given temporal property within K steps from the initial state, thus performing a global-style search. Recently, a SAT-based model-checking technique called Complementary Approximate Reachability (CAR) was shown to be complementary to BMC, in the sense that frequently they can solve instances that the other technique cannot, within the same time limit. CAR detects a counterexample gradually with the guidance of an over-approximating state sequence, and performs a local-style search. In this paper, we consider three different ways to combine BMC and CAR. Our experiments show that they all outperform BMC and CAR on their own, and solve instances that cannot be solved by these two techniques. Our findings are based on a comprehensive experimental evaluation using the benchmarks of two hardware model checking competitions.
- Xin Hong
- Yuan Feng
- Sanjiang Li
- Mingsheng Ying
Despite the rapid development of quantum computing these years, state-of-the-art quantum devices still contain only a limited number of qubits. One possible way to execute more realistic algorithms in near-term quantum devices is to employ dynamic quantum circuits (DQCs). In DQCs, measurements can happen during the circuit, and their outcomes can be processed with classical computers and used to control other parts of the circuit. This technique can help significantly reduce the qubit resources required to implement a quantum algorithm. In this paper, we give a formal definition of DQCs and then characterise their functionality in terms of ensembles of linear operators, following the Kraus representation of superoperators. We further interpret DQCs as tensor networks, implement their functionality as tensor decision diagrams (TDDs), and reduce the equivalence of two DQCs to checking if they have the same TDD representation. Experiments show that embedding classical logic into conventional quantum circuits does not incur a significant time and space burden.
SESSION: Routing with Cell Movement (Virtual)
- Xinshi Zang
- Fangzhou Wang
- Jinwei Liu
- Martin D. F. Wong
Placement and routing are two crucial steps in the physical design of integrated circuits (ICs). To close the gap between placement and routing, the routing with cell movement problem has attracted great attention recently. In this problem, a certain number of cells can be moved to new positions and the nets can be rerouted to improve the total wire length. In this work, we advance the study on this problem by proposing a two-level layer-aware scheme, named ATLAS. A coarse-level cluster-based cell movement is first performed to optimize via usage and provides a better starting point for the next fine-level single cell movement. To further encourage routing on the upper metal layers, we utilize a set of adjusted layer weights to increase the routing cost on lower layers. Experimental results on the ICCAD 2020 contest benchmarks show that ATLAS achieves much more wire length reduction compared with the state-of-the-art routing with cell movement engine. Furthermore, applied on the ICCAD 2021 contest benchmarks, ATLAS outperforms the first place team of the contest with much better solution quality while being 3× faster.
- Ziran Zhu
- Fuheng Shen
- Yangjie Mei
- Zhipeng Huang
- Jianli Chen
- Jun Yang
Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.
SESSION: Special Session: Hardware Security through Reconfigurability: Attacks, Defenses, and Challenges
- Nima Kavand
- Armin Darjani
- Shubham Rai
- Akash Kumar
Hardware security has been an ever-growing concern of the integrated circuit (IC) designers. Through different stages in the IC design and life cycle, an adversary can extract sensitive design information and private data stored in the circuit using logical, physical, and structural weaknesses. Besides, in recent times, ML-based attacks have become the new de facto standard in hardware security community. Contemporary defense strategies are often facing unforeseen challenges to cope up with these attack schemes. Additionally, the high overhead of the CMOS-based secure addon circuitry and intrinsic limitations of these devices indicate the need for new nano-electronics. Emerging reconfigurable devices like Reconfigurable Field Effect transistors (RFETs) provide unique features to fortify the design against various threats at different stages in the IC design and life cycle. In this manuscript, we investigate the applications of the RFETs for securing the design against traditional and machine learning (ML)-based intellectual property (IP) piracy techniques and side-channel attacks (SCAs).
- Luca Collini
- Benjamin Tan
- Christian Pilato
- Ramesh Karri
Protecting the intellectual property (IP) of integrated circuit (IC) design is becoming a significant concern of fab-less semiconductor design houses. Malicious actors can access the chip design at any stage, reverse engineer the functionality, and create illegal copies. On the one hand, defenders are crafting more and more solutions to hide the critical portions of the circuit. On the other hand, attackers are designing more and more powerful tools to extract useful information from the design and reverse engineer the functionality, especially when they can get access to working chips. In this context, the use of custom reconfigurable fabrics has recently been investigated for hardware IP protection. This paper will discuss recent trends in hardware obfuscation with embedded FPGAs, focusing also on the open challenges that must be necessarily addressed for making this solution viable.
SESSION: Performance, Power and Temperature Aspects in Deep Learning
- Chaojian Li
- Sixu Li
- Yang Zhao
- Wenbo Zhu
- Yingyan Lin
Neural Radiance Field (NeRF) based rendering has attracted growing attention thanks to its state-of-the-art (SOTA) rendering quality and wide applications in Augmented and Virtual Reality (AR/VR). However, immersive real-time (> 30 FPS) NeRF based rendering enabled interactions are still limited due to the low achievable throughput on AR/VR devices. To this end, we first profile SOTA efficient NeRF algorithms on commercial devices and identify two primary causes of the aforementioned inefficiency: (1) the uniform point sampling and (2) the dense accesses and computations of the required embeddings in NeRF. Furthermore, we propose RT-NeRF, which to the best of our knowledge is the first algorithm-hardware co-design acceleration of NeRF. Specifically, on the algorithm level, RT-NeRF integrates an efficient rendering pipeline for largely alleviating the inefficiency due to the commonly adopted uniform point sampling method in NeRF by directly computing the geometry of pre-existing points. Additionally, RT-NeRF leverages a coarse-grained view-dependent computing ordering scheme for eliminating the (unnecessary) processing of invisible points. On the hardware level, our proposed RT-NeRF accelerator (1) adopts a hybrid encoding scheme to adaptively switch between a bitmap- or coordinate-based sparsity encoding format for NeRF’s sparse embeddings, aiming to maximize the storage savings and thus reduce the required DRAM accesses while supporting efficient NeRF decoding; and (2) integrates both a high-density sparse search unit and a dual-purpose bi-direction adder & search tree to coordinate the two aforementioned encoding formats. Extensive experiments on eight datasets consistently validate the effectiveness of RT-NeRF, achieving a large throughput improvement (e.g., 9.7×~3,201×) while maintaining the rendering quality as compared with SOTA efficient NeRF solutions.
- Yifan Gong
- Zheng Zhan
- Pu Zhao
- Yushu Wu
- Chao Wu
- Caiwen Ding
- Weiwen Jiang
- Minghai Qin
- Yanzhi Wang
During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.
- Jingyu Pan
- Chen-Chia Chang
- Zhiyao Xie
- Jiang Hu
- Yiran Chen
Deep learning has been widely applied in various VLSI design automation tasks, from layout quality estimation to design optimization. Though deep learning has shown state-of-the-art performance in several applications, recent studies reveal that deep neural networks exhibit intrinsic vulnerability to adversarial perturbations, which pose risks in the ML-aided VLSI design flow. One of the most effective strategies to improve robustness is regularization approaches, which adjust the optimization objective to make the deep neural network generalize better. In this paper, we examine several adversarial defense methods to improve the robustness of ML-based lithography hotspot detectors. We present an innovative design rule checking (DRC)-guided curvature regularization (CURE) approach, which is customized to robustify ML-based lithography hotspot detectors against white-box attacks. Our approach allows for improvements in both the robustness and the accuracy of the model. Experiments show that the model optimized by DRC-guided CURE achieves the highest robustness and accuracy compared with those trained using the baseline defense methods. Compared with the vanilla model, DRC-guided CURE decreases the average attack success rate by 53.9% and increases the average ROC-AUC by 12.1%. Compared with the best of the defense baselines, DRC-guided CURE reduces the average attack success rate by 18.6% and improves the average ROC-AUC by 4.3%.
- Mengyuan Li
- Arman Kazemi
- Ann Franchesca Laguna
- X. Sharon Hu
Experience replay is an essential component in deep reinforcement learning (DRL), which stores the experiences and generates experiences for the agent to learn in real time. Recently, prioritized experience replay (PER) has been proven to be powerful and widely deployed in DRL agents. However, implementing PER on traditional CPU or GPU architectures incurs significant latency overhead due to its frequent and irregular memory accesses. This paper proposes a hardware-software co-design approach to design an associative memory (AM) based PER, AMPER, with an AM-friendly priority sampling operation. AMPER replaces the widely-used time-costly tree-traversal-based priority sampling in PER while preserving the learning performance. Further, we design an in-memory computing hardware architecture based on AM to support AMPER by leveraging parallel in-memory search operations. AMPER shows comparable learning performance while achieving 55× to 270× latency improvement when running on the proposed hardware compared to the state-of-the-art PER running on GPU.
SESSION: Tutorial: TorchQuantum Case Study for Robust Quantum Circuits
- Hanrui Wang
- Zhiding Liang
- Jiaqi Gu
- Zirui Li
- Yongshan Ding
- Weiwen Jiang
- Yiyu Shi
- David Z. Pan
- Frederic T. Chong
- Song Han
Quantum Computing has attracted much research attention because of its potential to achieve fundamental speed and efficiency improvements in various domains. Among different quantum algorithms, Parameterized Quantum Circuits (PQC) for Quantum Machine Learning (QML) show promises to realize quantum advantages on the current Noisy Intermediate-Scale Quantum (NISQ) Machines. Therefore, to facilitate the QML and PQC research, a recent python library called TorchQuantum has been released. It can construct, simulate, and train PQC for machine learning tasks with high speed and convenient debugging supports. Besides quantum for ML, we want to raise the community’s attention on the reversed direction: ML for quantum. Specifically, the TorchQuantum library also supports using data-driven ML models to solve problems in quantum system research, such as predicting the impact of quantum noise on circuit fidelity and improving the quantum circuit compilation efficiency.
This paper presents a case study of the ML for quantum part in TorchQuantum. Since estimating the noise impact on circuit reliability is an essential step toward understanding and mitigating noise, we propose to leverage classical ML to predict noise impact on circuit fidelity. Inspired by the natural graph representation of quantum circuits, we propose to leverage a graph transformer model to predict the noisy circuit fidelity. We firstly collect a large dataset with a variety of quantum circuits and obtain their fidelity on noisy simulators and real machines. Then we embed each circuit into a graph with gate and noise properties as node features, and adopt a graph transformer to predict the fidelity. We can avoid exponential classical simulation cost and efficiently estimate fidelity with polynomial complexity.
Evaluated on 5 thousand random and algorithm circuits, the graph transformer predictor can provide accurate fidelity estimation with RMSE error 0.04 and outperform a simple neural network-based model by 0.02 on average. It can achieve 0.99 and 0.95 R2 scores for random and algorithm circuits, respectively. Compared with circuit simulators, the predictor has over 200× speedup for estimating the fidelity. The datasets and predictors can be accessed in the TorchQuantum library.
SESSION: Emerging Machine Learning Primitives: From Technology to Application
- Dharanidhar Dang
- Hai Helen Lee
- Che-Kai Liu
- Haobang Chen
- Mohsen Imani
- Kai Ni
- Arman Kazemi
- Ann Franchesca Laguna
- Michael Niemier
- Xiaobo Sharon Hu
- Liang Zhao
- Cheng Zhuo
- Xunzhao Yin
In a number of machine learning models, an input query is searched across the trained class vectors to find the closest feature class vector in cosine similarity metric. However, performing the cosine similarities between the vectors in Von-Neumann machines involves a large number of multiplications, Euclidean normalizations and division operations, thus incurring heavy hardware energy and latency overheads. Moreover, due to the memory wall problem that presents in the conventional architecture, frequent cosine similarity-based searches (CSSs) over the class vectors requires a lot of data movements, limiting the throughput and efficiency of the system. To overcome the aforementioned challenges, this paper introduces COSIME, a general in-memory associative memory (AM) engine based on the ferroelectric FET (FeFET) device for efficient CSS. By leveraging the one-transistor AND gate function of FeFET devices, current-based translinear analog circuit and winner-take-all (WTA) circuitry, COSIME can realize parallel in-memory CSS across all the entries in a memory block, and output the closest word to the input query in cosine similarity metric. Evaluation results at the array level suggest that the proposed COSIME design achieves 333× and 90.5× latency and energy improvements, respectively, and realizes better classification accuracy when compared with an AM design implementing approximated CSS. The proposed in-memory computing fabric is evaluated for an HDC problem, showcasing that COSIME can achieve on average 47.1× and 98.5× speedup and energy efficiency improvements compared with an GPU implementation.
- Thai-Hoang Nguyen
- Muhammad Imran
- Joon-Sung Yang
As the effectiveness of Deep Neural Networks (DNNs) is rising over time, so is the need for highly scalable and efficient hardware architectures to capitalize this effectiveness in many practical applications. Emerging non-volatile Phase Change Memory (PCM) technology has been found to be a promising candidate for future memory systems due to its better scalability, non-volatility and low leakage/dynamic power consumption, compared to conventional charged-based memories. Additionally, with its cell’s wide resistance span, PCM also has the Flash-like Multi-Level Cell (MLC) capability, which has enhanced storage density, providing an opportunity for the deployment of data-intensive applications such as DNNs on resource-constrained edge devices. However, the practical deployment of MLC PCM is hampered by certain reliability challenges, among which, the resistance drift is considered to be a critical concern. In a DNN application, the presence of resistance drift in MLC PCM can cause a severe impact to DNN’s accuracy if no drift-error-tolerance technique is utilized. This paper proposes DynaPAT, a low-cost and effective pattern-aware encoding technique to enhance the drift-error-tolerance of MLC PCM-based Deep Neural Networks. DynaPAT has been constructed on the insight into DNN’s vulnerability against different data pattern switching. Based on this insight, DynaPAT efficiently maps the most-frequent data pattern in DNN’s parameters to the least-drift-prone level of the MLC PCM, thus significantly enhancing the robustness of the system against drift errors. Various experiments on different DNN models and configurations demonstrate the effectiveness of DynaPAT. The experimental results indicate that DynaPAT can achieve up to 500× enhancement in the drift-errors-tolerance capability over the baseline MLC PCM based DNN while requiring only a negligible hardware overhead (below 1% storage overhead). Being orthogonal, DynaPAT can be integrated with existing drift-tolerance schemes for even higher gains in reliability.
- Vedika Servanan
- Samah Mohamed Saeed
Dynamical Decoupling (DD)-based protocols have been shown to reduce the idling errors encountered in quantum circuits. However, the current research in suppressing idling qubit errors suffers from scalability issues due to the large number of tuning quantum circuits that should be executed first to find the locations of the DD sequences in the target quantum circuit, which boost the output state fidelity. This process becomes tedious as the size of the quantum circuit increases. To address this challenge, we propose a Graph Neural Network (GNN) framework, which mitigates idling errors through an efficient insertion of DD sequences into quantum circuits by modeling their impact at different idle qubit windows. Our paper targets maximizing the benefit of DD sequences using a limited number of tuning circuits. We propose to classify the idle qubit windows into critical and non-critical (benign) windows using a data-driven reliability model. Our results obtained from IBM Lagos quantum computer show that our proposed GNN models, which determine the locations of DD sequences in the quantum circuits, significantly improve the output state fidelity by a factor of 1.4x on average and up to 2.6x compared to the adaptive DD approach, which searches for the best locations of DD sequences at run-time.
- Zhirui Hu
- Peiyan Dong
- Zhepeng Wang
- Youzuo Lin
- Yanzhi Wang
- Weiwen Jiang
Model compression, such as pruning and quantization, has been widely applied to optimize neural networks on resource-limited classical devices. Recently, there are growing interest in variational quantum circuits (VQC), that is, a type of neural network on quantum computers (a.k.a., quantum neural networks). It is well known that the near-term quantum devices have high noise and limited resources (i.e., quantum bits, qubits); yet, how to compress quantum neural networks has not been thoroughly studied. One might think it is straightforward to apply the classical compression techniques to quantum scenarios. However, this paper reveals that there exist differences between the compression of quantum and classical neural networks. Based on our observations, we claim that the compilation/traspilation has to be involved in the compression process. On top of this, we propose the very first systematical framework, namely CompVQC, to compress quantum neural networks (QNNs). In CompVQC, the key component is a novel compression algorithm, which is based on the alternating direction method of multipliers (ADMM) approach. Experiments demonstrate the advantage of the CompVQC, reducing the circuit depth (almost over 2.5×) with a negligible accuracy drop (<1%), which outperforms other competitors. Another promising truth is our CompVQC can indeed promote the robustness of the QNN on the near-term noisy quantum devices.
SESSION: Design for Low Energy, Low Resource, but High Quality
- Ravikumar Chakaravarthy
- Cong “Callie” Hao
- Azat Azamat
- Jaewoo Park
- Jongeun Lee
The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.
- Yunxiang Zhang
- Biao Sun
- Weixiong Jiang
- Yajun Ha
- Miao Hu
- Wenfeng Zhao
Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ1-norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.
- Kyeongho Lee
- Joonhyung Kim
- Jongsun Park
Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.
SESSION: Microarchitectural Attacks and Countermeasures
- Hasini Witharana
- Prabhat Mishra
Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.
- Arash Pashrashid
- Ali Hajiabadi
- Trevor E. Carlson
Modern processors achieve high performance and efficiency by employing techniques such as speculative execution and sharing resources such as caches. However, recent attacks like Spectre and Meltdown exploit the speculative execution of modern processors to leak sensitive information from the system. Many mitigation strategies have been proposed to restrict the speculative execution of processors and protect potential side-channels. Currently, these techniques have shown a significant performance overhead. A solution that can detect memory leaks before the attacker has a chance to exploit them would allow the processor to reduce the performance overhead by enabling protections only when the system is at risk.
In this paper, we propose a mechanism to detect speculative execution attacks that use caches as a side-channel. In this detector we track the phases of a successful attack and raise an alert before the attacker gets a chance to recover sensitive information. We accomplish this through monitoring the microarchitectural changes in the core and caches, and detect the memory locations that can be potential memory data leaks. We achieve 100% accuracy and negligible false positive rate in detecting Spectre attacks and evasive versions of Spectre that the state-of-the-art detectors are unable to detect. Our detector has no performance overhead with negligible power and area overheads.
- Ivan De Oliveira Nunes
- Sashidhar Jakkamsetti
- Youngil Kim
- Gene Tsudik
Guaranteeing runtime integrity of embedded system software is an open problem. Trade-offs between security and other priorities (e.g., cost or performance) are inherent, and resolving them is both challenging and important. The proliferation of runtime attacks that introduce malicious code (e.g., by injection) into embedded devices has prompted a range of mitigation techniques. One popular approach is Remote Attestation (RA), whereby a trusted entity (verifier) checks the current software state of an untrusted remote device (prover). RA yields a timely authenticated snapshot of prover state that verifier uses to decide whether an attack occurred.
Current RA schemes require verifier to explicitly initiate RA, based on some unclear criteria. Thus, in case of prover’s compromise, verifier only learns about it late, upon the next RA instance. While sufficient for compromise detection, some applications would benefit from a more proactive, prevention-based approach. To this end, we construct CASU: Compromise Avoidance via Secure Updates. CASU is an inexpensive hardware/software co-design enforcing: (i) runtime software immutability, thus precluding any illegal software modification, and (ii) authenticated updates as the sole means of modifying software. In CASU, a successful RA instance serves as a proof of successful update, and continuous subsequent software integrity is implicit, due to the runtime immutability guarantee. This obviates the need for RA in between software updates and leads to unobtrusive integrity assurance with guarantees akin to those of prior RA techniques, with better overall performance.
SESSION: Genetic Circuits Meet Ising Machines
- Tobias Schwarz
- Christian Hochberger
Synthetic Biology aims to create biological systems from scratch that do not exist in nature. An important method in this context is the engineering of DNA sequences such that cells realize Boolean functions that serve as control mechanisms in biological systems, e.g. in medical or agricultural applications. Libraries of logic gates exist as predefined gene sequences, based on the genetic mechanism of transcriptional regulation. Each individual gate is composed of different biological parts to allow for the differentiation of their output signals. Even gates of the same logic type therefore exhibit different transfer characteristics, i.e. relation from input to output signals. Thus, simulation of the whole network of genetic gates is needed to determine the performance of a genetic circuit. This makes mapping Boolean functions to these libraries much more complicated compared to EDA. Yet, optimal results are desired in the design phase due to high lab implementation costs. In this work, we identify fundamental features of the transfer characteristic of gates based on transcriptional regulation which is widely used in genetic gate technologies. Based on this, we present novel exact (Branch-and-Bound) and heuristic (Branch-and-Bound, Simulated Annealing) algorithms for the problem of technology mapping of genetic circuits and evaluate them using a prominent gate library. In contrast to state-of-the-art tools, all obtained solutions feature a (near) optimal output performance. Our exact method only explores 6.5 % and the heuristics even 0.2 % of the design space.
- Naomi Sagan
- Jaijeet Roychowdhury
Ising machines have generated much excitement in recent years due to their promise for solving hard combinatorial optimization problems. However, achieving physical all-to-all connectivity in IC implementations of large, densely-connected Ising machines remains a key challenge. We present a novel approach, DaS, that uses low-rank decomposition to achieve effectively-dense Ising connectivity using only sparsely interconnected hardware. The innovation consists of two components. First, we use the SVD to find a low-rank approximation of the Ising coupling matrix while maintaining very high accuracy. This decomposition requires substantially fewer nonzeros to represent the dense Ising coupling matrix. Second, we develop a method to translate the low-rank decomposition to a hardware implementation that uses only sparse resistive interconnections. We validate DaS on the MU-MIMO detection problem, important in modern telecommunications. Our results indicate that as problem sizes scale, DaS can achieve dense Ising coupling using only 5%-20% of the resistors needed for brute-force dense connections (which would be physically infeasible in ICs). We also outline a crossbar-style physical layout scheme for realizing sparse resistive networks generated by DaS.
- Yiqiao Zhang
- Uday Kumar Reddy Vengalam
- Anshujit Sharma
- Michael Huang
- Zeljko Ignjatovic
Physical Ising machines have been shown to solve combinatoric optimization problems with orders-of-magnitude improvements in speed and energy efficiency o ver v on N eumann systems. However, building such a system is still in its infancy and a scalable, robust implementation remains challenging. CMOS-compatible electronic Ising machines (e.g., [1]) are promising as the mature technology helps bring scale, speed, and energy efficiency to the dynamical system. However, subtle issues can arise when using voltage-controlled transistors to act as programmable resistive coupling. In this paper, we propose a version of resistively-coupled Ising machine using quantized nodal interactions (QuBRIM), which significantly i mproved the predictability of the coupling resistor. The functionality of QuBRIM is demonstrated by solving the well-known Max-Cut problem using both behavioral and circuit level simulations in 45 nm CMOS technology node. We show that the dynamical system naturally seeks local minima in the objective function’s energy landscape and that by applying spin-fix a nnealing, t he system reaches a global minimum with a high probability.
SESSION: Energy Efficient Neural Networks via Approximate Computations
- M. Hasan Najafi
- Vidya Chabria
- Elias Trommer
- Bernd Waschneck
- Akash Kumar
This work explores the search for heterogeneous approximate multiplier configurations for neural networks that produce high accuracy and low energy consumption. We discuss the validity of additive Gaussian noise added to accurate neural network computations as a surrogate model for behavioral simulation of approximate multipliers. The continuous and differentiable properties of the solution space spanned by the additive Gaussian noise model are used as a heuristic that generates meaningful estimates of layer robustness without the need for combinatorial optimization techniques. Instead, the amount of noise injected into the accurate computations is learned during network training using backpropagation. A probabilistic model of the multiplier error is presented to bridge the gap between the domains; the model estimates the standard deviation of the approximate multiplier error, connecting solutions in the additive Gaussian noise space to actual hardware instances. Our experiments show that the combination of heterogeneous approximation and neural network retraining reduces the energy consumption for multiplications by 70% to 79% for different ResNet variants on the CIFAR-10 dataset with a Top-1 accuracy loss below one percentage point. For the more complex Tiny ImageNet task, our VGG16 model achieves a 53 % reduction in energy consumption with a drop in Top-5 accuracy of 0.5 percentage points. We further demonstrate that our error model can predict the parameters of an approximate multiplier in the context of the commonly used additive Gaussian noise (AGN) model with high accuracy. Our software implementation is available under https://github.com/etrommer/agn-approx.
- Ayushi Dube
- Ankit Wagle
- Gian Singh
- Sarma Vrudhula
This paper presents a novel hardware-software co-design consisting of a Processing in-Memory (PiM) architecture with embedded neural processing elements (NPE) that are highly reconfigurable. The PiM platform and proposed approximation strategies are employed for various image filtering applications while providing the user with fine-grain dynamic control over energy efficiency, precision, and throughput (EPT). The proposed co-design can change the Peak Signal to Noise Ratio (PSNR, output quality metric for image filtering applications) from 25dB to 50dB (acceptable PSNR range for image filtering applications) without incurring any extra cost in terms of energy or latency. While switching from accurate to approximate mode of computation in the proposed co-design, the maximum improvement in energy efficiency and throughput is 2X. However, the gains in energy efficiency against a MAC-based PE array with the proposed memory platform are 3X-6X. The corresponding improvements in throughput are 2.26X-4.52X, respectively.
- Tim Bücher
- Lilas Alrahis
- Guilherme Paim
- Sergio Bampi
- Ozgur Sinanoglu
- Hussam Amrouch
The globalization of the Integrated Circuit (IC) market is attracting an ever-growing number of partners, while remarkably lengthening the supply chain. Thereby, security concerns, such as those imposed by functional Reverse Engineering (RE), have become quintessential. RE leads to disclosure of confidential information to competitors, potentially enabling the theft of intellectual property. Traditional functional RE methods analyze a given gate-level netlist through employing pattern matching towards reconstructing the underlying basic blocks, and hence, reverse engineer the circuit’s function.
In this work, we are the first to demonstrate that applying Approximate Computing (AxC) principles to circuits significantly improves the resiliency against RE. This is attributed to the increased complexity in the underlying pattern-matching process. The resiliency remains effective even for Graph Neural Networks (GNNs) that are presently one of the most powerful state-of-the-art techniques in functional RE. Using AxC, we demonstrate a substantial reduction in GNN average classification accuracy- from 98% to a mere 53%. To surmount the challenges introduced by AxC in RE, we propose the highly promising AppGNN platform, which enables GNNs (still being trained on exact circuits) to: (i) perform accurate classifications, and (ii) reverse engineer the circuit functionality, notwithstanding the applied approximation technique. AppGNN accomplishes this by implementing a novel graph-based node sampling approach that mimics generic approximation methodologies, requiring zero knowledge of the targeted approximation type.
We perform an extensive evaluation targeting wide-ranging adder and multiplier circuits that are approximated using various AxC techniques, including state-of-the-art evolutionary-based approaches. We show that, using our method, we can improve the classification accuracy from 53% to 81% when classifying approximate adder circuits that have been generated using evolutionary algorithms, which our method is oblivious of. Our AppGNN framework is publicly available under https://github.com/ML-CAD/AppGNN
- Aradhana Mohan Parvathy
- Sarada Krithivasan
- Sanchari Sen
- Anand Raghunathan
Compression techniques such as quantization and pruning are indispensable for deploying state-of-the-art Deep Neural Networks (DNNs) on resource-constrained edge devices. Quantization is widely used in practice – many commercial platforms already support 8-bits, with recent trends towards ultra-low precision (4-bits and below). Pruning, which increases network sparsity (incidence of zero-valued weights), enables compression by storing only the nonzero weights and their indices. Unfortunately, the compression benefits of pruning deteriorate or even vanish in ultra-low precision DNNs. This is due to (i) the unfavorable tradeoff between the number of bits needed to store a weight (which reduces with lower precision) and the number of bits needed to encode an index (which remains unchanged), and (ii) the lower sparsity levels that are achievable at lower precisions.
We propose Seprox, a new compression scheme that overcomes the aforementioned challenges by exploiting two key observations about ultra-low precision DNNs. First, with lower precision, fewer weight values are possible, leading to increased incidence of frequently-occurring weights and weight sequences. Second, some weight values occur rarely and can be eliminated by replacing them with similar values. Leveraging these insights, Seprox encodes frequently-occurring weight sequences (as opposed to individual weights) while using the eliminated weight values to encode them, thereby avoiding indexing overheads and achieving higher compression. Additionally, Seprox uses approximation techniques to increase the frequencies of the encoded sequences. Across six ultra-low precision DNNs trained on the Cifar10 and ImageNet datasets, Seprox achieves model compressions, energy improvements and speed-ups of up to 35.2%, 14.8% and 18.2% respectively.
SESSION: Algorithms and Tools for Security Analysis and Secure Hardware Design
- Rosario Cammarota
- Satwik Patnaik
- Amin Rezaei
- Raheel Afsharmazayejani
- Jordan Maynard
Hardware IP owners must envision procedures to avoid piracy and overproduction of their designs under a fabless paradigm. A newly proposed technique to obfuscate critical components in a logic design is called eFPGA-based redaction, which replaces a sensitive sub-circuit with an embedded FPGA, and the eFPGA is configured to perform the same functionality as the missing sub-circuit. In this case, the configuration bitstream acts as a hidden key only known to the hardware IP owner. In this paper, we first evaluate the security promise of the existing eFPGA-based redaction algorithms as a preliminary study. Then, we break eFPGA-based redaction schemes by an initial but not necessarily efficient attack named DIP Exclusion that excludes problematic input patterns from checking in a brute-force manner. Finally, by combining cycle breaking and unrolling, we propose a novel and powerful attack called Break & Unroll that is able to recover the bitstream of state-of-the-art eFPGA-based redaction schemes in a relatively short time even with the existence of hard cycles and large size keys. This study reveals that the common perception that eFPGA-based redaction is by default secure against oracle-guided attacks, is prejudice. It also shows that additional research on how to systematically create an exponential number of non-combinational hard cycles is required to secure eFPGA-based redaction schemes.
- Pei-Pei Chen
- Xiang-Min Yang
- Yi-Ting Li
- Yung-Chih Chen
- Chun-Yao Wang
Cyclic logic locking is a new type of SAT-resistant techniques in hardware security. Recently, LOOPLock 2.0 was proposed, which is a cyclic logic locking method creating cycles deliberately in the locked circuit to resist SAT Attack, CycSAT, BeSAT, and Removal Attack simultaneously. The key idea of LOOPLock 2.0 is that the resultant circuit is still cyclic no matter the key vector is correct or not. This property refuses attackers and demonstrates its success on defending against attackers. In this paper, we propose an unlocking approach to LOOPLock 2.0 based on structure analysis and SAT solvers. Specifically, we identify and remove non-combinational cycles in the locked circuit before running SAT solvers. The experimental results show that the proposed unlocking approach is promising.
- Mohammad Hashemi
- Steffi Roy
- Fatemeh Ganji
- Domenic Forte
The complexity of modern integrated circuits (ICs) necessitates collaboration between multiple distrusting parties, including third-party intellectual property (3PIP) vendors, design houses, CAD/EDA tool vendors, and foundries, which jeopardizes confidentiality and integrity of each party’s IP. IP protection standards and the existing techniques proposed by researchers are ad hoc and vulnerable to numerous structural, functional, and/or side-channel attacks. Our framework, Garbled EDA, proposes an alternative direction through formulating the problem in a secure multi-party computation setting, where the privacy of IPs, CAD tools, and process design kits (PDKs) is maintained. As a proof-of-concept, Garbled EDA is evaluated in the context of simulation, where multiple IP description formats (Verilog, C, S) are supported. Our results demonstrate a reasonable logical-resource cost and negligible memory overhead. To further reduce the overhead, we present another efficient implementation methodology, feasible when the resource utilization is a bottleneck, but the communication between two parties is not restricted. Interestingly, this implementation is private and secure even in the presence of malicious adversaries attempting to, e.g., gain access to PDKs or in-house IPs of the CAD tool providers.
- Baleegh Ahmad
- Wei-Kai Liu
- Luca Collini
- Hammond Pearce
- Jason M. Fung
- Jonathan Valamehr
- Mohammad Bidmeshki
- Piotr Sapiecha
- Steve Brown
- Krishnendu Chakrabarty
- Ramesh Karri
- Benjamin Tan
To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code that might contain one of a set of MITRE’s common weakness enumerations (CWEs). We explore the CWE database to characterize the scope and attributes of the CWEs and identify those that are amenable to static analysis. We prototype scanners and evaluate them on 11 open source designs – 4 system-on-chips (SoC) and 7 processor cores – and explore the nature of identified weaknesses. Our analysis reported 53 potential weaknesses in the OpenPiton SoC used in Hack@DAC-21, 11 of which we confirmed as security concerns.
SESSION: Special Session: Making ML Reliable: From Devices to Systems to Software
- Krishnendu Chakrabarty
- Partha Pande
- Meng-Fan Chang
- Je-Ming Hung
- Ping-Cheng Chen
- Tai-Hao Wen
Compute-in-memory macros based on non-volatile memory (nvCIM) are a promising approach to break through the memory bottleneck for artificial intelligence (AI) edge devices; however, the development of these devices involves unavoidable tradeoffs between reliability, energy efficiency, computing latency, and readout accuracy. This paper outlines the background of ReRAM-based nvCIM as well as the major challenges in its further development, including process variation in ReRAM devices and transistors and the small signal margins associated with variation in input-weight patterns. This paper also investigates the error model of a nvCIM macro, and the correspondent degradation of inference accuracy as a function of error model when using nvCIM macros. Finally, we summarize recent trends and advances in the development of reliable ReRAM-based nvCIM macro.
- Biresh Kumar Joardar
- Aqeeb Iqbal Arka
- Janardhan Rao Doppa
- Partha Pratim Pande
Resistive random-access memory has become one of the most popular choices of hardware implementation for machine learning application workloads. However, these devices exhibit non-ideal behavior, which presents a challenge towards widespread adoption. Training/inferencing on these faulty devices can lead to poor prediction accuracy. However, existing fault tolerant methods are associated with high implementation overheads. In this paper, we present some new directions for solving reliability issues using software solutions. These software-based methods are inherent in deep learning training/inferencing, and they can also be used to address hardware reliability issues as well. These methods prevent accuracy drop during training/inferencing due to unreliable ReRAMs and are associated with lower area and power overheads.
- Arjun Chaudhuri
- Jonti Talukdar
- Krishnendu Chakrabarty
The ubiquitous application of deep neural networks (DNN) has led to a rise in demand for AI accelerators. DNN-specific functional criticality analysis identifies faults that cause measurable and significant deviations from acceptable requirements such as the inferencing accuracy. This paper examines the problem of classifying structural faults in the processing elements (PEs) of systolic-array accelerators. We first present a two-tier machine-learning (ML) based method to assess the functional criticality of faults. While supervised learning techniques can be used to accurately estimate fault criticality, it requires a considerable amount of ground truth for model training. We therefore describe a neural-twin framework for analyzing fault criticality with a negligible amount of ground-truth data. We further describe a topological and probabilistic framework to estimate the expected number of PE’s primary outputs (POs) flipping in the presence of defects and use the PO-flip count as a surrogate for determining fault criticality. We demonstrate that the combination of PO-flip count and neural twin-enabled sensitivity analysis of internal nets can be used as additional features in existing ML-based criticality classifiers.
- Bonita Bhaskaran
- Sanmitra Banerjee
- Kaushik Narayanun
- Shao-Chun Hung
- Seyed Nima Mozaffari Mojaveri
- Mengyun Liu
- Gang Chen
- Tung-Che Liang
Silent Data Corruption (SDC) is one of the critical problems in the field of testing, where errors or corruption do not manifest externally. As a result, there is increased focus on improving the outgoing quality of dies by striving for better correlation between structural and functional patterns to achieve a low DPPM. This is very important for NVIDIA’s chips due to the various markets we target; for example, automotive and data center markets have stringent in-field testing requirements. One aspect of these efforts is to also target better testability while incurring lower test cost. Since structural testing is faster than functional tests, it is important to make these structural test patterns as effective as possible and free of test escapes. However, with the rising cell count in today’s digital circuits, it is becoming increasingly difficult to sensitize faults and propagate the fault effects to scan-flops or primary outputs. Hence, methods to insert observation points to facilitate the detection of hard-to-detect (HtD) faults are being increasingly explored. In this work, we propose an Observation Point Insertion (OPI) scheme using deep learning with the motivation of achieving – 1) better quality test points than commercial EDA tools leading to a potential lower pattern count 2) faster turnaround time to generate the test points. In order to achieve better pattern compaction than commercial EDA tools, we employ Graph Convolutional Networks (GCNs) to learn the topology of logic circuits along with the features that influence its testability. The graph structures are subsequently used to train two GCN-type deep learning models – the first model predicts signal probabilities at different nets and the second model uses these signal probabilities along with other features to predict the reduction in test-pattern count when OPs are inserted at different locations in the design. The features we consider include structural features like gate type, gate logic, reconvergent-fanouts and testability features like SCOAP. Our simulation results indicate that the proposed machine learning models can predict the probabilistic testability metrics with reasonable accuracy and can identify observation points that reduce pattern count.
SESSION: Autonomous Systems and Machine Learning on Embedded Systems
- Ibrahim (Abe) Elfadel
- Mimi Xie
- Luke Chen
- Mohanad Odema
- Mohammad Abdullah Al Faruque
Due to the high performance and safety requirements of self-driving applications, the complexity of modern autonomous driving systems (ADS) has been growing, instigating the need for more sophisticated hardware which could add to the energy footprint of the ADS platform. Addressing this, edge computing is poised to encompass self-driving applications, enabling the compute-intensive autonomy-related tasks to be offloaded for processing at compute-capable edge servers. Nonetheless, the intricate hardware architecture of ADS platforms, in addition to the stringent robustness demands, set forth complications for task offloading which are unique to autonomous driving. Hence, we present ROMANUS, a methodology for robust and efficient task offloading for modular ADS platforms with multi-sensor processing pipelines. Our methodology entails two phases: (i) the introduction of efficient offloading points along the execution path of the involved deep learning models, and (ii) the implementation of a runtime solution based on Deep Reinforcement Learning to adapt the operating mode according to variations in the perceived road scene complexity, network connectivity, and server load. Experiments on the object detection use case demonstrated that our approach is 14.99% more energy-efficient than pure local execution while achieving a 77.06% reduction in risky behavior from a robust-agnostic offloading baseline.
- Soham Sinha
- Anam Farrukh
- Richard West
This paper presents ModelMap, a model-based multi-domain application development framework for DriveOS, our in-house centralized vehicle management software system. DriveOS runs on multicore x86 machines and uses hardware virtualization to host isolated RTOS and Linux guest OS sandboxes. In this work, we design Simulink interfaces for model-based vehicle control function development across multiple sandboxed domains in DriveOS. ModelMap provides abstractions to: (1) automatically generate periodic tasks bound to threads in different OS domains, (2) establish cross-domain synchronous and asynchronous communication interfaces, and (3) handle USB-based CAN I/O in Simulink. We introduce the concept of a nested binary, for the deployment of ELF binary executable code in different sandboxed domains. We demonstrate ModelMap using a combination of synthetic benchmarks, and experiments with Simulink models of a CAN Gateway and HVAC service running on an electric car. ModelMap eases the development of applications, which are shown to achieve industry-target performance using a multicore hardware platform in DriveOS.
- Anish Krishnakumar
- Radu Marculescu
- Umit Ogras
The performance and energy efficiency potential of heterogeneous architectures has fueled domain-specific systems-on-chip (DSSoCs) that integrate general-purpose and domain-specialized hardware accelerators. Decision trees (DTs) perform high-quality, low-latency task scheduling to utilize the massive parallelism and heterogeneity in DSSoCs effectively. However, offline trained DT scheduling policies can quickly become ineffective when applications or hardware configurations change. There is a critical need for runtime techniques to train DTs incrementally without sacrificing accuracy since current training approaches have large memory and computational power requirements. To address this need, we propose INDENT, an incremental online DT framework to update the scheduling policy and adapt it to unseen scenarios. INDENT updates DT schedulers at runtime using only 1–8% of the original training data embedded during training. Thorough evaluations with hardware platforms and DSSoC simulators demonstrate that INDENT performs within 5% of a DT trained from scratch using the entire dataset and outperforms current state-of-the-art approaches.
- Cheng-Yuan Wang
- Yao-Wen Chang
- Yuan-Hao Chang
Resistive Random Access Memory (ReRAM) Crossbars are a promising process-in-memory technology to reduce enormous data movement overheads of large-scale graph processing between computation and memory units. ReRAM cells can combine with crossbar arrays to effectively accelerate graph processing, and partitioning ReRAM crossbar arrays into Operation Units (OUs) can further improve computation accuracy of ReRAM crossbars. The operation unit utilization was not optimized in previous work, incurring extra cost. This paper proposes a two-stage algorithm with a crossbar OU-aware scheme for sparse graph index remapping for ReRAM (SGIRR) crossbars, mitigating the influence of graph sparsity. In particular, this paper is the first to consider the given operation unit size with the remapping index algorithm, optimizing the operation unit and power dissipation. Experimental results show that our proposed algorithm reduces the utilization of crossbar OUs by 31.4%, improves the total OU block usage by 10.6%, and saves energy consumption by 17.2%, on average.