Full Citation in the ACM Digital Library
- Liang Feng
- Jieru Zhao
- Tingyuan Liang
- Sharad Sinha
- Wei Zhang
To satisfy increasing computing demands, heterogeneous computing platforms are gaining attention, especially CPU-FPGA platforms. Recently, emerging tightly coupled CPU-FPGA platforms with shared coherent caches (such as the Intel HARP and IBM POWER with CAPI) have been proposed to facilitate data communication and simplify the programming model. In this work, we propose LAMA, a static analysis and dynamic control combined framework for memory access management in such platforms, to further enhance the memory access efficiency and maintain the data consistency. Based on implementation results on the real Intel HARP2 platform, LAMA is shown to improve the performance by 34% on average with low overhead.
- Hsuan Hsiao
- Jason Anderson
In high-level synthesis (HLS), software multithreading constructs can be used to explicitly specify coarse-grained parallelism for multiple accelerators. While software threads typically operate independently and in isolation of each other on CPUs, HLS threads/accelerators are sub-components of one circuit. Since these components generally reside in the same clock domain, we can schedule their execution statically to avoid shared-resource contention among threads. We propose thread weaving, a technique that statically interleaves requests from different threads through scheduling constraints. With the guarantee of a contention-free schedule, we eliminate replication/arbitration of shared resources, reducing the area footprint of the circuit and improving its maximum operating frequency (Fmax).
- Junnan Shan
- Mario R. Casu
- Jordi Cortadella
- Luciano Lavagno
- Mihai T. Lazarescu
FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.
- Timothy Martin
- Dani Maarouf
- Ziad Abuowaimer
- Abeer Alhyari
- Gary Grewal
- Shawki Areibi
In this paper, we propose a novel, flat analytic timing-driven placer without explicit packing for Xilinx UltraScale FPGA devices. Our work uses novel methods to simultaneously optimize for timing, wirelength and congestion throughout the global and detailed placement stages. We evaluate the effectiveness of the flat placer on the ISPD 2016 benchmark suite for the xcvu095 UltraScale device, as well as on industrial benchmarks. Experimental results show that on average, FTPlace achieves an 8% increase in maximum clock rate, an 18% decrease in routed wirelength, and produces placements that require 80% less time to route when compared to Xilinx Vivado 2018.1.
- Weiwen Jiang
- Xinyi Zhang
- Edwin H.-M. Sha
- Lei Yang
- Qingfeng Zhuge
- Yiyu Shi
- Jingtong Hu
A fundamental question lies in almost every application of deep neural networks: what is the optimal neural architecture given a specific data set? Recently, several Neural Architecture Search (NAS) frameworks have been developed that use reinforcement learning and evolutionary algorithm to search for the solution. However, most of them take a long time to find the optimal architecture due to the huge search space and the lengthy training process needed to evaluate each candidate. In addition, most of them aim at accuracy only and do not take into consideration the hardware that will be used to implement the architecture. This will potentially lead to excessive latencies beyond specifications, rendering the resulting architectures useless. To address both issues, in this paper we use Field Programmable Gate Arrays (FPGAs) as a vehicle to present a novel hardware-aware NAS framework, namely FNAS, which will provide an optimal neural architecture with latency guaranteed to meet the specification. In addition, with a performance abstraction model to analyze the latency of neural architectures without training, our framework can quickly prune architectures that do not satisfy the specification, leading to higher efficiency. Experimental results on common data set such as ImageNet show that in the cases where the state-of-the-art generates architectures with latencies 7.81× longer than the specification, those from FNAS can meet the specs with less than 1% accuracy loss. Moreover, FNAS also achieves up to 11.13× speedup for the search process. To the best of the authors’ knowledge, this is the very first hardware aware NAS.
- Muhammad Abdullah Hanif
- Faiq Khalid
- Muhammad Shafique
Approximate Computing (AC) has emerged as a means for improving the performance, area and power-/energy-efficiency of a digital design at the cost of output quality degradation. Applications like machine learning (e.g., using DNNs-deep neural networks) are highly computationally intensive and, therefore, can significantly benefit from AC and specialized accelerators. However, the accuracy loss introduced because of approximations in the DNN accelerator hardware can result in undesirable results. This paper presents a novel method to design high-performance DNN accelerators where approximation error(s) from one stage/part of the design is “completely” compensated in the subsequent stage/part while offering significant efficiency gains. Towards this, the paper also presents a case-study for improving the performance of systolic array-based hardware architectures, which are commonly used for accelerating state-of-the-art deep learning algorithms.
- Sugil Lee
- Hyeonuk Sim
- Jooyeon Choi
- Jongeun Lee
Despite the multifaceted benefits of stochastic computing (SC) such as low cost, low power, and flexible precision, SC-based deep neural networks (DNNs) still suffer from the long-latency problem, especially for those with high precision requirements. While log quantization can be of help, it has its own accuracy-saturation problem due to uneven precision distribution. In this paper we propose successive log quantization (SLQ), which extends log quantization with significant improvements in precision and accuracy, and apply it to state-of-the-art SC-DNNs. SLQ reuses the existing datapath of log quantization, and thus retains its advantages such as simple multiplier hardware. Our experimental results demonstrate that our SLQ can significantly extend both the accuracy and efficiency of SC-DNNs over the state-of-the-art solutions, including linear-quantized and log-quantized SC-DNNs, achieving less than 1~1.5%p accuracy drop for AlexNet, SqueezeNet, and VGG-S at mere 4~5-bit weight resolution.
- Daniel Peroni
- Mohsen Imani
- Hamid Nejatollahi
- Nikil Dutt
- Tajana Rosing
Many data-driven applications including computer vision, speech recognition, and medical diagnostics show tolerance to error during computation. These applications are often accelerated on GPUs, but high computational costs limit performance and increase energy usage. In this paper, we present ARGA, an approximate computing technique capable of accelerating GPGPU applications. ARGA provides an approximate lookup table to GPGPU cores to avoid recomputing instructions with identical or similar values. We propose multi-table parallel lookup which enables computational reuse to significantly speed-up GPGPU computation by checking incoming instructions in parallel. The inputs of each operation are searched for in a lookup table. Matches resulting in an exact or low error are removed from the floating point pipeline and used directly as output. Matches producing highly inaccurate results are computed on exact hardware to minimize application error. We simulate our design by placing ARGA within each core of an Nvidia Kepler Architecture Titan and an AMD Southern Island 7970. We show our design improves performance throughput by up to 2.7× and improves EDP by 5.3× for 6 GPGPU applications while maintaining less than 5% output error. We also show ARGA accelerates inference of a LeNet NN by 2.1× and improves EDP by 3.7× without significantly impacting classification accuracy.
- Hamid Tabani
- Leonidas Kosmidis
- Jaume Abella
- Francisco J. Cazorla
- Guillem Bernat
The complexity and size of Autonomous Driving (AD) software are comparably higher than that of software implementing other (standard) functionalities in the car. To make things worse, a big fraction of AD software is not specifically designed for the automotive (or any other critical) domain, but the mainstream market. This brings uncertainty on to which extent AD software adheres to guidelines in safety standards. In this paper, we present our experience in applying ISO 26262 — the applicable functional safety standard for road vehicles — software safety guidelines to industrial AD software, in particular, Apollo, a heterogeneous Autonomous Driving framework used extensively in industry. We provide quantitative and qualitative metrics of compliance for many ISO 26262 recommendations on software design, implementation, and testing.
- Debayan Roy
- Wanli Chang
- Sanjoy K. Mitter
- Samarjit Chakraborty
In modern autonomous systems, there is typically a large number of connected components realizing complex functionalities. For example, in autonomous vehicles (AVs), there are tens of millions of lines of code implemented on hundreds of sensors, controllers, and actuators. AVs have been deployed, mostly in trials and restricted environments, showing that substantial progress has been made in functionality development. However, they are still faced with two major challenges: (i) performance guarantee of safety-critical functions under all possible scenarios; (ii) functionality implementation with limited resources. These two challenges are conflicting because safety guarantees necessitate a worst-case analysis that is often very pessimistic for complex hardware/software systems, and thus require more resources. To address this, we study an abstraction of a heterogeneous cyber-physical system architecture consisting of a mix of high- and low-quality resources, such as time- and event-triggered resources, or wired and wireless resources. We show that by properly managing such a mix of resources and formulating a formal verification (model checking) problem, it is possible to tightly dimension the high-quality resource to the minimum (50% in certain cases) while providing control performance guarantees.
- Chao Peng
- Yecheng Zhao
- Haibo Zeng
Today’s automotive engine control systems adopt several control strategies that come with tradeoffs between computational load and performance. The current practice is that the switching speeds at which the engine control system changes control strategy is fixed offline, typically based on the average driving need in a standard driving cycle (i.e., vehicle speed profile over time). This is clearly suboptimal since it fails to capture the variation in the driving cycle, and the actual driving cycle may be considerably different from the standard one. In this paper, we propose to dynamically adjust switching speeds based on the predicted driving cycle. We develop a hybrid set of schedulability analysis techniques to tame the complexity of ensuring the real-time schedulability of engine control tasks. We design an effective and efficient optimization algorithm that provides close-to-optimal solutions. Experimental results demonstrate that our approach efficiently finds dynamic switching speeds that significantly improve engine performance over static ones.
- He Zhou
- Sunil P. Khatri
- Jiang Hu
- Frank Liu
Although Markov Decision Process (MDP) has wide applications in autonomous systems as a core model in Reinforcement Learning, a key bottleneck is the large memory utilization of the state transition probability matrices. This is particularly problematic for computational platforms with limited memory, or for Bayesian MDP, which requires dozens of such matrices. To mitigate this difficulty, we propose a highly memory-efficient representation for probability matrices using Binary Decision Diagram (BDD) based sampling, and develop a corresponding (Bayesian/classical) MDP solver on a CPU-GPU platform. Simulation results indicate our approach reduces memory by one and two orders of magnitude for Bayesian/classical MDP, respectively.
- Trong Huynh-Bao
- Anabela Veloso
- Sushil Sakhare
- Philippe Matagne
- Julien Ryckaert
- Manu Perumkunnil
- Davide Crotti
- Farrukh Yasin
- Alessio Spessot
- Arnaud Furnemont
- Gouri Kar
- Anda Mocuta
We present for the first time a co-integrated FinFET with vertical nanosheet transistor (VFET) process on a 300 mm silicon wafer for STT-MRAM applications and its related avenues with a holistic design-technology-co-optimization (DTCO) and power-performance-area-cost (PPAC) approach. The STT-MRAM bitcell and a 2 Mbit macro have been optimized and designed to address the viability of the co-integration process and advantages of vertical channel transistors for STT-MRAM selectors. An architectural system simulator GEM5 has been also employed with Polybench workloads to assess energy saving at system-level. In order to enable this co-integration, four extra masks are required, which costs below 10% in embedded chips. A 36% area reduction can be achieved for the STT-MRAM bitcell implemented with VFET selectors. With a UVLT flavor, the STT-MRAM bitcell comprising of 3-nanosheet could deliver the same performance of the 4-fin LVT FinFET selector. A 2 Mbit STT-MRAM macro designed with VFET selector can offer a 17% and a 21% reduction for read access latency and energy per operation respectively, and a 10% for write energy per operation. A 7% energy saving for the STT-MRAM L2 cache using VFET selector has been observed at the system level with Polybench workloads.
- Nam Sung Kim
- Choungki Song
- Woo Young Cho
- Jian Huang
- Myoungsoo Jung
PCM is a promising non-volatile memory technology, as it can offer a unique trade-off between density and latency compared with DRAM and flash memory. Albeit PCM is much faster than flash memory, it is still notably slower than DRAM, which can significantly degrade system performance. In this paper, we analyze a PCM implementation in depth, and identify the primary cause of PCM’s long latency, i.e., a long interconnect (high resistance/capacitance) path between a cell and a sense-amp/write-driver. This in turn requires (1) a very large charge pump consuming: ~20% of PCM chip space, ~50% of latency of write operations, and ~2× more power than a write operation itself; and (2) a large current sense-amp with long time to pre-charge the interconnect path. Then, we propose Low-Latency PCM (LL-PCM) architecture. Our analysis shows that LL-PCM can give 119% higher performance and consume 43% lower memory energy than PCM for memory-intensive applications. LL-PCM is only ~1% larger than PCM, as the cost of reducing the resistance/capacitance of the interconnect path is negated by its 4.1× smaller charge pump.
- Janki Bhimani
- Tirthak Patel
- Ningfang Mi
- Devesh Tiwari
Vibration generated in modern computing environments such as autonomous vehicles, edge computing infrastructure, and data center systems is an increasing concern. In this paper, we systematically measure, quantify and characterize the impact of vibration on the performance of SSD devices. Our experiments and analysis uncover that exposure to both short-term and long-term vibration, even within the vendor-specified limits, can significantly affect SSD I/O performance and reliability.
- Leilai Shao
- Sicheng Li
- Ting Lei
- Tsung-Ching Huang
- Raymond Beausoleil
- Zhenan Bao
- Kwang-Ting Cheng
Skin-inspired electronics emerges as a new paradigm due to the increasing demands for conformable and high-quality skin-sensor-silicon (SSS) interfacing in wearable, electronic skin and health monitoring applications. Advances in ultra-thin, flexible, stretchable and conformable materials have made skin electronics feasible. In this paper, we prototyped an active electrode (with a thickness ≤ 2 um), which integrates the electrode with a thin-film transistor (TFT) based amplifier, to effectively suppress motion artifacts. The fabricated ultra-thin amplifier can achieve a gain of 32 dB at 20 kHz, demonstrating the feasibility of the proposed active electrode. Using atrial fibrillation (AF) detection for electrocardiogram (ECG) as an application driver, we further develop a simulation framework taking into account all elements including the skin, the sensor, the amplifier and the silicon chip. Systematic and quantitative simulation results indicate that the proposed active electrode can effectively improve the signal quality under motion noises (achieving ≥30 dB improvement in signal-to-noise ratio (SNR)), which boosts classification accuracy by more than 19% for AF detection.
- Hanbin Hu
- Peng Li
- Jianhua Z. Huang
With increasing design complexity and stringent robustness requirements in application such as automotive electronics, analog and mixed-signal (AMS) verification becomes akey bottleneck. Rare failure detection in a high-dimensional parameter space using minimal expensive simulation data is a major challenge. We address this challenge under a Bayesian learning framework using Bayesian optimization (BO). We formulate the failure detection as a BO problem where a chosen acquisition function is optimized to select the next (set of) optimal simulation sampling point(s) such that rare failures may be detected using a small amount of data. While providing an attractive black-box solution to design verification, in practice BO is limited in its ability in dealing with high-dimensional problems. We propose to use random embedding to effectively reduce the dimensionality of a given verification problem to improve both the quality of BO-based optimal sampling and computational efficiency. We demonstrate the success of the proposed approach on detecting rare design failures under high-dimensional process variations which are completely missed by competitive smart sampling and BO techniques without dimension reduction.
- Yuzhe Ma
- Haoxing Ren
- Brucek Khailany
- Harbinder Sikka
- Lijuan Luo
- Karthikeyan Natarajan
- Bei Yu
Applications of deep learning to electronic design automation (EDA) have recently begun to emerge, although they have mainly been limited to processing of regular structured data such as images. However, many EDA problems require processing irregular structures, and it can be non-trivial to manually extract important features in such cases. In this paper, a high performance graph convolutional network (GCN) model is proposed for the purpose of processing irregular graph representations of logic circuits. A GCN classifier is firstly trained to predict observation point candidates in a netlist. The GCN classifier is then used as part of an iterative process to propose observation point insertion based on the classification results. Experimental results show the proposed GCN model has superior accuracy to classical machine learning models on difficult-to-observation nodes prediction. Compared with commercial testability analysis tools, the proposed observation point insertion flow achieves similar fault coverage with an 11% reduction in observation points and a 6% reduction in test pattern count.
- Jung Min You
- Joon-Sung Yang
With the increasing integration of semiconductor design, many problems have emerged. Row-hammering is one of these problems. The row-hammering effect is a critical issue for reliable memory operation because it can cause some unexpected errors. Hence, it is necessary to address this problem. Mainly, there are two different methods to deal with the row-hammering problem. One is a counter based method, and the other is a probabilistic method. This paper proposes the improved version of the latter method and compares it with other probabilistic methods, PARA and PRoHIT. According to the evaluation results, comparing the proposed method with conventional ones, the proposed one has increased row-hammering reduction per refresh 1.82 and 7.78 times against PARA and PRoHIT in average, respectively.
- Xiaoyi Sun
- Krishnendu Chakrabarty
- Ruirui Huang
- Yiquan Chen
- Bing Zhao
- Hai Cao
- Yinhe Han
- Xiaoyao Liang
- Li Jiang
Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware at risk and backup the data. However, the current model-based predictors are incapable of using the discrete time-series data, such as the values of device attributes, which conveys high-level information of the device behavior. In this paper, we propose a novel deep-learning based prediction scheme for system-level hardware failure prediction. We normalize the distribution of samples’ attributes from different vendors to make use of diverse training sets. We propose a temporal Convolution Neural Network based model that is insensitive to the noise in the time dimension. Finally, we design a loss function to train the model with extremely imbalanced samples effectively. Experimental results from an open S.M.A.R.T data set and an industrial data set show the effectiveness of the proposed scheme.
- Onur Mutlu
- Saugata Ghose
- Juan Gómez-Luna
- Rachata Ausavarungnirun
Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in the processor (and accelerators), and data storage/movement on the other, which all other parts of the system are dedicated to. Due to this dichotomy, data moves a lot in order for the system to perform computation on it. Unfortunately, data movement is extremely expensive in terms of energy and latency, much more so than computation. As a result, a large fraction of system energy is spent and performance is lost solely on moving data in a modern computing system.
In this work, we re-examine the idea of reducing data movement by performing Processing in Memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked logic and DRAM, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the idea of PIM is not new, we examine two new approaches to enabling PIM: 1) exploiting analog properties of DRAM to perform massively-parallel operations in memory, and 2) exploiting 3D-stacked memory technology design to provide high bandwidth to in-memory logic. We conclude by discussing work on solving key challenges to the practical adoption of PIM.
The capacity of memory and storage devices is expected to increase drastically with adoption of the forthcoming memory and integration technologies. This is a welcome improvement especially for datacenter servers running modern data-intensive applications. Nonetheless, for such servers to fully benefit from the increasing capacity, the bandwidth of interconnects between processors and these devices must also increase proportionally, which becomes ever costlier under unabating physical constraints. As a promising alternative to tackle this challenge cost-effectively, a heterogeneous computing paradigm referred to as near-data processing (NDP) has emerged. However, NDP has not yet been widely adopted by the industry because of significant gaps between existing software stacks and demanded ones for NDP-capable memory and storage devices. Aiming to overcome the gaps, we propose to turn memory and storage devices into familiar heterogeneous distributed computing systems. Then, we demonstrate potentials of such computing systems for existing data-intensive applications with two recently implemented NDP-capable devices. Finally, we conclude with a practical blueprint to exploit the NDP-based computing systems for speeding up solving future computer-aided design and optimization problems.
- Ning Lin
- Hang Lu
- Xin Wei
- Xiaowei Li
Deep convolutional neural networks are well-known for the extensive parameters and computation intensity. Structured pruning is an effective solution to obtain a more compact model for the efficient inference on GPGPUs, without designing specific hardware accelerators. However, previous works resort to certain metrics in channel/filter pruning and count on labor intensive fine-tunings to recover the accuracy loss. The “inception” of the pruned model, as another form factor, has indispensable impact to the final accuracy but its importance is often ignored in these works. In this paper, we prove that optimal inception will be more likely to induce a satisfied performance and shortened fine-tuning iterations. We also propose a reinforcement learning based solution, termed as HeadStart, seeking to learn the best way of pruning aiming at the optimal inception. With the help of the specialized head-start network, it could automatically balance the tradeoff between the final accuracy and the preset speedup rather than tilting to one of them, which makes it differentiated from existing works as well. Experimental results show that HeadStart could attain up to 2.25x inference speedup with only 1.16% accuracy loss tested with large scale images on various GPGPUs, and could be well generalized to various cutting-edge DCNN models.
- Seokwon Kang
- Yongseung Yu
- Jiho Kim
- Yongjun Park
Although approximate computing is widely used, it requires substantial programming effort to find appropriate approximation patterns among multiple pre-defined patterns to achieve a high performance. Therefore, we propose an automatic approximation framework called GATE to uncover hidden opportunities from any data-parallel program regardless of the code pattern or application characteristics using two compiler techniques, namely subgraph-level approximation (SGLA) and approximate thread merge(ATM). GATE also features conservative/aggressive tuning and dynamic calibration to maximize the performance while maintaining the TOQ level during runtime. Our framework achieves an average performance gain of 2.54x over the baseline with minimum accuracy loss.
- Yu-Chuan Chang
- Wei-Ming Chen
- Pi-Cheng Hsiu
- Yen-Yu Lin
- Tei-Wei Kuo
Perceptual similarity measurement allows mobile applications to eliminate unnecessary computations without compromising visual experience. Existing pixel-wise measures incur significant overhead with increasing display resolutions and frame rates. This paper presents an ultra lightweight similarity measure called LSIM, which assesses the similarity between frames based on the transformation matrices of graphics objects. To evaluate its efficacy, we integrate LSIM into the Open Graphics Library and conduct experiments on an Android smartphone with various mobile 3D games. The results show that LSIM is highly correlated with the most widely used pixel-wise measure SSIM, yet three to five orders of magnitude faster. We also apply LSIM to a CPU-GPU governor to suppress the rendering of similar frames, thereby further reducing computation energy consumption by up to 27.3% while maintaining satisfactory visual quality.
- Sivert T. Sliper
- Domenico Balsamo
- Nikos Nikoleris
- William Wang
- Alex S. Weddell
- Geoff V. Merrett
Reactive transient computing systems preserve computational progress despite frequent power failures by suspending (saving state to nonvolatile memory) when detecting a power failure, and restoring once power returns. Existing methods inefficiently save and restore all allocated memory. We propose lightweight memory management that applies the concept of paging to load pages only when needed, and save only modified pages. We then develop a model that maximises available execution time by dynamically adjusting the suspend and restore voltage thresholds. Experiments on an MSP430FR5994 microcontroller show that our method reduces state retention overheads by up to 86.9% and executes algorithms up to 5.3× faster than the state-of-the-art.
- Gagandeep Singh
- Juan Gómez-Luna
- Giovanni Mariani
- Geraldo F. Oliveira
- Stefano Corda
- Sander Stuijk
- Onur Mutlu
- Henk Corporaal
The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies workloads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.
We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on microarchitectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.
- Andrew McCrabb
- Eric Winsor
- Valeria Bertacco
Graph-based algorithms have gained significant interest in several application domains. Solutions addressing the computational efficiency of such algorithms have mostly relied on many-core architectures. Cleverly laying out input graphs in storage, by placing adjacent vertices in a same storage unit (memory bank or cache unit), enables fast access during graph traversal. Dynamic graphs, however, must be continuously repartitioned to leverage this benefit. Yet software repartitioning solutions rely on costly, cross-vault communication to query and optimize the graph layout between algorithm iterations.
In this work, we propose DREDGE, a novel hardware solution to provide heuristic repartitioning optimizations in the background without extra communication. Our evaluation indicates that we achieve a 1.9x speedup, on average, over several graph algorithms and datasets, executing on a 24×24-core architecture, when compared against a baseline solution that does not repartition the dynamic graph. We estimated that DREDGE incurs only 1.5% area and 2.1% power overheads over an ARM A5 processor core.
- Xin Xin
- Youtao Zhang
- Jun Yang
DRAM based memory-centric computing architectures are promising solutions to tackle the challenges of memory wall. In this paper, we develop a novel design of DRAM-based processing-in-memory (PIM) architecture which achieves lower cycles in every basic operation than prior arts. Our small yet fast in-memory computing units support basic logic operations including NOT, AND, and OR. Using those operations, along with shift and propagation, bitwise operations can be extended to word-wise operations, e.g. increment and comparison, with high efficiency. We also optimize the designs to exploit parallelism and data reuse to further improve the performance of compound operations. Compared with the most powerful state-of-the-art PIM architecture, we can achieve comparable or even better performance while consuming only 6% of its area overhead.
- Chih-Cheng Chang
- Ming-Hung Wu
- Jia-Wei Lin
- Chun-Hsien Li
- Vivek Parmar
- Heng-Yuan Lee
- Jeng-Hua Wei
- Shyh-Shyuan Sheu
- Manan Suri
- Tian-Sheuan Chang
- Tuo-Hung Hou
Binary STT-MRAM is a highly anticipated embedded nonvolatile memory technology in advanced logic nodes < 28 nm. How to enable its in-memory computing (IMC) capability is critical for enhancing AI Edge. Based on the soon-available STT-MRAM, we report the first binary deep convolutional neural network (NV-BNN) capable of both local and remote learning. Exploiting intrinsic cumulative switching probability, accurate online training of CIFAR-10 color images (~ 90%) is realized using a relaxed endurance spec (switching ≤ 20 times) and hybrid digital/IMC design. For offline training, the accuracy loss due to imprecise weight placement can be mitigated using a rapid non-iterative training-with-noise and fine-tuning scheme.
- Fan Yang
- Youyou Lu
- Youmin Chen
- Haiyu Mao
- Jiwu Shu
Data encryption and authentication are essential for secure NVM. However, the introduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection without compromising the performance, we propose cc-NVM. It firstly proposes an epoch-based mechanism to aggressively cache the security metadata in CPU cache while retaining the consistency of them in NVM. Deferred spreading is also introduced to reduce the calculating overhead for data authentication. Leveraging the hidden ability of data HMACs, we can always recover the consistent but old security metadata to its newest version. Compared to Osiris, a state-of-the-art secure NVM, cc-NVM improves performance by 20.4% on average. When the system crashes, instead of dropping all the data due to malicious attacks, cc-NVM is able to detect and locate the exact tampered data while only incurring extra write traffic by 29.6% on average.
- Jinsoo Jang
- Brent Byunghoon Kang
Memory disclosure vulnerabilities have been exploited in the leaking of application secret data such as crypto keys (e.g., the Heartbleed Bug). To ameliorate this problem, we propose an in-process memory isolation mechanism by leveraging a common hardwarefeature, namely, hardware debugging. Specifically, we utilize a watchpoint to monitor a particular memory region containing secret data. We implemented the PoC of our approach based on the 64-bit ARM architecture, including the kernel patches and user APIs that help developers benefit from isolated memory use. We applied the approach to open-source applications such as OpenSSL and AESCrypt. The results of a performance evaluation show that our approach incurs a small amount of overhead.
- Liang Liu
- Rujia Wang
- Youtao Zhang
- Jun Yang
Oblivious RAM (ORAM) is an effective security primitive to prevent access pattern leakage. By adding redundant memory accesses, ORAM prevents attackers from revealing the patterns in the access sequences. However, ORAM tends to introduce a huge degradation on the performance. With growing address space to be protected, ORAM has to store the majority of data in the lower level storage, which further degrades the system performance.
In this paper, we propose Hybrid ORAM (H-ORAM), a novel ORAM primitive to address large performance degradation when overflowing the user data to storage. H-ORAM consists of a batch scheduling scheme for enhancing the memory bandwidth usage, and a novel ORAM interface that returns data without waiting for the I/O access each time. We evaluate H-ORAM on a real machine implementation. The experimental results show that that H-ORAM outperforms the state-of-the-art Path ORAM by 20×.
- Jisung Park
- Youngdon Jung
- Jonghoon Won
- Minji Kang
- Sungjin Lee
- Jihong Kim
We present a low-overhead ransomware-proof SSD, called RansomBlocker (RBlocker). RBlocker provides 100% full protections against all possible ransomware attacks by delaying every data deletion until no attack is guaranteed. To reduce storage overheads of the delayed deletion, RBlocker employs a time-out based backup policy. Based on the fact that ransomware must store encrypted version of target files, early deletions of obsolete data are allowed if no encrypted write was detected for a short interval. Otherwise, RBlocker keeps the data for an interval long enough to guarantee no attack condition. For an accurate in-line detection of encrypted writes, we leverages entropy- and CNN-based detectors in an integrated fashion. Our experimental results show that RBlocker can defend all types of ransomware attacks with negligible overheads.
- Zimeng Zhou
- Chenchen Fu
- Chun Jason Xue
- Song Han
This paper explores how to optimize the freshness of real-time data in energy harvesting based networked embedded systems. We introduce the concept of Age of Information (AoI) to quantitatively measure the data freshness and present a comprehensive analysis on the average AoI of the real-time data with stochastic update arrival and energy replenishment rates. Both an optimal offline solution and an effective online solution are designed to judiciously select a subset of the real-time data updates and determine their corresponding transmission times to optimize the average AoI subject to energy constraints. Our extensive experiments have validated the effectiveness of the proposed solutions, and showed that these two methods can significantly improve the average AoI by 47.2% comparing to the state-of-the-art solutions for low energy replenishment rate.
- Marco Widmer
- Andrea Bonetti
- Andreas Burg
Embedded DRAM (eDRAM) requires frequent power-hungry refresh according to the worst-case retention time across PVT variations to avoid data loss. Abandoning the error-free paradigm, by choosing sub-critical refresh rates that gracefully degrade the eDRAM content, unlocks considerable power-saving opportunities, but requires to understand the effect of stochastic memory errors at the system/application level. We propose an FPGA-based platform featuring faulty eDRAM emulation based on advanced retention time models and silicon measurements for statistical error resilience evaluation of applications in a complete embedded system. We analyze the statistical QoS for various benchmarks under different sub-critical refresh rates and retention time distributions.
- Yajuan Du
- Yao Zhou
- Meng Zhang
- Wei Liu
- Shengwu Xiong
Existing studies have uncovered that there exist significant Raw Bit Error Rates (RBERs) variations among different layers of 3D flash memories due to manufacture process variation. These RBER variations would cause significantly diversed read latencies when reading data with traditional Low-Density Parity-Check (LDPC) codes designed for planar flash memories, which induces sub-optimal read performance of flash-based Solid-State Drives (SSDs).
To investigate the latency diversity, this paper first performs a preliminary experiment and observes that LDPC read levels proportional to latencies increase in diverse speeds along with data retention. Then, by exploiting the observation results, a Multi-Granularity LDPC (MG-LDPC) read method is proposed to adapt level increase speed for each layer. Five LDPC engines with varied increase granularity are designed to adapt RBER speed requirements. Finally, two implementations for MG-LDPC are applied to assign LDPC engines for each flash layer in a fixed way or dynamically according to prior read levels. Experimental results show that the proposed two implementations can reduce SSD read response time by 21% and 47% on average, respectively.
- Siva Satyendra Sahoo
- Bharadwaj Veeravalli
- Akash Kumar
Technology scaling and architectural innovations have led to increasing ubiquity of embedded systems across applications with widely varying and often constantly changing performance and reliability specifications. However, the increasing physical fault-rates in electronic systems have led to single-layer reliability approaches becoming infeasible for resource-constrained systems. Dynamic Cross-layer reliability (CLR) provides scope for efficient adaptation to such QoS variations and increasing unreliability. We propose a design methodology for enabling QoS-aware CLR-integrated runtime adaptation in heterogeneous MPSoC-based embedded systems. Specifically, we propose a combination of reconfiguration cost-aware optimization at design-time and an agent-based optimization at run-time. We report a reduction of up to 51% and 37% in average reconfiguration cost and average energy consumption respectively over state-of-the-art approaches.
- Yuan Zhou
- Haoxing Ren
- Yanqing Zhang
- Ben Keller
- Brucek Khailany
- Zhiru Zhang
This paper introduces PRIMAL, a novel learning-based framework that enables fast and accurate power estimation for ASIC designs. PRIMAL trains machine learning (ML) models with design verification testbenches for characterizing the power of reusable circuit building blocks. The trained models can then be used to generate detailed power profiles of the same blocks under different workloads. We evaluate the performance of several established ML models on this task, including ridge regression, gradient tree boosting, multi-layer perceptron, and convolutional neural network (CNN). For average power estimation, ML-based techniques can achieve an average error of less than 1% across a diverse set of realistic benchmarks, outperforming a commercial RTL power estimation tool in both accuracy and speed (15x faster). For cycle-by-cycle power estimation, PRIMAL is on average 50x faster than a commercial gate-level power analysis tool, with an average error less than 5%. In particular, our CNN-based method achieves a 35x speed-up and an error of 5.2% for cycle-by-cycle power estimation of a RISC-V processor core. Furthermore, our case study on a NoC router shows that PRIMAL can achieve a small estimation error of 4.5% using cycle-approximate traces from SystemC simulation.
- Ilaria Scarabottolo
- Giovanni Ansaloni
- George A. Constantinides
- Laura Pozzi
Inexact hardware design techniques have become popular in error-tolerant systems, where energy efficiency is a primary concern. Several techniques aim to identify circuit portions that can be discarded under an error constraint, but research on systematic methods to determine such error is still at an early stage. We herein illustrate a generic, scalable algorithm that determines the influence of each circuit gate on the final output. The algorithm first partitions the graph representing the circuit, then determines the error propagation model of the resulting subgraphs. When applied to existing approximate design frameworks, our solution improves their efficiency and result quality.
- Martin Rapp
- Sami Salamin
- Hussam Amrouch
- Girish Pahwa
- Yogesh Chauhan
- Jörg Henkel
Negative Capacitance Field-Effect Transistor (NCFET) is an emerging technology that incorporates a ferroelectric layer within the transistor gate stack to overcome the fundamental limit of sub-threshold swing in transistors. Even though physics-based NCFET models have been recently proposed, system-level NCFET models do not exist and research is still in its infancy. In this work, we are the first to investigate the impact of NCFET on performance, energy and cooling costs in many-core processors. Our proposed methodology starts from accurate physics models all the way up to the system level, where the performance and power of a many-core are widely affected. Our new methodology and system-level models allow, for the first time, the exploration of the novel trade-offs between performance gains and power losses that NCFET now offers to system-level designers. We demonstrate that an optimal ferroelectric thickness does exist. In addition, we reveal that current state-of-the-art power management techniques fail when NCFET (with a thick ferroelectric layer) comes into play.
- Payman Behnam
- Mahdi Nazm Bojnordi
Data movement in large caches consumes a significant amount of energy in modern computer systems. Low power interfaces have been proposed to address this problem. Unfortunately, the energy-efficiency of these techniques is largely limited due to undue latency overheads of low power wires and complex coding mechanisms. This paper proposes a hybrid technique for slow-transition, fast-level (STFL) signaling that creates a balance between power and bandwidth in the last level cache interface. Combined with STFL codes, the signaling technique significantly mitigates the performance impacts of low power wires, thereby improving the energy efficiency of data movement in memory systems. When applied to the last level cache of a contemporary multicore system, STFL improves the CPU energy-delay product by 9% as compared to a voltage-frequency scaled baseline. Moreover, the proposed architecture reduces the CPU energy by 26% and achieves 98% of the performance provided by a high-performance baseline.
- Sayak Ray
- Nishant Ghosh
- Ramya Jayaram Masti
- Arun Kanuparthi
- Jason M. Fung
We present an effective methodology for formally verifying security-critical flows in a commercial System-on-Chip (SoC) which involve extensive interaction between firmware (FW) and hardware (HW). We describe several HW-FW interaction scenarios that are typical in commercial SoCs. We highlight unique challenges associated with formal verification of security properties of such interactions and discuss our approach of property-specific abstraction and software model checking to circumvent those challenges. To the best of our knowledge, this is the first exposition on formal co-verification of security-specific HW-FW interactions in the context and scale of a commercial SoCs. Despite traditional scalability challenges, we demonstrate that many such flows are amenable to effective formal verification.
- Lejla Batina
- Patrick Jauernig
- Nele Mentens
- Ahmad-Reza Sadeghi
- Emmanuel Stapf
Data processing and communication in almost all electronic systems are based on Central Processing Units (CPUs). In order to guarantee confidentiality and integrity of the software running on a CPU, hardware-assisted security architectures are used. However, both the threat model and the non-functional platform requirements, i.e. performance and energy budget, differ when we go from high-end desktop computers and servers to low-end embedded devices that populate the internet of things (IoT). For high-end platforms, a relatively large energy budget is available to protect software against attacks. However, measures to optimize performance give rise to microarchitectural side-channel attacks. IoT devices, in contrast, are constrained in terms of energy consumption and do not incorporate the performance enhancements found in high-end CPUs. Hence, they are less likely to be susceptible to microarchitectural attacks, but give rise to physical attacks, exploiting, e.g., leakage in power consumption or through fault injection. Whereas previous work mostly concentrates on a specific architecture, this paper covers the whole spectrum of computing systems, comparing the corresponding hardware architectures, and most relevant threats.
- Elke De Mulder
- Samatha Gummalla
- Michael Hutter
Software (SW) implementations of cryptographic algorithms are vulnerable to Side-channel Analysis (SCA) attacks, basically relinquishing the key to the outside world through measurable physical properties of the processor like power consumption and electromagnetic radiation. Protected SW implementations typically have a significant timing and code size overhead as well as a substantially long development time because hands-on testing the result is crucial. Plenty of scientific publications offer solutions for this problem for all kinds of algorithms but they are not straightforward to implement as they rely on device assumptions which are rarely met, nor do these solutions take micro-architecture related leakages into account. We present a solution to this problem by integrating side-channel analysis countermeasures into a RISC-V implementation. Our solution protects against first-order power or electromagnetic attacks while keeping the implementation costs as low as possible. We made use of state of the art masking techniques and present a novel solution to protect memory access against SCA. Practical results are provided that demonstrate the leakage results of various cryptographic primitives running on our protected hardware platform.
- Boqian Wang
- Zhonghai Lu
- Shenggang Chen
We propose an admission control method in Network-on-Chip (NoC) with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node via the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. Compared to the previous work, our method builds up a high-fidelity model between the network status and the injection rate regulation. The full-system simulation results show that our proposed method can enhance application performance by 17.8% on average and up to 23.8%.
The design space for energy-efficient Network-on-Chips (NoCs) has expanded significantly comprising a number of techniques. The simultaneous application of these techniques to yield maximum energy efficiency requires the monitoring of a large number of system parameters which often results in substantial engineering efforts and complicated control policies. This motivates us to explore the use of reinforcement learning (RL) approach that automatically learns an optimal control policy to improve NoC energy efficiency. First, we deploy power-gating (PG) and dynamic voltage and frequency scaling (DVFS) to simultaneously reduce both static and dynamic power. Second, we use RL to automatically explore the dynamic interactions among PG, DVFS, and system parameters, learn the critical system parameters contained in the router and cache, and eventually evolve optimal per-router control policies that significantly improve energy efficiency. Moreover, we introduce an artificial neural network (ANN) to efficiently implement the large state-action table required by RL. Simulation results using PARSEC benchmark show that the proposed RL approach improves power consumption by 26%, while improving system performance by 7%, as compared to a combined PG and DVFS design without RL. Additionally, the ANN design yields 67% area reduction, as compared to a conventional RL implementation.
- Venkata Yaswanth Raparti
- Sudeep Pasricha
Data-snooping is a serious security threat in NoC fabrics that can lead to theft of sensitive information from applications executing on manycore processors. Hardware Trojans (HTs) covertly embedded in NoC components can carry out such snooping attacks. In this paper, we first describe a low-overhead snooping invalidation module (SIM) to prevent malicious data replication by HTs in NoCs. We then devise a snooping detection module (THANOS) to also detect malicious applications that utilize such HTs. Experimental analysis shows that unlike state-of-the-art mechanisms, SIM and THANOS not only mitigate snooping attacks but also improve NoC performance by 48.4% in the presence of these attacks, with a minimal ~2.15% area and ~5.5% power overhead.
- Michihiro Koibuchi
- Lambert Leong
- Tomohiro Totoki
- Naoya Niwa
- Hiroki Matsutani
- Hideharu Amano
- Henri Casanova
Wireless interconnects based on inductive coupling technology are compelling propositions for designing 3-D integrated chips. This work addresses the heat dissipation problem on such systems. Although effective cooling technologies have been proposed for systems designed based on Through Silicon Via (TSV), their application to systems that use inductive coupling is problematic because of increased wireless-communication distance. For this reason, we propose two methods for designing sparse 3-D chips layouts and Networks on Chip (NoCs) based on inductive coupling. The first method computes an optimized 3-D chip layout and then generates a randomized network topology for this layout. The second method uses a standard stack chip layout with a standard network topology as a starting point, and then deterministically transforms it into either a “staircase” or a “checkerboard” layout. We quantitatively compare the designs produced by these two methods in terms of network and application performance. Our main finding is that the first method produces designs that ultimately lead to higher parallel application performance, as demonstrated for nine OpenMP applications in the NAS Parallel Benchmarks.
- Peng Wang
- Sobhan Niknam
- Sheng Ma
- Zhiying Wang
- Todor Stefanov
In this paper, we address the problem of how to achieve energy-efficient confined-interference communication on a bufferless NoC taking advantage of the low power consumption of such NoC. We propose a novel routing approach called Surfing on a Bufferless NoC (Surf-Bless) where packets are assigned to domains and Surf-Bless guarantees that interference between packets is confined within a domain, i.e., there is no interference between packets assigned to different domains. By experiments, we show that our Surf-Bless routing approach is effective in supporting confined-interference communication and consumes much less energy than the related approaches.
- Marcos Horro
- Mahmut T. Kandemir
- Louis-Noël Pouchet
- Gabriel Rodríguez
- Juan Touriño
Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Xeon Phi Knights Landing. It features 38 tiles packed in a single die, organized into a 2D mesh. Before accessing remote data, tiles need to query the distributed directory. The effect of this coherence traffic is poorly understood. We show that the apparent UMA behavior results from the degradation of the peak performance. We develop ways to optimize the coherence traffic, the core-to-core-affinity, and the scheduling of a set of tasks on the mesh, leveraging the unique characteristics of processor units stemming from process variations.
- Mohsen Imani
- Justin Morris
- John Messerly
- Helen Shu
- Yaobang Deng
- Tajana Rosing
Brain-inspired Hyperdimensional (HD) computing is a new computing paradigm emulating the neuron’s activity in high-dimensional space. The first step in HD computing is to map each data point into high-dimensional space (e.g., 10,000), which requires the computation of thousands of operations for each element of data in the original domain. Encoding alone takes about 80% of the execution time of training. In this paper, we propose BRIC, a fully binary Brain-Inspired Classifier based on HD computing for energy-efficient and high-accuracy classification. BRIC introduces a novel encoding module based on random projection with a predictable memory access pattern which can efficiently be implemented in hardware. BRIC is the first HD-based approach which provides data projection with a 1:1 ratio to the original data and enables all training/inference computation to be performed using binary hypervectors. To further improve BRIC efficiency, we develop an online dimension reduction approach which removes insignificant hypervector dimensions during training. Additionally, we designed a fully pipelined FPGA implementation which accelerates BRIC in both training and inference phases. Our evaluation of BRIC a wide range of classification applications show that BRIC can achieve 64.1× and 9.8× (43.8× and 6.1×) energy efficiency and speed up as compared to baseline HD computing during training (inference) while providing the same classification accuracy.
- Seongsik Park
- Seijoon Kim
- Hyeokjun Choe
- Sungroh Yoon
Spiking neural networks (SNNs) are considered as one of the most promising artificial neural networks due to their energy-efficient computing capability. Recently, conversion of a trained deep neural network to an SNN has improved the accuracy of deep SNNs. However, most of the previous studies have not achieved satisfactory results in terms of inference speed and energy efficiency. In this paper, we propose a fast and energy-efficient information transmission method with burst spikes and hybrid neural coding scheme in deep SNNs. Our experimental results showed the proposed methods can improve inference energy efficiency and shorten the latency.
- Kangjun Bai
- Qiyuan An
- Yang Yi
Deep neural networks (DNNs), the brain-like machine learning architecture, have gained immense success in data-extensive applications. In this work, a hybrid structured deep delayed feedback reservoir (Deep-DFR) computing model is proposed and fabricated. Our Deep-DFR employs memristive synapses working in a hierarchical information processing fashion with DFR modules as the readout layer, leading our proposed deep learning structure to be both depth-in-space and depth-in-time. Our fabricated prototype along with experimental results demonstrate its high energy efficiency with low hardware implementation cost. With applications on the image classification, MNIST and SVHN, our Deep-DFR yields a 1.26~7.69X reduction on the testing error compared to state-of-the-art DNN designs.
- Tao Liu
- Wujie Wen
- Lei Jiang
- Yanzhi Wang
- Chengmo Yang
- Gang Quan
New DNN accelerators based on emerging technologies, such as resistive random access memory (ReRAM), are gaining increasing research attention given their potential of “in-situ” data processing. Unfortunately, device-level physical limitations that are unique to these technologies may cause weight disturbance in memory and thus compromising the performance and stability of DNN accelerators. In this work, we propose a novel fault-tolerant neural network architecture to mitigate the weight disturbance problem without involving expensive retraining. Specifically, we propose a novel collaborative logistic classifier to enhance the DNN stability by redesigning the binary classifiers augmented from both traditional error correction output code (ECOC) and modern DNN training algorithm. We also develop an optimized variable-length “decode-free” scheme to further boost the accuracy under fewer number of classifiers. Experimental results on cutting-edge DNN models and complex datasets show that the proposed fault-tolerant neural network architecture can effectively rectify the accuracy degradation against weight disturbance for DNN accelerators with low cost, thus allowing for its deployment in a variety of mainstream DNNs.
- Zhenhua Zhu
- Hanbo Sun
- Yujun Lin
- Guohao Dai
- Lixue Xia
- Song Han
- Yu Wang
- Huazhong Yang
Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials in boosting both the performance and energy efficiency of CNNs. However, restricted by the immature process technology, it is hard to implement and fabricate a CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit RRAM based CNN accelerators only focus on binary or ternary CNNs which have more than 10% accuracy loss compared with full precision CNNs. This paper proposes a configurable multi-precision CNN computing framework based on single bit RRAM, which consists of an RRAM computing overhead aware network quantization algorithm and a configurable multi-precision CNN computing architecture based on single bit RRAM. The proposed method can achieve equivalent accuracy as full precision CNN but also with lower storage consumption and latency via multiple precision quantization. The designed architecture supports for accelerating the multi-precision CNNs even with various precision among different layers. Experiment results show that the proposed framework can reduce 70% computing area and 75% computing energy on average, with nearly no accuracy loss. And the equivalent energy efficiency is 1.6 ~ 8.6× compared with existing RRAM based architectures with only 1.07% area overhead.
- Zhezhi He
- Jie Lin
- Rickard Ewetz
- Jiann-Shiun Yuan
- Deliang Fan
In this work, we investigate various non-ideal effects (Stuck-At-Fault (SAF), IR-drop, thermal noise, shot noise, and random telegraph noise)of ReRAM crossbar when employing it as a dot-product engine for deep neural network (DNN) acceleration. In order to examine the impacts of those non-ideal effects, we first develop a comprehensive framework called PytorX based on main-stream DNN pytorch framework. PytorX could perform end-to-end training, mapping, and evaluation for crossbar-based neural network accelerator, considering all above discussed non-ideal effects of ReRAM crossbar together. Experiments based on PytorX show that directly mapping the trained large scale DNN into crossbar without considering these non-ideal effects could lead to a complete system malfunction (i.e., equal to random guess) when the neural network goes deeper and wider. In particular, to address SAF side effects, we propose a digital SAF error correction algorithm to compensate for crossbar output errors, which only needs one-time profiling to achieve almost no system accuracy degradation. Then, to overcome IR drop effects, we propose a Noise Injection Adaption (NIA) methodology by incorporating statistics of current shift caused by IR drop in each crossbar as stochastic noise to DNN training algorithm, which could efficiently regularize DNN model to make it intrinsically adaptive to non-ideal ReRAM crossbar. It is a one-time training method without the request of retraining for every specific crossbar. Optimizing system operating frequency could easily take care of rest non-ideal effects. Various experiments on different DNNs using image recognition application are conducted to show the efficacy of our proposed methodology.
Microarchitectural covert channel attack is a threat when multiple tenants share hardware resources such as last-level cache. In this work, we propose a novel covert channel attack that exploits new microarchitecture that have been introduced to support memory encryption — in particular, the memory encryption engine (MEE) cache. The MEE cache is a shared resource but only utilized when accessing the integrity tree data and provides opportunity for a stealthy covert channel attack. However, there are challenges since MEE cache organization is not publicly known and the access behavior differs from a conventional cache. We demonstrate how the MEE cache can be exploited to establish a covert channel communication.
- Zhenghong Jiang
- Hanchen Jin
- G. Edward Suh
- Zhiru Zhang
Designing a secure cryptographic accelerator is challenging as vulnerabilities may arise from design decisions and implementation flaws. To provide high security assurance, we propose to design and build cryptographic accelerators with hardware-level information flow control so that the security of an implementation can be formally verified. This paper uses an AES accelerator as a case study to demonstrate how to express security requirements of a cryptographic accelerator as information flow policies for security enforcement. Our AES prototype on an FPGA shows that the proposed protection has a marginal impact on area and performance.
- Khaled N. Khasawneh
- Esmaeil Mohammadian Koruyeh
- Chengyu Song
- Dmitry Evtyushkin
- Dmitry Ponomarev
- Nael Abu-Ghazaleh
Speculative attacks, such as Spectre and Meltdown, target speculative execution to access privileged data and leak it through a side-channel. In this paper, we introduce (SafeSpec), a new model for supporting speculation in a way that is immune to the side-channel leakage by storing side effects of speculative instructions in separate structures until they commit. Additionally, we address the possibility of a covert channel from speculative instructions to committed instructions before these instructions are committed. We develop a cycle accurate model of modified design of an x86-64 processor and show that the performance impact is negligible.
- Jacob Fustos
- Farzad Farshchi
- Heechul Yun
Speculative execution is an essential performance enhancing technique in modern processors, but it has been shown to be insecure. In this paper, we propose SpectreGuard, a novel defense mechanism against Spectre attacks. In our approach, sensitive memory blocks (e.g., secret keys) are marked using simple OS/library API, which are then selectively protected by hardware from Spectre attacks via low-cost micro-architecture extension. This technique allows microprocessors to maintain high performance, while restoring the control to software developers to make security and performance trade-offs.
- Daimeng Wang
- Zhiyun Qian
- Nael Abu-Ghazaleh
- Srikanth V. Krishnamurthy
CPU memory prefetchers can substantially interfere with prime and probe cache side-channel attacks, especially on in-order CPUs which use aggressive prefetching. This interference is not accounted for in previous attacks. In this paper, we propose PAPP, a Prefetcher-Aware Prime Probe attack that can operate even in the presence of aggressive prefetchers. Specifically, we reverse engineer the prefetcher and replacement policy on several CPUs and use these insights to design a prime and probe attack that minimizes the impact of the prefetcher. We evaluate PAPP using Cache Side-channel Vulnerability (CSV) metric and demonstrate the substantial improvements in the quality of the channel under different conditions.
- Thomas Nyman
- Ghada Dessouky
- Shaza Zeitouni
- Aaro Lehikoinen
- Andrew Paverd
- N. Asokan
- Ahmad-Reza Sadeghi
Memory-unsafe programming languages like C and C++ leave many (embedded) systems vulnerable to attacks like control-flow hijacking. However, defenses against control-flow attacks, such as (fine-grained) randomization or control-flow integrity are in-effective against data-oriented attacks and more expressive Data-oriented Programming (DOP) attacks that bypass state-of-the-art defenses.
We propose run-time scope enforcement (RSE), a novel approach that efficiently mitigates all currently known DOP attacks by enforcing compile-time memory safety constraints like variable visibility rules at run-time. We present Hardscope, a proof-of-concept implementation of hardware-assisted RSE for RISC-V, and show it has a low performance overhead of 3.2% for embedded benchmarks.
- Shuhan Zhang
- Wenlong Lyu
- Fan Yang
- Changhao Yan
- Dian Zhou
- Xuan Zeng
- Xiangdong Hu
This paper presents an efficient multi-fidelity Bayesian optimization approach for analog circuit synthesis. The proposed method can significantly reduce the overall computational cost by fusing the simple but potentially inaccurate low-fidelity model and a few accurate but expensive high-fidelity data. Gaussian Process (GP) models are employed to model the low- and high-fidelity black-box functions separately. The nonlinear map between the low-fidelity model and high-fidelity model is also modelled as a Gaussian process. A fusing GP model which combines the low- and high-fidelity models can thus be built. An acquisition function based on the fusing GP model is used to balance the exploitation and exploration. The fusing GP model is evolved gradually as new data points are selected sequentially by maximizing the acquisition function. Experimental results show that our proposed method reduces up to 65.5% of the simulation time compared with the state-of-the-art single-fidelity Bayesian optimization method, while exhibiting more stable performance and a more promising practical prospect.
- Mohamed Baker Alawieh
- Sinead A. Williamson
- David Z. Pan
As integrated circuit technologies continue to scale, efficient performance modeling becomes indispensable. Recently, several new learning paradigms have been proposed to reduce the computational cost associated with accurate performance modeling. A common attribute among most of these paradigms is the leverage of the sparsity feature to build efficient performance models. In this work, we propose a new perspective to incorporate sparsity in the modeling task by utilizing spike and slab feature selection techniques. Practically, our proposed method uses two different priors on the different model coefficients based on their importance. This is incorporated into a mixture model that can be built using a hierarchical Bayesian framework to select the important features and find the model coefficients. Our numerical experiments demonstrate that the proposed approach can achieve better results compared to traditional sparse modeling techniques while also providing valuable insight about the important features in the model.
- Biying Xu
- Yibo Lin
- Xiyuan Tang
- Shaolan Li
- Linxiao Shen
- Nan Sun
- David Z. Pan
In back-end analog/mixed-signal (AMS) design flow, well generation persists as a fundamental challenge for layout compactness, routing complexity, circuit performance and robustness. The immaturity of AMS layout automation tools comes to a large extent from the difficulty in comprehending and incorporating designer expertise. To mimic the behavior of experienced designers in well generation, we propose a generative adversarial network (GAN) guided well generation framework with a post-refinement stage leveraging the previous high-quality manually-crafted layouts. Guiding regions for wells are first created by a trained GAN model, after which the well generation results are legalized through post-refinement to satisfy design rules. Experimental results show that the proposed technique is able to generate wells close to manual designs with comparable post-layout circuit performance.
- Zhengyu Chen
- Hai Zhou
- Jie Gu
Mixed-signal time-domain computing (TC) has recently drawn significant attention due to its high efficiency in applications such as machine learning accelerators. However, due to the nature of analog and mixed-signal design, there is a lack of a systematic flow of synthesis and place & route for time-domain circuits. This paper proposed a comprehensive design flow for TC. In the front-end, a variation-aware digital compatible synthesis flow is proposed. In the back-end, a placement technique using graph-based optimization engine is proposed to deal with the especially stringent matching requirement in TC. Simulation results show significant improvement over the prior analog placement methods. A 55nm test chip is used to demonstrate that the proposed design flow can meet the stringent timing matching target for TC with significant performance boost over conventional digital design.
- Charalampos Antoniadis
- Nestor Evmorfopoulos
- Georgios Stamoulis
The integration of more components into modern Systems-on-Chip (SoCs) has led to very large RLC parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose a methodology for the sparsification of the dense circuit matrices resulting from Model Order Reduction of general RLC circuits, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.
- Sara Divanbeigi
- Evan Aditya
- Zhongpin Wang
- Markus Olbrich
In the era of advancing technology, increasing circuit complexity requires faster simulators for the verification step. The piece-wise linear simulation approach provides an efficient and accurate solution. In this paper, a state-of-the-art mixed-signal simulator is explained. The approach is extended to new exponential and quadratic stimuli. This requires a comprehensive derivation of mathematical equations, which remove the need for computationally expensive evaluation. The new stimuli are simulated in several circuits and compared to a conventional simulator. The result shows significant run-time acceleration with high accuracy. Therefore, it meets the industrial requirement, which demands simulation with various input forms and non-linear components.
- Heinz Riener
- Eleonora Testa
- Winston Haaswijk
- Alan Mishchenko
- Luca Amarù
- Giovanni De Micheli
- Mathias Soeken
This paper proposes a novel methodology for multi-level logic synthesis that is independent from a specific graph data-structure, but formulates synthesis procedures using an abstract concept definition of a logic representation. The idea is to capture the essence of optimisations in a general manner and tailor only small performance-critical sections to the underlying logic representation. This generic, yet scalable approach, saves many man-months of development time and enables logic synthesis and technology-mapping procedures parameterised in a logic representation. We present the generic design methodology and demonstrate its practicality by providing a complete state-of-the-art logic synthesis flow.
- Victor N. Kravets
- Nian-Ze Lee
- Jie-Hong R. Jiang
The task of an engineering change order (ECO) is to update the current implementation of a design according to its revised specification with minimum modification. Prior studies show that the amount of design modification majorly depends on the selection of rectification points, i.e., the input pins of gates whose functionality should be rectified with some patch circuitry. In realistic ECOs, as the netlist of the current implementation has been heavily optimized to meet design objectives, it is usually structurally dissimilar to the netlist of a revised specification, which is synthesized only by lightweight optimization. This paper proposes an ECO solution for optimized designs, which is robust against structural dissimilarity caused by design optimization. It locates candidate rectification points in a sampling domain, which significantly improves the scalability of rectification search. To synthesize the circuitry of patches, a structurally independent rewiring formulation is proposed to reuse existing logic in the implementation. Based on the proposed method, a newly developed engine is evaluated on the engineering changes arising in the design of microprocessors. Its ability to derive patches of superior quality is demonstrated in comparison to industrial tools.
- Niels Gleinig
- Frances Ann Hubis
- Torsten Hoefler
In order to compute a non-invertible function on a reversible circuit, one needs to “embed” the function into a larger function which has some garbage bits, corresponding to additional lines. The problem of determining the minimal number of garbage bits that are needed to embed a given function has attracted extensive research, largely motivated by quantum computing, where the number of lines equals the number of qubits. However, all approaches that are known have either no theoretical quality guarantees (bounds on approximation factors) or require exponential runtime. We present an efficient probabilistic approximation algorithm with theoretical bounds.
- Hao Chen
- Shao-Chun Hung
- Jie-Hong R. Jiang
Threshold logic circuits are artificial neural networks with their neuron outputs being binarized, thus amenable for efficient, multiplier-free, hardware implementation of machine learning applications. In the reviving threshold logic synthesis, this work lays the foundations of disjoint-support decomposition and extraction operation of threshold logic functions. They lead to a synthesis procedure for interconnect minimization of threshold logic circuits, an important, but not well addressed, objective in both neural network and nanometer circuit designs. Experimental results show that our method can efficiently and effectively reduce interconnect as well as weight/threshold value over highly optimized circuits, thus suitable for implementation using emerging technologies.
- Eleonora Testa
- Mathias Soeken
- Luca Amarù
- Giovanni De Micheli
Reducing the number of AND gates plays a central role in many cryptography and security applications. We propose a logic synthesis algorithm and tool to minimize the number of AND gates in a logic network composed of AND, XOR, and inverter gates. Our approach is fully automatic and exploits cut enumeration algorithms to explore optimization potentials in local subcircuits. The experimental results show that our approach can reduce the number of AND gates by 34% on average compared to generic size optimization algorithms. Further, we are able to reduce the number of AND gates up to 76% in best-known benchmarks from the cryptography community.
- Rafael Trapani Possignolo
- Jose Renau
Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective incremental flows. Wepropose SMatch, a flow for FPGAs that has a novel incremental elaboration and novel incremental FPGA placement and routing that improves the state-of-the-art by reducing the amount of placement and routing work needed. We evaluate our approach against commercial FPGAs flows. Our method finishes synthesis, placement, and routing in under 30s for most changes of publicly available benchmarks with negligible QoR impact, being over 20× faster than existing incremental FPGA flows.
- Tutu Ajayi
- Vidya A. Chhabria
- Mateus Fogaça
- Soheil Hashemi
- Abdelrahman Hosny
- Andrew B. Kahng
- Minsoo Kim
- Jeongsup Lee
- Uday Mallappa
- Marina Neseem
- Geraldo Pradipta
- Sherief Reda
- Mehdi Saligane
- Sachin S. Sapatnekar
- Carl Sechen
- Mohamed Shalan
- William Swartz
- Lutong Wang
- Zhehong Wang
- Mingyu Woo
- Bangqi Xu
We describe the planned Alpha release of OpenROAD, an open-source end-to-end silicon compiler. OpenROAD will help realize the goal of “democratization of hardware design”, by reducing cost, expertise, schedule and risk barriers that confront system designers today. The development of open-source, self-driving design tools is in and of itself a “moon shot” with numerous technical and cultural challenges. The open-source flow incorporates a compatible open-source set of tools that span logic synthesis, floorplanning, placement, clock tree synthesis, global routing and detailed routing. The flow also incorporates analysis and support tools for static timing analysis, parasitic extraction, power integrity analysis, and cloud deployment. We also note several observed challenges, or “lessons learned”, with respect to development of open-source EDA tools and flows.
- Kishor Kunal
- Meghna Madhusudan
- Arvind K. Sharma
- Wenbin Xu
- Steven M. Burns
- Ramesh Harjani
- Jiang Hu
- Desmond A. Kirkpatrick
- Sachin S. Sapatnekar
This paper presents analog layout automation efforts under the ALIGN (“Analog Layout, Intelligently Generated from Netlists”) project for fast layout generation using a modular approach based on a mix of algorithmic and machine learning-based tools. The road to rapid turnaround is based on an approach that detects structure and hierarchy in the input netlist and uses a grid based philosophy for layout. The paper provides a view of the current status of the project, challenges in developing open-source code with an academic/industry team, and nuts-and-bolts issues such as working with abstracted PDKs, navigating the “wall” between secured IP and open-source software, and securing access to example designs.
- Tsung-Wei Huang
- Chun-Xun Lin
- Guannan Guo
- Martin D. F. Wong
Open source has started energizing both industrial and academic research and development in electronic design automation (EDA) systems. By moving to open source, we can speed up our effort and work with others who are working toward the same goals, while reducing costs and improving end products. However, building an open-source project is much more than placing the codebase on the web. In this paper, we will talk about essential building blocks to create an impactful open-source project, including source repository, project landing page, documentation, and continuous integration. We will also cover the use of web-based frameworks to design a showcase project to bring community’s attention. We will then share our experience in developing an open-source timing analyzer (OpenTimer) and a parallel task programming library (Cpp-Taskflow), both of which are being used in many industrial and academic EDA research projects.
- Elad Alon
- Krste Asanović
- Jonathan Bachrach
- Borivoje Nikolić
We describe our experience developing and promoting a set of open-source tools and IP over the last 9 years, including the Chisel hardware construction language, the Rocket Chip SoC generator, and the BAG analog layout generator.
- Huiyu Mo
- Leibo Liu
- Wenping Zhu
- Qiang Li
- Hong Liu
- Wenjing Hu
- Yao Wang
- Shaojun Wei
Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection and alignment, along with the optimizations on multi-task cascaded convolutional networks algorithm, to implement both multi-face detection and alignment. First, the clustering non-maximum suppression is proposed to significantly reduce intersection over union computation and eliminate the hardware-interfer-ence sorting process, bringing 16.0% speed-up without any loss. Second, a new pipeline architecture is presented to implement the proposal network in more computation-efficient manner, with 41.7% less multiplier usage and 38.3% decrease in memory capacity compared with the similar method. Third, a batch schedule mechanism is proposed to improve hardware utilization of fully-connected layer by 16.7% on average with variable input number in batch process. Based on the TSMC 28 nm CMOS process, this accelerator only consumes 6.7ms at 400 MHz to simultaneously process 5 faces for each image and achieves 1.17 TOPS/W power efficiency, which is 54.8× higher than the state-of-the-art solution.
- Angad S. Rekhi
- Brian Zimmer
- Nikola Nedovic
- Ningxi Liu
- Rangharajan Venkatesan
- Miaorong Wang
- Brucek Khailany
- William J. Dally
- C. Thomas Gray
Analog/mixed-signal (AMS) computation can be more energy efficient than digital approaches for deep learning inference, but incurs an accuracy penalty from precision loss. Prior AMS approaches focus on small networks/datasets, which can maintain accuracy even with 2b precision. We analyze applicability of AMS approaches to larger networks by proposing a generic AMS error model, implementing it in an existing training framework, and investigating its effect on ImageNet classification with ResNet-50. We demonstrate significant accuracy recovery by exposing the network to AMS error during retraining, and we show that batch normalization layers are responsible for this accuracy recovery. We also introduce an energy model to predict the requirements of high-accuracy AMS hardware running large networks and use it to show that for ADC-dominated designs, there is a direct tradeoff between energy efficiency and network accuracy. Our model predicts that achieving < 0.4% accuracy loss on ResNet-50 with AMS hardware requires a computation energy of at least ~300 fJ/MAC. Finally, we propose methods for improving the energy-accuracy tradeoff.
- Juejian Wu
- Hongtao Zhong
- Kai Ni
- Yongpan Liu
- Huazhong Yang
- Xueqing Li
Making embedded memory symmetric provides the capability of memory access in both rows and columns, which brings new opportunities of significant energy and time savings if only a portion of data in the words need to be accessed. This work investigates the use of ferroelectric field-effect transistors (FeFETs), an emerging nonvolatile, low-power, deeply-scalable, CMOS-compatible transistor technology, and proposes a new 3-transistor/cell symmetric nonvolatile memory (SymNVM). With ~1.67x higher density as compared with the prior FeFET design, significant benefits of energy and latency improvement have been achieved, as evaluated and discussed in depth in this paper.
- William Simon
- Juan Galicia
- Alexandre Levisse
- Marina Zapater
- David Atienza
As the computational complexity of applications on the consumer market, such as high-definition video encoding and deep neural networks, become ever more demanding, novel ways to efficiently compute data intensive workloads are being explored. In this context, In-Memory Computing (IMC) solutions, and particularly bitline computing in SRAM, appear promising as they mitigate one of the most energy consuming aspects in computation: data movement. While IMC architectural level characteristics have been defined by the research community, only a few works so far have explored the implementation of such memories at a low level. Furthermore, these proposed solutions are either slow (<1GHz), area hungry (10T SRAM), or suffer from read disturb and corruption issues. Overall, there is no extensive design study considering realistic assumptions at the circuit level. In this work we propose a fast (up to 2.2Ghz), 6T SRAM-based, reliable (no read disturb issues), and wide voltage range (from 0.6 to 1V) IMC architecture using local bitlines. Beyond standard read and write, the proposed architecture can perform copy, addition and shift operations at the array level. As addition is the slowest operation, we propose a modified carry chain adder, providing a 2× carry propagation improvement. The proposed architecture is validated using a 28nm bulk high performances technology PDK with CMOS variability and post-layout simulations. High density SRAM bitcells (0.127μm) enable area efficiency of 59.7% for a 256×128 array, on par with current industrial standards.
- Sungju Ryu
- Hyungjun Kim
- Wooseok Yi
- Jae-Joon Kim
Deep Neural Networks (DNNs) have various performance requirements and power constraints depending on applications. To maximize the energy-efficiency of hardware accelerators for different applications, the accelerators need to support various bit-width configurations. When designing bit-reconfigurable accelerators, each PE must have variable shift-addition logic, which takes a large amount of area and power. This paper introduces an area and energy efficient precision-scalable neural network accelerator (BitBlade), which reduces the control overhead for variable shift-addition using bitwise summation method. The proposed BitBlade, when synthesized in a 28nm CMOS technology, showed reduction in area by 41% and in energy by 36-46% compared to the state-of-the-art precision-scalable architecture [14].
- Gunhee Lee
- Hanmin Park
- Namhyung Kim
- Joonsang Yu
- Sujeong Jo
- Kiyoung Choi
The training process of a deep neural network commonly consists of three phases: forward propagation, backward propagation, and weight update. In this paper, we propose a hardware architecture to accelerate the backward propagation. Our approach applies to neural networks that use rectified linear unit. Considering that the backward propagation results in a zero activation gradient when the corresponding activation is zero, we can safely skip the gradient calculation. Based on this observation, we design an efficient hardware accelerator for training deep neural networks by selectively computing gradients. We show the effectiveness of our approach through experiments with various network models.
- Matthew Sotoudeh
- Sara S. Baghsorkhi
Existing approaches to neural network compression have failed to holistically address algorithmic (training accuracy) and computational (inference performance) demands of real-world systems, particularly on resource-constrained devices. We present C3-Flow, a new approach adding non-uniformity to low-rank approximations and designed specifically to enable highly-efficient computation on common hardware architectures while retaining more accuracy than competing methods. Evaluation on two state-of-the-art acoustic models (versus existing work, empirical limit study approaches, and hand-tuned models) demonstrates up to 60% lower error. Finally, we show that our co-design approach achieves up to 14X inference speedup across three Haswell- and Broadwell-based platforms.
- Dong Wang
- Ke Xu
- Qun Jia
- Soheil Ghiasi
Hardware accelerators for convolutional neural network (CNN) inference have been extensively studied in recent years. The reported designs tend to utilize a similar underlying architecture based on multiplier-accumulator (MAC) arrays, which has the practical consequence of limiting the FPGA-based accelerator performance by the number of available on-chip DSP blocks, while leaving other resource under-utilized. To address this problem, we consider a transformation to the convolution computation, which leads to transformation of the accelerator design space and relaxes the pressure on the required DSP resources. We demonstrate that our approach enables us to strike a judicious balance between utilization of the on-chip memory, logic, and DSP resources, due to which, our accelerator considerably outperforms state of the art. We report the effectiveness of our approach on a Stratix-V GXA7 FPGA, which shows 55% throughput improvement, while using 6.25% less DSP blocks, compared to the best reported CNN accelerator on the same device.
- Angshuman Karmakar
- Sujoy Sinha Roy
- Frederik Vercauteren
- Ingrid Verbauwhede
Sampling from a discrete Gaussian distribution has applications in lattice-based post-quantum cryptography. Several efficient solutions have been proposed in recent years. However, making a Gaussian sampler secure against timing attacks turned out to be a challenging research problem. In this work, we present a toolchain to instantiate an efficient constant-time discrete Gaussian sampler of arbitrary standard deviation and precision. We observe an interesting property of the mapping from input random bit strings to samples during a Knuth-Yao sampling algorithm and propose an efficient way of minimizing the Boolean expressions for the mapping. Our minimization approach results in up to 37% faster discrete Gaussian sampling compared to the previous work. Finally, we apply our optimized and secure Gaussian sampler in the lattice-based digital signature algorithm Falcon, which is a NIST submission, and provide experimental evidence that the overall performance of the signing algorithm degrades by at most 33% only due to the additional overhead of ‘constant-time’ sampling, including the 60% overhead of random number generation. Breaking a general belief, our results indirectly show that the use of discrete Gaussian samples in digital signature algorithms would be beneficial.
- Hadi Mardani Kamali
- Kimia Zamiri Azar
- Houman Homayoun
- Avesta Sasan
In this paper, we propose a novel and SAT-resistant logic-locking technique, denoted as Full-Lock, to obfuscate and protect the hardware against threats including IP-piracy and reverse-engineering. The Full-Lock is constructed using a set of small-size fully Programmable Logic and Routing block (PLR) networks. The PLRs are SAT-hard instances with reasonable power, performance and area overheads which are used to obfuscate (1) the routing of a group of selected wires and (2) the logic of the gates leading and proceeding the selected wires. The Full-Lock resists removal attacks and breaks a SAT attack by significantly increasing the complexity of each SAT iteration.
- Rajit Karmakar
- Suman Sekhar Jana
- Santanu Chattopadhyay
A popular countermeasure against IP piracy relies on obfuscating the Finite State Machine (FSM), which is assumed to be the heart of a digital system. In this paper, we propose to use a special class of non-group additive cellular automata (CA) called D1 * CA, and it’s counterpart D1 * CAdual to obfuscate each state-transition of an FSM. The synthesized FSM exhibits correct state-transitions only for a correct key, which is a designer’s secret. The proposed easily testable key-controlled FSM synthesis scheme can thwart reverse engineering attacks, thus offers IP protection.
- Jie Xu
- Dan Feng
- Yu Hua
- Fangting Huang
- Wen Zhou
- Wei Tong
- Jingning Liu
Non-volatile memories (NVMs) are vulnerable to serious threat due to the endurance variation. We identify a new type of malicious attack, called Uniform Address Attack (UAA), which performs uniform and sequential writes to each line of the whole memory, and wears out the weaker lines (lines with lower endurance) early. Experimental results show that the lifetime of NVMs under UAA is reduced to 4.1% of the ideal lifetime. To address such attack, we propose a spare-line replacement scheme called Max-WE (Maximize the Weak lines’ Endurance). By employing weak-priority and weak-strong-matching strategies for spare-line allocation, Max-WE is able to maximize the number of writes that the weakest lines can endure. Furthermore, Max-WE reduces the storage overhead of the mapping table by 85% through adopting a hybrid spare-line mapping scheme. Experimental results show that Max-WE can improve the lifetime by 9.5X with the spare-line overhead and mapping overhead as 10% and 0.016% of the total space respectively.
- Daniel Casini
- Alessandro Biondi
- Giorgio Buttazzo
Despite several works in the literature targeted predictable execution models for parallel tasks, limited attention has been devoted to study how specific implementation techniques may affect their execution. This paper highlights some issues that can arise when executing parallel tasks with thread pools, which may lead to deadlocks and performance degradation when adopting blocking synchronization mechanisms. A new parallel task model, inspired to a realistic design found in popular software systems, is first presented to study this problem. Then, formal conditions to ensure the absence of deadlocks and schedulability analysis techniques are proposed under both global and partitioned scheduling.
- Xu Jiang
- Nan Guan
- Weichen Liu
- Maolin Yang
This paper for the first time studies the scheduling and analysis of parallel real-time tasks with semaphores. In parallel task systems, each task may issue multiple requests to a semaphore, which raises new challenges to the design and analysis problems. We propose a new locking protocol LPP that limits the maximal number of requests to a semaphore by a task that can block other tasks at any time. We develop analysis techniques to safely bound the task response times, with which we prove that the best real-time performance is achieved if only one request to a semaphore by a task is allowed to block other tasks at a time. Experiments under different parameter settings are conducted to compare our proposed protocol and analysis techniques with the state-of-the-art spinlock protocol and analysis techniques for parallel real-time tasks.
- Jinghao Sun
- Nan Guan
- Xiaoqing Wang
- Chenhan Jin
- Yaoyao Chi
Synchronous parallel tasks are widely used in HPC for purchasing high average performance, but merely consider how to guarantee good timing predictabilities. OpenMP is a promising framework for multi-core real-time embedded systems. The synchronous OpenMP tasks are significantly more difficult to schedule and analyze due to constraints posed by OpenMP specifications. An important OpenMP feature is tied task, which must execute on the same thread during the whole life cycle. This paper designs a novel method, called group scheduling, to schedule synchronous OpenMP tasks, which divides tasks into several groups, and assigns some of them to dedicated cores, in order to isolate tied tasks. We derive a linear-time computable response time bound. Experiments with both randomly generated and realistic OpenMP tasks show that our new bound significantly outperforms the existing bound.
- Tomás Picornell
- José Flich
- Carles Hernández
- José Duato
The adoption of many-cores in safety-critical systems requires real-time capable networks on chip (NoC). In this paper we propose a new time-predictable NoC design paradigm where contention within the network is eliminated. This new paradigm builds on the Channel Dependency Graph (CDG) and guarantees by design the absence of contention. Our delayed conflict-free NoC (DCFNoC) is able to naturally inject messages using a TDM period equal to the optimal theoretical bound and without the need of using a computationally demanding offline process. Results show that DCFNoC guarantees time predictability with very low implementation cost.
- Artur Mrowca
- Martin Nocker
- Sebastian Steinhorst
- Stephan Günnemann
Verification is essential to prevent malfunctioning of software systems. Model checking allows to verify conformity with nominal behavior. As manual definition of specifications from such systems gets infeasible, automated techniques to mine specifications from data become increasingly important. Existing approaches produce specifications of limited lengths, do not segregate functions and do not easily allow to include expert input. We present BaySpec, a dynamic mining approach to extract temporal specifications from Bayesian models, which represent behavioral patterns. This allows to learn specifications of arbitrary length from imperfect traces. Within this framework we introduce a novel extraction algorithm that for the first time mines LTL specifications from such models.
- Shuangnan Liu
- Francis CM Lau
- Benjamin Carrion Schafer
One of the advantages of High-Level Synthesis (HLS), also called C-based VLSI-design, over traditional RT-level VLSI design flows, is that multiple micro-architectures of unique area vs. performance can be automatically generated by setting different synthesis options, typically in the form of synthesis directives specified as pragmas in the source code. This design space exploration (DSE) is very time-consuming and can easily take multiple days for complex designs. At the same time, and because of the complexity in designing large ASICs, verification teams now routinely make use of emulation and prototyping to test the circuit before the silicon is taped out. This also allows the embedded software designers to start their work earlier in the design process and thus, further reducing the Turn-Around-Times (TAT). In this work, we present a method to automatically re-optimize ASIC designs specified as behavioral descriptions for HLS to FPGAs for emulation and prototyping, based on the observation that synthesis directives that lead to efficient micro-architectures for ASICs, do not directly translate into optimal micro-architectures in FPGAs. This implies that the HLS DSE process would have to be completely repeated for the target FPGA. To avoid this, this work presents a predictive model-based method that takes as inputs the results of an ASIC HLS DSE and automatically, without the need to re-explore the behavioral description, finds the Pareto-optimal micro-architectures for the target FPGA. Experimental results comparing our predictive-model based method vs. completely re-exploring the search space show that our proposed method works well.
- Ming Hu
- Tongquan Wei
- Min Zhang
- Frédéric Mallet
- Mingsong Chen
The Clock Constraint Specification Language (CCSL) has been widely investigated in verifying causal and temporal timing behaviors of real-time embedded systems. However, due to limited expertise in formal modeling, it is difficult for requirement engineers to completely and accurately derive CCSL specifications from natural language-based design descriptions. To address this problem, we present a novel approach that facilitates automated synthesis of CCSL specifications under the guidance of sampled (expected) timing behaviors of target systems. By encoding sampled behaviors and incomplete CCSL constraints provided by requirement engineers using our proposed transformation templates, the CCSL specification synthesis problem can be naturally converted into a SKETCH synthesis problem, which enables the automated generation of CCSL specifications with high accuracy. Experiments on both well-known benchmarks and synthetic examples demonstrate the effectiveness and scalability of our approach.
- Neetu Jindal
- Sandeep Chandran
- Preeti Ranjan Panda
- Sanjiva Prasad
- Abhay Mitra
- Kunal Singhal
- Shubham Gupta
- Shikhar Tuli
Runtime verification employs dedicated hardware or software monitors to check whether program properties hold at runtime. However, these monitors often incur high area and performance overheads depending on whether they are implemented in hardware or software. In this work, we propose DHOOM, an architectural framework for runtime monitoring of program assertions, which exploits the combination of a reconfigurable fabric present alongside a processor core with the vestigial on-chip Design-for-Debug hardware. This combination of hardware features allows DHOOM to minimize the overall performance overhead of runtime verification, even when subject to a given area constraint. We present an algorithm for dynamically selecting an effective subset of assertion monitors that can be accommodated in the available programmable fabric, while instrumenting the remaining assertions in software. We show that our proposed strategy, while respecting area constraints, reduces the performance overhead of runtime verification by up to 32% when compared with a baseline of software-only monitors.
- Dylan Stow
- Itir Akgun
- Wenqin Huangfu
- Yuan Xie
- Xueqi Li
- Gabriel H. Loh
Emerging Monolithic Three-Dimensional (M3D) integration technology will not only provide improved circuit density through the high-bandwidth coupling of multiple vertically-stacked layers, but it can also provide new architectural opportunities for on-chip computation, memory, and communication that are beyond the capabilities of existing process and packaging technologies. For example, with massive parallel communication between heterogeneous memory and compute layers, existing processing-in-memory architectures can be optimized and expanded, developing into efficient and flexible near-data processors. Additionally, multiple tiers of interconnect can be dynamically leveraged to provide an efficient, scalable interconnect fabric that spans the three-dimensional system. This work explores some of the challenges and opportunities presented by M3D technology for emerging computer architectures, with focus on improving efficiency and increasing system flexibility.
- Heechun Park
- Kyungwook Chang
- Bon Woong Ku
- Jinwoo Kim
- Edward Lee
- Daehyun Kim
- Arjun Chaudhuri
- Sanmitra Banerjee
- Saibal Mukhopadhyay
- Krishnendu Chakrabarty
- Sung Kyu Lim
Monolithic 3D IC overcomes the limitation of the existing through-silicon-via (TSV) based 3D IC by providing denser vertical connections with nano-scale inter-layer vias (ILVs). In this paper, we demonstrate a thorough RTL-to-GDS design flow for monolithic 3D IC, which is based on commercial 2D place-and-route (P&R) tools and clever ways to extend them to handle 3D IC designs and simulations. We also provide a low-cost built-in-self-test (BIST) method to detect various faults that can occur on ILVs. Lastly, we present a resistive random access memory (ReRAM) compiler that generates memory modules that are to be integrated in monolithic 3D ICs.
- Jiachen Mao
- Qing Yang
- Ang Li
- Hai Li
- Yiran Chen
In recent years, machine learning research has largely shifted focus from the cloud to the edge. While the resulting algorithm- and hardware-level optimizations have enabled local execution for the majority of deep neural networks (DNNs) on edge devices, the sheer magnitude of DNNs associated with real-time video detection workloads has forced them to remain relegated to remote execution in the cloud. This problematic when combined with the strict latency requirements that are coupled with these workloads, and imposes a unique set of challenges not directly addressed in prior works. In this work, we design MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is able to achieve up to a 32% reduction in latency when compared to a conventional implementation of video detection system with only a marginal reduction in accuracy.
- Shuo-Han Chen
- Ming-Chang Yang
- Yuan-Hao Chang
- Chun-Feng Wu
Existing secure deletion approaches are inefficient in erasing data permanently because file systems have no knowledge of the data layout on the storage device, nor is the storage device aware of file information within the file systems. This inefficiency is exaggerated on the emerging shingled magnetic recording (SMR) drive due to its inherent sequential-write constraint. On SMR drives, secure deletion requests may lead to serious write amplification and performance degradation if the data layout is not properly configured. Such observation motivates us to propose a file-oriented fast secure deletion (FFSD) strategy to alleviate the negative impacts of SMR drives’ sequential-write constraint and improve the efficiency of secure deletion operations on SMR drives. A series of experiments was conducted to demonstrate the capability of the proposed strategy on improving the efficiency of secure deletion on SMR drives.
- Wei-Ming Chen
- Pi-Cheng Hsiu
- Tei-Wei Kuo
Self-powered intermittent systems enable accumulative execution in unstable power environments, where checkpointing is often adopted as a means to achieve data consistency and system recovery under power failures. However, existing approaches based on the checkpointing paradigm normally require system suspension and/or logging at runtime. This paper presents a design which enables failure-resilient intermittently-powered systems without runtime checkpointing. Our design enforces the consistency and serializability of concurrent task execution while maximizing computation progress, as well as allows instant system recovery after power resumption, by leveraging the characteristics of data accessed in hybrid memory. We integrated the design into FreeRTOS running on a Texas Instruments device. Experimental results show that our design achieves up to 11.8 times the computation progress achieved by checkpointing-based approaches, while reducing the recovery time by nearly 90%.
- Tinghuan Chen
- Bingqing Lin
- Hao Geng
- Bei Yu
Sensor drift is an intractable obstacle to practical temperature measurement in smart building. In this paper, we propose a sensor spatial correlation model. Given prior knowledge, Maximum-aposteriori (MAP) estimation is performed to calibrate drifts. MAP is formulated as a non-convex problem with three hyper-parameters. An alternating-based method is proposed to solve this non-convex formulation. Cross-validation and Expectation-maximum with Gibbs sampling are further to determine hyper-parameters. Experimental results show that on benchmarks from simulator EnergyPlus, compared with state-of-the-art method, the proposed framework can achieve a robust drift calibration and a better trade-off between accuracy and runtime.
- Erick Carvajal Barboza
- Nishchal Shukla
- Yiran Chen
- Jiang Hu
Optimizations at placement stage need to be guided by timing estimation prior to routing. To handle timing uncertainty due to the lack of routing information, people tend to make very pessimistic predictions such that performance specification can be ensured in the worst case. Such pessimism causes over-design that wastes chip resources or design effort. In this work, a machine learning-based pre-routing timing prediction approach is introduced. Experimental results show that it can reach accuracy near post-routing sign-off analysis. Compared to a commercial pre-routing timing estimation tool, it reduces false positive rate by about 2/3 in reporting timing violations.
- Wei Ye
- Mohamed Baker Alawieh
- Yibo Lin
- David Z. Pan
Lithography simulation is one of the most fundamental steps in process modeling and physical verification. Conventional simulation methods suffer from a tremendous computational cost for achieving high accuracy. Recently, machine learning was introduced to trade off between accuracy and runtime through speeding up the resist modeling stage of the simulation flow. In this work, we propose LithoGAN, an end-to-end lithography modeling framework based on a generative adversarial network (GAN), to map the input mask patterns directly to the output resist patterns. Our experimental results show that LithoGAN can predict resist patterns with high accuracy while achieving orders of magnitude speedup compared to conventional lithography simulation and previous machine learning based approach.
- Kuan-Ming Lai
- Tsung-Wei Huang
- Tsung-Yi Ho
The recent TAU 2018 contest was seeking novel idea for efficient generation of timing reports. When the timing graph is updated, users query different forms of timing reports that happen subsequently and sequentially. This process is computationally expensive and inherently complex. Therefore, we introduce in this paper a general cache framework for efficient generation of timing critical paths. Our framework efficiently supports (1) a cache scheme to minimize duplicate calculation, (2) graph contraction to reduce the search space, and (3) multi-threading. We evaluated our framework on the TAU 2018 contest benchmarks and demonstrated promising performance over the top performer.
This paper proposes a scalable algorithmic framework for effective-resistance preserving spectral reduction of large undirected graphs. The proposed method allows computing much smaller graphs while preserving the key spectral (structural) properties of the original graph. Our framework is built upon the following three key components: a spectrum-preserving node aggregation and reduction scheme, a spectral graph sparsification framework with iterative edge weight scaling, as well as effective-resistance preserving post-scaling and iterative solution refinement schemes. By leveraging recent similarity-aware spectral sparsification method and graph-theoretic algebraic multigrid (AMG) Laplacian solver, a novel constrained stochastic gradient descent (SGD) optimization approach has been proposed for achieving truly scalable performance (nearly-linear complexity) for spectral graph reduction. We show that the resultant spectrally-reduced graphs can robustly preserve the first few nontrivial eigenvalues and eigenvectors of the original graph Laplacian and thus allow for developing highly-scalable spectral graph partitioning and circuit simulation algorithms.
- Jinsoo Jang
- Brent Byunghoon Kang
Hardware debugging facilities, such as watchpoints, have been used for software development and analysis. In this paper, we expanded the use of watchpoints as a hardware security primitive for enhancing the runtime security of mobile devices. By analyzing the watchpoints in detail, we derived useful watchpoint properties that can be exploited to build security applications. Based on our analysis, we designed example applications for hardening the OS kernel by exploiting watchpoints. The proposed applications were implemented on a Juno development board with 64-bit ARM architecture (ARMv8). Hardening the kernel by fully enabling the proposed schemes was found to impose reasonable overhead, i.e., 3% with SPEC CPU2006.
- Daniele Jahier Pagliari
- Sara Vinco
- Enrico Macii
- Massimo Poncino
Smart meters communicate to the utility provider fine-grain information about a user’s energy consumption, which could be used to infer the user’s habits and pose thus a critical privacy risk. State-of-the-art solutions try to obfuscate the readings of a meter either by using a large re-chargeable battery to filter the trace or by adding random noise to alter it. Both solutions, however, have significant drawbacks: large batteries are prohibitively expensive, whereas digitally added noise implies that the user entrusts the utility provider to protect his/her privacy.
This work proposes a hybrid approach in which zero-average noise is inserted in the power trace by means of a small energy storage device (battery or supercapacitor); the distinguishing feature of our approach is that this obfuscating device is indistinguishable from any other load and therefore it complicates by construction the load disaggregation task performed by the provider or by a malicious third party. Simulation results show that our device can achieve comparable or superior privacy enhancement as that of a solution based on a large battery and therefore with smaller cost.
- Ebrahim M. Songhori
- M. Sadegh Riazi
- Siam U. Hussain
- Ahmad-Reza Sadeghi
- Farinaz Koushanfar
We present ARM2GC, a novel secure computation framework based on Yao’s Garbled Circuit (GC) protocol and the ARM processor. It allows users to develop privacy-preserving applications using standard high-level programming languages (e.g., C) and compile them using off-the-shelf ARM compilers, e.g., gcc-arm. The main enabler of this framework is the introduction of SkipGate, an algorithm that dynamically omits the communication and encryption cost of a gate when its output is independent of the private data. SkipGate greatly enhances the performance of ARM2GC by omitting costs of the gates associated with the instructions of the compiled binary, which is known by both parties involved in the computation. Our evaluation on benchmark functions demonstrates that ARM2GC outperforms the prior best solution by 156×.
- Song Bian
- Masayuki Hiromoto
- Takashi Sato
The (ring) learning with errors (RLWE/LWE) problem is one of the most promising candidates for constructing quantum-secure key exchange protocols. In this work, we design and implement specialized hardware multiplier units for both LWE and RLWE key exchange schemes to maximize their computational efficiency. By exploiting the algebraic structure with aggressive parameter sets, we show that the design and implementation of LWE key exchange on hardware is considerably easier and more flexible than RLWE. Using the proposed architectures, we show that client-side energy-efficiency of LWE-based key exchange can be on the same order, or even (slightly) better than RLWE-based schemes, making LWE an attractive option for designing post-quantum cryptographic suite.
- Jie Xu
- Dan Feng
- Yu Hua
- Wei Tong
- Jingning Liu
- Chunyan Li
- Gaoxiang Xu
- Yiran Chen
Data encoding methods have been proposed to alleviate the high write energy and limited write endurance disadvantages of Non-Volatile Memories (NVMs). Encoding methods are proved to be effective through theoretical analysis. Under the data patterns of workloads, existing encoding methods could become inefficient. We observe that the new cache line and the old cache line have many redundant (or unmodified) words. This makes the utilization ratio of the tag bits of data encoding methods become very low, and the efficiency of data encoding method decreases. To fully exploit the tag bits to reduce the bit flips of NVMs, we propose REdundant word Aware Data encoding (READ). The key idea of READ is to share the tag bits among all the words of the cache line and dynamically assign the tag bits to the modified words. The high utilization ratio of the tag bits in READ leads to heavy bit flips of the tag bits. To reduce the bit flips of the tag bits in READ, we further propose Sequential flips Aware Encoding (SAE). SAE is designed based on the observation that many sequential bits of the new data and the old data are opposite. For those writes, the bit flips of the tag bits will increase with the number of tag bits. SAE dynamically selects the encoding granularity which causes the minimum bit flips instead of using the minimum encoding granularity. Experimental results show that our schemes can reduce the energy consumption by 20.3%, decrease the bit flips by 25.0%, and improve the lifetime by 52.1%.
- Farzaneh Zokaee
- Mingzhe Zhang
- Xiaochun Ye
- Dongrui Fan
- Lei Jiang
3D vertical ReRAM (3DV-ReRAM) emerges as one of the most promising alternatives to DRAM due to its good scalability beyond 10nm. Monolithic 3D (M3D) integration enables 3DV-ReRAM to improve its array area efficiency by stacking peripheral circuits underneath an array. A 3DV-ReRAM array has to be large enough to fully cover the peripheral circuits, but such large array size significantly increases its access latency. In this paper, we propose Magma, a M3D stacked heterogeneous ReRAM array architecture, for future main memory systems by stacking a large unipolar 3DV-ReRAM array on the top of a small bipolar 3DV-ReRAM array and peripheral circuits shared by two arrays. We further architect the small bipolar array as a direct-mapped cache for the main memory system. Compared to homogeneous ReRAMs, on average, Magma improves the system performance by 11.4%, reduces the system energy by 24.3% and obtains > 5-year lifetime.
- Xianzhang Chen
- Zhuge Qingfeng
- Qiang Sun
- Edwin H.-M. Sha
- Shouzhen Gu
- Chaoshu Yang
- Chun Jason Xue
Emerging non-volatile memories (NVMs) are promising main memory for their advanced characteristics. However, the low endurance of NVM cells makes them vulnerable to frequent fine-grained updates. This paper proposes a Wear-leveling Aware Fine-grained Allocator (WAFA) for NVM. WAFA divides pages into basic memory units to support fine-grained updates. WAFA allocates the basic memory units of a page in a rotational manner to distribute fine-grained updates evenly on memory cells. The fragmented basic memory units of each page caused by the memory allocation and deallocation operations are reorganized by reform operation. We implement WAFA in Linux kernel 4.4.4. Experimental results show that WAFA can reduce 81.1% and 40.1% of the total writes of pages over NVMalloc and nvm_alloc, the state-of-the-art wear-conscious allocator for NVM. Meanwhile, WAFA shows 48.6% and 42.3% performance improvement over NVMalloc and nvm_alloc, respectively.
- Yibo Lin
- Shounak Dhar
- Wuxi Li
- Haoxing Ren
- Brucek Khailany
- David Z. Pan
Placement for very-large-scale integrated (VLSI) circuits is one of the most important steps for design closure. This paper proposes a novel GPU-accelerated placement framework DREAMPlace, by casting the analytical placement problem equivalently to training a neural network. Implemented on top of a widely-adopted deep learning toolkit PyTorch, with customized key kernels for wirelength and density computations, DREAMPlace can achieve over 30× speedup in global placement without quality degradation compared to the state-of-the-art multi-threaded placer RePlAce. We believe this work shall open up new directions for revisiting classical EDA problems with advancement in AI hardware and software.
- Fan-Keng Sun
- Yao-Wen Chang
The analytical formulation has been shown to be the most effective for circuit placement. A key ingredient of analytical placement is its wirelength model, which needs to be differentiable and can accurately approximate a golden wirelength model such as half-perimeter wirelength. Existing wirelength models derive gradient from differentiating smooth maximum (minimum) functions, such as the log-sum-exp and weighted-average models. In this paper, we propose a novel bivariate gradient-based wirelength model, namely BiG, which directly derives a gradient with any bivariate smooth maximum (minimum) function without any differentiation. Our wirelength model can effectively combine the advantages of both multivariate and bivariate functions. Experimental results show that our BiG model effectively and efficiently improves placement solutions.
- Jai-Ming Lin
- Szu-Ting Li
- Yi-Ting Wang
The mixed-size placement becomes a great challenge in the modern VLSI design. To handle this problem, the three-stage mixed-size placement methodology is considered as the most suitable approach for a commercial design flow, where the placement prototyping is the most important stage. Since standard cells and macros have to be considered simultaneously in this stage, it is more complicated than the other two stages. To reduce complexity and improve design quality, this paper applies the multilevel framework with a design hierarchy-guided clustering scheme for getting a better coarsening result in order to improve outcome in the following stages. We propose an efficient and effective clustering scheme to group standard cells and macros based on the tree built from their design hierarchies. More importantly, our clustering algorithm considers indirect connectivity between macros which is ignored by previous works. Moreover, we propose a new overlapping bounding box constraint to avoid clustering improper macros which have connections to fixed pins. The experimental results show that wirelength and routability are improved by our methodology.
- Yih-Lang Li
- Shih-Ting Lin
- Shinichi Nishizawa
- Hong-Yan Su
- Ming-Jie Fong
- Oscar Chen
- Hidetoshi Onodera
For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware dynamic programming based transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. With M0 routing, the congestion of M1 routing can be reduced and the pin accessibility can be improved due to the diminished use of M2 routing. To improve the routing resource utilization, we propose an implicitly adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires less filler cells than that using traditional cell library by 70.9%, which achieves a block area reduction rate of 5.7%.
- Miloš Grujić
- Vladimir Rožić
- David Johnston
- John Kelsey
- Ingrid Verbauwhede
The generation of high quality true random numbers is essential in security applications. For secure communication, we also require high quality true random number generators (TRNGs) in embedded and IoT devices. This paper provides insights into modern TRNG design principles and their evaluation, based on standard’s requirements and design experience. We illustrate our approach with a case study of a recently proposed delay chain based TRNG.
- Gai Liu
- Joseph Primmer
- Zhiru Zhang
The increasing popularity of compute acceleration for emerging domains such as artificial intelligence and computer vision has led to the growing need for domain-specific accelerators, often implemented as specialized processors that execute a set of domain-optimized instructions. The ability to rapidly explore (1) various possibilities of the customized instruction set, and (2) its corresponding micro-architectural features is critical to achieve the best quality-of-results (QoRs). However, this ability is frequently hindered by the manual design process at the register transfer level (RTL). Such an RTL-based methodology is often expensive and slow to react when the design specifications change at the instruction-set level and/or micro-architectural level.
We address this deficiency in domain-specific processor design with ASSIST, a behavior-level synthesis framework for RISC-V processors. From an untimed functional instruction set description, ASSIST generates a spectrum of RISC-V processors implementing varying micro-architectural design choices, which enables effective tradeoffs between different QoR metrics. We demonstrate the automatic synthesis of more than 60 in-order processor implementations with varying pipeline structures from the RISC-V 32I instruction set, some of which dominate the manually optimized counterparts in the area-performance Pareto frontier. In addition, we propose an autotuning-based approach for optimizing the implementations under a given performance constraint and the technology target. We further present case studies of synthesizing various custom instruction extensions and customized instruction sets for cryptography and machine learning applications.
- Vojtech Mrazek
- Muhammad Abdullah Hanif
- Zdenek Vasicek
- Lukas Sekanina
- Muhammad Shafique
Approximate computing is an emerging paradigm for developing highly energy-efficient computing systems such as various accelerators. In the literature, many libraries of elementary approximate circuits have already been proposed to simplify the design process of approximate accelerators. Because these libraries contain from tens to thousands of approximate implementations for a single arithmetic operation it is intractable to find an optimal combination of approximate circuits in the library even for an application consisting of a few operations. An open problem is “how to effectively combine circuits from these libraries to construct complex approximate accelerators”. This paper proposes a novel methodology for searching, selecting and combining the most suitable approximate circuits from a set of available libraries to generate an approximate accelerator for a given application. To enable fast design space generation and exploration, the methodology utilizes machine learning techniques to create computational models estimating the overall quality of processing and hardware cost without performing full synthesis at the accelerator level. Using the methodology, we construct hundreds of approximate accelerators (for a Sobel edge detector) showing different but relevant tradeoffs between the quality of processing and hardware cost and identify a corresponding Pareto-frontier. Furthermore, when searching for approximate implementations of a generic Gaussian filter consisting of 17 arithmetic operations, the proposed approach allows us to identify approximately 103 highly relevant implementations from 1023 possible solutions in a few hours, while the exhaustive search would take four months on a high-end processor.
Non-stencil kernels with irregular memory access patterns pose unique challenges to achieving high computing performance and hardware efficiency in FPGA high-level synthesis. We present a highly versatile and systematic approach, termed as Graph-Morphing, to constructing a reconfigurable computing engine specifically optimized to perform non-stencil kernel computing. Graph-Morphing achieves significant performance improvement by fragmenting operations across loop iterations and subsequently rescheduling computation and data to maximize overall performance. In experiments, Graph-Morphing achieves 2-13 times performance improvement albeit with significantly more hardware usage. For accelerating non-stencil kernel computing, Graph-Morphing proposes a new research direction.
- Xuechao Wei
- Yun Liang
- Jason Cong
Deep Neural Networks (DNNs) are becoming more and more complex than before. Previous hardware accelerator designs neglect the layer diversity in terms of computation and communication behavior. On-chip memory resources are underutilized for the memory bounded layers, leading to suboptimal performance. In addition, the increasing complexity of DNN structures makes it difficult to do on-chip memory allocation. To address these issues, we propose a layer conscious memory management framework for FPGA-based DNN hardware accelerators. Our framework exploits the layer diversity and the disjoint lifespan information of memory buffers to efficiently utilize the on-chip memory to improve the performance of the layers bounded by memory and thus the entire performance of DNNs. It consists of four key techniques working coordinately with each other. We first devise a memory allocation algorithm to allocate on-chip buffers for the memory bound layers. In addition, buffer sharing between different layers is applied to improve on-chip memory utilization. Finally, buffer prefetching and splitting are used to further reduce latency. Experiments show that our techniques can achieve 1.36X performance improvement compared with previous designs.
- Marcos T. Leipnitz
- Gabriel L. Nazar
When attempting to make a design fit a set of the heterogeneous resources found in Field-Programmable Gate Arrays (FPGAs), designers using High-Level Synthesis (HLS) may resort to approximate approaches. However, current FPGA-oriented approximate HLS tools do not allow specifying constraints on heterogeneous resources such as lookup tables, flip-flops, and multipliers, being instead error-oriented. In this work, we propose a resource-oriented HLS methodology with which designers can specify heterogeneous resource constraints and satisfy them while minimizing the output error, attaining average improvements, over error-oriented approaches, of about 34% and 2.2 dB for mean-squared error and peak signal-to-noise ratio error metrics, respectively.
Loop pipelining is an important optimization in high-level synthesis to enable high-throughput pipelined execution of loop iterations. However, current pipeline scheduling approach relies on fundamentally inexact heuristics based on ad hoc priority functions and lacks guarantee on achieving the best throughput. To address this shortcoming, we propose a scheduling algorithm based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle various pipeline scheduling constraints. Our techniques take advantage of conflict-driven learning and problem-specific specialization to optimally yet efficiently derive pipelining solutions. Experiments demonstrate that our approach achieves notable speedup in comparison to integer linear programming based techniques.
- Quan Deng
- Youtao Zhang
- Minxuan Zhang
- Jun Yang
PIM (Processing-in-memory)-based CNN (Convolutional neural network) accelerators leverage the characteristics of basic memory cells to enable simple logic and arithmetic operations so that the bandwidth constraint can be effectively alleviated. However, it remains a major challenge to support multiplication operations efficiently on PIM accelerators, in particular, DRAM-based PIM accelerators. This has prevented PIM-based accelerators from being immediately adopted for accurate CNN inference.
In this paper, we propose LAcc, a DRAM-based PIM accelerator to support LUT- (lookup table) based fast and accurate multiplication. By enabling LUT based vector multiplication in DRAM, LAcc effectively decreases LUT size and improve its reuse. LAcc further adopts a hybrid mapping of weights and inputs to improve the hardware utilization rate. LAcc achieves 95 FPS at 5.3 W for Alexnet and 6.3 × efficiency improvement over the state-of-the-art.
- Jae-San Kim
- Joon-Sung Yang
Various studies have been carried out to improve the operational efficiency of the Deep Neural Networks (DNNs). However, the importance of the reliability in DNNs has generally been overlooked. As the underlying semiconductor technology decreases in reliability, the probability that some components of computing devices fail also increases, preventing high accuracy in DNN operations. To achieve high accuracy, ensuring operational reliability, even if faults occur, is necessary.
In this paper, we introduce a DNN reliability improvement scheme in 3D die-stacked memory called DRIS-3, based on the correlation between the faults in weights and an accuracy loss. We analyze the fault characteristics of conventional DNN models to find the bits that cause significant accuracy loss when faults are injected into weights. On the basis of the findings, we propose a reliability improvement structure which can reduce faults on the bits that must be protected for accuracy, considering asymmetric soft error rate (SER) per layer in 3D die-stacked memory.
Experimental results show that with the proposed method, the fault tolerance is improved regardless of the type of model and the pruning applied. The fault tolerance based on bit error rate (BER) for a 1% accuracy loss is increased up to 104 times over the conventional model.
- Ashish Ranjan
- Shubham Jain
- Jacob R. Stevens
- Dipankar Das
- Bharat Kaul
- Anand Raghunathan
Memory Augmented Neural Networks (MANNs) enhance a deep neural network with an external differentiable memory, enabling them to perform complex tasks well beyond the capabilities of conventional deep neural networks. We identify a unique challenge that arises in MANNs due to soft reads and writes to the differentiable memory, each of which requires access to all the memory locations. This characteristic of MANN workloads severely limits the performance of MANNs on CPUs, GPUs, and classical neural network accelerators. We present the first effort to design a hardware architecture that improves the efficiency of MANNs. Leveraging the intrinsic ability of resistive crossbars to efficiently realize in-memory computations, we propose X-MANN, a memory-centric crossbar-based architecture that is specialized to match the compute characteristics observed in MANNs. We design a transposable crossbar processing unit that can efficiently perform the different computational kernels of MANNs. To improve performance of soft writes in X-MANN, we propose an incremental write mechanism that leverages the characteristics of soft write operations. We develop an architectural simulator for X-MANN that utilizes array-level timing and power models of resistive crossbars calibrated from SPICE simulations. Across a suite of MANN benchmarks, X-MANN achieves 23.7×-45.7× speedup and 75.1×-267.1× reduction in energy over state-of-the-art GPU implementations.
- Haitong Li
- Mudit Bhargava
- Paul N. Whatmough
- H.-S. Philip Wong
Deep neural network (DNN) inference tasks have become ubiquitous workloads on mobile SoCs and demand energy-efficient hardware accelerators. Mobile DNN accelerators are heavily area-constrained, with only minimal on-chip SRAM, which results in heavy use of inefficient off-chip DRAM. With diminishing returns from conventional silicon technology scaling, emerging memory technologies that offer better area density than SRAM can boost accelerator efficiency by minimizing costly off-chip DRAM accesses. This paper presents a detailed design space exploration (DSE) of technology-system co-design for systolic-array accelerators. We focus on practical/mature on-chip memory technologies, including SRAM, eDRAM, MRAM, and 3D vertical RRAM (VRRAM). The DSE employs state-of-the-art optimizations (e.g., model compression and optimized buffer scheduling), and evaluates results on important models including ResNet-50, MobileNet, and Faster-RCNN. Compared to an SRAM/DRAM baseline, MRAM-based accelerators show up to 4.68× energy benefits (57% area overhead), while a 3D VRRAM-based design achieves 2.22× energy benefits (33% area reduction).
- Reza Hojabr
- Kamyar Givaki
- S. M. Reza Tayaranian
- Parsa Esfahanian
- Ahmad Khonsari
- Dara Rahmati
- M. Hassan Najafi
Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efficient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low-cost advantage, SC-based arithmetic units suffer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs.
In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the first multiplication, the following multiplications can be performed by multiplying the input and the differences of the successive weights. Leveraging this property, we develop a differential Multiply-and-Accumulate unit, called DMAC, to reduce the time consumed by convolutions in SkippyNN. We evaluate the efficiency of SkippyNN using four modern CNNs. On average, SkippyNN ofers 1.2x speedup and 2.7x energy saving compared to the binary implementation of CNN accelerators.
- Fan Chen
- Linghao Song
- Hai Helen Li
- Yiran Chen
Generative Adversarial Networks (GANs) recently demonstrated a great opportunity toward unsupervised learning with the intention to mitigate the massive human efforts on data labeling in supervised learning algorithms. GAN combines a generative model and a discriminative model to oppose each other in an adversarial situation to refine their abilities. Existing nonvolatile memory based machine learning accelerators, however, could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. In this work, we propose a novel computational deformation technique that synergistically optimizes the forward and backward functions in transposed convolution to eliminate the large resource underutilization. In addition, we present dedicated control units – a dataflow mapper and an operation scheduler, to support the proposed execution model with high parallelism and low energy consumption. ZARA is implemented with commodity ReRAM chips, and experimental results show that our design can improve GAN’s training performance by averagely 1.6× ~23× over CMOS-based GAN accelerators. Compared to state-of-the-art ReRAM-based accelerator designs, ZARA also provides 1.15 × ~2.1× performance improvement.
- Debayan Das
- Anupam Golder
- Josef Danial
- Santosh Ghosh
- Arijit Raychowdhury
- Shreyas Sen
This article, for the first time, demonstrates Cross-device Deep Learning Side-Channel Attack (X-DeepSCA), achieving an accuracy of > 99.9%, even in presence of significantly higher inter-device variations compared to the inter-key variations. Augmenting traces captured from multiple devices for training and with proper choice of hyper-parameters, the proposed 256-class Deep Neural Network (DNN) learns accurately from the power side-channel leakage of an AES-128 target encryption engine, and an N-trace (N ≤ 10) X-DeepSCA attack breaks different target devices within seconds compared to a few minutes for a correlational power analysis (CPA) attack, thereby increasing the threat surface for embedded devices significantly. Even for low SNR scenarios, the proposed X-DeepSCA attack achieves ~ 10× lower minimum traces to disclosure (MTD) compared to a traditional CPA.
- Haocheng Li
- Satwik Patnaik
- Abhrajit Sengupta
- Haoyu Yang
- Johann Knechtel
- Bei Yu
- Evangeline F.Y. Young
- Ozgur Sinanoglu
The notion of integrated circuit split manufacturing which delegates the front-end-of-line (FEOL) and back-end-of-line (BEOL) parts to different foundries, is to prevent overproduction, piracy of the intellectual property (IP), or targeted insertion of hardware Trojans by adversaries in the FEOL facility. In this work, we challenge the security promise of split manufacturing by formulating various layout-level placement and routing hints as vector- and image-based features. We construct a sophisticated deep neural network which can infer the missing BEOL connections with high accuracy. Compared with the publicly available network-flow attack [1], for the same set of ISCAS-85 benchmarks, we achieve 1.21× accuracy when splitting on M1 and 1.12× accuracy when splitting on M3 with less than 1% running time.
- Sayandeep Saha
- S. Nishok Kumar
- Sikhar Patranabis
- Debdeep Mukhopadhyay
- Pallab Dasgupta
Assessment of the security provided by a fault attack countermeasure is challenging, given that a protected cipher may leak the key if the countermeasure is not designed correctly. This paper proposes, for the first time, a statistical framework to detect information leakage in fault attack countermeasures. Based on the concept of non-interference, we formalize the leakage for fault attacks and provide a t-test based methodology for leakage assessment. One major strength of the proposed framework is that leakage can be detected without the complete knowledge of the countermeasure algorithm, solely by observing the faulty ciphertext distributions. Experimental evaluation over a representative set of countermeasures establishes the efficacy of the proposed methodology.
- Mohammad Mahmoodi
- Hussein Nili
- Shabnam Larimian
- Xinjie Guo
- Dmitri Strukov
We exploit randomness in static I-V characteristics and reconfigurability of embedded flash memories to design very efficient physically unclonable function. Leakage current and subthreshold slope variations, nonlinearity, nondeterministic tuning error, and sneak path current in the redesigned commercial flash memory arrays are exploited to create a unique digital fingerprint. A time-multiplexed architecture is designed to enhance the security and expand the challenge-response pair space to 10211. Experimental results demonstrate 50.3% average uniformity, 49.99% average diffuseness, and native <5% bit error rate. The analysis of the measured data also shows strong resilience against machine learning attacks and possibility for extremely energy efficient, 0.56 pJ/b operation.
- Sying-Jyan Wang
- Yu-Shen Chen
- Katherine Shu-Min Li
The Physical Unclonable Function (PUF) has been proposed for the identification and authentication of devices and cryptographic key generation. A strong PUF provides an extremely large number of device-specific challenge-response pairs (CRP) which can be used for identification. Unfortunately, the CRP mechanism is vulnerable to modeling attack, which uses machine learning (ML) algorithms to predict PUF responses with high accuracy. Many methods have been developed to strengthen strong PUFs with complicated hardware; however, recent studies show that they are still vulnerable by leveraging GPU-accelerated ML algorithms.
In this paper, we propose to deal with the problem from a different approach. With a slightly modified CRP mechanism, a PUF can provide poison data such that an accurate model of the PUF under attack cannot be built by ML algorithms. Experimental results show that the proposed method provides an effective countermeasure against modeling attacks on PUF. In addition, the proposed method is compatible with hardware strengthening schemes to provide even better protection for PUFs.
- Darshana Jayasinghe
- Aleksandar Ignjatovic
- Sri Parameswaran
Random execution time-based countermeasures against power analysis attacks have reduced resource overheads when compared to balancing power dissipation and masking countermeasures. The previous countermeasures on randomization use either a small number of clock frequencies or delays to randomize the execution. This paper presents a novel random frequency countermeasure (referred to as RFTC) using the dynamic reconfiguration ability of clock managers of Field-Programmable Gate Arrays — FPGAs (such as Xilinx Mixed-Mode Clock Manager — MMCM) which can change the frequency of operation at runtime. We show for the first time how Advanced Encryption Standard (AES) block cipher algorithm can be executed using randomly selected clock frequencies (amongst thousands of frequencies carefully chosen) generated within the FPGA to mitigate power analysis attack vulnerabilities. To test the effectiveness of the proposed clock randomization, Correlation Power analysis (CPA) attacks are performed on the collected power traces. Preprocessing methods, such as Dynamic Time Warping (DTW), Principal Component Analysis (PCA) and Fast Fourier Transform (FFT), based power analysis attacks are performed on the collected traces to test the effective removal of random execution. Compared to the state of the art, where there were 83 distinct finishing times for each encryption, the method described in this paper can have more than 60,000 distinct finishing times for each encryption, making it resistant against power analysis attacks when preprocessed and demonstrated to be secure up to four million traces.
- Wenqiang Zhang
- Xiaochen Peng
- Huaqiang Wu
- Bin Gao
- Hu He
- Youhui Zhang
- Shimeng Yu
- He Qian
RRAM based neural-processing-unit (NPU) is emerging for processing general purpose machine intelligence algorithms with ultra-high energy efficiency, while the imperfections of the analog devices and cross-point arrays make the practical application more complicated. In order to improve accuracy and robustness of the NPU, device-circuit-algorithm codesign with consideration of underlying device and array characteristics should outperform the optimization of individual device or algorithm. In this work, we provide a joint device-circuit-algorithm analysis and propose the corresponding design guidelines. Key innovations include: 1) An end-to-end simulator for RRAM NPU is developed with an integrated framework from device to algorithm. 2) The complete design of circuit and architecture for RRAM NPU is provided to make the analysis much close to the real prototype. 3) A large-scale neural network as well as other general-purpose networks are processed for the study of device-circuit interaction. 4) Accuracy loss from non-idealities of RRAM, such as I-V nonlinearity, noises of analog resistance levels, voltage-drop for interconnect, ADC/DAC precision, are evaluated for the NPU design.
- Abdullah Ash-Saki
- Mahabubul Alam
- Swaroop Ghosh
Concerted efforts by the academia and the industries e.g., IBM, Google and Intel have brought us to the era of Noisy Intermediate-Scale Quantum (NISQ) computers. Qubits, the basic elements of quantum computer, have been proven extremely susceptible to different noises. Recent experiments have exhibited spatial variations among the qubits in NISQ hardware. Therefore, conventional mapping of qubit done without quality awareness results in significant loss of fidelity for a given workload. In this paper, we have analyzed the effects of various noise sources on the overall fidelity of the given workload for a real NISQ hardware. We have also presented novel optimization technique namely, Qubit Re-allocation (QURE) to maximize the sequence fidelity of a given workload. QURE is scalable and can be applied to future large scale quantum computers. QURE can improve the fidelity of a quantum workload up to 1.54X (1.39X on average) in simulation and up to 1.7X in real device compared to variation oblivious qubit allocation without incurring any physical overhead.
- Robert Wille
- Lukas Burgholzer
- Alwin Zulehner
The recent progress in the physical realization of quantum computers (the first publicly available ones—IBM’s QX architectures—have been launched in 2017) has motivated research on automatic methods that aid users in running quantum circuits on them. Here, certain physical constraints given by the architectures which restrict the allowed interactions of the involved qubits have to be satisfied. Thus far, this has been addressed by inserting SWAP and H operations. However, it remains unknown whether existing methods add a minimum number of SWAP and H operations or, if not, how far they are away from that minimum—an NP-complete problem. In this work, weaddress this by formulating the mapping task as a symbolic optimization problem that is solved using reasoning engines like Boolean satisfiability solvers. By this, we do not only provide a method that maps quantum circuits to IBM’s QX architectures with a minimal number of SWAP and H operations, but also show by experimental evaluation that the number of operations added by IBM’s heuristic solution exceeds the lower bound by more than 100% on average. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.
- Xingyi Liu
- Keshab K. Parhi
This paper describes a novel approach to synthesize molecular reactions to compute a radial basis function (RBF) support vector machine (SVM) kernel. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to translating known stochastic logic circuits to molecular computing. Although inspired by prior stochastic logic implementation of the RBF-SVM kernel, the proposed molecular reactions require non-obvious modifications. This paper introduces a new explicit bipolar-to-unipolar molecular converter for intermediate format conversion. Two designs are presented; one is based on the explicit and the other is based on implicit conversion from prior stochastic logic. When 5 support vectors are used, it is shown that the DNA RBF-SVM realized using the explicit format conversion has orders of magnitude less regression error than that based on implicit conversion.
- Shaahin Angizi
- Jiao Sun
- Wei Zhang
- Deliang Fan
Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level’s parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm2 by ~12× compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6× higher throughput per Watt.
- Xing Huang
- Tsung-Yi Ho
- Wenzhong Guo
- Bing Li
- Ulf Schlichtmann
Recent advances in continuous-flow microfluidics have enabled highly integrated lab-on-a-chip biochips. These chips can execute complex biochemical applications precisely and efficiently within a tiny area, but they require a large number of control ports and the corresponding control logic to generate required pressure patterns for flow control, which, consequently, offset their advantages and prevent their wide adoption. In this paper, we propose the first synthesis flow called MiniControl, for continuous-flow microfluidic biochips (CFMBs) under strict constraints for control ports, incorporating high-level synthesis and physical design simultaneously, which has never been considered in previous work. With the maximum number of allowed control ports specified in advance, this synthesis flow generates a biochip architecture with high execution efficiency. Moreover, the overall cost of a CFMB can be reduced and the tradeoff between control logic and execution efficiency of biochemical applications can be evaluated for the first time. Experimental results demonstrate that MiniControl leads to high execution efficiency and low overall platform cost, while satisfying the given control port constraint strictly.
- Ran Chen
- Wei Zhong
- Haoyu Yang
- Hao Geng
- Xuan Zeng
- Bei Yu
As the circuit feature size continuously shrinks down, hotspot detection has become a more challenging problem in modern DFM flows. Developed deep learning techniques have recently shown their advantages on hotspot detection tasks. However, existing hotspot detectors only accept small layout clips as input with potential defects occurring at a center region of each clip, which will be time consuming and waste lots of computational resources when dealing with large full-chip layouts. In this paper, we develop a new end-to-end framework that can detect multiple hotspots in a large region at a time and promise a better hotspot detection performance. We design a joint auto-encoder and inception module for efficient feature extraction. A two-stage classification and regression flow is proposed to efficiently locate hotspot regions roughly and conduct final prediction with better accuracy and false alarm penalty. Experimental results show that our framework enables a significant speed improvement over existing methods with higher accuracy and fewer false alarms.
- Yiyang Jiang
- Fan Yang
- Hengliang Zhu
- Bei Yu
- Dian Zhou
- Xuan Zeng
Layout hotspot detection is of great importance in the physical verification flow. Deep neural network models have been applied to hotspot detection and achieved great successes. The layouts can be viewed as binary images. The binarized neural network can thus be suitable for the hotspot detection problem. In this paper we propose a new deep learning architecture based on binarized neural networks (BNNs) to speed up the neural networks in hotspot detection. A new binarized residual neural network is carefully designed for hotspot detection. Experimental results on ICCAD 2012 Contest benchmarks show that our architecture outperforms all previous hotspot detectors in detecting accuracy and has an 8x speedup over the best deep learning-based solution.
- Haoyu Yang
- Piyush Pathak
- Frank Gennari
- Ya-Chieh Lai
- Bei Yu
VLSI layout patterns provide critic resources in various design for manufacturability researches, from early technology node development to back-end design and sign-off flows. However, a diverse layout pattern library is not always available due to long logic-to-chip design cycle, which slows down the technology node development procedure. To address this issue, in this paper, we explore the capability of generative machine learning models to synthesize layout patterns. A transforming convolutional auto-encoder is developed to learn vector-based instantiations of squish pattern topologies. We show our framework can capture simple design rules and contributes to enlarging the existing squish topology space under certain transformations. Geometry information of each squish topology is obtained from an associated linear system derived from design rule constraints. Experiments on 7nm EUV designs show that our framework can more effectively generate diverse pattern libraries with DRC-clean patterns compared to a state-of-the-art industrial layout pattern generator.
- Mohamed Baker Alawieh
- Yibo Lin
- Zaiwei Zhang
- Meng Li
- Qixing Huang
- David Z. Pan
As the integrated circuits (IC) technology continues to scale, resolution enhancement techniques (RETs) are mandatory to obtain high manufacturing quality and yield. Among various RETs, sub-resolution assist feature (SRAF) generation is a key technique to improve the target pattern quality and lithographic process window. While model-based SRAF insertion techniques have demonstrated high accuracy, they usually suffer from high computational cost. Therefore, more efficient techniques that can achieve high accuracy while reducing runtime are in strong demand. In this work, we leverage the recent advancement in machine learning for image generation to tackle the SRAF insertion problem. In particular, we propose a new SRAF insertion framework, GAN-SRAF, which uses conditional generative adversarial networks (CGANs) to generate SRAFs directly for any given layout. Our proposed approach incorporates a novel layout to image encoding using multi-channel heatmaps to preserve the layout information and facilitate layout reconstruction. Our experimental results demonstrate ~14.6× reduction in runtime when compared to the previous best machine learning approach for SRAF generation, and ~144× reduction compared to model-based approach, while achieving comparable quality of results.
- Xiao Shi
- Hao Yan
- Qiancun Huang
- Jiajia Zhang
- Longxing Shi
- Lei He
“Curse of dimensionality” has become the major challenge for existing high-sigma yield analysis methods. In this paper, we develop a meta-model using Low-Rank Tensor Approximation (LRTA) to substitute expensive SPICE simulation. The polynomial degree of our LRTA model grows linearly with circuit dimension. This makes it especially promising for high-dimensional circuit problems. Our LRTA meta-model is solved efficiently with a robust greedy algorithm, and calibrated iteratively with an adaptive sampling method. Experiments on bit cell and SRAM column validate that proposed LRTA method outperforms other state-of-the-art approaches in terms of accuracy and efficiency.
- Yi-Ting Lin
- Iris Hui-Ru Jiang
Directed self-assembly (DSA) is one of the leading candidates for extending the resolution of optical lithography to sub-7nm and beyond. By incorporating DSA in multiple patterning lithography (DSA-MP), the flexibility and resolution of contact/via patterning can be further enhanced by using multiple block copolymer (BCP) materials. Prior work faces the dilemma between solution quality and efficiency and is unable to handle 2D templates. In this paper. we capture the essence of template and mask assignment in DSA-MP by a new graph model and a new problem reduction: Our graph model explicitly represents spacing conflict edges and template hyperedges; thus, extra enumeration and manipulation of incompatible via grouping edges can be avoided, and arbitrary 1D/2D templates can be natively handled. We further reduce the assignment problem to exact cover, which is encoded by a sparse matrix. Our concise integer linear programming (ILP) formulation and fast backtracking heuristic achieve substantially superior solution quality and efficiency to the state-of-the-art work. Moreover, our method is flexible and extendible to utilize dummy vias to improve manufacturability.
- Marten Lohstroh
- Martin Schoeberl
- Andrés Goens
- Armin Wasicek
- Christopher Gill
- Marjan Sirjani
- Edward A. Lee
Programming time-critical systems is notoriously difficult. In this paper we propose an actor-oriented programming model with a semantic notion of time and a deterministic coordination semantics based on discrete events to exercise precise control over both the computational and timing aspects of the system behavior.
We present two contrasting approaches to achieve time predictability in the embedded compute engine, the basic building block of any Internet of Things (IoT) or Cyber-Physical (CPS) system. The traditional approach offers predictability on top of unpredictable processors with numerous optimizations for enhanced performance and programmability at the cost of huge variability in timing. Approaches such as Worst-Case Execution Time (WCET) analysis of software have been struggling to model the complex timing behavior of the underlying processor to provide guarantees. On the other hand, the inevitable slowdown of Moore’s Law and the end of Dennard scaling have curtailed the performance and energy scaling of the processors. This stagnation in conjunction with the importance of cognitive computing have motivated widespread adoption of non-von Neumann accelerators and architectures. We argue that these emerging architectures are inherently time-predictable as they depend on software to orchestrate the computation and data movement and are an excellent match for the real-time processing needs.
- Benoît Dupont de Dinechin
The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This problem has been addressed by suitably designing the architecture, implementation, and programming models, of the Kalray MPPA (Multi-Purpose Processor Array) family of single-chip many-core processors. We introduce the third-generation MPPA processor, whose key features are motivated by the high-performance and high-integrity functions of automated vehicles. High-performance computing functions, represented by deep learning inference and by computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust, and its security architecture is able to support a security kernel for implementing the trusted execution environment functions required by applications.
- Sui Chen
- Faen Zhang
- Lei Liu
- Lu Peng
Non-volatile Random-Access Memories (NVRAM) have emerged in recent years to bridge the performance gap between the main memory and external storage devices. To utilize the non-volatility of NVRAMs, programs should allow durable stores, meaning consistency must be maintained during a power loss event. GPUs are designed with high throughput, leveraging high degrees of parallelism. However, with lower NVRAM write bandwidths compared to that of DRAMs, using NVRAM as is may yield suboptimal overall system performance. To address this problem, we propose using Helper Warps to move persistence out of the critical path of transaction execution, alleviating the impact of latencies. Our mechanism achieves a speedup of 4.4 and 1.5 under bandwidth limits of 1.6 GB/s and 12 GB/s and is projected to maintain speed advantage even when NVRAM bandwidth gets as high as hundreds of GB/s in certain cases.
- Jie Zhang
- Miryeong Kwon
- Hyojong Kim
- Hyesoon Kim
- Myoungsoo Jung
We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores. While Z-NAND is a hundred times faster than conventional 3D-stacked flash, its latency is still longer than DRAM. To address this shortcoming, we propose a dynamic page-placement and buffer manager in Z-NAND subsystems by being aware of bulk and parallel memory access characteristics of GPU applications, thereby offering high-throughput and low-energy consumption behaviors.
- Shuo Huai
- Weining Song
- Mengying Zhao
- Xiaojun Cai
- Zhiping Jia
Field programmable gate arrays (FPGAs) have been widely adopted in both high-performance servers and embedded systems. Since static random access memory (SRAM) has limited density and comparatively high leakage power, researchers have proposed FPGA architectures based on emerging non-volatile memories (NVMs) to satisfy the requirements of data-intensive and low-power applications. Block RAM is on-chip memory of FPGAs, when it is implemented with NVM, it will face the challenge of limited endurance. Traditional wear leveling strategy cannot be directly applied to block RAM because it may induce large performance overhead. In this paper, we propose a performance-aware wear leveling scheme for block RAM in FPGAs to improve its lifetime. The placement strategy is improved by injecting wear leveling guidance. The evaluation shows that 29.75% lifetime enhancement is achieved with 16.32% performance improvement at the same time, compared with traditional wear leveling.
- Zheng Liang
- Guangyu Sun
- Wang Kang
- Xing Chen
- Weisheng Zhao
Data insertion and deletion are common operations exist in various applications. However, traditional memory architecture can only perform an indirect insertion/deletion with multiple data read and write operations, which is significantly time and energy consuming. To mitigate this problem, we propose to leverage the unique capability of emerging skyrmion racetrack memory technology that it can naturally support direct insertion/deletion operations inside a racetrack. In this work, we first present a circuit level model for skyrmion racetrack memory. Then, we further propose a novel memory architecture to enable an efficient large size data insertion/deletion. With the help of the model and the architecture, we study several potential applications to leverage the insertion and deletion operations. Experimental results demonstrate that the efficiency of these operations can be substantially improved.
- Mohsen Imani
- Alice Sokolova
- Ricardo Garcia
- Andrew Huang
- Fan Wu
- Baris Aksanli
- Tajana Rosing
In a data hungry world, approximate computing has emerged as one of the solutions to create higher energy efficiency and faster systems, while providing application tailored quality. In this paper, we propose ApproxLP, an Approximate Multiplier based on Linear Planes. We introduce an iterative method for approximating the product of two operands using fitted linear functions with two inputs, referred to as linear planes. The linearization of multiplication allows multiplication operations to be completely replaced with weighted addition. The proposed technique is used to find the significand of the product of two floating point numbers, decreasing the high energy cost of floating point arithmetic. Our method fully exploits the trade-off between accuracy and energy consumption by offering various degrees of approximation at different energy costs. As the level of approximation increases, the approximated product asymptotically approaches the exact product in an iterative manner. The performance of ApproxLP is evaluated over a range of multimedia and machine learning applications. A GPU enhanced by ApproxLP yields significant energy-delay product (EDP) improvement. For multimedia, neural network, and hyperdimensional computing applications, ApproxLP offers on average 2.4×, 2.7×, and 4.3× EDP improvement respectively with sufficient computational quality for the application. ApproxLP also provides up to 4.5× EDP improvement and has 2.3× lower chip area than other state-of-the-art approximate multipliers.
- Vasileios Leon
- Konstantinos Asimakopoulos
- Sotirios Xydis
- Dimitrios Soudris
- Kiamal Pekmestzi
Approximate computing appears as an emerging and promising solution for energy-efficient system designs, exploiting the inherent error-tolerant nature of various applications. In this paper, targeting multiplication circuits, i.e., the energy-hungry counterpart of hardware accelerators, an extensive exploration of the error–energy trade-off, when combining arithmetic-level approximation techniques, is performed for the first time. Arithmetic-aware approximations deliver significant energy reductions, while allowing to control the error values with discipline by setting accordingly a configuration parameter. Inspired from the promising results of prior works with one configuration parameter, we propose 5 hybrid design families for approximate and energy-friendly hardware multipliers, consisting of two independent parameters to tune the approximation levels. Interestingly, the resolution of the state-of-the-art Pareto diagram is improved, giving the flexibility to achieve better energy gains for a specific error constraint imposed by the system. Moreover, we outperform prior works in the field of approximate multipliers by up to 60% energy reduction, and thus, we define the new Pareto front.
- Hassaan Saadat
- Haris Javaid
- Sri Parameswaran
We propose approximate dividers with near-zero error bias for both integer and floating-point numbers. The integer divider, INZeD, is designed using a novel, analytically deduced error-correction method in an approximate log based divider. The floating-point divider, FaNZeD, is based on a highly optimized mantissa divider that is inspired by INZeD. Both of the dividers are error configurable.
Our results show that the INZeD dividers have error bias in the range of 0.01-4.4% with area-delay product improvement of 25× – 95× and power improvement of 4.7× – 15× when compared to the accurate integer divider. Likewise, compared to IEEE single-precision floating-point divider, FaNZeD dividers offer up to 985× area-delay product and 77× power improvements with error bias in the range of 0.04-2.2%. Most importantly, using our FaNZeD dividers, floating-point arithmetic can be more resource-efficient than fixed-point arithmetic because most of the FaNZeD dividers are even smaller and have better area-delay product than the 8-bit and 16-bit accurate integer dividers. Finally, our dividers show negligible effect on the output quality when evaluated with AlexNet and JPEG compression applications.
Stochastic Computing (SC) is designed to minimize hardware area and power consumption compared to traditional binary-encoded computation, stemming from the bit-serial data representation and extremely straightforward logic. Though existing Stochastic Computing Units mostly assume uncorrelated bit streams, recent works find that correlation can be exploited for higher accuracy. We propose novel architectures for SC division and square root, which leverage correlation via low-cost in-stream mechanisms that eliminate expensive bit stream regeneration. We also introduce new metrics to better evaluate SC circuits relying on equilibrium via feedback loops. Experiments indicate that our division converges 46.3% faster with both 43.3% lower error and 45.6% less area.
- Fuxun Yu
- Zirui Xu
- Chenchen Liu
- Xiang Chen
Benefited from recent artificial intelligence evolution, Automatic Speech Recognition (ASR) technology has achieved enormous performance improvement and wider application. Unfortunately, ASR is also heavily leveraged by speech eavesdropping, where ASR is used to translate large volume of intercepted vocal speech into text content, causing considerable information leakage. In this work, we propose MASKER — a mobile security enhancement solution to protect the mobile speech data from ASR in eavesdropping. By identifying ASR models’ ubiquitous vulnerability, MASKER is designed to generate human imperceptible adversarial noises into the real-time speech on the mobile device (e.g. phone call and voice message). Even the speech data is exposed to eavesdropping during data transmission, the adversarial noises can effectively perturb the ASR process with significant Word Error Rate (WER). Meanwhile, MASKER is further optimized for mobile user perception quality and enhanced for environmental noises adaptation. Moreover, MASKER has outstanding computation efficiency for mobile system integration. Experiments show that, MASKER can achieve security enhancement with an average WER of 84.55% for ASR perturbation, 32% noise reduction for user perception quality and 16× faster processing speed compared to the state-of-the-art method.
- Sai Manoj Pudukotai Dinakarrao
- Sairaj Amberkar
- Sahil Bhat
- Abhijitt Dhavlle
- Hossein Sayadi
- Avesta Sasan
- Houman Homayoun
- Setareh Rafatirad
To overcome the performance overheads incurred by the traditional software-based malware detection techniques, Hardware-assisted Malware Detection (HMD) using machine learning (ML) classifiers has emerged as a panacea to detect malicious applications and secure the systems. To classify benign and malicious applications, HMD primarily relies on the generated low-level microarchitectural events captured through Hardware Performance Counters (HPCs). This work creates an adversarial attack on the HMD systems to tamper the security by introducing the perturbations in the HPC traces with the aid of an adversarial sample generator application. To craft the attack, we first deploy an adversarial sample predictor to predict the adversarial HPC pattern for a given application to be misclassified by the deployed ML classifier in the HMD. Further, as the attacker has no direct access to manipulate the HPCs generated during runtime, based on the output of the adversarial sample predictor, we devise an adversarial sample generator wrapped around a normal application to produce HPC patterns similar to the adversarial predictor HPC trace. As the crafted adversarial sample generator application does not have any malicious operations, it is not detectable with traditional signature-based malware detection solutions. With the proposed attack, malware detection accuracy has been reduced to 18.04% from 82.76%.
- Pu Zhao
- Siyue Wang
- Cheng Gongye
- Yanzhi Wang
- Yunsi Fei
- Xue Lin
Despite the great achievements of deep neural networks (DNNs), the vulnerability of state-of-the-art DNNs raises security concerns of DNNs in many application domains requiring high reliability. We propose the fault sneaking attack on DNNs, where the adversary aims to misclassify certain input images into any target labels by modifying the DNN parameters. We apply ADMM (alternating direction method of multipliers) for solving the optimization problem of the fault sneaking attack with two constraints: 1) the classification of the other images should be unchanged and 2) the parameter modifications should be minimized. Specifically, the first constraint requires us not only to inject designated faults (misclassifications), but also to hide the faults for stealthy or sneaking considerations by maintaining model accuracy. The second constraint requires us to minimize the parameter modifications (using ℓ0 norm to measure the number of modifications and ℓ2 norm to measure the magnitude of modifications). Comprehensive experimental evaluation demonstrates that the proposed framework can inject multiple sneaking faults without losing the overall test accuracy performance.
- Kanad Basu
- Rana Elnaggar
- Krishnendu Chakrabarty
- Ramesh Karri
Anti-virus software (AVS) tools are used to detect Malware in a system. However, software-based AVS are vulnerable to attacks. A malicious entity can exploit these vulnerabilities to subvert the AVS. Recently, hardware components such as Hardware Performance Counters (HPC) have been used for Malware detection. In this paper, we propose PREEMPT, a zero overhead, high-accuracy and low-latency technique to detect Malware by re-purposing the embedded trace buffer (ETB), a debug hardware component available in most modern processors. The ETB is used for post-silicon validation and debug and allows us to control and monitor the internal activities of a chip, beyond what is provided by the Input/Output pins. PREEMPT combines these hardware-level observations with machine learning-based classifiers to preempt Malware before it can cause damage. There are many benefits of re-using the ETB for Malware detection. It is difficult to hack into hardware compared to software, and hence, PREEMPT is more robust against attacks than AVS. PREEMPT does not incur performance penalties. Finally, PREEMPT has a high True Positive value of 94% and maintains a low False Positive value of 2%.
- Jiankang Ren
- Xiaoyan Su
- Guoqi Xie
- Chao Yu
- Guozhen Tan
- Guowei Wu
Multiprocessor platforms have been widely applied in safety-critical domains to accommodate the increasing computation requirement of modern real-time applications. In this paper, we present a workload-aware harmonic partitioned multiprocessor scheduling scheme for periodic real-time tasks with constrained deadlines under the fixed-priority preemptive scheduling policy. In particular, two grouping metrics effectively integrating both harmonicity and workload characteristic are designed to guide our task partition. With those metrics, our scheme can greatly improve system utilization by taking advantage of the combination of harmonic relationship exploration and workload awareness. Experiments show that our proposed scheme significantly outperforms existing approaches in terms of schedulability.
- Meng Xu
- Robert Gifford
- Linh Thi Xuan Phan
This paper presents vC2M, a holistic multi-resource allocation framework for real-time multicore virtualization. vC2M integrates shared cache allocation with memory bandwidth regulation to mitigate interferences among concurrent tasks, thus providing better timing isolation among tasks and VMs. It reduces the abstraction overhead through task and VCPU release synchronization and through VCPU execution regulation, and it further introduces novel resource allocation algorithms that consider CPU, cache, and memory bandwidth altogether to optimize resources. Evaluations on our prototype show that vC2M can be implemented with minimal overhead, and that it substantially improves schedulability over existing solutions.
- Mina Niknafs
- Ivan Ukhov
- Petru Eles
- Zebo Peng
Modern embedded platforms need sophisticated resource managers in order to utilize the heterogeneous computational resources efficiently. Moreover, such platforms are exposed to fluctuating workloads unpredictable at design time. In such a context, predicting the incoming workload might improve the efficiency of resource management. But is this true? And, if yes, how significant is this improvement? How accurate does the prediction need to be in order to improve decisions instead of doing harm? By proposing a prediction-based resource manager aimed at minimizing energy consumption while meeting task deadlines and by running extensive experiments, we try to answer the above questions.
- Francesco Barchi
- Gianvito Urgese
- Enrico Macii
- Andrea Acquaviva
Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.
- Ganapati Bhat
- Kunal Bagewadi
- Hyung Gyu Lee
- Umit Y. Ogras
The use of wearable and mobile devices for health and activity monitoring is growing rapidly. These devices need to maximize their accuracy and active time under a tight energy budget imposed by battery and form-factor constraints. This paper considers energy harvesting devices that run on a limited energy budget to recognize user activities over a given period. We propose a technique to co-optimize the accuracy and active time by utilizing multiple design points with different energy-accuracy trade-offs. The proposed technique switches between these design points at runtime to maximize a generalized objective function under tight harvested energy budget constraints. We evaluate our approach experimentally using a custom hardware prototype and 14 user studies. It achieves 46% higher expected accuracy and 66% longer active time compared to the highest performance design point.
- Yue Xu
- Hyung Gyu Lee
- Yujuan Tan
- Yu Wu
- Xianzhang Chen
- Liang Liang
- Lei Qiao
- Duo Liu
Energy harvesting technology has been popularly adopted in embedded systems. However, unstable energy source results in unsteady operation. In this paper, we devise a long-term energy efficient task scheduling targeting for solar-powered sensor nodes. The proposed method exploits a reinforcement learning with a solar energy prediction method to maximize the energy efficiency, which finally enhances the long-term quality of services (QoS) of the sensor nodes. Experimental results show that the proposed scheduling improves the energy efficiency by 6.0%, on average and achieves the better QoS level by 54.0%, compared with a state-of-the-art task scheduling algorithm.
- Pramesh Pandey
- Prabal Basu
- Koushik Chakraborty
- Sanghamitra Roy
The emergence of hardware accelerators has brought about several orders of magnitude improvement in the speed of the deep neural-network (DNN) inference. Among such DNN accelerators, Google Tensor Processing Unit (TPU) has transpired to be the best-in-class, offering more than 15× speedup over the contemporary GPUs. However, the rapid growth in several DNN workloads conspires to escalate the energy consumptions of the TPU-based data-centers. In order to restrict the energy consumption of TPUs, we propose Green TPU—a low-power near-threshold (NTC) TPU design paradigm. To ensure a high inference accuracy at a low-voltage operation, GreenTPU identifies the patterns in the error-causing activation sequences in the systolic array, and prevents further timing errors from the same sequence by intermittently boosting the operating voltage of the specific multiplier-and-accumulator units in the TPU. Compared to a cutting-edge timing error mitigation technique for TPUs, GreenTPU enables 2X–3X higher performance in an NTC TPU, with a minimal loss in the prediction accuracy.
- Minxuan Zhou
- Mohsen Imani
- Saransh Gupta
- Tajana Rosing
Recently, Processing-In-Memory (PIM) techniques exploiting resistive RAM (ReRAM) have been used to accelerate various big data applications. ReRAM-based in-memory search is a powerful operation which efficiently finds required data in a large data set. However, such operations result in a large amount of current which may create serious thermal issues, especially in state-of-the-art 3D stacking chips. Therefore, designing PIM accelerators based on in-memory searches requires a careful consideration of temperature. In this work, we propose static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations. Our experiments show the proposed design significantly reduces the peak chip temperature and dynamic management overhead. We test our proposed design in two important categories of applications which benefit from the search-based PIM acceleration – hyper-dimensional computing and database query. Validated experiments show that the proposed method can reduce the steady-state temperature by at least 15.3 °C which extends the lifetime of the ReRAM device by 57.2% on average. Furthermore, the proposed fine-grained dynamic thermal management provides 17.6% performance improvement over state-of-the-art methods.
- Jeff Jun Zhang
- Kang Liu
- Faiq Khalid
- Muhammad Abdullah Hanif
- Semeen Rehman
- Theocharis Theocharides
- Alessandro Artussi
- Muhammad Shafique
- Siddharth Garg
Machine learning, in particular deep learning, is being used in almost all the aspects of life to facilitate humans, specifically in mobile and Internet of Things (IoT)-based applications. Due to its state-of-the-art performance, deep learning is also being employed in safety-critical applications, for instance, autonomous vehicles. Reliability and security are two of the key required characteristics for these applications because of the impact they can have on human’s life. Towards this, in this paper, we highlight the current progress, challenges and research opportunities in the domain of robust systems for machine learning-based applications.
- Giulio Zizzo
- Chris Hankin
- Sergio Maffeis
- Kevin Jones
Machine learning systems have had enormous success in a wide range of fields from computer vision, natural language processing, and anomaly detection. However, such systems are vulnerable to attackers who can cause deliberate misclassification by introducing small perturbations. With machine learning systems being proposed for cyber attack detection such attackers are cause for serious concern. Despite this the vast majority of adversarial machine learning security research is focused on the image domain. This work gives a brief overview of adversarial machine learning and machine learning used in cyber attack detection and suggests key differences between the traditional image domain of adversarial machine learning and the cyber domain. Finally we show an adversarial machine learning attack on an industrial control system.
- Kun Wu
- Guohao Dai
- Xing Hu
- Shuangchen Li
- Xinfeng Xie
- Yu Wang
- Yuan Xie
Blockchain applications have shown huge potential in various domains. Proof of Work (PoW) is the key procedure in blockchain applications, which exhibits the memory-bound characteristic and hinders the performance improvement of blockchain accelerators. In order to mitigate the “memory wall” and improve the performance of memory-hard PoW accelerators, using Ethash as an example, we optimize the memory architecture from two perspectives: 1) Hiding memory latency. We propose specialized context switch design to overcome the uncertain cycles of repetitive memory requests. 2) Increasing memory bandwidth utilization. We introduce on-chip memory that stores a portion of the Ethash directed acyclic graph (DAG) for larger effective memory bandwidth, and further propose adopting embedded NOR flash to fulfill the role. Then, we conduct extensive experiments to explore the design space of our optimized memory architecture for Ethash, including number of hash cores, on-chip/off-chip memory technologies and specifications. Based on the design space exploration, we finally provide the guidance for designing the memory-bound PoW accelerator. The experiment results show that our optimized designs achieve 8.7% — 55% higher hash rate and 17% — 120% higher hash rate per Joule compared with the baseline design in different configurations.
- Jinwoo Kim
- Gauthaman Murali
- Heechun Park
- Eric Qin
- Hyoukjun Kwon
- Venkata Chaitanya
- Krishna Chekuri
- Nihar Dasari
- Arvind Singh
- Minah Lee
- Hakki Mert Torun
- Kallol Roy
- Madhavan Swaminathan
- Saibal Mukhopadhyay
- Tushar Krishna
- Sung Kyu Lim
A new trend in complex SoC design is chiplet-based IP reuse using 2.5D integration. In this paper we present a highly-integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5D designs. We chipletize each IP by adding logical protocol translators and physical interface modules. These chiplets are placed/routed on a silicon interposer next. Our package models are then used to calculate PPA and signal/power integrity of the overall system. Our design space exploration study using our tool flow shows that 2.5D integration incurs 2.1x PPA overhead compared with 2D SoC counterpart.
- Vijeta Rathore
- Vivek Chaturvedi
- Amit K. Singh
- Thambipillai Srikanthan
- Muhammad Shafique
Device scaling to subdeca nanometer has pushed device aging as a primary design concern. In manycore systems, inevitable process variation further adds to delay degradation and, coupled with the scalability issues in manycores, makes aging management, while meeting performance demands, a complex problem. LifeGuard is a performance-centric reinforcement learning-based task mapping strategy that leverages the different impact of applications on aging for improving system health. Experimental results, comparing LifeGuard with two state-of-the-art aging optimizing techniques, on a 256-core system, showed that LifeGuard led to improved health for, respectively, 57% and 74% of the cores, and also an enhanced aggregate core frequency.
We propose a framework that estimates the error rate experienced by an application as it runs on a timing-speculative processor. The framework uses an instruction error model that is comparable in accuracy to low-level simulations—as it considers the effects of operand values, preceding instructions, datapath configuration, and error correction scheme, as well as process variation, including its spatial correlation property—and yet efficient enough to allow its application in Monte Carlo experiments to characterize large program input datasets. We then use statistical limit theorems to estimate program error rate and quantify the effect of inter-instruction correlations.
- Jintaek Kang
- Dowhan Jung
- Kwanghyun Chung
- Soonhoi Ha
In the design of a neural processor, a cycle-accurate simulator is usually built to estimate the performance before hardware implementation. Since using the simulator to perform design space exploration (DSE) of hardware architecture is quite time consuming, we propose a novel method to use a high-level analytical model for fast DSE. In the model, non-deterministic execution delay is modeled with some parameters whose contribution to the performance is estimated statically by simulation. The viability of the proposed methodology is confirmed with two neural processors with different manycore architectures, achieving 2000 times speed-up within 3% accuracy error, compared with simulator-based DSE.
- Runbin Shi
- Junjie Liu
- Hayden K.-H. So
- Shuo Wang
- Yun Liang
Various models with Long Short-Term Memory (LSTM) network have demonstrated prior art performances in sequential information processing. Previous LSTM-specific architectures set large on-chip memory for weight storage to alleviate the memory-bound issue and facilitate the LSTM inference in cloud computing. In this paper, E-LSTM is proposed for embedded scenarios with the consideration of the chip-area and limited data-access bandwidth. The heterogeneous hardware in E-LSTM tightly couples an LSTM co-processor with an embedded RISC-V CPU. The eSELL format is developed to represent the sparse weight matrix. With the proposed cell fusion optimization based on the inherent sparsity in computation, E-LSTM achieves up to 2.2× speedup of processing throughput.
- Zirui Xu
- Fuxun Yu
- Chenchen Liu
- Xiang Chen
Although the Deep Neural Network (DNN) technique has been widely applied in various applications, the DNN-based applications are still too computationally intensive for the resource-constrained mobile devices. Many works have been proposed to optimize the DNN computation performance, but most of them are limited in an algorithmic perspective, ignoring certain computing issues in practical deployment. To achieve the comprehensive DNN performance enhancement in practice, the expected DNN optimization works should closely cooperate with specific hardware and system constraints (i.e. computation capacity, energy cost, memory occupancy, and inference latency). Therefore, in this work, we propose ReForm — a resource-aware DNN optimization framework. Through thorough mobile DNN computing analysis and innovative model reconfiguration schemes (i.e. ADMM based static model fine-tuning, dynamically selective computing), ReForm can efficiently and effectively reconfigure a pre-trained DNN model for practical mobile deployment with regards to various static and dynamic computation resource constraints. Experiments show that ReForm has ~3.5× faster optimization speed than state-of-the-art resource-aware optimization method. Also, ReForm can effective reconfigure a DNN model to different mobile devices with distinct resource constraints. Moreover, ReForm achieves satisfying computation cost reduction with ignorable accuracy drop in both static and dynamic computing scenarios (at most 18% workload, 16.23% latency, 48.63% memory, and 21.5% energy enhancement).
- Bharath Srinivas Prabakaran
- Semeen Rehman
- Muhammad Shafique
Bio-signals exhibit high redundancy, and the algorithms for their processing are inherently error resilient. This property can be leveraged to improve the energy-efficiency of IoT-Edge (wearables) through the emerging trend of approximate computing. This paper presents XBioSiP, a novel methodology for approximate bio-signal processing that employs two quality evaluation stages, during the pre-processing and bio-signal processing stages, to determine the approximation parameters. It thereby achieves high energy savings while satisfying the user-determined quality constraint. Our methodology achieves, up to 19× and 22× reduction in the energy consumption of a QRS peak detection algorithm for 0% and < 1% loss in peak detection accuracy, respectively.
- Alireza Mahzoon
- Daniel Große
- Rolf Drechsler
In recent years, formal methods based on Symbolic Computer Algebra (SCA) have shown very good results in verification of integer multipliers. The success is based on removing redundant terms (vanishing monomials) early which allows to avoid the explosion in the number of monomials during backward rewriting. However, the SCA approaches still suffer from two major problems: (1) high dependence on the detection of Half Adders (HAs) realized as AND-XOR gates in the multiplier netlist, and (2) extremely large search space for finding the source of the vanishing monomials. As a consequence, if the multiplier consists of dirty logic, i.e. for instance using non-standard libraries or logic optimization, the existing SCA methods are completely blind on the resulting polynomials, and their techniques for effective division fail.
In this paper, we present RevSCA. RevSCA brings back light into backward rewriting by identifying the atomic blocks of the arithmetic circuits using dedicated reverse engineering techniques. Our approach takes advantage of these atomic blocks to detect all sources of vanishing monomials independent of the design architecture. Furthermore, it cuts the local vanishing removal time drastically due to limiting the search space to a small part of the design only. Experimental results confirm the efficiency of our approach in verification of a wide variety of integer multipliers with up to 1024 output bits.
- Rehab Massoud
- Hoang M. Le
- Peter Chini
- Prakash Saivasan
- Roland Meyer
- Rolf Drechsler
This paper introduces a new method to trace cycle-accurately the temporal behavior of on-chip signals while operating in-field. Current cycle-accurate schemes incur unacceptable amounts of data for logging, storage and processing.
Our key idea to enable efficient yet cycle-accurate tracing, is to bring timing to the front as a main traced artifact. We split the signal tracing into consecutive (back-to-back) finite trace-cycles. Within a trace-cycle, a signal’s value-change instance gets assigned an encoded timestamp. At the end of each trace-cycle, these encoded timestamps are aggregated into a logged timeprint, which summarizes the temporal behavior over the trace-cycle.
To retrieve the accurate timing, we reconstruct the exact instances from a timeprint via a SAT query. The experiments demonstrate how unprecedented lightweight tracing can be applied, and how timeprints enable the verification of cycle-accurate properties and the detection of sporadic temperature effects.
- Michael Schwarz
- Raphael Stahl
- Daniel Müller-Gritschneder
- Ulf Schlichtmann
- Dominik Stoffel
- Wolfgang Kunz
Customizing embedded computing platforms to specific application domains often necessitates optimizing the firmware and/or the HW/SW interface under tight resource constraints. Such optimizations frequently alter the communication between the firmware and the peripheral devices, possibly compromising functional correctness of the input/output behavior of the embedded system. This paper proposes a formal HW/SW co-equivalence checking technique for verifying correct I/O behavior of peripherals under a modified firmware. We demonstrate the great promise of our approach on RTL implementations of several open-source peripherals. In our experiments we successfully prove or disprove correctness of firmware optimizations for an industrial driver software. In addition, we also found a subtle bug in one of the peripherals and several undocumented preconditions for correct device behavior.
- Vladimir Herdt
- Daniel Große
- Hoang M. Le
- Rolf Drechsler
Extensive testing of IoT SW is very important to prevent errors and security vulnerabilities. In the SW domain the automated concolic testing technique has been shown very effective.
In this paper we propose an approach for concolic testing of binaries targeting RISC-V systems with peripherals. Our approach works by integrating the Concolic Testing Engine (CTE) with the architecture specific Instruction Set Simulator (ISS) inside of a Virtual Prototype (VP). We provide a designated CTE-interface to integrate (SystemC-based) peripherals into the concolic testing by means of SW models. This combination enables a high simulation performance at binary level with comparatively little effort to integrate peripherals with concolic execution capabilities. Our approach has been effective in finding several buffer overflow related security vulnerabilities in the FreeRTOS TCP/IP stack.
- Brendan L. West
- Jian Zhou
- Ronald G. Dreslinski
- J. Brian Fowlkes
- Oliver Kripfgans
- Chaitali Chakrabarti
- Thomas F. Wenisch
High volume acquisition rates are imperative for medical ultrasound imaging applications, such as 3D elastography and 3D vector flow imaging. Unfortunately, despite recent algorithmic improvements, high-volume-rate imaging remains computationally infeasible on known platforms.
In this paper, we propose Tetris, a novel hardware accelerator for ultrasound beamforming that enables volume acquisition rates up to the physics limits of acoustic propagation delay. Through algorithmic and hardware optimizations, we enable a streaming system design outclassing previously proposed accelerators in performance while lowering hardware complexity and storage requirements. For a representative imaging task, our proposed system generates physics-limited 13,020 volumes per second in a 2.5W power budget.
- Nimish Shah
- Laura I. Galindez Olascoaga
- Wannes Meert
- Marian Verhelst
Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.
- Seungkyu Choi
- Jaekang Shin
- Yeongjae Choi
- Lee-Sup Kim
Personalization by incremental learning has become essential for IoT devices to enhance the performance of the deep learning models trained with global datasets. To avoid massive transmission traffic in the network, exploiting on-device learning is necessary. We propose a software/hardware co-design technique that builds an energy-efficient low-bit trainable system: (1) software optimizations by local low-bit quantization and computation freezing to minimize the on-chip storage requirement and computational complexity, (2) hardware design of a bit-flexible multiply-and-accumulate (MAC) array sharing the same resources in inference and training. Our scheme saves 99.2% on on-chip buffer storage and achieves 12.8x higher peak energy efficiency compared to previous trainable accelerators.
- Hong Liu
- Leibo Liu
- Wenping Zhu
- Qiang Li
- Huiyu Mo
- Shaojun Wei
A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on IoT edge devices. First, LUT with a unified mode is adopted to support convolutional neural network with fully variable weight bit precision to minimize operations of B-HG, which achieves 1.33×-1.50× speedup on multi-bit weight CNN relative to the similar solution. Second, multi-level prediction-correction model is proposed to achieve computational-efficient convolution with adaptive precision. The operations saved can be increase by about 30% than the two-stage model. Besides, nearly 77.4% of the operations in B-HG can be saved by using the combination of these two methods, yielding a 2.3× inference speedup. Third, block computing based pipeline is designed to improve the residual block deficiency in B-HG. It can not only reduce about 66.2% off-chip memory access than the baseline, but also save 60% and 31% on-chip memory space and access compared to the similar fused-layer accelerator. The proposed B-HG accelerator achieves 450 fps at 500MHz based on the simulation in TSMC 28 nm process. Meanwhile, the power efficiency is up to 8.5 TOPS/W, which is two orders of magnitude higher than the dedicated face landmark detection accelerator.
- Runze Liu
- Jianlei Yang
- Yiran Chen
- Weisheng Zhao
Simultaneous Localization and Mapping (SLAM) is a critical task for autonomous navigation. However, due to the computational complexity of SLAM algorithms, it is very difficult to achieve real-time implementation on low-power platforms. We propose an energy-efficient architecture for real-time ORB (Oriented-FAST and Rotated-BRIEF) based visual SLAM system by accelerating the most time-consuming stages of feature extraction and matching on FPGA platform. Moreover, the original ORB descriptor pattern is reformed as a rotational symmetric manner which is much more hardware friendly. Optimizations including rescheduling and parallelizing are further utilized to improve the throughput and reduce the memory footprint. Compared with Intel i7 and ARM Cortex-A9 CPUs on TUM dataset, our FPGA realization achieves up to 3× and 31× frame rate improvement, as well as up to 71× and 25× energy efficiency improvement, respectively.
- Shijun Gong
- Jiajun Li
- Wenyan Lu
- Guihai Yan
- Xiaowei Li
Streaming processing is an important and growing class of applications for analyzing continuous streams of real time data. Sliding-window aggregations (SWAGs) dominate the computation time in such applications and dictate an unprecedented computation capacity which poses a great challenge to the computing architectures. General-purpose processors cannot efficiently handle SWAGs because of the specific computation patterns. This paper proposes an efficient accelerator architecture for ubiquitous SWAGs, called ShuntFlow. ShuntFlow is a typical type of Kernel Processing Unit (KPU) where “Kernel” represent two main categories of SWAG operations widely used in streaming processing. Meanwhile, we propose a shunt rule to enable ShuntFlow to efficiently handle SWAGs with arbitrary parameters. As a case study, we implemented ShuntFlow on an Altera Arria 10 AX115N FPGA board at 150 MHz and compared it to previous approaches. The experimental results show that ShuntFlow provides a tremendous throughput and latency advantage over CPU and GPU implementations on both reduce-like and index-like SWAGs.
- Xingchen Man
- Leibo Liu
- Jianfeng Zhu
- Shaojun Wei
Compilation has become a major challenge to the usability of coarse-grained reconfigurable architectures as increasing programmable resources must be orchestrated. Static compilation is insufficient for prohibitive time cost while dynamic compilation still performs poorly in both generality and efficiency. This paper proposes a general pattern-based dynamic compilation framework, which utilizes statically-generated patterns to straightforwardly determine runtime re-placement and routing so that runtime configuration creation algorithm has low complexity. Domain-specific communication characteristics are harnessed to help improve the efficiency of patterns. The experimental results show that compiled general applications can be transformed onto arbitrary resources at runtime, reserving 97% (39%~163%) of the original performance/resource on average, 7% (0~17%) better than the state-of-the-art non-general methods.
- Mahdi Nazm Bojnordi
- Farhan Nasrullah
3D die-stacking has enabled energy-efficient solutions for near data processing by integrating multiple dice of high-density memory layers and processor cores within the same package. One promising approach is to employ the in-package memory as a gigascale last-level cache for data-intensive computing. Most existing in-package cache controllers rely on the command scheduling policies borrowed from the off-chip DRAM system. Regrettably, these control policies are not specifically tailored for in-package cache traffics, which results in a limited bandwidth efficiency. This paper proposes ReTagger, a DRAM cache controller that employs repeated tags to alleviate the cost of DRAM row buffer misses. Our simulation results on a set of ten data-intensive applications indicate an average of 20% performance improvement for the proposed controller over the state-of-the-art DRAM caches.
- Aviral Shrivastava
- Moslem Didehban
Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough — the latency between the occurrence of the fault to it’s detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.
- Eric Cheng
- Daniel-Mueller-Gritschneder
- Jacob Abraham
- Pradip Bose
- Alper Buyuktosunoglu
- Deming Chen
- Hyungmin Cho
- Yanjing Li
- Uzair Sharif
- Kevin Skadron
- Mircea Stan
- Ulf Schlichtmann
- Subhasish Mitra
Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by manufacturing and operating conditions, manufacturing test escapes, and early-life failures. Many publications have suggested that cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective resilience, is essential for designing cost-effective resilient digital systems. This paper presents a comprehensive overview of cross-layer resilience by addressing fundamental cross-layer resilience questions, by summarizing insights derived from recent advances in cross-layer resilience research, and by discussing future cross-layer resilience challenges.
- Michael Werner
- Keerthikumara Devarajegowda
- Moomen Chaari
- Wolfgang Ecker
Developing software in a slightly different way can have a dramatic impact on soft error resilience. This observation can be transferred in a process of improving existing code by transformations. These transformations are of systematic nature and can be automated. In this paper, we present a framework for low level embedded software generation – commonly referred to as firmware — and the inclusion of safety measures in the generated code. The generation approach follows a three stage process starting with formalized firmware specification using both platform dependent and independent firmware models. Finally, C-code is generated from the view model in a straight forward way. Safety measures are included either as part of the translation step between the models or as transformations of single models.
- Ruizhou Ding
- Zeye Liu
- Ting-Wu Chin
- Diana Marculescu
- R. D. (Shawn) Blanton
To improve the throughput and energy efficiency of Deep Neural Networks (DNNs) on customized hardware, lightweight neural networks constrain the weights of DNNs to be a limited combination (denoted as k &epsis; {1, 2}) of powers of 2. In such networks, the multiply-accumulate operation can be replaced with a single shift operation, or two shifts and an add operation. To provide even more design flexibility, the k for each convolutional filter can be optimally chosen instead of being fixed for every filter. In this paper, we formulate the selection of k to be differentiable, and describe model training for determining k-based weights on a per-filter basis. Over 46 FPGA-design experiments involving eight configurations and four data sets reveal that lightweight neural networks with a flexible k value (dubbed FLightNNs) fully utilize the hardware resources on Field Programmable Gate Arrays (FPGAs), our experimental results show that FLightNNs can achieve 2× speedup when compared to lightweight NNs with k = 2, with only 0.1% accuracy degradation. Compared to a 4-bit fixed-point quantization, FLightNNs achieve higher accuracy and up to 2× inference speedup, due to their lightweight shift operations. In addition, our experiments also demonstrate that FLightNNs can achieve higher computational energy efficiency for ASIC implementation.
- Shubham Jain
- Swagath Venkataramani
- Vijayalakshmi Srinivasan
- Jungwook Choi
- Kailash Gopalakrishnan
- Leland Chang
Fixed-point implementations (FxP) are prominently used to realize Deep Neural Networks (DNNs) efficiently on energy-constrained platforms. The choice of bit-width is often constrained by the ability of FxP to represent the entire range of numbers in the datastructure with sufficient resolution. At low bit-widths (< 8 bits), state-of-the-art DNNs invariably suffer a loss in classification accuracy due to quantization/saturation errors.
In this work, we leverage a key insight that almost all datastructures in DNNs are long-tailed i.e., a significant majority of the elements are small in magnitude, with a small fraction being orders of magnitude larger. We propose BiScaled-FxP, a new number representation which caters to the disparate range and resolution needs of long-tailed data-structures. The key idea is, whilst using the same number of bits to represent elements of both large and small magnitude, we employ two different scale factors viz. scale-fine and scale-wide in their quantization. Scale-fine allocates more fractional bits providing resolution for small numbers, while scale-wide favors covering the entire range of large numbers albeit at a coarser resolution. We develop a BiScaled DNN accelerator which computes on BiScaled-FxP tensors. A key challenge is to store the scale factor used in quantizing each element as computations that use operands quantized with different scale-factors need to scale their result. To minimize this overhead, we use a block sparse format to store only the indices of scale-wide elements, which are few in number. Also, we enhance the BiScaled-FxP processing elements with shifters to scale their output when operands to computations use different scale-factors. We develop a systematic methodology to identify the scale-fine and scale-wide factors for the weights and activations of any given DNN. Over 8 state-of-the-art image recognition benchmarks, BiScaled-FxP reduces 2 computation bits over conventional FxP, while also slightly improving classification accuracy on all cases. Compared to FxP8, the performance and energy benefits range between 1.43×-3.86× and 1.4×-3.7× respectively.
- Ying Wang
- Shengwen Liang
- Huawei Li
- Xiaowei Li
Prior research on energy-efficient Convolutional Neural Network (CNN) inference accelerators mostly focus on exploiting the model sparsity, i.e., zero patterns in weight and activations, to reduce the on-chip storage and computation overhead. In this work, we found in addition to zero patterns, a larger group of repetitive patterns and values exists in the working-set of CNN inference task, which is defined as computation redundancy and induces unnecessary performance and storage overhead in CNN accelerators. Based on this observation, we proposed a redundancy-free architecture that detects and eliminates the repetitive computation and storage patterns in CNN for more efficient network inference. The architecture consists of two parts: the off-line parameter analyzer that extracts the repetitive patterns in the 3D tensor of parameters, and the dataflow accelerator. The proposed accelerator at first preprocesses the weight patterns and the dynamically generated activations, and then cache these intermediate results in special P2-cache banks for further usage in convolution or full-connection stage. It is evaluated in experiments that the proposed Cavoluche architecture removes up to 89% of the repetitive operations from the layer inference process and reduce 77% of on-chip storage space to store both redundancy-free weight and activations. It is seen in experiments that the implementation of Cavoluche outperforms the state-of-the-art mobile GPGPU in both performance and energy-efficiency. When compared to the latest sparsity base accelerators, Cavoluche also achieves better operation elimination effects.
- Morteza Hosseini
- Mark Horton
- Hiren Paneliya
- Uttej Kallakuri
- Houman Homayoun
- Tinoosh Mohsenin
In deep neural networks (DNNs), model size is an important factor affecting performance, energy efficiency and scalability. Recent works on weight pruning have shown significant reduction in model size at the expense of irregularity in the DNN architecture, which necessitates additional indexing memory to address non-zero weights, thereby increasing chip size, energy consumption and delays. In this paper, we propose cyclic sparsely connected (CSC) layers, with a memory/computation complexity of O(NlogN), that can be used as an overlay for fully connected (FC) layers whose number of parameters, O(N2), can dominate the parameters of the entire DNN model. The CSC layers are composed of a few sequential layers, referred to as support layers, which result in full connectivity between the Inputs and Outputs of each CSC layer. We introduce an algorithm to train models with FC layers replaced with CSC layers in a bottom-up approach by incrementally increasing the CSC layers characteristics such as connectivity and number of synapses, to achieve the desired accuracy given a compression rate. One advantage of the CSC layers is that there will be no requirement for indexing the non-zero weights. Our experimental results using AlexNet on ImageNet and LeNet300100 on MNIST indicate that by substituting FC layers with CSC layers, we can achieve 10× to 46× compression within a margin of 2% accuracy loss, which is comparable to non-structural pruning methods. A scalable parallel hardware architecture to implement CSC layers, and an equivalent scalable parallel architecture to efficiently implement non-structurally pruned FC layers are designed and fully placed and routed on Artix-7 FPGA and ASIC 65nm CMOS technology for LeNet300100 model. The results indicate that the proposed CSC hardware outperforms the conventional non-structurally pruned architecture with an equal compression rate by ~2× in power, energy, area and resource utilization when running at the same frequency.
- Wonseok Choi
- Dongyeob Shin
- Jongsun Park
- Swaroop Ghosh
With inherent algorithmic error resilience of deep neural networks (DNNs), supply voltage scaling could be a promising technique for energy efficient DNN accelerator design. In this paper, we propose novel error resilient techniques to enable aggressive voltage scaling by exploiting different amount of error resilience (sensitivity) with respect to DNN layers, filters, and channels. First, to rapidly evaluate filter/channel-level weight sensitivities of large scale DNNs, first-order Taylor expansion is used, which accurately approximates weight sensitivity from actual error injection simulation. With measured timing error probability of each multiply-accumulate (MAC) units considering process variations, the sensitivity variation among filter weights can be leveraged to design DNN accelerator, such that the computations with more sensitive weights are assigned to more robust MAC units, while those with less sensitive weights are assigned to less robust MAC units. Based on post-synthesis timing simulations, 51% energy savings has been achieved with CIFAR-10 dataset using VGG-9 compared to state-of-the-art timing error recovery technique with the same constraint of 3% accuracy loss.
- Duy-Thanh Nguyen
- Nhut-Minh Ho
- Ik-Joon Chang
We present a stretchable DRAM refresh control for energy-efficient processing of DNNs, namely St-DRC. We exploit the characteristic that the recognition accuracy of DNNs is insensitive to errors of insignificant bits. By replacing some insignificant bits with parity bits for the error-correction of significant bits, the St-DRC can protect the significant bits under stretched refresh periods. This significantly improves DRAM refresh energy without performance degradation of DNNs, applicable to both training and inference operations. Our simulation shows that in training, the St-DRC obtains 23%/12% DRAM energy savings for graphic/main memories, respectively. Further, the St-DRC accelerates the training speed by 0.43 ~ 4.12%.
- Cong Hao
- Xiaofan Zhang
- Yuhong Li
- Sitao Huang
- Jinjun Xiong
- Kyle Rupnow
- Wen-mei Hwu
- Deming Chen
While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up and top-down approaches: a bottom-up hardware-oriented DNN model search for high accuracy, and a top-down FPGA accelerator design considering DNN-specific characteristics. We also build an automatic co-design flow, including an Auto-DNN engine to perform hardware-oriented DNN model search, as well as an Auto-HLS engine to generate synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results show that our proposed DNN model and accelerator outperform the state-of-the-art FPGA designs in all aspects including Intersection-over-Union (IoU) (6.2% higher), frames per second (FPS) (2.48× higher), power consumption (40% lower), and energy efficiency (2.5× higher). Compared to GPU-based solutions, our designs deliver similar accuracy but consume far less energy.
- Junzhong Shen
- Deguang Wang
- You Huang
- Mei Wen
- Chunyuan Zhang
Three-dimensional convolutional neural networks (3D CNNs) have become a promising method in lung nodule segmentation. The high computational complexity and memory requirements of 3D CNNs make it challenging to accelerate 3D CNNs on a single FPGA. In this work, we focus on accelerating the 3D CNN-based lung nodule segmentation on a multi-FPGA platform by proposing an efficient mapping scheme that takes advantage of the massive parallelism provided by the platform, as well as maximizing the computational efficiency of the accelerators. Experimental results show that our system integrating with four Xilinx VCU118 can achieve state-of-the-art performance of 14.5 TOPS, in addition with a 29.4x performance gain over CPU and 10.5x more energy efficiency over GPU.
- Eric Finnerty
- Zachary Sherer
- Hang Liu
- Yan Luo
The flexible architectures of Field Programmable Gate Arrays (FPGAs) lend themselves to an array of data analytical applications, among which Breadth-First Search (BFS), due to its vital importance, draws particular attention. Recent attempts that offload BFS on FPGAs either simply imitate the existing CPU- or Graphics Processing Units (GPU)- based mechanisms or suffer from scalability issues. To this end, we introduce a novel data centric design which extensively extracts the potential of FPGAs for BFS with the following two techniques. First, we advocate to partition and compress the BFS algorithmic metadata in order to buffer them in fast on-chip memory and circumvent the expensive metadata access. Second, we propose a hierarchical coalescing method to improve the throughput of graph data access. Taken together, our evaluation demonstrates that the proposed design achieves, on average, 1.6× and 2.2× speedups over the state-of-the-art FPGA designs TorusBFS and Umuroglu, respectively, across a collection of graph datasets.
- Jaeha Kung
- Junki Park
- Sehun Park
- Jae-Joon Kim
In this paper, we present an integrated solution to design a high-performance LSTM accelerator. We propose a fast and flexible hardware architecture, named Peregrine, supported by a stack of innovations from algorithm to hardware design. Peregrine first minimizes the memory footprint by limiting the synaptic connection patterns within the LSTM network. Also, Peregrine provides parallel Huffman decoders with adaptive clocking to provide flexibility in dealing with a wide range of sparsity levels in the weight matrices. All these features are incorporated in a novel hardware architecture to maximize energy-efficiency. As a result, Peregrine improves performance by ~38% and energy-efficiency by ~33% in speech recognition compared to the state-of-the-art LSTM accelerator.
- Yongchen Wang
- Ying Wang
- Huawei Li
- Cong Shi
- Xiaowei Li
3D convolutional neural networks (CNN) are gaining popularity in action/activity analysis. Compared to 2D convolutions that share the filters in 2D spatial domain, 3D convolutions further reuse filters in the temporal dimension to capture time-domain features. Prior works on specialized 3D-CNN accelerators employ additional on-chip memories and multi-cluster architecture to reuse data among the process element (PE) arrays, which is too expensive for low-power chips. Instead of harvesting in-memory locality, we propose a 3D systolic cube architecture to exploit the spatial-and-temporal localities of 3D CNNs, which moves the reusable data in-between PEs connected via a 3D-cube Network-on-Chip. Evaluation shows that systolic-cube contributes to considerable energy-efficiency boost for activity-recognition benchmarks.
- Jinhang Choi
- Zeinab Hakimi
- Philip W. Shin
- Jack Sampson
- Vijaykrishnan Narayanan
As the computing power of end-point devices grows, there has been interest in developing distributed deep neural networks specifically for hierarchical inference deployments on multi-sensor systems. However, as the existing approaches rely on latent parameters trained by machine learning, it is difficult to preemptively select front-end deep features across sensors, or understand individual feature’s relative importance for systematic global inference. In this paper, we propose multi-view convolutional neural networks exploiting likelihood estimation. Proof-of-concept experiments show that our likelihood-based context selection and weighted averaging collaboration scheme can decrease an endpoint’s communication and energy costs by a factor of 3×, while achieving high accuracy comparable to the original aggregation approaches.
- Shuo-Han Chen
- Ming-Chang Yang
- Yuan-Hao Chang
With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the “bit-level rewrite operation,” which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.
- Shunzhuo Wang
- Fei Wu
- Chengmo Yang
- Jiaona Zhou
- Changsheng Xie
- Jiguang Wan
Superblocks are widely employed in SSDs for improving performance. However, the standard superblock organization which links blocks with the same block ID across planes into one superblock leads to SSDs’ ineluctable lifetime waste due to inter-block wear tolerance variations. This work proposes a wear-aware superblock management, called WAS, which (1) dynamically organizes superblocks according to real-time block wear levels to make strong blocks relieve wear on weak ones, and (2) employs a wear-based garbage collection scheme to reduce inter-block wear gap. Comprehensive experiments are carried out in SSDsim. Results show that WAS greatly prolongs SSD lifetime by 51.3% compared with the state-of-the-art superblock management.
- Fei Li
- Youyou Lu
- Zhongjie Wu
- Jiwu Shu
With increased density, flash memory becomes more vulnerable to errors. Error correction incurs high overhead, which is sensitive in SSD cache. However, some applications like multimedia processing have the intrinsic tolerance of inaccuracies. In this paper, we propose ASCache, an approximate SSD cache, which allows bit errors in a controllable threshold for error-tolerant applications, so as to reduce the cache miss ratio caused by incorrect cache pages. ASCache further trades the strictness of error correction mechanisms for higher SSD access performance. Evaluations show ASCache reduces the average read latency by at most 30% and the cache miss ratio by 52%.
- Qiao Li
- Liang Shi
- Jun Yang
- Youtao Zhang
- Chun Jason Xue
With the increasing bit density and adoption of 3D NAND, flash memory suffers from increased errors. To address the issue, flash devices adopt error correction codes (ECC) with strong error correction capability, like low-density parity-check (LDPC) code, to correct errors. The drawback of LDPC is that, to correct data with a high raw bit error rate (RBER), read latency will be amplified. This work proposes to address this issue with the assistance of approximate data. First, studies have been conducted and show there are ample amount of approximate data available in flash storage. Second, a novel data organization is proposed to fortify the reliability of regular data by leaving approximate data unprotected. Finally, a new data allocation strategy and modified garbage collection scheme are presented to complete the design. The experimental results show that the proposed approach can improve read performance by 30% on average comparing to current techniques.
- Jingsong Chen
- Jinwei Liu
- Gengjie Chen
- Dan Zheng
- Evangeline F. Y. Young
The continuous development of modern VLSI technology has brought new challenges for on-chip interconnections. Different from classic net-by-net routing, bus routing requires all the nets (bits) in the same bus to share similar or even the same topology, besides considering wire length, via count, and other design rules. In this paper, we present MARCH, an efficient maze routing method under a concurrent and hierarchical scheme for buses. In MARCH, to achieve the same topology, all the bits in a bus are routed concurrently like marching in a path. For efficiency, our method is hierarchical, consisting of a coarse-grained topology-aware path planning and a fine-grained track assignment for bits. Additionally, an effective rip-up and reroute scheme is applied to further improve the solution quality. In experimental results, MARCH significantly outperforms the first place at 2018 IC/CAD Contest in both quality and runtime.
- Chen-Hao Hsu
- Shao-Chun Hung
- Hao Chen
- Fan-Keng Sun
- Yao-Wen Chang
As clock frequencies increase, topology-matching bus routing is desired to provide an initial routing result which facilitates the following buffer insertion to meet the timing constraints. Our algorithm consists of three main techniques: (1) a bus clustering method to reduce the routing complexity, (2) a DAG-based algorithm to connect a bus in the specific topology, and (3) a rip-up and re-route scheme to alleviate the routing congestion. Experimental results show that our proposed algorithm outperforms all the participating teams of the 2018 CAD Contest at ICCAD, where the top-3 routers result in 145%, 158%, and 420% higher costs than ours.
- Jihye Kwon
- Matthew M. Ziegler
- Luca P. Carloni
Logic synthesis and physical design (LSPD) tools automate complex design tasks previously performed by human designers. One time-consuming task that remains manual is configuring the LSPD flow parameters, which significantly impacts design results. To reduce the parameter-tuning effort, we propose an LSPD parameter recommender system that involves learning a collaborative prediction model through tensor decomposition and regression. Using a model trained with archived data from multiple state-of-the-art 14nm processors, we reduce the exploration cost while achieving comparable design quality. Furthermore, we demonstrate the transfer-learning properties of our approach by showing that this model can be successfully applied for 7nm designs.
Physical design process commonly consumes hours to days for large designs, and routing is known as the most critical step. Demands for accurate routing quality prediction raise to a new level to accelerate hardware innovation with advanced technology nodes. This work presents an approach that forecasts the density of all routing channels over the entire floorplan, with features collected up to placement, using conditional GANs. Specifically, forecasting the routing congestion is constructed as an image translation (colorization) problem. The proposed approach is applied to a) placement exploration for minimum congestion, b) constrained placement exploration and c) forecasting congestion in real-time during incremental placement, using eight designs targeting a fixed FPGA architecture.
- Tao-Chun Yu
- Shao-Yun Fang
- Hsien-Shih Chiu
- Kai-Shun Hu
- Philip Hui-Yuh Tai
- Cindy Chin-Fang Shen
- Henry Sheng
- Bentian Jiang
- Xiaopeng Zhang
- Ran Chen
- Gengjie Chen
- Peishan Tu
- Wei Li
- Evangeline F. Y. Young
- Bei Yu
Dummy fill insertion is a mandatory step in modern semiconductor manufacturing process to reduce dielectric thickness variation, and provide nearly uniform pattern density for the chemical mechanical planarization (CMP) process. However, with the continuous shrinking of the VLSI technology nodes, the coupling effects between the inserted metal fills and signal tracks can severely affect the original timing closure of the layout design. In this paper, we propose a robust, efficient and high-performance framework for timing-aware dummy fill insertion, which simultaneously minimizes the coupling capacitance of critical signal wires and other wires. The experimental results on IC/CAD 2018 contest benchmarks shows that our proposed framework outperforms contest winner by 8% on critical coupling capacitance with 3.3× runtime speedup.
- Jonathan Cruz
- Prabhat Mishra
- Swarup Bhunia
Electronic hardware trust is an emerging concern for all stakeholders in the semiconductor industry. Trust issues in electronic hardware span all stages of its life cycle – from creation of intellectual property (IP) blocks to manufacturing, test and deployment of hardware components and all abstraction levels – from chips to printed circuit boards (PCBs) to systems. The trust issues originate from a horizontal business model that promotes reliance of third-party untrusted facilities, tools, and IPs in the hardware life cycle. Today, designers are tasked with verifying the integrity of third-party IPs before incorporating them into system-on-chip (SoC) designs. Existing trust metric frameworks have limited applicability since they are not comprehensive. They capture only a subset of vulnerabilities such as potential vulnerabilities introduced through design mistakes and CAD tools, or quantify features in a design that target a particular Trojan model. Therefore, current practice uses ad-hoc security analysis of IP cores. In this paper, we propose a vector-based comprehensive coverage metric that quantifies the overall trust of an IP considering both vulnerabilities and direct malicious modifications. We use a variable weighted sum of a design’s functional coverage, structural coverage, and asset coverage to assess an IP’s integrity. Designers can also effectively use our trust metric to compare the relative trustworthiness of functionally equivalent third-party IPs. To demonstrate the applicability and usefulness of the proposed metric, we utilize our trust metric on Trojan-free and Trojan-inserted variants of an IP. Our results demonstrate that we are able to successfully distinguish between trusted and untrusted IPs.
- Hans Liljestrand
- Thomas Nyman
- Jan-Erik Ekberg
- N. Asokan
Shadow stacks are the go-to solution for perfect backward-edge control-flow integrity (CFI). Software shadow stacks trade off security for performance. Hardware-assisted shadow stacks are efficient and secure, but expensive to deploy. We present authenticated call stack (ACS), a novel mechanism for precise verification of return addresses using aggregated message authentication codes. We show how ACS can be realized using ARMv8.3-A pointer authentication, a new low-overhead mechanism for protecting pointer integrity. Our solution achieves security comparable to hardware-assisted shadow stacks, while incurring negligible performance overhead (< 0.5%) but requiring no additional hardware support.
- Urbi Chatterjee
- Pranesh Santikellur
- Rajat Sadhukhan
- Vidya Govindan
- Debdeep Mukhopadhyay
- Rajat Subhra Chakraborty
This work proposes a scheme to detect, isolate and mitigate malicious disruption of electro-mechanical processes in legacy PLCs where each PLC works as a finite state machine (FSM) and goes through predefined states depending on the control flow of the programs and input-output mechanism. The scheme generates a group-signature for a particular state combining the signature shares from each of these PLCs using (k,l)-threshold signature scheme. If some of them are affected by the malicious code, signature can be verified by k out of l uncorrupted PLCs and can be used to detect the corrupted PLCs and the compromised state. We use OpenPLC software to simulate Legacy PLC system on Raspberry Pi and show I/O pin configuration attack on digital and pulse width modulation (PWM) pins. We describe the protocol using a small prototype of five instances of legacy PLCs simultaneously running on OpenPLC software. We show that when our proposed protocol is deployed, the aforementioned attacks get successfully detected and the controller takes corrective measures. This work has been developed as a part of the problem statement given in the Cyber Security Awareness Week-2017 competition.
- Sonal Yadav
- Vijay Laxmi
- Manoj Singh Gaur
- Hemangee K. Kapoor
Network Demultiplexer (Net-Demux) is an essential hardware unit in multiple NoCs for traffic distribution between the NoC networks. This paper proposes a novel idea of the placement of Net-Demux at the control plane of switch allocator of the router to improve static power and energy efficiency as compared to conventional data plane placement at the Network Interface (NI).
- Manaar Alam
- Debdeep Mukhopadhyay
Deep Learning has become a de-facto paradigm for various prediction problems including many privacy-preserving applications, where the privacy of data is a serious concern. There have been efforts to analyze and exploit information leakages from DNN to compromise data privacy. In this paper, we provide an evaluation strategy for such information leakages through DNN by considering a case study on CNN classifier. The approach utilizes low-level hardware information provided by Hardware Performance Counters and hypothesis testing during the execution of a CNN to produce alarms if there exists any information leakage on actual input.
- Riadul Islam
- Md Asif Shahjalal
At leading technology nodes, the industry is facing a stiff challenge to make profitable ICs. One of the primary issues is the design rule checking (DRC) violation. In this research, we cohort with the DARPA IDEA program that aims for “no-human-in-the-loop” and 24-hour turnaround time to implement an IC from design specifications. In order to reduce human effort, we introduce the ensemble random forest algorithm to predict DRC violations before global routing, which is considered the most time-consuming step in an IC design flow. In addition, we identified features that critically impact the DRC violations. The algorithm has a 5.8% better F1-score compared to the existing SVM classifiers.
- Kourosh Hakhamaneshi
- Nick Werblun
- Pieter Abbeel
- Vladimir Stojanović
A deep neural network (DNN) based stochastic combinatorial optimization framework is presented that can find the optimal sizing of circuits in a sample-efficient manner. This sample efficiency allows us to unify this framework with generator-based tools like Berkeley Analog Generator (BAG) [1] to directly optimize layout, given the high level circuit specifications. We use this tool to design an optical link receiver layout, satisfying high-level design specifications, using post-layout simulations of only 348 design instances. Compared to an evolutionary algorithm without our DNN-based discriminator, our framework improves the sample efficiency and run time by more than 200x.
- Tsung-Wei Huang
- Chun-Xun Lin
- Martin D. F. Wong
As the design complexities continue to grow, the need to efficiently analyze circuit timing with billions of transistors is quickly becoming the major bottleneck to the overall chip design flow. In this work we introduce a distributed timer that (1) has scalable performance, (2) can be seamless integrable to existing EDA applications, (3) enables transparent resource management, (4) has robust fault-tolerant control. We evaluate the distributed timer using a set of large industry benchmarks on a cluster with 24 nodes. The results show that the proposed timer achieves full accuracy over all designs with high performance and good scalability.
- Onur Sahin
- Assel Aliyeva
- Hariharan Mathavan
- Ayse Coskun
- Manuel Egele
The ability to repeat the execution of a program is a fundamental requirement in evaluating computer systems and apps. Reproducing executions of mobile apps has proven difficult under real-life scenarios due to different sources of external inputs and interactive nature of the apps. We present a new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach. We demonstrate the feasibility of RandR by recording and replaying a set of real-world apps.
- Zheng-Hong Zhang
- Wei Chu
- Shi-Yu Huang
The Tunable Delay Line (TDL) is the most important building block in a modern cell-based timing circuit such as Phase-Locked Loop (PLL) or Delay-Locked Loop (DLL). In previously proposed TDLs, one dilemma exists — they cannot be both power efficient and environmentally adaptive at the same time. In this paper, we present an effective solution for such a dilemma – a novel “ping-pong delay line” architecture. The idea is to use two small cell-based delay lines operated in a synergistic manner in the sense that they exchange the “role of command” dynamically like in a ping-pong game, and thereby jointly reacting to severe environmental changes over a very wide range. This proposed ping-pong delay line has been incorporated in a Delay-Locked Loop (DLL) design, to demonstrate its advantages by post-layout simulation.
- Po-Cheng Pan
- Chien-Chia Huang
- Hung-Ming Chen
An efficient synthesis technique for modern analog circuits is important yet challenging due to the repeatedly re-synthesis process. To precisely explore the analog circuit performance limitation on the required technology is time-consuming. This work presents a learning-based framework for searching the limitation of analog circuits. With hierarchical architecture, the dimension of solution space can be reduced. Bayesian linear regression and support vector machine model are selected to speed up the algorithm and better performance quality can be retrieved. Experimental results show that our approach on two analog circuits can achieve up to 9x runtime speed-up without surrendering performance qualities.
- Bahar Asgari
- Ramyad Hadidi
- Hyesoon Kim
- Sudhakar Yalamanchili
The performance of sparse problems suffers from lack of spatial locality and low memory bandwidth utilization. However, the distribution of non-zero values in the data structures of a class of sparse problems, such as matrix operations in neural networks, is modifiable so that it can be matched with an efficient underlying hardware, such as systolic arrays. Such modification helps addressing the challenges coupled with sparsity. To efficiently execute sparse neural network inference on systolic arrays, we propose a structured pruning algorithm that increases the spatial locality in neural network models, while maintaining the accuracy of inference.
- Ramyad Hadidi
- Jiashen Cao
- Michael S. Ryoo
- Hyesoon Kim
Internet of Things (IoT) devices have access to an abundance of raw data for processing. With deep neural networks (DNNs), not only the demand for the computing power of IoT devices is increasing, but also privacy concerns are motivating the importance of close-to-edge computation. DNN execution by distributing its computation is common in IoT systems. However, managing unstable latencies in a network and intermittent failures are serious challenges. Our work provides robustness and close-to-zero recovery latency by adapting coded distributed computing (CDC). We analyze robust execution on a mesh of Raspberry Pis by studying four DNNs.
- Pankaj Bhowmik
- Md Jubaer Hossain Pantho
- Christophe Bobda
This paper presents a reconfigurable hardware architecture of smart image sensors to speed up low-level image processing applications at the pixel level. For each pixel in the sensor plane, the design includes an activation module and a processor. The processor has a basic structure which is common to all applications and reconfigurable segments for specific applications. Visual cortex inspired computing, like, Predictive Coding in time is implemented in the activation module to remove temporal redundancy. The ASIC implementation shows the design saves up to 84.01% dynamic power and achieves 9x speedup at 800 MHz by accurate prediction.
- Shaohan Hu
- Dmitri Maslov
- Marco Pistoia
- Jay Gambetta
Quantum computing has increasingly drawn interest and investments from the academic, industrial, and governmental research communities worldwide. Among quantum algorithms, Quantum Search is important for its quadratic speedup over its classical-computing counterpart. A key ingredient in its implementation is the Multi-Control Toffoli (MCT) gate, which creates a Boolean product of control variables and XORs it into the target. On an idealized quantum computer, all-to-all connectivity would eliminate the need to use SWAP gates to communicate information. This is, however, not affordable in the current Noisy Intermediate-Scale Quantum (NISQ) computing era. In this work, we discuss how to efficiently implement MCT gates on 2D Square Lattices (2DSL), suitable for superconducting circuits, by taking advantage of relative-phase Toffoli gates and H-tree layouts to drastically reduce resulting circuits’ depths and the amount of SWAPping required.
- Chandan Kumar Jha
- Joycee Mekie
Approximate computing has gained a lot of popularity due to its energy benefits in a variety of error-tolerant applications. In this paper we are proposing an adder which can perform n-bit single exact addition or dual approximate addition (SEDA), and is suitable for processors. The conversion from exact to approximate addition can be dynamically done at runtime. The maximum error is bounded for SEDA adders as carry is not approximated. Our proposed design consumes 48% lesser energy, has 32% lesser delay, occupies 24% lesser area as compared to exact mirror adder.
- Xiaoming Chen
- Longxiang Yin
- Bosheng Liu
- Yinhe Han
- Tianshi Wang
- Leon Wu
- Jaijeet Roychowdhury
In this paper, we report new results on a novel Ising machine technology for solving combinatorial optimization problems using networks of coupled self-sustaining oscillators. Specifically, we present several working hardware prototypes using CMOS electronic oscillators, built on bread-boards/perfboards and PCBs, implementing Ising machines consisting of up to 240 spins with programmable couplings. We also report that, just by simulating the differential equations of such Ising machines of larger sizes, good solutions can be achieved easily on benchmark optimization problems, demonstrating the effectiveness of oscillator-based Ising machines.
- Renhai Chen
- Qiming Guan
- Guohua Yan
- Zhiyong Feng
In this paper, we lead the first efforts towards intelligent RDF data management in SSDs. We propose to deeply fuse the RDF data in SSDs. In detail, the operations (e.g., data query) applied to RDF can be directly achieved in SSDs. To this end, we explore two RDF data organizations (e.g., triple-based) with the consideration of the internal structure of SSDs. The experiment is conducted on the Patient Disease Drug (PDD) Graph dataset [11]. The experimental results show that the proposed two strategies achieve the comprehensive, scalable in-SSD computation from different aspects (e.g., space efficiency or query efficiency).