SLIP 2016 TOC

A Comparative Analysis of Front-End and Back-End Compatible Silicon Photonic On-Chip
Interconnects

  • Thakkar
    Ishan G.

Photonic devices fabricated with back-end compatible silicon photonic (BCSP) materials
can provide independence from the complex CMOS front-end compatible silicon photonic
(FCSP) process, to significantly enhance photonic network-on-chip (PNoC) …

Latch Clustering for Minimizing Detection-to-Boosting Latency Toward Low-Power Resilient
Circuits

  • Hsu
    Chih-Cheng

Dynamic voltage scaling (DVS) has become one of the most effective approaches to achieve
ultra-low-power SoC. To eliminate timing errors arising from DVS, several error-resilient
circuit design techniques were proposed to detect and/or correct timing …

Connectivity Effects on Energy and Area for Neuromorphic System with High Speed Asynchronous
Pulse Mode Links

  • Segal
    Carrie

Hardware neuromorphic systems are challenged to achieve biologically realistic levels
of interconnectivity. When building a physical implementation of a neural net, the
properties of the media immediately impose limits on the number of interconnects and

Buffered Interconnects in 3D IC Layout Design

  • Ahmed
    Mohammad A.

A very important challenge in designing through-silicon via (TSV)-based 3D ICs is
to accurately estimate, through all stages of the physical design, the interconnect
delay which is strongly dependent on the layout of 3D IC. The earlier in the design

Topologically-Geometric Routing

  • Bazylevych
    Roman

The paper introduces foundations of the “Flexible Routing Method” that belongs to
the topologically-geometric type. It develops the idea to divide the routing problem
on two separate successive stages: topological and geometrical. At the first stage
it …

Revisiting 3DIC Benefit with Multiple Tiers

  • Chan
    Wei-Ting Jonas

3DICs with multiple tiers are expected to achieve large benefits (e.g., in terms of
power, area) as compared to conventional planar designs. However, few if any previous
works study upper bounds on power and area benefits from 3DIC integration with …

Spin-Hall Assisted STT-RAM Design and Discussion

  • Eken
    Enes

In recent years, Spin-Transfer Torque Random Access Memory (STT-RAM) has attracted
significant attentions from both industry and academia due to its attractive attributes
such as small cell area and non-volatility. However, long switching time and large

A Demand-Aware Predictive Dynamic Bandwidth Allocation Mechanism for Wireless Network-on-Chip

  • Mansoor
    Naseef

Long distance data communication over multi-hop wireline paths in conventional Networks-on-Chips
(NoCs) cause high energy consumption and degradation in bandwidth. Wireless interconnects
in the millimeter-wave band have emerged as an energy-efficient …

ISPD 2018 TOC

SESSION: Keynote Address

Session details: Keynote Address

  • Chu
    Chris

Challenges and Opportunities in Automotive, Industrial, and IoT Physical Design

  • Hill
    Anthony M.

Taping out modern, complex SOCs presents a myriad of challenges in physical design.
Doing so for demanding markets such as automotive, industrial, and IoT multiplies
that complexity. In this talk we will take a broad look across the physical design

SESSION: Finding the Golden Tree in the Forest!

Session details: Finding the Golden Tree in the Forest!

  • Yeap
    Gary

Wot the L: Analysis of Real versus Random Placed Nets, and Implications for Steiner Tree Heuristics

  • Kahng
    Andrew B.

The NP-hard Rectilinear Steiner Minimum Tree (RSMT) problem has been studied in the
VLSI physical design literature for well over three decades. Fast estimators of RSMT
cost (which reflects routed wirelength) are a required ingredient of modern physical

Prim-Dijkstra Revisited: Achieving Superior Timing-driven Routing Trees

  • Alpert
    Charles J.

The Prim-Dijkstra (PD ) construction [1] was first presented over 20 years ago as a way to efficiently
trade off between shortest-path and minimum-wirelength routing trees. This approach
has stood the test of time, having been integrated into leading …

Construction of All Rectilinear Steiner Minimum Trees on the Hanan Grid

  • Lin
    Sheng-En David

Given a set of pins, a Rectilinear Steiner Minimum Tree (RSMT) connects the pins using
only rectilinear edges with the minimum wirelength. RSMT construction is heavily used
at various design steps such as floorplanning, placement, routing, and …

SESSION: FPGA Special Session

Session details: FPGA Special Session

  • Das
    Sabya

Challenges in Large FPGA-based Logic Emulation Systems

  • Hung
    William N.N.

Functional verification is an important aspect of electronic design automation. Traditionally,
simulation at the register transfer-level has been the mainstream functional verification
approach. Formal verification and various static analysis checkers …

Flexibility: FPGAs and CAD in Deep Learning Acceleration

  • Chiu
    Gordon R.

Deep learning inference has become the key workload to accelerate in our AI-powered
world. FPGAs are an ideal platform for the acceleration of deep learning inference
by combining low-latency performance, power-efficiency, and flexibility. This paper

Exploration and Tradeoffs of different Kernels in FPGA Deep Learning Applications

  • Delaye
    Elliott

In the field of deep learning, efficient computational hardware has come to the forefront
of the large scale implementation and deployment of many applications. In the process
of designing hardware, various characteristics of hardware platforms have …

Architecture Exploration of Standard-Cell and FPGA-Overlay CGRAs Using the Open-Source
CGRA-ME Framework

  • Chin
    S. Alexander

We describe an open-source software framework,CGRA-ME, for the modeling and exploration
of coarse-grained reconfigurable architectures (CGRAs). CGRAs are programmable hardware
devices having large ALU-like logic blocks, and datapath bus-style inter-…

SESSION: Design Flow and Power Grid Optimization

Session details: Design Flow and Power Grid Optimization

  • Iyer
    Mahesh

Concurrent High Performance Processor Design: From Logic to PD in Parallel

  • Stok
    Leon

The design of a high-performance processor in an advanced technology node is a highly
concurrent process. While most SoCs are designed with (fairly) stable IP, several
trends are driving the design of the micro-architecture, the logic and the physical

Towards a VLSI Design Flow Based on Logic Computation and Signal Distribution

  • Reis
    André

This paper discusses directions for a VLSI design flow based on a novel paradigm of
local logic computation and global signal distribution. In the last years there has
been an increasing effort to perform a better integration between logic synthesis
and …

Power Grid Reduction by Sparse Convex Optimization

  • Ye
    Wei

With the dramatic increase in the complexity of modern integrated circuits (ICs),
direct analysis and verification of IC power distribution networks (PDNs) have become
extremely computationally expensive. Various power grid reduction methods are …

SESSION: Statistical and Machine Learning-Based CAD

Session details: Statistical and Machine Learning-Based CAD

  • Kissiov
    Ivan

Machine Learning Applications in Physical Design: Recent Results and Directions

  • Kahng
    Andrew B.

In the late-CMOS era, semiconductor and electronics companies face severe product
schedule and other competitive pressures. In this context, electronic design automation
(EDA) must deliver “design-based equivalent scaling” to help continue essential …

Machine Learning for Feature-Based Analytics

  • Wang
    Li-C.

Applying machine learning in Electronic Design Automation (EDA) has received growing
interests in recent years. One approach to analyze data in EDA applications can be
called feature-based analytics. In this context, the paper explains the inadequacy
of …

Data Efficient Lithography Modeling with Residual Neural Networks and Transfer Learning

  • Lin
    Yibo

Lithography simulation is one of the key steps in physical verification, enabled by
the substantial optical and resist models. A resist model bridges the aerial image
simulation to printed patterns. While the effectiveness of learning-based solutions

SESSION: Three Shades of Placement!

Session details: Three Shades of Placement!

  • Shinnerl
    Joseph

Compact-2D: A Physical Design Methodology to Build Commercial-Quality Face-to-Face-Bonded 3D ICs

  • Ku
    Bon Woong

The recent advancement of wafer bonding technology offers fine-grained and silicon-space
overhead-free 3D interconnections in face-to-face (F2F) bonded 3D ICs. In this paper,
we propose a full-chip RTL-to-GDSII physical design solution to build high-…

Analog Placement Constraint Extraction and Exploration with the Application to Layout
Retargeting

  • Xu
    Biying

In analog/mixed-signal (AMS) integrated circuits (ICs), most of the layout design
efforts are still handled manually, which is time-consuming and error-prone. Given
the previous high-quality manual layouts containing valuable design expertise of …

Pin Assignment Optimization for Multi-2.5D FPGA-based Systems

  • Kuo
    Wan-Sin

Advanced 2.5D FPGAs with larger logic capacity and higher pin counts compared to conventional
FPGAs are commercially available. Some multi-FPGA systems have already utilized 2.5D
FPGAs. Commercial 2.5D FPGA consists of multiple dies connected through an …

SESSION: Commemoration for Professor Te Chiang Hu

Session details: Commemoration for Professor Te Chiang Hu

  • Kahng
    Andrew B.

Influence of Professor T. C. Hu’s Works on Fundamental Approaches in Layout

  • Kahng
    Andrew B.

Professor T. C. Hu has made numerous pioneering and fundamental contributions in combinatorial
algorithms, mathematical programming and operations research. His seminal 1985 IEEE
book VLSI Circuit Layout: Theory and Design, coedited with Prof. E. S. Kuh,…

Tree Structures and Algorithms for Physical Design

  • Cheng
    Chung-Kuan

Tree structures and algorithms provide a fundamental and powerful data abstraction
and methods for computer science and operations research. In particular, they enable
significant advancement of IC physical design techniques and design optimization.
For …

Pioneer Research on Mathematical Models and Methods for Physical Design

  • Chu
    Chris

In the inaugural International Symposium on Physical Design (ISPD) at 1997, Prof.
Te Chiang Hu has delivered the keynote address “Physical Design: Mathematical Models
and Methods” [1]. Without any question, Prof. Hu has made a lot of foundational and

Theory and Algorithms of Physical Design

  • Cheng
    Chung-Kuan

SESSION: Interconnect Optimization and Detailed Routing Contest Results

Session details: Interconnect Optimization and Detailed Routing Contest Results

  • Yan
    Jackey

Interconnect Optimization Considering Multiple Critical Paths

  • Hu
    Jiang

Interconnect optimization, including buffer insertion and Steiner tree construction,
continues to be a pillar technology that largely determines overall chip performance.
Buffer insertion algorithms in published literature are mostly focused on …

Interconnect Physical Optimization

  • Janac
    K. Charles

The SoC Interconnect is one of the most important IPs in modern chips as it is the
logical and physical instantiation of an SoC architecture and carries virtually all
the SoC data. Interconnect IPs have to carry non-coherent, cache coherent, subsystem

ISPD 2018 Initial Detailed Routing Contest and Benchmarks

  • Mantik
    Stefanus

In advanced technology nodes, detailed routing becomes the most complicated and runtime
consuming stage. To spur detailed routing research, ISPD 2018 initial detailed routing
contest is hosted and it is the first ISPD contest on detailed routing …

SESSION: How to Make Your Foundry Happier?

Session details: How to Make Your Foundry Happier?

  • Hu
    Jiang

The Pressing Need for Electromigration-Aware Physical Design

  • Lienig
    Jens

Electromigration (EM) is becoming a progressively intractable design challenge due
to increased interconnect current densities. It has changed from something designers
“should” think about to something they “must” think about, i.e., it is now a definite

On Coloring and Colorability Analysis of Integrated Circuits with Triple and Quadruple
Patterning Techniques

  • Lvov
    Alexey

The continued delay of higher resolution alternatives for lithography, such as EUV,
is forcing the continued adoption of multi-patterning solutions in new technology
nodes, which include triple and quadruple patterning using several lithography-etch

Standard CAD Tool-Based Method for Simulation of Laser-Induced Faults in Large-Scale
Circuits

  • Viera
    Raphael A.C.

Designing secure integrated systems requires methods and tools dedicated to simulating
that early design stages’ the effects of laser-induced transient faults maliciously
injected by attackers. Existing methods for simulation of laser-induced transient

GLSVLSI 2019 TOC

SESSION: Keynote & Invited Talks

Thoughts on Edge Intelligence

  • Wolf
    Marilyn

Machine learning methods have exploded in the past half-dozen years. Machine learning
is being applied to a huge range of problems across the spectrum of applications.
Initial results relied on server-oriented computations. But many applications will

Automatic Implementation of Secure Silicon

  • Leef
    Serge

Throughout the past decade, cybersecurity threats have evolved from attacks focused
high in the software stack to progressively lower levels of computational hierarchy.
With the explosion of popularity and growing deployment of internet connected …

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

  • Mutlu
    Onur

Today’s systems are overwhelmingly designed to move data to computation. This design
choice goes directly against at least three key trends in systems that cause performance,
scalability and energy bottlenecks: 1) data access from memory is already a …

Innovations in IoT for a Safe, Secure, and Sustainable Future

  • Bhunia
    Swarup

Internet of things (IoT) promises to usher in the fourth industrial revolution through
an exponential growth of smart connected devices deployed in myriad application domains.
It gives rise to new relationships between man and smart connected machines …

SESSION: Tech Session 1: Design and Integration of Hardware Security Primitives

LPN-based Device Authentication Using Resistive Memory

  • Arafin
    Md Tanvir

Recent progress in the design and implementation of resistive memory components such
as RRAMs and PCMs has introduced opportunities for developing novel hardware security
solutions using unique physical properties of these devices. In this work, we …

Leveraging On-Chip Voltage Regulators Against Fault Injection Attacks

  • Vosoughi
    Ali

The security implications of utilizing an on-chip voltage regulator as a countermeasure
against fault injection attacks are investigated in this paper. The effect of the
size of the capacitors and number of phases of the voltage regulator on the …

On the Theoretical Analysis of Memristor based True Random Number Generator

  • Uddin
    Mesbah

Emerging nano-devices like memristors display stochastic switching behavior which
poses a big uncertainty in their implementation as the next-generation CMOS alternative.
However, this stochasticity provides an opportunity to design circuits for …

Control-Lock: Securing Processor Cores Against Software-Controlled Hardware Trojans

  • Šišejković
    Dominik

Malicious circuit modifications known as hardware Trojans represent a rising threat
to the integrated circuit supply chain. As many Trojans are activated based on a specific
sequence of circuit states, we have recognized the ease of utilizing an …

Lightweight Authenticated Encryption for Network-on-Chip Communications

  • Harttung
    Julian

In recent years, Network-on-Chip (NoC) has gained increasing popularity as a promising
solution for the challenging interconnection problem in multi-processor systems-on-chip
(MPSoCs). However, the interest of adversaries to compromise such systems grew …

SESSION: Tech Session 2: VLSI Circuits and Power Aware Design

Design of a Low-power and Small-area Approximate Multiplier using First the Approximate
and then the Accurate Compression Method

  • Yang
    Tongxin

Recently emerging applications, such as convolution neural networks (CNNs), which
process thousands of convolutional computations, require a large amount of power.
Multiplication is the key arithmetic in these applications and an approximate multiplier

GraphiDe: A Graph Processing Accelerator leveraging In-DRAM-Computing

  • Angizi
    Shaahin

In this paper, we propose GraphiDe, a novel DRAM-based processing-in-memory (PIM)
accelerator for graph processing. It transforms current DRAM architecture to massively
parallel computational units exploiting the high internal bandwidth of the modern

An Efficient Time-based Stochastic Computing Circuitry Employing Neuron-MOS

  • Erlina
    Tati

A compact and low energy circuitry of time-based stochastic computing (TBSC) have
been designed. In the TBSC theory, stochastic numbers (SNs) are represented by duty-cycle
of periodic signals. Additionally, multiplication and addition operations of the …

Monolithic 8×8 SiPM with 4-bit Current-Mode Flash ADC with Tunable Dynamic Range

  • Vinayaka
    Vikas

A monolithic photon-counting receiver consisting of an integrated silicon-photomultiplier
and a current-mode analog-to-digital converter (ADC) was designed, simulated and fabricated
in the AMS 0.35 μm SiGe BiCMOS process. The silicon photomultiplier (…

SESSION: Tech Session 3: : VLSI for Machine Learning and Artificial Intelligence

A Systolic SNN Inference Accelerator and its Co-optimized Software Framework

  • Guo
    Shasha

Although Deep Neural Network (DNN) architectures have made some breakthroughs in computer
vision tasks, they are not close to biological brain neurons. Spiking Neural Network
(SNN) is highly expected to bridge the gap between artificial computing …

Dynamic Beam Width Tuning for Energy-Efficient Recurrent Neural Networks

  • Jahier Pagliari
    Daniele

Recurrent Neural Networks (RNNs) are state-of-the-art models for many machine learning
tasks, such as language modeling and machine translation. Executing the inference
phase of a RNN directly in edge nodes, rather than in the cloud, would provide …

Efficient Softmax Hardware Architecture for Deep Neural Networks

  • Du
    Gaoming

Deep neural network (DNN) has become a pivotal machine learning and object recognition
technology in the big data era. The softmax layer is one of the key component layers
for completing multi-classification tasks. However, the softmax layer contains …

HSIM-DNN: Hardware Simulator for Computation-, Storage- and Power-Efficient Deep Neural Networks

  • Sun
    Mengshu

Deep learning that utilizes large-scale deep neural networks (DNNs) is effective in
automatic high-level feature extraction but also computation and memory intensive.
Constructing DNNs using block-circulant matrices can simultaneously achieve hardware

SESSION: Tech Session 4: Next Generation Interconnect: Architecture to Physical Design

An Area-Efficient Iterative Single-Precision Floating-Point Multiplier Architecture
for FPGA

  • Kim
    Sunwoong

Approximate multipliers have been widely used in critical applications, such as machine
learning and multimedia, which are tolerant to approximation errors. This paper proposes
a novel single-precision floating-point (SPFP) multiplication algorithm and …

An Automatic Transistor-Level Tool for GRM FPGA Interconnect Circuits Optimization

  • Li
    Zhengjie

Due to its dominance in FPGA area and delay, the interconnect circuit is traditionally
designed and optimized in full customized fashion, which can be extremely time consuming.
In this paper, we propose an automated transistor-level sizing optimization …

Low Voltage Clock Tree Synthesis with Local Gate Clusters

  • Sitik
    Can

In this paper, a novel local clock gate cluster-aware low voltage clock tree synthesis
methodology is introduced. In low voltage/swing clocking, timing closure is a challenging
problem due to tight skew and slew constraints. The clock gating makes this …

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware
security

TOIC: Timing Obfuscated Integrated Circuits

  • Alam
    Mahabubul

To counter the threats of reverse engineering (RE) and Trojan in-sertion, researchers
have considered gate-level obfuscation in inte-grated circuits (IC) as a viable solution.
However, several techniques are present in the literature to crack the …

Design for Eliminating Operation Specific Power Signatures from Digital Logic

  • Majumder
    Md Badruddoja

Conventional digital logic operations have distinguishable power signatures. Side
channel power analysis combined with classification algorithm can reveal unknown logic
operations. Revealing the underlying operations is the main task in reverse …

Non-Uniform Temperature Distribution in Interconnects and Its Impact on Electromigration

  • Abbasinasab
    Ali

We investigate the effect of electrically induced thermal load on interconnect reliability
and aging. We propose new models for uniform and non-uniform temperature evolution
and its steady state distribution in interconnects considering Joule heating …

Fault Classification and Coverage of Analog Circuits using DC Operating Point and
Frequency Response Analysis

  • Sanyal
    Sayandeep

Detection of faults in a mixed-signal SOC at the pre-silicon stage is a challenge,
especially when it has substantial analog components. Given the time taken for simulating
analog circuits, designing tests to detect faults in them is not a …

Crash Skipping: A Minimal-Cost Framework for Efficient Error Recovery in Approximate Computing Environments

  • Verdeja Herms
    Yan

We present a lightweight technique to minimize error recovery costs in approximate
computing environments. We take advantage of the key observation that if an application
crashes in a “non-critical” region of its execution, then skipping the crash and …

SESSION: Tech Session 6: Emerging Computing & Post-CMOS Technologies

Voltage-Controlled Magnetoelectric Memory Bit-cell Design With Assisted Body-bias
in FD-SOI

  • Cai
    Hao

Voltage-controlled magnetic anisotropy (VCMA)-magnetic tunnel junction (MTJ) is incorporated
into FD-SOI CMOS technology. The design space of 1 transistor-1 MTJ (1T-1M) bit-cell
is explored through varied VCMA pulse duration/amplitude and scaling down …

Low Cost Hybrid Spin-CMOS Compressor for Stochastic Neural Networks

  • Li
    Bingzhe

With expansion of neural network (NN) applications lowering their hardware implementation
cost becomes an urgent task especially in back-end applications where the power-supply
is limited. Stochastic computing (SC) is a promising solution to realize low-…

Functionally Complete Boolean Logic and Adder Design Based on 2T2R RRAMs for Post-CMOS
In-Memory Computing

  • Yang
    Zongxian

In-memory computing (IMC) paradigm has attracted extensive attention for future electronics
to overcome the bottleneck and memory wall problem in the von Neumann systems. Nonvolatile
logic based on resistive random-access memory (RRAM) is a promising …

Jump Search: A Fast Technique for the Synthesis of Approximate Circuits

  • Witschen
    Linus

State-of-the-art frameworks for generating approximate circuits automatically explore
the search space in an iterative process – often greedily. Synthesis and verification
processes are invoked in each iteration to evaluate the found solutions and to …

SESSION: Tech Session 7: Physical Design and Obfuscation

SAT-Based Placement Adjustment of FinFETs inside Unroutable Standard Cells Targeting
Feasible DRC-Clean Routing

  • Sorokin
    Anton

In this paper, we present an algorithm of transistor placement that takes unroutable
standard cells and makes them routable by moving transistors in local windows. It
converts the task of placement of gridded FinFETs into a Boolean problem and employs

A Scalable and Process Variation Aware NVM-FPGA Placement Algorithm

  • Yang
    Chengmo

As non-volatile memory (NVM) based FPGAs gain increasing popularity, FPGA synthesis
tools start to tune the synthesis flow to match NVM characteristics. State-of-the-art
NVM FPGA placement algorithms tried to reduce the high reconfiguration cost induced

Functional Obfuscation of Hardware Accelerators through Selective Partial Design Extraction
onto an Embedded FPGA

  • Hu
    Bo

The protection of Intellectual Property (IP) has emerged as one of the most serious
areas of concern in the semiconductor industry. To address this issue, we present
a method and architecture to map selective portions of a design, given as a behavioral

HydraRoute: A Novel Approach to Circuit Routing

  • Khasawneh
    Mohammad

Routing for dense circuits is a major challenge for VLSI physical design. Most routing
approaches rely at least partially on a “rip-up and reroute” scheme, where solution
quality and run times can be impacted profoundly by the order in which nets are …

SESSION: Tech Session 8: Quantum Circuits and Emerging Technologies

Balanced Factorization and Rewriting Algorithms for Synthesizing Single Flux Quantum
Logic Circuits

  • Pasandi
    Ghasem

Single Flux Quantum (SFQ) logic with switching energy of 100zJ1 and switching delay
of 1ps is a promising post-CMOS candidate. Logic synthesis of these magnetic-pulse-based
circuits is a very important step in their design flow with a big impact on the …

A Majority Logic Synthesis Framework for Adiabatic Quantum-Flux-Parametron Superconducting
Circuits

  • Cai
    Ruizhe

Adiabatic Quantum-Flux-Parametron (AQFP) logic is an adiabatic superconductor logic
that has been proposed as alternative to CMOS logic with extremely high energy efficiency.
In AQFP technology, majority-based gates have the same area as two-input AND/…

A Processing-In-Memory Implementation of SHA-3 Using a Voltage-Gated Spin Hall-Effect
Driven MTJ-based Crossbar

  • Yang
    Chengmo

Processing-In-Memory (PIM), which implements logic operations within memory cells,
opens up a new direction on organizing data and computation. Leveraging resistive
or magnetic characteristics of nonvolatile memory (NVM) devices, platforms such as
PLiM …

Exploring Processing In-Memory for Different Technologies

  • Gupta
    Saransh

The recent emergence of IoT has led to a substantial increase in the amount of data
processed. Today, a large number of applications are data intensive, involving massive
data transfers between processing core and memory. These transfers act as a …

SESSION: Tech Session 9: Towards Fast, Efficient, and Robust Memory

BLADE: A BitLine Accelerator for Devices on the Edge

  • Simon
    William Andrew

The increasing ubiquity of edge devices in the consumer market, along with their ever
more computationally expensive workloads, necessitate corresponding increases in computing
power to support such workloads. In-memory computing is attractive in edge …

Enhancing the Lifetime of Non-Volatile Caches by Exploiting Module-Wise Write Restriction

  • Agarwal
    Sukarn

The emerging Non-Volatile Memory (NVM) technologies offer a good combination of high
density and near-zero leakage power, becoming the strongest candidate in the memory
hierarchy including caches. However, the weak write endurance of these memories …

Mitigating the Performance and Quality of Parallelized Compressive Sensing Reconstruction
Using Image Stitching

  • Namazi
    Mahmoud

Orthogonal Matching Pursuit is an iterative greedy algorithm used to find a sparse
approximation for high-dimensional signals. The algorithm is most popularly used in
Compressive Sensing, which allows for the reconstruction of sparse signals at rates

Towards Optimizing Refresh Energy in embedded-DRAM Caches using Private Blocks

  • Manohar
    Sheel Sindhu

In recent years, the increased working set size of applications craves for more memory
demand in terms of large size Last Level Caches (LLC). To fulfill this, embedded DRAM
(eDRAM) caches have been considered as one of the best alternatives over …

SESSION: Tech Session 10: MSE

Extending Student Labs with SMT Circuit Implementation

  • Brunvand
    Erik

Computer Science and Computer Engineering classes related to digital circuits, embedded
systems, Human Computer Interaction (HCI), and a wide variety of “maker” subjects,
would often like to include physical computing projects. Extending these physical

Teaching the Next Generation of Cryptographic Hardware Design to the Next Generation
of Engineers

  • Aysu
    Aydin

Evolving threats against cryptographic systems and the increasing diversity of computing
platforms enforce teaching cryptographic engineering to a wider audience. This paper
describes the development of a new graduate course on hardware security taught …

A Web-based Remote FPGA Laboratory for Computer Organization Course

  • Wan
    Han

Learning in digital systems could be enhanced by applying a learn-by-doing mechanism.
In this paper the implementation of a web-based remote FPGA laboratory for Computer
Organization course is proposed. The projects created for this course are designed

System-on-a-Chip Design as a Platform for Teaching Design and Design Flow Integration

  • Covey
    Jacob

The design of microelectronic systems requires integration and cooperation across
multiple disciplines, but most curriculum is taught in unconnected pieces. This makes
the creation of manageable projects that reflect the design experience very …

SESSION: Poster Sessions I, II

UPIM: Unipolar Switching Logic for High Density Processing-in-Memory Applications

  • Sim
    Joonseop

Internet of Things (IoT) has built a network with billions of connected devices which
generate massive volumes of data. Processing large data on existing systems requires
significant costs for data movements between processors and memory due to limited

Fence-Region-Aware Mixed-Height Standard Cell Legalization

  • Do
    SangGi

We propose a fence-region-aware mixed-height standard cell legalization that can optimize
the placement of standard cells that have more than a two row height in various shapes
of the fence region. The algorithm consists of pre-legalization and mixed-…

A Case for Heterogeneous Network-on-Chip Based H.264 Video Decoders

  • Ghorbani Moghaddam
    Milad

The design of a heterogeneous network-on-chip (NoC) based H.264 video decoder is proposed.
A thorough investigation using a system simulator developed as the combination of
a cycle accurate NoC simulator together with complete implementations of all the …

A 16b Clockless Digital-to-Analog Converter with Ultra-Low-Cost Poly Resistors Supporting
Wide-Temperature Range from -40°C to 85°C

  • Wang
    Xuedi

High-precision digital-to-analog converter (DAC) is a critical component in process
control, data acquisition, and testing instruments. In order to achieve high resolution
and a wide-temperature range, conventional designs have been adopting high-cost …

A Skyrmion Racetrack Memory based Computing In-memory Architecture for Binary Neural
Convolutional Network

  • Pan
    Yu

A Skyrmion Racetrack Memory (SRM) based Computing In-Memory Architecture (SRM-CIM)
was proposed in this paper. Both data and computing operation can be achieved in SRM-CIM.
SRM-CIM is used to support convolutional computing in Binary Convolutional …

TASecure: Temperature-Aware Secure Deletion Scheme for Solid State Drives

  • Li
    Bingzhe

With the increasing concerns of security, the secure deletion for SSDs becomes very
costly due to its out-of-place update (i.e., an update is performed in a new location
leaving the old data un-touched). Some previous studies used a combined erase-based

An Asymmetric Dual Output On-Chip DC-DC Converter for Dynamic Workloads

  • Liu
    Xingye

We propose a novel two-stage hybrid on-chip DC-DC converter targeting low power applications
with multiple supply voltage domains and dynamic workloads. The converter has a nominal
input voltage of 1.2V and generates two asymmetrically regulated output …

CNNWire: Boosting Convolutional Neural Network with Winograd on ReRAM based Accelerators

  • Lin
    Jilan

Resistive random access memory (ReRAM) demonstrates the great potential of in-memory
processing for neural network (NN) acceleration. However, since the convolutional
neural network (CNN) is widely known as compute-bound, current ReRAM-based …

Feed-Forward XOR PUFs: Reliability and Attack-Resistance Analysis

  • Avvaru
    S. V. Sandeep

Physical unclonable functions (PUFs) can be used to generate unique signatures of
integrated circuit (IC) chips. XOR arbiter PUFs (XOR PUFs), that typically contain
multiple standard arbiter PUFs as their components, are more secure than standard

Exploring Design Trade-offs in Fault-Tolerant Behavioral Hardware Accelerators

  • Zhu
    Zhiqi

High-Level Synthesis (HLS) allows the automatic generation of hardware accelerators
with unique design metrics. This work leverages this unique feature and presents a
method to increase the search space of fault-tolerant hardware accelerators. The …

Automatic Extraction of Requirements from State-based Hardware Designs for Runtime
Verification

  • Seo
    Minjun

Runtime monitoring and verification enables a system to monitor itself and ensure
system requirements are met even in the presence of dynamic environments. For hardware,
state-based models are widely used, but verifying the correctness between the state-…

MirrorCache: An Energy-Efficient Relaxed Retention L1 STTRAM Cache

  • Kuan
    Kyle

Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAMs in on-chip caches,
due to several advantages, including non-volatility, low leakage, high integration
density, and CMOS compatibility. However, STTRAMs’ wide adoption in resource-…

Design and Evaluation of DNU-Tolerant Registers for Resilient Architectural State
Storage

  • Alghareb
    Faris S.

In this work, we aim to maintain the correct execution of instructions in the pipeline
stages. To achieve that, the integrity for the data computed in registers during execution
should be maintained via protecting the susceptible registers. Thus, we …

Automated Analysis of Virtual Prototypes at Electronic System Level

  • Goli
    Mehran

The exponential increase in functionality of System-on-Chips (SoCs) and reduced Time-to-Market
(TTM) requirements have significantly altered the typical design and verification
flow. Virtual Prototyping (VP) at the Electronic System Level (ESL) using …

Dynamic Physically Unclonable Functions

  • Xiong
    Wenjie

Physical variations in the manufacturing processes of electronic devices have been
widely leveraged to design Physically Unclonable Functions (PUFs), which can be used
for authentication and key storage. Existing PUFs are static, as their PUF responses

RDTA: An Efficient Routability-Driven Track Assignment Algorithm

  • Liu
    Genggeng

This paper presents a routability-driven track assignment algorithm (RDTA) to efficiently
estimate routability. Routability has become a very challenging issue in modern IC
design and it can be effectively estimated by routing congestion. Track …

EraseMe: A Defense Mechanism against Information Leakage exploiting GPU Memory

  • Fang
    Hongyu

Graphics Processing Units (GPU) play a major role in speeding up computational tasks
of the users, especially in applications such as high volume text and image processing.
Recent works have demonstrated the security problems associated with GPU that do …

A Statistical Current and Delay Model Based on Log-Skew-Normal Distribution for Low
Voltage Region

  • Cao
    Peng

The increasing performance variation and non-Gaussian distribution pose remarkable
challenges to timing analysis for circuits operating in low voltage region. Accurate
modeling of the statistical characteristics is urgently required with process …

Enabling Approximate Storage through Lossy Media Data Compression

  • Worek
    Brian

As compute capabilities continue to scale, memory capacity and bandwidth continue
to lag behind. Data compression is an effective approach to improving memory capacity
and bandwidth; but prior works have focused primarily on lossless compression and

Thermal Fingerprinting of FPGA Designs through High-Level Synthesis

  • Chen
    Jianqi

This work investigates if temperature can be used to fingerprint FPGA designs and
presents a method to generate a large number of functionally equivalent FPGA designs
such that each design has a unique distinguishable thermal signature. The main …

Deep RNN-Oriented Paradigm Shift through BOCANet: Broken Obfuscated Circuit Attack

  • Tehranipoor
    Fatemeh

Logic encryption obfuscation has been used for thwarting counterfeiting, overproduction,
and reverse engineering but vulnerable to attacks. However, it was recently shown
that satisfiability – checking (SAT) can potentially compromise hardware …

STAT: Mean and Variance Characterization for Robust Inference of DNNs on Memristor-based
Platforms

  • Zhang
    Baogang

An emerging solution to accelerate the inference phase of deep neural networks (DNNs)
is to utilize memristor crossbar arrays (MCAs) to perform highly efficient matrix-vector
multiplication in the analog domain. An adverse challenge is that memristor …

LSM: Novel Low-Complexity Unified Systolic Multiplier over Binary Extension Field

  • Xie
    Jiafeng

Unified (hybrid field-size) systolic multiplier over GF(2m) (binary extension field)
has attracted significant attentions from research communities recently as it can
be used in reconfigurable cryptographic processors. In this paper, we present a novel

Binarized Depthwise Separable Neural Network for Object Tracking in FPGA

  • Yang
    Li

Object tracking has achieved great advances in the past few years and has been widely
applied in vision-based application. Nowadays, deep convolutional neural network has
taken an important role in object tracking tasks. However, its enormous model size

An Analytical-based Hybrid Algorithm for FPGA Placement

  • Hu
    Chengyu

As the capacity of FPGA increases, FPGA placers that adopt Simulated Annealing (SA)
algorithm take more and more runtime. To solve this problem, this paper presents HCAS,
a Hybrid algorithm Combining Analytical method and SA. There are three …

Approximate Memory with Approximate DCT

  • Ma
    Shenghou

Approximate Computing is an emerging computing paradigm where one exploits inherent
error resilience of certain applications (e.g., digital signal processing, multimedia
and artificial intelligence) and trades off absolute computation precisions for …

AQuRate: MRAM-based Stochastic Oscillator for Adaptive Quantization Rate Sampling of Sparse
Signals

  • Salehi
    Soheil

Recently, the promising aspects of compressive sensing have inspired new circuit-level
approaches for their efficient realization within the literature. However, most of
these recent advances involving novel sampling techniques have been proposed …

Clockless Spin-based Look-Up Tables with Wide Read Margin

  • Salehi
    Soheil

In this paper, we develop a 6-input fracturable non-volatile Clockless LUT (C-LUT)
using spin Hall effect (SHE)-based Magnetic Tunnel Junctions (MTJs) and provide a
detailed comparison between the SHE-MTJ-based C-LUT and Spin Transfer Torque (STT)-MTJ-…

A Hybrid Framework for Functional Verification using Reinforcement Learning and Deep
Learning

  • Singh
    Karunveer

In this paper, we propose a novel hybrid verification framework (HVF) which uses Reinforcement
Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex
systems. More precisely, our HVF incorporates RL to generate all …

SESSION: Special Session 1: In-Memory Processing for Future Electronics

Digital and Analog-Mixed-Signal In-Memory Processing in CMOS SRAM

  • Jaiswal
    Akhilesh

Ferroelectric FET Based In-Memory Computing for Few-Shot Learning

  • Laguna
    Ann Franchesca

As CMOS technology advances, the performance gap between the CPU and main memory has
not improved. Furthermore, the hardware deployed for Internet of Things (IoT) applications
need to process ever growing volumes of data, which can further exacerbate …

True In-memory Computing with the CRAM: From Technology to Applications

  • Zabihi
    Masoud

An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive
Applications

  • Li
    Bing

The conventional von Neumann architecture has been revealed as a major performance
and energy bottleneck for rising data-intensive applications. The decade-old idea
of leveraging in-memory processing to eliminate substantial data movements has returned

SESSION: Special Session 2: Approximate Computing Systems Design: Energy Efficiency and Security
Implications

Security Threats in Approximate Computing Systems

  • Yellu
    Pruthvy

Approximate computing systems improve energy efficiency and computation speed at the
cost of reduced accuracy on system outputs. Existing efforts mainly explore the feasible
approximation mechanisms and their implementation methods. There is limited …

Characterizing Approximate Adders and Multipliers Optimized under Different Design
Constraints

  • Jiang
    Honglan

Taking advantage of the error resilience in many applications as well as the perceptual
limitations of humans, numerous approximate arithmetic circuits have been proposed
that trade off accuracy for higher speed or lower power in emerging applications …

Approximate Communication Strategies for Energy-Efficient and High Performance NoC: Opportunities and Challenges

  • Reza
    Md Farhadur

With the advancement and miniaturization of transistor technology, hundreds of cores
can be integrated on a single chip. Network-on-Chips (NoCs) are the de facto on-chip
communication fabrics for multi/many core systems because of their benefits over …

Information Hiding behind Approximate Computation

  • Wang
    Ye

There are many interesting advances in approximate computing recently targeting the
energy efficiency in system design and execution. The basic idea is to trade computation
accuracy for power and energy during all phases of the computation, from data to …

MLPrivacyGuard: Defeating Confidence Information based Model Inversion Attacks on Machine Learning
Systems

  • Alves
    Tiago A. O.

As services based on Machine Learning (ML) applications find increasing use, there
is a growing risk of attack against such systems. Recently, adversarial machine learning
has received a lot of attention, where an adversary is able to craft an input or …

SESSION: Special Session 3: Recent Advances in Near and In-Memory Computing Circuit ?

XNOR-SRAM: In-Bitcell Computing SRAM Macro based on Resistive Computing Mechanism

  • Jiang
    Zhewei

We present an in-memory computing SRAM macro for binary neural networks. The memory
macro computes XNOR-and-accumulate for binary/ternary deep convolutional neural networks
on the bitline without row-by-row data access. It achieves 33X better energy and …

Efficient Process-in-Memory Architecture Design for Unsupervised GAN-based Deep Learning
using ReRAM

  • Chen
    Fan

The ending of Moore’s Law makes domain-specific architecture as the future of computing.
The most representative is the emergence of various deep learning accelerators. Among
the proposed solutions, resistive random access memory (ReRAM) based process-…

DigitalPIM: Digital-based Processing In-Memory for Big Data Acceleration

  • Imani
    Mohsen

In this work, we design, DigitalPIM, a Digital-based Processing In-Memory platform
capable of accelerating fundamental big data algorithms in real time with orders of
magnitude more energy efficient operation. Unlike the existing near-data processing

In-memory Processing based on Time-domain Circuit

  • Kong
    Yuyao

Deep Neural Networks (DNN) have emerged as a dominant algorithm for machine learning
(ML). High performance and extreme energy efficiency are critical for deployments
of DNN, especially in mobile platforms such as autonomous vehicles, cameras, and other

SESSION: Special Session 4: Opportunities and Challenges for Emerging Monolithic 3D Integrated
Circuits

An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs

  • Shukla
    Prachi

Monolithic 3D (Mono3D) is a three-dimensional integration technology that can overcome
some of the fundamental limitations faced by traditional, two-dimensional scaling.
This paper analyzes the unique thermal characteristics of Mono3D ICs by simulating

Logic Monolithic 3D ICs: PPA Benefits and EDA Tools Necessary

  • Pentapati
    Sai Surya Kiran

Monolithic 3D (M3D) ICs provide a way to achieve high performance and low power designs
within the same technology node, thereby bypassing the need for transistor scaling.
M3D ICs have multiple 2D tiers sequentially fabricated on top of each other and …

Investigation and Trade-offs in 3DIC Partitioning Methodologies: N/A

  • Sketopoulos
    Nikolaos

In this work, we compare alternative 3DIC partitioning methodologies, in terms of
slack, number of inter-tier vias, Tier Area Ratio (TAR) and HPWL design parameters.
The popular 3DIC postplacement, bin-based Fidducia-Mattheyses (FM) partitioning flow
is …

Test and Design-for-Testability Solutions for Monolithic 3D Integrated Circuits

  • Koneru
    Abhishek

M3D integration can result in reduced area and higher performance when compared to
3D die stacking. Due to the benefits of M3D integration, there is growing interest
in industry towards the adoption of this technology. However, test challenges for
M3D …

N3XT Monolithic 3D Energy-Efficient Computing Systems

  • Aly
    Mohamed M. Sabry

The world’s appetite for analyzing massive amounts of structured and unstructured
data has grown dramatically. The computational demands of these abundant-data applications
far exceed the capabilities of today’s computing systems. The N3XT (Nano-…

SESSION: Special Session 5: Robust IC Authentication and Protected Intellectual Property: A
Special Session on Hardware Security

How to Generate Robust Keys from Noisy DRAMs?

  • Karimian
    Nima

Security primitives based on Dynamic Random Access Memory (DRAM) can provide cost-efficient
and practical security solutions, especially for resource-constrained devices, such
as hardware used in the Internet of Things (IoT), as DRAMs are an intrinsic …

Threats on Logic Locking: A Decade Later

  • Zamiri Azar
    Kimia

To reduce the cost of ICs and to meet the market’s demand, a considerable portion
of manufacturing supply chain, including silicon fabrication, packaging and testing
may be pushed offshore. Utilizing a global IC manufacturing supply chain, and inclusion

On Custom LUT-based Obfuscation

  • Kolhe
    Gaurav

Logic obfuscation yields hardware security against various threats, such as Intellectual
Property (IP) piracy and reverse engineering. Evolving Boolean satisfiability (SAT)
attacks have challenged the hardware security assurance rendered by various …

Securing Analog Mixed-Signal Integrated Circuits Through Shared Dependencies

  • Juretus
    Kyle

The transition to a horizontal integrated circuit (IC) design flow has raised concerns
regarding the security and protection of IC intellectual property (IP). Obfuscation
of an IC has been explored as a potential methodology to protect IP in both the …

SESSION: Special Session 6: Neuromorphic Computing and Deep Neural Network

Design Methodology for Embedded Approximate Artificial Neural Networks

  • Balaji
    Adarsha

Artificial neural networks (ANNs) have demonstrated significant promise while implementing
recognition and classification applications. The implementation of pre-trained ANNs
on embedded systems requires representation of data and design parameters in …

Exploration of Segmented Bus As Scalable Global Interconnect for Neuromorphic Computing

  • Balaji
    Adarsha

Spiking Neural Networks (SNNs) are efficient computation models for spatio-temporal
pattern recognition on resource and power constrained platforms. Dedicated SNN hardware,
also called neuromorphic hardware, can further reduce the energy consumption of …

ADMM-based Weight Pruning for Real-Time Deep Learning Acceleration on Mobile Devices

  • Li
    Hongjia

Deep learning solutions are being increasingly deployed in mobile applications, at
least for the inference phase. Due to the large model size and computational requirements,
model compression for deep neural networks (DNNs) becomes necessary, especially …

On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning

  • Prakash
    Bharat

In autonomous embedded systems, it is often vital to reduce the amount of actions
taken in the real world and energy required to learn a policy. Training reinforcement
learning agents from high dimensional image representations can be very expensive
and …

SESSION: Panelist Position Papers

Tuning Track-based NVM Caches for Low-Power IoT Devices

  • Aghaei Khouzani
    Hoda

Track-based non-volatile memories, such as Domain Wall Memory (DWM) and Skyrmion,
are promising candidates to be used as CPU caches due to their ultra-high density
and low-static power. However, the access latency and energy of these devices are
highly …

Dynamic Computation Migration at the Edge: Is There an Optimal Choice?

  • Shahhosseini
    Sina

In the era of Fog computing where one can decide to compute certain time-critical
tasks at the edge of the network, designers often encounter a question whether the
sensor layer provides the optimal response time for a service, or the Fog layer, or

Solving Energy and Cybersecurity Constraints in IoT Devices Using Energy Recovery
Computing

  • Thapliyal
    Himanshu

With the growth of Internet-of-Things (IoT), the potential threat vectors for malicious
cyber and hardware attacks are rapidly expanding. As the IoT paradigm emerges, there
are challenging requirements to design energy-efficient and secure systems. To …

Right-Provisioned IoT Edge Computing: An Overview

  • Adegbija
    Tosiron

Edge computing on the Internet of Things (IoT) is an increasingly popular paradigm
in which computation is moved closer to the data source (i.e., edge devices). Edge
computing mitigates the overheads of cloud-based computing arising from increased

Secure Computing Systems Design Through Formal Micro-Contracts

  • Kinsy
    Michel A.

Two enduring concepts in computer system design are abstraction levels and layered
composition. The design generally takes a layered approach where each layer implements
a different abstraction of the system. The layers communicate through interfaces …

SBCCI 2019 TOC

PHICC: an error correction code for memory devices

  • Philippe Magalhães
  • Otávio Alcântara
  • Jarbas Silveira

With the evolution of technology in the microelectronics field, integrated circuits (ICs) have been developed with decreasing dimensions. Despite the advances provided by the scale reduction, the occurrence of Multiple Cell Upsets (MCUs) caused by interferences such as ionizing radiation, has become increasingly common. Error Correction Codes (ECCs) are capable of augmenting fault tolerance of computer systems, however, there must be balance between error correction effectiveness and silicon implementation costs. The purpose of this article is to present the Parity Hamming Interleaved Correction Code (PHICC), which consists of a code capable of correcting multiple transient errors in memory cells, with low implementation cost. The validation of the PHICC was performed through a comparative analysis of correction effectiveness, implementation costs, reliability and Mean Time to Failure (MTTF) with others ECCs. The results show that PHICC can maintain the reliability system for longer time, which makes it a strong candidate for use in critical applications.

Lightweight security mechanisms for MPSoCs

  • Anderson Camargo Sant’Ana
  • Henrique Martins Medina
  • Kevin Boucinha Fiorentin
  • Fernando Gehm Moraes

Computational systems tend to adopt parallel architectures, by using multiprocessor systems-on-chip (MPSoCs). MPSoCs are vulnerable to software and hardware attacks, as infected applications and Hardware Trojans respectively. These attacks may have the purpose to gain access to sensitive data, interrupt a given application or even damage the system physically. The literature presents countermeasures using dedicated routing algorithms, cryptography, firewalls and secure zones. These approaches present a significant hardware cost (firewalls, cryptography) or are too restrictive regarding the use of MPSoC resources (secure zones). The goal of this paper is to present lightweight security mechanisms for MPSoCs, using four techniques: spatial isolation of applications; dedicated network to send sensitive data; traffic blocking filter; lightweight cryptography. These mechanisms protect the MPSoC against the most common software attacks, as Denial of Service (DoS) and spoofing (man-in-the-middle), and ensures confidentiality and integrity to applications. Results present low area and latency overhead, as well as the effectiveness of using the mechanisms to block malicious traffic.

Exploiting approximate computing for low-cost fault tolerant architectures

  • Gennaro S. Rodrigues
  • Juan Fonseca
  • Fabio Benevenuti
  • Fernanda Kastensmidt
  • Alberto Bosio

This work investigates how the approximate computing paradigm can be exploited to provide low-cost fault tolerant architectures. In particular, we focus on the implementation of Approximate Triple Modular Redundancy (ATMR) designs using the precision reduction technique. The proposed method is applied to two benchmarks and a multitude of ATMR designs with different degrees of approximation. The benchmarks are implemented on a Xilinx Zynq-7000 APSoC FPGA through high-level synthesis and evaluated concerning area usage and the inaccuracy caused by approximation. Fault injection experiments are performed by flipping bits of the FPGA configuration bitstream. Results show that the proposed approximation method can decrease the DSP usage of the hardware implementation up to 80% and the number of sensitive configuration bits up to 75% while maintaining an accuracy of more than 99.96%.

Fine-grain temperature monitoring for many-core systems

  • Alzemiro Lucas da Silva
  • André Luís del Mestre Martins
  • Fernando Gehm Moraes

The power density may limit the amount of energy a many-core system can consume. A many-core at its maximum performance may lead to safe temperature violations and, consequently, result in reliability issues. Dynamic Thermal Management (DTM) techniques have been proposed to guarantee that many-core systems run at good performance without compromising reliability. DTM techniques rely on accurate temperature information and estimation, which is a computationally complex problem. However, related works usually abstract the temperature monitoring complexity, assuming available temperature sensors. An issue related to temperature sensors is their granularity, frequently measuring the temperature of a large system area instead of a processing element (PE) area. Therefore, the first goal of this work is to propose a fine-grain (PE level) temperature monitoring for many-core systems. The second one is to present a dedicated hardware accelerator to estimate the system temperature. Results show that software performance can be a limiting factor when applying an accurate model to provide temperature estimation for system management. On the other side, the hardware accelerator connected to the many-core enables the fine-grain temperature estimation at runtime without sacrificing system performance.

An adaptive discrete particle swarm optimization for mapping real-time applications onto network-on-a-chip based MPSoCs

  • Jessé Barreto de Barros
  • Renato Coral Sampaio
  • Carlos Humberto Llanos

This paper presents a modified version of the well-known Particle Swarm Optimization (PSO) algorithm as an alternative for the single-objective Genetic Algorithm (GA) that is currently the state-of-the-art method to map real-time applications tasks onto Multiple Processors System-on-a-Chip (MPSoC) using preemptive capable wormhole-based Network-on-a-Chip (NoC) as their communication architecture. A statistical study based on an experimental setup has been performed to compare the GA-based task mapper and the proposed method by using a real-time application as a benchmark, as well as a group of randomly generated ones. Preliminary results have shown that our method is capable of achieving quicker convergence than the GA-based method, and it even produces better results when the application utilization is smaller than the available processing capacity, i.e., a fully schedulable mapping solution exists.

Exploring Tabu search based algorithms for mapping and placement in NoC-based reconfigurable systems

  • Guilherme A. Silva Novaes
  • Luiz Carlos Moreira
  • Wang Jiang Chau

Nowadays, the development of systems based on Networks-on-Chip (NoCs) brings big challenges to the designers due to problems of scalability, such as efficient Mapping and Placement, which are NP-hard problems. Several solutions have been proposed to solve this type of problem that is a variation of Quadratic Assignment Problems (QAP), being Tabu Search (TS) algorithms the ones showing most promising results. In NoC-based dynamically reconfigurable systems (NoC-DRSs), both mapping and placement problems present several layers of complexity due the reconfigurable scenarios. A previous work has adopted TS algorithm variations, but the best solution is not achieved with the wished high frequency. This paper introduces the original Forced Inversion (FI) Heuristic over Tabu Search algorithms for 2D-Mesh FPGA NoC-DRSs, in order to avoid local minima. Results with a series of benchmarks are presented and the performances of different approaches are quantitatively and qualitatively compared.

Performance evaluation of HEVC RCL applications mapped onto NoC-based embedded platforms

  • Wagner Penny
  • Daniel Palomino
  • Marcelo Porto
  • Bruno Zatt
  • Leandro Indrusiak

Today, several applications running into embedded systems have to fulfill soft or hard timing constraints. Video applications, like the modern High Efficiency Video Coding (HEVC), e.g., most often have soft real-time constraints. However, in specific scenarios, such as in robotic surgeries, the coupling of satellites and so on, harder timing constraints arise, becoming a huge challenge. Although the implementation of such applications in Networks-on-Chip (NoCs) being an alternative to reduce their algorithmic complexity and meet real-time constraints, a performance evaluation of the mapped NoC and the schedulability analysis for a given application are mandatory. In this work we make a performance evaluation of HEVC Residual Coding Loop (RCL) mapped onto a NoC-based embedded platform, considering the encoding of a single 1920×1080 pixels frame. A set of analysis exploring the combination of different NoC sizes and task mapping strategies were performed, showing for the typical and upper-bound workload cases scenarios when the application is schedulable and meets the real-time constraints.

An FPGA-based evaluation platform for energy harvesting embedded systems

  • Roberto Paulo Dias Alcantara Filho
  • Otavio Alcantara de Lima Junior
  • Corneli Gomes Furtado Junior

Extreme low-power embedded systems are essential in Smart Cities and the Internet of Things, once these systems are responsible for acquiring, processing, and transmitting valuable environmental data. Some of these systems should run for a very long time without any human intervention, even for batteries replacement. Energy harvesting technologies allow embedded systems to be powered up from the environment by converting surrounding energy sources into electrical energy. However, energy-harvesting embedded systems (EHES) heavily depends on the nature of the energy sources, which are mostly uncontrollable and unpredictable. To improve the evaluation of energy management techniques in EHES, we propose the emulation of I-V curves of low-power energy harvesting transducers. An FPGA-based platform controls the energy source emulation combined with an integrated logic analyzer, which allows real-time data gathering from the EHES in multiple evaluation scenarios. The experiments show that the platform replicates solar energy scenarios with only 0.56% mean error.

A comparison of two embedded systems to detect electrical disturbances using decision tree algorithm

  • Reneilson Santos
  • Edward David Moreno
  • Carlos Estombelo-Montesco

The Electrical Power Quality (EPQ) is a relevant subject in the academic area because of its importance on real-world problems. The anomalies on an electrical network can cause strong losses in equipment and data. In this context, much effort has been made by many types of research approaches to get solutions for this kind of problem, seeking for better accuracy on the classification of the anomalies, or building a system to detect them. This paper, therefore, aims to compare two systems built to classify electrical disturbances even in noised environments. For this purpose, it was used a microprocessor system (Raspberry Pi3) and a micro-controller system (NodeMCU Amica), analyzing their time to classify the input signal. The microprocessor achieves better results (45.50ms against 267.10ms), with an accuracy of 97.96% in an ideal environment and 76.79% in a noisy environment (20dB of SNR) for both systems.

FPGA hardware linear regression implementation using fixed-point arithmetic

  • Willian de Assis Pedrobon Ferreira
  • Ian Grout
  • Alexandre César Rodrigues da Silva

In this paper, a hardware design based on the field programmable gate array (FPGA) to implement a linear regression algorithm is presented. The arithmetic operations were optimized by applying a fixed-point number representation for all hardware based computations. A floating-point number training data point was initially created and stored in a personal computer (PC) which was then converted to fixed-point representation and transmitted to the FPGA via a serial communication link. With the proposed VHDL design description synthesized and implemented within the FPGA, the custom hardware architecture performs the linear regression algorithm based on matrix algebra considering a fixed size training data point set. To validate the hardware fixed-point arithmetic operations, the same algorithm was implemented in the Python language and the results of the two computation approaches were compared. The power consumption of the proposed embedded FPGA system was estimated to be 136.82 mW.

New insight for next generation SRAM: tunnel FET versus FinFET for different topologies

  • Adriana Arevalo
  • Romain Liautard
  • Daniel Romero
  • Lionel Trojman
  • Luis-Miguel Procel

The purpose of this work is to point out the main differences between a Static Random-Access Memory (SRAM) cells implemented by using Tunnel FET (TFET) and FinFET technologies. We have compared the behavior of SRAM cells implemented in both technologies cells for a supply voltage range from 0.4V to 1.2V. Furthermore, for our study, we have chosen different SRAM cell topologies, such as 6T, 8T, 9T and 10T. Therefore, we have simulated all of these topologies for both technologies and extracted the Static Noise Margins (SNM) for the reading and writing processes. In addition, we have determined the power consumption in order to find the best trade-off between stability and power. By analyzing these results, we have determined the best topology for each technology. Finally, we have compared these best topologies for each technology in order to perform a study of advantages and shortcomings. Our results show more advantages using TFET technology instead of FinFET one.

DNAr-logic: a constructive DNA logic circuit design library in R language for molecular computing

  • Renan A. Marks
  • Daniel K. S. Vieira
  • Marcos V. Guterres
  • Poliana A. C. Oliveira
  • Omar P. Vilela Neto

This paper describes the DNAr-Logic: an implementation of a software package in R language that provides ease of use and scalability of the design process of digital logic circuits in molecular computing, more specifically, DNA computing. These devices may be used in-vitro, in-vivo, or even replace the CMOS technology in some applications. Using a technique known as DNA strand displacement reaction (DSD) in conjunction with chemical reaction networks (CRN’s), DNA strands can be used as “wet” hardware to construct molecular logic circuits analogous to electronic digital projects. The circuits designed using the DNAr-Logic can be created in a constructive manner and simulated without requiring knowledge of chemistry or DSD mechanism. The package implements all the main logic gates. We describe the design of a majority gate (also available in the package) and a full-adder circuit that only uses this port. We describe the results and simulation of our design.

Finding optimal qubit permutations for IBM’s quantum computer architectures

  • Alexandre A. A. de Almeida
  • Gerhard W. Dueck
  • Alexandre C. R. da Silva

IBM offers quantum processors for Clifford+T circuits. The only restriction is that not all CNOT gates are implemented and must be substituted with alternate sequences of gates. Each CNOT has its own mapping with a respective cost. However, by permuting the qubits, the number of CNOT that need mappings can be reduced. The problem is to find a good permutation without an exhaustive search. In this paper we propose a solution for this problem. The permutation problem is formulated as an Integer Linear Programming (ILP) problem. Solving the ILP problem, the lowest cost permutation for the CNOT mappings is guaranteed. To test and validated the proposed formulation, quantum architectures with 5 and 16 qubits were used. The ILP formulation along with mapping techniques found circuits with up to 64% fewer gates than other approaches.

Hardware implementation of a shape recognition algorithm based on invariant moments

  • Clement Raffaitin
  • Juan-Sebastian Romero
  • Juan-Sebastian Romero
  • Luis-Miguel Procel

The present work shows the description of a simple fast shape detection algorithm and its implementation in hardware in a FPGA system. The detection algorithm is based on the concepts of Hu’s moments which are invariant to similarity transformations. The recognition algorithm is implemented by using a non-local means filter. The algorithm is implemented on a FPGA system by using a hardware description language. We present the different design stages of the algorithm implementation which is based on the finite state machine technique. This algorithm is able to recognize a target shape over a test image. Furthermore, this work, describes the advantages of the implementation in hardware, such as speed and parallelism in signal processing. Finally, we show some results of the implementation of this algorithm.

A custom processor for an FPGA-based platform for automatic license plate recognition

  • Guilherme A. M. Sborz
  • Guilherme A. Pohl
  • Felipe Viel
  • Cesar A. Zeferino

Automatic License Plate Recognition (ALPR) systems are used to identify a vehicle from an image that contains its plate. These systems have applications in a wide range of areas, such as toll payment, border control, and traffic surveillance. ALPR systems demand high computational power, especially for real-time applications. In this context, this paper describes the development of a custom processor designed to accelerate part of the processing of an FPGA-based ALPR system. This processor reduces the latency for computing the most expensive function of the ALPR system in almost 23 times, thus reducing the time necessary for detection of a vehicle plate.

Hardware design of DC/CFL intra-prediction decoder for the AV1 codec

  • Jones Goebel
  • Bruno Zatt
  • Luciano Agostini
  • Marcelo Porto

This paper presents a dedicated hardware design for the DC and Chroma from Luma (CFL) intra-prediction modes of AV1 decoder. The hardware was designed to reach real-time when processing UHD 4K videos. The AV1 codec is an open-source and royalties-free video coding, which was developed by the AOMedia group, this group is composed of multiple companies like Google, Netflix, AMD, ARM, Intel, Nvidia, Microsoft, Mozilla and others. The proposed solution can support all 19 block sizes allowed in AV1 encoder, being able to process UHD 4K videos at 60 frames per second. The DC/CFL modules were synthesized to the TSMC 40 nm cells library targeting the frequency of 132.1 MHz. Synthesis results show the proposed hardware used 89.39 Kgates and a power dissipation of 7.96mW.

Approximate interpolation filters for the fractional motion estimation in HEVC encoders and their VLSI design

  • Rafael da Silva
  • Ícaro Siqueira
  • Mateus Grellert

Motion Estimation (ME) is one of the most complex HEVC steps, consuming more than 60% of the average encoding time, most of which is spent on its fractional part (Fractional Motion Estimation – FME), in which sub-pixel samples are interpolated and searched over to find motion vectors with higher precision. This paper presents hardware designs for the sub-pixel interpolation unit of the FME step. The designs employ approximate computing techniques by reducing the number of taps in each filter to reduce memory access and hardware cost. The approximate filters were implemented in the HEVC reference software to assess their impact on coding performance. A complete interpolation architecture was implemented in VHDL and synthesized with different filter precision and input sizes in order to assess the effect of these parameters on hardware area and performance. The approximate designs reduce the number of adders/subtractors by up to 67.65% and memory bandwidth by up to 75% with a tolerable loss in coding performance (less than 1% using the Bjontegaard Delta bitrate metric). When synthesized to an FPGA device, 52.9% less logic elements are required with a modest increase in frequency.

An SVM-based hardware accelerator for onboard classification of hyperspectral images

  • Lucas A. Martins
  • Guilherme A. M. Sborz
  • Felipe Viel
  • Cesar A. Zeferino

Hyperspectral images (HSIs) have been used in civil and military scenarios for ground recognition, urban development management, rare minerals identification, and diverse other purposes. However, HSIs have a significant volume of information and require high computational power, especially for real-time processing in embedded applications, as in onboard computers in satellites. These issues have driven the development of hardware-based solutions able to provide the processing power necessary to meet such requirements. In this paper, we present a hardware accelerator to enhance the performance of one of the most computational expensive stages of HSI processing: the classification. We have employed the Entropy Multiple Correlation Ratio procedure to select the spectral bands to be used in the training process. For the classification step, we have applied a Support Vector Machine classifier with a Hamming Distance decision approach. The proposed custom processor was implemented in FPGA and compared with high-level implementations. The results obtained demonstrate that the processor has a silicon cost lower than similar solutions and can perform a real-time pixel classification in 0.1 ms and achieves a state-of-the-art accuracy of 99.7%.

A sub-1mA highly linear inductorless wideband LNA with low IP3 sensitivity to variability for IoT applications

  • Arthur Liraneto Torres Costa
  • Hamilton Klimach
  • Sergio Bampi

This paper proposes a wideband 0.4-2 GHz cascode common-gate LNA that can be used as a building block for a noise canceling topology (which entails its noise to be canceled at the output node). The design strategy is to set the operating point by analyzing the third order coefficient (α3) of the output current and the output voltage, which is designed using a load composed by a diode-connected PMOS transistor and a resistor in parallel. This operating point allows a reasonable VGS spread, maintaining a high IIP3 which implies a low IIP3 sensitivity to process variability. The design strategy also achieves a current consumption under 1 mA and, depending on the technology node VDD (CMOS 130 nm in this case), it can consume under 1 mW of power. This makes the wideband LNA suitable for IoT applications. Monte Carlo simulations have been carried out to demonstrate the operating region sensitivity to variability and achieves a result of worst case IIP3μ = +0.2 dBm with σ = 0.8 dBm (@2GHz) up to a nominal 2.75 dBm @900 MHz, S11 < -23 dB, NF < 5.5 dB (canceled by virtue of its topology), a voltage gain of 11.6-14.6 dB (S21 = 6.4-9.4 dB with a buffer to 50 Ω), and consuming just 1.19 mW from a 1.2 V supply.

Comparison between direct and indirect learnings for the digital pre-distortion of concurrent dual-band power amplifiers

  • Luis Schuartz
  • Artur T. Hara
  • André A. Mariano
  • Bernardo Leite
  • Eduardo G. Lima

Current radio-communication systems demand high linearity and high efficiency. The digital baseband pre-distorter (DPD) is a cost-effective solution to guarantee the required linearity without compromising the efficiency. In the design of a DPD for a single band power amplifier (PA), the position of the inverse system is exchanged during the identification procedure to avoid the necessity of a PA model within a cumbersome closed-loop process. However, in a practical environment where only an approximation to the inverse is achieved, the linearization capability is affected by shifting the post-inverse placed after the PA to a pre-inverse located before the PA. In DPD intended for concurrent dual-band PAs, an additional advantage of such approach is that the post-inverse identifications for each band are completely independent of each other. This work performs a comparative analysis between two learning architectures applied to the linearization of two concurrent dual-band PAs stimulated by 2.4 GHz Wi-Fi and 3.5 GHz LTE signals. For the first PA, an exact PA model is known and the replacement of a post-inverse to a pre-inverse produces only negligible degradation in linearity. For the second PA, only an approximate PA model is available and the accuracy of such PA model produces a major impact on the linearization capability than the shifting of the inverse.

Interactive evolutionary approach to reduce the optimization cycle time of a low noise amplifier

  • Rodrigo A. L. Moreto
  • Douglas Rocha
  • Carlos E. Thomaz
  • André Mariano
  • Salvador P. Gimenez

Nowadays, wireless communications at frequencies of gigahertz have an increasing demand due to the ever-increasing number of electronic devices that uses this type of communication. They are implemented by Radio Frequency (RF) circuits. However, the design of RF circuits is difficult, time-consuming and based on designer knowledge and experience. This work proposes an interactive evolutionary approach using the genetic algorithm, which is implemented in the in-house iMTGSPICE optimization tool, to perform the optimization process of a robust (corner and Monte Carlo analyses) Ultra Low-Power Low Noise Amplifier (LNA) dedicated to Wireless Sensor Networks (WSN), which is implemented in a 130 nm Bulk CMOS technology. We performed two experimental studies to optimize the LNA. The first one used the interactive approach of iMTGSPICE, which was monitored and assisted by a beginner designer during the optimization process. The second one used the conventional approach of iMTGSPICE (non-interactive), which was not assisted by a designer during the optimization process. The obtained results demonstrated that the interactive approach of iMTGSPICE performed the optimization process of the robust LNA around 94% faster (in approximately 20 minutes only) than the non-interactive evolutionary approach (in approximately 6 hours).

An innovative strategy to reduce die area of robust OTA by using iMTGSPICE and diamond layout style for MOSFETs

  • José Roberto Banin Júnior
  • Rodrigo Alves de Lima Moreto
  • Gabriel Augusto da Silva
  • Carlos Eduardo Thomaz
  • Salvador Pinillos Gimenez

This paper describes a pioneering design and optimization methodology that provides a remarkable die area reduction of robust analog Complementary Metal-Oxide-Semiconductor (CMOS) Integrated Circuits (ICs) by using a computational tool based on artificial intelligence (iMTGSPICE) and the Diamond layout style for MOSFETs. The validation of this innovative optimization strategy for analog CMOS ICs was made for an Operational Transconductance Amplifiers (OTA) by using 180 nm CMOS ICs technology. The main finding of this work reports a remarkable reduction of the total die area of a robust OTA around 30%, regarding the use of Diamond MOSFETs with alfa angles of 45° when compared to the one implemented with standard rectangular MOSFETs.

NMLSim 2.0: a robust CAD and simulation tool for in-plane nanomagnetic logic based on the LLG equation

  • Lucas A. Lascasas Freitas
  • Omar P. Vilela Neto
  • João G. Nizer Rahmeier
  • Luiz G. C. Melo

Nanomagnetic Logic (NML) is a new technology based on the magnetization of nanometric magnets. Logic operations are performed via dipolar coupling through ferromagnetic and antiferromagnetic interactions. The low energy dissipation and the possibility of higher integration density in circuits are significant advantages over CMOS technology. Even so, there is a great need for simulation and CAD tools for the proper study of large NML circuits. This paper presents a high-efficiency tool that uses the Landau-Lifshitz-Gilbert equation to evolve the magnetization of the particles over time in amonodomain approach. The new version of NMLSim comes with flexibility in its code, allowing expansion of the tool with ease and consistency. The results of simulated structures show the reliability of the simulator when compared with the current state of the art Object-Oriented Micromagnetic Framework (OOMMF). It also presents an improvement of up to 716 times in execution time and up to 41 times in memory usage.

Ropper: a placement and routing framework for field-coupled nanotechnologies

  • Ruan Evangelista Formigoni
  • Ricardo S. Ferreira
  • José Augusto M. Nacif

Field-Coupled Nanocomputing technologies are the subject of extensive research to overcome current CMOS limitations. These technologies include nanomagnetic and quantum structures, each with its design and synchronization challenges. In this scenario clocking schemes are used to ensure circuit synchronization and avoid signal disruptions at the cost of some area overhead. Unfortunatelly, a nanocomputing technology is limited to a small subset of clocking schemes due to its number of clocking phases and signal propagation system, thus, leading to complex design challenges when tackling the placement and routing problem resulting in technology dependant solutions. Our work consists on presenting a novel framework developed by our team that solves these design challenges when using distinct schemes, therefore, avoiding the need to design pre-defined routing algorithms for each one. The framework offers a technology independent solution and provides interfaces for the implementation of efficient and scalable placement strategies, moreover, it has full integration with reference state-of-the-art optimization and synthesis tools.

Toward nanometric scale integration: an automatic routing approach for NML circuits

  • Pedro Arthur R. L. Silva
  • Omar Paranaíba V. Neto
  • José Augusto M. Nacif

In recent years, many technologies have been studied to replace or complement CMOS. Some of these emerging technologies are known as Field Coupled Nanocomputing. However, these new technologies introduce the need for developing tools to perform circuit mapping, placement, and routing. Nanomagnetic Logic Circuit (NML) is one of these emergent technologies. It relies on the magnetization of nanomagnets to perform operations through majority logic. In this work, we propose an approach to map a gate-level circuit to an NML layout automatically. We use the Breadth First Search to perform the placement and the A* algorithm to transverse the circuit and build the routes for each node. To evaluate the effectiveness of our approach, we use a series of ISCAS’85 benchmarks. Our results show an area reduction varying from 20% to 60%.

Energy efficient fJ/spike LTS e-Neuron using 55-nm node

  • Pietro M. Ferreira
  • Nathan De Carvalho
  • Geoffroy Klisnick
  • Aziz Benlarbi-Delai

While CMOS technology is currently reaching its limits in power consumption and circuit density, a challenger is emerging from the analogy between biology and silicon. Hardware-based neural networks may drive a new generation of bio-inspired computers by the urge of a hardware solution for real-time applications. This paper redesigns a previous proposed electronic neuron (e-Neuron) in a higher firing rate to reduce the silicon area and highlight a better energy efficiency trade-off. Besides, an innovative schematic is proposed to state an e-Neuron library based on Izhikevichs model of neural firing patterns. Both e-Neuron circuits are designed using 55 nm technology node. Physical design of transistors in weak inversion are discussed to a minimal leakage. Neural firing pattern behaviors are validated by post-layout simulations, demonstrating the spike frequency adaptation and the rebound spikes due to post-inhibitory effect in LTS e-Neuron. Presented results suggest that the time to rebound spikes is dependent of the excitation current amplitude. Both e-Neurons have presented a fF/spike energy efficiency and a smaller silicon area in comparison to Izhikevichs library propositions in the literature.

CMOS analog four-quadrant multiplier free of voltage reference generators

  • Antonio José Sobrinho de Sousa
  • Fabian de Andrade
  • Hildeloi dos Santos
  • Gabriele Gonçalves
  • Maicon Deivid Pereira
  • Edson Santana
  • Ana Isabela Cunha

This work presents a CMOS four quadrant analog multiplier architecture for application as the synapse element in analog cellular neural networks. The circuit has voltage-mode inputs and a current-mode output and includes a signal application method that avoids voltage or current reference generators. Simulations have been accomplished for a CMOS 130 nm technology, featuring ±50 mV input voltage range, 60 μW static power and -25 dB maximum THD. The active area is 346 μm2.

Amplifier-based MOS analog neural network implementation and weights optimization

  • Tiago Oliveira Weber
  • Diogo da Silva Labres
  • Fabián Leonardo Cabrera

Neural networks are achieving state-of-the-art performance in many applications, from speech recognition to computer vision. A neuron in a multi-layer network needs to multiply each input by its weight, sum the results and perform an activation function. This paper presents a variation of the implementation of an amplifier-based MOS analog neuron capable of performing these tasks and also the optimization of the synaptic weights using in-loop circuit simulations. MOS transistors operating in the triode region are used as variable resistors to convert the input and weight voltage to a proportional input current. To test the analog neuron in full networks, an automatic generator is developed to produce a netlist based on the number of neurons on each layer, inputs and weights. Simulation results using a CMOS 180 nm technology demonstrate the neuron proper transfer function and its functionality while trained in test datasets.

Reduction of neural network circuits by constant and nearly constant signal propagation

  • Augusto Andre Souza Berndt
  • Alan Mishchenko
  • Paulo Francisco Butzen
  • Andre Inacio Reis

This work focuses on optimizing circuits representing neural networks (NNs) in the form of and-inverter graphs (AIGs). The optimization is done by analyzing the training set of the neural network to find constant bit values at the primary inputs. The constant values are then propagated through the AIG, which results in removing unnecessary nodes. Furthermore, a trade-off between neural network accuracy and its reduction due to constant propagation is investigated by replacing with constants those inputs that are likely to be zero or one. The experimental results show a significant reduction in circuit size with negligible loss in accuracy.

A FPGA parameterizable multi-layer architecture for CNNs

  • Guilherme Korol
  • Fernando Gehm Moraes

Advances in hardware platforms boosted the use of Convolutional Neural Networks (CNNs) to solve problems in several fields such as Computer Vision and Natural Language Processing. With the improvements of algorithms involved in learning and inferencing for CNNs, dedicated hardware architectures have been proposed with the goal to speed up the CNNs’ performance. However, the CNNs’ requirements in bandwidth and processing power challenge designers to create architectures fitted for ASICs and FPGAs. Embedded applications targeting IoT (including sensors and actuators), health devices, smartphones, and any other battery-powered device may benefit from CNNs. For that, the CNN design must follow a different path, where the cost function is a small area footprint and reduced power consumption. This paper is a step towards this goal, by proposing an architecture for the main modules of modern CNNs. The proposal uses as case-study the Alexnet CNN, targeting Xilinx FPGA devices. Compared to the literature, results show a reduction up to 9 times in the amount of required DSP modules.

Design of a low power 10-bit 12MS/s asynchronous SAR ADC in 65nm CMOS

  • Arthur Lombardi Campos
  • João Navarro
  • Maximiliam Luppe

During the last decades we have witnessed the performance improvement and the aggressive growth of the complexity of integrated circuits (ICs). The progressive size reduction of transistors in recent technological nodes has allowed IC designers to perform analog tasks in the digital domain, increasing the demand for analog-to-digital converters (ADCs). This work presents the design and implementation of a low power successive approximation register analog-to-digital converter (SAR ADC) in a 65nm CMOS technology, suitable for low power frontend of wireless receivers with a flexible sampling rate up to 12 MS/s. At maximum sampling rate, the post-layout simulated circuit achieved an equivalent number of bits (ENOB) of 9.65 and a power consumption of 151.4μW, leading to a Figure of Merit of 15.8fJ/Conversion-step, inside an area of 0.074mm2.

A new algorithm for an incremental sigma-delta converter reconstruction filter

  • Li Huang
  • Caroline Lelandais-Perrault
  • Anthony Kolar
  • Philippe Benabes

Image sensors dedicated for the applications of the Earth observation require medium-speed and high-resolution analog-to-digital converters (ADCs). For that purpose, an incremental sigma-delta analog-to-digital converter (IΣΔ ADC) has been designed. Post-layout simulations highlighted a degradation in resolution caused by the circuit imperfections. Therefore, a digital correction has been investigated. This paper proposes a new reconstruction filter which takes into account not only the bit values of the modulator output sequence but also the occurrence of certain patterns. This technique has been applied to an incremental sigma-delta analog-to-digital converter in order to correct the conversion errors. Performing with 400 clock periods for each conversion cycle, the theoretical resolution is 15.4 bits. Post-layout simulation shows that a 13.5-bit resolution is obtained by using the classical optimal filter whereas a 14.8-bit resolution is obtained with our reconstruction filter.

Behavioral modeling of a control module for an energy-investing piezoelectric harvester

  • Tales Luiz Bortolin
  • André Luiz Aita
  • João Baptista dos Santos Martins

This work analyzes a piezoelectric energy harvesting system that uses a single inductor and the concept of energy investment. The harvester behavior, with special focus on its control logic module and state machine, is fully described and modeled in Verilog-A. The needed sensors and control variables were also identified and modeled. Simulation results have shown the correct behavioral modeling of the piezoelectric energy harvester system and proposed control, highlighting the harvesting mechanism based on the concept of energy-investment and the effect of the energy invested on the characteristics of the battery charging profile. The speed of the behavioral simulations when compared to electrical ones and the obtained model accuracy, have shown a reliable and prospective higher-level design approach.

An IR-UWB pulse generator using PAM modulation with adaptive PSD in 130nm CMOS process

  • Luiz Carlos Moreira
  • José Fontebasso Neto
  • Walter Silva Oliveira
  • Thiago Ferauche
  • Guilherme Heck
  • Ney Laert Vilar Calazans
  • Fernando Gehm Moraes

This paper proposes an adaptive pulse generator using Pulse Amplitude Modulation (PAM). The circuit was implemented with eight Pulse Generator Units (PGUs) to produce up to eight monocycles per pulse. The number of monocycles per pulse is inversely proportional to the Power Spectrum Density (PSD) bandwidth in the Impulse Radio Ultra-Wide Band (IR-UWB). The complete circuit contains two pulse generator blocks, each one composed by eight PGUs to build a rectangular waveform at the output. The PGU has been implemented with Edge Combiners High (ECH) and Edge Combiners Low (ECL) to encode the information. Each Edge Combiner has a high impedance circuit that is selected by digital control signals. The circuit has been simulated, showing an output pulse amplitude of ≈70mV for the high logic level and an amplitude of ≈35mV for the low logic level, both at 100 MHz Pulse Repetition Frequency (PRF). This produces a mean pulse duration of ≈270ps, a mean central frequency of ≈3.7GHz and a power consumption less than 0,22μW. The pulse generator block occupies an area of 0.54mm2.

SRC-2018

ACM Student Research Competition at ICCAD 2018 (SRC@ICCAD’18)

http://www.cse.cuhk.edu.hk/~byu/img/img-sigda/logo-src.jpg


DEADLINE: September 02, 2018 
Online Submission: https://www.easychair.org/conferences/?conf=srciccad18
 
Sponsored by Microsoft Research, the ACM Student Research Competition is an internationally recognized venue enabling undergraduate and graduate students who are ACM members to:

  • Experience the research world — for many undergraduates this is a first!
  • Share research results and exchange ideas with other students, judges, and conference attendees
  • Rub shoulders with academic and industry luminaries
  • Understand the practical applications of their research
  • Perfect their communication skills
  • Receive prizes and gain recognition from ACM and the greater computing community.

The ACM Special Interest Group on Design Automation (ACM SIGDA) is organizing such an event in conjunction with the International Conference on Computer Aided Design (ICCAD). Authors of accepted submissions will get travel grants up to $500 from ACM/Microsoft and ICCAD registration fee support from SIGDA. The event consists of several rounds, as described at http://src.acm.org/ and http://www.acm.org/student-research-competition, where you can also find more details on student eligibility and timeline.
 
The first-place winner in the graduate category at SRC@ICCAD’17, Meng Li (University of Texas at Austin), also won the First Place in the 2018 ACM SRC Grand Finals!
 
The first-place winner in the undergraduate category at SRC@ICCAD’16, Jennifer Vaccaro (Olin College of Engineering), also won the Second Place in the 2017 ACM SRC Grand Finals: http://www.acm.org/media-center/2017/june/src-2017-grand-finals.
 
Details on abstract submission:
Research projects from all areas of design automation are encouraged. The author submitting the abstract must still be a student at the time the abstract is due. Each submission should be made on the EasyChair submission site. Please include the author’s name, affiliation, postal address, and email address; research advisor’s name; ACM student member number; category (undergraduate or graduate); research title; and an extended abstract (maximum 2 pages or 800 words) containing the following sections:

  • Problem and Motivation: This section should clearly state the problem being addressed and explain the reasons for seeking a solution to this problem.
  • Background and Related Work: This section should describe the specialized (but pertinent) background necessary to appreciate the work. Include references to the literature where appropriate, and briefly explain where your work departs from that done by others. Reference lists do not count towards the limit on the length of the abstract.
  • Approach and Uniqueness: This section should describe your approach in attacking the problem and should clearly state how your approach is novel.
  • Results and Contributions: This section should clearly show how the results of your work contribute to computer science and should explain the significance of those results. Include a separate paragraph (maximum of 100 words) for possible publication in the conference proceedings that serves as a succinct description of the project.
  • Single paper summaries (or just cut & paste versions of published papers) are inappropriate for the ACM SRC. Submissions should include at least one year worth of research contributions, but not subsuming an entire doctoral thesis load.

Note that this event is different than other ACM/SIGDA sponsored or supported events at DAC or ICCAD: YSSP brings together seniors and 1st year graduate students at DAC, UBooth features demos from research groups, DASS allows graduate students to get up to speed on lectures on design automation, while the PhD Forum showcases post-proposal PhD research at DAC and the CADathlon allows graduate students to compete in a programming contest at ICCAD.
The ACM Student Research Competition allows both graduate and undergraduate students to discuss their research with student peers, as well as academic and industry researchers, in an informal setting, while enabling them to attend DAC and compete with other ACM SRC winners from other computing areas in the ACM Grand Finals. Travel grant recipients cannot receive travel support from any other ICCAD or ACM/SIGDA sponsored program.
This year we plan to reserve as many as 5 poster session spots for undergraduate attendees to encourage their continuous investigation in design automation field. The exact number is subject to the total undergraduates submissions as well as the quality of the works.
 
Online Submission – EasyChair:
https://www.easychair.org/conferences/?conf=srciccad18
 
Important dates:

  • Abstract submission deadline: 11:59pm, PST, September 02, 2018
  • Acceptance notification: September 17, 2018
  • Poster session: November 5, 2018 from 11:30am–1:30pm, Private Dinning Room
  • Presentation session: November 5, 2018 from 6:45pm–8:30pm, Saint Tropez Room
  • Award winners announced at ACM SIGDA Dinner: November 6, 2018, from 6:30pm
  • Grand Finals winners honored at ACM Awards Banquet: June 2019 (Estimated)

 
Requirement:
Students submitting and presenting their work at SRC@ICCAD’18 are required to be members of both ACM and ACM SIGDA.
 
Organizers:
Cheng Zhuo (Zhejiang University, China)
Bei Yu (The Chinese University of Hong Kong, Hong Kong)

NANOARCH 2018 TOC

Full Citation in the ACM Digital Library

Fast Estimations of Failure Probability Over Long Time Spans

  • Michail Noltsis
  • Panayiotis Englezakis
  • Eleni Maragkoudaki
  • Chrysostomos Nicopoulos
  • Dimitrios Rodopoulos
  • Francky Catthoor
  • Yiannakis Sazeides
  • Davide Zoni
  • Dimitrios Soudris

Shrinking of device dimensions has undoubtedly enabled the very large scale integration of transistors on electronic chips. However, it has also brought to surface time-zero and time-dependent variation phenomena that degrade system’s performance and threaten functional operation. Hence, the need to capture and describe these mechanisms, as well as effectively model their impact is crucial. To this extent, we follow existing models and propose a complete framework that evaluates failure probability of electronic components. To assess our framework, a case-study of packet-switched Network on Chip (NoC) routers is presented, studying the failure probability of its SRAM buffers.

A Probabilistic Error Model and Framework for Approximate Booth Multipliers

  • Yuying Zhu
  • Weiqiang Liu
  • Jie Han
  • Fabrizio Lombardi

Approximate computing is a paradigm for high performance and low power design by compromising computational accuracy. In this paper, the structure of an approximate modified radix-4 Booth multiplier is analyzed. A probabilistic error model is proposed to facilitate the evaluation of the approximate multiplier for errors from the approximate radix-4 Booth encoding, the approximate regular partial product array, and the approximate 4–2 compressor. The normalized mean error distances (NMEDs) of 8-bit and 16-bit approximate designs are found by utilizing the proposed model. The results from the error model and the corresponding analytical framework are close to those found by simulation, thus confirming the validity of the proposed approach.

Variability-Tolerant Memristor-based Ratioed Logic in Crossbar Array

  • M. Escudero
  • I. Vourkas
  • A. Rubio
  • F. Moll

The advent of the first TiO2-based memristor in 2008 revived the scientific interest both from academia and industry for this device technology, with several emerging applications including that of logic circuits. Several memristive logic families have been proposed, each with different attributes, in the current quest for energy-efficient computing systems of the future. However, limited endurance of memristor devices and variations (both cycle-to-cycle and device-to-device) are important parameters to be considered in the evaluation of such logic families. In this work we build upon an accurate physics-based model of a bipolar metal-oxide resistive RAM device (supporting parasitics of the device structure and variability of switching voltages and resistance states) and use it to show how performance of memristor-based logic circuits can de degraded owing to both variability and state-drift impact. Based on previous work on CMOS-like memristive logic circuits, we propose a memristive ratioed logic scheme, which is crossbar-compatible, i.e. suitable for in-/near-memory computing, and tolerant to device variability, while also it does not affect the device endurance since computations do not involve switching the memristor states. As a figure of merit, we compare such new logic scheme with MAGIC, focusing on the universal NOR logic gate.

High-Endurance Bipolar ReRAM-Based Non-Volatile Flip-Flops with Run-Time Tunable Resistive States

  • Mehrdad Biglari
  • Tobias Lieske
  • Dietmar Fey

ReRAM technologies feature desired properties, e.g. fast switching and high read margin, that make them attractive candidates to be used in non-volatile flip-flops (NVFFs). However, they suffer from limited endurance. Therefore, cell degradation considerations are a necessity for practical deployment in non-volatile processors (NVPs). In this paper, we present two bipolar ReRAM-based NVFFs, Hypnos and Morpheus, with enhanced endurance and energy efficiency. Hypnos reduces the ReRAM electrical stress during set operation while keeping the imposed NVFF area overhead at a minimum. In Morpheus, a write-termination circuit is used to further enhance the ReRAM endurance and energy efficiency at the cost of an affordable area overhead. Moreover, both NVFFs feature run-time tunable resistive states to enable online adjustment of the tradeoff among endurance, retention, energy consumption, and restore success rate (in case of approximate computing). Experimental results demonstrate that Hypnos reduces the ReRAM set degradation by 91%, on average. Moreover, the write-termination mechanism in Morpheus further reduces the remaining degradation by 93%/97% in set/reset operation, on average. The results also demonstrate enhanced energy efficiency in both NVFFs.

An Aging Resilient Neural Network Architecture

  • Seyed Nima Mozaffari
  • Krishna Prasad Gnawali
  • Spyros Tragoudas

Recent artificial neural network architectures use memristors to store synaptic weights. The crossbar structure of memristors is used because of its dense structure and extreme parallelism. Transistor aging impacts their computational accuracy. An enhancement of the memristor-based neural network architecture is introduced using built-in current-based calibration circuit. It is shown experimentally that the proposed approach alleviates the cell aging effect.

Overcoming Crossbar Nonidealities in Binary Neural Networks Through Learning

  • Mohammed E. Fouda
  • Jongeun Lee
  • Ahmed M. Eltawil
  • Fadi Kurdahi

The crossbar nonidealaties may considerably degrade the accuracy of matrix multiplication operation, which is the cornerstone of hardware accelerated neural networks. In this paper, we show that the crossbar nonidealities especially the wire resistance should be taken into consideration for accurate evaluation. We also present a simple yet highly effective way to capture the wire resistance effect for the inference and training of deep neural networks without extensive SPICE simulations. Different scenarios have been studied and used to show the efficacy of our proposed method.

Real-Time Trainable Data Converters for General Purpose Applications

  • Loai Danial
  • Shahar Kvatinsky

Data converters are ubiquitous in data-abundant systems, where they are heterogeneously distributed across the analog-digital interface. Unfortunately, conventional data converters trade off speed, power, and accuracy. Furthermore, intrinsic real-time and post-silicon variations dramatically degrade their performance. In this paper, we employ novel neuro-inspired approaches to design smart data converters that could be trained in real-time for general purpose applications, using machine learning algorithms and artificial neural network architectures. Our approach integrates emerging memristor technology with CMOS. This concept will pave the way towards adaptive interfaces with the continuous varying conditions of data driven applications.

Programmable Molecular-Nanoparticle Multi-junction Networks for Logic Operations

  • Angelika Balliou
  • Jiri Pfleger
  • George Skoulatakis
  • Samrana Kazim
  • Jan Rakusan
  • Stella Kennou
  • Nikos Glezos

We propose and investigate a nanoscale multi-junction network architecture that can be configured on-flight to perform Boolean logic functions at room temperature. The device exploits the electronic properties of randomly deposited molecule-interconnected metal nanoparticles, which act collectively as strongly nonlinear single-electron transistors. Disorder is being incorporated in the modeling of their electrical behavior and the collective response of interacting nano-components is being rationalized. The non-optimized energy consumption of the synaptic grid for a “then-if” logical computation is in the range of few aJ.

Multi-Valued Logic Circuits on Graphene Quantum Point Contact Devices

  • Konstantinos Rallis
  • Georgios Ch. Sirakoulis
  • Ioannis Karafyllidis
  • Antonio Rubio

Graphene quantum point contacts (G-QPC) combine switching operations with quantized conductance, which can be modulated by top and back gates. Here we use the conductance quantization to design and simulate multi-valued logic (MVL) circuits and, more specifically an adder. The adder comprises two G-QPCs connected in parallel. We compute the conductance of the adder for various inputs and show that Graphene MVL circuits are feasible.

Sequential Circuit Design with Bilayer Avalanche Spin Diode Logic

  • Vaibhav Vyas
  • Joseph S. Friedman

Novel computing paradigms like the fully cascadable InSb bilayer avalanche spin-diode logic (BASDL) are capable of performing complex logic operations. Although the original work provides a comprehensive explanation for the device structure, the fundamental logic set and basic combinational circuits, it lacks the inclusion of sequential circuit design. This paper addresses the void by demonstrating the structural design of SR and D-type latches with BASDL. Novel latch topologies are proposed that take full advantage of the BASDL-based logic set while maintaining conventional latch functionality. The effective operation of these latches is verified through a complete logic-level analysis and a briefinsight into their physical implementation.

Complementary Arranged Graphene Nanoribbon-based Boolean Gates

  • Yande Jiang
  • Nicoleta Cucu Laurenciu
  • Sorin Cotofana

With CMOS feature size heading towards atomic dimensions, unjustifiable static power, reliability, and economic implications are exacerbating, prompting for research on new materials, devices, and/or computation paradigms. Within this context, Graphene Nanoribbons (GNRs), owing to graphene’s excellent electronic properties, may serve as basic blocks for carbon-based nanoelectronics. In this paper we build upon the fact that GNR behaviour can be controlled according to some desired functionality via top/back gate contacts and propose to combine GNRs with complementary functionalities to construct Boolean gates. To this end, we introduce a generic GNR-based Boolean gate structure, composed of two GNRs, i.e., a pull-up GNR performing the gate Boolean function and a pull-down GNR performing the inverted Boolean function. Subsequently, by properly adjusting GNRs’ dimensions and topology, we design 2-input AND, NAND, and XOR graphene-based Boolean gates, as well as 1-input gates, i.e., inverter and buffer. Our SPICE simulations indicate that the proposed gates exhibit a smaller propagation delay, from 23% for the XOR gate to 6x for the AND gate, and 2 orders of magnitude smaller power consumption, when compared with 7nm CMOS based counterparts, while requiring a 1 to 2 orders of magnitude smaller active area footprint. These results clearly indicate that GNR-based gates have great potential as basic building blocks for future beyond CMOS energy effective nanoscale circuits.

CCE: A Combined SRAM and Non Volatile Cache for Endurance of Next Generation Multilevel Non Volatile Memories in Embedded Systems

  • Linbin Chen
  • Pilin Junsangsri
  • Pedro Reviriego
  • Fabrizio Lombardi

In this paper we present Combined Cache for Endurance (CCE), a scheme to enable the use of next generation high density multilevel non volatile memories in embedded systems. These memories are attractive as they can reduce the static power consumption dramatically and a single memory can be potentially used avoiding having both flash and SRAM or DRAM in a system. However, a common drawback of the new multilevel non volatile memories is that they support a limited number of write operations and thus its endurance needs to be improved to make them a viable alternative for the main memory of embedded systems. The proposed CCE relies on the fact that most writes are concentrated on a few addresses. Therefore, a small SRAM cache can be used to store positions that are frequently written. However, this would not preserve the non volatile nature of the memory. To do so, in the proposed CCE, the cache cell has an SRAM part and a non volatile part. At power up the contents of the non volatile part are copied to the SRAM and the other way around at power down. As many embedded systems execute predictable workloads, this cache is statically set to cover the most frequently written addresses. The evaluation shows that CCE can increase the endurance of the memory by several orders of magnitude. At the same time the overheads required to implement the cache are small relative to the main memory. Therefore, CCE can be an interesting option to improve the endurance of next generation high density multilevel non volatile memories.

Regular Expression Matching with Memristor TCAMs for Network Security

  • Catherine E. Graves
  • Wen Ma
  • Xia Sheng
  • Brent Buchanan
  • Le Zheng
  • Si-Ty Lam
  • Xuema Li
  • Sai Rahul Chalamalasetti
  • Lennie Kiyama
  • Martin Foltin
  • John Paul Strachan
  • Matthew P. Hardy

We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, RegEx matching latency and power can be incredibly high and current proposals are challenged to perform wire-speed matching for large scale rulesets. Our approach dramatically decreases RegEx matching operating power, provides high throughput, and the use of mTCAMs enables novel compression techniques to expand ruleset sizes and allows future exploitation of the multi-state (analog) capabilities of memristors. We fabricated and demonstrated nanoscale memristor TCAM cells. SPICE simulations investigate mTCAM performance at scale and a mTCAM power model at 22nm demonstrates 0.2 fJ/bit/search energy for a 36×400 mTCAM. We further propose a tiled architecture which implements a Snort ruleset and assess the application performance. Compared to a state-of-the-art FPGA approach (2 Gbps,~1W), we show x4 throughput (8 Gbps) at 60% the power (0.62W) before applying standard TCAM power-saving techniques. Our performance comparison improves further when striding (searching multiple characters) is considered, resulting in 47.2 Gbps at 1.3W for our approach compared to 3.9 Gbps at 630mW for the strided FPGA NFA, demonstrating a promising path to wire-speed RegEx matching on large scale rulesets.

A Novel Cross-point MRAM with Diode Selector Capable of High-Density, High-Speed, and Low-Power In-Memory Computation

  • Chaoxin Ding
  • Wang Kang
  • He Zhang
  • Youguang Zhang
  • Weisheng Zhao

In-Memory Computation (IMC), which is capable of reducing the power consumption and bandwidth requirement resulting from the data transfer between the processing and memory units, has been considered as a promising technology to break the von-Neumann bottleneck. In order to develop an effective and efficient IMC platform, the performance, such as density, operation speed and power consumption, of the memory itself is one of the most important keys. In this work, we report a cross-point magnetic random access memory (MRAM) with diode selector for IMC implementation. The memory cell consists of a magnetic tunnel junction (MTJ) device and a diode connected in series. The memory cells are arranged in a cross-point array structure, providing high storage density. The MTJ can be switched through the unipolar precessional voltage-controlled magnetic anisotropy (VCMA) effect, thus enabling high speed and low power. Further, Boolean logic functions can be realized via regular memory-like write & read operations. The feasibility and performance of the proposed IMC in the crosspoint MRAM are successfully demonstrated with hybrid VCMA-MTJ/CMOS circuit simulations under the 40 nm technology node.

Hardware Acceleration Implementation of Sparse Coding Algorithm with Spintronic Devices

  • Deming Zhang
  • Yanchun Hou
  • Chengzhi Wang
  • Jie Chen
  • Lang Zeng
  • Weisheng Zhao

In this paper, we explore the possibility of hardware acceleration implementation of sparse coding algorithm with spintronic devices by a series of design optimizations across the architecture, circuit and device. Firstly, a domain wall motion (DWM) based compound spintronic device (CSD) is engineered and modelled, which is envisioned to achieve multiple conductance states. Sequentially, a parallel architecture is presented based on a dense cross-point array of the proposed DWM based CSD, where each dictionary (D) value can be mapped into the conductance of the proposed DWM based CSD at the corresponding cross-point. Benefitting from its massively parallel read and write operation, such proposed parallel architecture can accelerate the selected sparse coding algorithm using a designed dedicated periphery read and write circuit. Experimental results show that the selected sparse coding algorithm can be accelerated by 1400x with the proposed parallel architecture in comparison with software implementation. Moreover, its energy dissipation is 8 orders of magnitude smaller than that with software implementation.

Quantum-dot Cellular Automata RAM design using Crossbar Architecture

  • Orestis Liolis
  • Vassilios A. Mardiris
  • Georgios Ch. Sirakoulis
  • Ioannis G. Karafyllidis

In this paper, a new approach of RAM circuits, using Quantum-dot Cellular Automata (QCA), based on programmable crossbar architecture, is presented. In addition, a methodology for 2n bits RAMs is presented. Using the aforementioned methodology any designer can design a RAM regardless of its size. The proposed designs utilize the benefits of QCA programmable crossbar architecture. Namely, the RAM circuit is characterized by regularity and the ability of customization. The features that the proposed RAM design methodology has, allow the designers to use the available circuit area efficiently.

Integrated Synthesis Methodology for Crossbar Arrays

  • M. Ceylan Morgul
  • Onur Tunali
  • Mustafa Altun
  • Luca Frontini
  • Valentina Ciriani
  • E. Ioana Vatajelu
  • Lorena Anghel
  • Csaba Andras Moritz
  • Mircea R. Stan
  • Dan Alexandrescu

Nano-crossbar arrays have emerged as area and power efficient structures with an aim of achieving high performance computing beyond the limits of current CMOS. Due to the stochastic nature of nano-fabrication, nano arrays show different properties both in structural and physical device levels compared to conventional technologies. Mentioned factors introduce random characteristics that need to be carefully considered by synthesis process. For instance, a competent synthesis methodology must consider basic technology preference for switching elements, defect or fault rates of the given nano switching array and the variation values as well as their effects on performance metrics including power, delay, and area. Presented synthesis methodology in this study comprehensively covers the all specified factors and provides optimization algorithms for each step of the process.

Minimal Disturbed Bits in Writing Resistive Crossbar Memories

  • Mohammed E Fouda
  • Ahmed M. Eltawil
  • Fadi Kurdahi

Resistive memories are promising candidates for non-volatile memories. Write disturb is one of problems that facing this kind of memories. In this paper, the write disturb problem is mathematically formulated in terms of the bias parameters and optimized analytically. A closed form solution for the optimal bias parameters is calculated. Results are compared with the 1/2 and 1/3 bias schemes showing a significant improvement.

A Recursive Growing & Featuring Mechanism for Nanocomputing Structures

  • Mihaela Maliţa
  • Gheorghe M. Ştefan

The huge amounts of physical possibilities offered by the emerging nanotechnologies must be accompanied, beyond the uniform growing mechanisms supposed by the current serial and/or parallel extensions, by an appropriate structuring mechanism able to support efficiently the increasing functional demands. A recursive growing mechanism is proposed for the upcoming Nano-Era. The current growing mechanism involves only pure quantitative aspects. We consider as mandatory, for the very big sized systems, another mechanism which interleaves the quantitative aspects with the functional ones. Because the computational parallelism is implicit for the big sized systems, the growing mechanism must be supported also by an appropriate computational model. For the current systems we started from gates. For Nano-Era structuring mechanism we will start from cellular automata. The main difference is that for nanoarchitectures the growing mechanism and the featuring mechanism are unified in an unique recursive mechanism.

Free BDD based CAD of Compact Memristor Crossbars for in-Memory Computing

  • Amad Ul Hassen
  • Salman Anwar Khokhar
  • Haseeb Aslam Butt
  • Sumit Kumar Jha

The demise of Moore’s law, breakdown of Dennard Scaling, dark silicon phenomenon, process variation, leakage currents and quantum tunneling are some of the hurdles faced in the further advancement of computing systems today. As a result, there is a renewed interest in alternate computing paradigms using emerging nanoelectronic devices. This work uses free binary decision diagrams (FBDDs) for computer-aided design (CAD) of compact memristive crossbars for sneak-path based in-memory computing. The absence of a fixed variable ordering makes FBDDs more compact than their ordered counterpart called reduced ordered binary decision diagrams (ROBDDs). Our design has used the size of the circuit-representation of Boolean functions for selecting different variable orderings along different paths which results in compact FBDDs. We have demonstrated our approach by designing compact crossbars for a four-bit multiplier and other RevLib benchmarks. Our synthesis process yields a 50.1% reduction in area over the previous FBDD-based synthesis for the fourth-output-bit of the multiplier. Overall, our approach has reduced the multiplier area by 20.1%.

Crosstalk based Fine-Grained Reconfiguration Techniques for Polymorphic Circuits

  • Naveen Macha
  • Sandeep Geedipally
  • Bhavana Repalle
  • Md Arif Iqbal
  • Wafi Danesh
  • Mostafizur Rahman

Truly polymorphic circuits, whose functionality/circuit behavior can be altered using a control variable, can provide tremendous benefits in multi-functional system design and resource sharing. For secure and fault tolerant hardware designs these can be crucial as well. Polymorphic circuits work in literature so far either rely on environmental parameters such as temperature, variation etc. or on special devices such as ambipolar FET, configurable magnetic devices, etc., that often result in inefficiencies in performance and/or realization. In this paper, we introduce a novel polymorphic circuit design approach where deterministic interference between nano-metal lines is leveraged for logic computing and configuration. For computing, the proposed approach relies on nano-metal lines, their interference and commonly used FETs. For polymorphism, it requires only an extra metal line that carries the control signal. In this paper, we show a wide range of crosstalk polymorphic logic gates and their evaluation results. We also show an example of a large circuit that performs both the functionalities of multiplier and sorter depending on the configuration signal. A comparison is made with respect to other existing approaches in literature, and transistor count is benchmarked. For crosstalk-polymorphic circuits, the transistor count reduction range from 25% to 83% with respect to various other approaches. For example, polymorphic AOI21-OA21 cell show 83%, 85% and 50% transistor count reduction, and Multiplier-Sorter circuit show 40%, 36% and 28% transistor count reduction with respect to CMOS, genetically evolved, and ambipolar transistor based polymorphic circuits, respectively.

A Novel Analog to Digital Conversion Concept with Crosstalk Computing

  • Rajanikanth Desh
  • Naveen Kumar Macha
  • Sehtab Hossain
  • Repalle Bhavana Tejaswini
  • Mostafizur Rahman

Analog to Digital Converters (ADCs) is the core component of computing systems forming a link between the external stimuli and digital microprocessor operations. Current CMOS based fast ADCs are difficult to scale due to the reliance on transistor sizing and high voltage operations. They also suffer from high power consumption. In this paper, we introduce a novel ADC design which uses the deterministic signal interference between metal lines as a mechanism for signal conversion. In contrast to CMOS ADCs, our approach uses a simple crosstalk tree network of metal lines to convert sampled analog levels to digital code. Here, the sampled analog signal is passed through an input metal line which is capacitively coupled to a series of metal lines in a tree-like layout, and the coupled voltages on the edge of the tree (the leaves) determine the output. The resolution is dependent on the number of branches. We show 2-bit and 3-bit ADC implemented through this mechanism at 16n technology node. Our results indicate the possibility of huge power savings with Crosstalk ADCs in comparison to CMOS; for 2-bit and 3-bit ADCs the power consumption was found to be 43.51μW and 96.74μW respectively at 50M Hz sampling frequency.

Energy Efficiency of Low Swing Signaling for Emerging Interposer Technologies

  • Eleni Maragkoudaki
  • Przemyslaw Mroszczyk
  • Vasilis F. Pavlidis

Interconnects often constitute the major bottleneck in the design process of low power integrated circuits (IC). Although 2.5-D integration technologies support physical proximity, the dissipated power in the communication links remains high. In this work, the additional power savings for interposer-based interconnects enabled by low swing signaling is investigated. The energy consumed by a low swing scheme is, therefore, compared with a full swing solution and the critical length of the interconnect, above which the low swing solution starts to pay off, is determined for diverse interposer technologies. The energy consumption is compared for three different substrate materials, silicon, glass, and organic. Results indicate that the higher the load capacitance of the communication medium is, the greater the energy savings of the low swing circuit are. Specifically, in cases that electrostatic discharge (ESD) protection is required, the low swing circuit is always superior in terms of energy consumption due to the high capacitive load of the ESD circuit, regardless the substrate material and the link length. Without ESD protection, the highest critical length is about 380 μm for glass and organic interposers. To further explore the limits of power reduction from low swing signaling for 2.5-D ICs, the effect of typical interconnect parameters such as width and space on the energy efficiency of low swing communication is evaluated.

Energy-Efficient 4T SRAM Bitcell with 2T Read-Port for Ultra-Low-Voltage Operations in 28 nm 3D Monolithic CoolCubeTM Technology

  • Reda Boumchedda
  • Jean-Philippe Noel
  • Bastien Giraud
  • Adam Makosiej
  • Marco Antonio Rios
  • Eduardo Esmanhotto
  • Emilien Bourde-Cicé
  • Mathis Bellet
  • David Turgis
  • Edith Beigne

This paper presents a 4T-based SRAM bitcell optimized both for write and read operations at ultra-low voltage (ULV). The proposed bitcell is designed to respond to the requirements of energy constrained systems, as in the case of most IoT-oriented circuits and applications. The use of 3D CoolCubeTM technology enables the design of a stable 4T SRAM bitcell by using data-dependent back biasing. The proposed bitcell architecture provides a major reduction of the write operation energy consumption compared to a conventional 6T bitcell. A dedicated read port coupled to a virtual GND (VGND) ensures a full functionality at ULV of read operations. Simulation results show reliable operations down to 0.35 V close to six sigma (6 σ) without any assist techniques (e.g. negative bitlines), achieving in worst case corner 300 ns and 125 ns in write and read access time, respectively. A 6x energy consumption reduction compared to a ULV ultra-low-leakage (ULL) 6T bitcell is demonstrated.

Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition

  • Qin Li
  • Huifeng Zhu
  • Fei Qiao
  • Qi Wei
  • Xinjun Liu
  • Huazhong Yang

This paper proposes a novel processing architecture to extract Mel-Frequency Cepstrum Coefficients (MFCC) for automatic speech recognition. Inspired by the human ear, the energy-efficient analog-domain information processing is adopted to replace the energy-intensive Fourier Transform in conventional digital-domain. Moreover, the proposed architecture extracts the acoustic features in the mixed-signal domain, which significantly reduces the cost of Analog-to-Digital Converter (ADC) and the computational complexity. We carry out the circuit-level simulation based on 180nm CMOS technology, which shows an energy consumption of 2.4 nJ/frame, and a processing speed of 45.79 μs/frame. The proposed architecture achieves 97.2% energy saving and about 6.4x speedup than state of the art. Speech recognition simulation reaches the classification accuracy of 99% using the proposed MFCC features.

Power Analysis of an mRNA-Ribosome System

  • Pratima Chatterjee
  • Prasun Ghosal

Energy is the heart to drive any device, such as any machine. As researchers have been trying to perform low energy operations more and more, energy requirements are turning out to be one of the key features in measuring the performance of a device. On the other hand, as conventional silicon-based computing is approaching a barrier, needs of non-conventional computing is increasing. Though several such computing platforms have arisen to prove itself as a suitable alternative to silicon-based computing, less energy requirement is certainly one of the most sought features in the competition among the new platforms. Moreover, there are certain scenarios where performing calculations in pure bio-molecular ways are highly desired. Although DNA computing has already flagged the success of bio-molecular computing in terms of energy/power requirements, its manual nature keeps it behind from other computing techniques. Another new bio-molecular computing technique Ribosomal Computing, though still in infancy, has shown real promises due to its inherent automation. This work performs an analysis of the energy/power requirements of this computing technique. With the promising result obtained, ribosomal computing can claim itself as a promising computing technique, if combined with its inherent automation.

Controlling distilleries in fault-tolerant quantum circuits: problem statement and analysis towards a solution

  • Alexandru Paler

The failure susceptibility of the quantum hardware will force quantum computers to execute fault-tolerant quantum circuits. These circuits are based on quantum error correcting codes, and there is increasing evidence that one of the most practical choices is the surface code. Design methodologies of surface code based quantum circuits were focused on the layout of such circuits without emphasizing the reduced availability of hardware and its effect on the execution time. Circuit layout has not been investigated for practical scenarios, and the problem presented herein was neglected until now. For achieving fault-tolerance and implementing surface code based computations, a significant amount of computing resources (hardware and time) are necessary for preparing special quantum states in a procedure called distillation. This work introduces the problem of how distilleries (circuit portions responsible for state distillation) influence the layout of surface code protected quantum circuits, and analyses the tradeoffs for reducing the resources necessary for executing the circuits. A first algorithmic solution is presented, implemented and evaluated for addition quantum circuits.

Signal Synchronization in Large Scale Quantum-dot Cellular Automata Circuits

  • Vassilios A. Mardiris
  • Orestis Liolis
  • Georgios Ch. Sirakoulis
  • Ioannis G. Karafyllidis

Quantum-dot fabrication is a well-established nanotechnology, which have many applications in many different scientific fields. By placing four quantum-dots on the corners of a square, a cell is formed, in which the digital information can be stored. This cell serves as the structural device of Quantum-dot Cellular Automata (QCA) circuits. After QCA presentation, several digital circuits and systems have been designed and proposed in the literature. However, one of the biggest problems QCA designers have to face to pave the successful design of functional and large scale QCA circuits is signal synchronization. In this paper, a novel approach of the aforementioned problem is presented. This approach is inspired by the well known computational problem of Firing Squad Synchronization (FSS). FSS problem has many similarities with large scale QCA circuits synchronization problem. In addition, FSS problem has been studied by many researchers and many efficient solutions have been proposed in the literature.

Size Optimization of MIGs with an Application to QCA and STMG Technologies

  • Heinz Riener
  • Eleonora Testa
  • Luca Amaru
  • Mathias Soeken
  • Giovanni De Micheli

Majority-inverter graphs (MIGs) are a logic representation with remarkable algebraic and Boolean properties that enable efficient logic optimizations beyond the capabilities of traditional logic representations. Further, since many nano-emerging technologies, such as quantum-dot cellular automata (QCA) or spin torque majority gates (STMG), are inherently majority-based, MIGs serve as a natural logic representation to map into these technologies. So far, MIG optimization methods predominantly target to reduce the depth of the logic networks, corresponding to low delay implementations in the respective technologies. In this paper, we introduce several methods to optimize the size of MIGs. They can be applied such that the depth of the logic network is preserved; therefore our methods have a direct effect on the physical area, without worsening the delay. Some methods are inspired by existing size optimization algorithms for non-majority-based logic networks, others make explicit use of the majority function and its properties. All methods are Boolean—in contrast to algebraic optimization methods—which has a positive effect on the quality but challenges their implementation. Our experiments show that using our methods the size of MIGs in the EPFL combinational benchmark suite can be reduced by up to 7.12%. When mapped to QCA and STMG technologies we reduce the average area-delay-energy product by 2.31% and 2.07%, respectively.

Representation of Qubit States using 3D Memristance Spaces: A first step towards a Memristive Quantum Simulator

  • Ioannis Karafyllidis
  • Georgios Ch. Sirakoulis
  • Panagiotis Dimitrakis

Development of quantum simulators is a major step towards the universal quantum computer. Quantum simulators are quantum systems that can perform specific quantum computations, or software packages that can reproduce most of the aspects of a general universal quantum computer on a general purpose classical computer. Development of quantum simulators using digital circuits, such as FPGAs is very difficult, mainly because the unit of quantum information, the qubit, has an infinite number of states, whereas the classical bit has only two. On the other hand, analog circuits comprising R, L and C elements have no internal state variables that can be used to reproduce and store qubit states. Here we take the first step towards the development of a new quantum simulator using memristors. The qubit state is mapped to a 3D space spanned by the memristances of three identical memristors. The qubit state evolution is reproduced by the input voltages applied to the memristors. We define the correspondence between the general qubit state rotation, i.e. the one-qubit quantum gates, and memristor input voltage variations and reproduce the rotations imposed by the action of quantum gates in the 3D memristance space. Our results show that, at least in principle, qubits and one-qubit quantum gates can be simulated by memristors.

ISLPED 2018 TOC

Full Citation in the ACM Digital Library

SESSION: Machine Learning – Inference

Value-driven Synthesis for Neural Network ASICs

  • Zhiyuan Yang
  • Ankur Srivastava

In order to enable low power and high performance evaluation of neural network (NN) applications, we investigate new design methodologies for synthesizing neural network ASICs (NN-ASICs). An NN-ASIC takes a trained NN and implements a chip with customized optimization. Knowing the NN topology and weights allows us to develop unique optimization schemes which are not available to regular ASICs. In this work, we investigate two types of value-driven optimized multipliers which exploit the knowledge of synaptic weights and we develop an algorithm to synthesize the multiplication of trained NNs using these special multipliers instead of general ones. The proposed method is evaluated using several Deep Neural Networks. Experimental results demonstrate that compared to traditional NNPs, our proposed NN-ASICs can achieve up to 6.5x and 55x improvement in performance and energy efficiency (i.e. inverse of Energy-Delay-Product), respectively.

CLINK: Compact LSTM Inference Kernel for Energy Efficient Neurofeedback Devices

  • Zhe Chen
  • Andrew Howe
  • Hugh T. Blair
  • Jason Cong

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

Compact Convolution Mapping on Neuromorphic Hardware using Axonal Delay

  • Jinseok Kim
  • Yulhwa Kim
  • Sungho Kim
  • Jae-Joon Kim

Mapping Convolutional Neural Network (CNN) to a neuromorphic hardware has been inefficient in synapse memory usage because both kernel/input reuse are not exploited well. We propose a method to enable kernel reuse by utilizing axonal delay, which is a biological parameter for a spiking neuron. Using IBM TrueNorth as a test platform, we demonstrate that the number of cores, neurons, synapses, and synaptic operations per time step can be reduced by up to 20.9x, 27.9x, 88.4x, and 1586x, respectively, compared to the conventional scheme, which raises the possibility of implementing large-scale CNN on neuromorphic hardware.

NNest: Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators

  • Liu Ke
  • Xin He
  • Xuan Zhang

Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN’s enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an “one-size-fits-all” hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).

SESSION: Hardware Security

Blacklist Core: Machine-Learning Based Dynamic Operating-Performance-Point Blacklisting for Mitigating Power-Management Security Attacks

  • Sheng Zhang
  • Adrian Tang
  • Zhewei Jiang
  • Simha Sethumadhavan
  • Mingoo Seok

Most modern computing devices make available fine-grained control of operating frequency and voltage for power management. These interfaces, as demonstrated by recent attacks, open up a new class of software fault injection attacks that compromise security on commodity devices. CLKSCREW, a recently-published attack that stretches the frequency of devices beyond their operational limits to induce faults, is one such attack. Statically and permanently limiting frequency and voltage modulation space, i.e., guard-banding, could mitigate such attacks but it incurs large performance degradation and long testing time. Instead, in this paper, we propose a run-time technique which dynamically blacklists unsafe operating performance points using a neural-net model. The model is first trained offline in the design time and then subsequently adjusted at run-time by inspecting a selected set of features such as power management control registers, timing-error signals, and core temperature. We designed the algorithm and hardware, titled a BlackList (BL) core, which is capable of detecting and mitigating such power management-based security attack at high accuracy. The BL core incurs a reasonably small amount of overhead in power, delay, and area.

Threshold Defined Camouflaged Gates in 65nm Technology for Reverse Engineering Protection

  • Anirudh S. Iyengar
  • Deepak Vontela
  • Ithihasa Reddy
  • Swaroop Ghosh
  • Syedhamidreza Motaman
  • Jae-won Jang

Due to the ever-increasing threat of Reverse Engineering (RE) of Intellectual Property (IP) for malicious gains, camouflaging of logic gates is becoming very important. In this paper, we present experimental demonstration of transistor threshold voltage-defined switch [2] based camouflaged logic gates that can hide six logic functionalities i.e. NAND, AND, NOR, OR, XOR and XNOR. The proposed gates can be used to design the IP, forcing an adversary to perform brute-force guess-and-verify of the underlying functionality—increasing the RE effort. We propose two flavors of camouflaging, one employing only a pass transistor (NMOS-switch) and the other utilizing a full pass transistor (CMOS-switch). The camouflaged gates are used to design Ring-Oscillators (RO) in ST 65nm technology, one for each functionality, on which we have performed temperature, voltage, and process-variation analysis. We observe that CMOS-switch based camouflaged gate offers a higher performance (~1.5-8X better) than NMOS-switch based gate at an added area cost of only 5%. The proposed gates show functionality till 0.65V. We are also able to reclaim lost performance by dynamically changing the switch gate voltage and show that robust operation can be achieved at lower voltage and under temperature fluctuation.

Reliability and Uniformity Enhancement in 8T-SRAM based PUFs operating at NTC

  • Pramesh Pandey
  • Asmita Pal
  • Koushik Chakraborty
  • Sanghamitra Roy

SRAM-based PUFs (SPUFs) have emerged as promising security primitives for low-power devices. However, operating 8T-SPUFs at Near-Threshold Computing (NTC) realm is plagued by exacerbated process variation (PV) sensitivity which thwarts their reliable operation. In this paper, we demonstrate the massive degradation in the reliability and uniformity characteristics of 8T-SPUF. By exploiting the opportunities bestowed by schematic asymmetry of 8T-SPUF cells, we propose biasing and sizing based design strategies. Our techniques achieve an immense improvement of more than 55% in the percentage of unreliable cells and improves the proximity to ideal uniformity by 82%, over a baseline NTC 8T-SPUF with no enhancement.

Efficient and Secure Group Key Management in IoT using Multistage Interconnected PUF

  • Hongxiang Gu
  • Miodrag Potkonjak

Secure group-oriented communication is crucial to a wide range of applications in Internet of Things (IoT). Security problems related to group-oriented communications in IoT-based applications placed in a privacy-sensitive environment have become a major concern along with the development of the technology. Unfortunately, many IoT devices are designed to be portable and light-weight; thus, their functionalities, including security modules, are heavily constrained by the limited energy resources (e.g., battery capacity). To address these problems, we propose a group key management scheme based on a novel physically unclonable function (PUF) design: multistage interconnected PUF (MIPUF) to secure group communications in an energy-constrained environment. Our design is capable of performing key management tasks such as key distribution, key storage and rekeying securely and efficiently. We show that our design is secure against multiple attack methods and our experimental results show that our design saves 47.33% of energy globally comparing to state-of-the-art Elliptic-curve cryptography (ECC)-based key management scheme on average.

SESSION: Energy Efficient Wireline Circuits

An Energy-Efficient High-Swing PAM-4 Voltage-Mode Transmitter

  • Lejie Lu
  • Yong Wang
  • Hui Wu

As the data rate of high-speed I/Os continues to increase, four-level pulse amplitude modulation (PAM-4) is adopted to improve the bandwidth density and link margin at 50 Gb/s and beyond. Compared to non-return-to-zero (NRZ) signaling, however, the PAM-4 eye height is reduced, which calls for larger transmitter swing to maintain signal-to-noise-ratio. A new energy-efficient transmitter is proposed to generate large swing PAM-4 signals with a cascode voltage-mode driver and supporting pre-drivers and logic circuits. By reconfiguring the pull-up and pull-down branches based on the transmit data and steering the bypass currents, the proposed voltage-mode driver significantly reduces power consumption compared to conventional implementation while maintaining impedance matching. Voltage stacking technique is adopted for pre-drivers to further improve energy efficiency. To demonstrate the new transmitter design, a prototype 56 Gb/s PAM-4 transmitter is designed using a generic 28-nm CMOS technology with a 2-V power supply voltage. It achieves a overall output swing of 2 V and a minimum eye height of 490 mV with good linearity (98.7% level separation mismatch ratio). Compared to a conventional voltage-mode transmitter design with the same swing, the static power consumption of the new transmitter is reduced almost by half (from 30 mW to 16 mW), and its overall energy efficiency improves from 0.7 pJ/b to 0.5 pJ/b.

Energy-Efficient Dynamic Comparator with Active Inductor for Receiver of Memory Interfaces

  • Jae Whan Lee
  • Joo-Hyung Chae
  • Jihwan Park
  • Hyunkyu Park
  • Jaekwang Yun
  • Suhwan Kim

In this paper, we propose a dynamic comparator that improved the operation performance of receiver (RX) with the effort to reduce power consumption. It is implemented via double-tail StrongARM latch comparator with an active inductor and efforts are made to minimize power consumption for high-speed resulting in better energy efficiency at the targeted high frequency. In this regard, our comparator is suitable for memory application RX to satisfy both low-power and high-speed. It is applied to the single-ended RX designed with a continuous-time linear equalizer, a clock generator and a quarter-rate 2-tap decision-feedback equalizer which is appropriate for the high-frequency memory application. Compared to the conventional one, our design, fabricated in 55nm CMOS process, provides an improvement of 7% in unit interval (UI) margin under the same power consumption and receives up to 10Gb/s PRBS15 data at BER < 10-12 with 0.4 UI margin and energy efficiency of 0.67pJ/bit.

4-Channel Push-Pull VCSEL Drivers for HDMI Active Optical Cable in 0.18-μm CMOS

  • Jeongho Hwang
  • Hong-Seok Choi
  • Hyungrok Do
  • Gyu-Seob Jeong
  • Daehyun Koh
  • Seong Ho Park
  • Deog-Kyoon Jeong

The price and power consumption of standard HDMI cables exponentially rise when the data rate increases or cable runs longer. HDMI active optical cable (AOC) can potentially solve price and power issues since fibers are tolerant to loss. However, additional optical components such as vertical-cavity surface-emitting laser (VCSEL) and photodiode (PD) are required. Therefore, drivers and transimpedance amplifiers should be designed carefully for normal operations. In this paper, two types of 4-channel VCSEL drivers for HDMI AOC are presented. The first type of the driver passes data and bias separately. It uses off-chip capacitors for AC coupling. On the other hand, the second type of the driver passes data including DC value without using off-chip capacitors. Structures of the both drivers are based on push-pull current-mode logic (CML) to achieve better power efficiency. Drivers fabricated in 0.18-μm CMOS process consume 36.5 mW/channel at 6 Gb/s and 24.7 mW/channel at 12 Gb/s, respectively.

SESSION: Approximate Computing

RMAC: Runtime Configurable Floating Point Multiplier for Approximate Computing

  • Mohsen Imani
  • Ricardo Garcia
  • Saransh Gupta
  • Tajana Rosing

Approximate computing is a way to build fast and energy efficient systems, which provides responses of good enough quality tailored for different purposes. In this paper, we propose a novel approximate floating point multiplier which efficiently multiplies two floating numbers and yields a high precision product. RMAC approximates the costly mantissa multiplication to a simple addition between the mantissa of input operands. To tune the level of accuracy, RMAC looks at the first bit of the input mantissas as well as the first N bits of the result of addition to dynamically estimate the maximum multiplication error rate. Then, RMAC decides to either accept the approximate result or re-execute the exact multiplication. Depending on the value of N, the proposed RMAC can be configured to achieve different levels of accuracy. We integrate the proposed RMAC in AMD southern Island GPU, by replacing RMAC with the existing floating point units. We test the efficiency and accuracy of the enhanced GPU on a wide range of applications including multimedia and machine learning applications. Our evaluations show that a GPU enhanced by the proposed RMAC can achieve 5.2x energydelay product improvement as opposed to GPU using conventional FPUs while ensuring less than 2% quality loss. Comparing our approach with other state-of-the-art approximate multipliers shows that RMAC can achieve 3.1x faster and 1.8x more energy efficient computations while providing the same quality of service.

Designing Efficient Imprecise Adders using Multi-bit Approximate Building Blocks

  • Sarvenaz Tajasob
  • Morteza Rezaalipour
  • Masoud Dehyadegari
  • Mahdi Nazm Bojnordi

Energy-efficiency has become a major concern in designing computer systems. One of the most promising solutions to enhance power and energy-efficiency in error tolerant applications is approximate computing that balances accuracy, area, delay, and power consumption based on the computational needs. By trading accuracy of computation, approximate computing may achieve significant improvements in speed, power, and area consumption.

Adders are important arithmetic units widely used in almost every digital processing system, which contribute to significant amounts of power dissipation. With the emergence of deep learning tasks and fault tolerant big data processing in every aspect of today’s computing, the demand for low-power and energy-efficient approximate adders has increased significantly. Numerous designs have been proposed in the literature that build multi-bit adders using novel approximate full adder circuits. Regrettably, relying on single-bit building blocks only limits the design space of approximate adders and prevents the designers from achieving the most significant benefits of approximate circuits. This paper presents a novel approach to designing imprecise multi-bit adders, based on four novel approximate 2 and 3-bit adder building blocks. The proposed circuits are evaluated and compared with the existing low power adders in terms of various design characteristics, such as area, delay, power, and error tolerance. Our simulation results indicate that the proposed adders achieve more than 60% reduction in power and area consumption compared to the state-of-the-art approximate adders while introducing 12-17% less error in computation.

An Energy-Efficient, Yet Highly-Accurate, Approximate Non-Iterative Divider

  • Marzieh Vaeztourshizi
  • Mehdi Kamal
  • Ali Afzali-Kusha
  • Massoud Pedram

In1 this paper, we present a highly accurate and energy efficient non-iterative divider, which uses multiplication as its main building block. In this structure, the division operation is performed by first reforming both dividend and divisor inputs, and then multiplying the rounded value of the scaled dividend by the reciprocal of the rounded value of the scaled divisor. Precisely, the interval representing the fractional value of the scaled divisor is partitioned into non-overlapping sub-intervals, and the reciprocal of the scaled divisor is then approximated with a linear function in each of these sub-intervals. The efficacy of the proposed divider structure is assessed by comparing its design parameters and accuracy with state-of-the-art, non-iterative approximate dividers as well as exact dividers in 45nm digital CMOS technology. Circuit simulation results show that the mean absolute relative error of the proposed structure for doing 1 32-bit division is less than 0.2%, while the proposed structure has significantly lower energy consumption than the exact divider. Finally, the effectiveness of the proposed divider in one image processing application is reported and discussed.

SESSION: Architectural Techniques

Aggressive Slack Recycling via Transparent Pipelines

  • Gokul Subramanian Ravi
  • Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation – maximizing slack conservation and timing speculation efficiency.

We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.

Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache

  • Martin Rapp
  • Anuj Pathania
  • Jörg Henkel

Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.

SPONGE: A Scalable Pivot-based On/Off Gating Engine for Reducing Static Power in NoC Routers

  • Hossein Farrokhbakht
  • Hadi Mardani Kamali
  • Natalie Enright Jerger
  • Shaahin Hessabi

Due to high aggregate idle time of Networks-on-Chip (NoCs) routers in practical applications, power-gating techniques have been proposed to combat the ever-increasing ratio of static power. Nevertheless, the sporadic packet arrivals compromise the effectiveness of power-gating by incurring significant latency and energy overhead. In this paper, we propose a Scalable Pivot-based On/Off Gating Engine (SPONGE) which efficiently manages power-gating decisions and routing mechanism by adaptively selecting a small set of powered-on columns of routers and keeping the others in power-gated state. To this end, a router architecture augmented with a novel routing algorithm is proposed in which a packet can traverse powered-off routers without waking them up, and can only turn in predetermined powered-on routers. Experimental results on SPLASH-2 benchmarks demonstrate that, compared to the conventional power-gating method, SPONGE on average not only improves static power consumption by 81.7%, it also improves average packet latency by 63%.

SESSION: Machine Learning – Training

Taming the beast: Programming Peta-FLOP class Deep Learning Systems

  • Swagath Venkataramani
  • Vijayalakshmi Srinivasan
  • Jungwook Choi
  • Kailash Gopalakrishnan
  • Leland Chang

TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training

  • Seungkyu Choi
  • Jaehyeong Sim
  • Myeonggu Kang
  • Lee-Sup Kim

Training convolutional neural network on device has become essential where it allows applications to consider user’s individual environment. Meanwhile, the weight update operation from the training process is the primary factor of high energy consumption due to its substantial memory accesses. We propose a dedicated weight update architecture with two key features: (1) a specialized local buffer for the DRAM access deduction (2) a novel dataflow and its suitable processing element array structure for weight gradient computation to optimize the energy consumed by internal memories. Our scheme achieves 14.3%-30.2% total energy reduction by drastically eliminating the memory accesses.

AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

  • Xin He
  • Liu Ke
  • Wenyan Lu
  • Guihai Yan
  • Xuan Zhang

The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods—one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain’s ability to obtain resilient neural network parameters and system energy efficiency improvement.

Spin Orbit Torque Device based Stochastic Multi-bit Synapses for On-chip STDP Learning

  • Gyuseong Kang
  • Yunho Jang
  • Jongsun Park

As a large number of neurons and synapses are needed in spike neural network (SNN) design, emerging devices have been employed to implement synapses and neurons. In this paper, we present a stochastic multi-bit spin orbit torque (SOT) memory based synapse, where only one SOT device is switched for potentiation and depression using modified Gray code. The modified Gray code based approach needs only N devices to represent 2N levels of synapse weights. Early read termination scheme is also adopted to reduce the power consumption of training process by turning off less associated neurons and its ADCs. For MNIST dataset, with comparable classification accuracy, the proposed SNN architecture using 3-bit synapse achieves 68.7% reduction of ADC overhead compared to the conventional 8-level synapse.

SESSION: Non-volatile Memory

Enabling Intra-Plane Parallel Block Erase in NAND Flash to Alleviate the Impact of Garbage Collection

  • Tyler Garrett
  • Jun Yang
  • Youtao Zhang

Garbage collection (GC) in NAND flash can significantly decrease I/O performance in SSDs by copying valid data to other locations, thus blocking incoming I/O requests. To help improve performance, NAND flash utilizes various advanced commands to increase internal parallelism. Currently, these commands only parallelize operations across channels, chips, dies, and planes, neglecting the block level due to risk of disturbances that can compromise valid data by inducing errors. However, due to the triple-well structure of the NAND flash plane architecture, it is possible to erase multiple blocks within a plane, in parallel, without diminishing the integrity of the valid data. The number of page movements due to multiple block erases can be restrained so as to bound the overhead per GC. Moreover, more capacity can be reclaimed per GC which delays future GCs and effectively reduces their frequency. Such an Intra-Plane Parallel Block Erase (IPPBE) in turn diminishes the impact of GC on incoming requests, improving their response times. Experimental results show that IPPBE can reduce the time spent performing GC by up to 50.7% and 33.6% on average, read/write response time by up to 47.0%/45.4% and 16.5%/14.8% on average respectively, page movements by up to 52.2% and 26.6% on average, and blocks erased by up to 14.2% and 3.6% on average. An energy analysis conducted indicates that by reducing the number of page copies and the number of block erases, the energy cost of garbage collection can be reduced up to 44.1% and 19.3% on average.

Enhancing the Energy Efficiency of Journaling File System via Exploiting Multi-Write Modes on MLC NVRAM

  • Shuo-Han Chen
  • Yuan-Hao Chang
  • Tseng-Yi Chen
  • Yu-Ming Chang
  • Pei-Wen Hsiao
  • Hsin-Wen Wei
  • Wei-Kuan Shih

Non-volatile random-access memory (NVRAM) is regarded as a great alternative storage medium owing to its attractive features, including low idle energy consumption, byte addressability, and short read/write latency. In addition, multi-level-cell (MLC) NVRAM has also been proposed to provide higher bit density. However, MLC NVRAM has lower energy efficiency and longer write latency when compared with single-level-cell (SLC) NVRAM. These drawbacks could lead to higher energy consumption of MLC NVRAM-based storage systems. The energy consumption is magnified by existing journaling file systems (JFS) on MLC NVRAM-based storage devices due to the JFS’s fail-safe policy of writing the same data twice. Such observations motivate us to propose a multi-write-mode journaling file systems (mwJFS) to alleviate the drawbacks of MLC NVRAM and lower the energy consumption of MLC NVRAM-based JFS. The proposed mwJFS differentiates the data retention requirement of journaled data and applies different write modes to enhance the energy efficiency with better access performance. A series of experiments was conducted to demonstrate the capability of mwJFS on a MLC NVRAM-based storage system.

Computing in memory with FeFETs

  • Dayane Reis
  • Michael Niemier
  • X. Sharon Hu

Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.

Information Leakage Attacks on Emerging Non-Volatile Memory and Countermeasures

  • Mohammad Nasim Imtiaz Khan
  • Swaroop Ghosh

Emerging Non-Volatile Memories (NVMs) suffer from high and asymmetric read/write current and long write latency which can result in supply noise, such as supply voltage droop and ground bounce. The magnitude of supply noise depends on the old data and the new data that is being written (for a write operation) or on the stored data (for a read operation). Therefore, victim’s write operation creates a supply noise which propagates to adversary’s memory space. The adversary can detect victim’s write initiation and can leverage faster read latency (compared to write) to further sense the Hamming Weight (HW) of the victim’s write data by detecting read failures in his memory space. These attacks are specifically possible if exhaustive testing of the memory for all patterns, all possible location combinations, all possible parallel read/write conditions are not performed under bit-to-bit process variations and specified (-10°C to 90°C) and unspecified temperature ranges (i.e., less than -10°C and greater than 90°C). Simulation result indicates that adversary can sense HW of victim’s (near-by) write data = 66.77%, and further narrow the range based on read/write failure characteristics. Side Channel Attacks can utilize this information to strengthen the attacks.

SESSION: Energy-efficient Parallelism

Load-Triggered Warp Approximation on GPU

  • Zhenhong Liu
  • Daniel Wong
  • Nam Sung Kim

Value similarity of operands across warps have been exploited to improve energy efficiency of GPUs. Prior work, however, incurs significant overheads to check value similarity for every instruction and does not improve performance as it does not reduce the number of executed instructions. This work proposes Lock ‘n Load (LnL) which triggers approximate execution of code regions by only checking similarity of values returned from load instructions and fuses multiple approximated warps into a single warp.

GAS: A Heterogeneous Memory Architecture for Graph Processing

  • Minxuan Zhou
  • Mohsen Imani
  • Saransh Gupta
  • Tajana Rosing

Graph processing has become important for various applications in today’s big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.

ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU

  • Tahmoures Shabanian
  • Aatreyi Bal
  • Prabal Basu
  • Koushik Chakraborty
  • Sanghamitra Roy

The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.

SESSION: Self-powered Devices

HomeRun: HW/SW Co-Design for Program Atomicity on Self-Powered Intermittent Systems

  • Chih-Kai Kang
  • Chun-Han Lin
  • Pi-Cheng Hsiu
  • Ming-Syan Chen

Self-powered intermittent systems featuring nonvolatile processors (NVPs) allow for accumulative execution in unstable power environments. However, frequent power failures may cause incorrect NVP execution results due to invalid data generated intermittently. This paper presents a HW/SW co-design, called HomeRun, to guarantee atomicity by ensuring that an uninterruptible program section can be run through at one execution. We design a HW module to ensure that a power pulse is sufficient for an atomic section, and develop a SW mechanism for programmers to protect atomic sections. The proposed design is validated through the development of a prototype pattern locking system. Experimental results demonstrate that the proposed design can completely guarantee atomicity and significantly improve the energy utilization of self-powered intermittent systems.

EcoMicro: A Miniature Self-Powered Inertial Sensor Node Based on Bluetooth Low Energy

  • Cheng-Ting Lee
  • Yun-Hao Liang
  • Pai H. Chou
  • Ali Heydari Gorji
  • Seyede Mahya Safavi
  • Wen-Chan Shih
  • Wen-Tsuen Chen

This paper describes EcoMicro, a miniature, self-powered, wireless inertial-sensing node in the volume of 8 x 13 x 9.5 mm3, including energy storage and solar cells. It is smaller than existing systems with similar functionality while retaining rich functionality and efficiency. It is capable of measuring motion using a inertial measurement unit (IMU) and communication over Bluetooth Low Energy (BLE) protocol. It is self-powered by miniature solar cells and can perform maximum power point tracking (MPPT). Its integrated energy-storage device combines the longevity and power density of supercapacitors with the relatively flat discharge curve of batteries. Our power-ground gating circuit minimizes leakage current during sleep mode and is used in conjunction with the real-time-clock for duty cycling. Experimental results show EcoMicro to be operational and efficient for a class of wireless sensing applications.

Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems

  • S. K. Thirumala
  • A. Raha
  • H. Jayakumar
  • K. Ma
  • V. Narayanan
  • V. Raghunathan
  • S. K. Gupta

In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.

SESSION: Design and 3D Integration

Multiple Combined Write-Read Peripheral Assists in 6T FinFET SRAMs for Low-VMIN IoT and Cognitive Applications

  • Arijit Banerjee
  • Sumanth Kamineni
  • Benton H. Calhoun

Battery-operated or energy-harvested IoT and cognitive SoCs in modern FinFET processes prefer the use of low-VMIN SRAMs for ultra-low power (ULP) operations. However, the 1:1:1 high-density (HD) FinFET 6T bitcell faces challenges in achieving a lower VMIN across process variation. The 6T bitcell VMIN improves either by increasing the size of the bitcell or by using combinations of peripheral assists (PAs) since a single PA cannot achieve the best VMIN across process variation. State-of-the-art works show some combinations of write and read PAs that lower the VMIN of 6T FinFET SRAMs. However, the better combinations of PA for 14nm HD 6T FinFET SRAMs are unknown. This work compares all the possible dual combinations of PAs and reveals the better ones. We show that in a usual column mux scenario the combination of negative bitline with VDD boosting and VDD collapse with VDD boosting in a proportion of 14% and 6% (total 20%), respectively, maximize the static VMIN improvement close to 191mV for ULP IoT and cognitive applications. We also show that a combination of wordline boosting with negative bitline and wordline boosting with VSS lowering achieve a 150mV and 25mV of dynamic VMIN improvement at the 5GHz frequency for the worst-case write and read corners, respectively, beating other combinations.

Road to High-Performance 3D ICs: Performance Optimization Methodologies for Monolithic 3D ICs

  • Kyungwook Chang
  • Sai Pentapati
  • Da Eun Shim
  • Sung Kyu Lim

As we approach the limits of 2D device scaling, monolithic 3D IC (M3D) has emerged as a potential solution offering performance and power benefits. Although various studies have been done to increase power savings of M3D designs, efforts to improve their performance are rarely made. In this paper, we, for the first time, perform in-depth analysis of the factors that affect the performance of M3D, and present methodologies to improve the performance. Our methodologies outperform the state-of-the-art M3D design flow by offering 15.6% performance improvement and 16.2% energy-delay product (EDP) benefit over 2D designs.

A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support

  • Srivatsa Srinivasa
  • Akshay Krishna Ramanathan
  • Xueqing Li
  • Wei-Hao Chen
  • Fu-Kuo Hsueh
  • Chih-Chao Yang
  • Chang-Hong Shen
  • Jia-Min Shieh
  • Sumeet Gupta
  • Meng-Fan Marvin Chang
  • Swaroop Ghosh
  • Jack Sampson
  • Vijaykrishnan Narayanan

We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.

SESSION: Industry ML/AI Compute

Across the Stack Opportunities for Deep Learning Acceleration

  • Vijayalakshmi Srinivasan
  • Bruce Fleischer
  • Sunil Shukla
  • Matthew Ziegler
  • Joel Silberman
  • Jinwook Oh
  • Jungwook Choi
  • Silvia Mueller
  • Ankur Agrawal
  • Tina Babinsky
  • Nianzheng Cao
  • Chia-Yu Chen
  • Pierce Chuang
  • Thomas Fox
  • George Gristede
  • Michael Guillorn
  • Howard Haynie
  • Michael Klaiber
  • Dongsoo Lee
  • Shih-Hsien Lo
  • Gary Maier
  • Michael Scheuermann
  • Swagath Venkataramani
  • Christos Vezyrtzis
  • Naigang Wang
  • Fanchieh Yee
  • Ching Zhou
  • Pong-Fei Lu
  • Brian Curran
  • Leland Chang
  • Kailash Gopalakrishnan

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers.

One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the “exactness” constraint enables exploiting opportunities for approximate computing across all layers of the system stack.

In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware.

Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training’s scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core.

Overall, our work shows how the benefits from exploiting approximation using algorithm/application’s robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

SESSION: Mobile Applications

App-Oriented Thermal Management of Mobile Devices

  • Jihoon Park
  • Seokjun Lee
  • Hojung Cha

The thermal issue for mobile devices becomes critical as the devices’ performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.

DiReCt: Resource-Aware Dynamic Model Reconfiguration for Convolutional Neural Network in Mobile Systems

  • Zirui Xu
  • Zhuwei Qin
  • Fuxun Yu
  • Chenchen Liu
  • Xiang Chen

Although Convolutional Neural Networks (CNNs) have been widely applied in various applications, their deployment in resource-constrained mobile systems remains a significant concern. To overcome the computation resource constraints, such as limited memory and energy capacity, many works are proposed for mobile CNN optimization. However, most of them lack a comprehensive modeling analysis of the CNN computation consumption and merely focus on static optimization schemes regardless of different mobile computation scenarios. In this work, we proposed DiReCt — a resource-aware CNN reconfiguration system. Leveraging accurate CNN computation consumption modeling and mobile resource constraint analysis, DiReCt can reconfigure a CNN with different accuracy and resource consumption levels to adapt to various mobile computation scenarios. The experiment results show that: the proposed computation consumption models in DiReCt can well estimate the CNN computation consumption with 94.1% accuracy, and DiReCt achieves at most 34.9% computation acceleration, 52.7% memory reduction, and 27.1% energy saving. Eventually, DiReCt can effectively adapt CNNs to dynamic mobile usage scenarios for optimal performance.

POSTER SESSION: Posters

A Low-power [email protected] H.265/HEVC Video Encoder for Smart Video Surveillance

  • Ke Xu
  • Yu Li
  • Bo Huang
  • Xiangkai Liu
  • Hong Wang
  • Zhuoyan Wu
  • Zhanpeng Yan
  • Xueying Tu
  • Tongqing Wu
  • Daibing Zeng

This paper presents the design and VLSI implementation of a low-power HEVC main profile encoder, which is able to process up to [email protected] 4:2:0 encoding in real-time with five-stage pipeline architecture. A pyramid ME (Motion Estimation) engine is employed to reduce search complexity. To compensate for the video sequences with fast moving objects, GME (Global Motion Estimation) are introduced to alleviate the effect of limited search range. We also implement an alternative 5×5 search along with 3×3 to boost video quality. For intra mode decision, original pixels, instead of reconstructed ones are used to reduce pipeline stall. The encoder supports DVFS (Dynamic Voltage and Frequency Scaling) and features three operating modes, which helps to reduce power consumption by 25%. Scalable quality that trades encoding quality for power by reducing size of search range and intra prediction candidates, achieves 11.4% power reduction with 3.5% quality degradation. Furthermore, a lossless frame buffer compression is proposed which reduced DDR bandwidth by 49.1% and power consumption by 13.6%. The entire video surveillance SoC is fabricated with TSMC 28nm technology with 1.96 mm2 area. It consumes 2.88M logic gates and 117KB SRAM. The measured power consumption is 103mW at 350MHz for 4K encoding with high-quality mode. The 0.39nJ/pixel of energy efficiency of this work, which achieves 42% ~ 97% power reduction as compared with reference designs, make it ideal for real-time low-power smart video surveillance applications.

Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories

  • M. Reza Mahmoodi
  • Dmitri Strukov

Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100×100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.

Efficient Image Sensor Subsampling for DNN-Based Image Classification

  • Jia Guo
  • Hongxiang Gu
  • Miodrag Potkonjak

Today’s mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.

Input-Splitting of Large Neural Networks for Power-Efficient Accelerator with Resistive Crossbar Memory Array

  • Yulhwa Kim
  • Hyungjun Kim
  • Daehyun Ahn
  • Jae-Joon Kim

Resistive Crossbar memory Arrays (RCA) have been gaining interest as a promising platform to implement Convolutional Neural Networks (CNN). One of the major challenges in RCA-based design is that the number of rows in an RCA is often smaller than the number of input neurons in a layer. Previous works used high-resolution Analog-to-Digital Converters (ADCs) to compute the partial weighted sum in each array and merged partial sums from multiple arrays outside the RCAs. However, such approach suffers from significant power consumption due to the need for high-resolution ADCs. In this paper, we propose a methodology to more efficiently construct a large CNN with multiple RCAs. By splitting the input feature map and retraining the CNN with proper initialization, we demonstrate that any CNN model can be represented with multiple arrays without using intermediate partial sums. The experimental results show that the ADC power of the proposed design is 32x smaller and the total chip power of the proposed design is 3x smaller than those of the baseline design.

Design Optimization of 3D Multi-Processor System-on-Chip with Integrated Flow Cell Arrays

  • Artem Andreev
  • Fulya Kaplan
  • Marina Zapater
  • Ayse K. Coskun
  • David Atienza

Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs

  • Swaminathan Narayanaswamy
  • Sangyoung Park
  • Sebastian Steinhorst
  • Samarjit Chakraborty

Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.

Intrinsic and Database-free Watermarking in ICs by Exploiting Process and Design Dependent Variability in Metal-Oxide-Metal Capacitances

  • Ahish Shylendra
  • Swarup Bhunia
  • Amit Ranjan Trivedi

Authentication of integrated circuits (IC) to verify their integrity has emerged as a critical need to address increasing concerns associated with counterfeit ICs in the supply chain. In this paper, novel SAR-ADC based intrinsic and database-free authentication scheme has been proposed. Proposed technique utilizes mismatch in back end of line (BEOL) capacitors used in charge-redistribution SAR ADC to generate authentication signature. BEOL metal-oxide-metal (MOM) capacitors form a reliable source of process variation information and are less sensitive to aging & temperature induced variations. Line edge roughness is the primary source of mismatch in BEOL capacitors and thus, capacitor mismatch variation has been analyzed in terms of LER and geometric parameters. Resource overhead incurred by the proposed modifications to the ADC architecture to incorporate authentication ability is minimal and existing on-chip calibration circuitry is used to extract signature. Proposed technique does not require sophisticated test setup, thereby, simplifying the authentication procedure.

Scheduling of Hybrid Battery-Supercapacitor Control Instructions for Longevity in Systems with Power Gating

  • Sumanta Pyne

The in-rush current due to wake-up of power gating (PG) components causes faster discharge of battery. This work introduces an instruction controlled hybrid battery-supercapacitor (B-SC) system for longer battery life in systems with instruction controlled PG. Two instructions have been introduced along with architectural support. The first instruction disconnects the battery from the PG components if the charge in the supercapacitor greater than or equal to the charge required by wake-up of PG components. The other instruction connects the battery to the PG components for recharging the supercapacitor. Disconnecting the battery during wake-up minimizes rate capacity effect (C-rate) for longer battery life. An algorithm is designed to schedule the proposed battery control instructions within a program having PG instructions. The efficacy of the proposed method is evaluated on MiBench and MediaBench benchmark programs. The proposed method reduces C-rate by an average of 14.25% at the cost of average performance loss of 6.87%.

Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter

  • Dongkwun Kim
  • Mingoo Seok

We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch’s resistance and flying-capacitor’s density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch’s resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.

Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware

  • Daniele Jahier Pagliari
  • Enrico Macii
  • Massimo Poncino

Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency.

In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.

Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

  • Mahdi Nazemi
  • Massoud Pedram

Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations. Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.

Battery-Aware Energy Model of Drone Delivery Tasks

  • Donkyu Baek
  • Yukai Chen
  • Alberto Bocca
  • Alberto Macii
  • Enrico Macii
  • Massimo Poncino

Drones are becoming increasingly popular in the commercial market for various package delivery services. In this scenario, the mostly adopted drones are quad-rotors (i.e., quadcopters). The energy consumed by a drone may become an issue, since it may affect (i) the delivery deadline (quality of service), (ii) the number of packages that can be delivered (throughput) and (iii) the battery lifetime (number of recharging cycles). It is thus fundamental try to find the proper compromise between the energy used to complete the delivery and the speed at which the quadcopter flies to reach the destination. In order to achieve this, we have to consider that the energy required by the drone for completing a given delivery task does not exactly correspond to the energy requested to the battery, since the latter is a non-ideal power supply that is able to deliver power with different efficiencies depending on its state of charge. In this paper, we demonstrate that the proposed battery-aware delivery scheduling algorithm carries more packages than the traditional delivery model with the same battery capacity. Moreover, the battery-aware delivery model is 17% more accurate than the traditional delivery model for the same delivery scheme, which prevents the unexpected drone landing.

A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference

  • Li Yang
  • Zhezhi He
  • Deliang Fan

Deep convolutional neural network has taken an important role in machine learning algorithm which has been widely used in computer vision tasks. However, its enormous model size and massive computation cost have became the main obstacle for deployment of such powerful algorithm in low power and resource limited embedded system, such as FPGA. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binary activation function, can significantly reduce the model size and computation complexity, which paves a new road for energy-efficient FPGA implementation. In this work, we first propose a new BNN algorithm, called Parallel-Convolution BNN (i.e. PC-BNN), which replaces the original binary convolution layer in conventional BNN with two parallel binary convolution layers. PC-BNN achieves ~86% on CIFAR-10 dataset with only 2.3Mb parameter size. We then deploy our proposed PC-BNN into the Xilinx PYNQ Z1 FPGA board with only 4.9Mb on-chip RAM. Since the ultra-small network parameter, it is feasible to store the whole network parameter into on-chip RAM, which could greatly reduce the energy and delay overhead to load network parameter from off-chip memory. Meanwhile, a new data streaming pipeline architecture is proposed in PC-BNN FPGA implementation to further improve throughput. The experiment results show that our PC-BNN based FPGA implementation achieves 930 frames per second, 387.5 FPS/Watt and 396×10-4 FPS/LUT, which are among the best throughput and energy efficiency compared to most recent works.

In-situ Stochastic Training of MTJ Crossbar based Neural Networks

  • Ankit Mondal
  • Ankur Srivastava

Owing to high device density, scalability and non-volatility, Magnetic Tunnel Junction-based crossbars have garnered significant interest for implementing the weights of an artificial neural network. The existence of only two stable states in MTJs implies a high overhead of obtaining optimal binary weights in software. We illustrate that the inherent parallelism in the crossbar structure makes it highly appropriate for in-situ training, wherein the network is taught directly on the hardware. It leads to significantly smaller training overhead as the training time is independent of the size of the network, while also circumventing the effects of alternate current paths in the crossbar and accounting for manufacturing variations in the device. We show how the stochastic switching characteristics of MTJs can be leveraged to perform probabilistic weight updates using the gradient descent algorithm. We describe how the update operations can be performed on crossbars both with and without access transistors and perform simulations on them to demonstrate the effectiveness of our techniques. The results reveal that stochastically trained MTJ-crossbar NNs achieve a classification accuracy nearly same as that of real-valued-weight networks trained in software and exhibit immunity to device variations.

Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit

  • Ioannis Tsiokanos
  • Lev Mukhanov
  • Dimitrios S. Nikolopoulos
  • Georgios Karakonstantis

In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.

A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications

  • Sana Arshad
  • Azam Beg
  • Rashad Ramzan

This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA’s power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.

SESSION: Far-out Ideas

Insights from Biology: Low Power Circuits in the Fruit Fly

  • Louis K. Scheffer

Fruit flies (Drosophila melanogaster) are small insects, with correspondingly small power budgets. Despite this, they perform sophisticated neural computations in real time. Careful study of these insects is revealing how some of these circuits work. Insights from these systems might be helpful in designing other low power circuits.

ICCAD 2018 TOC

Full Citation in the ACM Digital Library

A fast thermal-aware fixed-outline floorplanning methodology based on analytical models

  • Jai-Ming Lin
  • Tai-Ting Chen
  • Yen-Fu Chang
  • Wei-Yi Chang
  • Ya-Ting Shyu
  • Yeong-Jar Chang
  • Juin-Ming Lu

High temperature or temperature non-uniformity have become a serious threat to performance and reliability of high-performance integrated circuits (ICs). Thermal effect becomes a non-ignorable issue to circuit design or physical design. To estimate temperature accurately, the locations of modules have to be determined, which makes an efficient and effective thermal-aware floorplanning play a more important role. To resolve this problem, this paper proposes a differential nonlinear model which can approximate temperature and minimize wirelength at the same time during floorplanning. We also apply some techniques such a thermal-aware clustering or shrinking hot modules in the multi-level framework to further reduce temperature without inducing longer wirelength. The experimental results demonstrate that temperature and wirelength are greatly improved in our method compared to other works. More importantly, our runtime is quite fast and the fixed-outline constraint is also satisfied.

Analytical solution of Poisson’s equation and its application to VLSI global placement

  • Wenxing Zhu
  • Zhipeng Huang
  • Jianli Chen
  • Yao-Wen Chang

Poisson’s equation has been used in VLSI global placement for describing the potential field induced by a given charge density distribution. Unlike previous global placement methods that solve Poisson’s equation numerically, in this paper, we provide an analytical solution of the equation to calculate the potential energy of an electrostatic system. The analytical solution is derived based on the separation of variables method and an exact density function to model the block distribution in a placement region, which is an infinite series and converges absolutely. Using the analytical solution, we give a fast computation scheme of Poisson’s equation and develop an effective and efficient global placement algorithm called Pplace. Experimental results show that our Pplace achieves smaller placement wirelength than ePlace and NTUplace3, two leading wirelength-driven placers. With the pervasive applications of Poisson’s equation in scientific fields, in particular, our effective, efficient, and robust computation scheme for its analytical solution can provide substantial impacts to these fields.

Novel proximal group ADMM for placement considering fogging and proximity effects

  • Jianli Chen
  • Li Yang
  • Zheng Peng
  • Wenxing Zhu
  • Yao-Wen Chang

Fogging and proximity effects are two major factors that cause inaccurate exposure and thus layout pattern distortions in e-beam lithography. In this paper, we propose the first analytical placement algorithm to consider both the fogging and proximity effects. We first formulate the global placement problem as a separable minimization problem with linear constraints, where different objectives can be tackled one by one in an alternating fashion. Then, we propose a novel proximal group alternating direction method of multipliers (ADMM) to solve the separable minimization problem with two subproblems, where the first subproblem (mainly associated with wirelength and density) is solved by a steepest descent method without line-search, and the second one (mainly associated with the fogging and proximity effects) is handled by an analytical scheme. We prove the property of global convergence of the proximal group ADMM method. Finally, legalization and detailed placement are used to legal and further improve the placement result. Experimental results show that our algorithm is effective and efficient for the addressed problem. Compared with the state-of-the-art work, our algorithm not only can achieve 13.4% smaller fogging variation and 21.4% lower proximity variation, but also has a 1.65X speedup.

Simultaneous partitioning and signals grouping for time-division multiplexing in 2.5D FPGA-based systems

  • Shih-Chun Chen
  • Richard Sun
  • Yao-Wen Chang

The 2.5D FPGA is a promising technology to accommodate a large design in one FPGA chip, but the limited number of inter-die connections in a 2.5D FPGA may cause routing failures. To resolve the failures, input/output time-division multiplexing is adopted by grouping cross-die signals to go through one routing channel with a timing penalty after netlist partitioning. However, grouping signals after partitioning might lead to a suboptimal solution. Consequently, it is desirable to consider simultaneous partitioning and signal grouping although the optimization objectives of partitioning and grouping are different, and the time complexity of such simultaneous optimization is usually high. In this paper, we propose a simultaneous partitioning and grouping algorithm that can not only integrate the two objectives smoothly, but also reduce the time complexity to linear time per partitioning iteration. Experimental results show that our proposed algorithm outperforms the state-of-the-arts flow in both cross-die signal timing criticality and system-clock periods.

IC/IP piracy assessment of reversible logic

  • Samah Mohamed Saeed
  • Xiaotong Cui
  • Alwin Zulehner
  • Robert Wille
  • Rolf Drechsler
  • Kaijie Wu
  • Ramesh Karri

Reversible logic is a building block for adiabatic and quantum computing in addition to other applications. Since common functions are non-reversible, one needs to embed them into proper-size reversible functions by adding ancillary inputs and garbage outputs. We explore the Intellectual Property (IP) piracy of reversible circuits. The number of embeddings of regular functions in a reversible function and the percent of leaked ancillary inputs measure the difficulty of recovering the embedded function. To illustrate the key concepts, we study reversible logic circuits designed using reversible logic synthesis tools based on Binary Decision Diagrams and Quantum Multi-valued Decision Diagrams.

TimingSAT: timing profile embedded SAT attack

  • Abhishek Chakraborty
  • Yuntao Liu
  • Ankur Srivastava

In order to enhance the security of logic obfuscation schemes, delay based logic locking has been proposed in combination with traditional functional logic locking approaches in recent literature. A circuit obfuscated using the aforementioned approach preserves the correct functionality only when both correct functional and delay keys are provided. In this paper, we develop a novel SAT formulation based approach called TimingSAT to deobfuscte the functionalities of such delay locked designs within a reasonable amount of time. The proposed technique models the timing characteristics of various types of gates present in the design as Boolean functions to build timing profile embedded SAT formulations in terms of targeted key inputs. TimingSAT attack works in two stages: In the first stage the functional keys are found using traditional SAT attack approach and in the second stage the delay keys are deciphered utilizing the timing profile embedded SAT formulation of the circuit. In both stages of the attack, wrong keys are iteratively eliminated till a key belonging to the correct equivalence class is obtained. The experimental results highlight the effectiveness of the proposed TimingSAT attack to break delay logic locked benchmarks within few hours.

Towards provably-secure analog and mixed-signal locking against overproduction

  • Nithyashankari Gummidipoondi Jayasankaran
  • Adriana Sanabria Borbon
  • Edgar Sanchez-Sinencio
  • Jiang Hu
  • Jeyavijayan Rajendran

Similar to digital circuits, analog and mixed-signal (AMS) circuits are also susceptible to supply-chain attacks such as piracy, overproduction, and Trojan insertion. However, unlike digital circuits, supply-chain security of AMS circuits is less explored. In this work, we propose to perform “logic locking” on digital section of the AMS circuits. The idea is to make the analog design intentionally suffer from the effects of process variations, which impede the operation of the circuit. Only on applying the correct key, the effect of process variations are mitigated, and the analog circuit performs as desired. We provide the theoretical guarantees of the security of the circuit, and along with simulation results for the band-pass filter, low-noise amplifier, and low-dropout regulator, we also show experimental results of our technique on a band-pass filter.

Best of both worlds: integration of split manufacturing and camouflaging into a security-driven CAD flow for 3D ICs

  • Satwik Patnaik
  • Mohammed Ashraf
  • Ozgur Sinanoglu
  • Johann Knechtel

With the globalization of manufacturing and supply chains, ensuring the security and trustworthiness of ICs has become an urgent challenge. Split manufacturing (SM) and layout camouflaging (LC) are promising techniques to protect the intellectual property (IP) of ICs from malicious entities during and after manufacturing (i.e., from untrusted foundries and reverse-engineering by end-users). In this paper, we strive for “the best of both worlds,” that is of SM and LC. To do so, we extend both techniques towards 3D integration, an up-and-coming design and manufacturing paradigm based on stacking and interconnecting of multiple chips/dies/tiers.

Initially, we review prior art and their limitations. We also put forward a novel, practical threat model of IP piracy which is in line with the business models of present-day design houses. Next, we discuss how 3D integration is a naturally strong match to combine SM and LC. We propose a security-driven CAD and manufacturing flow for face-to-face (F2F) 3D ICs, along with obfuscation of interconnects. Based on this CAD flow, we conduct comprehensive experiments on DRC-clean layouts. Strengthened by an extensive security analysis (also based on a novel attack to recover obfuscated F2F interconnects), we argue that entering the next, third dimension is eminent for effective and efficient IP protection.

Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base

  • Sebastian Vogel
  • Mengyu Liang
  • Andre Guntoro
  • Walter Stechele
  • Gerd Ascheid

Efficient acceleration of Deep Neural Networks is a manifold task. In order to save memory requirements and reduce energy consumption we propose the use of dedicated accelerators with novel arithmetic processing elements which use bit shifts instead of multipliers. While a regular power-of-2 quantization scheme allows for multiplierless computation of multiply-accumulate-operations, it suffers from high accuracy losses in neural networks. Therefore, we evaluate the use of powers-of-arbitrary-log-bases and confirmed their suitability for quantization of pre-trained neural networks. The presented method works without retraining of the neural network and therefore is suitable for applications in which no labeled training data is available. In order to verify our proposed method, we implement the log-based processing elements into a neural network accelerator on an FPGA. The hardware efficiency is evaluated in terms of FPGA utilization and energy requirements in comparison to regular 8-bit-fixed-point multiplier based acceleration. Using this approach hardware resources are minimized and power consumption is reduced by 22.3%.

NID: processing binary convolutional neural network in commodity DRAM

  • Jaehyeong Sim
  • Hoseok Seol
  • Lee-Sup Kim

Recent large-scale CNNs suffer from a severe memory wall problem as their number of weights range from tens to hundreds of millions. Processing in-memory (PIM) and binary CNN have been proposed to alleviate the number of memory accesses and footprints, respectively. By combining the two separate concepts, we propose a novel processing in-DRAM framework for binary CNN, called NID, where dominant convolution operations are processed using in-DRAM bulk bitwise operations. We first identify the problem that the bitcount operations with only bulk bitwise AND/OR/NOT incur significant overhead in terms of delay when the size of kernels gets larger. Then, we not only optimize the performance by efficiently allocating inputs and kernels to DRAM banks for both convolutional and fully-connected layers through design space explorations, but also mitigate the overhead of bitcount operations by splitting kernels into multiple parts. Partial sum accumulations and tasks of the other layers such as max-pooling and normalization layers are processed in the peripheral area of DRAM with negligible overheads. In results, our NID framework achieves 19X-36X performance and 9X-14X EDP improvements for convolutional layers, and 9X-17X performance and 1.4X-4.5X EDP improvements for fully-connected layers over previous PIM technique in four large-scale CNN models.

AXNet: approximate computing using an end-to-end trainable neural network

  • Zhenghao Peng
  • Xuyang Chen
  • Chengwen Xu
  • Naifeng Jing
  • Xiaoyao Liang
  • Cewu Lu
  • Li Jiang

Neural network based approximate computing is a universal architecture promising to gain tremendous energy-efficiency for many error resilient applications. To guarantee the approximation quality, existing works deploy two neural networks (NNs), e.g., an approximator and a predictor. The approximator provides the approximate results, while the predictor predicts whether the input data is safe to approximate with the given quality requirement. However, it is non-trivial and time-consuming to make these two neural network coordinate—they have different optimization objectives—by training them separately. This paper proposes a novel neural network structure—AXNet—to fuse two NNs to a holistic end-to-end trainable NN. Leveraging the philosophy of multi-task learning, AXNet can tremendously improve the invocation (proportion of safe-to-approximate samples) and reduce the approximation error. The training effort also decrease significantly. Experiment results show 50.7% more invocation and substantial cuts of training time when compared to existing neural network based approximate computing framework.

Scalable-effort ConvNets for multilevel classification

  • Valentino Peluso
  • Andrea Calimera

This work introduces the concept of scalable-effort Convolutional Neural Networks (ConvNets), an effort-accuracy scalable model for classification of data at multilevel abstraction. Scalable-effort ConvNets are able to adapt at run-timeto the complexity of the classification problem, i.e. the level of abstraction defined by the application (or context), and reach a given classification accuracy with minimal computational effort. The mechanism is implemented using a single-weight scalable-precision model rather than an ensemble of quantized weight models; this makes the proposed strategy highly flexible and particularly suited for embedded architectures with limited resource availability.

The paper describes (i) a hardware/software vertical implementation of scalable-precision multiply&accumulate arithmetic, (ii) an accuracy-constrained heuristic that delivers near-optimal layer-by-layer precision mapping at a predefined level of abstraction. It also reports the validation for three state-of-the-art nets, i.e. AlexNet, SqueezeNet and MobileNet, trained and tested with ImageNet. Collected results show scalable-effort ConvNets guarantee flexibility and substantial savings: 47.07% computational effort reduction at minimum accuracy, or 30.6% accuracy improvement at maximum effort w.r.t. standard flat ConvNets (average over the three benchmarks for high-level classification).

Emerging reconfigurable nanotechnologies: can they support future electronics?

  • Shubham Rai
  • Srivatsa Srinivasa
  • Patsy Cadareanu
  • Xunzhao Yin
  • Xiaobo Sharon Hu
  • Pierre-Emmanuel Gaillardon
  • Vijaykrishnan Narayanan
  • Akash Kumar

Several emerging reconfigurable technologies have been explored in recent years offering device level runtime reconfigurability. These technologies offer the freedom to choose between p- and n-type functionality from a single transistor. In order to optimally utilize the feature-sets of these technologies, circuit designs and storage elements require novel design to complement the existing and future electronic requirements. An important aspect to sustain such endeavors is to supplement the existing design flow from the device level to the circuit level. This should be backed by a thorough evaluation so as to ascertain the feasibility of such explorations. Additionally, since these technologies offer runtime reconfigurability and often encapsulate more than one functions, hardware security features like polymorphic logic gates and on-chip key storage come naturally cheap with circuits based on these reconfigurable technologies. This paper presents innovative approaches devised for circuit designs harnessing the reconfigurable features of these nanotechnologies. New circuit design paradigms based on these nano devices will be discussed to brainstorm on exciting avenues for novel computing elements.

Design and algorithm for clock gating and flip-flop co-optimization

  • Giyoung Yang
  • Taewhan Kim

This work firstly investigates the problem of how designing data-driven (i.e., toggling based) clock gating can be closely integrated with the synthesis of flip-flops, which has never been addressed in the prior clock gating works. Our key observation is that some internal part of a flip-flop cell can be reused to generate its clock gating enable signal. Based on this, we propose a newly optimized flip-flop wiring structure, called eXOR-FF, in which an internal logic can be reused for every clock cycle to decide if the flip-flop is to be activated or inactivated through clock gating, thereby achieving area saving (thus, leakage as well as dynamic power saving) on every pair of flip-flop and its toggling detection logic. Then, we propose a comprehensive methodology of placement/timing-aware clock gating exploration that provides two unique strengths: best suited for maximally exploiting the benefit of eXOR-FFs and precise analyses on the decomposition of power consumptions and timing impact, and translating them into cost functions in core engine of clock gating exploration.

Macro-aware row-style power delivery network design for better routability

  • Jai-Ming Lin
  • Jhih-Sheng Syu
  • I-Ru Chen

Reliability of a P/G network is one of the most important concerns in a chip design, which makes powerplanning the most critical step in the physical design. Traditional P/G network design mainly focuses on reducing usage of routing resource to satisfy voltage drop and electromigration constraints according to a regular mesh. As the number of macros in a modern design increases, this style may waste more routing resource and make routing congestion more severe in local regions. In order to save routing resource and increase routability, this paper proposes a delicate powerplanning method. First, we propose a row-style power mesh to facilitate connection of pre-placed macros and increase routability of signal nets in the later stage. Besides, an effective power stripe width which can reduce wastage of routing resource and provide stronger supply voltage is found. Moreover, we propose the first work to use the linear programming algorithm to minimize P/G routing area and consider routability at the same time. The experimental results show that routability of a design with many macros can be significantly improved by our row-style power networks.

Modeling and optimization of magnetic core TSV-inductor for on-chip DC-DC converter

  • Baixin Chen
  • Umamaheswara Tida
  • Cheng Zhuo
  • Yiyu Shi

Conventional on-chip spiral inductor consumes significant top metal routing area, thereby preventing its popularity in many on-chip applications. Recently TSV-inductor with a magnetic core has been proved to be a viable option for on-chip DC-DC converter in a 14nm test chip. The operating conditions of such inductors play a major role in maximizing the performance and efficiency of the DC-DC converter. However, due to its unique TSV-structure, unlike conventional spiral inductor, much of the modeling details remain unclear. This paper analyzes the modeling details of a magnetic core TSV-inductor and proposes a design methodology to optimize power losses of the inductor. With this methodology, designers can ensure fast and reliable inductor optimization for on-chip applications. Experimental results show that the optimized magnetic core TSV-inductor can achieve inductance density improvement of 6.0–7.7X and quality factor improvements of 1.3–1.6X while maintaining the same footprint.

Machine-learning-based dynamic IR drop prediction for ECO

  • Yen-Chun Fang
  • Heng-Yi Lin
  • Min-Yan Su
  • Chien-Mo Li
  • Eric Jia-Wei Fang

During design signoff, many iterations of Engineer Change Order (ECO) are needed to ensure IR drop of each cell instance meets the specified limit. It is a waste of resources because repeated dynamic IR drop simulations take a very long time on very similar designs. In this work, we train a machine learning model, based on data before ECO, and predict IR drop after ECO. To increase our prediction accuracy, we propose 17 timing-aware, power-aware, and physical-aware features. Our method is scalable because the feature dimension is fixed (937), independent of design size and cell library. Also, we propose to build regional models for cell instances near IR drop violations to improves both prediction accuracy and training time. Our experiments show that our prediction correlation coefficient is 0.97 and average error is 3.0mV on a 5-million-cell industry design. Our IR drop prediction for 100K cell instances can be completed within 2 minutes. Our proposed method provides a fast IR drop prediction to speedup ECO.

Privacy-preserving deep learning and inference

  • M. Sadegh Riazi
  • Farinaz Koushanfar

We provide a systemization of knowledge of the recent progress made in addressing the crucial problem of deep learning on encrypted data. The problem is important due to the prevalence of deep learning models across various applications, and privacy concerns over the exposure of deep learning IP and user’s data. Our focus is on provably secure methodologies that rely on cryptographic primitives and not trusted third parties/platforms. Computational intensity of the learning models, together with the complexity of realization of the cryptography algorithms hinder the practical implementation a challenge. We provide a summary of the state-of-the-art, comparison of the existing solutions, as well as future challenges and opportunities.

Machine learning IP protection

  • Rosario Cammarota
  • Indranil Banerjee
  • Ofer Rosenberg

Machine learning, specifically deep learning is becoming a key technology component in application domains such as identity management, finance, automotive, and healthcare, to name a few. Proprietary machine learning models – Machine Learning IP – are developed and deployed at the network edge, end devices and in the cloud, to maximize user experience.

With the proliferation of applications embedding Machine Learning IPs, machine learning models and hyper-parameters become attractive to attackers, and require protection. Major players in the semiconductor industry provide mechanisms on device to protect the IP at rest and during execution from being copied, altered, reverse engineered, and abused by attackers. In this work we explore system security architecture mechanisms and their applications to Machine Learning IP protection.

Assured deep learning: practical defense against adversarial attacks

  • Bita Darvish Rouhani
  • Mohammad Samragh
  • Mojan Javaheripi
  • Tara Javidi
  • Farinaz Koushanfar

Deep Learning (DL) models have been shown to be vulnerable to adversarial attacks. In light of the adversarial attacks, it is critical to reliably quantify the confidence of the prediction in a neural network to enable safe adoption of DL models in autonomous sensitive tasks (e.g., unmanned vehicles and drones). This article discusses recent research advances for unsupervised model assurance against the strongest adversarial attacks known to date and quantitatively compare their performance. Given the widespread usage of DL models, it is imperative to provide model assurance by carefully looking into the feature maps automatically learned within Dl models instead of looking back with regret when deep learning systems are compromised by adversaries.

Tetris: re-architecting convolutional neural network computation for machine learning accelerators

  • Hang Lu
  • Xin Wei
  • Ning Lin
  • Guihai Yan
  • Xiaowei Li

Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.

FCN-engine: accelerating deconvolutional layers in classic CNN processors

  • Dawen Xu
  • Kaijie Tu
  • Ying Wang
  • Cheng Liu
  • Bingsheng He
  • Huawei Li

Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead.

To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X — 44.9X speedup and reduces the energy consumption by 14.6%-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.

Designing adaptive neural networks for energy-constrained image classification

  • Dimitrios Stamoulis
  • Ting-Wu (Rudy) Chin
  • Anand Krishnan Prakash
  • Haocheng Fang
  • Sribhuvan Sajja
  • Mitchell Bognar
  • Diana Marculescu

As convolutional neural networks (CNNs) enable state-of-the-art computer vision applications, their high energy consumption has emerged as a key impediment to their deployment on embedded and mobile devices. Towards efficient image classification under hardware constraints, prior work has proposed adaptive CNNs, i.e., systems of networks with different accuracy and computation characteristics, where a selection scheme adaptively selects the network to be evaluated for each input image. While previous efforts have investigated different network selection schemes, we find that they do not necessarily result in energy savings when deployed on mobile systems. The key limitation of existing methods is that they learn only how data should be processed among the CNNs and not the network architectures, with each network being treated as a blackbox.

To address this limitation, we pursue a more powerful design paradigm where the architecture settings of the CNNs are treated as hyper-parameters to be globally optimized. We cast the design of adaptive CNNs as a hyper-parameter optimization problem with respect to energy, accuracy, and communication constraints imposed by the mobile device. To efficiently solve this problem, we adapt Bayesian optimization to the properties of the design space, reaching near-optimal configurations in few tens of function evaluations. Our method reduces the energy consumed for image classification on a mobile device by up to 6X, compared to the best previously published work that uses CNNs as blackboxes. Finally, we evaluate two image classification practices, i.e., classifying all images locally versus over the cloud under energy and communication constraints.

FATE: fast and accurate timing error prediction framework for low power DNN accelerator design

  • Jeff (Jun) Zhang
  • Siddharth Garg

Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs which execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8X –58X speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing a conventional DNN accelerator that uses 2’s complement (2C) arithmetic with one that uses signed magnitude representation (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest.

Waterfall is too slow, let’s go Agile: multi-domain coupling for synthesizing automotive cyber-physical systems

  • Debayan Roy
  • Michael Balszun
  • Thomas Heurung
  • Samarjit Chakraborty
  • Amol Naik

For future autonomous vehicles, the system development life cycle must keep up with the rapid rate of innovation and changing needs of the market. Waterfall is too slow to react to such changes, and therefore, there is a growing emphasis to adopt Agile development concepts in the automotive industry. Ensuring requirements trace-ability, and thus proving functional safety, is a serious challenge in this direction. Modern cars are complex cyber-physical systemsand are traditionally designed using a set of disjoint tools, which adds to the challenge. In this paper, we point out that multi-domain coupling and design automation using correct-by-design approaches can lead to safe designs even in an Agile environment. In this context, we study current industry trends. We further outline the challenges involved in multi-domain coupling and demonstrate using a state-of-the-art approach how these challenges can be addressed by exploiting domain-specific knowledge.

Model-based and data-driven approaches for building automation and control

  • Tianshu Wei
  • Xiaoming Chen
  • Xin Li
  • Qi Zhu

Smart buildings in the future are complex cyber-physical-human systems that involve close interactions among embedded platform (for sensing, computation, communication and control), mechanical components, physical environment, building architecture, and occupant activities. The design and operation of such buildings require a new set of methodologies and tools that can address these heterogeneous domains in a holistic, quantitative and automated fashion. In this paper, we will present our design automation methods for improving building energy efficiency and offering comfortable services to occupants at low cost. In particular, we will highlight our work in developing both model-based and data-driven approaches for building automation and control, including methods for co-scheduling heterogeneous energy demands and supplies, for integrating intelligent building energy management with grid optimization through a proactive demand response framework, for optimizing HVAC control with deep reinforcement learning, and for accurately measuring in-building temperature by combining prior modeling information with few sensor measurements based upon Bayesian inference.

Design automation for battery systems

  • Swaminathan Narayanaswamy
  • Sangyoung Park
  • Sebastian Steinhorst
  • Samarjit Chakraborty

High power Lithium-Ion (Li-Ion) battery packs used in stationary Electrical Energy Storage (EES) systems and Electric Vehicle (EV) applications require a sophisticated Battery Management System (BMS) in order to maintain safe operation and improve their performance. With the increasing complexity of these battery packs and their demand for shorter time-to-market, decentralized approaches for battery management, providing a high degree of modularity, scalability and improved control performance are typically preferred. However, manual design approaches for these complex distributed systems are time consuming and are error-prone resulting in a reduced energy efficiency of the overall system. Here, special design automation techniques considering all abstraction-levels of the battery system are required to obtain highly optimized battery packs. This paper presents from a design automation perspective the recent advances in the domain of battery systems that are a combination of the electrochemical cells and their associated management modules. Specifically, we classify the battery systems into three abstraction levels, cell-level (battery cells and their interconnection schemes), module-level (sensing and charge balancing circuits) and pack-level (computation and control algorithms). We provide an overview of challenges that exist in each abstraction layer and give an outlook towards future design automation techniques that are required to overcome these limitations.

RFUZZ: coverage-directed fuzz testing of RTL on FPGAs

  • Kevin Laeufer
  • Jack Koenig
  • Donggyu Kim
  • Jonathan Bachrach
  • Koushik Sen

Dynamic verification is widely used to increase confidence in the correctness of RTL circuits during the pre-silicon design phase. Despite numerous attempts over the last decades to automate the stimuli generation based on coverage feedback, Coverage Directed Test Generation (CDG) has not found the widespread adoption that one would expect. Based on new ideas from the software testing community around coverage-guided mutational fuzz testing, we propose a new approach to the CDG problem which requires minimal setup and takes advantage of FPGA-accelerated simulation for rapid testing. We provide test input and coverage definitions that allow fuzz testing to be applied to RTL circuit verification. In addition we propose and implement a series of transformation passes that make it feasible to reset arbitrary RTL designs quickly, a requirement for deterministic test execution. Alongside this paper we provide rfuzz, a fully featured implementation of our testing methodology which we make available as open-source software to the research community. An empirical evaluation of rfuzz shows promising results on archiving coverage for a wide range of different RTL designs ranging from communication IPs to an industry scale 64-bit CPU.

Steep coverage-ascent directed test generation for shared-memory verification of multicore chips

  • Gabriel A. G. Andrade
  • Marleson Graf
  • Nícolas Pfeifer
  • Luiz C. V. dos Santos

This paper proposes a framework for functional verification of shared memory that relies on reusable coverage-driven directed test generation. It reveals a new mechanism to improve the quality of non-deterministic tests. The generator exploits general properties of coherence protocols and cache memories for better control on transition coverage, which serves as a proxy for increasing the actual coverage metric adopted in a given verification environment. Being independent of coverage metric, coherence protocol, and cache parameters, the proposed generator is reusable across quite different designs and verification environments. We report the coverage for 8, 16, and 32-core designs and the effort required for exposing nine different types of errors. The proposed technique was always able to reach similar coverage as a state-of-the-art generator, and it always did it faster above a certain threshold. For instance, when executing tests with IK operations for verifying 32-core designs, the former reached 65% coverage around 5 times faster than the latter. Besides, we identified challenging errors that could hardly be found by the latter within one hour, but were exposed by our technique in 5 to 30 minutes.

SMTSampler: efficient stimulus generation from complex SMT constraints

  • Rafael Dutra
  • Jonathan Bachrach
  • Koushik Sen

Stimulus generation is an essential part of hardware verification, being at the core of widely applied constrained-random verification techniques. However, as verification problems get more and more complex, so do the constraints which must be satisfied. In this context, it is a challenge to efficiently generate random stimuli which can achieve a good coverage of the design space. We developed a new technique SMTSampler which can sample random solutions from Satisfiability Modulo Theories (SMT) formulas with bit-vectors, arrays, and uninterpreted functions. The technique uses a small number of calls to a constraint solver in order to generate up to millions of stimuli. Our evaluation on a large set of complex industrial SMT benchmarks shows that SMTSampler can handle a larger class of SMT problems, outperforming state-of-the-art constraint sampling techniques in the number of samples produced and the coverage of the constraint space.

DL-RSIM: a simulation framework to enable reliable ReRAM-based accelerators for deep learning

  • Meng-Yao Lin
  • Hsiang-Yun Cheng
  • Wei-Ting Lin
  • Tzu-Hsien Yang
  • I-Ching Tseng
  • Chia-Lin Yang
  • Han-Wen Hu
  • Hung-Sheng Chang
  • Hsiang-Pang Li
  • Meng-Fan Chang

Memristor-based deep learning accelerators provide a promising solution to improve the energy efficiency of neuromorphic computing systems. However, the electrical properties and crossbar structure of memristors make these accelerators error-prone. To enable reliable memristor-based accelerators, a simulation platform is needed to precisely analyze the impact of non-ideal circuit and device properties on the inference accuracy. In this paper, we propose a flexible simulation framework, DL-RSIM, to tackle this challenge. DL-RSIM simulates the error rates of every sum-of-products computation in the memristor-based accelerator and injects the errors in the targeted TensorFlow-based neural network model. A rich set of reliability impact factors are explored by DL-RSIM, and it can be incorporated with any deep learning neural network implemented by TensorFlow. Using three representative convolutional neural networks as case studies, we show that DL-RSIM can guide chip designers to choose a reliability-friendly design option and develop reliability optimization techniques.

A ferroelectric FET based power-efficient architecture for data-intensive computing

  • Yun Long
  • Taesik Na
  • Prakshi Rastogi
  • Karthik Rao
  • Asif Islam Khan
  • Sudhakar Yalamanchili
  • Saibal Mukhopadhyay

In this paper, we present a ferroelectric FET (FeFET) based power-efficient architecture to accelerate data-intensive applications such as deep neural networks (DNNs). We propose a cross-cutting solution combining emerging device technologies, circuit optimizations, and micro-architecture innovations. At device level, FeFET crossbar is utilized to perform vector-matrix multiplication (VMM). As a field effect device, FeFET significantly reduces the read/write energy compared with the resistive random-access memory (ReRAM). At circuit level, we propose an all-digital peripheral design, reducing the large overhead introduced by ADC and DAC in prior works. In terms of micro-architecture innovation, a dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Speed, power, area and computing accuracy are evaluated based on detailed device characterization and system modeling. For DNN computing, our design achieves 254x and 9.7x gain in power efficiency (GOPS/W) compared to GPU and ReRAM based designs, respectively.

EMAT: an <u>e</u>fficient <u>m</u>ulti-task <u>a</u>rchitecture for <u>t</u>ransfer learning using ReRAM

  • Fan Chen
  • Hai Li

Transfer learning has demonstrated a great success recently towards general supervised learning to mitigate expensive training efforts. However, existing neural network accelerators have been proven inefficient in executing transfer learning by failing to accommodate the layer-wise heterogeneity in computation and memory requirements. In this work, we propose EMAT—an efficient multi-task architecture for transfer learning built on resistive memory (ReRAM) technology. EMAT utilizes the energy-efficiency of ReRAM arrays for matrix-vector multiplication and realizes a hierarchical reconfigurable design with heterogeneous computation components to incorporate the data patterns in transfer learning. Compared to the GPU platform, EMAT can perform averagely 120X performance speedup and 87X energy saving. EMAT also obtains 2.5X speedup compared to the-state-of-the-art CMOS accelerator.

Co-manage power delivery and consumption for manycore systems using reinforcement learning

  • Haoran Li
  • Zhongyuan Tian
  • Rafael K. V. Maeda
  • Xuanqi Chen
  • Jun Feng
  • Jiang Xu

Maintaining high energy efficiency has become a critical design issue for high-performance systems. Many power management techniques have been proposed for the processor cores such as dynamic voltage and frequency scaling (DVFS). However, very few solutions consider the power losses suffered on the power delivery system (PDS), despite the fact that they have a significant impact on the system overall energy efficiency. With the explosive growth of system complexity and highly dynamic workloads variations, it is challenging to find the optimal power management policies which can effectively match the power delivery with the power consumption. To tackle the above problems, we propose a reinforcement learning-based power management scheme for manycore systems to jointly monitor and adjust both the PDS and the processor cores aiming to improve system overall energy efficiency. The learning agents distributed across power domains not only manage the power states of processor cores but also control the on/off states of on-chip VRs to proactively adapt to the workload variations. Experimental results with realistic applications show that when the proposed approach is applied to a large-scale system with a hybrid PDS, it lowers the system overall energy-delay-product (EDP) by 41% than a traditional monolithic DVFS approach with a bulky off-chip VR.

Adaptive-precision framework for SGD using deep Q-learning

  • Wentai Zhang
  • Hanxian Huang
  • Jiaxi Zhang
  • Ming Jiang
  • Guojie Luo

Stochastic gradient descent (SGD) is a widely-used algorithm in many applications, especially in the training process of deep learning models. Low-precision implementation for SGD has been studied as a major acceleration approach. However, if not appropriately used, low-precision implementation can deteriorate its convergence because of the rounding error when gradients become small near a local optimum. In this work, to balance throughput and algorithmic accuracy, we apply the Q-learning technique to adjust the precision of SGD automatically by designing an appropriate decision function. The proposed decision function for Q-learning takes the error rate of the objective function, its gradients, and the current precision configuration as the inputs. Q-learning then chooses proper precision adaptively for hardware efficiency and algorithmic accuracy. We use reconfigurable devices such as FPGAs to evaluate the adaptive precision configurations generated by the proposed Q-learning method. We prototype the framework using LeNet-5 model with MNIST and CIFAR10 datasets and implement it on a Xilinx KCU1500 FPGA board. In the experiments, we analyze the throughput of different precision representations and the precision-selection of our framework. The results show that the proposed framework with adapative precision increases the throughput by up to 4.3 x compared to the conventional 32-bit floating point setting, and it achieves both the best hardware efficiency and algorithmic accuracy.

Differentiated handling of physical scenes and virtual objects for mobile augmented reality

  • Chih-Hsuan Yen
  • Wei-Ming Chen
  • Pi-Cheng Hsiu
  • Tei-Wei Kuo

Mobile devices running augmented reality applications consume considerable energy for graphics-intensive workloads. This paper presents a scheme for the differentiated handling of camera-captured physical scenes and computer-generated virtual objects according to different perceptual quality metrics. We propose online algorithms and their realtime implementations to reduce energy consumption through dynamic frame rate adaptation while maintaining the visual quality required for augmented reality applications. To evaluate system efficacy, we integrate our scheme into Android and conduct extensive experiments on a commercial smartphone with various application scenarios. The results show that the proposed scheme can achieve energy savings of up to 39.1% in comparison to the native graphics system in Android while maintaining satisfactory visual quality.

DATC RDF: an academic flow from logic synthesis to detailed routing

  • Jinwook Jung
  • Iris Hui-Ru Jiang
  • Jianli Chen
  • Shih-Ting Lin
  • Yih-Lang Li
  • Victor N. Kravets
  • Gi-Joon Nam

In this paper, we present DATC Robust Design Flow (RDF) from logic synthesis to detailed routing. We further include detailed placement and detailed routing tools based on recent EDA research contests. We also demonstrate RDF in a scalable cloud infrastructure. Design methodology and cross-stage optimization research can be conducted via RDF.

Physical modeling of bitcell stability in subthreshold SRAMs for leakage-area optimization under PVT variations

  • Xin Fan
  • Rui Wang
  • Tobias Gemmeke

Subthreshold SRAM design is crucial for addressing the memory bottleneck in energy constrained applications. While statistical optimization can be applied based on Monte-Carlo (MC) simulation, exploration of bitcell design space is time consuming. This paper presents a framework for model-based design and optimization of subthreshold SRAM bitcells under random PVT variations. By incorporating key design and process features, a physical model of bitcell static noise margin (SNM) has been derived analytically. It captures intra-die SNM variations by the combination of a folded-normal distribution and a non-central chi-squared distribution. Validations with MC simulation show its accuracy of modeling SNM distributions down to 25mV beyond 6-sigma for typical bitcells in 28nm. Model-based tuning of subthreshold SRAM bitcells is investigated for design tradeoff between leakage, area and stability. When targeting a specific SNM constraint, we show that an optimal standby voltage exists which offers minimum bitcell leakage power – any deviation above or below increases the power consumption. When targeting a specific standby voltage, our design flow identifies bitcell instances of 12x less leakage power or 3x reductions in area as compared to the minimum-length design.

Comparing voltage adaptation performance between replica and in-situ timing monitors

  • Yutaka Masuda
  • Jun Nagayama
  • Hirotaka Takeno
  • Yoshimasa Ogawa
  • Yoichi Momiyama
  • Masanori Hashimoto

Adaptive voltage scaling (AVS) is a promising approach to overcome manufacturing variability, dynamic environmental fluctuation, and aging. This paper focuses on timing sensors necessary for AVS implementation and compares in-situ timing error predictive FF (TEP-FF) and critical path replica in terms of how much voltage margin can be reduced. For estimating the theoretical bound of ideal AVS, this work proposes linear programming based minimum supply voltage analysis and discusses the voltage adaptation performance quantitatively by investigating the gap between the lower bound and actual supply voltages. Experimental results show that TEP-FF based AVS and replica based AVS achieve up to 13.3% and 8.9% supply voltage reduction, respectively while satisfying the target MTTF. AVS with TEP-FF tracks the theoretical bound with 2.5 to 5.6 % voltage margin while AVS with replica needs 7.2 to 9.9 % margin.

Strain-aware performance evaluation and correction for OTFT-based flexible displays

  • Tengtao Li
  • Sachin S. Sapatnekar

Organic thin-film transistors (OTFTs) are widely used in flexible circuits, such as flexible displays, sensor arrays, and radio frequency identification cards (RFIDs), because these technologies offer features such as better flexibility, lower cost, and easy manufacturability using low-temperature fabrication process. This paper develops a procedure that evaluates the performance of flexible displays. Due to their very nature, flexible displays experience significant mechanical strain/stress in the field due to the deformation caused during daily use. These deformations can impact device and circuit performance, potentially causing a loss in functionality. This paper first models the effects of extrinsic strain due to two fundamental deformations modes, bending and twisting. Next, this strain is translated to variations in device mobility, after which analytical models for error analysis in the flexible display are derived based on the rendered image values in each pixel of the display. Finally, two error correction approaches for flexible displays are proposed, based on voltage compensation and flexible clocking.

Achieving fast sanitization with zero live data copy for MLC flash memory

  • Ping-Hsien Lin
  • Yu-Ming Chang
  • Yung-Chun Li
  • Wei-ChenWang
  • Chien-Chung Ho
  • Yuan-Hao Chang

As data security has become the major concern in modern storage systems with low-cost multi-level-cell (MLC) flash memories, it is not trivial to realize data sanitization in such a system. Even though some existing works employ the encryption or the built-in erase to achieve this requirement, they still suffer the risk of being deciphered or the issue of performance degradation. In contrast to the existing work, a fast sanitization scheme is proposed to provide the highest degree of security for data sanitization; that is, every old version of data could be immediately sanitized with zero live-data-copy overhead once the new version of data is created/written. In particular, this scheme further considers the reliability issue of MLC flash memories; the proposed scheme includes a one-shot sanitization design to minimize the disturbance during data sanitization. The feasibility and the capability of the proposed scheme were evaluated through extensive experiments based on real flash chips. The results demonstrate that this scheme can achieve the data sanitization with zero live-data-copy, where performance overhead is less than 1%.

Architecting data placement in SSDs for efficient secure deletion implementation

  • Hoda Aghaei Khouzani
  • Chen Liu
  • Chengmo Yang

Secure deletion ensures user privacy by permanently removing invalid data from the secondary storage. This process is particularly critical to solid state drives (SSDs) wherein invalid data are generated not only upon deleting a file but also upon updating a file of which the user is not aware. While previous secure deletion schemes are usually applied to all invalid data on the SSD, our observation is that in many cases security is not required for all files on the SSD. This paper proposes an efficient secure deletion scheme targeting only the invalid data of files marked as “secure” by the user. A security-aware data allocation strategy is designed, which separates secure and unsecure data at lower (block) level but mixes them at higher levels of SSD hierarchical organization. Block-level separation minimizes secure deletion cost, while higher-level mixing mitigates the adverse impact of secure deletion on SSD lifetime. A two-level block management scheme is further developed to scatter secure blocks over the SSD for wear leveling. Experiments on real-world benchmarks confirm the advantage of the proposed scheme in reducing secure deletion cost and improving SSD lifetime.

AxBA: an approximate bus architecture framework

  • Jacob R. Stevens
  • Ashish Ranjan
  • Anand Raghunathan

The exponential growth in creation and consumption of various forms of digital data has led to the emergence of new application workloads such as machine learning, data analytics and search. These workloads process large amounts of data and hence pose increased demands on the on-chip and off-chip interconnects of modern computing systems. Therefore, techniques that can improve the energy-efficiency and performance of interconnects are becoming increasingly important.

Security: the dark side of approximate computing?

  • Francesco Regazzoni
  • Cesare Alippi
  • Ilia Polian

Approximate computing promises significant advantages over more traditional computing architectures with respect to circuit area, performance, power efficiency, flexibility, and cost. Its use is suitable in applications where limited and controlled inaccuracies are tolerable or uncertainty is intrinsic in input or their data processing, e.g., as it happens in (deep-) machine learning, image and signal processing. This paper discusses a dimension of approximate computing that has been neglected so far, despite it represents nowadays a major asset, that of security. A number of hardware-related security threats are considered, and the implications of approximate circuits or systems designed to address these threats are discussed.

Security aspects of neuromorphic MPSoCs

  • Johanna Sepulveda
  • Cezar Reinbrecht
  • Jean-Philippe Diguet

Neural networks and deep learning are promising techniques for bringing brain inspired computing into embedded platforms. They pave the way to new kinds of associative memories, classifiers, data-mining, machine learning or search engines, which can be the basis of critical and sensitive applications such as autonomous driving. Emerging non-volatile memory technologies integrated in the so called Multi-Processor System-on-Chip (MPSoC) architectures enable the realization of such computational paradigms. These architectures take advantage of the Network-on-Chip concept to efficiently carry out communications with dedicated distributed memories and processing elements. However, current MPSoC-based neuromorphic architectures are deployed without taking security into account. The growing complexity and the hyper-sharing of hardware resources of MPSoCs may become a threat, thus increasing the risk of malware infections and Trojans introduced at design time. Specially, MPSoC microarchitectural side-channels and fault injection attacks can be exploited to leak sensitive information and to cause malfunctions. In this work we present three contributions to that issue: i) first analysis of security issues in MPSoC-based neuromorphic architectures; ii) discussion of the threat model of the neuromorphic architectures; ii) demonstration of the correlation between SNN input and the neural computation.

Vulnerability-tolerant secure architectures

  • Todd Austin
  • Valeria Bertacco
  • Baris Kasikci
  • Sharad Malik
  • Mohit Tiwari

Today, secure systems are built by identifying potential vulnerabilities and then adding protections to thwart the associated attacks. Unfortunately, the complexity of today’s systems makes it impossible to prove that all attacks are stopped, so clever attackers find a way around even the most carefully designed protections. In this article, we take a sobering look at the state of secure system design, and ask ourselves why the “security arms race” never ends? The answer lies in our inability to develop adequate security verification technologies. We then examine an advanced defensive system in nature – the human immune system – and we discover that it does not remove vulnerabilities, rather it adds offensive measures to protect the body when its vulnerabilities are penetrated We close the article with brief speculation on how the human immune system could inspire more capable secure system designs.

Machine learning for performance and power modeling of heterogeneous systems

  • Joseph L. Greathouse
  • Gabriel H. Loh

Modern processing systems with heterogeneous components (e.g., CPUs, GPUs) have numerous configuration and design options such as the number and types of cores, frequency, and memory bandwidth. Hardware architects must perform design space explorations in order to accurately target markets of interest under tight time-to-market constraints. This need highlights the importance of rapid performance and power estimation mechanisms.

This work describes the use of machine learning (ML) techniques within a methodology for the estimating performance and power of heterogeneous systems. In particular, we measure the power and performance of a large collection of test applications running on real hardware across numerous hardware configurations. We use these measurements to train a ML model; the model learns how the applications scale with the system’s key design parameters.

Later, new applications of interest are executed on a single configuration, and we gather hardware performance counter values which describe how the application used the hardware. These values are fed into our ML model’s inference algorithm, which quickly identify how this application will scale across various design points. In this way, we can rapidly predict the performance and power of the new application across a wide range of system configurations.

Once the initial run of the program is complete, our ML algorithm can predict the application’s performance and power at many hardware points faster than running it at each of those points and with a level of accuracy comparable to cycle-level simulators.

Machine learning for design space exploration and optimization of manycore systems

  • Ryan Gary Kim
  • Janardhan Rao Doppa
  • Partha Pratim Pande

In the emerging data-driven science paradigm, computing systems ranging from IoT and mobile to manycores and datacenters play distinct roles. These systems need to be optimized for the objectives and constraints dictated by the needs of the application. In this paper, we describe how machine learning techniques can be leveraged to improve the computational-efficiency of hardware design optimization. This includes generic methodologies that are applicable for any hardware design space. As an example, we discuss a guided design space exploration framework to accelerate application-specific manycore systems design and advanced imitation learning techniques to improve on-chip resource management. We present some experimental results for application-specific manycore system design optimization and dynamic power management to demonstrate the efficacy of these methods over traditional EDA approaches.

Failure prediction based on anomaly detection for complex core routers

  • Shi Jin
  • Zhaobo Zhang
  • Krishnendu Chakrabarty
  • Xinli Gu

Data-driven prognostic health management is essential to ensure high reliability and rapid error recovery in commercial core router systems. The effectiveness of prognostic health management depends on whether failures can be accurately predicted with sufficient lead time. This paper describes how time-series analysis and machine-learning techniques can be used to detect anomalies and predict failures in complex core router systems. First, both a feature-categorization-based hybrid method and a changepoint-based method have been developed to detect anomalies in time-varying features with different statistical characteristics. Next, a SVM-based failure predictor is developed to predict both categories and lead time of system failures from collected anomalies. A comprehensive set of experimental results is presented for data collected during 30 days of field operation from over 20 core routers deployed by customers of a major telecom company.

Invocation-driven neural approximate computing with a multiclass-classifier and multiple approximators

  • Haiyue Song
  • Chengwen Xu
  • Qiang Xu
  • Zhuoran Song
  • Naifeng Jing
  • Xiaoyao Liang
  • Li Jiang

Neural approximate computing gains enormous energy-efficiency at the cost of tolerable quality-loss. A neural approximator can map the input data to output while a classifier determines whether the input data are safe to approximate with quality guarantee. However, existing works cannot maximize the invocation of the approximator, resulting in limited speedup and energy saving. By exploring the mapping space of those target functions, in this paper, we observe a nonuniform distribution of the approximation error incurred by the same approximator. We thus propose a novel approximate computing architecture with a Multiclass-Classifier and Multiple Approximators (MCMA). These approximators have identica network topologies, and thus can share the same hardware resource in an neural processing unit(NPU) clip. In the runtime, MCMA can swap in the invoked approximator by merely shipping the synapse weights from the on-chip memory to the buffers near MAC within a cycle. We also propose efficient co-training methods for such MCMA architecture. Experimental results show a more substantial invocation of MCMA as well as the gain of energy-efficiency.

Deterministic methods for stochastic computing using low-discrepancy sequences

  • M. Hassan Najafi
  • David J. Lilja
  • Marc Riedel

Recently, deterministic approaches to stochastic computing (SC) have been proposed. These compute with the same constructs as stochastic computing but operate on deterministic bit streams. These approaches reduce the area, greatly reduce the latency (by an exponential factor), and produce completely accurate results. However, these methods do not scale well. Also, they lack the property of progressive precision enjoyed by SC. As a result, these deterministic approaches are not competitive for applications where some degree of inaccuracy can be tolerated. In this work we introduce two fast-converging, scalable deterministic approaches to SC based on low-discrepancy sequences. The results are completely accurate when running the operations for the required number of cycles. However, the computation can be truncated early if some inaccuracy is acceptable. Experimental results show that the proposed approaches significantly improve both the processing time and area-delay product compared to prior approaches.

Design space exploration of multi-output logic function approximations

  • Jorge Echavarria
  • Stefan Wildermann
  • Jürgen Teich

Approximate Computing has emerged as a design paradigm that allows to decrease hardware costs by reducing the accuracy of the computation for applications that are robust against such errors. In Boolean logic approximation, the number of terms and literals of a logic function can be reduced by allowing to produce erroneous outputs for some input combinations. This paper proposes a novel methodology for the approximation of multi-output logic functions. Related work on multi-output logic approximation minimizes each output function separately. In this paper, we show that thereby a huge optimization potential is lost. As a remedy, our methodology considers the effect on all output functions when introducing errors thus exploiting the cross-function minimization potential. Moreover, our approach is integrated into a design space exploration technique to obtain not only a single solution but a Pareto-set of designs with different trade-offs between hardware costs (terms and literals) and error (number of minterms that have been falsified). Experimental results show our technique is very efficient in exploring Pareto-optimal fronts. For some benchmarks, the number of terms could be reduced from an accurate function implementation by up to 15% and literals by up to 19% with degrees of inaccuracy around 0.1% w.r.t. accurate designs. Moreover, we show that the Pareto-fronts obtained by our methodology dominate the results obtained when applying related work.

3DICT: a reliable and QoS capable mobile process-in-memory architecture for lookup-based CNNs in 3D XPoint ReRAMs

  • Qian Lou
  • Wujie Wen
  • Lei Jiang

It is extremely challenging to deploy computing-intensive convolutional neural networks (CNNs) with rich parameters in mobile devices because of their limited computing resources and low power budgets. Although prior works build fast and energy-efficient CNN accelerators by greatly sacrificing test accuracy, mobile devices have to guarantee high CNN test accuracy for critical applications, e.g., unlocking phones by face recognitions. In this paper, we propose a 3D XPoint ReRAM-based process-in-memory architecture, 3DICT, to provide various test accuracies to applications with different priorities by lookup-based CNN tests that dynamically exploit the trade-off between test accuracy and latency. Compared to the state-of-the-art accelerators, on average, 3DICT improves the CNN test performance per Watt by 13% ~ 61X and guarantees 9-year endurance under various CNN test accuracy requirements.

Aliens: a novel hybrid architecture for resistive random-access memory

  • Bing Wu
  • Dan Feng
  • Wei Tong
  • Jingning Liu
  • Shuai Li
  • Mingshun Yang
  • Chengning Wang
  • Yang Zhang

Passive crossbar arrays of resistive random-access memory (RRAM) have shown great potential to meet the demands of future memory. By eliminating transistor per cell, the crossbar array possesses a higher memory density but introduces sneak currents which incur extra energy waste and reliability issues. The complementary resistive switch (CRS), consisting of two anti-serially stacked memristors, is considered as a promising solution to the sneak current problem. However, the destructive read of the CRS results in an additional recovery write operation which strongly restricts its further promotion. Exploiting the dual CRS/memristor mode of CRS devices, we propose Aliens, a novel hybrid architecture for resistive random-access memory which introduces one alien cell (memristor mode) for each wordline in the crossbar to provide a practical hybrid memory without operating system’s intervention. Aliens draws advantages from both modes: restrained sneak current of CRS mode and non-destructive read of memristor mode. The simple and regular cell mode organization of Aliens enables an energy-saving read method and an effective mode switching strategy called Lazy-Switch. By exploiting memory access locality, Lazy-Switch delays and merges the recovery write operations of the CRS mode. Due to fewer recovery write operations and negligible sneak currents, Aliens achieves improvement in energy, overall endurance, and access performance. The experiment results show that our design offers average energy savings of 13.9X compared with memristor-only memory, a memory lifetime 5.3X longer than CRS-only memory, and a competitive performance compared with memristor-only memory.

FELIX: fast and energy-efficient logic in memory

  • Saransh Gupta
  • Mohsen Imani
  • Tajana Rosing

The Internet of Things (IoT) has led to the emergence of big data. Processing this amount of data poses a challenge for current computing systems. PIM enables in-place computation which reduces data movement, a major latency bottleneck in conventional systems. In this paper, we propose an in-memory implementation of fast and energy-efficient logic (FELIX) which combines the functionality of PIM with memories. To the best of authors’ knowledge, FELIX is the first PIM logic to enable the single cycle NOR, NOT, NAND, minority, and OR directly in crossbar memory. We exploit the voltage threshold-based memristors to enable single cycle operations. It is a purely in-memory execution which neither reads out data nor changes sense amplifiers, while preserving data in-memory. We extend these single cycle operations to implement more complex functions like XOR and addition in memory with 2X lower latency than the fastest published PIM technique. We also increase the amount of in-memory parallelism in our design by segmenting bitlines using switches. To evaluate the efficiency of our design at the system level, we design a FELIX-based HyperDimensional (HD) computing accelerator. Our evaluation shows that for all applications tested using HD, FELIX provides on average 128.8X speedup and 5,589.3X lower energy consumption as compared to AMD GPU. FELIX HD also achieves on average 2.21X higher energy efficiency, 1.86X speedup, and 1.68X less memory as compared to the fastest PIM technique.

DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs

  • Xiaofan Zhang
  • Junsong Wang
  • Chao Zhu
  • Yonghua Lin
  • Jinjun Xiong
  • Wen-mei Hwu
  • Deming Chen

Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs

  • Yufei Ma
  • Tu Zheng
  • Yu Cao
  • Sarma Vrudhula
  • Jae-sun Seo

The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the inference algorithm. Therefore, this work proposes to customize the detection algorithm, e.g. SSD, to benefit its hardware implementation with low data precision at the cost of marginal accuracy degradation. The proposed FPGA-based deep learning inference accelerator is demonstrated on two Intel FPGAs for SSD algorithm achieving up to 2.18 TOPS throughput and up to 3.3X superior energy-efficiency compared to GPU.

TGPA: tile-grained pipeline architecture for low latency CNN inference

  • Xuechao Wei
  • Yun Liang
  • Xiuhong Li
  • Cody Hao Yu
  • Peng Zhang
  • Jason Cong

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

Customized locking of IP blocks on a multi-million-gate SoC

  • Abhrajit Sengupta
  • Mohammed Nabeel
  • Mohammed Ashraf
  • Ozgur Sinanoglu

Reliance on off-site untrusted fabrication facilities has given rise to several threats such as intellectual property (IP) piracy, overbuilding and hardware Trojans. Logic locking is a promising defense technique against such malicious activities that is effected at the silicon layer. Over the past decade, several logic locking defenses and attacks have been presented, thereby, enhancing the state-of-the-art. Nevertheless, there has been little research aiming to demonstrate the applicability of logic locking with large-scale multi-million-gate industrial designs consisting of multiple IP blocks with different security requirements. In this work, we take on this challenge to successfully lock a multi-million-gate system-on-chip (SoC) provided by DARPA by taking it all the way to GDSII layout. We analyze how specific features, constraints, and security requirements of an IP block can be leveraged to lock its functionality in the most appropriate way. We show that the blocks of an SoC can be locked in a customized manner at 0.5%, 15.3%, and 1.5% chip-level overhead in power, performance, and area, respectively.

Dynamic resource management for heterogeneous many-cores

  • Jörg Henkel
  • Jürgen Teich
  • Stefan Wildermann
  • Hussam Amrouch

With the advent of many-core systems, use cases of embedded systems have become more dynamic: Plenty of applications are concurrently executed, but may dynamically be exchanged and modified even after deployment. Moreover, resources may temporally or permanently become unavailable because of thermal aspects, dynamic power management, or the occurrence of faults. This poses new challenges for reaching objectives like timeliness for real-time or performance for best-effort program execution and maximizing system utilization. In this work, we first focus on dynamic management schemes for reliability/aging optimization under thermal constraints. The reliability of on-chip systems in the current and upcoming technology nodes is continuously degrading with every new generation because transistor scaling is approaching its fundamental limits. Protecting systems against degradation effects such as circuits’ aging comes with considerable losses in efficiency. We demonstrate in this work why sustaining reliability while maximizing the utilization of available resources and hence avoiding efficiency loss is quite challenging – this holds even more when thermal constraints come into play. Then, we discuss techniques for run-time management of multiple applications which sustain real-time properties. Our solution relies on hybrid application mapping denoting the combination of design-time analysis with run-time application mapping. We present a method for Real-time Mapping Reconfiguration (RMR) which enables the Run-Time Manager (RM) to execute realtime applications even in the presence of dynamic thermal-and reliability-aware resource management.

This paper is paper of the ICCAD 2018 Special Session on “Managing Heterogeneous Many-cores for High-Performance and Energy-Efficiency”. The other two papers of this Special sessions are [1] and [2].

Online learning for adaptive optimization of heterogeneous SoCs

  • Ganapati Bhat
  • Sumit K. Mandal
  • Ujjwal Gupta
  • Umit Y. Ogras

Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads.

Hybrid on-chip communication architectures for heterogeneous manycore systems

  • Biresh Kumar Joardar
  • Janardhan Rao Doppa
  • Partha Pratim Pande
  • Diana Marculescu
  • Radu Marculescu

The widespread adoption of big data has led to the search for high-performance and low-power computational platforms. Emerging heterogeneous manycore processing platforms consisting of CPU and GPU cores along with various types of accelerators offer power and area-efficient trade-offs for running these applications. However, heterogeneous manycore architectures need to satisfy the communication and memory requirements of the diverse computing elements that conventional Network-on-Chip (NoC) architectures are unable to handle effectively. Further, with increasing system sizes and level of heterogeneity, it becomes difficult to quickly explore the large design space and establish the appropriate design trade-offs. To address these challenges, machine learning-inspired heterogeneous manycore system design is a promising research direction to pursue. In this paper, we highlight various salient features of heterogeneous manycore architectures enabled by emerging interconnect technologies and machine learning techniques.

A practical detailed placement algorithm under multi-cell spacing constraints

  • Yu-Hsiang Cheng
  • Ding-Wei Huang
  • Wai-Kei Mak
  • Ting-Chi Wang

Multi-cell spacing constraints arise due to aggressive scaling and manufacturing issues. For example, we can incorporate multi-cell spacing constraints due to pin accessibility problem in sub-10nm nodes. This work studies detailed placement considering multi-cell spacing constraints. A naive approach is to model each multi-cell spacing constraint as a set of 2-cell spacing constraints, but the resulting total cell displacement would be much larger than necessary. Thus, we aim to tackle this problem and propose a practical multi-cell method by first analyzing the initial layout to determine which cell pair in each multi-cell spacing constraint is the easiest to break apart. Secondly, we apply a single-row dynamic programming (SRDP)-based method one row at a time, called Intra-Row Move (IRM) to resolve a majority of violations while minimizing the total cell displacement or wirelength increase. With cell virtualization and movable region computation techniques, our IRM can be easily extended to handle mixed cell-height designs with only a slight modification of the cost computation in the SRDP method. Finally, we apply an integer linear programming-based method called Global Move (GM) to resolve the remaining violations. Experimental results indicate that our multi-cell method is much better than a 2-cell method both in solution quality and runtime.

Mixed-cell-height placement considering drain-to-drain abutment

  • Yu-Wei Tseng
  • Yao-Wen Chang

Along with device scaling, the drain-to-drain abutment (DDA) constraint arises as an emerging challenge in modern circuit designs, which incurs additional difficulties especially for designs with mixed-cell-height standard cells which have prevailed in advanced technology. This paper presents the first work to address the mixed-cell-height placement problem considering the DDA constraint from post global placement throughout detailed placement. Our algorithms consists of three major stages: (1) DDA-aware preprocessing, (2) legalization, and (3) detailed placement. In the DDA-aware preprocessing stage, we first align cells to desired rows, considering the distribution ratio of source nodes to drain nodes. After deciding the cell ordering of every row, we adopt the modulus-based matrix splitting iteration method to remove all cell overlaps with minimum total displacement in the legalization stage. For detailed placement, we propose a satisfiability-based approach which considers the whole layout to flip a subset of cells and swap pairs of adjacent cells simultaneously. Compared with a shortest-path method, experimental results show that our proposed algorithm can significantly reduce cell violations and displacements with reasonable runtime.

Mixed-cell-height legalization considering technology and region constraints

  • Ziran Zhu
  • Xingquan Li
  • Yuhang Chen
  • Jianli Chen
  • Wenxing Zhu
  • Yao-Wen Chang

Mixed-cell-height circuits have become popular in advanced technologies for better power, area, routability, and performance trade-offs. With the technology and region constraints imposed by modern circuit designs, the mixed-cell-height legalization problem has become more challenging. In this paper, we present an effective and efficient legalization algorithm for mixed-cell-height circuit designs with technology and region constraints. We first present a fence region handling technique to unify the fence regions and the default ones. To obtain a desired cell assignment, we then propose a movement-aware cell reassignment method by iteratively reassigning cells in locally dense areas to their desired rows. After cell reassignment, a technology-aware legalization is presented to remove cell overlaps while satisfying the technology constraints. Finally, we propose a technology-aware refinement to further reduce the average and maximum cell movements without increasing the technology constraints violations. Compared with the champion of the 2017 ICCAD CAD Contest and the state-of-the-art work, experimental results show that our algorithm achieves the best average and maximum cell movements and significantly fewer technology constraint violations, in a comparable runtime.

Mixed-cell-height placement with complex minimum-implant-area constraints

  • Jianli Chen
  • Peng Yang
  • Xingquan Li
  • Wenxing Zhu
  • Yao-Wen Chang

Mixed-cell-height standard cells are prevailingly used in advanced technologies to achieve better design trade-offs among timing, power, and routability. As feature size decreases, placement of cells with multiple threshold voltages may violate the complex minimum-implant-area (MIA) layer rule arising from the limitations of patterning technologies. Existing works consider the mixed-cell-height placement problem only during legalization, or handle the MIA constraints during detailed placement. In this paper, we address the mixed-cell-height placement problem with MIA constraints into two major stages: post global placement and MIA-aware legalization. In the post global placement stage, we first present a continuous and differentiable cost function to address the Vdd/Vss alignment constraints, and add weighted pseudo nets to MIA violation cells dynamically. Then, we propose a proximal optimization method based on the given global placement result to simultaneously consider Vdd/Vss alignment constraints, MIA constraints, cell distribution, cell displacement, and total wirelength. In the MIA-aware legalization stage, we develop a graph-based method to cluster cells of specific threshold voltages, and apply a strip-packing-based binary linear programming to reshape cells. Then, we propose a matching-based technique to resolve intra-row MIA violations and reduce filler insertion. Furthermore, we formulate inter-row MIA-aware legalization as a quadratic programming problem, which is efficiently solved by a modulus-based matrix splitting iteration method. Finally, MIA-aware cell allocation and refinement are performed to further improve the result. Experimental results show that, without any extra area overhead, our algorithm still can achieve 8.5% shorter final total wirelength than the state-of-the-art work.

RAPID: read acceleration for improved performance and endurance in MLC/TLC NVMs

  • Poovaiah M. Palangappa
  • Kartik Mohanram

RAPID is a low-overhead critical-word-first read acceleration architecture for improved performance and endurance in MLC/TLC non-volatile memories (NVMs). RAPID encodes the critical words in a cache line using only the most significant bits (MSbs) of the MLC/TLC NVM cells. Since the MSbs of an NVM cell can be decoded using a single read strobe, the data (i.e., critical words) encoded using the MSbs can be decoded with low latency. System-level SPEC CPU2006 workload evaluations of a TLC RRAM architecture show that RAPID improves read latency by 21%, energy by 24%, and endurance by 2-4x over state-of-the-art striped NVM.

Sneak path free reconfiguration of via-switch crossbars based FPGA

  • Ryutaro Doi
  • Jaehoon Yu
  • Masanori Hashimoto

FPGA that utilizes via-switches, which are a kind of nonvolatile resistive RAMs, for crossbar implementation is attracting attention due to higher integration density and performance. However, programming via-switches arbitrarily in a crossbar is not trivial since a programming current must be provided through signal wires that are shared by multiple via-switches. Consequently, depending on the previous programming status in sequential programming, unintentional switch programming may occur due to signal detour, which is called sneak path problem. This problem interferes the reconfiguration of via-switch FPGA, and hence countermeasures for sneak path problem are indispensable. This paper identifies the circuit status that causes sneak path problem and proposes a sneak path avoidance method that gives sneak path free programming order of via-switches in a crossbar. We prove that sneak path free programming order necessarily exists for arbitrary on-off patterns in a crossbar as long as no loops exist, and also validate the proof and the proposed method with simulation-based evaluation. Thanks to the proposed method, any practical configurations of via-switch FPGA can be successfully programmed without sneak path problem.

Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method

  • Zhenhua Zhu
  • Jilan Lin
  • Ming Cheng
  • Lixue Xia
  • Hanbo Sun
  • Xiaoming Chen
  • Yu Wang
  • Huazhong Yang

Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7x speedup, 30% utilization rate improvement, and 1.2x energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4x higher on average in energy efficiency and 20x higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4x speedup and 1.65x energy efficiency improvement.

Enhancing the solution quality of hardware ising-model solver via parallel tempering

  • Hidenori Gyoten
  • Masayuki Hiromoto
  • Takashi Sato

We propose an efficient Ising processor with approximated parallel tempering (IPAPT) implemented on an FPGA. Hardware-friendly approximations of the components of parallel tempering (PT) are proposed to enhance solution quality with low hardware overhead. Multiple replicas of Ising states having different temperatures run in parallel by sharing a single network structure, and the replicas are exchanged based on the approximated energy evaluation. The application of PT substantially improves the quality of optimization solutions. The experimental results on the various max-cut problems have shown that utilization of PT significantly increases the probability of obtaining optimal solutions, and IPAPT obtains optimal solutions two orders magnitude faster than a software solver.

Defensive dropout for hardening deep neural networks under adversarial attacks

  • Siyue Wang
  • Xiao Wang
  • Pu Zhao
  • Wujie Wen
  • David Kaeli
  • Peter Chin
  • Xue Lin

Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others’ strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker’s strategy for generating adversarial examples. We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.

Online human activity recognition using low-power wearable devices

  • Ganapati Bhat
  • Ranadeep Deb
  • Vatika Vardhan Chaurasia
  • Holly Shill
  • Umit Y. Ogras

Human activity recognition (HAR) has attracted significant research interest due to its applications in health monitoring and patient rehabilitation. Recent research on HAR focuses on using smartphones due to their widespread use. However, this leads to inconvenient use, limited choice of sensors and inefficient use of resources, since smartphones are not designed for HAR. This paper presents the first HAR framework that can perform both online training and inference. The proposed framework starts with a novel technique that generates features using the fast Fourier and discrete wavelet transforms of a textile-based stretch sensor and accelerometer data. Using these features, we design a neural network classifier which is trained online using the policy gradient algorithm. Experiments on a low power IoT device (TI-CC2650 MCU) with nine users show 97.7% accuracy in identifying six activities and their transitions with less than 12.5 mW power consumption.

Shadow attacks on MEDA biochips

  • Mohammed Shayan
  • Sukanta Bhattacharjee
  • Tung-Che Liang
  • Jack Tang
  • Krishnendu Chakrabarty
  • Ramesh Karri

The Micro-electrode-dot-array (MEDA) is a next-generation digital microfluidic biochip (DMFB) platform that supports fine-grained control and real-time sensing of droplet movements. These capabilities permit continuous monitoring and checkpoint-based validation of assay execution on MEDA. This paper presents a class of “shadow attacks” that abuse the timing slack in the assay execution. State-of-the-art checkpoint-based validation techniques cannot expose the shadow operations. We develop a defense that introduces extra checkpoints in the assay execution at time instances when the assay is prone to shadow attacks. Experiments confirm the effectiveness and practicality of the defense.

LeapChain: efficient blockchain verification for embedded IoT

  • Emanuel Regnath
  • Sebastian Steinhorst

Blockchain provides decentralized consensus in large, open networks without a trusted authority, making it a promising solution for the Internet of Things (IoT) to distribute verifiable data, such as firmware updates. However, verifying data integrity and consensus on a linearly growing blockchain quickly exceeds memory and processing capabilities of embedded systems.

As a remedy, we propose a generic blockchain extension that enables highly constrained devices to verify the inclusion and integrity of any block within a blockchain. Instead of traversing block by block, we construct a LeapChain that reduces verification steps without weakening the integrity guarantees of the blockchain. Applied to Proof-of-Work blockchains, our scheme can be used to verify consensus by proving a certain amount of work on top of a block.

Our analytical and experimental results show that, compared to existing approaches, only LeapChain provides deterministic and tight upper bounds on the memory requirements in the kilobyte range, significantly extending the possibilities of blockchain application on embedded IoT devices.

Robust object estimation using generative-discriminative inference for secure robotics applications

  • Yanqi Liu
  • Alessandro Costantini
  • R. Iris Bahar
  • Zhiqiang Sui
  • Zhefan Ye
  • Shiyang Lu
  • Odest Chadwicke Jenkins

Convolutional neural networks (CNNs) are of increasing widespread use in robotics, especially for object recognition. However, such CNNs still lack several critical properties necessary for robots to properly perceive and function autonomously in uncertain, and potentially adversarial, environments. In this paper, we investigate factors for accurate, reliable, and resource-efficient object and pose recognition suitable for robotic manipulation in adversarial clutter. Our exploration is in the context of a three-stage pipeline of discriminative CNN-based recognition, generative probabilistic estimation, and robot manipulation. This pipeline proposes using a SAmpling Network Density filter, or SAND filter, to recover from potentially erroneous decisions produced by a CNN through generative probabilistic inference. We present experimental results from SAND filter perception for robotic manipulation in tabletop scenes with both benign and adversarial clutter. These experiments vary CNN model complexity for object recognition and evaluate levels of inaccuracy that can be recovered by generative pose inference. This scenario is extended to consider adversarial environmental modifications with varied lighting, occlusions, and surface modifications.

Efficient utilization of adversarial training towards robust machine learners and its analysis

  • Sai Manoj P D
  • Sairaj Amberkar
  • Setareh Rafatirad
  • Houman Homayoun

Advancements in machine learning led to its adoption into numerous applications ranging from computer vision to security. Despite the achieved advancements in the machine learning, the vulnerabilities in those techniques are as well exploited. Adversarial samples are the samples generated by adding crafted perturbations to the normal input samples. An overview of different techniques to generate adversarial samples, defense to make classifiers robust is presented in this work. Furthermore, the adversarial learning and its effective utilization to enhance the robustness and the required constraints are experimentally provided, such as up to 97.65% accuracy even against CW attack. Though adversarial learning’s effectiveness is enhanced, still it is shown in this work that it can be further exploited for vulnerabilities.

Majority logic synthesis

  • Luca Amarù
  • Eleonora Testa
  • Miguel Couceiro
  • Odysseas Zografos
  • Giovanni De Micheli
  • Mathias Soeken

The majority function <xyz> evaluates to true, if at least two of its Boolean inputs evaluate to true. The majority function has frequently been studied as a central primitive in logic synthesis applications for many decades. Knuth refers to the majority function in the last volume of his seminal The Art of Computer Programming as “probably the most important ternary operation in the entire universe.” Majority logic sythesis has recently regained significant interest in the design automation community due to nanoemerging technologies which operate based on the majority function. In addition, majority logic synthesis has successfully been employed in CMOS-based applications such as standard cell or FPGA mapping.

This tutorial gives a broad introduction into the field of majority logic synthesis. It will review fundamental results and describe recent contributions from theory, practice, and applications.

RouteNet: routability prediction for mixed-size designs using convolutional neural network

  • Zhiyao Xie
  • Yu-Hung Huang
  • Guan-Qi Fang
  • Haoxing Ren
  • Shao-Yun Fang
  • Yiran Chen
  • Nvidia Corporation

Early routability prediction helps designers and tools perform preventive measures so that design rule violations can be avoided in a proactive manner. However, it is a huge challenge to have a predictor that is both accurate and fast. In this work, we study how to leverage convolutional neural network to address this challenge. The proposed method, called RouteNet, can either evaluate the overall routability of cell placement solutions without global routing or predict the locations of DRC (Design Rule Checking) hotspots. In both cases, large macros in mixed-size designs are taken into consideration. Experiments on benchmark circuits show that RouteNet can forecast overall routability with accuracy similar to that of global router while using substantially less runtime. For DRC hotspot prediction, RouteNet improves accuracy by 50% compared to global routing. It also significantly outperforms other machine learning approaches such as support vector machine and logistic regression.

TritonRoute: an initial detailed router for advanced VLSI technologies

  • Andrew B. Kahng
  • Lutong Wang
  • Bangqi Xu

Detailed routing is a dead-or-alive critical element in design automation tooling for advanced node enablement. However, very few works address detailed routing in the recent open literature, particularly in the context of modern industrial designs and a complete, end-to-end flow. The ISPD-2018 Initial Detailed Routing Contest addressed this gap for modern industrial designs, using a reduced design rules set. In this work, we present TritonRoute, an initial detailed router for the ISPD-2018 contest. Given route guides from global routing, the initial detailed routing stage should generate a detailed routing solution honoring the route guides as much as possible, while minimizing wirelength, via count and various design rule violations. In our work, the key contribution is intra-layer parallel routing, where we partition each layer into parallel panels and route each panel using an Integer Linear Programming-based algorithm. We sequentially route layer by layer from the bottom to the top. We evaluate our router using the official ISPD-2018 benchmark suite and show that we reduce the contest metric by up to 74%, and on average 50%, compared to the first-place routing solution for each testcase.

A multithreaded initial detailed routing algorithm considering global routing guides

  • Fan-Keng Sun
  • Hao Chen
  • Ching-Yu Chen
  • Chen-Hao Hsu
  • Yao-Wen Chang

Detailed routing is the most complicated and time-consuming stage in VLSI design and has become a critical process for advanced node enablement. To handle the high complexity of modern detailed routing, initial detailed routing is often employed to minimize design-rule violations to facilitate final detailed routing, even though it is still not violation-free after initial routing. This paper presents a novel initial detailed routing algorithm to consider industrial design-rule constraints and optimize the total wirelength and via count. Our algorithm consists of three major stages: (1) an effective pinaccess point generation method to identify valid points to model a complex pin shape, (2) a via-aware track assignment method to minimize the overlaps between assigned wire segments, and (3) a detailed routing algorithm with a novel negotiation-based rip-up and re-route scheme that enables multithreading and honors global routing information while minimizing designrule violations. Experimental results show that our router outperforms all the winning teams of the 2018 ACM ISPD Initial Detailed Routing Contest, where the top-3 routers result in 23%, 52%, and 1224% higher costs than ours.

Extending ML-OARSMT to net open locator with efficient and effective boolean operations

  • Bing-Hui Jiang
  • Hung-Ming Chen

Multi-layer obstacle-avoiding rectilinear Steiner minimal tree (ML-OARSMT) problem has been extensively studied in recent years. In this work, we consider a variant of ML-OARSMT problem and extend the applicability to the net open location finder. Since ECO or router limitations may cause the open nets, we come up with a framework to detect and reconnect existing nets to resolve the net opens. Different from prior connection graph based approach, we propose a technique by applying efficient Boolean operations to repair net opens. Our method has good quality and scalability and is highly parallelizable. Compared with the results of ICCAD-2017 contest, we show that our proposed algorithm can achieve the smallest cost with 4.81 speedup in average than the top-3 winners.

Logic synthesis of binarized neural networks for efficient circuit implementation

  • Chia-Chih Chi
  • Jie-Hong R. Jiang

Neural networks (NNs) are key to deep learning systems. Their efficient hardware implementation is crucial to applications at the edge. Binarized NNs (BNNs), where the weights and output of a neuron are of binary values {-1, +1} (or encoded in {0,1}), have been proposed recently. As no multiplier is required, they are particularly attractive and suitable for hardware realization. Most prior NN synthesis methods target on hardware architectures with neural processing elements (NPEs), where the weights of a neuron are loaded and the output of the neuron is computed. The load-and-compute method, though area efficient, requires expensive memory access, which deteriorates energy and performance efficiency. In this work we aim at synthesizing BNN dense layers into dedicated logic circuits. We formulate the corresponding matrix covering problem and propose a scalable algorithm to reduce the area and routing cost of BNNs. Experimental results justify the effectiveness of the method in terms of area and net savings on FPGA implementation. Our method provides an alternative implementation of BNNs, and can be applied in combination with NPE-based implementation for area, speed, and power tradeoffs.

Canonicalization of threshold logic representation and its applications

  • Siang-Yun Lee
  • Nian-Ze Lee
  • Jie-Hong R. Jiang

Threshold logic functions gain revived attention due to their connection to neural networks employed in deep learning. Despite prior endeavors in the characterization of threshold logic functions, to the best of our knowledge, the quest for a canonical representation of threshold logic functions in the form of their realizing linear inequalities remains open. In this paper we devise a procedure to canonicalize a threshold logic function such that two threshold logic functions are equivalent if and only if their canonicalized linear inequalities are the same. We further strengthen the canonicity to ensure that symmetric variables of a threshold logic function receive the same weight in the canonicalized linear inequality. The canonicalization procedure invokes O(m) queries to a linear programming (resp. an integer linear programming) solver when a linear inequality solution with fractional (resp. integral) weight and threshold values is to be found, where m is the number of symmetry groups of the given threshold logic function. The guaranteed canonicity allows direct application to the classification of NP (input negation, input permutation) and NPN (input negation, input permutation, output negation) equivalence of threshold logic functions. It may thus enable applications such as equivalence checking, Boolean matching, and library construction for threshold circuit synthesis.

DALS: delay-driven approximate logic synthesis

  • Zhuangzhuang Zhou
  • Yue Yao
  • Shuyang Huang
  • Sanbao Su
  • Chang Meng
  • Weikang Qian

Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To synthesize approximate circuits automatically, many approximate logic synthesis (ALS) algorithms have been proposed. However, they mainly focus on area reduction and are not optimal in reducing the delay of the circuits. In this paper, we propose DALS, a delay-driven ALS framework. DALS works on the AND-inverter graph (AIG) representation of a circuit. It supports a wide range of approximate local changes and some commonly-used error metrics, including error rate and mean error distance. In order to select an optimal set of nodes in the AIG to apply approximate local changes, DALS establishes a critical error network (CEN) from the AIG and formulates a maximum flow problem on the CEN. Our experimental results on a wide range of benchmarks show that DALS produces approximate circuits with significantly reduced delays.

Unlocking fine-grain parallelism for AIG rewriting

  • Vinicius Possani
  • Yi-Shan Lu
  • Alan Mishchenko
  • Keshav Pingali
  • Renato Ribas
  • Andre Reis

Parallel computing is a trend to enhance scalability of electronic design automation (EDA) tools using widely available multicore platforms. In order to benefit from parallelism, well-known EDA algorithms have to be reformulated and optimized for multicore implementation. This paper introduces a set of principles to enable a fine-grain parallel AND-inverter graph (AIG) rewriting. It presents a novel method to discover and rewrite in parallel parts of the AIG, without the need for graph partitioning. Experiments show that, when synthesizing large designs composed of millions of AIG nodes, the parallel rewriting on 40 physical cores is up to 36x and 68x faster than ABC commands rewrite -l and drw, respectively, with comparable quality of results in terms of AIG size and depth.

High-level synthesis with timing-sensitive information flow enforcement

  • Zhenghong Jiang
  • Steve Dai
  • G. Edward Suh
  • Zhiru Zhang

Specialized hardware accelerators are being increasingly integrated into today’s computer systems to achieve improved performance and energy efficiency. However, the resulting variety and complexity make it challenging to ensure the security of these accelerators. To mitigate complexity while guaranteeing security, we propose a high-level synthesis (HLS) infrastructure that incorporates static information flow analysis to enforce security policies on HLS-generated hardware accelerators. Our security-constrained HLS infrastructure is able to effectively identify both explicit and implicit information leakage. By detecting the security vulnerabilities at the behavioral level, our tool allows designers to address these vulnerabilities at an early stage of the design flow. We further propose a novel synthesis technique in HLS to eliminate timing channels in the generated accelerator. Our approach is able to remove timing channels in a verifiable manner while incurring lower performance overhead for high-security tasks on the accelerator.

Property specific information flow analysis for hardware security verification

  • Wei Hu
  • Armaiti Ardeshiricham
  • Mustafa S Gobulukoglu
  • Xinmu Wang
  • Ryan Kastner

Hardware information flow analysis detects security vulnerabilities resulting from unintended design flaws, timing channels, and hardware Trojans. These information flow models are typically generated in a general way, which includes a significant amount of redundancy that is irrelevant to the specified security properties. In this work, we propose a property specific approach for information flow security. We create information flow models tailored to the properties to be verified by performing a property specific search to identify security critical paths. This helps find suspicious signals that require closer inspection and quickly eliminates portions of the design that are free of security violations. Our property specific trimming technique reduces the complexity of the security model; this accelerates security verification and restricts potential security violations to a smaller region which helps quickly pinpoint hardware security vulnerabilities.

HISA: hardware isolation-based secure architecture for CPU-FPGA embedded systems

  • Mengmei Ye
  • Xianglong Feng
  • Sheng Wei

Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures.

SWAN: mitigating hardware trojans with design ambiguity

  • Timothy Linscott
  • Pete Ehrett
  • Valeria Bertacco
  • Todd Austin

For the past decade, security experts have warned that malicious engineers could modify hardware designs to include hardware backdoors (trojans), which, in turn, could grant attackers full control over a system. Proposed defenses to detect these attacks have been outpaced by the development of increasingly small, but equally dangerous, trojans. To thwart trojan-based attacks, we propose a novel architecture that maps the security-critical portions of a processor design to a one-time programmable, LUT-free fabric. The programmable fabric is automatically generated by analyzing the HDL of targeted modules. We present our tools to generate the fabric and map functionally equivalent designs onto the fabric. By having a trusted party randomly select a mapping and configure each chip, we prevent an attacker from knowing the physical location of targeted signals at manufacturing time. In addition, we provide decoy options (canaries) for the mapping of security-critical signals, such that hardware trojans hitting a decoy are thwarted and exposed. Using this defense approach, any trojan capable of analyzing the entire configurable fabric must employ complex logic functions with a large silicon footprint, thus exposing it to detection by inspection. We evaluated our solution on a RISC-V BOOM processor and demonstrated that, by providing the ability to map each critical signal to 6 distinct locations on the chip, we can reduce the chance of attack success by an undetectable trojan by 99%, incurring only a 27% area overhead.

Security for safety: a path toward building trusted autonomous vehicles

  • Raj Gautam Dutta
  • Feng Yu
  • Teng Zhang
  • Yaodan Hu
  • Yier Jin

Automotive systems have always been designed with safety in mind. In this regard, the functional safety standard, ISO 26262, was drafted with the intention of minimizing risk due to random hardware faults or systematic failure in design of electrical and electronic components of an automobile. However, growing complexity of a modern car has added another potential point of failure in the form of cyber or sensor attacks. Recently, researchers have demonstrated that vulnerability in vehicle’s software or sensing units could enable them to remotely alter the intended operation of the vehicle. As such, in addition to safety, security should be considered as an important design goal. However, designing security solutions without the consideration of safety objectives could result in potential hazards. Consequently, in this paper we propose the notion of security for safety and show that by integrating safety conditions with our system-level security solution, which comprises of a modified Kalman filter and a Chi-squared detector, we can prevent potential hazards that could occur due to violation of safety objectives during an attack. Furthermore, with the help of a car-following case study, where the follower car is equipped with an adaptive-cruise control unit, we show that our proposed system-level security solution preserves the safety constraints and prevent collision between vehicle while under sensor attack.

Hardware-accelerated data acquisition and authentication for high-speed video streams on future heterogeneous automotive processing platforms

  • Martin Geier
  • Fabian Franzen
  • Samarjit Chakraborty

With the increasing use of Ethernet-based communication backbones in safety-critical real-time domains, both efficient and predictable interfacing and cryptographically secure authentication of high-speed data streams are becoming very important. Although the increasing data rates of in-vehicle networks allow the integration of more demanding (e.g., camera-based) applications, processing speeds and, in particular, memory bandwidths are no longer scaling accordingly. The need for authentication, on the other hand, stems from the ongoing convergence of traditionally separated functional domains and the extended connectivity both in- (e.g., smart-phones) and outside (e.g., telemetry, cloud-based services and vehicle-to-X technologies) current vehicles. The inclusion of cryptographic measures thus requires careful interface design to meet throughput, latency, safety, security and power constraints given by the particular application domain. Over the last decades, this has forced system designers to not only optimize their software stacks accordingly, but also incrementally move interface functionalities from software to hardware. This paper discusses existing and emerging methods for dealing with high-speed data streams ranging from software-only via mixed-hardware/software approaches to fully hardware-based solutions. In particular, we introduce two approaches to acquire and authenticate GigE Vision Video Streams at full line rate of Gigabit Ethernet on Programmable SoCs suitable for future heterogeneous automotive processing platforms.

Network and system level security in connected vehicle applications

  • Hengyi Liang
  • Matthew Jagielski
  • Bowen Zheng
  • Chung-Wei Lin
  • Eunsuk Kang
  • Shinichi Shiraishi
  • Cristina Nita-Rotaru
  • Qi Zhu

Connected vehicle applications such as autonomous intersections and intelligent traffic signals have shown great promises in improving transportation safety and efficiency. However, security is a major concern in these systems, as vehicles and surrounding infrastructures communicate through ad-hoc networks. In this paper, we will first review security vulnerabilities in connected vehicle applications. We will then introduce and discuss some of the defense mechanisms at network and system levels, including (1) the Security Credential Management System (SCMS) proposed by the United States Department of Transportation, (2) an intrusion detection system (IDS) that we are developing and its application on collaborative adaptive cruise control, and (3) a partial consensus mechanism and its application on lane merging. These mechanisms can assist to improve the security of connected vehicle applications.

A safety and security architecture for reducing accidents in intelligent transportation systems

  • Qian Chen
  • Azizeh Khaled Sowan
  • Shouhuai Xu

The Internet of Things (IoT) technology is transforming the world into Smart Cities, which have a huge impact on future societal lifestyle, economy and business. Intelligent Transportation Systems (ITS), especially IoT-enabled Electric Vehicles (EVs), are anticipated to be an integral part of future Smart Cities. Assuring ITS safety and security is critical to the success of Smart Cities because human lives are at stake. The state-of-the-art understanding of this matter is very superficial because there are many new problems that have yet to be investigated. For example, the cyber-physical nature of ITS requires considering human-in-the-loop (i.e., drivers and pedestrians) and imposes many new challenges. In this paper, we systematically explore the threat model against ITS safety and security (e.g., malfunctions of connected EVs/transportation infrastructures, driver misbehavior and unexpected medical conditions, and cyber attacks). Then, we present a novel and systematic ITS safety and security architecture, which aims to reduce accidents caused or amplified by a range of threats. The architecture has appealing features: (i) it is centered at proactive cyber-physical-human defense; (ii) it facilitates the detection of early-warning signals of accidents; (iii) it automates effective defense against a range of threats.

The need and opportunities of electromigration-aware integrated circuit design

  • Steve Bigalke
  • Jens Lienig
  • Göran Jerke
  • Jürgen Scheible
  • Roland Jancke

Electromigration (EM) is becoming a progressively severe reliability challenge due to increased interconnect current densities. A shift from traditional (post-layout) EM verification to robust (pro-active) EM-aware design – where the circuit layout is designed with individual EM-robust solutions – is urgently needed. This tutorial will give an overview of EM and its effects on the reliability of present and future integrated circuits (ICs). We introduce the physical EM process and present its specific characteristics that can be affected during physical design. Examples of EM countermeasures which are applied in today’s commercial design flows are presented. We show how to improve the EM-robustness of metallization patterns and we also consider mission profiles to obtain application-oriented current-density limits. The increasing interaction of EM with thermal migration is investigated as well. We conclude with a discussion of application examples to shift from the current post-layout EM verification towards an EM-aware physical design process. Its methodologies, such as EM-aware routing, increase the EM-robustness of the layout with the overall goal of reducing the negative impact of EM on the circuit’s reliability.

Uncertainty quantification of electronic and photonic ICs with non-Gaussian correlated process variations

  • Chunfeng Cui
  • Zheng Zhang

Since the invention of generalized polynomial chaos in 2002, uncertainty quantification has impacted many engineering fields, including variation-aware design automation of integrated circuits and integrated photonics. Due to the fast convergence rate, the generalized polynomial chaos expansion has achieved orders-of-magnitude speedup than Monte Carlo in many applications. However, almost all existing generalized polynomial chaos methods have a strong assumption: the uncertain parameters are mutually independent or Gaussian correlated. This assumption rarely holds in many realistic applications, and it has been a long-standing challenge for both theorists and practitioners.

This paper propose a rigorous and efficient solution to address the challenge of non-Gaussian correlation. We first extend generalized polynomial chaos, and propose a class of smooth basis functions to efficiently handle non-Gaussian correlations. Then, we consider high-dimensional parameters, and develop a scalable tensor method to compute the proposed basis functions. Finally, we develop a sparse solver with adaptive sample selections to solve high-dimensional uncertainty quantification problems. We validate our theory and algorithm by electronic and photonic ICs with 19 to 57 non-Gaussian correlated variation parameters. The results show that our approach outperforms Monte Carlo by 2500× to 3000× in terms of efficiency. Moreover, our method can accurately predict the output density functions with multiple peaks caused by non-Gaussian correlations, which is hard to handle by existing methods.

Based on the results in this paper, many novel uncertainty quantification algorithms can be developed and can be further applied to a broad range of engineering domains.

Parallelizable Bayesian optimization for analog and mixed-signal rare failure detection with high coverage

  • Hanbin Hu
  • Peng Li
  • Jianhua Z. Huang

Due to inherent complex behaviors and stringent requirements in analog and mixed-signal (AMS) systems, verification becomes a key bottleneck in the product development cycle. For the first time, we present a Bayesian optimization (BO) based approach to the challenging problem of verifying AMS circuits with stringent low failure requirements. At the heart of the proposed BO process is a delicate balancing between two competing needs: exploitation of the current statistical model for quick identification of highly-likely failures and exploration of undiscovered design space so as to detect hard-to-find failures within a large parametric space. To do so, we simultaneously leverage multiple optimized acquisition functions to explore varying degrees of balancing between exploitation and exploration. This makes it possible to not only detect rare failures which other techniques fail to identify, but also do so with significantly improved efficiency. We further build in a mechanism into the BO process to enable detection of multiple failure regions, hence providing a higher degree of coverage. Moreover, the proposed approach is readily parallelizable, further speeding up failure detection, particularly for large circuits for which acquisition of simulation/measurement data is very time-consuming. Our experimental study demonstrates that the proposed approach is very effective in finding very rare failures and multiple failure regions which existing statistical sampling techniques and other BO techniques can miss, thereby providing a more robust and cost-effective methodology for rare failure detection.

Transient circuit simulation for differential algebraic systems using matrix exponential

  • Pengwen Chen
  • Chung-Kuan Cheng
  • Dongwon Park
  • Xinyuan Wang

Transient simulation becomes a bottleneck for modern IC designs due to large numbers of transistors, interconnects and tight design margins. For modified nodal analysis (MNA) formulation, we could have differential algebraic equations (DAEs) which consist ordinary differential equations (ODEs) and algebraic equations. Study of solving DAEs with conventional multi-step integration methods has been a research topic in the last few decades. We adopt matrix exponential based integration method for circuit transient analysis, its stability and accuracy with DAEs remain an open problem. We identify that potential stability issues in the calculation of matrix exponential and vector product (MEVP) with rational Krylov method are originated from the singular system matrix in DAEs. We then devise a robust algorithm to implicitly regularize the system matrix while maintaining its sparsity. With the new approach, &phis; functions are applied for MEVP to improve the accuracy of results. Moreover our framework no longer suffers from the limitation on step sizes thus a large leap step is adopted to skip many simulation steps in between. Features of the algorithm are validated on large-scale power delivery networks which achieve high efficiency and accuracy.

CustomTopo: a topology generation method for application-specific wavelength-routed optical NoCs

  • Mengchu Li
  • Tsun-Ming Tseng
  • Davide Bertozzi
  • Mahdi Tala
  • Ulf Schlichtmann

Optical network-on-chip (NoC) is a promising platform beyond electronic NoCs. In particular, wavelength-routed optical network-on-chip (WRONoC) is renowned for its high bandwidth and ultra-low signal delay. Current WRONoC topology generation approaches focus on full-connectivity, i.e. all masters are connected to all slaves. This assumption leads to wasted resources for application-specific designs. In this work, we propose CustomTopo: a general solution to the topology generation problem on WRONoCs that supports customized connectivity. CustomTopo models the topology structure and its communication behavior as an integer-linear-programming (ILP) problem, with an adjustable optimization target considering the number of add-drop filters (ADFs), the number of wavelengths, and insertion loss. The time for solving the ILP problem in general positively correlates with the network communication densities. Experimental results show that CustomTopo is applicable for various communication requirements, and the resulting customized topology enables a remarkable reduction in both resource usage and insertion loss.

A cross-layer methodology for design and optimization of networks in 2.5D systems

  • Ayse Coskun
  • Furkan Eris
  • Ajay Joshi
  • Andrew B. Kahng
  • Yenai Ma
  • Vaishnav Srinivas

2.5D integration technology is gaining popularity in the design of homogeneous and heterogeneous many-core computing systems. 2.5D network design, both inter- and intra-chiplet, impacts overall system performance as well as its manufacturing cost and thermal feasibility. This paper introduces a cross-layer methodology for designing networks in 2.5D systems. We optimize the network design and chiplet placement jointly across logical, physical, and circuit layers to achieve an energy-efficient network, while maximizing system performance, minimizing manufacturing cost, and adhering to thermal constraints. In the logical layer, our co-optimization considers eight different network topologies. In the physical layer, we consider routing, microbump assignment, and microbump pitch constraints to account for the extra costs associated with microbump utilization in the inter-chiplet communication. In the circuit layer, we consider both passive and active links with five different link types, including a gas station link design. Using our cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods. Compared to 2D systems, our approach achieves 29% better performance with the same manufacturing cost, or 25% lower cost with the same performance.

Wavefront-MCTS: multi-objective design space exploration of NoC architectures based on Monte Carlo tree search

  • Yong Hu
  • Daniel Mueller-Gritschneder
  • Ulf Schlichtmann

Application-specific MPSoCs profit immensely from a custom-fit Network-on-Chip (NoC) architecture in terms of network performance and power consumption. In this paper we suggest a new approach to explore application-specific NoC architectures. In contrast to other heuristics, our approach uses a set of network modifications defined with graph rewriting rules to model the design space exploration as a Markov Decision Process (MDP). The MDP can be efficiently explored using the Monte Carlo Tree Search (MCTS) heuristics. We formulate a weighted sum reward function to compute a single solution with a good trade-off between power and latency or a set of max reward functions to compute the complete Pareto front between the two objectives. The Wavefront feature adds additional efficiency when computing the Pareto front by exchanging solutions between parallel MCTS optimization processes. Comparison with other popular search heuristics demonstrates a higher efficiency of MCTS-based heuristics for several test cases. Additionally, the Wavefront-MCTS heuristics allows complete tracability and control by the designer to enable an interactive design space exploration process.

HLS-based optimization and design space exploration for applications with variable loop bounds

  • Young-kyu Choi
  • Jason Cong

In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot find a design point with large speedup for applications with variable loop bounds. The reason is that loops with variable loop bounds cannot be efficiently parallelized or pipelined with simple insertion of HLS directives. Also, making highly accurate prediction of cycles and resource consumption on the entire design space becomes a challenging task because of the inaccuracy of the HLS tool cycle prediction and the wide design space. In this paper we present an HLS-based FPGA optimization and DSE framework that produces a high-performance design even in the presence of variable loop bounds. We propose code transformations that increase the utilization of the compute resources for variable loops, including several computation patterns with loop-carried dependency such as floating-point reduction and prefix sum. In order to rapidly perform DSE with high accuracy, we describe a resource and cycle estimation model constructed from the information obtained from the actual HLS synthesis. Experiments on applications with variable loop bounds in Polybench benchmarks with Vivado HLS show that our framework improves the baseline implementation by 75X on average and outperforms current state-of-the-art DSE frameworks.

HLSPredict: cross platform performance prediction for FPGA high-level synthesis

  • Kenneth O’Neal
  • Mitch Liu
  • Hans Tang
  • Amin Kalantar
  • Kennen DeRenard
  • Philip Brisk

FPGA application developers must explore increasingly large design spaces to identify regions of code to accelerate. High-Level Synthesis (HLS) tools automatically derive FPGA-based designs from high-level language specifications, which improves designer productivity; however, HLS tool run-times are cost-prohibitive for design space exploration, preventing designers from adequately answering cost-value decisions without expert guidance. To address this concern, this paper introduces a machine learning framework to predict FPGA performance and power consumption without relying on analytical models or HLS tools in-the-loop. For workloads that were manually optimized by appropriately setting pragmas, the framework obtains a worst-case relative error of 9.08% while running 43.78x faster than HLS; for unoptimized workloads, the framework obtains a worst-case relative error of 9.79% while running 36.24x faster than HLS.

C-GOOD: C-code generation framework for optimized on-device deep learning

  • Duseok Kang
  • Euiseok Kim
  • Inpyo Bae
  • Bernhard Egger
  • Soonhoi Ha

Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. Thus it is typical to use an energy-efficient accelerator such as mobile GPU, DSP array, and customized neural processor chip. Moreover, new deep learning algorithms that aim to balance accuracy, speed, and resource requirements are developed on a deep learning framework such as Caffe[16] and Tensorflow[1] that is assumed to run directly on the target hardware. However, embedded devices may not be able to run those frameworks directly due to hardware limitations or missing OS support. To overcome this difficulty, we develop a deep learning software framework that generates a C code that can be run on any devices. The framework is facilitated with various options for software optimization that can be performed according to the optimization methodology proposed in this paper. Another benefit is that it can generate various styles of C code, tailored for a specific compiler or the accelerator architecture. Experiments on three platforms, NVIDIA Jetson TX2[23], Odroid XU4[10], and SRP (Samsung Reconfigurable Processor)[32], demonstrate the potential of the proposed approach.

LiteHAX: lightweight hardware-assisted attestation of program execution

  • Ghada Dessouky
  • Tigist Abera
  • Ahmad Ibrahim
  • Ahmad-Reza Sadeghi

Unlike traditional processors, embedded Internet of Things (IoT) devices lack resources to incorporate protection against modern sophisticated attacks resulting in critical consequences. Remote attestation (RA) is a security service to establish trust in the integrity of a remote device. While conventional RA is static and limited to detecting malicious modification to software binaries at load-time, recent research has made progress towards runtime attestation, such as attesting the control flow of an executing program. However, existing control-flow attestation schemes are inefficient and vulnerable to sophisticated data-oriented programming (DOP) attacks subvert these schemes and keep the control flow of the code intact.

In this paper, we present LiteHAX, an efficient hardware-assisted remote attestation scheme for RISC-based embedded devices that enables detecting both control-flow attacks as well as DOP attacks. LiteHAX continuously tracks both the control-flow and data-flow events of a program executing on a remote device and reports them to a trusted verifying party. We implemented and evaluated LiteHAX on a RISC-V System-on-Chip (SoC) and show that it has minimal performance and area overhead.

SCADET: a side-channel attack detection tool for tracking prime+probe

  • Majid Sabbagh
  • Yunsi Fei
  • Thomas Wahl
  • A. Adam Ding

Microarchitectural side-channel attacks have posed serious threats to many computing systems, ranging from embedded systems and mobile devices to desktop workstations and cloud servers. Such attacks exploit side-channel vulnerabilities stemming from fundamental microarchitectural performance features, including the most common caches, out-of-order execution (for the newly revealed Meltdown exploit), and speculative execution (for Spectre). Prior efforts have focused on identifying and assessing these security vulnerabilities, and designing and implementing countermeasures against them. However, the efforts aiming at detecting specific side-channel attacks tend to be narrowly focused, which can make them effective but also makes them obsolete very quickly. In this paper, we propose a new methodology for detecting microarchitectural side-channel attacks that has the potential for a wide scope of applicability, as we demonstrate using a case study involving the Prime+Probe attack family. Instead of looking at the side-effects of side-channel attacks on microarchitectural elements such as hardware performance counters, we target the high-level semantics and invariant patterns of these attacks. We have applied our method to different Prime+Probe attack variants on the instruction cache, data cache, and last-level cache, as well as several benign programs as benchmarks. The method can detect all of the Prime+Probe attack variants with a true positive rate of 100% and an average false positive rate of 7.4%.

Industrial experiences with resource management under software randomization in ARINC653 avionics environments

  • Leonidas Kosmidis
  • Cristian Maxim
  • Victor Jegu
  • Francis Vatrinet
  • Francisco J. Cazorla

Injecting randomization in different layers of the computing platform has been shown beneficial for security, resilience to software bugs and timing analysis. In this paper, with focus on the latter, we show our experience regarding memory and timing resource management when software randomization techniques are applied to one of the most stringent industrial environments, ARINC653-based avionics. We describe the challenges in this task, we propose a set of solutions and present the results obtained for two commercial avionics applications, executed on COTS hardware and RTOS.

Single flux quantum circuit technology and CAD overview

  • Coenrad Fourie

Single Flux Quantum (SFQ) electronic circuits originated with the advent of Rapid Single Flux Quantum (RSFQ) logic in 1985 and have since evolved to include more energy-efficient technologies such as ERSFQ and eSFQ. SFQ logic circuits, based on the manipulation of quantized flux pulses, have been demonstrated to run at clock speeds in excess of 120 GHz, and with bit-switch energy below 1 aJ. Small SFQ microprocessors have been developed, but characteristics inherent to SFQ circuits and the lack of circuit design tools have hampered the development of large SFQ systems. SFQ circuit characteristics include fan-out of one and the subsequent demand for pulse splitters, gate-level clocking, susceptibility to magnetic fields and sensitivity to intra-gate and inter-gate inductance. Superconducting interconnects propagate data pulses at the speed of light, but suffer from reflections at vias that attenuate transmitted pulses. The recently started IARPA SuperTools program aims to deliver SFQ Computer-Aided Design (CAD) tools that can enable the successful design of 64 bit RISC processors given the characteristics of SFQ circuits. A discussion on the technology of SFQ circuits and the most modern SFQ fabrication processes is presented, with a focus on the unique electronic design automation CAD requirements for the design, layout and verification of SFQ circuits.

Design automation methodology and tools for superconductive electronics

  • Massoud Pedram
  • Yanzhi Wang

Josephson junction-based superconducting logic families have been proposed to implement analog and digital signals, which can achieve low energy dissipation and ultra-fast switching speed. There are two representative technologies: DC-biased RSFQ (rapid single flux quantum) technology and its variants that achieve a verified speed of 370 Ghz, and AC-biased AQFP (adiabatic quantum-flux-parametron) that achieves an energy dissipation near quantum limits. Despite extraordinary characteristics of the superconducting logic families, many technical challenges remain, including the choice of circuit fabrics and architectures that utilize the SFQ technology and the development of effective design automation methodologies and tools. This paper presents our work on developing design flows and tools for DC- and AC-biased SFQ circuits, leveraging unique characteristics and design requirements of the SFQ logic families. More precisely, physical design algorithms, including placement, clock tree routing, and signal routing algorithms targeting RSFQ circuits are presented first. Next, a majority/minority gate-based automatic synthesis framework targeting AQFP logic circuits is described. Finally, experimental results to demonstrate the efficacy of the proposed framework and tools are presented.

Multi-terminal routing with length-matching for rapid single flux quantum circuits

  • Pei-Yi Cheng
  • Kazuyoshi Takagi
  • Tsung-Yi Ho

With the increasing clock frequencies, the timing requirement of Rapid Single Flux Quantum (RSFQ) digital circuits is critical for achieving the correct functionality. To meet this requirement, it is necessary to incorporate length-matching constraint into routing problem. However, the solutions of existing routing algorithms are inherently limited by pre-allocated splitters (SPLs), which complicates the subsequent routing stage under length-matching constraint. Hence, in this paper, we reallocate SPLs to fully utilize routing resources to cope with length-matching effectively. We propose the first multi-terminal routing algorithm for RSFQ circuits that integrates SPL reallocation into the routing stage. The experimental results on a practical circuit show that our proposed algorithm achieves routing completion while reducing the required area by 17%. Comparing to [2], we can still improve by 7% with less runtime when SPLs are pre-allocated.

Electromagnetic equalizer: an active countermeasure against EM side-channel attack

  • Chenguang Wang
  • Yici Cai
  • Haoyi Wang
  • Qiang Zhou

Electromagnetic (EM) analysis is to reveal the secret information by analyzing the EM emission from a cryptographic device. EM analysis (EMA) attack is emerging as a serious threat to hardware security. It has been noted that the on-chip power grid (PG) has a security implication on EMA attack by affecting the fluctuations of supply current. However, there is little study on exploiting this intrinsic property as an active countermeasure against EMA. In this paper, we investigate the effect of PG on EM emission and propose an active countermeasure against EMA, i.e. EM Equalizer (EME). By adjusting the PG impedance, the current waveform can be flattened, equalizing the EM profile. Therefore, the correlation between secret data and EM emission is significantly reduced. As a first attempt to the co-optimization for power and EM security, we extend the EME method by fixing the vulnerability of power analysis. To verify the EME method, several cryptographic designs are implemented. The measurement to disclose (MTD) is improved by 1138x with area and power overheads of 0.62% and 1.36%, respectively.

GPU acceleration of RSA is vulnerable to side-channel timing attacks

  • Chao Luo
  • Yunsi Fei
  • David Kaeli

The RSA algorithm [21] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [9, 18, 23, 26], leveraging the Single Instruction Multiple Thread (SIMT) model.

Remote inter-chip power analysis side-channel attacks at board-level

  • Falk Schellenberg
  • Dennis R. E. Gnad
  • Amir Moradi
  • Mehdi B. Tahoori

The current practice in board-level integration is to incorporate chips and components from numerous vendors. A fully trusted supply chain for all used components and chipsets is an important, yet extremely difficult to achieve, prerequisite to validate a complete board-level system for safe and secure operation. An increasing risk is that most chips nowadays run software or firmware, typically updated throughout the system lifetime, making it practically impossible to validate the full system at every given point in the manufacturing, integration and operational life cycle. This risk is elevated in devices that run 3rd party firmware. In this paper we show that an FPGA used as a common accelerator in various boards can be reprogrammed by software to introduce a sensor, suitable as a remote power analysis side-channel attack vector at the board-level. We show successful power analysis attacks from one FPGA on the board to another chip implementing RSA and AES cryptographic modules. Since the sensor is only mapped through firmware, this threat is very hard to detect, because data can be exfiltrated without requiring inter-chip communication between victim and attacker. Our results also prove the potential vulnerability in which any untrusted chip on the board can launch such attacks on the remaining system.

Effective simple-power analysis attacks of elliptic curve cryptography on embedded systems

  • Chao Luo
  • Yunsi Fei
  • David Kaeli

Elliptic Curve Cryptography (ECC), initially proposed by Koblitz [17] and Miller [20], is a public-key cipher. Compared with other popular public-key ciphers (e.g., RSA), ECC features a shorter key length for the same level of security. For example, a 256-bit ECC cipher provides 128-bit security, equivalent to a 2048-bit RSA cipher [4]. Using smaller keys, ECC requires less memory for performing cryptographic operations. Embedded systems, especially given the proliferation of Internet-of-Things (IoT) devices and platforms, require efficient and low-power secure communications between edge devices and gateways/clouds. ECC has been widely adopted in IoT systems for authentication of communications, while RSA, which is much more costly to compute, remains the standard for desktops and servers.

SODA: stencil with optimized dataflow architecture

  • Yuze Chi
  • Jason Cong
  • Peng Wei
  • Peipei Zhou

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space.

In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

PolySA: polyhedral-based systolic array auto-compilation

  • Jason Cong
  • Jie Wang

Automatic systolic array generation has long been an interesting topic due to the need to reduce the lengthy development cycles of manual designs. Existing automatic systolic array generation approach builds dependency graphs from algorithms, and iteratively maps computation nodes in the graph into processing elements (PEs) with time stamps that specify the sequences of nodes that operate within the PE. There are a number of previous works that implemented the idea and generated designs for ASICs. However, all of these works relied on human intervention and usually generated inferior designs compared to manual designs. In this work, we present our ongoing compilation framework named PolySA which leverages the power of the polyhedral model to achieve the end-to-end compilation for systolic array architecture on FPGAs. PolySA is the first fully automated compilation framework for generating high-performance systolic array architectures on the FPGA leveraging recent advances in high-level synthesis. We demonstrate PolySA on two key applications—matrix multiplication and convolutional neural network. PolySA is able to generate optimal designs within one hour with performance comparable to state-of-the-art manual designs.

An efficient data reuse strategy for multi-pattern data access

  • Wensong Li
  • Fan Yang
  • Hengliang Zhu
  • Xuan Zeng
  • Dian Zhou

Memory partitioning has been widely adopted to increase the memory bandwidth. Data reuse is a hardware-efficient way to improve data access throughput by exploiting locality in memory access patterns. We found that for many applications in image and video processing, a global data reuse scheme can be shared by multiple patterns. In this paper, we propose an efficient data reuse strategy for multi-pattern data access. Firstly, a heuristic algorithm is proposed to extract the reuse information as well as find the non-reusable data elements of each pattern. Then the non-reusable elements are partitioned into several memory banks by an efficient memory partitioning algorithm. Moreover, the reuse information is utilized to generate the global data reuse logic shared by the multi-pattern. We design a novel algorithm to minimize the number of registers required by the data reuse logic. Experimental results show that compared with the state-of-the-art approach, our proposed method can reduce the number of required BRAMs by 62.2% on average, with the average reduction of 82.1% in SLICE, 87.1% in LUTs, 71.6% in Flip-Flops, 73.1% in DSP48Es, 83.8% in SRLs, 46.7% in storage overhead, 79.1% in dynamic power consumption, and 82.6% in execution time of memory partitioning. Besides, the performance is improved by 14.4%.

Optimizing data layout and system configuration on FPGA-based heterogeneous platforms

  • Hou-Jen Ko
  • Zhiyuan Li
  • Samuel Midkiff

The most attractive feature of field-programmable gate arrays (FPGAs) is their configuration flexibility. However, if the configuration is performed manually, this flexibility places a heavy burden on system designers to choose among a vast number of configuration parameters and program transformations. In this paper, we improve the state-of-the-art with two main innovations: First, we apply compiler-automated transformations to the data layout and program statements to create streaming accesses. Such accesses are turned into streaming interfaces when the kernels are implemented in hardware, allowing the kernels to run efficiently. Second, we use two-step mixed integer programming to first minimize the execution time and then to minimize energy dissipation. Configuration parameters are chosen automatically, including several important ones omitted by existing models. Experimental results demonstrate significant performance gains and energy savings using these techniques.

This work is sponsored in part by the National Science Foundation (Grant 1533822).

Design and optimization of edge computing distributed neural processor for biomedical rehabilitation with sensor fusion

  • Kofi Otseidu
  • Tianyu Jia
  • Joshua Bryne
  • Levi Hargrove
  • Jie Gu

Modern biomedical devices use sensor fusion techniques to improve the classification accuracy of motion intent of users for rehabilitation application. The design of motion classifier observes significant challenges due to the large number of channels and stringent communication latency requirement. This paper proposes an edge-computing distributed neural processor to effectively reduce the data traffic and physical wiring congestion. A special local and global networking architecture is introduced to significantly reduce traffic among multi-chips in edge computing. To optimize the design space of the features selected, a systematic design methodology is proposed. A novel mixed-signal feature extraction approach with assistance of neural network distortion recovery is also provided to significantly reduce the silicon area. A 12-channel 55nm CMOS test chip was implemented to demonstrate the proposed systematic design methodology. The measurement shows the test chip consumes only 20uW power, more than 10,000X less power than the current clinically used microprocessor and can perform edge-computing networking operation within 5ms time.

Area-efficient and low-power face-to-face-bonded 3D liquid state machine design

  • Bon Woong Ku
  • Yu Liu
  • Yingyezhe Jin
  • Peng Li
  • Sung Kyu Lim

As small-form-factor and low-power end devices matter in the cloud networking and Internet-of-Things Era, the bio-inspired neuromorphic architectures attract great attention recently in the hope of reaching the energy-efficiency of brain functions. Out of promising solutions, a liquid state machine (LSM), that consists of randomly and recurrently connected reservoir neurons and trainable readout neurons, has shown a great promise in delivering brain-inspired computing power. In this work, we adopt the state-of-the-art face-to-face (F2F)-bonded 3D IC flow named Compact-2D [4] to the LSM processor design, and study the power-area-accuracy benefits of 3D LSM ICs targeting the next generation commercial-grade neuromorphic computing platforms. First, we analyze how the different size and connection density of a reservoir in the LSM architecture affects the learning performance using the real-world speech recognition benchmark. Also, we explore how much the power-area design overhead should be paid off to enable better classification accuracy. Based on the power-area-accuracy trade-off, we implement a F2F-bonded 3D LSM IC using the optimal LSM architecture, and finally justify that 3D integration practically benefits the LSM processor design in huge form factor and power savings while preserving the best learning performance.

DIMA: a <u>d</u>epthwise CNN <u>i</u>n-<u>m</u>emory <u>a</u>ccelerator

  • Shaahin Angizi
  • Zhezhi He
  • Deliang Fan

In this work, we first propose a deep depthwise Convolutional Neural Network (CNN) structure, called Add-Net, which uses binarized depthwise separable convolution to replace conventional spatial-convolution. In Add-Net, the computationally expensive convolution operations (i.e. Multiplication and Accumulation) are converted into hardware-friendly Addition operations. We meticulously investigate and analyze the Add-Net’s performance (i.e. accuracy, parameter size and computational cost) in object recognition application compared to traditional baseline CNN using the most popular large scale ImageNet dataset. Accordingly, we propose a <u>D</u>epthwise CNN <u>I</u>n-<u>M</u>emory <u>A</u>ccelerator (DIMA) based on SOT-MRAM computational sub-arrays to efficiently accelerate Add-Net within non-volatile MRAM. Our device-to-architecture co-simulation results show that, with almost the same inference accuracy to the baseline CNN on different data-sets, DIMA can obtain ~1.4× better energy-efficiency and 15.7× speedup compared to ASICs, and, ~1.6× better energy-efficiency and 5.6× speedup over the best processing-in-DRAM accelerators.

Multi-channel and fault-tolerant control multiplexing for flow-based microfluidic biochips

  • Ying Zhu
  • Bing Li
  • Tsung-Yi Ho
  • Qin Wang
  • Hailong Yao
  • Robert Wille
  • Ulf Schlichtmann

Continuous flow-based biochips are one of the promising platforms used in biochemical and pharmaceutical laboratories due to their efficiency and low costs. Inside such a chip, fluid volumes of nanoliter size are transported between devices for various operations, such as mixing and detection. The transportation channels and corresponding operation devices are controlled by microvalves driven by external pressure sources. Since assigning an independent pressure source to every microvalve would be impractical due to high costs and limited system dimensions, states of microvalves are switched using a control logic by time multiplexing. Existing control logic designs, however, still switch only a single control channel per operation — leading to a low efficiency. In this paper, we propose the first automatic synthesis approach for a control logic that is able to switch multiple control channels simultaneously to reduce the overall switching time of valve states. In addition, we propose the first fault-aware design in control logic to introduce redundant control paths to maintain the correct function even when manufacturing defects occur. Compared with the existing direct connection method, the proposed multi-channel switching mechanism can reduce the switching time of valve states by up to 64%. In addition, all control paths for fault tolerance have been realized.

Multi-physics-based FEM analysis for post-voiding analysis of electromigration failure effects

  • Hengyang Zhao
  • Sheldon Tan

In this paper, we propose anew multi-physics finite element method (FEM) based analysis method for void growth simulation of confined copper interconnects. This new method for the first time considers three important physics simultaneously in the EM failure process and their time-varying interactions: the hydrostatic stress in the confined interconnect wire, the current density and Joule heating induced temperature. As a result, we end up with solving a set of coupled partial differential equations which consist of the stress diffusion equation (Korhonen’s equation), the phase field equation (for modeling void boundary move), the Laplace equation for current density and the heat diffusion equation for Joule heating and wire temperature. In the new method, we show that each of the physics will have different physical domains and differential boundary conditions, and how such coupled multi-physics transient analysis was carried out based on FEM and different time scales are properly handled. Experiment results show that by considering all three coupled physics – the stress, current density, and temperature – and their transient behaviors, the proposed FEM EM solver can predict the unique transient wire resistance change pattern for copper interconnect wires, which were well observed by the published experiment data. We also show that the simulated void growth speed is less conservative than recently proposed compact EM model.

Estimating and optimizing BTI aging effects: from physics to CAD

  • Hussam Amrouch
  • Victor M. van Santen
  • Jörg Henkel

Transistor aging due to Bias Temperature Instability (BTI) is a crucial degradation that affects the reliability of circuits over time. Aging-aware circuit design flows do virtually not exist yet and even research is in its infancy. In this work, we demonstrate how the deleterious effects BTI-induced degradations can be modeled from physics, where they do occur, all the way up to the system level, where they finally take place and affect the delay and power of circuits. To achieve that, degradation-aware cell libraries, that properly capture the impact of BTI not only on the delay of standard cells but also on their static and dynamic power, are created. Unlike state of the art, which solely models the impact of BTI on the threshold voltage of transistors (Vth), we are the first to model the other key transistor parameters degraded by BTI like carrier mobility (μ), sub-threshold slope (SS), and gate-drain capacitance (Cgd).

Our cell libraries are compatible with existing commercial CAD tools. Employing the mature algorithms in such tools, enables designers – after importing our cell libraries – to accurately estimate the overall impact of aging on changing the delay and/or power of any circuit, despite its complexity. We demonstrate that ΔVth alone (as done in state of the art) is insufficient to correctly model the impact of BTI either on delay or power of circuits. On the one hand, neglecting BTI-induced μ and Cgd degradations leads to underestimating the impact that BTI has on increasing the delay of circuits. Hence, designers will employ narrower timing guardbands in which reliability of circuits during lifetime cannot be sustained. On the other hand, neglecting BTI-induced SS degradation leads to overestimating the impact that BTI has on static power reduction. Hence, the potential benefit of circuits from BTI will be exaggerated.

PVT2: process, voltage, temperature and time-dependent variability in scaled CMOS process

  • A. K. M. Mahfuzul Islam
  • Hidetoshi Onodera

In addition to the conventional PVT (Process, Voltage and Temperature) variation, time-dependent current fluctuation such as random telegraph noise (RTN) poses a new challenge on VLSI reliability. In this paper, we show that compared with the static random variation, RTN amplitude of a particular device is not constant across supply voltages and temperatures. A device may show large RTN amplitude at one operating condition and small amplitude at another operating condition. As a result, RTN amplitude distribution becomes uncorrelated across a wide range of voltage and temperature. The emergence of uncorrelated distribution causes significant degradation of worst-case values. Analysis results based on variability models from a 65 nm silicon-on-insulator process show that uncorrelated RTN degrades the worst-case threshold voltage value significantly compared with that where RTN is not considered. Delay variation analysis shows that consideration of RTN in the statistical analysis have little impact at high supply voltage. However, at low voltage operation, RTN can degrade the worst-case value by more than 5 %.

Performance and accuracy in soft-error resilience evaluation using the multi-level processor simulator ETISS-ML

  • Daniel Mueller-Gritschneder
  • Uzair Sharif
  • Ulf Schlichtmann

Soft errors are a major safety concern in many devices, e.g., in automotive, industrial, control or medical applications. Ideally, safety-critical systems should be resilient against the impact of soft errors, but at a low cost. This requires to evaluate the soft error resilience, which is typically done by extensive fault injection.

In this paper, we present ETISS-ML, a multi-level processor simulator, which manages to achieve both accuracy and performance for fault simulation by intelligently switching the level of abstraction between an Instruction Set Simulator (ISS) and an RTL simulator. For a given software testcase and fault scenario, the software is first executed in ISS-mode until shortly before the fault injection. Then ETISS-ML switches to RTL-mode for accurate fault simulation. Whenever the impact of the fault is propagated completely out of the processor’s micro-architecture, the simulation can switch back to ISS-mode. This paper describes the methods needed to preserve accuracy during both of these switches. Experimental results show that ETISS-ML obtains near to ISS performance with RTL accuracy. It is also shown that ETISS-ML can be used as the processor model in SystemC / TLM virtual prototypes (VPs) and, hence, allows to investigate the impact of soft errors at system level.

Computer-aided design for quantum computation

  • Robert Wille
  • Austin Fowler
  • Yehuda Naveh

Quantum computation is currently moving from an academic idea to a practical reality. The recent past has seen tremendous progress in the physical implementation of corresponding quantum computers – also involving big players such as IBM, Google, Intel, Rigetti, Microsoft, and Alibaba. These devices promise substantial speedups over conventional computers for applications like quantum chemistry, optimization, machine learning, cryptography, quantum simulation, and systems of linear equations. The Computer-Aided Design and Verification (jointly referred as CAD) community needs to be ready for this revolutionizing new technology. While research on automatic design methods for quantum computers is currently underway, there is still far too little coordination between the CAD community and the quantum computation community. Consequently, many CAD approaches proposed in the past have either addressed the wrong problems or failed to reach the end users. In this summary paper, we provide a glimpse into both sides. To this end, we review and discuss selected accomplishments from the CAD domain as well as open challenges within the quantum domain. These examples showcase the recent state-of-the-art but also outline the remaining work left to be done in both communities.

PolyCleaner: clean your polynomials before backward rewriting to verify million-gate multipliers

  • Alireza Mahzoon
  • Daniel Große
  • Rolf Drechsler

Nowadays, a variety of multipliers are used in different computationally intensive industrial applications. Most of these multipliers are highly parallelized and structurally complex. Therefore, the existing formal verification techniques fail to verify them.

In recent years, formal multiplier verification based on Symbolic Computer Algebra (SCA) has shown superior results in comparison to all other existing proof techniques. However, for non-trivial architectures still a monomial explosion can be observed. A common understanding is that this is caused by redundant monomials also known as vanishing monomials. While several approaches have been proposed to overcome the explosion, the problem itself is still not fully understood.

In this paper we present a new theory for the origin of vanishing monomials and how they can be handled to prevent the explosion during backward rewriting. We implement our new approach as the SCA-verifier PolyCleaner. The experimental results show the efficiency of our proposed method in verification of non-trivial million-gate multipliers.

A formal instruction-level GPU model for scalable verification

  • Yue Xing
  • Bo-Yuan Huang
  • Aarti Gupta
  • Sharad Malik

GPUs have been widely used to accelerate big-data inference applications and scientific computing through their parallelized hardware resources and programming model. Their extreme parallelism increases the possibility of bugs such as data races and un-coalesced memory accesses, and thus verifying program correctness is critical. State-of-the-art GPU program verification efforts mainly focus on analyzing application-level programs, e.g., in C, and suffer from the following limitations: (1) high false-positive rate due to coarse-grained abstraction of synchronization primitives, (2) high complexity of reasoning about pointer arithmetic, and (3) keeping up with an evolving API for developing application-level programs.

In this paper, we address these limitations by modeling GPUs and reasoning about programs at the instruction level. We formally model the Nvidia GPU at the parallel execution thread (PTX) level using the recently proposed Instruction-Level Abstraction (ILA) model for accelerators. PTX is analogous to the Instruction-Set Architecture (ISA) of a general-purpose processor. Our formal ILA model of the GPU includes non-synchronization instructions as well as all synchronization primitives, enabling us to verify multithreaded programs. We demonstrate the applicability of our ILA model in scalable GPU program verification of data-race checking. The evaluation shows that our checker outperforms state-of-the-art GPU data race checkers with fewer false-positives and improved scalability.

Fast FPGA emulation of analog dynamics in digitally-driven systems

  • Steven Herbst
  • Byong Chan Lim
  • Mark Horowitz

In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation.

SPN dash: fast detection of adversarial attacks on mobile via sensor pattern noise fingerprinting

  • Kent W. Nixon
  • Jiachen Mao
  • Juncheng Shen
  • Huanrui Yang
  • Hai (Helen) Li
  • Yiran Chen

A concerning weakness of deep neural networks is their susceptibility to adversarial attacks. While methods exist to detect these attacks, they incur significant drawbacks, ignoring external features which could aid in the task of attack detection. In this work, we propose SPN Dash, a method for detection of adversarial attacks based on integrity of sensor pattern noise embedded in submitted images. Through experiment, we show that our SPN Dash method is capable of detecting the addition of adversarial noise with up to 94% accuracy for images of size 256×256. Analysis shows that SPN Dash is robust to image scaling techniques, as well as a small amount of image compression. This performance is on par with state of the art neural network-based detectors, while incurring an order of magnitude less computational and memory overhead.

Watermarking deep neural networks for embedded systems

  • Jia Guo
  • Miodrag Potkonjak

Deep neural networks (DNNs) have become an important tool for bringing intelligence to mobile and embedded devices. The increasingly wide deployment, sharing and potential commercialization of DNN models create a compelling need for intellectual property (IP) protection. Recently, DNN watermarking emerges as a plausible IP protection method. Enabling DNN watermarking on embedded devices in a practical setting requires a black-box approach. Existing DNN watermarking frameworks either fail to meet the black-box requirement or are susceptible to several forms of attacks. We propose a watermarking framework by incorporating the author’s signature in the process of training DNNs. While functioning normally in regular cases, the resulting watermarked DNN behaves in a different, predefined pattern when given any signed inputs, thus proving the authorship. We demonstrate an example implementation of the framework on popular image classification datasets and show that strong watermarks can be embedded in the models.

DeepFense: online accelerated defense against adversarial deep learning

  • Bita Darvish Rouhani
  • Mohammad Samragh
  • Mojan Javaheripi
  • Tara Javidi
  • Farinaz Koushanfar

Recent advances in adversarial Deep Learning (DL) have opened up a largely unexplored surface for malicious attacks jeopardizing the integrity of autonomous DL systems. With the wide-spread usage of DL in critical and time-sensitive applications, including unmanned vehicles, drones, and video surveillance systems, online detection of malicious inputs is of utmost importance. We propose DeepFense, the first end-to-end automated framework that simultaneously enables efficient and safe execution of DL models. DeepFense formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a DL network. To solve the aforementioned minimization problem, a set of complementary but disjoint modular redundancies are trained to validate the legitimacy of the input samples in parallel with the victim DL model. DeepFense leverages hardware/software/algorithm co-design and customized acceleration to achieve just-in-time performance in resource-constrained settings. The proposed countermeasure is unsupervised, meaning that no adversarial sample is leveraged to train modular redundancies. We further provide an accompanying API to reduce the non-recurring engineering cost and ensure automated adaptation to various platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders of magnitude performance improvement while enabling online adversarial sample detection.

Enabling deep learning at the IoT edge

  • Liangzhen Lai
  • Naveen Suda

Deep learning algorithms have demonstrated super-human capabilities in many cognitive tasks, such as image classification and speech recognition. As a result, there is an increasing interest in deploying neural networks (NNs) on low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers. In this paper, we discuss the challenges of deploying neural networks on microcontrollers with limited memory, compute resources and power budgets. We introduce CMSIS-NN, a library of optimized software kernels to enable deployment of NNs on Cortex-M cores. We also present techniques for NN algorithm exploration to develop light-weight models suitable for resource constrained systems, using keyword spotting as an example.

Searching toward pareto-optimal device-aware neural architectures

  • An-Chieh Cheng
  • Jin-Dong Dong
  • Chi-Hung Hsu
  • Shu-Huan Chang
  • Min Sun
  • Shih-Chieh Chang
  • Jia-Yu Pan
  • Yu-Ting Chen
  • Wei Wei
  • Da-Cheng Juan

Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

Hardware-aware machine learning: modeling and optimization

  • Diana Marculescu
  • Dimitrios Stamoulis
  • Ermao Cai

Recent breakthroughs in Machine Learning (ML) applications, and especially in Deep Learning (DL), have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms (from mobile devices to datacenters) have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. “What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)?” “Is it possible to predict this latency or energy consumption before a model is even trained?” “If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment?” From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention.

One cannot optimize what isn’t properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of ML applications. Furthermore, ML practitioners are currently challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardwareaware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for ML applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.

DAC 2019 TOC

Full Citation in the ACM Digital Library

LAMA: Link-Aware Hybrid Management for Memory Accesses in Emerging CPU-FPGA Platforms

  • Liang Feng
  • Jieru Zhao
  • Tingyuan Liang
  • Sharad Sinha
  • Wei Zhang

To satisfy increasing computing demands, heterogeneous computing platforms are gaining attention, especially CPU-FPGA platforms. Recently, emerging tightly coupled CPU-FPGA platforms with shared coherent caches (such as the Intel HARP and IBM POWER with CAPI) have been proposed to facilitate data communication and simplify the programming model. In this work, we propose LAMA, a static analysis and dynamic control combined framework for memory access management in such platforms, to further enhance the memory access efficiency and maintain the data consistency. Based on implementation results on the real Intel HARP2 platform, LAMA is shown to improve the performance by 34% on average with low overhead.

Thread Weaving: Static Resource Scheduling for Multithreaded High-Level Synthesis

  • Hsuan Hsiao
  • Jason Anderson

In high-level synthesis (HLS), software multithreading constructs can be used to explicitly specify coarse-grained parallelism for multiple accelerators. While software threads typically operate independently and in isolation of each other on CPUs, HLS threads/accelerators are sub-components of one circuit. Since these components generally reside in the same clock domain, we can schedule their execution statically to avoid shared-resource contention among threads. We propose thread weaving, a technique that statically interleaves requests from different threads through scheduling constraints. With the guarantee of a contention-free schedule, we eliminate replication/arbitration of shared resources, reducing the area footprint of the circuit and improving its maximum operating frequency (Fmax).

Exact and Heuristic Allocation of Multi-kernel Applications to Multi-FPGA Platforms

  • Junnan Shan
  • Mario R. Casu
  • Jordi Cortadella
  • Luciano Lavagno
  • Mihai T. Lazarescu

FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.

A Flat Timing-Driven Placement Flow for Modern FPGAs

  • Timothy Martin
  • Dani Maarouf
  • Ziad Abuowaimer
  • Abeer Alhyari
  • Gary Grewal
  • Shawki Areibi

In this paper, we propose a novel, flat analytic timing-driven placer without explicit packing for Xilinx UltraScale FPGA devices. Our work uses novel methods to simultaneously optimize for timing, wirelength and congestion throughout the global and detailed placement stages. We evaluate the effectiveness of the flat placer on the ISPD 2016 benchmark suite for the xcvu095 UltraScale device, as well as on industrial benchmarks. Experimental results show that on average, FTPlace achieves an 8% increase in maximum clock rate, an 18% decrease in routed wirelength, and produces placements that require 80% less time to route when compared to Xilinx Vivado 2018.1.

Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search

  • Weiwen Jiang
  • Xinyi Zhang
  • Edwin H.-M. Sha
  • Lei Yang
  • Qingfeng Zhuge
  • Yiyu Shi
  • Jingtong Hu

A fundamental question lies in almost every application of deep neural networks: what is the optimal neural architecture given a specific data set? Recently, several Neural Architecture Search (NAS) frameworks have been developed that use reinforcement learning and evolutionary algorithm to search for the solution. However, most of them take a long time to find the optimal architecture due to the huge search space and the lengthy training process needed to evaluate each candidate. In addition, most of them aim at accuracy only and do not take into consideration the hardware that will be used to implement the architecture. This will potentially lead to excessive latencies beyond specifications, rendering the resulting architectures useless. To address both issues, in this paper we use Field Programmable Gate Arrays (FPGAs) as a vehicle to present a novel hardware-aware NAS framework, namely FNAS, which will provide an optimal neural architecture with latency guaranteed to meet the specification. In addition, with a performance abstraction model to analyze the latency of neural architectures without training, our framework can quickly prune architectures that do not satisfy the specification, leading to higher efficiency. Experimental results on common data set such as ImageNet show that in the cases where the state-of-the-art generates architectures with latencies 7.81× longer than the specification, those from FNAS can meet the specs with less than 1% accuracy loss. Moreover, FNAS also achieves up to 11.13× speedup for the search process. To the best of the authors’ knowledge, this is the very first hardware aware NAS.

CANN: Curable Approximations for High-Performance Deep Neural Network Accelerators

  • Muhammad Abdullah Hanif
  • Faiq Khalid
  • Muhammad Shafique

Approximate Computing (AC) has emerged as a means for improving the performance, area and power-/energy-efficiency of a digital design at the cost of output quality degradation. Applications like machine learning (e.g., using DNNs-deep neural networks) are highly computationally intensive and, therefore, can significantly benefit from AC and specialized accelerators. However, the accuracy loss introduced because of approximations in the DNN accelerator hardware can result in undesirable results. This paper presents a novel method to design high-performance DNN accelerators where approximation error(s) from one stage/part of the design is “completely” compensated in the subsequent stage/part while offering significant efficiency gains. Towards this, the paper also presents a case-study for improving the performance of systolic array-based hardware architectures, which are commonly used for accelerating state-of-the-art deep learning algorithms.

Successive Log Quantization for Cost-Efficient Neural Networks Using Stochastic Computing

  • Sugil Lee
  • Hyeonuk Sim
  • Jooyeon Choi
  • Jongeun Lee

Despite the multifaceted benefits of stochastic computing (SC) such as low cost, low power, and flexible precision, SC-based deep neural networks (DNNs) still suffer from the long-latency problem, especially for those with high precision requirements. While log quantization can be of help, it has its own accuracy-saturation problem due to uneven precision distribution. In this paper we propose successive log quantization (SLQ), which extends log quantization with significant improvements in precision and accuracy, and apply it to state-of-the-art SC-DNNs. SLQ reuses the existing datapath of log quantization, and thus retains its advantages such as simple multiplier hardware. Our experimental results demonstrate that our SLQ can significantly extend both the accuracy and efficiency of SC-DNNs over the state-of-the-art solutions, including linear-quantized and log-quantized SC-DNNs, achieving less than 1~1.5%p accuracy drop for AlexNet, SqueezeNet, and VGG-S at mere 4~5-bit weight resolution.

ARGA: Approximate Reuse for GPGPU Acceleration

  • Daniel Peroni
  • Mohsen Imani
  • Hamid Nejatollahi
  • Nikil Dutt
  • Tajana Rosing

Many data-driven applications including computer vision, speech recognition, and medical diagnostics show tolerance to error during computation. These applications are often accelerated on GPUs, but high computational costs limit performance and increase energy usage. In this paper, we present ARGA, an approximate computing technique capable of accelerating GPGPU applications. ARGA provides an approximate lookup table to GPGPU cores to avoid recomputing instructions with identical or similar values. We propose multi-table parallel lookup which enables computational reuse to significantly speed-up GPGPU computation by checking incoming instructions in parallel. The inputs of each operation are searched for in a lookup table. Matches resulting in an exact or low error are removed from the floating point pipeline and used directly as output. Matches producing highly inaccurate results are computed on exact hardware to minimize application error. We simulate our design by placing ARGA within each core of an Nvidia Kepler Architecture Titan and an AMD Southern Island 7970. We show our design improves performance throughput by up to 2.7× and improves EDP by 5.3× for 6 GPGPU applications while maintaining less than 5% output error. We also show ARGA accelerates inference of a LeNet NN by 2.1× and improves EDP by 3.7× without significantly impacting classification accuracy.

Assessing the Adherence of an Industrial Autonomous Driving Framework to ISO 26262 Software Guidelines

  • Hamid Tabani
  • Leonidas Kosmidis
  • Jaume Abella
  • Francisco J. Cazorla
  • Guillem Bernat

The complexity and size of Autonomous Driving (AD) software are comparably higher than that of software implementing other (standard) functionalities in the car. To make things worse, a big fraction of AD software is not specifically designed for the automotive (or any other critical) domain, but the mainstream market. This brings uncertainty on to which extent AD software adheres to guidelines in safety standards. In this paper, we present our experience in applying ISO 26262 — the applicable functional safety standard for road vehicles — software safety guidelines to industrial AD software, in particular, Apollo, a heterogeneous Autonomous Driving framework used extensively in industry. We provide quantitative and qualitative metrics of compliance for many ISO 26262 recommendations on software design, implementation, and testing.

Tighter Dimensioning of Heterogeneous Multi-Resource Autonomous CPS with Control Performance Guarantees

  • Debayan Roy
  • Wanli Chang
  • Sanjoy K. Mitter
  • Samarjit Chakraborty

In modern autonomous systems, there is typically a large number of connected components realizing complex functionalities. For example, in autonomous vehicles (AVs), there are tens of millions of lines of code implemented on hundreds of sensors, controllers, and actuators. AVs have been deployed, mostly in trials and restricted environments, showing that substantial progress has been made in functionality development. However, they are still faced with two major challenges: (i) performance guarantee of safety-critical functions under all possible scenarios; (ii) functionality implementation with limited resources. These two challenges are conflicting because safety guarantees necessitate a worst-case analysis that is often very pessimistic for complex hardware/software systems, and thus require more resources. To address this, we study an abstraction of a heterogeneous cyber-physical system architecture consisting of a mix of high- and low-quality resources, such as time- and event-triggered resources, or wired and wireless resources. We show that by properly managing such a mix of resources and formulating a formal verification (model checking) problem, it is possible to tightly dimension the high-quality resource to the minimum (50% in certain cases) while providing control performance guarantees.

Dynamic Switching Speed Reconfiguration for Engine Performance Optimization

  • Chao Peng
  • Yecheng Zhao
  • Haibo Zeng

Today’s automotive engine control systems adopt several control strategies that come with tradeoffs between computational load and performance. The current practice is that the switching speeds at which the engine control system changes control strategy is fixed offline, typically based on the average driving need in a standard driving cycle (i.e., vehicle speed profile over time). This is clearly suboptimal since it fails to capture the variation in the driving cycle, and the actual driving cycle may be considerably different from the standard one. In this paper, we propose to dynamically adjust switching speeds based on the predicted driving cycle. We develop a hybrid set of schedulability analysis techniques to tame the complexity of ensuring the real-time schedulability of engine control tasks. We design an effective and efficient optimization algorithm that provides close-to-optimal solutions. Experimental results demonstrate that our approach efficiently finds dynamic switching speeds that significantly improve engine performance over static ones.

A Memory-Efficient Markov Decision Process Computation Framework Using BDD-based Sampling Representation

  • He Zhou
  • Sunil P. Khatri
  • Jiang Hu
  • Frank Liu

Although Markov Decision Process (MDP) has wide applications in autonomous systems as a core model in Reinforcement Learning, a key bottleneck is the large memory utilization of the state transition probability matrices. This is particularly problematic for computational platforms with limited memory, or for Bayesian MDP, which requires dozens of such matrices. To mitigate this difficulty, we propose a highly memory-efficient representation for probability matrices using Binary Decision Diagram (BDD) based sampling, and develop a corresponding (Bayesian/classical) MDP solver on a CPU-GPU platform. Simulation results indicate our approach reduces memory by one and two orders of magnitude for Bayesian/classical MDP, respectively.

Process, Circuit and System Co-optimization of Wafer Level Co-Integrated FinFET with Vertical Nanosheet Selector for STT-MRAM Applications

  • Trong Huynh-Bao
  • Anabela Veloso
  • Sushil Sakhare
  • Philippe Matagne
  • Julien Ryckaert
  • Manu Perumkunnil
  • Davide Crotti
  • Farrukh Yasin
  • Alessio Spessot
  • Arnaud Furnemont
  • Gouri Kar
  • Anda Mocuta

We present for the first time a co-integrated FinFET with vertical nanosheet transistor (VFET) process on a 300 mm silicon wafer for STT-MRAM applications and its related avenues with a holistic design-technology-co-optimization (DTCO) and power-performance-area-cost (PPAC) approach. The STT-MRAM bitcell and a 2 Mbit macro have been optimized and designed to address the viability of the co-integration process and advantages of vertical channel transistors for STT-MRAM selectors. An architectural system simulator GEM5 has been also employed with Polybench workloads to assess energy saving at system-level. In order to enable this co-integration, four extra masks are required, which costs below 10% in embedded chips. A 36% area reduction can be achieved for the STT-MRAM bitcell implemented with VFET selectors. With a UVLT flavor, the STT-MRAM bitcell comprising of 3-nanosheet could deliver the same performance of the 4-fin LVT FinFET selector. A 2 Mbit STT-MRAM macro designed with VFET selector can offer a 17% and a 21% reduction for read access latency and energy per operation respectively, and a 10% for write energy per operation. A 7% energy saving for the STT-MRAM L2 cache using VFET selector has been observed at the system level with Polybench workloads.

LL-PCM: Low-Latency Phase Change Memory Architecture

  • Nam Sung Kim
  • Choungki Song
  • Woo Young Cho
  • Jian Huang
  • Myoungsoo Jung

PCM is a promising non-volatile memory technology, as it can offer a unique trade-off between density and latency compared with DRAM and flash memory. Albeit PCM is much faster than flash memory, it is still notably slower than DRAM, which can significantly degrade system performance. In this paper, we analyze a PCM implementation in depth, and identify the primary cause of PCM’s long latency, i.e., a long interconnect (high resistance/capacitance) path between a cell and a sense-amp/write-driver. This in turn requires (1) a very large charge pump consuming: ~20% of PCM chip space, ~50% of latency of write operations, and ~2× more power than a write operation itself; and (2) a large current sense-amp with long time to pre-charge the interconnect path. Then, we propose Low-Latency PCM (LL-PCM) architecture. Our analysis shows that LL-PCM can give 119% higher performance and consume 43% lower memory energy than PCM for memory-intensive applications. LL-PCM is only ~1% larger than PCM, as the cost of reducing the resistance/capacitance of the interconnect path is negated by its 4.1× smaller charge pump.

What does Vibration do to Your SSD?

  • Janki Bhimani
  • Tirthak Patel
  • Ningfang Mi
  • Devesh Tiwari

Vibration generated in modern computing environments such as autonomous vehicles, edge computing infrastructure, and data center systems is an increasing concern. In this paper, we systematically measure, quantify and characterize the impact of vibration on the performance of SSD devices. Our experiments and analysis uncover that exposure to both short-term and long-term vibration, even within the vendor-specified limits, can significantly affect SSD I/O performance and reliability.

Ultra-thin Skin Electronics for High Quality and Continuous Skin-Sensor-Silicon Interfacing

  • Leilai Shao
  • Sicheng Li
  • Ting Lei
  • Tsung-Ching Huang
  • Raymond Beausoleil
  • Zhenan Bao
  • Kwang-Ting Cheng

Skin-inspired electronics emerges as a new paradigm due to the increasing demands for conformable and high-quality skin-sensor-silicon (SSS) interfacing in wearable, electronic skin and health monitoring applications. Advances in ultra-thin, flexible, stretchable and conformable materials have made skin electronics feasible. In this paper, we prototyped an active electrode (with a thickness ≤ 2 um), which integrates the electrode with a thin-film transistor (TFT) based amplifier, to effectively suppress motion artifacts. The fabricated ultra-thin amplifier can achieve a gain of 32 dB at 20 kHz, demonstrating the feasibility of the proposed active electrode. Using atrial fibrillation (AF) detection for electrocardiogram (ECG) as an application driver, we further develop a simulation framework taking into account all elements including the skin, the sensor, the amplifier and the silicon chip. Systematic and quantitative simulation results indicate that the proposed active electrode can effectively improve the signal quality under motion noises (achieving ≥30 dB improvement in signal-to-noise ratio (SNR)), which boosts classification accuracy by more than 19% for AF detection.

Enabling High-Dimensional Bayesian Optimization for Efficient Failure Detection of Analog and Mixed-Signal Circuits

  • Hanbin Hu
  • Peng Li
  • Jianhua Z. Huang

With increasing design complexity and stringent robustness requirements in application such as automotive electronics, analog and mixed-signal (AMS) verification becomes akey bottleneck. Rare failure detection in a high-dimensional parameter space using minimal expensive simulation data is a major challenge. We address this challenge under a Bayesian learning framework using Bayesian optimization (BO). We formulate the failure detection as a BO problem where a chosen acquisition function is optimized to select the next (set of) optimal simulation sampling point(s) such that rare failures may be detected using a small amount of data. While providing an attractive black-box solution to design verification, in practice BO is limited in its ability in dealing with high-dimensional problems. We propose to use random embedding to effectively reduce the dimensionality of a given verification problem to improve both the quality of BO-based optimal sampling and computational efficiency. We demonstrate the success of the proposed approach on detecting rare design failures under high-dimensional process variations which are completely missed by competitive smart sampling and BO techniques without dimension reduction.

High Performance Graph Convolutional Networks with Applications in Testability Analysis

  • Yuzhe Ma
  • Haoxing Ren
  • Brucek Khailany
  • Harbinder Sikka
  • Lijuan Luo
  • Karthikeyan Natarajan
  • Bei Yu

Applications of deep learning to electronic design automation (EDA) have recently begun to emerge, although they have mainly been limited to processing of regular structured data such as images. However, many EDA problems require processing irregular structures, and it can be non-trivial to manually extract important features in such cases. In this paper, a high performance graph convolutional network (GCN) model is proposed for the purpose of processing irregular graph representations of logic circuits. A GCN classifier is firstly trained to predict observation point candidates in a netlist. The GCN classifier is then used as part of an iterative process to propose observation point insertion based on the classification results. Experimental results show the proposed GCN model has superior accuracy to classical machine learning models on difficult-to-observation nodes prediction. Compared with commercial testability analysis tools, the proposed observation point insertion flow achieves similar fault coverage with an 11% reduction in observation points and a 6% reduction in test pattern count.

MRLoc: Mitigating Row-hammering based on memory Locality

  • Jung Min You
  • Joon-Sung Yang

With the increasing integration of semiconductor design, many problems have emerged. Row-hammering is one of these problems. The row-hammering effect is a critical issue for reliable memory operation because it can cause some unexpected errors. Hence, it is necessary to address this problem. Mainly, there are two different methods to deal with the row-hammering problem. One is a counter based method, and the other is a probabilistic method. This paper proposes the improved version of the latter method and compares it with other probabilistic methods, PARA and PRoHIT. According to the evaluation results, comparing the proposed method with conventional ones, the proposed one has increased row-hammering reduction per refresh 1.82 and 7.78 times against PARA and PRoHIT in average, respectively.

System-level hardware failure prediction using deep learning

  • Xiaoyi Sun
  • Krishnendu Chakrabarty
  • Ruirui Huang
  • Yiquan Chen
  • Bing Zhao
  • Hai Cao
  • Yinhe Han
  • Xiaoyao Liang
  • Li Jiang

Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware at risk and backup the data. However, the current model-based predictors are incapable of using the discrete time-series data, such as the values of device attributes, which conveys high-level information of the device behavior. In this paper, we propose a novel deep-learning based prediction scheme for system-level hardware failure prediction. We normalize the distribution of samples’ attributes from different vendors to make use of diverse training sets. We propose a temporal Convolution Neural Network based model that is insensitive to the noise in the time dimension. Finally, we design a loss function to train the model with extremely imbalanced samples effectively. Experimental results from an open S.M.A.R.T data set and an industrial data set show the effectiveness of the proposed scheme.

Enabling Practical Processing in and near Memory for Data-Intensive Computing

  • Onur Mutlu
  • Saugata Ghose
  • Juan Gómez-Luna
  • Rachata Ausavarungnirun

Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in the processor (and accelerators), and data storage/movement on the other, which all other parts of the system are dedicated to. Due to this dichotomy, data moves a lot in order for the system to perform computation on it. Unfortunately, data movement is extremely expensive in terms of energy and latency, much more so than computation. As a result, a large fraction of system energy is spent and performance is lost solely on moving data in a modern computing system.

In this work, we re-examine the idea of reducing data movement by performing Processing in Memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked logic and DRAM, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the idea of PIM is not new, we examine two new approaches to enabling PIM: 1) exploiting analog properties of DRAM to perform massively-parallel operations in memory, and 2) exploiting 3D-stacked memory technology design to provide high bandwidth to in-memory logic. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

Practical Near-Data Processing to Evolve Memory and Storage Devices into Mainstream Heterogeneous Computing Systems

  • Nam Sung Kim
  • Pankaj Mehra

The capacity of memory and storage devices is expected to increase drastically with adoption of the forthcoming memory and integration technologies. This is a welcome improvement especially for datacenter servers running modern data-intensive applications. Nonetheless, for such servers to fully benefit from the increasing capacity, the bandwidth of interconnects between processors and these devices must also increase proportionally, which becomes ever costlier under unabating physical constraints. As a promising alternative to tackle this challenge cost-effectively, a heterogeneous computing paradigm referred to as near-data processing (NDP) has emerged. However, NDP has not yet been widely adopted by the industry because of significant gaps between existing software stacks and demanded ones for NDP-capable memory and storage devices. Aiming to overcome the gaps, we propose to turn memory and storage devices into familiar heterogeneous distributed computing systems. Then, we demonstrate potentials of such computing systems for existing data-intensive applications with two recently implemented NDP-capable devices. Finally, we conclude with a practical blueprint to exploit the NDP-based computing systems for speeding up solving future computer-aided design and optimization problems.

HeadStart: Enforcing Optimal Inceptions in Pruning Deep Neural Networks for Efficient Inference on GPGPUs

  • Ning Lin
  • Hang Lu
  • Xin Wei
  • Xiaowei Li

Deep convolutional neural networks are well-known for the extensive parameters and computation intensity. Structured pruning is an effective solution to obtain a more compact model for the efficient inference on GPGPUs, without designing specific hardware accelerators. However, previous works resort to certain metrics in channel/filter pruning and count on labor intensive fine-tunings to recover the accuracy loss. The “inception” of the pruned model, as another form factor, has indispensable impact to the final accuracy but its importance is often ignored in these works. In this paper, we prove that optimal inception will be more likely to induce a satisfied performance and shortened fine-tuning iterations. We also propose a reinforcement learning based solution, termed as HeadStart, seeking to learn the best way of pruning aiming at the optimal inception. With the help of the specialized head-start network, it could automatically balance the tradeoff between the final accuracy and the preset speedup rather than tilting to one of them, which makes it differentiated from existing works as well. Experimental results show that HeadStart could attain up to 2.25x inference speedup with only 1.16% accuracy loss tested with large scale images on various GPGPUs, and could be well generalized to various cutting-edge DCNN models.

GATE: A Generalized Dataflow-level Approximation Tuning Engine For Data Parallel Architectures

  • Seokwon Kang
  • Yongseung Yu
  • Jiho Kim
  • Yongjun Park

Although approximate computing is widely used, it requires substantial programming effort to find appropriate approximation patterns among multiple pre-defined patterns to achieve a high performance. Therefore, we propose an automatic approximation framework called GATE to uncover hidden opportunities from any data-parallel program regardless of the code pattern or application characteristics using two compiler techniques, namely subgraph-level approximation (SGLA) and approximate thread merge(ATM). GATE also features conservative/aggressive tuning and dynamic calibration to maximize the performance while maintaining the TOQ level during runtime. Our framework achieves an average performance gain of 2.54x over the baseline with minimum accuracy loss.

LSIM: Ultra Lightweight Similarity Measurement for Mobile Graphics Applications

  • Yu-Chuan Chang
  • Wei-Ming Chen
  • Pi-Cheng Hsiu
  • Yen-Yu Lin
  • Tei-Wei Kuo

Perceptual similarity measurement allows mobile applications to eliminate unnecessary computations without compromising visual experience. Existing pixel-wise measures incur significant overhead with increasing display resolutions and frame rates. This paper presents an ultra lightweight similarity measure called LSIM, which assesses the similarity between frames based on the transformation matrices of graphics objects. To evaluate its efficacy, we integrate LSIM into the Open Graphics Library and conduct experiments on an Android smartphone with various mobile 3D games. The results show that LSIM is highly correlated with the most widely used pixel-wise measure SSIM, yet three to five orders of magnitude faster. We also apply LSIM to a CPU-GPU governor to suppress the rendering of similar frames, thereby further reducing computation energy consumption by up to 27.3% while maintaining satisfactory visual quality.

Efficient State Retention through Paged Memory Management for Reactive Transient Computing

  • Sivert T. Sliper
  • Domenico Balsamo
  • Nikos Nikoleris
  • William Wang
  • Alex S. Weddell
  • Geoff V. Merrett

Reactive transient computing systems preserve computational progress despite frequent power failures by suspending (saving state to nonvolatile memory) when detecting a power failure, and restoring once power returns. Existing methods inefficiently save and restore all allocated memory. We propose lightweight memory management that applies the concept of paging to load pages only when needed, and save only modified pages. We then develop a model that maximises available execution time by dynamically adjusting the suspend and restore voltage thresholds. Experiments on an MSP430FR5994 microcontroller show that our method reduces state retention overheads by up to 86.9% and executes algorithms up to 5.3× faster than the state-of-the-art.

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning

  • Gagandeep Singh
  • Juan Gómez-Luna
  • Giovanni Mariani
  • Geraldo F. Oliveira
  • Stefano Corda
  • Sander Stuijk
  • Onur Mutlu
  • Henk Corporaal

The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies workloads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.

We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on microarchitectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.

DREDGE: Dynamic Repartitioning during Dynamic Graph Execution

  • Andrew McCrabb
  • Eric Winsor
  • Valeria Bertacco

Graph-based algorithms have gained significant interest in several application domains. Solutions addressing the computational efficiency of such algorithms have mostly relied on many-core architectures. Cleverly laying out input graphs in storage, by placing adjacent vertices in a same storage unit (memory bank or cache unit), enables fast access during graph traversal. Dynamic graphs, however, must be continuously repartitioned to leverage this benefit. Yet software repartitioning solutions rely on costly, cross-vault communication to query and optimize the graph layout between algorithm iterations.

In this work, we propose DREDGE, a novel hardware solution to provide heuristic repartitioning optimizations in the background without extra communication. Our evaluation indicates that we achieve a 1.9x speedup, on average, over several graph algorithms and datasets, executing on a 24×24-core architecture, when compared against a baseline solution that does not repartition the dynamic graph. We estimated that DREDGE incurs only 1.5% area and 2.1% power overheads over an ARM A5 processor core.

ROC: DRAM-based Processing with Reduced Operation Cycles

  • Xin Xin
  • Youtao Zhang
  • Jun Yang

DRAM based memory-centric computing architectures are promising solutions to tackle the challenges of memory wall. In this paper, we develop a novel design of DRAM-based processing-in-memory (PIM) architecture which achieves lower cycles in every basic operation than prior arts. Our small yet fast in-memory computing units support basic logic operations including NOT, AND, and OR. Using those operations, along with shift and propagation, bitwise operations can be extended to word-wise operations, e.g. increment and comparison, with high efficiency. We also optimize the designs to exploit parallelism and data reuse to further improve the performance of compound operations. Compared with the most powerful state-of-the-art PIM architecture, we can achieve comparable or even better performance while consuming only 6% of its area overhead.

NV-BNN: An Accurate Deep Convolutional Neural Network Based on Binary STT-MRAM for Adaptive AI Edge

  • Chih-Cheng Chang
  • Ming-Hung Wu
  • Jia-Wei Lin
  • Chun-Hsien Li
  • Vivek Parmar
  • Heng-Yuan Lee
  • Jeng-Hua Wei
  • Shyh-Shyuan Sheu
  • Manan Suri
  • Tian-Sheuan Chang
  • Tuo-Hung Hou

Binary STT-MRAM is a highly anticipated embedded nonvolatile memory technology in advanced logic nodes < 28 nm. How to enable its in-memory computing (IMC) capability is critical for enhancing AI Edge. Based on the soon-available STT-MRAM, we report the first binary deep convolutional neural network (NV-BNN) capable of both local and remote learning. Exploiting intrinsic cumulative switching probability, accurate online training of CIFAR-10 color images (~ 90%) is realized using a relaxed endurance spec (switching ≤ 20 times) and hybrid digital/IMC design. For offline training, the accuracy loss due to imprecise weight placement can be mitigated using a rapid non-iterative training-with-noise and fine-tuning scheme.

No Compromises: Secure NVM with Crash Consistency, Write-Efficiency and High-Performance

  • Fan Yang
  • Youyou Lu
  • Youmin Chen
  • Haiyu Mao
  • Jiwu Shu

Data encryption and authentication are essential for secure NVM. However, the introduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection without compromising the performance, we propose cc-NVM. It firstly proposes an epoch-based mechanism to aggressively cache the security metadata in CPU cache while retaining the consistency of them in NVM. Deferred spreading is also introduced to reduce the calculating overhead for data authentication. Leveraging the hidden ability of data HMACs, we can always recover the consistent but old security metadata to its newest version. Compared to Osiris, a state-of-the-art secure NVM, cc-NVM improves performance by 20.4% on average. When the system crashes, instead of dropping all the data due to malicious attacks, cc-NVM is able to detect and locate the exact tampered data while only incurring extra write traffic by 29.6% on average.

In-process Memory Isolation Using Hardware Watchpoint

  • Jinsoo Jang
  • Brent Byunghoon Kang

Memory disclosure vulnerabilities have been exploited in the leaking of application secret data such as crypto keys (e.g., the Heartbleed Bug). To ameliorate this problem, we propose an in-process memory isolation mechanism by leveraging a common hardwarefeature, namely, hardware debugging. Specifically, we utilize a watchpoint to monitor a particular memory region containing secret data. We implemented the PoC of our approach based on the 64-bit ARM architecture, including the kernel patches and user APIs that help developers benefit from isolated memory use. We applied the approach to open-source applications such as OpenSSL and AESCrypt. The results of a performance evaluation show that our approach incurs a small amount of overhead.

H-ORAM: A Cacheable ORAM Interface for Efficient I/O Accesses

  • Liang Liu
  • Rujia Wang
  • Youtao Zhang
  • Jun Yang

Oblivious RAM (ORAM) is an effective security primitive to prevent access pattern leakage. By adding redundant memory accesses, ORAM prevents attackers from revealing the patterns in the access sequences. However, ORAM tends to introduce a huge degradation on the performance. With growing address space to be protected, ORAM has to store the majority of data in the lower level storage, which further degrades the system performance.

In this paper, we propose Hybrid ORAM (H-ORAM), a novel ORAM primitive to address large performance degradation when overflowing the user data to storage. H-ORAM consists of a batch scheduling scheme for enhancing the memory bandwidth usage, and a novel ORAM interface that returns data without waiting for the I/O access each time. We evaluate H-ORAM on a real machine implementation. The experimental results show that that H-ORAM outperforms the state-of-the-art Path ORAM by 20×.

RansomBlocker: a Low-Overhead Ransomware-Proof SSD

  • Jisung Park
  • Youngdon Jung
  • Jonghoon Won
  • Minji Kang
  • Sungjin Lee
  • Jihong Kim

We present a low-overhead ransomware-proof SSD, called RansomBlocker (RBlocker). RBlocker provides 100% full protections against all possible ransomware attacks by delaying every data deletion until no attack is guaranteed. To reduce storage overheads of the delayed deletion, RBlocker employs a time-out based backup policy. Based on the fact that ransomware must store encrypted version of target files, early deletions of obsolete data are allowed if no encrypted write was detected for a short interval. Otherwise, RBlocker keeps the data for an interval long enough to guarantee no attack condition. For an accurate in-line detection of encrypted writes, we leverages entropy- and CNN-based detectors in an integrated fashion. Our experimental results show that RBlocker can defend all types of ransomware attacks with negligible overheads.

Transmit or Discard: Optimizing Data Freshness in Networked Embedded Systems with Energy Harvesting Sources

  • Zimeng Zhou
  • Chenchen Fu
  • Chun Jason Xue
  • Song Han

This paper explores how to optimize the freshness of real-time data in energy harvesting based networked embedded systems. We introduce the concept of Age of Information (AoI) to quantitatively measure the data freshness and present a comprehensive analysis on the average AoI of the real-time data with stochastic update arrival and energy replenishment rates. Both an optimal offline solution and an effective online solution are designed to judiciously select a subset of the real-time data updates and determine their corresponding transmission times to optimize the average AoI subject to energy constraints. Our extensive experiments have validated the effectiveness of the proposed solutions, and showed that these two methods can significantly improve the average AoI by 47.2% comparing to the state-of-the-art solutions for low energy replenishment rate.

FPGA-Based Emulation of Embedded DRAMs for Statistical Error Resilience Evaluation of Approximate Computing Systems

  • Marco Widmer
  • Andrea Bonetti
  • Andreas Burg

Embedded DRAM (eDRAM) requires frequent power-hungry refresh according to the worst-case retention time across PVT variations to avoid data loss. Abandoning the error-free paradigm, by choosing sub-critical refresh rates that gracefully degrade the eDRAM content, unlocks considerable power-saving opportunities, but requires to understand the effect of stochastic memory errors at the system/application level. We propose an FPGA-based platform featuring faulty eDRAM emulation based on advanced retention time models and silicon measurements for statistical error resilience evaluation of applications in a complete embedded system. We analyze the statistical QoS for various benchmarks under different sub-critical refresh rates and retention time distributions.

Adapting Layer RBERs Variations of 3D Flash Memories via Multi-granularity Progressive LDPC Reading

  • Yajuan Du
  • Yao Zhou
  • Meng Zhang
  • Wei Liu
  • Shengwu Xiong

Existing studies have uncovered that there exist significant Raw Bit Error Rates (RBERs) variations among different layers of 3D flash memories due to manufacture process variation. These RBER variations would cause significantly diversed read latencies when reading data with traditional Low-Density Parity-Check (LDPC) codes designed for planar flash memories, which induces sub-optimal read performance of flash-based Solid-State Drives (SSDs).

To investigate the latency diversity, this paper first performs a preliminary experiment and observes that LDPC read levels proportional to latencies increase in diverse speeds along with data retention. Then, by exploiting the observation results, a Multi-Granularity LDPC (MG-LDPC) read method is proposed to adapt level increase speed for each layer. Five LDPC engines with varied increase granularity are designed to adapt RBER speed requirements. Finally, two implementations for MG-LDPC are applied to assign LDPC engines for each flash layer in a fixed way or dynamically according to prior read levels. Experimental results show that the proposed two implementations can reduce SSD read response time by 21% and 47% on average, respectively.

A Hybrid Agent-based Design Methodology for Dynamic Cross-layer Reliability in Heterogeneous Embedded Systems

  • Siva Satyendra Sahoo
  • Bharadwaj Veeravalli
  • Akash Kumar

Technology scaling and architectural innovations have led to increasing ubiquity of embedded systems across applications with widely varying and often constantly changing performance and reliability specifications. However, the increasing physical fault-rates in electronic systems have led to single-layer reliability approaches becoming infeasible for resource-constrained systems. Dynamic Cross-layer reliability (CLR) provides scope for efficient adaptation to such QoS variations and increasing unreliability. We propose a design methodology for enabling QoS-aware CLR-integrated runtime adaptation in heterogeneous MPSoC-based embedded systems. Specifically, we propose a combination of reconfiguration cost-aware optimization at design-time and an agent-based optimization at run-time. We report a reduction of up to 51% and 37% in average reconfiguration cost and average energy consumption respectively over state-of-the-art approaches.

PRIMAL: Power Inference using Machine Learning

  • Yuan Zhou
  • Haoxing Ren
  • Yanqing Zhang
  • Ben Keller
  • Brucek Khailany
  • Zhiru Zhang

This paper introduces PRIMAL, a novel learning-based framework that enables fast and accurate power estimation for ASIC designs. PRIMAL trains machine learning (ML) models with design verification testbenches for characterizing the power of reusable circuit building blocks. The trained models can then be used to generate detailed power profiles of the same blocks under different workloads. We evaluate the performance of several established ML models on this task, including ridge regression, gradient tree boosting, multi-layer perceptron, and convolutional neural network (CNN). For average power estimation, ML-based techniques can achieve an average error of less than 1% across a diverse set of realistic benchmarks, outperforming a commercial RTL power estimation tool in both accuracy and speed (15x faster). For cycle-by-cycle power estimation, PRIMAL is on average 50x faster than a commercial gate-level power analysis tool, with an average error less than 5%. In particular, our CNN-based method achieves a 35x speed-up and an error of 5.2% for cycle-by-cycle power estimation of a RISC-V processor core. Furthermore, our case study on a NoC router shows that PRIMAL can achieve a small estimation error of 4.5% using cycle-approximate traces from SystemC simulation.

Partition and Propagate: an Error Derivation Algorithm for the Design of Approximate Circuits

  • Ilaria Scarabottolo
  • Giovanni Ansaloni
  • George A. Constantinides
  • Laura Pozzi

Inexact hardware design techniques have become popular in error-tolerant systems, where energy efficiency is a primary concern. Several techniques aim to identify circuit portions that can be discarded under an error constraint, but research on systematic methods to determine such error is still at an early stage. We herein illustrate a generic, scalable algorithm that determines the influence of each circuit gate on the final output. The algorithm first partitions the graph representing the circuit, then determines the error propagation model of the resulting subgraphs. When applied to existing approximate design frameworks, our solution improves their efficiency and result quality.

Performance, Power and Cooling Trade-Offs with NCFET-based Many-Cores

  • Martin Rapp
  • Sami Salamin
  • Hussam Amrouch
  • Girish Pahwa
  • Yogesh Chauhan
  • Jörg Henkel

Negative Capacitance Field-Effect Transistor (NCFET) is an emerging technology that incorporates a ferroelectric layer within the transistor gate stack to overcome the fundamental limit of sub-threshold swing in transistors. Even though physics-based NCFET models have been recently proposed, system-level NCFET models do not exist and research is still in its infancy. In this work, we are the first to investigate the impact of NCFET on performance, energy and cooling costs in many-core processors. Our proposed methodology starts from accurate physics models all the way up to the system level, where the performance and power of a many-core are widely affected. Our new methodology and system-level models allow, for the first time, the exploration of the novel trade-offs between performance gains and power losses that NCFET now offers to system-level designers. We demonstrate that an optimal ferroelectric thickness does exist. In addition, we reveal that current state-of-the-art power management techniques fail when NCFET (with a thick ferroelectric layer) comes into play.

STFL: Energy-Efficient Data Movement with Slow Transition Fast Level Signaling

  • Payman Behnam
  • Mahdi Nazm Bojnordi

Data movement in large caches consumes a significant amount of energy in modern computer systems. Low power interfaces have been proposed to address this problem. Unfortunately, the energy-efficiency of these techniques is largely limited due to undue latency overheads of low power wires and complex coding mechanisms. This paper proposes a hybrid technique for slow-transition, fast-level (STFL) signaling that creates a balance between power and bandwidth in the last level cache interface. Combined with STFL codes, the signaling technique significantly mitigates the performance impacts of low power wires, thereby improving the energy efficiency of data movement in memory systems. When applied to the last level cache of a contemporary multicore system, STFL improves the CPU energy-delay product by 9% as compared to a voltage-frequency scaled baseline. Moreover, the proposed architecture reduces the CPU energy by 26% and achieves 98% of the performance provided by a high-performance baseline.

Formal Verification of Security Critical Hardware-Firmware Interactions in Commercial SoCs

  • Sayak Ray
  • Nishant Ghosh
  • Ramya Jayaram Masti
  • Arun Kanuparthi
  • Jason M. Fung

We present an effective methodology for formally verifying security-critical flows in a commercial System-on-Chip (SoC) which involve extensive interaction between firmware (FW) and hardware (HW). We describe several HW-FW interaction scenarios that are typical in commercial SoCs. We highlight unique challenges associated with formal verification of security properties of such interactions and discuss our approach of property-specific abstraction and software model checking to circumvent those challenges. To the best of our knowledge, this is the first exposition on formal co-verification of security-specific HW-FW interactions in the context and scale of a commercial SoCs. Despite traditional scalability challenges, we demonstrate that many such flows are amenable to effective formal verification.

In Hardware We Trust: Gains and Pains of Hardware-assisted Security

  • Lejla Batina
  • Patrick Jauernig
  • Nele Mentens
  • Ahmad-Reza Sadeghi
  • Emmanuel Stapf

Data processing and communication in almost all electronic systems are based on Central Processing Units (CPUs). In order to guarantee confidentiality and integrity of the software running on a CPU, hardware-assisted security architectures are used. However, both the threat model and the non-functional platform requirements, i.e. performance and energy budget, differ when we go from high-end desktop computers and servers to low-end embedded devices that populate the internet of things (IoT). For high-end platforms, a relatively large energy budget is available to protect software against attacks. However, measures to optimize performance give rise to microarchitectural side-channel attacks. IoT devices, in contrast, are constrained in terms of energy consumption and do not incorporate the performance enhancements found in high-end CPUs. Hence, they are less likely to be susceptible to microarchitectural attacks, but give rise to physical attacks, exploiting, e.g., leakage in power consumption or through fault injection. Whereas previous work mostly concentrates on a specific architecture, this paper covers the whole spectrum of computing systems, comparing the corresponding hardware architectures, and most relevant threats.

Protecting RISC-V against Side-Channel Attacks

  • Elke De Mulder
  • Samatha Gummalla
  • Michael Hutter

Software (SW) implementations of cryptographic algorithms are vulnerable to Side-channel Analysis (SCA) attacks, basically relinquishing the key to the outside world through measurable physical properties of the processor like power consumption and electromagnetic radiation. Protected SW implementations typically have a significant timing and code size overhead as well as a substantially long development time because hands-on testing the result is crucial. Plenty of scientific publications offer solutions for this problem for all kinds of algorithms but they are not straightforward to implement as they rely on device assumptions which are rarely met, nor do these solutions take micro-architecture related leakages into account. We present a solution to this problem by integrating side-channel analysis countermeasures into a RISC-V implementation. Our solution protects against first-order power or electromagnetic attacks while keeping the implementation costs as low as possible. We made use of state of the art masking techniques and present a novel solution to protect memory access against SCA. Practical results are provided that demonstrate the leakage results of various cryptographic primitives running on our protected hardware platform.

ANN Based Admission Control for On-Chip Networks

  • Boqian Wang
  • Zhonghai Lu
  • Shenggang Chen

We propose an admission control method in Network-on-Chip (NoC) with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node via the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. Compared to the previous work, our method builds up a high-fidelity model between the network status and the injection rate regulation. The full-system simulation results show that our proposed method can enhance application performance by 17.8% on average and up to 23.8%.

An Energy-Efficient Network-on-Chip Design using Reinforcement Learning

  • Hao Zheng
  • Ahmed Louri

The design space for energy-efficient Network-on-Chips (NoCs) has expanded significantly comprising a number of techniques. The simultaneous application of these techniques to yield maximum energy efficiency requires the monitoring of a large number of system parameters which often results in substantial engineering efforts and complicated control policies. This motivates us to explore the use of reinforcement learning (RL) approach that automatically learns an optimal control policy to improve NoC energy efficiency. First, we deploy power-gating (PG) and dynamic voltage and frequency scaling (DVFS) to simultaneously reduce both static and dynamic power. Second, we use RL to automatically explore the dynamic interactions among PG, DVFS, and system parameters, learn the critical system parameters contained in the router and cache, and eventually evolve optimal per-router control policies that significantly improve energy efficiency. Moreover, we introduce an artificial neural network (ANN) to efficiently implement the large state-action table required by RL. Simulation results using PARSEC benchmark show that the proposed RL approach improves power consumption by 26%, while improving system performance by 7%, as compared to a combined PG and DVFS design without RL. Additionally, the ANN design yields 67% area reduction, as compared to a conventional RL implementation.

Lightweight Mitigation of Hardware Trojan Attacks in NoC-based Manycore Computing

  • Venkata Yaswanth Raparti
  • Sudeep Pasricha

Data-snooping is a serious security threat in NoC fabrics that can lead to theft of sensitive information from applications executing on manycore processors. Hardware Trojans (HTs) covertly embedded in NoC components can carry out such snooping attacks. In this paper, we first describe a low-overhead snooping invalidation module (SIM) to prevent malicious data replication by HTs in NoCs. We then devise a snooping detection module (THANOS) to also detect malicious applications that utilize such HTs. Experimental analysis shows that unlike state-of-the-art mechanisms, SIM and THANOS not only mitigate snooping attacks but also improve NoC performance by 48.4% in the presence of these attacks, with a minimal ~2.15% area and ~5.5% power overhead.

Sparse 3-D NoCs with Inductive Coupling

  • Michihiro Koibuchi
  • Lambert Leong
  • Tomohiro Totoki
  • Naoya Niwa
  • Hiroki Matsutani
  • Hideharu Amano
  • Henri Casanova

Wireless interconnects based on inductive coupling technology are compelling propositions for designing 3-D integrated chips. This work addresses the heat dissipation problem on such systems. Although effective cooling technologies have been proposed for systems designed based on Through Silicon Via (TSV), their application to systems that use inductive coupling is problematic because of increased wireless-communication distance. For this reason, we propose two methods for designing sparse 3-D chips layouts and Networks on Chip (NoCs) based on inductive coupling. The first method computes an optimized 3-D chip layout and then generates a randomized network topology for this layout. The second method uses a standard stack chip layout with a standard network topology as a starting point, and then deterministically transforms it into either a “staircase” or a “checkerboard” layout. We quantitatively compare the designs produced by these two methods in terms of network and application performance. Our main finding is that the first method produces designs that ultimately lead to higher parallel application performance, as demonstrated for nine OpenMP applications in the NAS Parallel Benchmarks.

Surf-Bless: A Confined-interference Routing for Energy-Efficient Communication in NoCs

  • Peng Wang
  • Sobhan Niknam
  • Sheng Ma
  • Zhiying Wang
  • Todor Stefanov

In this paper, we address the problem of how to achieve energy-efficient confined-interference communication on a bufferless NoC taking advantage of the low power consumption of such NoC. We propose a novel routing approach called Surfing on a Bufferless NoC (Surf-Bless) where packets are assigned to domains and Surf-Bless guarantees that interference between packets is confined within a domain, i.e., there is no interference between packets assigned to different domains. By experiments, we show that our Surf-Bless routing approach is effective in supporting confined-interference communication and consumes much less energy than the related approaches.

Effect of Distributed Directories in Mesh Interconnects

  • Marcos Horro
  • Mahmut T. Kandemir
  • Louis-Noël Pouchet
  • Gabriel Rodríguez
  • Juan Touriño

Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Xeon Phi Knights Landing. It features 38 tiles packed in a single die, organized into a 2D mesh. Before accessing remote data, tiles need to query the distributed directory. The effect of this coherence traffic is poorly understood. We show that the apparent UMA behavior results from the degradation of the peak performance. We develop ways to optimize the coherence traffic, the core-to-core-affinity, and the scheduling of a set of tasks on the mesh, leveraging the unique characteristics of processor units stemming from process variations.

BRIC: Locality-based Encoding for Energy-Efficient Brain-Inspired Hyperdimensional Computing

  • Mohsen Imani
  • Justin Morris
  • John Messerly
  • Helen Shu
  • Yaobang Deng
  • Tajana Rosing

Brain-inspired Hyperdimensional (HD) computing is a new computing paradigm emulating the neuron’s activity in high-dimensional space. The first step in HD computing is to map each data point into high-dimensional space (e.g., 10,000), which requires the computation of thousands of operations for each element of data in the original domain. Encoding alone takes about 80% of the execution time of training. In this paper, we propose BRIC, a fully binary Brain-Inspired Classifier based on HD computing for energy-efficient and high-accuracy classification. BRIC introduces a novel encoding module based on random projection with a predictable memory access pattern which can efficiently be implemented in hardware. BRIC is the first HD-based approach which provides data projection with a 1:1 ratio to the original data and enables all training/inference computation to be performed using binary hypervectors. To further improve BRIC efficiency, we develop an online dimension reduction approach which removes insignificant hypervector dimensions during training. Additionally, we designed a fully pipelined FPGA implementation which accelerates BRIC in both training and inference phases. Our evaluation of BRIC a wide range of classification applications show that BRIC can achieve 64.1× and 9.8× (43.8× and 6.1×) energy efficiency and speed up as compared to baseline HD computing during training (inference) while providing the same classification accuracy.

Fast and Efficient Information Transmission with Burst Spikes in Deep Spiking Neural Networks

  • Seongsik Park
  • Seijoon Kim
  • Hyeokjun Choe
  • Sungroh Yoon

Spiking neural networks (SNNs) are considered as one of the most promising artificial neural networks due to their energy-efficient computing capability. Recently, conversion of a trained deep neural network to an SNN has improved the accuracy of deep SNNs. However, most of the previous studies have not achieved satisfactory results in terms of inference speed and energy efficiency. In this paper, we propose a fast and energy-efficient information transmission method with burst spikes and hybrid neural coding scheme in deep SNNs. Our experimental results showed the proposed methods can improve inference energy efficiency and shorten the latency.

Deep-DFR: A Memristive Deep Delayed Feedback Reservoir Computing System with Hybrid Neural Network Topology

  • Kangjun Bai
  • Qiyuan An
  • Yang Yi

Deep neural networks (DNNs), the brain-like machine learning architecture, have gained immense success in data-extensive applications. In this work, a hybrid structured deep delayed feedback reservoir (Deep-DFR) computing model is proposed and fabricated. Our Deep-DFR employs memristive synapses working in a hierarchical information processing fashion with DFR modules as the readout layer, leading our proposed deep learning structure to be both depth-in-space and depth-in-time. Our fabricated prototype along with experimental results demonstrate its high energy efficiency with low hardware implementation cost. With applications on the image classification, MNIST and SVHN, our Deep-DFR yields a 1.26~7.69X reduction on the testing error compared to state-of-the-art DNN designs.

A Fault-Tolerant Neural Network Architecture

  • Tao Liu
  • Wujie Wen
  • Lei Jiang
  • Yanzhi Wang
  • Chengmo Yang
  • Gang Quan

New DNN accelerators based on emerging technologies, such as resistive random access memory (ReRAM), are gaining increasing research attention given their potential of “in-situ” data processing. Unfortunately, device-level physical limitations that are unique to these technologies may cause weight disturbance in memory and thus compromising the performance and stability of DNN accelerators. In this work, we propose a novel fault-tolerant neural network architecture to mitigate the weight disturbance problem without involving expensive retraining. Specifically, we propose a novel collaborative logistic classifier to enhance the DNN stability by redesigning the binary classifiers augmented from both traditional error correction output code (ECOC) and modern DNN training algorithm. We also develop an optimized variable-length “decode-free” scheme to further boost the accuracy under fewer number of classifiers. Experimental results on cutting-edge DNN models and complex datasets show that the proposed fault-tolerant neural network architecture can effectively rectify the accuracy degradation against weight disturbance for DNN accelerators with low cost, thus allowing for its deployment in a variety of mainstream DNNs.

A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM

  • Zhenhua Zhu
  • Hanbo Sun
  • Yujun Lin
  • Guohao Dai
  • Lixue Xia
  • Song Han
  • Yu Wang
  • Huazhong Yang

Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials in boosting both the performance and energy efficiency of CNNs. However, restricted by the immature process technology, it is hard to implement and fabricate a CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit RRAM based CNN accelerators only focus on binary or ternary CNNs which have more than 10% accuracy loss compared with full precision CNNs. This paper proposes a configurable multi-precision CNN computing framework based on single bit RRAM, which consists of an RRAM computing overhead aware network quantization algorithm and a configurable multi-precision CNN computing architecture based on single bit RRAM. The proposed method can achieve equivalent accuracy as full precision CNN but also with lower storage consumption and latency via multiple precision quantization. The designed architecture supports for accelerating the multi-precision CNNs even with various precision among different layers. Experiment results show that the proposed framework can reduce 70% computing area and 75% computing energy on average, with nearly no accuracy loss. And the equivalent energy efficiency is 1.6 ~ 8.6× compared with existing RRAM based architectures with only 1.07% area overhead.

Noise Injection Adaption: End-to-End ReRAM Crossbar Non-ideal Effect Adaption for Neural Network Mapping

  • Zhezhi He
  • Jie Lin
  • Rickard Ewetz
  • Jiann-Shiun Yuan
  • Deliang Fan

In this work, we investigate various non-ideal effects (Stuck-At-Fault (SAF), IR-drop, thermal noise, shot noise, and random telegraph noise)of ReRAM crossbar when employing it as a dot-product engine for deep neural network (DNN) acceleration. In order to examine the impacts of those non-ideal effects, we first develop a comprehensive framework called PytorX based on main-stream DNN pytorch framework. PytorX could perform end-to-end training, mapping, and evaluation for crossbar-based neural network accelerator, considering all above discussed non-ideal effects of ReRAM crossbar together. Experiments based on PytorX show that directly mapping the trained large scale DNN into crossbar without considering these non-ideal effects could lead to a complete system malfunction (i.e., equal to random guess) when the neural network goes deeper and wider. In particular, to address SAF side effects, we propose a digital SAF error correction algorithm to compensate for crossbar output errors, which only needs one-time profiling to achieve almost no system accuracy degradation. Then, to overcome IR drop effects, we propose a Noise Injection Adaption (NIA) methodology by incorporating statistics of current shift caused by IR drop in each crossbar as stochastic noise to DNN training algorithm, which could efficiently regularize DNN model to make it intrinsically adaptive to non-ideal ReRAM crossbar. It is a one-time training method without the request of retraining for every specific crossbar. Optimizing system operating frequency could easily take care of rest non-ideal effects. Various experiments on different DNNs using image recognition application are conducted to show the efficacy of our proposed methodology.

A Novel Covert Channel Attack Using Memory Encryption Engine Cache

  • Youngkwang Han
  • John Kim

Microarchitectural covert channel attack is a threat when multiple tenants share hardware resources such as last-level cache. In this work, we propose a novel covert channel attack that exploits new microarchitecture that have been introduced to support memory encryption — in particular, the memory encryption engine (MEE) cache. The MEE cache is a shared resource but only utilized when accessing the integrity tree data and provides opportunity for a stealthy covert channel attack. However, there are challenges since MEE cache organization is not publicly known and the access behavior differs from a conventional cache. We demonstrate how the MEE cache can be exploited to establish a covert channel communication.

Designing Secure Cryptographic Accelerators with Information Flow Enforcement: A Case Study on AES

  • Zhenghong Jiang
  • Hanchen Jin
  • G. Edward Suh
  • Zhiru Zhang

Designing a secure cryptographic accelerator is challenging as vulnerabilities may arise from design decisions and implementation flaws. To provide high security assurance, we propose to design and build cryptographic accelerators with hardware-level information flow control so that the security of an implementation can be formally verified. This paper uses an AES accelerator as a case study to demonstrate how to express security requirements of a cryptographic accelerator as information flow policies for security enforcement. Our AES prototype on an FPGA shows that the proposed protection has a marginal impact on area and performance.

SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation

  • Khaled N. Khasawneh
  • Esmaeil Mohammadian Koruyeh
  • Chengyu Song
  • Dmitry Evtyushkin
  • Dmitry Ponomarev
  • Nael Abu-Ghazaleh

Speculative attacks, such as Spectre and Meltdown, target speculative execution to access privileged data and leak it through a side-channel. In this paper, we introduce (SafeSpec), a new model for supporting speculation in a way that is immune to the side-channel leakage by storing side effects of speculative instructions in separate structures until they commit. Additionally, we address the possibility of a covert channel from speculative instructions to committed instructions before these instructions are committed. We develop a cycle accurate model of modified design of an x86-64 processor and show that the performance impact is negligible.

SpectreGuard: An Efficient Data-centric Defense Mechanism against Spectre Attacks

  • Jacob Fustos
  • Farzad Farshchi
  • Heechul Yun

Speculative execution is an essential performance enhancing technique in modern processors, but it has been shown to be insecure. In this paper, we propose SpectreGuard, a novel defense mechanism against Spectre attacks. In our approach, sensitive memory blocks (e.g., secret keys) are marked using simple OS/library API, which are then selectively protected by hardware from Spectre attacks via low-cost micro-architecture extension. This technique allows microprocessors to maintain high performance, while restoring the control to software developers to make security and performance trade-offs.

PAPP: Prefetcher-Aware Prime and Probe Side-channel Attack

  • Daimeng Wang
  • Zhiyun Qian
  • Nael Abu-Ghazaleh
  • Srikanth V. Krishnamurthy

CPU memory prefetchers can substantially interfere with prime and probe cache side-channel attacks, especially on in-order CPUs which use aggressive prefetching. This interference is not accounted for in previous attacks. In this paper, we propose PAPP, a Prefetcher-Aware Prime Probe attack that can operate even in the presence of aggressive prefetchers. Specifically, we reverse engineer the prefetcher and replacement policy on several CPUs and use these insights to design a prime and probe attack that minimizes the impact of the prefetcher. We evaluate PAPP using Cache Side-channel Vulnerability (CSV) metric and demonstrate the substantial improvements in the quality of the channel under different conditions.

HardScope: Hardening Embedded Systems Against Data-Oriented Attacks

  • Thomas Nyman
  • Ghada Dessouky
  • Shaza Zeitouni
  • Aaro Lehikoinen
  • Andrew Paverd
  • N. Asokan
  • Ahmad-Reza Sadeghi

Memory-unsafe programming languages like C and C++ leave many (embedded) systems vulnerable to attacks like control-flow hijacking. However, defenses against control-flow attacks, such as (fine-grained) randomization or control-flow integrity are in-effective against data-oriented attacks and more expressive Data-oriented Programming (DOP) attacks that bypass state-of-the-art defenses.

We propose run-time scope enforcement (RSE), a novel approach that efficiently mitigates all currently known DOP attacks by enforcing compile-time memory safety constraints like variable visibility rules at run-time. We present Hardscope, a proof-of-concept implementation of hardware-assisted RSE for RISC-V, and show it has a low performance overhead of 3.2% for embedded benchmarks.

An Efficient Multi-fidelity Bayesian Optimization Approach for Analog Circuit Synthesis

  • Shuhan Zhang
  • Wenlong Lyu
  • Fan Yang
  • Changhao Yan
  • Dian Zhou
  • Xuan Zeng
  • Xiangdong Hu

This paper presents an efficient multi-fidelity Bayesian optimization approach for analog circuit synthesis. The proposed method can significantly reduce the overall computational cost by fusing the simple but potentially inaccurate low-fidelity model and a few accurate but expensive high-fidelity data. Gaussian Process (GP) models are employed to model the low- and high-fidelity black-box functions separately. The nonlinear map between the low-fidelity model and high-fidelity model is also modelled as a Gaussian process. A fusing GP model which combines the low- and high-fidelity models can thus be built. An acquisition function based on the fusing GP model is used to balance the exploitation and exploration. The fusing GP model is evolved gradually as new data points are selected sequentially by maximizing the acquisition function. Experimental results show that our proposed method reduces up to 65.5% of the simulation time compared with the state-of-the-art single-fidelity Bayesian optimization method, while exhibiting more stable performance and a more promising practical prospect.

Rethinking Sparsity in Performance Modeling for Analog and Mixed Circuits using Spike and Slab Models

  • Mohamed Baker Alawieh
  • Sinead A. Williamson
  • David Z. Pan

As integrated circuit technologies continue to scale, efficient performance modeling becomes indispensable. Recently, several new learning paradigms have been proposed to reduce the computational cost associated with accurate performance modeling. A common attribute among most of these paradigms is the leverage of the sparsity feature to build efficient performance models. In this work, we propose a new perspective to incorporate sparsity in the modeling task by utilizing spike and slab feature selection techniques. Practically, our proposed method uses two different priors on the different model coefficients based on their importance. This is incorporated into a mixture model that can be built using a hierarchical Bayesian framework to select the important features and find the model coefficients. Our numerical experiments demonstrate that the proposed approach can achieve better results compared to traditional sparse modeling techniques while also providing valuable insight about the important features in the model.

WellGAN: Generative-Adversarial-Network-Guided Well Generation for Analog/Mixed-Signal Circuit Layout

  • Biying Xu
  • Yibo Lin
  • Xiyuan Tang
  • Shaolan Li
  • Linxiao Shen
  • Nan Sun
  • David Z. Pan

In back-end analog/mixed-signal (AMS) design flow, well generation persists as a fundamental challenge for layout compactness, routing complexity, circuit performance and robustness. The immaturity of AMS layout automation tools comes to a large extent from the difficulty in comprehending and incorporating designer expertise. To mimic the behavior of experienced designers in well generation, we propose a generative adversarial network (GAN) guided well generation framework with a post-refinement stage leveraging the previous high-quality manually-crafted layouts. Guiding regions for wells are first created by a trained GAN model, after which the well generation results are legalized through post-refinement to satisfy design rules. Experimental results show that the proposed technique is able to generate wells close to manual designs with comparable post-layout circuit performance.

Digital Compatible Synthesis, Placement and Implementation of Mixed-Signal Time-Domain Computing

  • Zhengyu Chen
  • Hai Zhou
  • Jie Gu

Mixed-signal time-domain computing (TC) has recently drawn significant attention due to its high efficiency in applications such as machine learning accelerators. However, due to the nature of analog and mixed-signal design, there is a lack of a systematic flow of synthesis and place & route for time-domain circuits. This paper proposed a comprehensive design flow for TC. In the front-end, a variation-aware digital compatible synthesis flow is proposed. In the back-end, a placement technique using graph-based optimization engine is proposed to deal with the especially stringent matching requirement in TC. Simulation results show significant improvement over the prior analog placement methods. A 55nm test chip is used to demonstrate that the proposed design flow can meet the stringent timing matching target for TC with significant performance boost over conventional digital design.

A Rigorous Approach for the Sparsification of Dense Matrices in Model Order Reduction of RLC Circuits

  • Charalampos Antoniadis
  • Nestor Evmorfopoulos
  • Georgios Stamoulis

The integration of more components into modern Systems-on-Chip (SoCs) has led to very large RLC parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose a methodology for the sparsification of the dense circuit matrices resulting from Model Order Reduction of general RLC circuits, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.

Enabling Complex Stimuli in Accelerated Mixed-Signal Simulation

  • Sara Divanbeigi
  • Evan Aditya
  • Zhongpin Wang
  • Markus Olbrich

In the era of advancing technology, increasing circuit complexity requires faster simulators for the verification step. The piece-wise linear simulation approach provides an efficient and accurate solution. In this paper, a state-of-the-art mixed-signal simulator is explained. The approach is extended to new exponential and quadratic stimuli. This requires a comprehensive derivation of mathematical equations, which remove the need for computationally expensive evaluation. The new stimuli are simulated in several circuits and compared to a conventional simulator. The result shows significant run-time acceleration with high accuracy. Therefore, it meets the industrial requirement, which demands simulation with various input forms and non-linear components.

Scalable Generic Logic Synthesis: One Approach to Rule Them All

  • Heinz Riener
  • Eleonora Testa
  • Winston Haaswijk
  • Alan Mishchenko
  • Luca Amarù
  • Giovanni De Micheli
  • Mathias Soeken

This paper proposes a novel methodology for multi-level logic synthesis that is independent from a specific graph data-structure, but formulates synthesis procedures using an abstract concept definition of a logic representation. The idea is to capture the essence of optimisations in a general manner and tailor only small performance-critical sections to the underlying logic representation. This generic, yet scalable approach, saves many man-months of development time and enables logic synthesis and technology-mapping procedures parameterised in a logic representation. We present the generic design methodology and demonstrate its practicality by providing a complete state-of-the-art logic synthesis flow.

Comprehensive Search for ECO Rectification Using Symbolic Sampling

  • Victor N. Kravets
  • Nian-Ze Lee
  • Jie-Hong R. Jiang

The task of an engineering change order (ECO) is to update the current implementation of a design according to its revised specification with minimum modification. Prior studies show that the amount of design modification majorly depends on the selection of rectification points, i.e., the input pins of gates whose functionality should be rectified with some patch circuitry. In realistic ECOs, as the netlist of the current implementation has been heavily optimized to meet design objectives, it is usually structurally dissimilar to the netlist of a revised specification, which is synthesized only by lightweight optimization. This paper proposes an ECO solution for optimized designs, which is robust against structural dissimilarity caused by design optimization. It locates candidate rectification points in a sampling domain, which significantly improves the scalability of rectification search. To synthesize the circuitry of patches, a structurally independent rewiring formulation is proposed to reuse existing logic in the implementation. Based on the proposed method, a newly developed engine is evaluated on the engineering changes arising in the design of microprocessors. Its ability to derive patches of superior quality is demonstrated in comparison to industrial tools.

Embedding Functions Into Reversible Circuits: A Probabilistic Approach to the Number of Lines

  • Niels Gleinig
  • Frances Ann Hubis
  • Torsten Hoefler

In order to compute a non-invertible function on a reversible circuit, one needs to “embed” the function into a larger function which has some garbage bits, corresponding to additional lines. The problem of determining the minimal number of garbage bits that are needed to embed a given function has attracted extensive research, largely motivated by quantum computing, where the number of lines equals the number of qubits. However, all approaches that are known have either no theoretical quality guarantees (bounds on approximation factors) or require exponential runtime. We present an efficient probabilistic approximation algorithm with theoretical bounds.

Disjoint-Support Decomposition and Extraction for Interconnect-Driven Threshold Logic Synthesis

  • Hao Chen
  • Shao-Chun Hung
  • Jie-Hong R. Jiang

Threshold logic circuits are artificial neural networks with their neuron outputs being binarized, thus amenable for efficient, multiplier-free, hardware implementation of machine learning applications. In the reviving threshold logic synthesis, this work lays the foundations of disjoint-support decomposition and extraction operation of threshold logic functions. They lead to a synthesis procedure for interconnect minimization of threshold logic circuits, an important, but not well addressed, objective in both neural network and nanometer circuit designs. Experimental results show that our method can efficiently and effectively reduce interconnect as well as weight/threshold value over highly optimized circuits, thus suitable for implementation using emerging technologies.

Reducing the Multiplicative Complexity in Logic Networks for Cryptography and Security Applications

  • Eleonora Testa
  • Mathias Soeken
  • Luca Amarù
  • Giovanni De Micheli

Reducing the number of AND gates plays a central role in many cryptography and security applications. We propose a logic synthesis algorithm and tool to minimize the number of AND gates in a logic network composed of AND, XOR, and inverter gates. Our approach is fully automatic and exploits cut enumeration algorithms to explore optimization potentials in local subcircuits. The experimental results show that our approach can reduce the number of AND gates by 34% on average compared to generic size optimization algorithms. Further, we are able to reduce the number of AND gates up to 76% in best-known benchmarks from the cryptography community.

SMatch: Structural Matching for Fast Resynthesis in FPGAs

  • Rafael Trapani Possignolo
  • Jose Renau

Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective incremental flows. Wepropose SMatch, a flow for FPGAs that has a novel incremental elaboration and novel incremental FPGA placement and routing that improves the state-of-the-art by reducing the amount of placement and routing work needed. We evaluate our approach against commercial FPGAs flows. Our method finishes synthesis, placement, and routing in under 30s for most changes of publicly available benchmarks with negligible QoR impact, being over 20× faster than existing incremental FPGA flows.

Toward an Open-Source Digital Flow: First Learnings from the OpenROAD Project

  • Tutu Ajayi
  • Vidya A. Chhabria
  • Mateus Fogaça
  • Soheil Hashemi
  • Abdelrahman Hosny
  • Andrew B. Kahng
  • Minsoo Kim
  • Jeongsup Lee
  • Uday Mallappa
  • Marina Neseem
  • Geraldo Pradipta
  • Sherief Reda
  • Mehdi Saligane
  • Sachin S. Sapatnekar
  • Carl Sechen
  • Mohamed Shalan
  • William Swartz
  • Lutong Wang
  • Zhehong Wang
  • Mingyu Woo
  • Bangqi Xu

We describe the planned Alpha release of OpenROAD, an open-source end-to-end silicon compiler. OpenROAD will help realize the goal of “democratization of hardware design”, by reducing cost, expertise, schedule and risk barriers that confront system designers today. The development of open-source, self-driving design tools is in and of itself a “moon shot” with numerous technical and cultural challenges. The open-source flow incorporates a compatible open-source set of tools that span logic synthesis, floorplanning, placement, clock tree synthesis, global routing and detailed routing. The flow also incorporates analysis and support tools for static timing analysis, parasitic extraction, power integrity analysis, and cloud deployment. We also note several observed challenges, or “lessons learned”, with respect to development of open-source EDA tools and flows.

ALIGN: Open-Source Analog Layout Automation from the Ground Up

  • Kishor Kunal
  • Meghna Madhusudan
  • Arvind K. Sharma
  • Wenbin Xu
  • Steven M. Burns
  • Ramesh Harjani
  • Jiang Hu
  • Desmond A. Kirkpatrick
  • Sachin S. Sapatnekar

This paper presents analog layout automation efforts under the ALIGN (“Analog Layout, Intelligently Generated from Netlists”) project for fast layout generation using a modular approach based on a mix of algorithmic and machine learning-based tools. The road to rapid turnaround is based on an approach that detects structure and hierarchy in the input netlist and uses a grid based philosophy for layout. The paper provides a view of the current status of the project, challenges in developing open-source code with an academic/industry team, and nuts-and-bolts issues such as working with abstracted PDKs, navigating the “wall” between secured IP and open-source software, and securing access to example designs.

Essential Building Blocks for Creating an Open-source EDA Project

  • Tsung-Wei Huang
  • Chun-Xun Lin
  • Guannan Guo
  • Martin D. F. Wong

Open source has started energizing both industrial and academic research and development in electronic design automation (EDA) systems. By moving to open source, we can speed up our effort and work with others who are working toward the same goals, while reducing costs and improving end products. However, building an open-source project is much more than placing the codebase on the web. In this paper, we will talk about essential building blocks to create an impactful open-source project, including source repository, project landing page, documentation, and continuous integration. We will also cover the use of web-based frameworks to design a showcase project to bring community’s attention. We will then share our experience in developing an open-source timing analyzer (OpenTimer) and a parallel task programming library (Cpp-Taskflow), both of which are being used in many industrial and academic EDA research projects.

Open-Source EDA Tools and IP, A View from the Trenches

  • Elad Alon
  • Krste Asanović
  • Jonathan Bachrach
  • Borivoje Nikolić

We describe our experience developing and promoting a set of open-source tools and IP over the last 9 years, including the Chisel hardware construction language, the Rocket Chip SoC generator, and the BAG analog layout generator.

A 1.17 TOPS/W, 150fps Accelerator for Multi-Face Detection and Alignment

  • Huiyu Mo
  • Leibo Liu
  • Wenping Zhu
  • Qiang Li
  • Hong Liu
  • Wenjing Hu
  • Yao Wang
  • Shaojun Wei

Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection and alignment, along with the optimizations on multi-task cascaded convolutional networks algorithm, to implement both multi-face detection and alignment. First, the clustering non-maximum suppression is proposed to significantly reduce intersection over union computation and eliminate the hardware-interfer-ence sorting process, bringing 16.0% speed-up without any loss. Second, a new pipeline architecture is presented to implement the proposal network in more computation-efficient manner, with 41.7% less multiplier usage and 38.3% decrease in memory capacity compared with the similar method. Third, a batch schedule mechanism is proposed to improve hardware utilization of fully-connected layer by 16.7% on average with variable input number in batch process. Based on the TSMC 28 nm CMOS process, this accelerator only consumes 6.7ms at 400 MHz to simultaneously process 5 faces for each image and achieves 1.17 TOPS/W power efficiency, which is 54.8× higher than the state-of-the-art solution.

Analog/Mixed-Signal Hardware Error Modeling for Deep Learning Inference

  • Angad S. Rekhi
  • Brian Zimmer
  • Nikola Nedovic
  • Ningxi Liu
  • Rangharajan Venkatesan
  • Miaorong Wang
  • Brucek Khailany
  • William J. Dally
  • C. Thomas Gray

Analog/mixed-signal (AMS) computation can be more energy efficient than digital approaches for deep learning inference, but incurs an accuracy penalty from precision loss. Prior AMS approaches focus on small networks/datasets, which can maintain accuracy even with 2b precision. We analyze applicability of AMS approaches to larger networks by proposing a generic AMS error model, implementing it in an existing training framework, and investigating its effect on ImageNet classification with ResNet-50. We demonstrate significant accuracy recovery by exposing the network to AMS error during retraining, and we show that batch normalization layers are responsible for this accuracy recovery. We also introduce an energy model to predict the requirements of high-accuracy AMS hardware running large networks and use it to show that for ADC-dominated designs, there is a direct tradeoff between energy efficiency and network accuracy. Our model predicts that achieving < 0.4% accuracy loss on ResNet-50 with AMS hardware requires a computation energy of at least ~300 fJ/MAC. Finally, we propose methods for improving the energy-accuracy tradeoff.

A 3T/Cell Practical Embedded Nonvolatile Memory Supporting Symmetric Read and Write Access Based on Ferroelectric FETs

  • Juejian Wu
  • Hongtao Zhong
  • Kai Ni
  • Yongpan Liu
  • Huazhong Yang
  • Xueqing Li

Making embedded memory symmetric provides the capability of memory access in both rows and columns, which brings new opportunities of significant energy and time savings if only a portion of data in the words need to be accessed. This work investigates the use of ferroelectric field-effect transistors (FeFETs), an emerging nonvolatile, low-power, deeply-scalable, CMOS-compatible transistor technology, and proposes a new 3-transistor/cell symmetric nonvolatile memory (SymNVM). With ~1.67x higher density as compared with the prior FeFET design, significant benefits of energy and latency improvement have been achieved, as evaluated and discussed in depth in this paper.

A Fast, Reliable and Wide-Voltage-Range In-Memory Computing Architecture

  • William Simon
  • Juan Galicia
  • Alexandre Levisse
  • Marina Zapater
  • David Atienza

As the computational complexity of applications on the consumer market, such as high-definition video encoding and deep neural networks, become ever more demanding, novel ways to efficiently compute data intensive workloads are being explored. In this context, In-Memory Computing (IMC) solutions, and particularly bitline computing in SRAM, appear promising as they mitigate one of the most energy consuming aspects in computation: data movement. While IMC architectural level characteristics have been defined by the research community, only a few works so far have explored the implementation of such memories at a low level. Furthermore, these proposed solutions are either slow (<1GHz), area hungry (10T SRAM), or suffer from read disturb and corruption issues. Overall, there is no extensive design study considering realistic assumptions at the circuit level. In this work we propose a fast (up to 2.2Ghz), 6T SRAM-based, reliable (no read disturb issues), and wide voltage range (from 0.6 to 1V) IMC architecture using local bitlines. Beyond standard read and write, the proposed architecture can perform copy, addition and shift operations at the array level. As addition is the slowest operation, we propose a modified carry chain adder, providing a 2× carry propagation improvement. The proposed architecture is validated using a 28nm bulk high performances technology PDK with CMOS variability and post-layout simulations. High density SRAM bitcells (0.127μm) enable area efficiency of 59.7% for a 256×128 array, on par with current industrial standards.

BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation

  • Sungju Ryu
  • Hyungjun Kim
  • Wooseok Yi
  • Jae-Joon Kim

Deep Neural Networks (DNNs) have various performance requirements and power constraints depending on applications. To maximize the energy-efficiency of hardware accelerators for different applications, the accelerators need to support various bit-width configurations. When designing bit-reconfigurable accelerators, each PE must have variable shift-addition logic, which takes a large amount of area and power. This paper introduces an area and energy efficient precision-scalable neural network accelerator (BitBlade), which reduces the control overhead for variable shift-addition using bitwise summation method. The proposed BitBlade, when synthesized in a 28nm CMOS technology, showed reduction in area by 41% and in energy by 36-46% compared to the state-of-the-art precision-scalable architecture [14].

Acceleration of DNN Backward Propagation by Selective Computation of Gradients

  • Gunhee Lee
  • Hanmin Park
  • Namhyung Kim
  • Joonsang Yu
  • Sujeong Jo
  • Kiyoung Choi

The training process of a deep neural network commonly consists of three phases: forward propagation, backward propagation, and weight update. In this paper, we propose a hardware architecture to accelerate the backward propagation. Our approach applies to neural networks that use rectified linear unit. Considering that the backward propagation results in a zero activation gradient when the corresponding activation is zero, we can safely skip the gradient calculation. Based on this observation, we design an efficient hardware accelerator for training deep neural networks by selectively computing gradients. We show the effectiveness of our approach through experiments with various network models.

C3-Flow: Compute Compression Co-Design Flow for Deep Neural Networks

  • Matthew Sotoudeh
  • Sara S. Baghsorkhi

Existing approaches to neural network compression have failed to holistically address algorithmic (training accuracy) and computational (inference performance) demands of real-world systems, particularly on resource-constrained devices. We present C3-Flow, a new approach adding non-uniformity to low-rank approximations and designed specifically to enable highly-efficient computation on common hardware architectures while retaining more accuracy than competing methods. Evaluation on two state-of-the-art acoustic models (versus existing work, empirical limit study approaches, and hand-tuned models) demonstrates up to 60% lower error. Finally, we show that our co-design approach achieves up to 14X inference speedup across three Haswell- and Broadwell-based platforms.

ABM-SpConv: A Novel Approach to FPGA-Based Acceleration of Convolutional Neural Network Inference

  • Dong Wang
  • Ke Xu
  • Qun Jia
  • Soheil Ghiasi

Hardware accelerators for convolutional neural network (CNN) inference have been extensively studied in recent years. The reported designs tend to utilize a similar underlying architecture based on multiplier-accumulator (MAC) arrays, which has the practical consequence of limiting the FPGA-based accelerator performance by the number of available on-chip DSP blocks, while leaving other resource under-utilized. To address this problem, we consider a transformation to the convolution computation, which leads to transformation of the accelerator design space and relaxes the pressure on the required DSP resources. We demonstrate that our approach enables us to strike a judicious balance between utilization of the on-chip memory, logic, and DSP resources, due to which, our accelerator considerably outperforms state of the art. We report the effectiveness of our approach on a Stratix-V GXA7 FPGA, which shows 55% throughput improvement, while using 6.25% less DSP blocks, compared to the best reported CNN accelerator on the same device.

Pushing the speed limit of constant-time discrete Gaussian sampling. A case study on the Falcon signature scheme

  • Angshuman Karmakar
  • Sujoy Sinha Roy
  • Frederik Vercauteren
  • Ingrid Verbauwhede

Sampling from a discrete Gaussian distribution has applications in lattice-based post-quantum cryptography. Several efficient solutions have been proposed in recent years. However, making a Gaussian sampler secure against timing attacks turned out to be a challenging research problem. In this work, we present a toolchain to instantiate an efficient constant-time discrete Gaussian sampler of arbitrary standard deviation and precision. We observe an interesting property of the mapping from input random bit strings to samples during a Knuth-Yao sampling algorithm and propose an efficient way of minimizing the Boolean expressions for the mapping. Our minimization approach results in up to 37% faster discrete Gaussian sampling compared to the previous work. Finally, we apply our optimized and secure Gaussian sampler in the lattice-based digital signature algorithm Falcon, which is a NIST submission, and provide experimental evidence that the overall performance of the signing algorithm degrades by at most 33% only due to the additional overhead of ‘constant-time’ sampling, including the 60% overhead of random number generation. Breaking a general belief, our results indirectly show that the use of discrete Gaussian samples in digital signature algorithms would be beneficial.

Full-Lock: Hard Distributions of SAT instances for Obfuscating Circuits using Fully Configurable Logic and Routing Blocks

  • Hadi Mardani Kamali
  • Kimia Zamiri Azar
  • Houman Homayoun
  • Avesta Sasan

In this paper, we propose a novel and SAT-resistant logic-locking technique, denoted as Full-Lock, to obfuscate and protect the hardware against threats including IP-piracy and reverse-engineering. The Full-Lock is constructed using a set of small-size fully Programmable Logic and Routing block (PLR) networks. The PLRs are SAT-hard instances with reasonable power, performance and area overheads which are used to obfuscate (1) the routing of a group of selected wires and (2) the logic of the gates leading and proceeding the selected wires. The Full-Lock resists removal attacks and breaks a SAT attack by significantly increasing the complexity of each SAT iteration.

A Cellular Automata Guided Obfuscation Strategy For Finite-State-Machine Synthesis

  • Rajit Karmakar
  • Suman Sekhar Jana
  • Santanu Chattopadhyay

A popular countermeasure against IP piracy relies on obfuscating the Finite State Machine (FSM), which is assumed to be the heart of a digital system. In this paper, we propose to use a special class of non-group additive cellular automata (CA) called D1 * CA, and it’s counterpart D1 * CAdual to obfuscate each state-transition of an FSM. The synthesized FSM exhibits correct state-transitions only for a correct key, which is a designer’s secret. The proposed easily testable key-controlled FSM synthesis scheme can thwart reverse engineering attacks, thus offers IP protection.

An Efficient Spare-Line Replacement Scheme to Enhance NVM Security

  • Jie Xu
  • Dan Feng
  • Yu Hua
  • Fangting Huang
  • Wen Zhou
  • Wei Tong
  • Jingning Liu

Non-volatile memories (NVMs) are vulnerable to serious threat due to the endurance variation. We identify a new type of malicious attack, called Uniform Address Attack (UAA), which performs uniform and sequential writes to each line of the whole memory, and wears out the weaker lines (lines with lower endurance) early. Experimental results show that the lifetime of NVMs under UAA is reduced to 4.1% of the ideal lifetime. To address such attack, we propose a spare-line replacement scheme called Max-WE (Maximize the Weak lines’ Endurance). By employing weak-priority and weak-strong-matching strategies for spare-line allocation, Max-WE is able to maximize the number of writes that the weakest lines can endure. Furthermore, Max-WE reduces the storage overhead of the mapping table by 85% through adopting a hybrid spare-line mapping scheme. Experimental results show that Max-WE can improve the lifetime by 9.5X with the spare-line overhead and mapping overhead as 10% and 0.016% of the total space respectively.

Analyzing Parallel Real-Time Tasks Implemented with Thread Pools

  • Daniel Casini
  • Alessandro Biondi
  • Giorgio Buttazzo

Despite several works in the literature targeted predictable execution models for parallel tasks, limited attention has been devoted to study how specific implementation techniques may affect their execution. This paper highlights some issues that can arise when executing parallel tasks with thread pools, which may lead to deadlocks and performance degradation when adopting blocking synchronization mechanisms. A new parallel task model, inspired to a realistic design found in popular software systems, is first presented to study this problem. Then, formal conditions to ensure the absence of deadlocks and schedulability analysis techniques are proposed under both global and partitioned scheduling.

Scheduling and Analysis of Parallel Real-Time Tasks with Semaphores

  • Xu Jiang
  • Nan Guan
  • Weichen Liu
  • Maolin Yang

This paper for the first time studies the scheduling and analysis of parallel real-time tasks with semaphores. In parallel task systems, each task may issue multiple requests to a semaphore, which raises new challenges to the design and analysis problems. We propose a new locking protocol LPP that limits the maximal number of requests to a semaphore by a task that can block other tasks at any time. We develop analysis techniques to safely bound the task response times, with which we prove that the best real-time performance is achieved if only one request to a semaphore by a task is allowed to block other tasks at a time. Experiments under different parameter settings are conducted to compare our proposed protocol and analysis techniques with the state-of-the-art spinlock protocol and analysis techniques for parallel real-time tasks.

Real-Time Scheduling and Analysis of Synchronous OpenMP Task Systems with Tied Tasks

  • Jinghao Sun
  • Nan Guan
  • Xiaoqing Wang
  • Chenhan Jin
  • Yaoyao Chi

Synchronous parallel tasks are widely used in HPC for purchasing high average performance, but merely consider how to guarantee good timing predictabilities. OpenMP is a promising framework for multi-core real-time embedded systems. The synchronous OpenMP tasks are significantly more difficult to schedule and analyze due to constraints posed by OpenMP specifications. An important OpenMP feature is tied task, which must execute on the same thread during the whole life cycle. This paper designs a novel method, called group scheduling, to schedule synchronous OpenMP tasks, which divides tasks into several groups, and assigns some of them to dedicated cores, in order to isolate tied tasks. We derive a linear-time computable response time bound. Experiments with both randomly generated and realistic OpenMP tasks show that our new bound significantly outperforms the existing bound.

DCFNoC: A Delayed Conflict-Free Time Division Multiplexing Network on Chip

  • Tomás Picornell
  • José Flich
  • Carles Hernández
  • José Duato

The adoption of many-cores in safety-critical systems requires real-time capable networks on chip (NoC). In this paper we propose a new time-predictable NoC design paradigm where contention within the network is eliminated. This new paradigm builds on the Channel Dependency Graph (CDG) and guarantees by design the absence of contention. Our delayed conflict-free NoC (DCFNoC) is able to naturally inject messages using a TDM period equal to the optimal theoretical bound and without the need of using a computationally demanding offline process. Results show that DCFNoC guarantees time predictability with very low implementation cost.

Learning Temporal Specifications from Imperfect Traces Using Bayesian Inference

  • Artur Mrowca
  • Martin Nocker
  • Sebastian Steinhorst
  • Stephan Günnemann

Verification is essential to prevent malfunctioning of software systems. Model checking allows to verify conformity with nominal behavior. As manual definition of specifications from such systems gets infeasible, automated techniques to mine specifications from data become increasingly important. Existing approaches produce specifications of limited lengths, do not segregate functions and do not easily allow to include expert input. We present BaySpec, a dynamic mining approach to extract temporal specifications from Bayesian models, which represent behavioral patterns. This allows to learn specifications of arbitrary length from imperfect traces. Within this framework we introduce a novel extraction algorithm that for the first time mines LTL specifications from such models.

Accelerating FPGA Prototyping through Predictive Model-Based HLS Design Space Exploration

  • Shuangnan Liu
  • Francis CM Lau
  • Benjamin Carrion Schafer

One of the advantages of High-Level Synthesis (HLS), also called C-based VLSI-design, over traditional RT-level VLSI design flows, is that multiple micro-architectures of unique area vs. performance can be automatically generated by setting different synthesis options, typically in the form of synthesis directives specified as pragmas in the source code. This design space exploration (DSE) is very time-consuming and can easily take multiple days for complex designs. At the same time, and because of the complexity in designing large ASICs, verification teams now routinely make use of emulation and prototyping to test the circuit before the silicon is taped out. This also allows the embedded software designers to start their work earlier in the design process and thus, further reducing the Turn-Around-Times (TAT). In this work, we present a method to automatically re-optimize ASIC designs specified as behavioral descriptions for HLS to FPGAs for emulation and prototyping, based on the observation that synthesis directives that lead to efficient micro-architectures for ASICs, do not directly translate into optimal micro-architectures in FPGAs. This implies that the HLS DSE process would have to be completely repeated for the target FPGA. To avoid this, this work presents a predictive model-based method that takes as inputs the results of an ASIC HLS DSE and automatically, without the need to re-explore the behavioral description, finds the Pareto-optimal micro-architectures for the target FPGA. Experimental results comparing our predictive-model based method vs. completely re-exploring the search space show that our proposed method works well.

Sample-Guided Automated Synthesis for CCSL Specifications

  • Ming Hu
  • Tongquan Wei
  • Min Zhang
  • Frédéric Mallet
  • Mingsong Chen

The Clock Constraint Specification Language (CCSL) has been widely investigated in verifying causal and temporal timing behaviors of real-time embedded systems. However, due to limited expertise in formal modeling, it is difficult for requirement engineers to completely and accurately derive CCSL specifications from natural language-based design descriptions. To address this problem, we present a novel approach that facilitates automated synthesis of CCSL specifications under the guidance of sampled (expected) timing behaviors of target systems. By encoding sampled behaviors and incomplete CCSL constraints provided by requirement engineers using our proposed transformation templates, the CCSL specification synthesis problem can be naturally converted into a SKETCH synthesis problem, which enables the automated generation of CCSL specifications with high accuracy. Experiments on both well-known benchmarks and synthetic examples demonstrate the effectiveness and scalability of our approach.

DHOOM: Reusing Design-for-Debug Hardware for Online Monitoring

  • Neetu Jindal
  • Sandeep Chandran
  • Preeti Ranjan Panda
  • Sanjiva Prasad
  • Abhay Mitra
  • Kunal Singhal
  • Shubham Gupta
  • Shikhar Tuli

Runtime verification employs dedicated hardware or software monitors to check whether program properties hold at runtime. However, these monitors often incur high area and performance overheads depending on whether they are implemented in hardware or software. In this work, we propose DHOOM, an architectural framework for runtime monitoring of program assertions, which exploits the combination of a reconfigurable fabric present alongside a processor core with the vestigial on-chip Design-for-Debug hardware. This combination of hardware features allows DHOOM to minimize the overall performance overhead of runtime verification, even when subject to a given area constraint. We present an algorithm for dynamically selecting an effective subset of assertion monitors that can be accommodated in the available programmable fabric, while instrumenting the remaining assertions in software. We show that our proposed strategy, while respecting area constraints, reduces the performance overhead of runtime verification by up to 32% when compared with a baseline of software-only monitors.

Efficient System Architecture in the Era of Monolithic 3D: Dynamic Inter-tier Interconnect and Processing-in-Memory

  • Dylan Stow
  • Itir Akgun
  • Wenqin Huangfu
  • Yuan Xie
  • Xueqi Li
  • Gabriel H. Loh

Emerging Monolithic Three-Dimensional (M3D) integration technology will not only provide improved circuit density through the high-bandwidth coupling of multiple vertically-stacked layers, but it can also provide new architectural opportunities for on-chip computation, memory, and communication that are beyond the capabilities of existing process and packaging technologies. For example, with massive parallel communication between heterogeneous memory and compute layers, existing processing-in-memory architectures can be optimized and expanded, developing into efficient and flexible near-data processors. Additionally, multiple tiers of interconnect can be dynamically leveraged to provide an efficient, scalable interconnect fabric that spans the three-dimensional system. This work explores some of the challenges and opportunities presented by M3D technology for emerging computer architectures, with focus on improving efficiency and increasing system flexibility.

RTL-to-GDS Tool Flow and Design-for-Test Solutions for Monolithic 3D ICs

  • Heechun Park
  • Kyungwook Chang
  • Bon Woong Ku
  • Jinwoo Kim
  • Edward Lee
  • Daehyun Kim
  • Arjun Chaudhuri
  • Sanmitra Banerjee
  • Saibal Mukhopadhyay
  • Krishnendu Chakrabarty
  • Sung Kyu Lim

Monolithic 3D IC overcomes the limitation of the existing through-silicon-via (TSV) based 3D IC by providing denser vertical connections with nano-scale inter-layer vias (ILVs). In this paper, we demonstrate a thorough RTL-to-GDS design flow for monolithic 3D IC, which is based on commercial 2D place-and-route (P&R) tools and clever ways to extend them to handle 3D IC designs and simulations. We also provide a low-cost built-in-self-test (BIST) method to detect various faults that can occur on ILVs. Lastly, we present a resistive random access memory (ReRAM) compiler that generates memory modules that are to be integrated in monolithic 3D ICs.

MobiEye: An Efficient Cloud-based Video Detection System for Real-time Mobile Applications

  • Jiachen Mao
  • Qing Yang
  • Ang Li
  • Hai Li
  • Yiran Chen

In recent years, machine learning research has largely shifted focus from the cloud to the edge. While the resulting algorithm- and hardware-level optimizations have enabled local execution for the majority of deep neural networks (DNNs) on edge devices, the sheer magnitude of DNNs associated with real-time video detection workloads has forced them to remain relegated to remote execution in the cloud. This problematic when combined with the strict latency requirements that are coupled with these workloads, and imposes a unique set of challenges not directly addressed in prior works. In this work, we design MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is able to achieve up to a 32% reduction in latency when compared to a conventional implementation of video detection system with only a marginal reduction in accuracy.

Enabling File-Oriented Fast Secure Deletion on Shingled Magnetic Recording Drives

  • Shuo-Han Chen
  • Ming-Chang Yang
  • Yuan-Hao Chang
  • Chun-Feng Wu

Existing secure deletion approaches are inefficient in erasing data permanently because file systems have no knowledge of the data layout on the storage device, nor is the storage device aware of file information within the file systems. This inefficiency is exaggerated on the emerging shingled magnetic recording (SMR) drive due to its inherent sequential-write constraint. On SMR drives, secure deletion requests may lead to serious write amplification and performance degradation if the data layout is not properly configured. Such observation motivates us to propose a file-oriented fast secure deletion (FFSD) strategy to alleviate the negative impacts of SMR drives’ sequential-write constraint and improve the efficiency of secure deletion operations on SMR drives. A series of experiments was conducted to demonstrate the capability of the proposed strategy on improving the efficiency of secure deletion on SMR drives.

Enabling Failure-resilient Intermittently-powered Systems Without Runtime Checkpointing

  • Wei-Ming Chen
  • Pi-Cheng Hsiu
  • Tei-Wei Kuo

Self-powered intermittent systems enable accumulative execution in unstable power environments, where checkpointing is often adopted as a means to achieve data consistency and system recovery under power failures. However, existing approaches based on the checkpointing paradigm normally require system suspension and/or logging at runtime. This paper presents a design which enables failure-resilient intermittently-powered systems without runtime checkpointing. Our design enforces the consistency and serializability of concurrent task execution while maximizing computation progress, as well as allows instant system recovery after power resumption, by leveraging the characteristics of data accessed in hybrid memory. We integrated the design into FreeRTOS running on a Texas Instruments device. Experimental results show that our design achieves up to 11.8 times the computation progress achieved by checkpointing-based approaches, while reducing the recovery time by nearly 90%.

Sensor Drift Calibration via Spatial Correlation Model in Smart Building

  • Tinghuan Chen
  • Bingqing Lin
  • Hao Geng
  • Bei Yu

Sensor drift is an intractable obstacle to practical temperature measurement in smart building. In this paper, we propose a sensor spatial correlation model. Given prior knowledge, Maximum-aposteriori (MAP) estimation is performed to calibrate drifts. MAP is formulated as a non-convex problem with three hyper-parameters. An alternating-based method is proposed to solve this non-convex formulation. Cross-validation and Expectation-maximum with Gibbs sampling are further to determine hyper-parameters. Experimental results show that on benchmarks from simulator EnergyPlus, compared with state-of-the-art method, the proposed framework can achieve a robust drift calibration and a better trade-off between accuracy and runtime.

Machine Learning-Based Pre-Routing Timing Prediction with Reduced Pessimism

  • Erick Carvajal Barboza
  • Nishchal Shukla
  • Yiran Chen
  • Jiang Hu

Optimizations at placement stage need to be guided by timing estimation prior to routing. To handle timing uncertainty due to the lack of routing information, people tend to make very pessimistic predictions such that performance specification can be ensured in the worst case. Such pessimism causes over-design that wastes chip resources or design effort. In this work, a machine learning-based pre-routing timing prediction approach is introduced. Experimental results show that it can reach accuracy near post-routing sign-off analysis. Compared to a commercial pre-routing timing estimation tool, it reduces false positive rate by about 2/3 in reporting timing violations.

LithoGAN: End-to-End Lithography Modeling with Generative Adversarial Networks

  • Wei Ye
  • Mohamed Baker Alawieh
  • Yibo Lin
  • David Z. Pan

Lithography simulation is one of the most fundamental steps in process modeling and physical verification. Conventional simulation methods suffer from a tremendous computational cost for achieving high accuracy. Recently, machine learning was introduced to trade off between accuracy and runtime through speeding up the resist modeling stage of the simulation flow. In this work, we propose LithoGAN, an end-to-end lithography modeling framework based on a generative adversarial network (GAN), to map the input mask patterns directly to the output resist patterns. Our experimental results show that LithoGAN can predict resist patterns with high accuracy while achieving orders of magnitude speedup compared to conventional lithography simulation and previous machine learning based approach.

A General Cache Framework for Efficient Generation of Timing Critical Paths

  • Kuan-Ming Lai
  • Tsung-Wei Huang
  • Tsung-Yi Ho

The recent TAU 2018 contest was seeking novel idea for efficient generation of timing reports. When the timing graph is updated, users query different forms of timing reports that happen subsequently and sequentially. This process is computationally expensive and inherently complex. Therefore, we introduce in this paper a general cache framework for efficient generation of timing critical paths. Our framework efficiently supports (1) a cache scheme to minimize duplicate calculation, (2) graph contraction to reduce the search space, and (3) multi-threading. We evaluated our framework on the TAU 2018 contest benchmarks and demonstrated promising performance over the top performer.

Effective-Resistance Preserving Spectral Reduction of Graphs

  • Zhiqiang Zhao
  • Zhuo Feng

This paper proposes a scalable algorithmic framework for effective-resistance preserving spectral reduction of large undirected graphs. The proposed method allows computing much smaller graphs while preserving the key spectral (structural) properties of the original graph. Our framework is built upon the following three key components: a spectrum-preserving node aggregation and reduction scheme, a spectral graph sparsification framework with iterative edge weight scaling, as well as effective-resistance preserving post-scaling and iterative solution refinement schemes. By leveraging recent similarity-aware spectral sparsification method and graph-theoretic algebraic multigrid (AMG) Laplacian solver, a novel constrained stochastic gradient descent (SGD) optimization approach has been proposed for achieving truly scalable performance (nearly-linear complexity) for spectral graph reduction. We show that the resultant spectrally-reduced graphs can robustly preserve the first few nontrivial eigenvalues and eigenvectors of the original graph Laplacian and thus allow for developing highly-scalable spectral graph partitioning and circuit simulation algorithms.

Revisiting the ARM Debug Facility for OS Kernel Security

  • Jinsoo Jang
  • Brent Byunghoon Kang

Hardware debugging facilities, such as watchpoints, have been used for software development and analysis. In this paper, we expanded the use of watchpoints as a hardware security primitive for enhancing the runtime security of mobile devices. By analyzing the watchpoints in detail, we derived useful watchpoint properties that can be exploited to build security applications. Based on our analysis, we designed example applications for hardening the OS kernel by exploiting watchpoints. The proposed applications were implemented on a Juno development board with 64-bit ARM architecture (ARMv8). Hardening the kernel by fully enabling the proposed schemes was found to impose reasonable overhead, i.e., 3% with SPEC CPU2006.

Low-Overhead Power Trace Obfuscation for Smart Meter Privacy

  • Daniele Jahier Pagliari
  • Sara Vinco
  • Enrico Macii
  • Massimo Poncino

Smart meters communicate to the utility provider fine-grain information about a user’s energy consumption, which could be used to infer the user’s habits and pose thus a critical privacy risk. State-of-the-art solutions try to obfuscate the readings of a meter either by using a large re-chargeable battery to filter the trace or by adding random noise to alter it. Both solutions, however, have significant drawbacks: large batteries are prohibitively expensive, whereas digitally added noise implies that the user entrusts the utility provider to protect his/her privacy.

This work proposes a hybrid approach in which zero-average noise is inserted in the power trace by means of a small energy storage device (battery or supercapacitor); the distinguishing feature of our approach is that this obfuscating device is indistinguishable from any other load and therefore it complicates by construction the load disaggregation task performed by the provider or by a malicious third party. Simulation results show that our device can achieve comparable or superior privacy enhancement as that of a solution based on a large battery and therefore with smaller cost.

ARM2GC: Succinct Garbled Processor for Secure Computation

  • Ebrahim M. Songhori
  • M. Sadegh Riazi
  • Siam U. Hussain
  • Ahmad-Reza Sadeghi
  • Farinaz Koushanfar

We present ARM2GC, a novel secure computation framework based on Yao’s Garbled Circuit (GC) protocol and the ARM processor. It allows users to develop privacy-preserving applications using standard high-level programming languages (e.g., C) and compile them using off-the-shelf ARM compilers, e.g., gcc-arm. The main enabler of this framework is the introduction of SkipGate, an algorithm that dynamically omits the communication and encryption cost of a gate when its output is independent of the private data. SkipGate greatly enhances the performance of ARM2GC by omitting costs of the gates associated with the instructions of the compiled binary, which is known by both parties involved in the computation. Our evaluation on benchmark functions demonstrates that ARM2GC outperforms the prior best solution by 156×.

Filianore: Better Multiplier Architectures for LWE-based Post-Quantum Key Exchange

  • Song Bian
  • Masayuki Hiromoto
  • Takashi Sato

The (ring) learning with errors (RLWE/LWE) problem is one of the most promising candidates for constructing quantum-secure key exchange protocols. In this work, we design and implement specialized hardware multiplier units for both LWE and RLWE key exchange schemes to maximize their computational efficiency. By exploiting the algebraic structure with aggressive parameter sets, we show that the design and implementation of LWE key exchange on hardware is considerably easier and more flexible than RLWE. Using the proposed architectures, we show that client-side energy-efficiency of LWE-based key exchange can be on the same order, or even (slightly) better than RLWE-based schemes, making LWE an attractive option for designing post-quantum cryptographic suite.

Adaptive Granularity Encoding for Energy-efficient Non-Volatile Main Memory

  • Jie Xu
  • Dan Feng
  • Yu Hua
  • Wei Tong
  • Jingning Liu
  • Chunyan Li
  • Gaoxiang Xu
  • Yiran Chen

Data encoding methods have been proposed to alleviate the high write energy and limited write endurance disadvantages of Non-Volatile Memories (NVMs). Encoding methods are proved to be effective through theoretical analysis. Under the data patterns of workloads, existing encoding methods could become inefficient. We observe that the new cache line and the old cache line have many redundant (or unmodified) words. This makes the utilization ratio of the tag bits of data encoding methods become very low, and the efficiency of data encoding method decreases. To fully exploit the tag bits to reduce the bit flips of NVMs, we propose REdundant word Aware Data encoding (READ). The key idea of READ is to share the tag bits among all the words of the cache line and dynamically assign the tag bits to the modified words. The high utilization ratio of the tag bits in READ leads to heavy bit flips of the tag bits. To reduce the bit flips of the tag bits in READ, we further propose Sequential flips Aware Encoding (SAE). SAE is designed based on the observation that many sequential bits of the new data and the old data are opposite. For those writes, the bit flips of the tag bits will increase with the number of tag bits. SAE dynamically selects the encoding granularity which causes the minimum bit flips instead of using the minimum encoding granularity. Experimental results show that our schemes can reduce the energy consumption by 20.3%, decrease the bit flips by 25.0%, and improve the lifetime by 52.1%.

Magma: A Monolithic 3D Vertical Heterogeneous ReRAM-based Main Memory Architecture

  • Farzaneh Zokaee
  • Mingzhe Zhang
  • Xiaochun Ye
  • Dongrui Fan
  • Lei Jiang

3D vertical ReRAM (3DV-ReRAM) emerges as one of the most promising alternatives to DRAM due to its good scalability beyond 10nm. Monolithic 3D (M3D) integration enables 3DV-ReRAM to improve its array area efficiency by stacking peripheral circuits underneath an array. A 3DV-ReRAM array has to be large enough to fully cover the peripheral circuits, but such large array size significantly increases its access latency. In this paper, we propose Magma, a M3D stacked heterogeneous ReRAM array architecture, for future main memory systems by stacking a large unipolar 3DV-ReRAM array on the top of a small bipolar 3DV-ReRAM array and peripheral circuits shared by two arrays. We further architect the small bipolar array as a direct-mapped cache for the main memory system. Compared to homogeneous ReRAMs, on average, Magma improves the system performance by 11.4%, reduces the system energy by 24.3% and obtains > 5-year lifetime.

A Wear-Leveling-Aware Fine-Grained Allocator for Non-Volatile Memory

  • Xianzhang Chen
  • Zhuge Qingfeng
  • Qiang Sun
  • Edwin H.-M. Sha
  • Shouzhen Gu
  • Chaoshu Yang
  • Chun Jason Xue

Emerging non-volatile memories (NVMs) are promising main memory for their advanced characteristics. However, the low endurance of NVM cells makes them vulnerable to frequent fine-grained updates. This paper proposes a Wear-leveling Aware Fine-grained Allocator (WAFA) for NVM. WAFA divides pages into basic memory units to support fine-grained updates. WAFA allocates the basic memory units of a page in a rotational manner to distribute fine-grained updates evenly on memory cells. The fragmented basic memory units of each page caused by the memory allocation and deallocation operations are reorganized by reform operation. We implement WAFA in Linux kernel 4.4.4. Experimental results show that WAFA can reduce 81.1% and 40.1% of the total writes of pages over NVMalloc and nvm_alloc, the state-of-the-art wear-conscious allocator for NVM. Meanwhile, WAFA shows 48.6% and 42.3% performance improvement over NVMalloc and nvm_alloc, respectively.

DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement

  • Yibo Lin
  • Shounak Dhar
  • Wuxi Li
  • Haoxing Ren
  • Brucek Khailany
  • David Z. Pan

Placement for very-large-scale integrated (VLSI) circuits is one of the most important steps for design closure. This paper proposes a novel GPU-accelerated placement framework DREAMPlace, by casting the analytical placement problem equivalently to training a neural network. Implemented on top of a widely-adopted deep learning toolkit PyTorch, with customized key kernels for wirelength and density computations, DREAMPlace can achieve over 30× speedup in global placement without quality degradation compared to the state-of-the-art multi-threaded placer RePlAce. We believe this work shall open up new directions for revisiting classical EDA problems with advancement in AI hardware and software.

BiG: A Bivariate Gradient-Based Wirelength Model for Analytical Circuit Placement

  • Fan-Keng Sun
  • Yao-Wen Chang

The analytical formulation has been shown to be the most effective for circuit placement. A key ingredient of analytical placement is its wirelength model, which needs to be differentiable and can accurately approximate a golden wirelength model such as half-perimeter wirelength. Existing wirelength models derive gradient from differentiating smooth maximum (minimum) functions, such as the log-sum-exp and weighted-average models. In this paper, we propose a novel bivariate gradient-based wirelength model, namely BiG, which directly derives a gradient with any bivariate smooth maximum (minimum) function without any differentiation. Our wirelength model can effectively combine the advantages of both multivariate and bivariate functions. Experimental results show that our BiG model effectively and efficiently improves placement solutions.

Routability-driven Mixed-size Placement Prototyping Approach Considering Design Hierarchy and Indirect Connectivity Between Macros

  • Jai-Ming Lin
  • Szu-Ting Li
  • Yi-Ting Wang

The mixed-size placement becomes a great challenge in the modern VLSI design. To handle this problem, the three-stage mixed-size placement methodology is considered as the most suitable approach for a commercial design flow, where the placement prototyping is the most important stage. Since standard cells and macros have to be considered simultaneously in this stage, it is more complicated than the other two stages. To reduce complexity and improve design quality, this paper applies the multilevel framework with a design hierarchy-guided clustering scheme for getting a better coarsening result in order to improve outcome in the following stages. We propose an efficient and effective clustering scheme to group standard cells and macros based on the tree built from their design hierarchies. More importantly, our clustering algorithm considers indirect connectivity between macros which is ignored by previous works. Moreover, we propose a new overlapping bounding box constraint to avoid clustering improper macros which have connections to fixed pins. The experimental results show that wirelength and routability are improved by our methodology.

NCTUcell: A DDA-Aware Cell Library Generator for FinFET Structure with Implicitly Adjustable Grid Map

  • Yih-Lang Li
  • Shih-Ting Lin
  • Shinichi Nishizawa
  • Hong-Yan Su
  • Ming-Jie Fong
  • Oscar Chen
  • Hidetoshi Onodera

For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware dynamic programming based transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. With M0 routing, the congestion of M1 routing can be reduced and the pin accessibility can be improved due to the diminished use of M2 routing. To improve the routing resource utilization, we propose an implicitly adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires less filler cells than that using traditional cell library by 70.9%, which achieves a block area reduction rate of 5.7%.

Design Principles for True Random Number Generators for Security Applications

  • Miloš Grujić
  • Vladimir Rožić
  • David Johnston
  • John Kelsey
  • Ingrid Verbauwhede

The generation of high quality true random numbers is essential in security applications. For secure communication, we also require high quality true random number generators (TRNGs) in embedded and IoT devices. This paper provides insights into modern TRNG design principles and their evaluation, based on standard’s requirements and design experience. We illustrate our approach with a case study of a recently proposed delay chain based TRNG.

Rapid Generation of High-Qality RISC-V Processors from Functional Instruction Set Specifications

  • Gai Liu
  • Joseph Primmer
  • Zhiru Zhang

The increasing popularity of compute acceleration for emerging domains such as artificial intelligence and computer vision has led to the growing need for domain-specific accelerators, often implemented as specialized processors that execute a set of domain-optimized instructions. The ability to rapidly explore (1) various possibilities of the customized instruction set, and (2) its corresponding micro-architectural features is critical to achieve the best quality-of-results (QoRs). However, this ability is frequently hindered by the manual design process at the register transfer level (RTL). Such an RTL-based methodology is often expensive and slow to react when the design specifications change at the instruction-set level and/or micro-architectural level.

We address this deficiency in domain-specific processor design with ASSIST, a behavior-level synthesis framework for RISC-V processors. From an untimed functional instruction set description, ASSIST generates a spectrum of RISC-V processors implementing varying micro-architectural design choices, which enables effective tradeoffs between different QoR metrics. We demonstrate the automatic synthesis of more than 60 in-order processor implementations with varying pipeline structures from the RISC-V 32I instruction set, some of which dominate the manually optimized counterparts in the area-performance Pareto frontier. In addition, we propose an autotuning-based approach for optimizing the implementations under a given performance constraint and the technology target. We further present case studies of synthesizing various custom instruction extensions and customized instruction sets for cryptography and machine learning applications.

autoAx: An Automatic Design Space Exploration and Circuit Building Methodology utilizing Libraries of Approximate Components

  • Vojtech Mrazek
  • Muhammad Abdullah Hanif
  • Zdenek Vasicek
  • Lukas Sekanina
  • Muhammad Shafique

Approximate computing is an emerging paradigm for developing highly energy-efficient computing systems such as various accelerators. In the literature, many libraries of elementary approximate circuits have already been proposed to simplify the design process of approximate accelerators. Because these libraries contain from tens to thousands of approximate implementations for a single arithmetic operation it is intractable to find an optimal combination of approximate circuits in the library even for an application consisting of a few operations. An open problem is “how to effectively combine circuits from these libraries to construct complex approximate accelerators”. This paper proposes a novel methodology for searching, selecting and combining the most suitable approximate circuits from a set of available libraries to generate an approximate accelerator for a given application. To enable fast design space generation and exploration, the methodology utilizes machine learning techniques to create computational models estimating the overall quality of processing and hardware cost without performing full synthesis at the accelerator level. Using the methodology, we construct hundreds of approximate accelerators (for a Sobel edge detector) showing different but relevant tradeoffs between the quality of processing and hardware cost and identify a corresponding Pareto-frontier. Furthermore, when searching for approximate implementations of a generic Gaussian filter consisting of 17 arithmetic operations, the proposed approach allows us to identify approximately 103 highly relevant implementations from 1023 possible solutions in a few hours, while the exhaustive search would take four months on a high-end processor.

Graph-Morphing: Exploiting Hidden Parallelism of Non-Stencil Computation in High-Level Synthesis

  • Yu Zou
  • Mingjie Lin

Non-stencil kernels with irregular memory access patterns pose unique challenges to achieving high computing performance and hardware efficiency in FPGA high-level synthesis. We present a highly versatile and systematic approach, termed as Graph-Morphing, to constructing a reconfigurable computing engine specifically optimized to perform non-stencil kernel computing. Graph-Morphing achieves significant performance improvement by fragmenting operations across loop iterations and subsequently rescheduling computation and data to maximize overall performance. In experiments, Graph-Morphing achieves 2-13 times performance improvement albeit with significantly more hardware usage. For accelerating non-stencil kernel computing, Graph-Morphing proposes a new research direction.

Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators via Layer Conscious Memory Management

  • Xuechao Wei
  • Yun Liang
  • Jason Cong

Deep Neural Networks (DNNs) are becoming more and more complex than before. Previous hardware accelerator designs neglect the layer diversity in terms of computation and communication behavior. On-chip memory resources are underutilized for the memory bounded layers, leading to suboptimal performance. In addition, the increasing complexity of DNN structures makes it difficult to do on-chip memory allocation. To address these issues, we propose a layer conscious memory management framework for FPGA-based DNN hardware accelerators. Our framework exploits the layer diversity and the disjoint lifespan information of memory buffers to efficiently utilize the on-chip memory to improve the performance of the layers bounded by memory and thus the entire performance of DNNs. It consists of four key techniques working coordinately with each other. We first devise a memory allocation algorithm to allocate on-chip buffers for the memory bound layers. In addition, buffer sharing between different layers is applied to improve on-chip memory utilization. Finally, buffer prefetching and splitting are used to further reduce latency. Experiments show that our techniques can achieve 1.36X performance improvement compared with previous designs.

High-Level Synthesis of Resource-oriented Approximate Designs for FPGAs

  • Marcos T. Leipnitz
  • Gabriel L. Nazar

When attempting to make a design fit a set of the heterogeneous resources found in Field-Programmable Gate Arrays (FPGAs), designers using High-Level Synthesis (HLS) may resort to approximate approaches. However, current FPGA-oriented approximate HLS tools do not allow specifying constraints on heterogeneous resources such as lookup tables, flip-flops, and multipliers, being instead error-oriented. In this work, we propose a resource-oriented HLS methodology with which designers can specify heterogeneous resource constraints and satisfy them while minimizing the output error, attaining average improvements, over error-oriented approaches, of about 34% and 2.2 dB for mean-squared error and peak signal-to-noise ratio error metrics, respectively.

Improving Scalability of Exact Modulo Scheduling with Specialized Conflict-Driven Learning

  • Steve Dai
  • Zhiru Zhang

Loop pipelining is an important optimization in high-level synthesis to enable high-throughput pipelined execution of loop iterations. However, current pipeline scheduling approach relies on fundamentally inexact heuristics based on ad hoc priority functions and lacks guarantee on achieving the best throughput. To address this shortcoming, we propose a scheduling algorithm based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle various pipeline scheduling constraints. Our techniques take advantage of conflict-driven learning and problem-specific specialization to optimally yet efficiently derive pipelining solutions. Experiments demonstrate that our approach achieves notable speedup in comparison to integer linear programming based techniques.

LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator

  • Quan Deng
  • Youtao Zhang
  • Minxuan Zhang
  • Jun Yang

PIM (Processing-in-memory)-based CNN (Convolutional neural network) accelerators leverage the characteristics of basic memory cells to enable simple logic and arithmetic operations so that the bandwidth constraint can be effectively alleviated. However, it remains a major challenge to support multiplication operations efficiently on PIM accelerators, in particular, DRAM-based PIM accelerators. This has prevented PIM-based accelerators from being immediately adopted for accurate CNN inference.

In this paper, we propose LAcc, a DRAM-based PIM accelerator to support LUT- (lookup table) based fast and accurate multiplication. By enabling LUT based vector multiplication in DRAM, LAcc effectively decreases LUT size and improve its reuse. LAcc further adopts a hybrid mapping of weights and inputs to improve the hardware utilization rate. LAcc achieves 95 FPS at 5.3 W for Alexnet and 6.3 × efficiency improvement over the state-of-the-art.

DRIS-3: Deep Neural Network Reliability Improvement Scheme in 3D Die-Stacked Memory based on Fault Analysis

  • Jae-San Kim
  • Joon-Sung Yang

Various studies have been carried out to improve the operational efficiency of the Deep Neural Networks (DNNs). However, the importance of the reliability in DNNs has generally been overlooked. As the underlying semiconductor technology decreases in reliability, the probability that some components of computing devices fail also increases, preventing high accuracy in DNN operations. To achieve high accuracy, ensuring operational reliability, even if faults occur, is necessary.

In this paper, we introduce a DNN reliability improvement scheme in 3D die-stacked memory called DRIS-3, based on the correlation between the faults in weights and an accuracy loss. We analyze the fault characteristics of conventional DNN models to find the bits that cause significant accuracy loss when faults are injected into weights. On the basis of the findings, we propose a reliability improvement structure which can reduce faults on the bits that must be protected for accuracy, considering asymmetric soft error rate (SER) per layer in 3D die-stacked memory.

Experimental results show that with the proposed method, the fault tolerance is improved regardless of the type of model and the pruning applied. The fault tolerance based on bit error rate (BER) for a 1% accuracy loss is increased up to 104 times over the conventional model.

X-MANN: A Crossbar based Architecture for Memory Augmented Neural Networks

  • Ashish Ranjan
  • Shubham Jain
  • Jacob R. Stevens
  • Dipankar Das
  • Bharat Kaul
  • Anand Raghunathan

Memory Augmented Neural Networks (MANNs) enhance a deep neural network with an external differentiable memory, enabling them to perform complex tasks well beyond the capabilities of conventional deep neural networks. We identify a unique challenge that arises in MANNs due to soft reads and writes to the differentiable memory, each of which requires access to all the memory locations. This characteristic of MANN workloads severely limits the performance of MANNs on CPUs, GPUs, and classical neural network accelerators. We present the first effort to design a hardware architecture that improves the efficiency of MANNs. Leveraging the intrinsic ability of resistive crossbars to efficiently realize in-memory computations, we propose X-MANN, a memory-centric crossbar-based architecture that is specialized to match the compute characteristics observed in MANNs. We design a transposable crossbar processing unit that can efficiently perform the different computational kernels of MANNs. To improve performance of soft writes in X-MANN, we propose an incremental write mechanism that leverages the characteristics of soft write operations. We develop an architectural simulator for X-MANN that utilizes array-level timing and power models of resistive crossbars calibrated from SPICE simulations. Across a suite of MANN benchmarks, X-MANN achieves 23.7×-45.7× speedup and 75.1×-267.1× reduction in energy over state-of-the-art GPU implementations.

On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators

  • Haitong Li
  • Mudit Bhargava
  • Paul N. Whatmough
  • H.-S. Philip Wong

Deep neural network (DNN) inference tasks have become ubiquitous workloads on mobile SoCs and demand energy-efficient hardware accelerators. Mobile DNN accelerators are heavily area-constrained, with only minimal on-chip SRAM, which results in heavy use of inefficient off-chip DRAM. With diminishing returns from conventional silicon technology scaling, emerging memory technologies that offer better area density than SRAM can boost accelerator efficiency by minimizing costly off-chip DRAM accesses. This paper presents a detailed design space exploration (DSE) of technology-system co-design for systolic-array accelerators. We focus on practical/mature on-chip memory technologies, including SRAM, eDRAM, MRAM, and 3D vertical RRAM (VRRAM). The DSE employs state-of-the-art optimizations (e.g., model compression and optimized buffer scheduling), and evaluates results on important models including ResNet-50, MobileNet, and Faster-RCNN. Compared to an SRAM/DRAM baseline, MRAM-based accelerators show up to 4.68× energy benefits (57% area overhead), while a 3D VRRAM-based design achieves 2.22× energy benefits (33% area reduction).

SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks

  • Reza Hojabr
  • Kamyar Givaki
  • S. M. Reza Tayaranian
  • Parsa Esfahanian
  • Ahmad Khonsari
  • Dara Rahmati
  • M. Hassan Najafi

Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efficient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low-cost advantage, SC-based arithmetic units suffer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs.

In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the first multiplication, the following multiplications can be performed by multiplying the input and the differences of the successive weights. Leveraging this property, we develop a differential Multiply-and-Accumulate unit, called DMAC, to reduce the time consumed by convolutions in SkippyNN. We evaluate the efficiency of SkippyNN using four modern CNNs. On average, SkippyNN ofers 1.2x speedup and 2.7x energy saving compared to the binary implementation of CNN accelerators.

ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM

  • Fan Chen
  • Linghao Song
  • Hai Helen Li
  • Yiran Chen

Generative Adversarial Networks (GANs) recently demonstrated a great opportunity toward unsupervised learning with the intention to mitigate the massive human efforts on data labeling in supervised learning algorithms. GAN combines a generative model and a discriminative model to oppose each other in an adversarial situation to refine their abilities. Existing nonvolatile memory based machine learning accelerators, however, could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. In this work, we propose a novel computational deformation technique that synergistically optimizes the forward and backward functions in transposed convolution to eliminate the large resource underutilization. In addition, we present dedicated control units – a dataflow mapper and an operation scheduler, to support the proposed execution model with high parallelism and low energy consumption. ZARA is implemented with commodity ReRAM chips, and experimental results show that our design can improve GAN’s training performance by averagely 1.6× ~23× over CMOS-based GAN accelerators. Compared to state-of-the-art ReRAM-based accelerator designs, ZARA also provides 1.15 × ~2.1× performance improvement.

X-DeepSCA: Cross-Device Deep Learning Side Channel Attack

  • Debayan Das
  • Anupam Golder
  • Josef Danial
  • Santosh Ghosh
  • Arijit Raychowdhury
  • Shreyas Sen

This article, for the first time, demonstrates Cross-device Deep Learning Side-Channel Attack (X-DeepSCA), achieving an accuracy of > 99.9%, even in presence of significantly higher inter-device variations compared to the inter-key variations. Augmenting traces captured from multiple devices for training and with proper choice of hyper-parameters, the proposed 256-class Deep Neural Network (DNN) learns accurately from the power side-channel leakage of an AES-128 target encryption engine, and an N-trace (N ≤ 10) X-DeepSCA attack breaks different target devices within seconds compared to a few minutes for a correlational power analysis (CPA) attack, thereby increasing the threat surface for embedded devices significantly. Even for low SNR scenarios, the proposed X-DeepSCA attack achieves ~ 10× lower minimum traces to disclosure (MTD) compared to a traditional CPA.

Attacking Split Manufacturing from a Deep Learning Perspective

  • Haocheng Li
  • Satwik Patnaik
  • Abhrajit Sengupta
  • Haoyu Yang
  • Johann Knechtel
  • Bei Yu
  • Evangeline F.Y. Young
  • Ozgur Sinanoglu

The notion of integrated circuit split manufacturing which delegates the front-end-of-line (FEOL) and back-end-of-line (BEOL) parts to different foundries, is to prevent overproduction, piracy of the intellectual property (IP), or targeted insertion of hardware Trojans by adversaries in the FEOL facility. In this work, we challenge the security promise of split manufacturing by formulating various layout-level placement and routing hints as vector- and image-based features. We construct a sophisticated deep neural network which can infer the missing BEOL connections with high accuracy. Compared with the publicly available network-flow attack [1], for the same set of ISCAS-85 benchmarks, we achieve 1.21× accuracy when splitting on M1 and 1.12× accuracy when splitting on M3 with less than 1% running time.

ALAFA: Automatic Leakage Assessment for Fault Attack Countermeasures

  • Sayandeep Saha
  • S. Nishok Kumar
  • Sikhar Patranabis
  • Debdeep Mukhopadhyay
  • Pallab Dasgupta

Assessment of the security provided by a fault attack countermeasure is challenging, given that a protected cipher may leak the key if the countermeasure is not designed correctly. This paper proposes, for the first time, a statistical framework to detect information leakage in fault attack countermeasures. Based on the concept of non-interference, we formalize the leakage for fault attacks and provide a t-test based methodology for leakage assessment. One major strength of the proposed framework is that leakage can be detected without the complete knowledge of the countermeasure algorithm, solely by observing the faulty ciphertext distributions. Experimental evaluation over a representative set of countermeasures establishes the efficacy of the proposed methodology.

ChipSecure: A Reconfigurable Analog eFlash-Based PUF with Machine Learning Attack Resiliency in 55nm CMOS

  • Mohammad Mahmoodi
  • Hussein Nili
  • Shabnam Larimian
  • Xinjie Guo
  • Dmitri Strukov

We exploit randomness in static I-V characteristics and reconfigurability of embedded flash memories to design very efficient physically unclonable function. Leakage current and subthreshold slope variations, nonlinearity, nondeterministic tuning error, and sneak path current in the redesigned commercial flash memory arrays are exploited to create a unique digital fingerprint. A time-multiplexed architecture is designed to enhance the security and expand the challenge-response pair space to 10211. Experimental results demonstrate 50.3% average uniformity, 49.99% average diffuseness, and native <5% bit error rate. The analysis of the measured data also shows strong resilience against machine learning attacks and possibility for extremely energy efficient, 0.56 pJ/b operation.

Adversarial Attack against Modeling Attack on PUFs

  • Sying-Jyan Wang
  • Yu-Shen Chen
  • Katherine Shu-Min Li

The Physical Unclonable Function (PUF) has been proposed for the identification and authentication of devices and cryptographic key generation. A strong PUF provides an extremely large number of device-specific challenge-response pairs (CRP) which can be used for identification. Unfortunately, the CRP mechanism is vulnerable to modeling attack, which uses machine learning (ML) algorithms to predict PUF responses with high accuracy. Many methods have been developed to strengthen strong PUFs with complicated hardware; however, recent studies show that they are still vulnerable by leveraging GPU-accelerated ML algorithms.

In this paper, we propose to deal with the problem from a different approach. With a slightly modified CRP mechanism, a PUF can provide poison data such that an accurate model of the PUF under attack cannot be built by ML algorithms. Experimental results show that the proposed method provides an effective countermeasure against modeling attacks on PUF. In addition, the proposed method is compatible with hardware strengthening schemes to provide even better protection for PUFs.

RFTC: Runtime Frequency Tuning Countermeasure Using FPGA Dynamic Reconfiguration to Mitigate Power Analysis Attacks

  • Darshana Jayasinghe
  • Aleksandar Ignjatovic
  • Sri Parameswaran

Random execution time-based countermeasures against power analysis attacks have reduced resource overheads when compared to balancing power dissipation and masking countermeasures. The previous countermeasures on randomization use either a small number of clock frequencies or delays to randomize the execution. This paper presents a novel random frequency countermeasure (referred to as RFTC) using the dynamic reconfiguration ability of clock managers of Field-Programmable Gate Arrays — FPGAs (such as Xilinx Mixed-Mode Clock Manager — MMCM) which can change the frequency of operation at runtime. We show for the first time how Advanced Encryption Standard (AES) block cipher algorithm can be executed using randomly selected clock frequencies (amongst thousands of frequencies carefully chosen) generated within the FPGA to mitigate power analysis attack vulnerabilities. To test the effectiveness of the proposed clock randomization, Correlation Power analysis (CPA) attacks are performed on the collected power traces. Preprocessing methods, such as Dynamic Time Warping (DTW), Principal Component Analysis (PCA) and Fast Fourier Transform (FFT), based power analysis attacks are performed on the collected traces to test the effective removal of random execution. Compared to the state of the art, where there were 83 distinct finishing times for each encryption, the method described in this paper can have more than 60,000 distinct finishing times for each encryption, making it resistant against power analysis attacks when preprocessed and demonstrated to be secure up to four million traces.

Design Guidelines of RRAM based Neural-Processing-Unit: A Joint Device-Circuit-Algorithm Analysis

  • Wenqiang Zhang
  • Xiaochen Peng
  • Huaqiang Wu
  • Bin Gao
  • Hu He
  • Youhui Zhang
  • Shimeng Yu
  • He Qian

RRAM based neural-processing-unit (NPU) is emerging for processing general purpose machine intelligence algorithms with ultra-high energy efficiency, while the imperfections of the analog devices and cross-point arrays make the practical application more complicated. In order to improve accuracy and robustness of the NPU, device-circuit-algorithm codesign with consideration of underlying device and array characteristics should outperform the optimization of individual device or algorithm. In this work, we provide a joint device-circuit-algorithm analysis and propose the corresponding design guidelines. Key innovations include: 1) An end-to-end simulator for RRAM NPU is developed with an integrated framework from device to algorithm. 2) The complete design of circuit and architecture for RRAM NPU is provided to make the analysis much close to the real prototype. 3) A large-scale neural network as well as other general-purpose networks are processed for the study of device-circuit interaction. 4) Accuracy loss from non-idealities of RRAM, such as I-V nonlinearity, noises of analog resistance levels, voltage-drop for interconnect, ADC/DAC precision, are evaluated for the NPU design.

QURE: Qubit Re-allocation in Noisy Intermediate-Scale Quantum Computers

  • Abdullah Ash-Saki
  • Mahabubul Alam
  • Swaroop Ghosh

Concerted efforts by the academia and the industries e.g., IBM, Google and Intel have brought us to the era of Noisy Intermediate-Scale Quantum (NISQ) computers. Qubits, the basic elements of quantum computer, have been proven extremely susceptible to different noises. Recent experiments have exhibited spatial variations among the qubits in NISQ hardware. Therefore, conventional mapping of qubit done without quality awareness results in significant loss of fidelity for a given workload. In this paper, we have analyzed the effects of various noise sources on the overall fidelity of the given workload for a real NISQ hardware. We have also presented novel optimization technique namely, Qubit Re-allocation (QURE) to maximize the sequence fidelity of a given workload. QURE is scalable and can be applied to future large scale quantum computers. QURE can improve the fidelity of a quantum workload up to 1.54X (1.39X on average) in simulation and up to 1.7X in real device compared to variation oblivious qubit allocation without incurring any physical overhead.

Mapping Quantum Circuits to IBM QX Architectures Using the Minimal Number of SWAP and H Operations

  • Robert Wille
  • Lukas Burgholzer
  • Alwin Zulehner

The recent progress in the physical realization of quantum computers (the first publicly available ones—IBM’s QX architectures—have been launched in 2017) has motivated research on automatic methods that aid users in running quantum circuits on them. Here, certain physical constraints given by the architectures which restrict the allowed interactions of the involved qubits have to be satisfied. Thus far, this has been addressed by inserting SWAP and H operations. However, it remains unknown whether existing methods add a minimum number of SWAP and H operations or, if not, how far they are away from that minimum—an NP-complete problem. In this work, weaddress this by formulating the mapping task as a symbolic optimization problem that is solved using reasoning engines like Boolean satisfiability solvers. By this, we do not only provide a method that maps quantum circuits to IBM’s QX architectures with a minimal number of SWAP and H operations, but also show by experimental evaluation that the number of operations added by IBM’s heuristic solution exceeds the lower bound by more than 100% on average. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

Computing Radial Basis Function Support Vector Machine using DNA via Fractional Coding

  • Xingyi Liu
  • Keshab K. Parhi

This paper describes a novel approach to synthesize molecular reactions to compute a radial basis function (RBF) support vector machine (SVM) kernel. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to translating known stochastic logic circuits to molecular computing. Although inspired by prior stochastic logic implementation of the RBF-SVM kernel, the proposed molecular reactions require non-obvious modifications. This paper introduces a new explicit bipolar-to-unipolar molecular converter for intermediate format conversion. Two designs are presented; one is based on the explicit and the other is based on implicit conversion from prior stochastic logic. When 5 support vectors are used, it is shown that the DNA RBF-SVM realized using the explicit format conversion has orders of magnitude less regression error than that based on implicit conversion.

AlignS: A Processing-In-Memory Accelerator for DNA Short Read Alignment Leveraging SOT-MRAM

  • Shaahin Angizi
  • Jiao Sun
  • Wei Zhang
  • Deliang Fan

Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level’s parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm2 by ~12× compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6× higher throughput per Watt.

MiniControl: Synthesis of Continuous-Flow Microfluidics with Strictly Constrained Control Ports

  • Xing Huang
  • Tsung-Yi Ho
  • Wenzhong Guo
  • Bing Li
  • Ulf Schlichtmann

Recent advances in continuous-flow microfluidics have enabled highly integrated lab-on-a-chip biochips. These chips can execute complex biochemical applications precisely and efficiently within a tiny area, but they require a large number of control ports and the corresponding control logic to generate required pressure patterns for flow control, which, consequently, offset their advantages and prevent their wide adoption. In this paper, we propose the first synthesis flow called MiniControl, for continuous-flow microfluidic biochips (CFMBs) under strict constraints for control ports, incorporating high-level synthesis and physical design simultaneously, which has never been considered in previous work. With the maximum number of allowed control ports specified in advance, this synthesis flow generates a biochip architecture with high execution efficiency. Moreover, the overall cost of a CFMB can be reduced and the tradeoff between control logic and execution efficiency of biochemical applications can be evaluated for the first time. Experimental results demonstrate that MiniControl leads to high execution efficiency and low overall platform cost, while satisfying the given control port constraint strictly.

Faster Region-based Hotspot Detection

  • Ran Chen
  • Wei Zhong
  • Haoyu Yang
  • Hao Geng
  • Xuan Zeng
  • Bei Yu

As the circuit feature size continuously shrinks down, hotspot detection has become a more challenging problem in modern DFM flows. Developed deep learning techniques have recently shown their advantages on hotspot detection tasks. However, existing hotspot detectors only accept small layout clips as input with potential defects occurring at a center region of each clip, which will be time consuming and waste lots of computational resources when dealing with large full-chip layouts. In this paper, we develop a new end-to-end framework that can detect multiple hotspots in a large region at a time and promise a better hotspot detection performance. We design a joint auto-encoder and inception module for efficient feature extraction. A two-stage classification and regression flow is proposed to efficiently locate hotspot regions roughly and conduct final prediction with better accuracy and false alarm penalty. Experimental results show that our framework enables a significant speed improvement over existing methods with higher accuracy and fewer false alarms.

Efficient Layout Hotspot Detection via Binarized Residual Neural Network

  • Yiyang Jiang
  • Fan Yang
  • Hengliang Zhu
  • Bei Yu
  • Dian Zhou
  • Xuan Zeng

Layout hotspot detection is of great importance in the physical verification flow. Deep neural network models have been applied to hotspot detection and achieved great successes. The layouts can be viewed as binary images. The binarized neural network can thus be suitable for the hotspot detection problem. In this paper we propose a new deep learning architecture based on binarized neural networks (BNNs) to speed up the neural networks in hotspot detection. A new binarized residual neural network is carefully designed for hotspot detection. Experimental results on ICCAD 2012 Contest benchmarks show that our architecture outperforms all previous hotspot detectors in detecting accuracy and has an 8x speedup over the best deep learning-based solution.

DeePattern: Layout Pattern Generation with Transforming Convolutional Auto-Encoder

  • Haoyu Yang
  • Piyush Pathak
  • Frank Gennari
  • Ya-Chieh Lai
  • Bei Yu

VLSI layout patterns provide critic resources in various design for manufacturability researches, from early technology node development to back-end design and sign-off flows. However, a diverse layout pattern library is not always available due to long logic-to-chip design cycle, which slows down the technology node development procedure. To address this issue, in this paper, we explore the capability of generative machine learning models to synthesize layout patterns. A transforming convolutional auto-encoder is developed to learn vector-based instantiations of squish pattern topologies. We show our framework can capture simple design rules and contributes to enlarging the existing squish topology space under certain transformations. Geometry information of each squish topology is obtained from an associated linear system derived from design rule constraints. Experiments on 7nm EUV designs show that our framework can more effectively generate diverse pattern libraries with DRC-clean patterns compared to a state-of-the-art industrial layout pattern generator.

GAN-SRAF: Sub-Resolution Assist Feature Generation Using Conditional Generative Adversarial Networks

  • Mohamed Baker Alawieh
  • Yibo Lin
  • Zaiwei Zhang
  • Meng Li
  • Qixing Huang
  • David Z. Pan

As the integrated circuits (IC) technology continues to scale, resolution enhancement techniques (RETs) are mandatory to obtain high manufacturing quality and yield. Among various RETs, sub-resolution assist feature (SRAF) generation is a key technique to improve the target pattern quality and lithographic process window. While model-based SRAF insertion techniques have demonstrated high accuracy, they usually suffer from high computational cost. Therefore, more efficient techniques that can achieve high accuracy while reducing runtime are in strong demand. In this work, we leverage the recent advancement in machine learning for image generation to tackle the SRAF insertion problem. In particular, we propose a new SRAF insertion framework, GAN-SRAF, which uses conditional generative adversarial networks (CGANs) to generate SRAFs directly for any given layout. Our proposed approach incorporates a novel layout to image encoding using multi-channel heatmaps to preserve the layout information and facilitate layout reconstruction. Our experimental results demonstrate ~14.6× reduction in runtime when compared to the previous best machine learning approach for SRAF generation, and ~144× reduction compared to model-based approach, while achieving comparable quality of results.

Meta-Model based High-Dimensional Yield Analysis using Low-Rank Tensor Approximation

  • Xiao Shi
  • Hao Yan
  • Qiancun Huang
  • Jiajia Zhang
  • Longxing Shi
  • Lei He

“Curse of dimensionality” has become the major challenge for existing high-sigma yield analysis methods. In this paper, we develop a meta-model using Low-Rank Tensor Approximation (LRTA) to substitute expensive SPICE simulation. The polynomial degree of our LRTA model grows linearly with circuit dimension. This makes it especially promising for high-dimensional circuit problems. Our LRTA meta-model is solved efficiently with a robust greedy algorithm, and calibrated iteratively with an adaptive sampling method. Experiments on bit cell and SRAM column validate that proposed LRTA method outperforms other state-of-the-art approaches in terms of accuracy and efficiency.

Novel Guiding Template and Mask Assignment for DSA-MP Hybrid Lithography Using Multiple BCP Materials

  • Yi-Ting Lin
  • Iris Hui-Ru Jiang

Directed self-assembly (DSA) is one of the leading candidates for extending the resolution of optical lithography to sub-7nm and beyond. By incorporating DSA in multiple patterning lithography (DSA-MP), the flexibility and resolution of contact/via patterning can be further enhanced by using multiple block copolymer (BCP) materials. Prior work faces the dilemma between solution quality and efficiency and is unable to handle 2D templates. In this paper. we capture the essence of template and mask assignment in DSA-MP by a new graph model and a new problem reduction: Our graph model explicitly represents spacing conflict edges and template hyperedges; thus, extra enumeration and manipulation of incompatible via grouping edges can be avoided, and arbitrary 1D/2D templates can be natively handled. We further reduce the assignment problem to exact cover, which is encoded by a sparse matrix. Our concise integer linear programming (ILP) formulation and fast backtracking heuristic achieve substantially superior solution quality and efficiency to the state-of-the-art work. Moreover, our method is flexible and extendible to utilize dummy vias to improve manufacturability.

Actors Revisited for Time-Critical Systems

  • Marten Lohstroh
  • Martin Schoeberl
  • Andrés Goens
  • Armin Wasicek
  • Christopher Gill
  • Marjan Sirjani
  • Edward A. Lee

Programming time-critical systems is notoriously difficult. In this paper we propose an actor-oriented programming model with a semantic notion of time and a deterministic coordination semantics based on discrete events to exercise precise control over both the computational and timing aspects of the system behavior.

Time-Predictable Computing by Design: Looking Back, Looking Forward

  • Tulika Mitra

We present two contrasting approaches to achieve time predictability in the embedded compute engine, the basic building block of any Internet of Things (IoT) or Cyber-Physical (CPS) system. The traditional approach offers predictability on top of unpredictable processors with numerous optimizations for enhanced performance and programmability at the cost of huge variability in timing. Approaches such as Worst-Case Execution Time (WCET) analysis of software have been struggling to model the complex timing behavior of the underlying processor to provide guarantees. On the other hand, the inevitable slowdown of Moore’s Law and the end of Dennard scaling have curtailed the performance and energy scaling of the processors. This stagnation in conjunction with the importance of cognitive computing have motivated widespread adoption of non-von Neumann accelerators and architectures. We argue that these emerging architectures are inherently time-predictable as they depend on software to orchestrate the computation and data movement and are an excellent match for the real-time processing needs.

Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor

  • Benoît Dupont de Dinechin

The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This problem has been addressed by suitably designing the architecture, implementation, and programming models, of the Kalray MPPA (Multi-Purpose Processor Array) family of single-chip many-core processors. We introduce the third-generation MPPA processor, whose key features are motivated by the high-performance and high-integrity functions of automated vehicles. High-performance computing functions, represented by deep learning inference and by computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust, and its security architecture is able to support a security kernel for implementing the trusted execution environment functions required by applications.

Efficient GPU NVRAM Persistence with Helper Warps

  • Sui Chen
  • Faen Zhang
  • Lei Liu
  • Lu Peng

Non-volatile Random-Access Memories (NVRAM) have emerged in recent years to bridge the performance gap between the main memory and external storage devices. To utilize the non-volatility of NVRAMs, programs should allow durable stores, meaning consistency must be maintained during a power loss event. GPUs are designed with high throughput, leveraging high degrees of parallelism. However, with lower NVRAM write bandwidths compared to that of DRAMs, using NVRAM as is may yield suboptimal overall system performance. To address this problem, we propose using Helper Warps to move persistence out of the critical path of transaction execution, alleviating the impact of latencies. Our mechanism achieves a speedup of 4.4 and 1.5 under bandwidth limits of 1.6 GB/s and 12 GB/s and is projected to maintain speed advantage even when NVRAM bandwidth gets as high as hundreds of GB/s in certain cases.

FlashGPU: Placing New Flash Next to GPU Cores

  • Jie Zhang
  • Miryeong Kwon
  • Hyojong Kim
  • Hyesoon Kim
  • Myoungsoo Jung

We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores. While Z-NAND is a hundred times faster than conventional 3D-stacked flash, its latency is still longer than DRAM. To address this shortcoming, we propose a dynamic page-placement and buffer manager in Z-NAND subsystems by being aware of bulk and parallel memory access characteristics of GPU applications, thereby offering high-throughput and low-energy consumption behaviors.

Performance-aware Wear Leveling for Block RAM in Nonvolatile FPGAs

  • Shuo Huai
  • Weining Song
  • Mengying Zhao
  • Xiaojun Cai
  • Zhiping Jia

Field programmable gate arrays (FPGAs) have been widely adopted in both high-performance servers and embedded systems. Since static random access memory (SRAM) has limited density and comparatively high leakage power, researchers have proposed FPGA architectures based on emerging non-volatile memories (NVMs) to satisfy the requirements of data-intensive and low-power applications. Block RAM is on-chip memory of FPGAs, when it is implemented with NVM, it will face the challenge of limited endurance. Traditional wear leveling strategy cannot be directly applied to block RAM because it may induce large performance overhead. In this paper, we propose a performance-aware wear leveling scheme for block RAM in FPGAs to improve its lifetime. The placement strategy is improved by injecting wear leveling guidance. The evaluation shows that 29.75% lifetime enhancement is achieved with 16.32% performance improvement at the same time, compared with traditional wear leveling.

ZUMA: Enabling Direct Insertion/Deletion Operations with Emerging Skyrmion Racetrack Memory

  • Zheng Liang
  • Guangyu Sun
  • Wang Kang
  • Xing Chen
  • Weisheng Zhao

Data insertion and deletion are common operations exist in various applications. However, traditional memory architecture can only perform an indirect insertion/deletion with multiple data read and write operations, which is significantly time and energy consuming. To mitigate this problem, we propose to leverage the unique capability of emerging skyrmion racetrack memory technology that it can naturally support direct insertion/deletion operations inside a racetrack. In this work, we first present a circuit level model for skyrmion racetrack memory. Then, we further propose a novel memory architecture to enable an efficient large size data insertion/deletion. With the help of the model and the architecture, we study several potential applications to leverage the insertion and deletion operations. Experimental results demonstrate that the efficiency of these operations can be substantially improved.

ApproxLP: Approximate Multiplication with Linearization and Iterative Error Control

  • Mohsen Imani
  • Alice Sokolova
  • Ricardo Garcia
  • Andrew Huang
  • Fan Wu
  • Baris Aksanli
  • Tajana Rosing

In a data hungry world, approximate computing has emerged as one of the solutions to create higher energy efficiency and faster systems, while providing application tailored quality. In this paper, we propose ApproxLP, an Approximate Multiplier based on Linear Planes. We introduce an iterative method for approximating the product of two operands using fitted linear functions with two inputs, referred to as linear planes. The linearization of multiplication allows multiplication operations to be completely replaced with weighted addition. The proposed technique is used to find the significand of the product of two floating point numbers, decreasing the high energy cost of floating point arithmetic. Our method fully exploits the trade-off between accuracy and energy consumption by offering various degrees of approximation at different energy costs. As the level of approximation increases, the approximated product asymptotically approaches the exact product in an iterative manner. The performance of ApproxLP is evaluated over a range of multimedia and machine learning applications. A GPU enhanced by ApproxLP yields significant energy-delay product (EDP) improvement. For multimedia, neural network, and hyperdimensional computing applications, ApproxLP offers on average 2.4×, 2.7×, and 4.3× EDP improvement respectively with sufficient computational quality for the application. ApproxLP also provides up to 4.5× EDP improvement and has 2.3× lower chip area than other state-of-the-art approximate multipliers.

Cooperative Arithmetic-Aware Approximation Techniques for Energy-Efficient Multipliers

  • Vasileios Leon
  • Konstantinos Asimakopoulos
  • Sotirios Xydis
  • Dimitrios Soudris
  • Kiamal Pekmestzi

Approximate computing appears as an emerging and promising solution for energy-efficient system designs, exploiting the inherent error-tolerant nature of various applications. In this paper, targeting multiplication circuits, i.e., the energy-hungry counterpart of hardware accelerators, an extensive exploration of the error–energy trade-off, when combining arithmetic-level approximation techniques, is performed for the first time. Arithmetic-aware approximations deliver significant energy reductions, while allowing to control the error values with discipline by setting accordingly a configuration parameter. Inspired from the promising results of prior works with one configuration parameter, we propose 5 hybrid design families for approximate and energy-friendly hardware multipliers, consisting of two independent parameters to tune the approximation levels. Interestingly, the resolution of the state-of-the-art Pareto diagram is improved, giving the flexibility to achieve better energy gains for a specific error constraint imposed by the system. Moreover, we outperform prior works in the field of approximate multipliers by up to 60% energy reduction, and thus, we define the new Pareto front.

Approximate Integer and Floating-Point Dividers with Near-Zero Error Bias

  • Hassaan Saadat
  • Haris Javaid
  • Sri Parameswaran

We propose approximate dividers with near-zero error bias for both integer and floating-point numbers. The integer divider, INZeD, is designed using a novel, analytically deduced error-correction method in an approximate log based divider. The floating-point divider, FaNZeD, is based on a highly optimized mantissa divider that is inspired by INZeD. Both of the dividers are error configurable.

Our results show that the INZeD dividers have error bias in the range of 0.01-4.4% with area-delay product improvement of 25× – 95× and power improvement of 4.7× – 15× when compared to the accurate integer divider. Likewise, compared to IEEE single-precision floating-point divider, FaNZeD dividers offer up to 985× area-delay product and 77× power improvements with error bias in the range of 0.04-2.2%. Most importantly, using our FaNZeD dividers, floating-point arithmetic can be more resource-efficient than fixed-point arithmetic because most of the FaNZeD dividers are even smaller and have better area-delay product than the 8-bit and 16-bit accurate integer dividers. Finally, our dividers show negligible effect on the output quality when evaluated with AlexNet and JPEG compression applications.

In-Stream Stochastic Division and Square Root via Correlation

  • Di Wu
  • Joshua San Miguel

Stochastic Computing (SC) is designed to minimize hardware area and power consumption compared to traditional binary-encoded computation, stemming from the bit-serial data representation and extremely straightforward logic. Though existing Stochastic Computing Units mostly assume uncorrelated bit streams, recent works find that correlation can be exploited for higher accuracy. We propose novel architectures for SC division and square root, which leverage correlation via low-cost in-stream mechanisms that eliminate expensive bit stream regeneration. We also introduce new metrics to better evaluate SC circuits relying on equilibrium via feedback loops. Experiments indicate that our division converges 46.3% faster with both 43.3% lower error and 45.6% less area.

MASKER: Adaptive Mobile Security Enhancement against Automatic Speech Recognition in Eavesdropping

  • Fuxun Yu
  • Zirui Xu
  • Chenchen Liu
  • Xiang Chen

Benefited from recent artificial intelligence evolution, Automatic Speech Recognition (ASR) technology has achieved enormous performance improvement and wider application. Unfortunately, ASR is also heavily leveraged by speech eavesdropping, where ASR is used to translate large volume of intercepted vocal speech into text content, causing considerable information leakage. In this work, we propose MASKER — a mobile security enhancement solution to protect the mobile speech data from ASR in eavesdropping. By identifying ASR models’ ubiquitous vulnerability, MASKER is designed to generate human imperceptible adversarial noises into the real-time speech on the mobile device (e.g. phone call and voice message). Even the speech data is exposed to eavesdropping during data transmission, the adversarial noises can effectively perturb the ASR process with significant Word Error Rate (WER). Meanwhile, MASKER is further optimized for mobile user perception quality and enhanced for environmental noises adaptation. Moreover, MASKER has outstanding computation efficiency for mobile system integration. Experiments show that, MASKER can achieve security enhancement with an average WER of 84.55% for ASR perturbation, 32% noise reduction for user perception quality and 16× faster processing speed compared to the state-of-the-art method.

Adversarial Attack on Microarchitectural Events based Malware Detectors

  • Sai Manoj Pudukotai Dinakarrao
  • Sairaj Amberkar
  • Sahil Bhat
  • Abhijitt Dhavlle
  • Hossein Sayadi
  • Avesta Sasan
  • Houman Homayoun
  • Setareh Rafatirad

To overcome the performance overheads incurred by the traditional software-based malware detection techniques, Hardware-assisted Malware Detection (HMD) using machine learning (ML) classifiers has emerged as a panacea to detect malicious applications and secure the systems. To classify benign and malicious applications, HMD primarily relies on the generated low-level microarchitectural events captured through Hardware Performance Counters (HPCs). This work creates an adversarial attack on the HMD systems to tamper the security by introducing the perturbations in the HPC traces with the aid of an adversarial sample generator application. To craft the attack, we first deploy an adversarial sample predictor to predict the adversarial HPC pattern for a given application to be misclassified by the deployed ML classifier in the HMD. Further, as the attacker has no direct access to manipulate the HPCs generated during runtime, based on the output of the adversarial sample predictor, we devise an adversarial sample generator wrapped around a normal application to produce HPC patterns similar to the adversarial predictor HPC trace. As the crafted adversarial sample generator application does not have any malicious operations, it is not detectable with traditional signature-based malware detection solutions. With the proposed attack, malware detection accuracy has been reduced to 18.04% from 82.76%.

Fault Sneaking Attack: a Stealthy Framework for Misleading Deep Neural Networks

  • Pu Zhao
  • Siyue Wang
  • Cheng Gongye
  • Yanzhi Wang
  • Yunsi Fei
  • Xue Lin

Despite the great achievements of deep neural networks (DNNs), the vulnerability of state-of-the-art DNNs raises security concerns of DNNs in many application domains requiring high reliability. We propose the fault sneaking attack on DNNs, where the adversary aims to misclassify certain input images into any target labels by modifying the DNN parameters. We apply ADMM (alternating direction method of multipliers) for solving the optimization problem of the fault sneaking attack with two constraints: 1) the classification of the other images should be unchanged and 2) the parameter modifications should be minimized. Specifically, the first constraint requires us not only to inject designated faults (misclassifications), but also to hide the faults for stealthy or sneaking considerations by maintaining model accuracy. The second constraint requires us to minimize the parameter modifications (using ℓ0 norm to measure the number of modifications and ℓ2 norm to measure the magnitude of modifications). Comprehensive experimental evaluation demonstrates that the proposed framework can inject multiple sneaking faults without losing the overall test accuracy performance.

PREEMPT: PReempting Malware by Examining Embedded Processor Traces

  • Kanad Basu
  • Rana Elnaggar
  • Krishnendu Chakrabarty
  • Ramesh Karri

Anti-virus software (AVS) tools are used to detect Malware in a system. However, software-based AVS are vulnerable to attacks. A malicious entity can exploit these vulnerabilities to subvert the AVS. Recently, hardware components such as Hardware Performance Counters (HPC) have been used for Malware detection. In this paper, we propose PREEMPT, a zero overhead, high-accuracy and low-latency technique to detect Malware by re-purposing the embedded trace buffer (ETB), a debug hardware component available in most modern processors. The ETB is used for post-silicon validation and debug and allows us to control and monitor the internal activities of a chip, beyond what is provided by the Input/Output pins. PREEMPT combines these hardware-level observations with machine learning-based classifiers to preempt Malware before it can cause damage. There are many benefits of re-using the ETB for Malware detection. It is difficult to hack into hardware compared to software, and hence, PREEMPT is more robust against attacks than AVS. PREEMPT does not incur performance penalties. Finally, PREEMPT has a high True Positive value of 94% and maintains a low False Positive value of 2%.

Workload-Aware Harmonic Partitioned Scheduling of Periodic Real-Time Tasks with Constrained Deadlines

  • Jiankang Ren
  • Xiaoyan Su
  • Guoqi Xie
  • Chao Yu
  • Guozhen Tan
  • Guowei Wu

Multiprocessor platforms have been widely applied in safety-critical domains to accommodate the increasing computation requirement of modern real-time applications. In this paper, we present a workload-aware harmonic partitioned multiprocessor scheduling scheme for periodic real-time tasks with constrained deadlines under the fixed-priority preemptive scheduling policy. In particular, two grouping metrics effectively integrating both harmonicity and workload characteristic are designed to guide our task partition. With those metrics, our scheme can greatly improve system utilization by taking advantage of the combination of harmonic relationship exploration and workload awareness. Experiments show that our proposed scheme significantly outperforms existing approaches in terms of schedulability.

Holistic multi-resource allocation for multicore real-time virtualization

  • Meng Xu
  • Robert Gifford
  • Linh Thi Xuan Phan

This paper presents vC2M, a holistic multi-resource allocation framework for real-time multicore virtualization. vC2M integrates shared cache allocation with memory bandwidth regulation to mitigate interferences among concurrent tasks, thus providing better timing isolation among tasks and VMs. It reduces the abstraction overhead through task and VCPU release synchronization and through VCPU execution regulation, and it further introduces novel resource allocation algorithms that consider CPU, cache, and memory bandwidth altogether to optimize resources. Evaluations on our prototype show that vC2M can be implemented with minimal overhead, and that it substantially improves schedulability over existing solutions.

Runtime Resource Management with Workload Prediction

  • Mina Niknafs
  • Ivan Ukhov
  • Petru Eles
  • Zebo Peng

Modern embedded platforms need sophisticated resource managers in order to utilize the heterogeneous computational resources efficiently. Moreover, such platforms are exposed to fluctuating workloads unpredictable at design time. In such a context, predicting the incoming workload might improve the efficiency of resource management. But is this true? And, if yes, how significant is this improvement? How accurate does the prediction need to be in order to improve decisions instead of doing harm? By proposing a prediction-based resource manager aimed at minimizing energy consumption while meeting task deadlines and by running extensive experiments, we try to answer the above questions.

Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR

  • Francesco Barchi
  • Gianvito Urgese
  • Enrico Macii
  • Andrea Acquaviva

Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.

REAP: Runtime Energy-Accuracy Optimization for Energy Harvesting IoT Devices

  • Ganapati Bhat
  • Kunal Bagewadi
  • Hyung Gyu Lee
  • Umit Y. Ogras

The use of wearable and mobile devices for health and activity monitoring is growing rapidly. These devices need to maximize their accuracy and active time under a tight energy budget imposed by battery and form-factor constraints. This paper considers energy harvesting devices that run on a limited energy budget to recognize user activities over a given period. We propose a technique to co-optimize the accuracy and active time by utilizing multiple design points with different energy-accuracy trade-offs. The proposed technique switches between these design points at runtime to maximize a generalized objective function under tight harvested energy budget constraints. We evaluate our approach experimentally using a custom hardware prototype and 14 user studies. It achieves 46% higher expected accuracy and 66% longer active time compared to the highest performance design point.

Tumbler: Energy Efficient Task Scheduling for Dual-Channel Solar-Powered Sensor Nodes

  • Yue Xu
  • Hyung Gyu Lee
  • Yujuan Tan
  • Yu Wu
  • Xianzhang Chen
  • Liang Liang
  • Lei Qiao
  • Duo Liu

Energy harvesting technology has been popularly adopted in embedded systems. However, unstable energy source results in unsteady operation. In this paper, we devise a long-term energy efficient task scheduling targeting for solar-powered sensor nodes. The proposed method exploits a reinforcement learning with a solar energy prediction method to maximize the energy efficiency, which finally enhances the long-term quality of services (QoS) of the sensor nodes. Experimental results show that the proposed scheduling improves the energy efficiency by 6.0%, on average and achieves the better QoS level by 54.0%, compared with a state-of-the-art task scheduling algorithm.

GreenTPU: Improving Timing Error Resilience of a Near-Threshold Tensor Processing Unit

  • Pramesh Pandey
  • Prabal Basu
  • Koushik Chakraborty
  • Sanghamitra Roy

The emergence of hardware accelerators has brought about several orders of magnitude improvement in the speed of the deep neural-network (DNN) inference. Among such DNN accelerators, Google Tensor Processing Unit (TPU) has transpired to be the best-in-class, offering more than 15× speedup over the contemporary GPUs. However, the rapid growth in several DNN workloads conspires to escalate the energy consumptions of the TPU-based data-centers. In order to restrict the energy consumption of TPUs, we propose Green TPU—a low-power near-threshold (NTC) TPU design paradigm. To ensure a high inference accuracy at a low-voltage operation, GreenTPU identifies the patterns in the error-causing activation sequences in the systolic array, and prevents further timing errors from the same sequence by intermittently boosting the operating voltage of the specific multiplier-and-accumulator units in the TPU. Compared to a cutting-edge timing error mitigation technique for TPUs, GreenTPU enables 2X–3X higher performance in an NTC TPU, with a minimal loss in the prediction accuracy.

Thermal-Aware Design and Management for Search-based In-Memory Acceleration

  • Minxuan Zhou
  • Mohsen Imani
  • Saransh Gupta
  • Tajana Rosing

Recently, Processing-In-Memory (PIM) techniques exploiting resistive RAM (ReRAM) have been used to accelerate various big data applications. ReRAM-based in-memory search is a powerful operation which efficiently finds required data in a large data set. However, such operations result in a large amount of current which may create serious thermal issues, especially in state-of-the-art 3D stacking chips. Therefore, designing PIM accelerators based on in-memory searches requires a careful consideration of temperature. In this work, we propose static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations. Our experiments show the proposed design significantly reduces the peak chip temperature and dynamic management overhead. We test our proposed design in two important categories of applications which benefit from the search-based PIM acceleration – hyper-dimensional computing and database query. Validated experiments show that the proposed method can reduce the steady-state temperature by at least 15.3 °C which extends the lifetime of the ReRAM device by 57.2% on average. Furthermore, the proposed fine-grained dynamic thermal management provides 17.6% performance improvement over state-of-the-art methods.

Building Robust Machine Learning Systems: Current Progress, Research Challenges, and Opportunities

  • Jeff Jun Zhang
  • Kang Liu
  • Faiq Khalid
  • Muhammad Abdullah Hanif
  • Semeen Rehman
  • Theocharis Theocharides
  • Alessandro Artussi
  • Muhammad Shafique
  • Siddharth Garg

Machine learning, in particular deep learning, is being used in almost all the aspects of life to facilitate humans, specifically in mobile and Internet of Things (IoT)-based applications. Due to its state-of-the-art performance, deep learning is also being employed in safety-critical applications, for instance, autonomous vehicles. Reliability and security are two of the key required characteristics for these applications because of the impact they can have on human’s life. Towards this, in this paper, we highlight the current progress, challenges and research opportunities in the domain of robust systems for machine learning-based applications.

Adversarial Machine Learning Beyond the Image Domain

  • Giulio Zizzo
  • Chris Hankin
  • Sergio Maffeis
  • Kevin Jones

Machine learning systems have had enormous success in a wide range of fields from computer vision, natural language processing, and anomaly detection. However, such systems are vulnerable to attackers who can cause deliberate misclassification by introducing small perturbations. With machine learning systems being proposed for cyber attack detection such attackers are cause for serious concern. Despite this the vast majority of adversarial machine learning security research is focused on the image domain. This work gives a brief overview of adversarial machine learning and machine learning used in cyber attack detection and suggests key differences between the traditional image domain of adversarial machine learning and the cyber domain. Finally we show an adversarial machine learning attack on an industrial control system.

Memory-Bound Proof-of-Work Acceleration for Blockchain Applications

  • Kun Wu
  • Guohao Dai
  • Xing Hu
  • Shuangchen Li
  • Xinfeng Xie
  • Yu Wang
  • Yuan Xie

Blockchain applications have shown huge potential in various domains. Proof of Work (PoW) is the key procedure in blockchain applications, which exhibits the memory-bound characteristic and hinders the performance improvement of blockchain accelerators. In order to mitigate the “memory wall” and improve the performance of memory-hard PoW accelerators, using Ethash as an example, we optimize the memory architecture from two perspectives: 1) Hiding memory latency. We propose specialized context switch design to overcome the uncertain cycles of repetitive memory requests. 2) Increasing memory bandwidth utilization. We introduce on-chip memory that stores a portion of the Ethash directed acyclic graph (DAG) for larger effective memory bandwidth, and further propose adopting embedded NOR flash to fulfill the role. Then, we conduct extensive experiments to explore the design space of our optimized memory architecture for Ethash, including number of hash cores, on-chip/off-chip memory technologies and specifications. Based on the design space exploration, we finally provide the guidance for designing the memory-bound PoW accelerator. The experiment results show that our optimized designs achieve 8.7% — 55% higher hash rate and 17% — 120% higher hash rate per Joule compared with the baseline design in different configurations.

Architecture, Chip, and Package Co-design Flow for 2.5D IC Design Enabling Heterogeneous IP Reuse

  • Jinwoo Kim
  • Gauthaman Murali
  • Heechun Park
  • Eric Qin
  • Hyoukjun Kwon
  • Venkata Chaitanya
  • Krishna Chekuri
  • Nihar Dasari
  • Arvind Singh
  • Minah Lee
  • Hakki Mert Torun
  • Kallol Roy
  • Madhavan Swaminathan
  • Saibal Mukhopadhyay
  • Tushar Krishna
  • Sung Kyu Lim

A new trend in complex SoC design is chiplet-based IP reuse using 2.5D integration. In this paper we present a highly-integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5D designs. We chipletize each IP by adding logical protocol translators and physical interface modules. These chiplets are placed/routed on a silicon interposer next. Our package models are then used to calculate PPA and signal/power integrity of the overall system. Our design space exploration study using our tool flow shows that 2.5D integration incurs 2.1x PPA overhead compared with 2D SoC counterpart.

LifeGuard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management

  • Vijeta Rathore
  • Vivek Chaturvedi
  • Amit K. Singh
  • Thambipillai Srikanthan
  • Muhammad Shafique

Device scaling to subdeca nanometer has pushed device aging as a primary design concern. In manycore systems, inevitable process variation further adds to delay degradation and, coupled with the scalability issues in manycores, makes aging management, while meeting performance demands, a complex problem. LifeGuard is a performance-centric reinforcement learning-based task mapping strategy that leverages the different impact of applications on aging for improving system health. Experimental results, comparing LifeGuard with two state-of-the-art aging optimizing techniques, on a 256-core system, showed that LifeGuard led to improved health for, respectively, 57% and 74% of the cores, and also an enhanced aggregate core frequency.

Accurate Estimation of Program Error Rate for Timing-Speculative Processors

  • Omid Assare
  • Rajesh Gupta

We propose a framework that estimates the error rate experienced by an application as it runs on a timing-speculative processor. The framework uses an instruction error model that is comparable in accuracy to low-level simulations—as it considers the effects of operand values, preceding instructions, datapath configuration, and error correction scheme, as well as process variation, including its spatial correlation property—and yet efficient enough to allow its application in Monte Carlo experiments to characterize large program input datasets. We then use statistical limit theorems to estimate program error rate and quantify the effect of inter-instruction correlations.

Fast Performance Estimation and Design Space Exploration of Manycore-based Neural Processors

  • Jintaek Kang
  • Dowhan Jung
  • Kwanghyun Chung
  • Soonhoi Ha

In the design of a neural processor, a cycle-accurate simulator is usually built to estimate the performance before hardware implementation. Since using the simulator to perform design space exploration (DSE) of hardware architecture is quite time consuming, we propose a novel method to use a high-level analytical model for fast DSE. In the model, non-deterministic execution delay is modeled with some parameters whose contribution to the performance is estimated statically by simulation. The viability of the proposed methodology is confirmed with two neural processors with different manycore architectures, achieving 2000 times speed-up within 3% accuracy error, compared with simulator-based DSE.

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

  • Runbin Shi
  • Junjie Liu
  • Hayden K.-H. So
  • Shuo Wang
  • Yun Liang

Various models with Long Short-Term Memory (LSTM) network have demonstrated prior art performances in sequential information processing. Previous LSTM-specific architectures set large on-chip memory for weight storage to alleviate the memory-bound issue and facilitate the LSTM inference in cloud computing. In this paper, E-LSTM is proposed for embedded scenarios with the consideration of the chip-area and limited data-access bandwidth. The heterogeneous hardware in E-LSTM tightly couples an LSTM co-processor with an embedded RISC-V CPU. The eSELL format is developed to represent the sparse weight matrix. With the proposed cell fusion optimization based on the inherent sparsity in computation, E-LSTM achieves up to 2.2× speedup of processing throughput.

ReForm: Static and Dynamic Resource-Aware DNN Reconfiguration Framework for Mobile Device

  • Zirui Xu
  • Fuxun Yu
  • Chenchen Liu
  • Xiang Chen

Although the Deep Neural Network (DNN) technique has been widely applied in various applications, the DNN-based applications are still too computationally intensive for the resource-constrained mobile devices. Many works have been proposed to optimize the DNN computation performance, but most of them are limited in an algorithmic perspective, ignoring certain computing issues in practical deployment. To achieve the comprehensive DNN performance enhancement in practice, the expected DNN optimization works should closely cooperate with specific hardware and system constraints (i.e. computation capacity, energy cost, memory occupancy, and inference latency). Therefore, in this work, we propose ReForm — a resource-aware DNN optimization framework. Through thorough mobile DNN computing analysis and innovative model reconfiguration schemes (i.e. ADMM based static model fine-tuning, dynamically selective computing), ReForm can efficiently and effectively reconfigure a pre-trained DNN model for practical mobile deployment with regards to various static and dynamic computation resource constraints. Experiments show that ReForm has ~3.5× faster optimization speed than state-of-the-art resource-aware optimization method. Also, ReForm can effective reconfigure a DNN model to different mobile devices with distinct resource constraints. Moreover, ReForm achieves satisfying computation cost reduction with ignorable accuracy drop in both static and dynamic computing scenarios (at most 18% workload, 16.23% latency, 48.63% memory, and 21.5% energy enhancement).

XBioSiP: A Methodology for Approximate Bio-Signal Processing at the Edge

  • Bharath Srinivas Prabakaran
  • Semeen Rehman
  • Muhammad Shafique

Bio-signals exhibit high redundancy, and the algorithms for their processing are inherently error resilient. This property can be leveraged to improve the energy-efficiency of IoT-Edge (wearables) through the emerging trend of approximate computing. This paper presents XBioSiP, a novel methodology for approximate bio-signal processing that employs two quality evaluation stages, during the pre-processing and bio-signal processing stages, to determine the approximation parameters. It thereby achieves high energy savings while satisfying the user-determined quality constraint. Our methodology achieves, up to 19× and 22× reduction in the energy consumption of a QRS peak detection algorithm for 0% and < 1% loss in peak detection accuracy, respectively.

RevSCA: Using Reverse Engineering to Bring Light into Backward Rewriting for Big and Dirty Multipliers

  • Alireza Mahzoon
  • Daniel Große
  • Rolf Drechsler

In recent years, formal methods based on Symbolic Computer Algebra (SCA) have shown very good results in verification of integer multipliers. The success is based on removing redundant terms (vanishing monomials) early which allows to avoid the explosion in the number of monomials during backward rewriting. However, the SCA approaches still suffer from two major problems: (1) high dependence on the detection of Half Adders (HAs) realized as AND-XOR gates in the multiplier netlist, and (2) extremely large search space for finding the source of the vanishing monomials. As a consequence, if the multiplier consists of dirty logic, i.e. for instance using non-standard libraries or logic optimization, the existing SCA methods are completely blind on the resulting polynomials, and their techniques for effective division fail.

In this paper, we present RevSCA. RevSCA brings back light into backward rewriting by identifying the atomic blocks of the arithmetic circuits using dedicated reverse engineering techniques. Our approach takes advantage of these atomic blocks to detect all sources of vanishing monomials independent of the design architecture. Furthermore, it cuts the local vanishing removal time drastically due to limiting the search space to a small part of the design only. Experimental results confirm the efficiency of our approach in verification of a wide variety of integer multipliers with up to 1024 output bits.

Temporal Tracing of On-Chip Signals using Timeprints

  • Rehab Massoud
  • Hoang M. Le
  • Peter Chini
  • Prakash Saivasan
  • Roland Meyer
  • Rolf Drechsler

This paper introduces a new method to trace cycle-accurately the temporal behavior of on-chip signals while operating in-field. Current cycle-accurate schemes incur unacceptable amounts of data for logging, storage and processing.

Our key idea to enable efficient yet cycle-accurate tracing, is to bring timing to the front as a main traced artifact. We split the signal tracing into consecutive (back-to-back) finite trace-cycles. Within a trace-cycle, a signal’s value-change instance gets assigned an encoded timestamp. At the end of each trace-cycle, these encoded timestamps are aggregated into a logged timeprint, which summarizes the temporal behavior over the trace-cycle.

To retrieve the accurate timing, we reconstruct the exact instances from a timeprint via a SAT query. The experiments demonstrate how unprecedented lightweight tracing can be applied, and how timeprints enable the verification of cycle-accurate properties and the detection of sporadic temperature effects.

ACCESS: HW/SW Co-Equivalence Checking for Firmware Optimization

  • Michael Schwarz
  • Raphael Stahl
  • Daniel Müller-Gritschneder
  • Ulf Schlichtmann
  • Dominik Stoffel
  • Wolfgang Kunz

Customizing embedded computing platforms to specific application domains often necessitates optimizing the firmware and/or the HW/SW interface under tight resource constraints. Such optimizations frequently alter the communication between the firmware and the peripheral devices, possibly compromising functional correctness of the input/output behavior of the embedded system. This paper proposes a formal HW/SW co-equivalence checking technique for verifying correct I/O behavior of peripherals under a modified firmware. We demonstrate the great promise of our approach on RTL implementations of several open-source peripherals. In our experiments we successfully prove or disprove correctness of firmware optimizations for an industrial driver software. In addition, we also found a subtle bug in one of the peripherals and several undocumented preconditions for correct device behavior.

Early Concolic Testing of Embedded Binaries with Virtual Prototypes: A RISC-V Case Study

  • Vladimir Herdt
  • Daniel Große
  • Hoang M. Le
  • Rolf Drechsler

Extensive testing of IoT SW is very important to prevent errors and security vulnerabilities. In the SW domain the automated concolic testing technique has been shown very effective.

In this paper we propose an approach for concolic testing of binaries targeting RISC-V systems with peripherals. Our approach works by integrating the Concolic Testing Engine (CTE) with the architecture specific Instruction Set Simulator (ISS) inside of a Virtual Prototype (VP). We provide a designated CTE-interface to integrate (SystemC-based) peripherals into the concolic testing by means of SW models. This combination enables a high simulation performance at binary level with comparatively little effort to integrate peripherals with concolic execution capabilities. Our approach has been effective in finding several buffer overflow related security vulnerabilities in the FreeRTOS TCP/IP stack.

Tetris: A Streaming Accelerator for Physics-Limited 3D Plane-Wave Ultrasound Imaging

  • Brendan L. West
  • Jian Zhou
  • Ronald G. Dreslinski
  • J. Brian Fowlkes
  • Oliver Kripfgans
  • Chaitali Chakrabarti
  • Thomas F. Wenisch

High volume acquisition rates are imperative for medical ultrasound imaging applications, such as 3D elastography and 3D vector flow imaging. Unfortunately, despite recent algorithmic improvements, high-volume-rate imaging remains computationally infeasible on known platforms.

In this paper, we propose Tetris, a novel hardware accelerator for ultrasound beamforming that enables volume acquisition rates up to the physics limits of acoustic propagation delay. Through algorithmic and hardware optimizations, we enable a streaming system design outclassing previously proposed accelerators in performance while lowering hardware complexity and storage requirements. For a representative imaging task, our proposed system generates physics-limited 13,020 volumes per second in a 2.5W power budget.

ProbLP: A framework for low-precision probabilistic inference

  • Nimish Shah
  • Laura I. Galindez Olascoaga
  • Wannes Meert
  • Marian Verhelst

Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.

An Optimized Design Technique of Low-bit Neural Network Training for Personalization on IoT Devices

  • Seungkyu Choi
  • Jaekang Shin
  • Yeongjae Choi
  • Lee-Sup Kim

Personalization by incremental learning has become essential for IoT devices to enhance the performance of the deep learning models trained with global datasets. To avoid massive transmission traffic in the network, exploiting on-device learning is necessary. We propose a software/hardware co-design technique that builds an energy-efficient low-bit trainable system: (1) software optimizations by local low-bit quantization and computation freezing to minimize the on-chip storage requirement and computational complexity, (2) hardware design of a bit-flexible multiply-and-accumulate (MAC) array sharing the same resources in inference and training. Our scheme saves 99.2% on on-chip buffer storage and achieves 12.8x higher peak energy efficiency compared to previous trainable accelerators.

L-MPC: A LUT based Multi-Level Prediction-Correction Architecture for Accelerating Binary-Weight Hourglass Network

  • Hong Liu
  • Leibo Liu
  • Wenping Zhu
  • Qiang Li
  • Huiyu Mo
  • Shaojun Wei

A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on IoT edge devices. First, LUT with a unified mode is adopted to support convolutional neural network with fully variable weight bit precision to minimize operations of B-HG, which achieves 1.33×-1.50× speedup on multi-bit weight CNN relative to the similar solution. Second, multi-level prediction-correction model is proposed to achieve computational-efficient convolution with adaptive precision. The operations saved can be increase by about 30% than the two-stage model. Besides, nearly 77.4% of the operations in B-HG can be saved by using the combination of these two methods, yielding a 2.3× inference speedup. Third, block computing based pipeline is designed to improve the residual block deficiency in B-HG. It can not only reduce about 66.2% off-chip memory access than the baseline, but also save 60% and 31% on-chip memory space and access compared to the similar fused-layer accelerator. The proposed B-HG accelerator achieves 450 fps at 500MHz based on the simulation in TSMC 28 nm process. Meanwhile, the power efficiency is up to 8.5 TOPS/W, which is two orders of magnitude higher than the dedicated face landmark detection accelerator.

eSLAM: An Energy-Efficient Accelerator for Real-Time ORB-SLAM on FPGA Platform

  • Runze Liu
  • Jianlei Yang
  • Yiran Chen
  • Weisheng Zhao

Simultaneous Localization and Mapping (SLAM) is a critical task for autonomous navigation. However, due to the computational complexity of SLAM algorithms, it is very difficult to achieve real-time implementation on low-power platforms. We propose an energy-efficient architecture for real-time ORB (Oriented-FAST and Rotated-BRIEF) based visual SLAM system by accelerating the most time-consuming stages of feature extraction and matching on FPGA platform. Moreover, the original ORB descriptor pattern is reformed as a rotational symmetric manner which is much more hardware friendly. Optimizations including rescheduling and parallelizing are further utilized to improve the throughput and reduce the memory footprint. Compared with Intel i7 and ARM Cortex-A9 CPUs on TUM dataset, our FPGA realization achieves up to 3× and 31× frame rate improvement, as well as up to 71× and 25× energy efficiency improvement, respectively.

ShuntFlow: An Efficient and Scalable Dataflow Accelerator Architecture for Streaming Applications

  • Shijun Gong
  • Jiajun Li
  • Wenyan Lu
  • Guihai Yan
  • Xiaowei Li

Streaming processing is an important and growing class of applications for analyzing continuous streams of real time data. Sliding-window aggregations (SWAGs) dominate the computation time in such applications and dictate an unprecedented computation capacity which poses a great challenge to the computing architectures. General-purpose processors cannot efficiently handle SWAGs because of the specific computation patterns. This paper proposes an efficient accelerator architecture for ubiquitous SWAGs, called ShuntFlow. ShuntFlow is a typical type of Kernel Processing Unit (KPU) where “Kernel” represent two main categories of SWAG operations widely used in streaming processing. Meanwhile, we propose a shunt rule to enable ShuntFlow to efficiently handle SWAGs with arbitrary parameters. As a case study, we implemented ShuntFlow on an Altera Arria 10 AX115N FPGA board at 150 MHz and compared it to previous approaches. The experimental results show that ShuntFlow provides a tremendous throughput and latency advantage over CPU and GPU implementations on both reduce-like and index-like SWAGs.

A General Pattern-Based Dynamic Compilation Framework for Coarse-Grained Reconfigurable Architectures

  • Xingchen Man
  • Leibo Liu
  • Jianfeng Zhu
  • Shaojun Wei

Compilation has become a major challenge to the usability of coarse-grained reconfigurable architectures as increasing programmable resources must be orchestrated. Static compilation is insufficient for prohibitive time cost while dynamic compilation still performs poorly in both generality and efficiency. This paper proposes a general pattern-based dynamic compilation framework, which utilizes statically-generated patterns to straightforwardly determine runtime re-placement and routing so that runtime configuration creation algorithm has low complexity. Domain-specific communication characteristics are harnessed to help improve the efficiency of patterns. The experimental results show that compiled general applications can be transformed onto arbitrary resources at runtime, reserving 97% (39%~163%) of the original performance/resource on average, 7% (0~17%) better than the state-of-the-art non-general methods.

ReTagger: An Efficient Controller for DRAM Cache Architectures

  • Mahdi Nazm Bojnordi
  • Farhan Nasrullah

3D die-stacking has enabled energy-efficient solutions for near data processing by integrating multiple dice of high-density memory layers and processor cores within the same package. One promising approach is to employ the in-package memory as a gigascale last-level cache for data-intensive computing. Most existing in-package cache controllers rely on the command scheduling policies borrowed from the off-chip DRAM system. Regrettably, these control policies are not specifically tailored for in-package cache traffics, which results in a limited bandwidth efficiency. This paper proposes ReTagger, a DRAM cache controller that employs repeated tags to alleviate the cost of DRAM row buffer misses. Our simulation results on a set of ten data-intensive applications indicate an average of 20% performance improvement for the proposed controller over the state-of-the-art DRAM caches.

Software Approaches for In-time Resilience

  • Aviral Shrivastava
  • Moslem Didehban

Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough — the latency between the occurrence of the fault to it’s detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.

Cross-Layer Resilience: Challenges, Insights, and the Road Ahead

  • Eric Cheng
  • Daniel-Mueller-Gritschneder
  • Jacob Abraham
  • Pradip Bose
  • Alper Buyuktosunoglu
  • Deming Chen
  • Hyungmin Cho
  • Yanjing Li
  • Uzair Sharif
  • Kevin Skadron
  • Mircea Stan
  • Ulf Schlichtmann
  • Subhasish Mitra

Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by manufacturing and operating conditions, manufacturing test escapes, and early-life failures. Many publications have suggested that cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective resilience, is essential for designing cost-effective resilient digital systems. This paper presents a comprehensive overview of cross-layer resilience by addressing fundamental cross-layer resilience questions, by summarizing insights derived from recent advances in cross-layer resilience research, and by discussing future cross-layer resilience challenges.

Increasing Soft Error Resilience by Software Transformation

  • Michael Werner
  • Keerthikumara Devarajegowda
  • Moomen Chaari
  • Wolfgang Ecker

Developing software in a slightly different way can have a dramatic impact on soft error resilience. This observation can be transferred in a process of improving existing code by transformations. These transformations are of systematic nature and can be automated. In this paper, we present a framework for low level embedded software generation – commonly referred to as firmware — and the inclusion of safety measures in the generated code. The generation approach follows a three stage process starting with formalized firmware specification using both platform dependent and independent firmware models. Finally, C-code is generated from the view model in a straight forward way. Safety measures are included either as part of the translation step between the models or as transformations of single models.

FLightNNs: Lightweight Quantized Deep Neural Networks for Fast and Accurate Inference

  • Ruizhou Ding
  • Zeye Liu
  • Ting-Wu Chin
  • Diana Marculescu
  • R. D. (Shawn) Blanton

To improve the throughput and energy efficiency of Deep Neural Networks (DNNs) on customized hardware, lightweight neural networks constrain the weights of DNNs to be a limited combination (denoted as k &epsis; {1, 2}) of powers of 2. In such networks, the multiply-accumulate operation can be replaced with a single shift operation, or two shifts and an add operation. To provide even more design flexibility, the k for each convolutional filter can be optimally chosen instead of being fixed for every filter. In this paper, we formulate the selection of k to be differentiable, and describe model training for determining k-based weights on a per-filter basis. Over 46 FPGA-design experiments involving eight configurations and four data sets reveal that lightweight neural networks with a flexible k value (dubbed FLightNNs) fully utilize the hardware resources on Field Programmable Gate Arrays (FPGAs), our experimental results show that FLightNNs can achieve 2× speedup when compared to lightweight NNs with k = 2, with only 0.1% accuracy degradation. Compared to a 4-bit fixed-point quantization, FLightNNs achieve higher accuracy and up to 2× inference speedup, due to their lightweight shift operations. In addition, our experiments also demonstrate that FLightNNs can achieve higher computational energy efficiency for ASIC implementation.

BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks

  • Shubham Jain
  • Swagath Venkataramani
  • Vijayalakshmi Srinivasan
  • Jungwook Choi
  • Kailash Gopalakrishnan
  • Leland Chang

Fixed-point implementations (FxP) are prominently used to realize Deep Neural Networks (DNNs) efficiently on energy-constrained platforms. The choice of bit-width is often constrained by the ability of FxP to represent the entire range of numbers in the datastructure with sufficient resolution. At low bit-widths (< 8 bits), state-of-the-art DNNs invariably suffer a loss in classification accuracy due to quantization/saturation errors.

In this work, we leverage a key insight that almost all datastructures in DNNs are long-tailed i.e., a significant majority of the elements are small in magnitude, with a small fraction being orders of magnitude larger. We propose BiScaled-FxP, a new number representation which caters to the disparate range and resolution needs of long-tailed data-structures. The key idea is, whilst using the same number of bits to represent elements of both large and small magnitude, we employ two different scale factors viz. scale-fine and scale-wide in their quantization. Scale-fine allocates more fractional bits providing resolution for small numbers, while scale-wide favors covering the entire range of large numbers albeit at a coarser resolution. We develop a BiScaled DNN accelerator which computes on BiScaled-FxP tensors. A key challenge is to store the scale factor used in quantizing each element as computations that use operands quantized with different scale-factors need to scale their result. To minimize this overhead, we use a block sparse format to store only the indices of scale-wide elements, which are few in number. Also, we enhance the BiScaled-FxP processing elements with shifters to scale their output when operands to computations use different scale-factors. We develop a systematic methodology to identify the scale-fine and scale-wide factors for the weights and activations of any given DNN. Over 8 state-of-the-art image recognition benchmarks, BiScaled-FxP reduces 2 computation bits over conventional FxP, while also slightly improving classification accuracy on all cases. Compared to FxP8, the performance and energy benefits range between 1.43×-3.86× and 1.4×-3.7× respectively.

A None-Sparse Inference Accelerator that Distills and Reuses the Computation Redundancy in CNNs

  • Ying Wang
  • Shengwen Liang
  • Huawei Li
  • Xiaowei Li

Prior research on energy-efficient Convolutional Neural Network (CNN) inference accelerators mostly focus on exploiting the model sparsity, i.e., zero patterns in weight and activations, to reduce the on-chip storage and computation overhead. In this work, we found in addition to zero patterns, a larger group of repetitive patterns and values exists in the working-set of CNN inference task, which is defined as computation redundancy and induces unnecessary performance and storage overhead in CNN accelerators. Based on this observation, we proposed a redundancy-free architecture that detects and eliminates the repetitive computation and storage patterns in CNN for more efficient network inference. The architecture consists of two parts: the off-line parameter analyzer that extracts the repetitive patterns in the 3D tensor of parameters, and the dataflow accelerator. The proposed accelerator at first preprocesses the weight patterns and the dynamically generated activations, and then cache these intermediate results in special P2-cache banks for further usage in convolution or full-connection stage. It is evaluated in experiments that the proposed Cavoluche architecture removes up to 89% of the repetitive operations from the layer inference process and reduce 77% of on-chip storage space to store both redundancy-free weight and activations. It is seen in experiments that the implementation of Cavoluche outperforms the state-of-the-art mobile GPGPU in both performance and energy-efficiency. When compared to the latest sparsity base accelerators, Cavoluche also achieves better operation elimination effects.

On the Complexity Reduction of Dense Layers from O(N2) to O(NlogN) with Cyclic Sparsely Connected Layers

  • Morteza Hosseini
  • Mark Horton
  • Hiren Paneliya
  • Uttej Kallakuri
  • Houman Homayoun
  • Tinoosh Mohsenin

In deep neural networks (DNNs), model size is an important factor affecting performance, energy efficiency and scalability. Recent works on weight pruning have shown significant reduction in model size at the expense of irregularity in the DNN architecture, which necessitates additional indexing memory to address non-zero weights, thereby increasing chip size, energy consumption and delays. In this paper, we propose cyclic sparsely connected (CSC) layers, with a memory/computation complexity of O(NlogN), that can be used as an overlay for fully connected (FC) layers whose number of parameters, O(N2), can dominate the parameters of the entire DNN model. The CSC layers are composed of a few sequential layers, referred to as support layers, which result in full connectivity between the Inputs and Outputs of each CSC layer. We introduce an algorithm to train models with FC layers replaced with CSC layers in a bottom-up approach by incrementally increasing the CSC layers characteristics such as connectivity and number of synapses, to achieve the desired accuracy given a compression rate. One advantage of the CSC layers is that there will be no requirement for indexing the non-zero weights. Our experimental results using AlexNet on ImageNet and LeNet300100 on MNIST indicate that by substituting FC layers with CSC layers, we can achieve 10× to 46× compression within a margin of 2% accuracy loss, which is comparable to non-structural pruning methods. A scalable parallel hardware architecture to implement CSC layers, and an equivalent scalable parallel architecture to efficiently implement non-structurally pruned FC layers are designed and fully placed and routed on Artix-7 FPGA and ASIC 65nm CMOS technology for LeNet300100 model. The results indicate that the proposed CSC hardware outperforms the conventional non-structurally pruned architecture with an equal compression rate by ~2× in power, energy, area and resource utilization when running at the same frequency.

Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators

  • Wonseok Choi
  • Dongyeob Shin
  • Jongsun Park
  • Swaroop Ghosh

With inherent algorithmic error resilience of deep neural networks (DNNs), supply voltage scaling could be a promising technique for energy efficient DNN accelerator design. In this paper, we propose novel error resilient techniques to enable aggressive voltage scaling by exploiting different amount of error resilience (sensitivity) with respect to DNN layers, filters, and channels. First, to rapidly evaluate filter/channel-level weight sensitivities of large scale DNNs, first-order Taylor expansion is used, which accurately approximates weight sensitivity from actual error injection simulation. With measured timing error probability of each multiply-accumulate (MAC) units considering process variations, the sensitivity variation among filter weights can be leveraged to design DNN accelerator, such that the computations with more sensitive weights are assigned to more robust MAC units, while those with less sensitive weights are assigned to less robust MAC units. Based on post-synthesis timing simulations, 51% energy savings has been achieved with CIFAR-10 dataset using VGG-9 compared to state-of-the-art timing error recovery technique with the same constraint of 3% accuracy loss.

St-DRC: Stretchable DRAM Refresh Controller with No Parity-overhead Error Correction Scheme for Energy-efficient DNNs

  • Duy-Thanh Nguyen
  • Nhut-Minh Ho
  • Ik-Joon Chang

We present a stretchable DRAM refresh control for energy-efficient processing of DNNs, namely St-DRC. We exploit the characteristic that the recognition accuracy of DNNs is insensitive to errors of insignificant bits. By replacing some insignificant bits with parity bits for the error-correction of significant bits, the St-DRC can protect the significant bits under stretched refresh periods. This significantly improves DRAM refresh energy without performance degradation of DNNs, applicable to both training and inference operations. Our simulation shows that in training, the St-DRC obtains 23%/12% DRAM energy savings for graphic/main memories, respectively. Further, the St-DRC accelerates the training speed by 0.43 ~ 4.12%.

FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge

  • Cong Hao
  • Xiaofan Zhang
  • Yuhong Li
  • Sitao Huang
  • Jinjun Xiong
  • Kyle Rupnow
  • Wen-mei Hwu
  • Deming Chen

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up and top-down approaches: a bottom-up hardware-oriented DNN model search for high accuracy, and a top-down FPGA accelerator design considering DNN-specific characteristics. We also build an automatic co-design flow, including an Auto-DNN engine to perform hardware-oriented DNN model search, as well as an Auto-HLS engine to generate synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results show that our proposed DNN model and accelerator outperform the state-of-the-art FPGA designs in all aspects including Intersection-over-Union (IoU) (6.2% higher), frames per second (FPS) (2.48× higher), power consumption (40% lower), and energy efficiency (2.5× higher). Compared to GPU-based solutions, our designs deliver similar accuracy but consume far less energy.

Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System

  • Junzhong Shen
  • Deguang Wang
  • You Huang
  • Mei Wen
  • Chunyuan Zhang

Three-dimensional convolutional neural networks (3D CNNs) have become a promising method in lung nodule segmentation. The high computational complexity and memory requirements of 3D CNNs make it challenging to accelerate 3D CNNs on a single FPGA. In this work, we focus on accelerating the 3D CNN-based lung nodule segmentation on a multi-FPGA platform by proposing an efficient mapping scheme that takes advantage of the massive parallelism provided by the platform, as well as maximizing the computational efficiency of the accelerators. Experimental results show that our system integrating with four Xilinx VCU118 can achieve state-of-the-art performance of 14.5 TOPS, in addition with a 29.4x performance gain over CPU and 10.5x more energy efficiency over GPU.

Dr. BFS: Data Centric Breadth-First Search on FPGAs

  • Eric Finnerty
  • Zachary Sherer
  • Hang Liu
  • Yan Luo

The flexible architectures of Field Programmable Gate Arrays (FPGAs) lend themselves to an array of data analytical applications, among which Breadth-First Search (BFS), due to its vital importance, draws particular attention. Recent attempts that offload BFS on FPGAs either simply imitate the existing CPU- or Graphics Processing Units (GPU)- based mechanisms or suffer from scalability issues. To this end, we introduce a novel data centric design which extensively extracts the potential of FPGAs for BFS with the following two techniques. First, we advocate to partition and compress the BFS algorithmic metadata in order to buffer them in fast on-chip memory and circumvent the expensive metadata access. Second, we propose a hierarchical coalescing method to improve the throughput of graph data access. Taken together, our evaluation demonstrates that the proposed design achieves, on average, 1.6× and 2.2× speedups over the state-of-the-art FPGA designs TorusBFS and Umuroglu, respectively, across a collection of graph datasets.

Peregrine: A Flexible Hardware Accelerator for LSTM with Limited Synaptic Connection Patterns

  • Jaeha Kung
  • Junki Park
  • Sehun Park
  • Jae-Joon Kim

In this paper, we present an integrated solution to design a high-performance LSTM accelerator. We propose a fast and flexible hardware architecture, named Peregrine, supported by a stack of innovations from algorithm to hardware design. Peregrine first minimizes the memory footprint by limiting the synaptic connection patterns within the LSTM network. Also, Peregrine provides parallel Huffman decoders with adaptive clocking to provide flexibility in dealing with a wide range of sparsity levels in the weight matrices. All these features are incorporated in a novel hardware architecture to maximize energy-efficiency. As a result, Peregrine improves performance by ~38% and energy-efficiency by ~33% in speech recognition compared to the state-of-the-art LSTM accelerator.

Systolic Cube: A Spatial 3D CNN Accelerator Architecture for Low Power Video Analysis

  • Yongchen Wang
  • Ying Wang
  • Huawei Li
  • Cong Shi
  • Xiaowei Li

3D convolutional neural networks (CNN) are gaining popularity in action/activity analysis. Compared to 2D convolutions that share the filters in 2D spatial domain, 3D convolutions further reuse filters in the temporal dimension to capture time-domain features. Prior works on specialized 3D-CNN accelerators employ additional on-chip memories and multi-cluster architecture to reuse data among the process element (PE) arrays, which is too expensive for low-power chips. Instead of harvesting in-memory locality, we propose a 3D systolic cube architecture to exploit the spatial-and-temporal localities of 3D CNNs, which moves the reusable data in-between PEs connected via a 3D-cube Network-on-Chip. Evaluation shows that systolic-cube contributes to considerable energy-efficiency boost for activity-recognition benchmarks.

Context-Aware Convolutional Neural Network over Distributed System in Collaborative Computing

  • Jinhang Choi
  • Zeinab Hakimi
  • Philip W. Shin
  • Jack Sampson
  • Vijaykrishnan Narayanan

As the computing power of end-point devices grows, there has been interest in developing distributed deep neural networks specifically for hierarchical inference deployments on multi-sensor systems. However, as the existing approaches rely on latent parameters trained by machine learning, it is difficult to preemptively select front-end deep features across sensors, or understand individual feature’s relative importance for systematic global inference. In this paper, we propose multi-view convolutional neural networks exploiting likelihood estimation. Proof-of-concept experiments show that our likelihood-based context selection and weighted averaging collaboration scheme can decrease an endpoint’s communication and energy costs by a factor of 3×, while achieving high accuracy comparable to the original aggregation approaches.

The Best of Both Worlds: On Exploiting Bit-Alterable NAND Flash for Lifetime and Read Performance Optimization

  • Shuo-Han Chen
  • Ming-Chang Yang
  • Yuan-Hao Chang

With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the “bit-level rewrite operation,” which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.

WAS: Wear Aware Superblock Management for Prolonging SSD Lifetime

  • Shunzhuo Wang
  • Fei Wu
  • Chengmo Yang
  • Jiaona Zhou
  • Changsheng Xie
  • Jiguang Wan

Superblocks are widely employed in SSDs for improving performance. However, the standard superblock organization which links blocks with the same block ID across planes into one superblock leads to SSDs’ ineluctable lifetime waste due to inter-block wear tolerance variations. This work proposes a wear-aware superblock management, called WAS, which (1) dynamically organizes superblocks according to real-time block wear levels to make strong blocks relieve wear on weak ones, and (2) employs a wear-based garbage collection scheme to reduce inter-block wear gap. Comprehensive experiments are carried out in SSDsim. Results show that WAS greatly prolongs SSD lifetime by 51.3% compared with the state-of-the-art superblock management.

ASCache: An Approximate SSD Cache for Error-Tolerant Applications

  • Fei Li
  • Youyou Lu
  • Zhongjie Wu
  • Jiwu Shu

With increased density, flash memory becomes more vulnerable to errors. Error correction incurs high overhead, which is sensitive in SSD cache. However, some applications like multimedia processing have the intrinsic tolerance of inaccuracies. In this paper, we propose ASCache, an approximate SSD cache, which allows bit errors in a controllable threshold for error-tolerant applications, so as to reduce the cache miss ratio caused by incorrect cache pages. ASCache further trades the strictness of error correction mechanisms for higher SSD access performance. Evaluations show ASCache reduces the average read latency by at most 30% and the cache miss ratio by 52%.

Leveraging Approximate Data for Robust Flash Storage

  • Qiao Li
  • Liang Shi
  • Jun Yang
  • Youtao Zhang
  • Chun Jason Xue

With the increasing bit density and adoption of 3D NAND, flash memory suffers from increased errors. To address the issue, flash devices adopt error correction codes (ECC) with strong error correction capability, like low-density parity-check (LDPC) code, to correct errors. The drawback of LDPC is that, to correct data with a high raw bit error rate (RBER), read latency will be amplified. This work proposes to address this issue with the assistance of approximate data. First, studies have been conducted and show there are ample amount of approximate data available in flash storage. Second, a novel data organization is proposed to fortify the reliability of regular data by leaving approximate data unprotected. Finally, a new data allocation strategy and modified garbage collection scheme are presented to complete the design. The experimental results show that the proposed approach can improve read performance by 30% on average comparing to current techniques.

MARCH: MAze Routing Under a Concurrent and Hierarchical Scheme for Buses

  • Jingsong Chen
  • Jinwei Liu
  • Gengjie Chen
  • Dan Zheng
  • Evangeline F. Y. Young

The continuous development of modern VLSI technology has brought new challenges for on-chip interconnections. Different from classic net-by-net routing, bus routing requires all the nets (bits) in the same bus to share similar or even the same topology, besides considering wire length, via count, and other design rules. In this paper, we present MARCH, an efficient maze routing method under a concurrent and hierarchical scheme for buses. In MARCH, to achieve the same topology, all the bits in a bus are routed concurrently like marching in a path. For efficiency, our method is hierarchical, consisting of a coarse-grained topology-aware path planning and a fine-grained track assignment for bits. Additionally, an effective rip-up and reroute scheme is applied to further improve the solution quality. In experimental results, MARCH significantly outperforms the first place at 2018 IC/CAD Contest in both quality and runtime.

A DAG-Based Algorithm for Obstacle-Aware Topology-Matching On-Track Bus Routing

  • Chen-Hao Hsu
  • Shao-Chun Hung
  • Hao Chen
  • Fan-Keng Sun
  • Yao-Wen Chang

As clock frequencies increase, topology-matching bus routing is desired to provide an initial routing result which facilitates the following buffer insertion to meet the timing constraints. Our algorithm consists of three main techniques: (1) a bus clustering method to reduce the routing complexity, (2) a DAG-based algorithm to connect a bus in the specific topology, and (3) a rip-up and re-route scheme to alleviate the routing congestion. Experimental results show that our proposed algorithm outperforms all the participating teams of the 2018 CAD Contest at ICCAD, where the top-3 routers result in 145%, 158%, and 420% higher costs than ours.

A Learning-Based Recommender System for Autotuning Design Flows of Industrial High-Performance Processors

  • Jihye Kwon
  • Matthew M. Ziegler
  • Luca P. Carloni

Logic synthesis and physical design (LSPD) tools automate complex design tasks previously performed by human designers. One time-consuming task that remains manual is configuring the LSPD flow parameters, which significantly impacts design results. To reduce the parameter-tuning effort, we propose an LSPD parameter recommender system that involves learning a collaborative prediction model through tensor decomposition and regression. Using a model trained with archived data from multiple state-of-the-art 14nm processors, we reduce the exploration cost while achieving comparable design quality. Furthermore, we demonstrate the transfer-learning properties of our approach by showing that this model can be successfully applied for 7nm designs.

Painting on Placement: Forecasting Routing Congestion using Conditional Generative Adversarial Nets

  • Cunxi Yu
  • Zhiru Zhang

Physical design process commonly consumes hours to days for large designs, and routing is known as the most critical step. Demands for accurate routing quality prediction raise to a new level to accelerate hardware innovation with advanced technology nodes. This work presents an approach that forecasts the density of all routing channels over the entire floorplan, with features collected up to placement, using conditional GANs. Specifically, forecasting the routing congestion is constructed as an image translation (colorization) problem. The proposed approach is applied to a) placement exploration for minimum congestion, b) constrained placement exploration and c) forecasting congestion in real-time during incremental placement, using eight designs targeting a fixed FPGA architecture.

Pin Accessibility Prediction and Optimization with Deep Learning-based Pin Pattern Recognition

  • Tao-Chun Yu
  • Shao-Yun Fang
  • Hsien-Shih Chiu
  • Kai-Shun Hu
  • Philip Hui-Yuh Tai
  • Cindy Chin-Fang Shen
  • Henry Sheng

FIT: Fill Insertion Considering Timing

  • Bentian Jiang
  • Xiaopeng Zhang
  • Ran Chen
  • Gengjie Chen
  • Peishan Tu
  • Wei Li
  • Evangeline F. Y. Young
  • Bei Yu

Dummy fill insertion is a mandatory step in modern semiconductor manufacturing process to reduce dielectric thickness variation, and provide nearly uniform pattern density for the chemical mechanical planarization (CMP) process. However, with the continuous shrinking of the VLSI technology nodes, the coupling effects between the inserted metal fills and signal tracks can severely affect the original timing closure of the layout design. In this paper, we propose a robust, efficient and high-performance framework for timing-aware dummy fill insertion, which simultaneously minimizes the coupling capacitance of critical signal wires and other wires. The experimental results on IC/CAD 2018 contest benchmarks shows that our proposed framework outperforms contest winner by 8% on critical coupling capacitance with 3.3× runtime speedup.

The Metric Matters: The Art of Measuring Trust in Electronics

  • Jonathan Cruz
  • Prabhat Mishra
  • Swarup Bhunia

Electronic hardware trust is an emerging concern for all stakeholders in the semiconductor industry. Trust issues in electronic hardware span all stages of its life cycle – from creation of intellectual property (IP) blocks to manufacturing, test and deployment of hardware components and all abstraction levels – from chips to printed circuit boards (PCBs) to systems. The trust issues originate from a horizontal business model that promotes reliance of third-party untrusted facilities, tools, and IPs in the hardware life cycle. Today, designers are tasked with verifying the integrity of third-party IPs before incorporating them into system-on-chip (SoC) designs. Existing trust metric frameworks have limited applicability since they are not comprehensive. They capture only a subset of vulnerabilities such as potential vulnerabilities introduced through design mistakes and CAD tools, or quantify features in a design that target a particular Trojan model. Therefore, current practice uses ad-hoc security analysis of IP cores. In this paper, we propose a vector-based comprehensive coverage metric that quantifies the overall trust of an IP considering both vulnerabilities and direct malicious modifications. We use a variable weighted sum of a design’s functional coverage, structural coverage, and asset coverage to assess an IP’s integrity. Designers can also effectively use our trust metric to compare the relative trustworthiness of functionally equivalent third-party IPs. To demonstrate the applicability and usefulness of the proposed metric, we utilize our trust metric on Trojan-free and Trojan-inserted variants of an IP. Our results demonstrate that we are able to successfully distinguish between trusted and untrusted IPs.

Authenticated Call Stack

  • Hans Liljestrand
  • Thomas Nyman
  • Jan-Erik Ekberg
  • N. Asokan

Shadow stacks are the go-to solution for perfect backward-edge control-flow integrity (CFI). Software shadow stacks trade off security for performance. Hardware-assisted shadow stacks are efficient and secure, but expensive to deploy. We present authenticated call stack (ACS), a novel mechanism for precise verification of return addresses using aggregated message authentication codes. We show how ACS can be realized using ARMv8.3-A pointer authentication, a new low-overhead mechanism for protecting pointer integrity. Our solution achieves security comparable to hardware-assisted shadow stacks, while incurring negligible performance overhead (< 0.5%) but requiring no additional hardware support.

United We Stand: A Threshold Signature Scheme for Identifying Outliers in PLCs

  • Urbi Chatterjee
  • Pranesh Santikellur
  • Rajat Sadhukhan
  • Vidya Govindan
  • Debdeep Mukhopadhyay
  • Rajat Subhra Chakraborty

This work proposes a scheme to detect, isolate and mitigate malicious disruption of electro-mechanical processes in legacy PLCs where each PLC works as a finite state machine (FSM) and goes through predefined states depending on the control flow of the programs and input-output mechanism. The scheme generates a group-signature for a particular state combining the signature shares from each of these PLCs using (k,l)-threshold signature scheme. If some of them are affected by the malicious code, signature can be verified by k out of l uncorrupted PLCs and can be used to detect the corrupted PLCs and the compromised state. We use OpenPLC software to simulate Legacy PLC system on Raspberry Pi and show I/O pin configuration attack on digital and pulse width modulation (PWM) pins. We describe the protocol using a small prototype of five instances of legacy PLCs simultaneously running on OpenPLC software. We show that when our proposed protocol is deployed, the aforementioned attacks get successfully detected and the controller takes corrective measures. This work has been developed as a part of the problem statement given in the Cyber Security Awareness Week-2017 competition.

Improving Static Power Efficiency via Placement of Network Demultiplexer over Control Plane of Router in Multi-NoCs

  • Sonal Yadav
  • Vijay Laxmi
  • Manoj Singh Gaur
  • Hemangee K. Kapoor

Network Demultiplexer (Net-Demux) is an essential hardware unit in multiple NoCs for traffic distribution between the NoC networks. This paper proposes a novel idea of the placement of Net-Demux at the control plane of switch allocator of the router to improve static power and energy efficiency as compared to conventional data plane placement at the Network Interface (NI).

How Secure are Deep Learning Algorithms from Side-Channel based Reverse Engineering?

  • Manaar Alam
  • Debdeep Mukhopadhyay

Deep Learning has become a de-facto paradigm for various prediction problems including many privacy-preserving applications, where the privacy of data is a serious concern. There have been efforts to analyze and exploit information leakages from DNN to compromise data privacy. In this paper, we provide an evaluation strategy for such information leakages through DNN by considering a case study on CNN classifier. The approach utilizes low-level hardware information provided by Hardware Performance Counters and hypothesis testing during the execution of a CNN to produce alarms if there exists any information leakage on actual input.

Predicting DRC Violations Using Ensemble Random Forest Algorithm

  • Riadul Islam
  • Md Asif Shahjalal

At leading technology nodes, the industry is facing a stiff challenge to make profitable ICs. One of the primary issues is the design rule checking (DRC) violation. In this research, we cohort with the DARPA IDEA program that aims for “no-human-in-the-loop” and 24-hour turnaround time to implement an IC from design specifications. In order to reduce human effort, we introduce the ensemble random forest algorithm to predict DRC violations before global routing, which is considered the most time-consuming step in an IC design flow. In addition, we identified features that critically impact the DRC violations. The algorithm has a 5.8% better F1-score compared to the existing SVM classifiers.

Analog Circuit Generator based on Deep Neural Network enhanced Combinatorial Optimization

  • Kourosh Hakhamaneshi
  • Nick Werblun
  • Pieter Abbeel
  • Vladimir Stojanović

A deep neural network (DNN) based stochastic combinatorial optimization framework is presented that can find the optimal sizing of circuits in a sample-efficient manner. This sample efficiency allows us to unify this framework with generator-based tools like Berkeley Analog Generator (BAG) [1] to directly optimize layout, given the high level circuit specifications. We use this tool to design an optical link receiver layout, satisfying high-level design specifications, using post-layout simulations of only 348 design instances. Compared to an evolutionary algorithm without our DNN-based discriminator, our framework improves the sample efficiency and run time by more than 200x.

Distributed Timing Analysis at Scale

  • Tsung-Wei Huang
  • Chun-Xun Lin
  • Martin D. F. Wong

As the design complexities continue to grow, the need to efficiently analyze circuit timing with billions of transistors is quickly becoming the major bottleneck to the overall chip design flow. In this work we introduce a distributed timer that (1) has scalable performance, (2) can be seamless integrable to existing EDA applications, (3) enables transparent resource management, (4) has robust fault-tolerant control. We evaluate the distributed timer using a set of large industry benchmarks on a cluster with 24 nodes. The results show that the proposed timer achieves full accuracy over all designs with high performance and good scalability.

Towards Practical Record and Replay for Mobile Applications

  • Onur Sahin
  • Assel Aliyeva
  • Hariharan Mathavan
  • Ayse Coskun
  • Manuel Egele

The ability to repeat the execution of a program is a fundamental requirement in evaluating computer systems and apps. Reproducing executions of mobile apps has proven difficult under real-life scenarios due to different sources of external inputs and interactive nature of the apps. We present a new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach. We demonstrate the feasibility of RandR by recording and replaying a set of real-world apps.

The Ping-Pong Tunable Delay Line In A Super-Resilient Delay-Locked Loop

  • Zheng-Hong Zhang
  • Wei Chu
  • Shi-Yu Huang

The Tunable Delay Line (TDL) is the most important building block in a modern cell-based timing circuit such as Phase-Locked Loop (PLL) or Delay-Locked Loop (DLL). In previously proposed TDLs, one dilemma exists — they cannot be both power efficient and environmentally adaptive at the same time. In this paper, we present an effective solution for such a dilemma – a novel “ping-pong delay line” architecture. The idea is to use two small cell-based delay lines operated in a synergistic manner in the sense that they exchange the “role of command” dynamically like in a ping-pong game, and thereby jointly reacting to severe environmental changes over a very wide range. This proposed ping-pong delay line has been incorporated in a Delay-Locked Loop (DLL) design, to demonstrate its advantages by post-layout simulation.

An Efficient Learning-based Approach for Performance Exploration on Analog and RF Circuit Synthesis

  • Po-Cheng Pan
  • Chien-Chia Huang
  • Hung-Ming Chen

An efficient synthesis technique for modern analog circuits is important yet challenging due to the repeatedly re-synthesis process. To precisely explore the analog circuit performance limitation on the required technology is time-consuming. This work presents a learning-based framework for searching the limitation of analog circuits. With hierarchical architecture, the dimension of solution space can be reduced. Bayesian linear regression and support vector machine model are selected to speed up the algorithm and better performance quality can be retrieved. Experimental results show that our approach on two analog circuits can achieve up to 9x runtime speed-up without surrendering performance qualities.

LODESTAR: Creating Locally-Dense CNNs for Efficient Inference on Systolic Arrays

  • Bahar Asgari
  • Ramyad Hadidi
  • Hyesoon Kim
  • Sudhakar Yalamanchili

The performance of sparse problems suffers from lack of spatial locality and low memory bandwidth utilization. However, the distribution of non-zero values in the data structures of a class of sparse problems, such as matrix operations in neural networks, is modifiable so that it can be matched with an efficient underlying hardware, such as systolic arrays. Such modification helps addressing the challenges coupled with sparsity. To efficiently execute sparse neural network inference on systolic arrays, we propose a structured pruning algorithm that increases the spatial locality in neural network models, while maintaining the accuracy of inference.

Robustly Executing DNNs in IoT Systems Using Coded Distributed Computing

  • Ramyad Hadidi
  • Jiashen Cao
  • Michael S. Ryoo
  • Hyesoon Kim

Internet of Things (IoT) devices have access to an abundance of raw data for processing. With deep neural networks (DNNs), not only the demand for the computing power of IoT devices is increasing, but also privacy concerns are motivating the importance of close-to-edge computation. DNN execution by distributing its computation is common in IoT systems. However, managing unstable latencies in a network and intermittent failures are serious challenges. Our work provides robustness and close-to-zero recovery latency by adapting coded distributed computing (CDC). We analyze robust execution on a mesh of Raspberry Pis by studying four DNNs.

Visual Cortex Inspired Pixel-Level Re-configurable Processors for Smart Image Sensors

  • Pankaj Bhowmik
  • Md Jubaer Hossain Pantho
  • Christophe Bobda

This paper presents a reconfigurable hardware architecture of smart image sensors to speed up low-level image processing applications at the pixel level. For each pixel in the sensor plane, the design includes an activation module and a processor. The processor has a basic structure which is common to all applications and reconfigurable segments for specific applications. Visual cortex inspired computing, like, Predictive Coding in time is implemented in the activation module to remove temporal redundancy. The ASIC implementation shows the design saves up to 84.01% dynamic power and achieves 9x speedup at 800 MHz by accurate prediction.

Efficient Circuits for Quantum Search over 2D Square Lattice Architecture

  • Shaohan Hu
  • Dmitri Maslov
  • Marco Pistoia
  • Jay Gambetta

Quantum computing has increasingly drawn interest and investments from the academic, industrial, and governmental research communities worldwide. Among quantum algorithms, Quantum Search is important for its quadratic speedup over its classical-computing counterpart. A key ingredient in its implementation is the Multi-Control Toffoli (MCT) gate, which creates a Boolean product of control variables and XORs it into the target. On an idealized quantum computer, all-to-all connectivity would eliminate the need to use SWAP gates to communicate information. This is, however, not affordable in the current Noisy Intermediate-Scale Quantum (NISQ) computing era. In this work, we discuss how to efficiently implement MCT gates on 2D Square Lattices (2DSL), suitable for superconducting circuits, by taking advantage of relative-phase Toffoli gates and H-tree layouts to drastically reduce resulting circuits’ depths and the amount of SWAPping required.

SEDA – Single Exact Dual Approximate Adders for Approximate Processors

  • Chandan Kumar Jha
  • Joycee Mekie

Approximate computing has gained a lot of popularity due to its energy benefits in a variety of error-tolerant applications. In this paper we are proposing an adder which can perform n-bit single exact addition or dual approximate addition (SEDA), and is suitable for processors. The conversion from exact to approximate addition can be dynamically done at runtime. The maximum error is bounded for SEDA adders as carry is not approximated. Our proposed design consumes 48% lesser energy, has 32% lesser delay, occupies 24% lesser area as compared to exact mirror adder.

Merging Everything (ME): A Unified FPGA Architecture Based on Logic-in-Memory Techniques

  • Xiaoming Chen
  • Longxiang Yin
  • Bosheng Liu
  • Yinhe Han

New Computational Results and Hardware Prototypes for Oscillator-based Ising Machines

  • Tianshi Wang
  • Leon Wu
  • Jaijeet Roychowdhury

In this paper, we report new results on a novel Ising machine technology for solving combinatorial optimization problems using networks of coupled self-sustaining oscillators. Specifically, we present several working hardware prototypes using CMOS electronic oscillators, built on bread-boards/perfboards and PCBs, implementing Ising machines consisting of up to 240 spins with programmable couplings. We also report that, just by simulating the differential equations of such Ising machines of larger sizes, good solutions can be achieved easily on benchmark optimization problems, demonstrating the effectiveness of oscillator-based Ising machines.

Internal Structure Aware RDF Data Management in SSDs

  • Renhai Chen
  • Qiming Guan
  • Guohua Yan
  • Zhiyong Feng

In this paper, we lead the first efforts towards intelligent RDF data management in SSDs. We propose to deeply fuse the RDF data in SSDs. In detail, the operations (e.g., data query) applied to RDF can be directly achieved in SSDs. To this end, we explore two RDF data organizations (e.g., triple-based) with the consideration of the internal structure of SSDs. The experiment is conducted on the Patient Disease Drug (PDD) Graph dataset [11]. The experimental results show that the proposed two strategies achieve the comprehensive, scalable in-SSD computation from different aspects (e.g., space efficiency or query efficiency).

TODAES

ACM Transactions on Design Automation of Electronic Systems

The ACM Transactions on Design Automation of Electronic Systems (TODAES) is the premier journal that publishes recent significant results of research and development efforts in the area of design automation of electronic systems. The TODAES editorial board invites submission of technical papers describing recent results of research and development efforts in the area of design automation of electronic systems. The journal intends to provide a comprehensive coverage of innovative works concerning the specification, design, analysis, simulation, testing, and evaluation of very large scale integrated electronic systems, emphasizing a computer science/engineering orientation.
 
Further information on submission, the editorial board, subscription, and other details can be found at the journal’s home page:
http://www.acm.org/todaes/
ACM TODAES Editor-In-Chief
X. Sharon Hu
shu@nd.edu