Publications Archives - Page 4 of 4

GLSVLSI 2019 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

Thoughts on Edge Intelligence

Wolf
Marilyn

Machine learning methods have exploded in the past half-dozen years. Machine learning
is being applied to a huge range of problems across the spectrum of applications.
Initial results relied on server-oriented computations. But many applications will
…

Automatic Implementation of Secure Silicon

Leef
Serge

Throughout the past decade, cybersecurity threats have evolved from attacks focused
high in the software stack to progressively lower levels of computational hierarchy.
With the explosion of popularity and growing deployment of internet connected …

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

Mutlu
Onur

Today’s systems are overwhelmingly designed to move data to computation. This design
choice goes directly against at least three key trends in systems that cause performance,
scalability and energy bottlenecks: 1) data access from memory is already a …

Innovations in IoT for a Safe, Secure, and Sustainable Future

Bhunia
Swarup

Internet of things (IoT) promises to usher in the fourth industrial revolution through
an exponential growth of smart connected devices deployed in myriad application domains.
It gives rise to new relationships between man and smart connected machines …

SESSION: Tech Session 1: Design and Integration of Hardware Security Primitives

LPN-based Device Authentication Using Resistive Memory

Arafin
Md Tanvir

Recent progress in the design and implementation of resistive memory components such
as RRAMs and PCMs has introduced opportunities for developing novel hardware security
solutions using unique physical properties of these devices. In this work, we …

Leveraging On-Chip Voltage Regulators Against Fault Injection Attacks

Vosoughi
Ali

The security implications of utilizing an on-chip voltage regulator as a countermeasure
against fault injection attacks are investigated in this paper. The effect of the
size of the capacitors and number of phases of the voltage regulator on the …

On the Theoretical Analysis of Memristor based True Random Number Generator

Uddin
Mesbah

Emerging nano-devices like memristors display stochastic switching behavior which
poses a big uncertainty in their implementation as the next-generation CMOS alternative.
However, this stochasticity provides an opportunity to design circuits for …

Control-Lock: Securing Processor Cores Against Software-Controlled Hardware Trojans

Šišejković
Dominik

Malicious circuit modifications known as hardware Trojans represent a rising threat
to the integrated circuit supply chain. As many Trojans are activated based on a specific
sequence of circuit states, we have recognized the ease of utilizing an …

Lightweight Authenticated Encryption for Network-on-Chip Communications

Harttung
Julian

In recent years, Network-on-Chip (NoC) has gained increasing popularity as a promising
solution for the challenging interconnection problem in multi-processor systems-on-chip
(MPSoCs). However, the interest of adversaries to compromise such systems grew …

SESSION: Tech Session 2: VLSI Circuits and Power Aware Design

Design of a Low-power and Small-area Approximate Multiplier using First the Approximate
and then the Accurate Compression Method

Yang
Tongxin

Recently emerging applications, such as convolution neural networks (CNNs), which
process thousands of convolutional computations, require a large amount of power.
Multiplication is the key arithmetic in these applications and an approximate multiplier
…

GraphiDe: A Graph Processing Accelerator leveraging In-DRAM-Computing

Angizi
Shaahin

In this paper, we propose GraphiDe, a novel DRAM-based processing-in-memory (PIM)
accelerator for graph processing. It transforms current DRAM architecture to massively
parallel computational units exploiting the high internal bandwidth of the modern
…

An Efficient Time-based Stochastic Computing Circuitry Employing Neuron-MOS

Erlina
Tati

A compact and low energy circuitry of time-based stochastic computing (TBSC) have
been designed. In the TBSC theory, stochastic numbers (SNs) are represented by duty-cycle
of periodic signals. Additionally, multiplication and addition operations of the …

Monolithic 8×8 SiPM with 4-bit Current-Mode Flash ADC with Tunable Dynamic Range

Vinayaka
Vikas

A monolithic photon-counting receiver consisting of an integrated silicon-photomultiplier
and a current-mode analog-to-digital converter (ADC) was designed, simulated and fabricated
in the AMS 0.35 μm SiGe BiCMOS process. The silicon photomultiplier (…

SESSION: Tech Session 3: : VLSI for Machine Learning and Artificial Intelligence

A Systolic SNN Inference Accelerator and its Co-optimized Software Framework

Guo
Shasha

Although Deep Neural Network (DNN) architectures have made some breakthroughs in computer
vision tasks, they are not close to biological brain neurons. Spiking Neural Network
(SNN) is highly expected to bridge the gap between artificial computing …

Dynamic Beam Width Tuning for Energy-Efficient Recurrent Neural Networks

Jahier Pagliari
Daniele

Recurrent Neural Networks (RNNs) are state-of-the-art models for many machine learning
tasks, such as language modeling and machine translation. Executing the inference
phase of a RNN directly in edge nodes, rather than in the cloud, would provide …

Efficient Softmax Hardware Architecture for Deep Neural Networks

Du
Gaoming

Deep neural network (DNN) has become a pivotal machine learning and object recognition
technology in the big data era. The softmax layer is one of the key component layers
for completing multi-classification tasks. However, the softmax layer contains …

HSIM-DNN: Hardware Simulator for Computation-, Storage- and Power-Efficient Deep Neural Networks

Sun
Mengshu

Deep learning that utilizes large-scale deep neural networks (DNNs) is effective in
automatic high-level feature extraction but also computation and memory intensive.
Constructing DNNs using block-circulant matrices can simultaneously achieve hardware
…

SESSION: Tech Session 4: Next Generation Interconnect: Architecture to Physical Design

An Area-Efficient Iterative Single-Precision Floating-Point Multiplier Architecture
for FPGA

Kim
Sunwoong

Approximate multipliers have been widely used in critical applications, such as machine
learning and multimedia, which are tolerant to approximation errors. This paper proposes
a novel single-precision floating-point (SPFP) multiplication algorithm and …

An Automatic Transistor-Level Tool for GRM FPGA Interconnect Circuits Optimization

Li
Zhengjie

Due to its dominance in FPGA area and delay, the interconnect circuit is traditionally
designed and optimized in full customized fashion, which can be extremely time consuming.
In this paper, we propose an automated transistor-level sizing optimization …

Low Voltage Clock Tree Synthesis with Local Gate Clusters

Sitik
Can

In this paper, a novel local clock gate cluster-aware low voltage clock tree synthesis
methodology is introduced. In low voltage/swing clocking, timing closure is a challenging
problem due to tight skew and slew constraints. The clock gating makes this …

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware
security

TOIC: Timing Obfuscated Integrated Circuits

Alam
Mahabubul

To counter the threats of reverse engineering (RE) and Trojan in-sertion, researchers
have considered gate-level obfuscation in inte-grated circuits (IC) as a viable solution.
However, several techniques are present in the literature to crack the …

Design for Eliminating Operation Specific Power Signatures from Digital Logic

Majumder
Md Badruddoja

Conventional digital logic operations have distinguishable power signatures. Side
channel power analysis combined with classification algorithm can reveal unknown logic
operations. Revealing the underlying operations is the main task in reverse …

Non-Uniform Temperature Distribution in Interconnects and Its Impact on Electromigration

Abbasinasab
Ali

We investigate the effect of electrically induced thermal load on interconnect reliability
and aging. We propose new models for uniform and non-uniform temperature evolution
and its steady state distribution in interconnects considering Joule heating …

Fault Classification and Coverage of Analog Circuits using DC Operating Point and
Frequency Response Analysis

Sanyal
Sayandeep

Detection of faults in a mixed-signal SOC at the pre-silicon stage is a challenge,
especially when it has substantial analog components. Given the time taken for simulating
analog circuits, designing tests to detect faults in them is not a …

Crash Skipping: A Minimal-Cost Framework for Efficient Error Recovery in Approximate Computing Environments

Verdeja Herms
Yan

We present a lightweight technique to minimize error recovery costs in approximate
computing environments. We take advantage of the key observation that if an application
crashes in a “non-critical” region of its execution, then skipping the crash and …

SESSION: Tech Session 6: Emerging Computing & Post-CMOS Technologies

Voltage-Controlled Magnetoelectric Memory Bit-cell Design With Assisted Body-bias
in FD-SOI

Cai
Hao

Voltage-controlled magnetic anisotropy (VCMA)-magnetic tunnel junction (MTJ) is incorporated
into FD-SOI CMOS technology. The design space of 1 transistor-1 MTJ (1T-1M) bit-cell
is explored through varied VCMA pulse duration/amplitude and scaling down …

Low Cost Hybrid Spin-CMOS Compressor for Stochastic Neural Networks

Li
Bingzhe

With expansion of neural network (NN) applications lowering their hardware implementation
cost becomes an urgent task especially in back-end applications where the power-supply
is limited. Stochastic computing (SC) is a promising solution to realize low-…

Functionally Complete Boolean Logic and Adder Design Based on 2T2R RRAMs for Post-CMOS
In-Memory Computing

Yang
Zongxian

In-memory computing (IMC) paradigm has attracted extensive attention for future electronics
to overcome the bottleneck and memory wall problem in the von Neumann systems. Nonvolatile
logic based on resistive random-access memory (RRAM) is a promising …

Jump Search: A Fast Technique for the Synthesis of Approximate Circuits

Witschen
Linus

State-of-the-art frameworks for generating approximate circuits automatically explore
the search space in an iterative process – often greedily. Synthesis and verification
processes are invoked in each iteration to evaluate the found solutions and to …

SESSION: Tech Session 7: Physical Design and Obfuscation

SAT-Based Placement Adjustment of FinFETs inside Unroutable Standard Cells Targeting
Feasible DRC-Clean Routing

Sorokin
Anton

In this paper, we present an algorithm of transistor placement that takes unroutable
standard cells and makes them routable by moving transistors in local windows. It
converts the task of placement of gridded FinFETs into a Boolean problem and employs
…

A Scalable and Process Variation Aware NVM-FPGA Placement Algorithm

Yang
Chengmo

As non-volatile memory (NVM) based FPGAs gain increasing popularity, FPGA synthesis
tools start to tune the synthesis flow to match NVM characteristics. State-of-the-art
NVM FPGA placement algorithms tried to reduce the high reconfiguration cost induced
…

Functional Obfuscation of Hardware Accelerators through Selective Partial Design Extraction
onto an Embedded FPGA

Hu
Bo

The protection of Intellectual Property (IP) has emerged as one of the most serious
areas of concern in the semiconductor industry. To address this issue, we present
a method and architecture to map selective portions of a design, given as a behavioral
…

HydraRoute: A Novel Approach to Circuit Routing

Khasawneh
Mohammad

Routing for dense circuits is a major challenge for VLSI physical design. Most routing
approaches rely at least partially on a “rip-up and reroute” scheme, where solution
quality and run times can be impacted profoundly by the order in which nets are …

SESSION: Tech Session 8: Quantum Circuits and Emerging Technologies

Balanced Factorization and Rewriting Algorithms for Synthesizing Single Flux Quantum
Logic Circuits

Pasandi
Ghasem

Single Flux Quantum (SFQ) logic with switching energy of 100zJ1 and switching delay
of 1ps is a promising post-CMOS candidate. Logic synthesis of these magnetic-pulse-based
circuits is a very important step in their design flow with a big impact on the …

A Majority Logic Synthesis Framework for Adiabatic Quantum-Flux-Parametron Superconducting
Circuits

Cai
Ruizhe

Adiabatic Quantum-Flux-Parametron (AQFP) logic is an adiabatic superconductor logic
that has been proposed as alternative to CMOS logic with extremely high energy efficiency.
In AQFP technology, majority-based gates have the same area as two-input AND/…

A Processing-In-Memory Implementation of SHA-3 Using a Voltage-Gated Spin Hall-Effect
Driven MTJ-based Crossbar

Yang
Chengmo

Processing-In-Memory (PIM), which implements logic operations within memory cells,
opens up a new direction on organizing data and computation. Leveraging resistive
or magnetic characteristics of nonvolatile memory (NVM) devices, platforms such as
PLiM …

Exploring Processing In-Memory for Different Technologies

Gupta
Saransh

The recent emergence of IoT has led to a substantial increase in the amount of data
processed. Today, a large number of applications are data intensive, involving massive
data transfers between processing core and memory. These transfers act as a …

SESSION: Tech Session 9: Towards Fast, Efficient, and Robust Memory

BLADE: A BitLine Accelerator for Devices on the Edge

Simon
William Andrew

The increasing ubiquity of edge devices in the consumer market, along with their ever
more computationally expensive workloads, necessitate corresponding increases in computing
power to support such workloads. In-memory computing is attractive in edge …

Enhancing the Lifetime of Non-Volatile Caches by Exploiting Module-Wise Write Restriction

Agarwal
Sukarn

The emerging Non-Volatile Memory (NVM) technologies offer a good combination of high
density and near-zero leakage power, becoming the strongest candidate in the memory
hierarchy including caches. However, the weak write endurance of these memories …

Mitigating the Performance and Quality of Parallelized Compressive Sensing Reconstruction
Using Image Stitching

Namazi
Mahmoud

Orthogonal Matching Pursuit is an iterative greedy algorithm used to find a sparse
approximation for high-dimensional signals. The algorithm is most popularly used in
Compressive Sensing, which allows for the reconstruction of sparse signals at rates
…

Towards Optimizing Refresh Energy in embedded-DRAM Caches using Private Blocks

Manohar
Sheel Sindhu

In recent years, the increased working set size of applications craves for more memory
demand in terms of large size Last Level Caches (LLC). To fulfill this, embedded DRAM
(eDRAM) caches have been considered as one of the best alternatives over …

SESSION: Tech Session 10: MSE

Extending Student Labs with SMT Circuit Implementation

Brunvand
Erik

Computer Science and Computer Engineering classes related to digital circuits, embedded
systems, Human Computer Interaction (HCI), and a wide variety of “maker” subjects,
would often like to include physical computing projects. Extending these physical
…

Teaching the Next Generation of Cryptographic Hardware Design to the Next Generation
of Engineers

Aysu
Aydin

Evolving threats against cryptographic systems and the increasing diversity of computing
platforms enforce teaching cryptographic engineering to a wider audience. This paper
describes the development of a new graduate course on hardware security taught …

A Web-based Remote FPGA Laboratory for Computer Organization Course

Wan
Han

Learning in digital systems could be enhanced by applying a learn-by-doing mechanism.
In this paper the implementation of a web-based remote FPGA laboratory for Computer
Organization course is proposed. The projects created for this course are designed
…

System-on-a-Chip Design as a Platform for Teaching Design and Design Flow Integration

Covey
Jacob

The design of microelectronic systems requires integration and cooperation across
multiple disciplines, but most curriculum is taught in unconnected pieces. This makes
the creation of manageable projects that reflect the design experience very …

SESSION: Poster Sessions I, II

UPIM: Unipolar Switching Logic for High Density Processing-in-Memory Applications

Sim
Joonseop

Internet of Things (IoT) has built a network with billions of connected devices which
generate massive volumes of data. Processing large data on existing systems requires
significant costs for data movements between processors and memory due to limited
…

Fence-Region-Aware Mixed-Height Standard Cell Legalization

Do
SangGi

We propose a fence-region-aware mixed-height standard cell legalization that can optimize
the placement of standard cells that have more than a two row height in various shapes
of the fence region. The algorithm consists of pre-legalization and mixed-…

A Case for Heterogeneous Network-on-Chip Based H.264 Video Decoders

Ghorbani Moghaddam
Milad

The design of a heterogeneous network-on-chip (NoC) based H.264 video decoder is proposed.
A thorough investigation using a system simulator developed as the combination of
a cycle accurate NoC simulator together with complete implementations of all the …

A 16b Clockless Digital-to-Analog Converter with Ultra-Low-Cost Poly Resistors Supporting
Wide-Temperature Range from -40°C to 85°C

Wang
Xuedi

High-precision digital-to-analog converter (DAC) is a critical component in process
control, data acquisition, and testing instruments. In order to achieve high resolution
and a wide-temperature range, conventional designs have been adopting high-cost …

A Skyrmion Racetrack Memory based Computing In-memory Architecture for Binary Neural
Convolutional Network

Pan
Yu

A Skyrmion Racetrack Memory (SRM) based Computing In-Memory Architecture (SRM-CIM)
was proposed in this paper. Both data and computing operation can be achieved in SRM-CIM.
SRM-CIM is used to support convolutional computing in Binary Convolutional …

TASecure: Temperature-Aware Secure Deletion Scheme for Solid State Drives

Li
Bingzhe

With the increasing concerns of security, the secure deletion for SSDs becomes very
costly due to its out-of-place update (i.e., an update is performed in a new location
leaving the old data un-touched). Some previous studies used a combined erase-based
…

An Asymmetric Dual Output On-Chip DC-DC Converter for Dynamic Workloads

Liu
Xingye

We propose a novel two-stage hybrid on-chip DC-DC converter targeting low power applications
with multiple supply voltage domains and dynamic workloads. The converter has a nominal
input voltage of 1.2V and generates two asymmetrically regulated output …

CNNWire: Boosting Convolutional Neural Network with Winograd on ReRAM based Accelerators

Lin
Jilan

Resistive random access memory (ReRAM) demonstrates the great potential of in-memory
processing for neural network (NN) acceleration. However, since the convolutional
neural network (CNN) is widely known as compute-bound, current ReRAM-based …

Feed-Forward XOR PUFs: Reliability and Attack-Resistance Analysis

Avvaru
S. V. Sandeep

Physical unclonable functions (PUFs) can be used to generate unique signatures of
integrated circuit (IC) chips. XOR arbiter PUFs (XOR PUFs), that typically contain
multiple standard arbiter PUFs as their components, are more secure than standard
…

Exploring Design Trade-offs in Fault-Tolerant Behavioral Hardware Accelerators

Zhu
Zhiqi

High-Level Synthesis (HLS) allows the automatic generation of hardware accelerators
with unique design metrics. This work leverages this unique feature and presents a
method to increase the search space of fault-tolerant hardware accelerators. The …

Automatic Extraction of Requirements from State-based Hardware Designs for Runtime
Verification

Seo
Minjun

Runtime monitoring and verification enables a system to monitor itself and ensure
system requirements are met even in the presence of dynamic environments. For hardware,
state-based models are widely used, but verifying the correctness between the state-…

MirrorCache: An Energy-Efficient Relaxed Retention L1 STTRAM Cache

Kuan
Kyle

Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAMs in on-chip caches,
due to several advantages, including non-volatility, low leakage, high integration
density, and CMOS compatibility. However, STTRAMs’ wide adoption in resource-…

Design and Evaluation of DNU-Tolerant Registers for Resilient Architectural State
Storage

Alghareb
Faris S.

In this work, we aim to maintain the correct execution of instructions in the pipeline
stages. To achieve that, the integrity for the data computed in registers during execution
should be maintained via protecting the susceptible registers. Thus, we …

Automated Analysis of Virtual Prototypes at Electronic System Level

Goli
Mehran

The exponential increase in functionality of System-on-Chips (SoCs) and reduced Time-to-Market
(TTM) requirements have significantly altered the typical design and verification
flow. Virtual Prototyping (VP) at the Electronic System Level (ESL) using …

Dynamic Physically Unclonable Functions

Xiong
Wenjie

Physical variations in the manufacturing processes of electronic devices have been
widely leveraged to design Physically Unclonable Functions (PUFs), which can be used
for authentication and key storage. Existing PUFs are static, as their PUF responses
…

RDTA: An Efficient Routability-Driven Track Assignment Algorithm

Liu
Genggeng

This paper presents a routability-driven track assignment algorithm (RDTA) to efficiently
estimate routability. Routability has become a very challenging issue in modern IC
design and it can be effectively estimated by routing congestion. Track …

EraseMe: A Defense Mechanism against Information Leakage exploiting GPU Memory

Fang
Hongyu

Graphics Processing Units (GPU) play a major role in speeding up computational tasks
of the users, especially in applications such as high volume text and image processing.
Recent works have demonstrated the security problems associated with GPU that do …

A Statistical Current and Delay Model Based on Log-Skew-Normal Distribution for Low
Voltage Region

Cao
Peng

The increasing performance variation and non-Gaussian distribution pose remarkable
challenges to timing analysis for circuits operating in low voltage region. Accurate
modeling of the statistical characteristics is urgently required with process …

Enabling Approximate Storage through Lossy Media Data Compression

Worek
Brian

As compute capabilities continue to scale, memory capacity and bandwidth continue
to lag behind. Data compression is an effective approach to improving memory capacity
and bandwidth; but prior works have focused primarily on lossless compression and
…

Thermal Fingerprinting of FPGA Designs through High-Level Synthesis

Chen
Jianqi

This work investigates if temperature can be used to fingerprint FPGA designs and
presents a method to generate a large number of functionally equivalent FPGA designs
such that each design has a unique distinguishable thermal signature. The main …

Deep RNN-Oriented Paradigm Shift through BOCANet: Broken Obfuscated Circuit Attack

Tehranipoor
Fatemeh

Logic encryption obfuscation has been used for thwarting counterfeiting, overproduction,
and reverse engineering but vulnerable to attacks. However, it was recently shown
that satisfiability – checking (SAT) can potentially compromise hardware …

STAT: Mean and Variance Characterization for Robust Inference of DNNs on Memristor-based
Platforms

Zhang
Baogang

An emerging solution to accelerate the inference phase of deep neural networks (DNNs)
is to utilize memristor crossbar arrays (MCAs) to perform highly efficient matrix-vector
multiplication in the analog domain. An adverse challenge is that memristor …

LSM: Novel Low-Complexity Unified Systolic Multiplier over Binary Extension Field

Xie
Jiafeng

Unified (hybrid field-size) systolic multiplier over GF(2m) (binary extension field)
has attracted significant attentions from research communities recently as it can
be used in reconfigurable cryptographic processors. In this paper, we present a novel
…

Binarized Depthwise Separable Neural Network for Object Tracking in FPGA

Yang
Li

Object tracking has achieved great advances in the past few years and has been widely
applied in vision-based application. Nowadays, deep convolutional neural network has
taken an important role in object tracking tasks. However, its enormous model size
…

An Analytical-based Hybrid Algorithm for FPGA Placement

Hu
Chengyu

As the capacity of FPGA increases, FPGA placers that adopt Simulated Annealing (SA)
algorithm take more and more runtime. To solve this problem, this paper presents HCAS,
a Hybrid algorithm Combining Analytical method and SA. There are three …

Approximate Memory with Approximate DCT

Ma
Shenghou

Approximate Computing is an emerging computing paradigm where one exploits inherent
error resilience of certain applications (e.g., digital signal processing, multimedia
and artificial intelligence) and trades off absolute computation precisions for …

AQuRate: MRAM-based Stochastic Oscillator for Adaptive Quantization Rate Sampling of Sparse
Signals

Salehi
Soheil

Recently, the promising aspects of compressive sensing have inspired new circuit-level
approaches for their efficient realization within the literature. However, most of
these recent advances involving novel sampling techniques have been proposed …

Clockless Spin-based Look-Up Tables with Wide Read Margin

Salehi
Soheil

In this paper, we develop a 6-input fracturable non-volatile Clockless LUT (C-LUT)
using spin Hall effect (SHE)-based Magnetic Tunnel Junctions (MTJs) and provide a
detailed comparison between the SHE-MTJ-based C-LUT and Spin Transfer Torque (STT)-MTJ-…

A Hybrid Framework for Functional Verification using Reinforcement Learning and Deep
Learning

Singh
Karunveer

In this paper, we propose a novel hybrid verification framework (HVF) which uses Reinforcement
Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex
systems. More precisely, our HVF incorporates RL to generate all …

SESSION: Special Session 1: In-Memory Processing for Future Electronics

Digital and Analog-Mixed-Signal In-Memory Processing in CMOS SRAM

Jaiswal
Akhilesh

Ferroelectric FET Based In-Memory Computing for Few-Shot Learning

Laguna
Ann Franchesca

As CMOS technology advances, the performance gap between the CPU and main memory has
not improved. Furthermore, the hardware deployed for Internet of Things (IoT) applications
need to process ever growing volumes of data, which can further exacerbate …

True In-memory Computing with the CRAM: From Technology to Applications

Zabihi
Masoud

An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive
Applications

Li
Bing

The conventional von Neumann architecture has been revealed as a major performance
and energy bottleneck for rising data-intensive applications. The decade-old idea
of leveraging in-memory processing to eliminate substantial data movements has returned
…

SESSION: Special Session 2: Approximate Computing Systems Design: Energy Efficiency and Security
Implications

Security Threats in Approximate Computing Systems

Yellu
Pruthvy

Approximate computing systems improve energy efficiency and computation speed at the
cost of reduced accuracy on system outputs. Existing efforts mainly explore the feasible
approximation mechanisms and their implementation methods. There is limited …

Characterizing Approximate Adders and Multipliers Optimized under Different Design
Constraints

Jiang
Honglan

Taking advantage of the error resilience in many applications as well as the perceptual
limitations of humans, numerous approximate arithmetic circuits have been proposed
that trade off accuracy for higher speed or lower power in emerging applications …

Approximate Communication Strategies for Energy-Efficient and High Performance NoC: Opportunities and Challenges

Reza
Md Farhadur

With the advancement and miniaturization of transistor technology, hundreds of cores
can be integrated on a single chip. Network-on-Chips (NoCs) are the de facto on-chip
communication fabrics for multi/many core systems because of their benefits over …

Information Hiding behind Approximate Computation

Wang
Ye

There are many interesting advances in approximate computing recently targeting the
energy efficiency in system design and execution. The basic idea is to trade computation
accuracy for power and energy during all phases of the computation, from data to …

MLPrivacyGuard: Defeating Confidence Information based Model Inversion Attacks on Machine Learning
Systems

Alves
Tiago A. O.

As services based on Machine Learning (ML) applications find increasing use, there
is a growing risk of attack against such systems. Recently, adversarial machine learning
has received a lot of attention, where an adversary is able to craft an input or …

SESSION: Special Session 3: Recent Advances in Near and In-Memory Computing Circuit ?

XNOR-SRAM: In-Bitcell Computing SRAM Macro based on Resistive Computing Mechanism

Jiang
Zhewei

We present an in-memory computing SRAM macro for binary neural networks. The memory
macro computes XNOR-and-accumulate for binary/ternary deep convolutional neural networks
on the bitline without row-by-row data access. It achieves 33X better energy and …

Efficient Process-in-Memory Architecture Design for Unsupervised GAN-based Deep Learning
using ReRAM

Chen
Fan

The ending of Moore’s Law makes domain-specific architecture as the future of computing.
The most representative is the emergence of various deep learning accelerators. Among
the proposed solutions, resistive random access memory (ReRAM) based process-…

DigitalPIM: Digital-based Processing In-Memory for Big Data Acceleration

Imani
Mohsen

In this work, we design, DigitalPIM, a Digital-based Processing In-Memory platform
capable of accelerating fundamental big data algorithms in real time with orders of
magnitude more energy efficient operation. Unlike the existing near-data processing
…

In-memory Processing based on Time-domain Circuit

Kong
Yuyao

Deep Neural Networks (DNN) have emerged as a dominant algorithm for machine learning
(ML). High performance and extreme energy efficiency are critical for deployments
of DNN, especially in mobile platforms such as autonomous vehicles, cameras, and other
…

SESSION: Special Session 4: Opportunities and Challenges for Emerging Monolithic 3D Integrated
Circuits

An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs

Shukla
Prachi

Monolithic 3D (Mono3D) is a three-dimensional integration technology that can overcome
some of the fundamental limitations faced by traditional, two-dimensional scaling.
This paper analyzes the unique thermal characteristics of Mono3D ICs by simulating
…

Logic Monolithic 3D ICs: PPA Benefits and EDA Tools Necessary

Pentapati
Sai Surya Kiran

Monolithic 3D (M3D) ICs provide a way to achieve high performance and low power designs
within the same technology node, thereby bypassing the need for transistor scaling.
M3D ICs have multiple 2D tiers sequentially fabricated on top of each other and …

Investigation and Trade-offs in 3DIC Partitioning Methodologies: N/A

Sketopoulos
Nikolaos

In this work, we compare alternative 3DIC partitioning methodologies, in terms of
slack, number of inter-tier vias, Tier Area Ratio (TAR) and HPWL design parameters.
The popular 3DIC postplacement, bin-based Fidducia-Mattheyses (FM) partitioning flow
is …

Test and Design-for-Testability Solutions for Monolithic 3D Integrated Circuits

Koneru
Abhishek

M3D integration can result in reduced area and higher performance when compared to
3D die stacking. Due to the benefits of M3D integration, there is growing interest
in industry towards the adoption of this technology. However, test challenges for
M3D …

N3XT Monolithic 3D Energy-Efficient Computing Systems

Aly
Mohamed M. Sabry

The world’s appetite for analyzing massive amounts of structured and unstructured
data has grown dramatically. The computational demands of these abundant-data applications
far exceed the capabilities of today’s computing systems. The N3XT (Nano-…

SESSION: Special Session 5: Robust IC Authentication and Protected Intellectual Property: A
Special Session on Hardware Security

How to Generate Robust Keys from Noisy DRAMs?

Karimian
Nima

Security primitives based on Dynamic Random Access Memory (DRAM) can provide cost-efficient
and practical security solutions, especially for resource-constrained devices, such
as hardware used in the Internet of Things (IoT), as DRAMs are an intrinsic …

Threats on Logic Locking: A Decade Later

Zamiri Azar
Kimia

To reduce the cost of ICs and to meet the market’s demand, a considerable portion
of manufacturing supply chain, including silicon fabrication, packaging and testing
may be pushed offshore. Utilizing a global IC manufacturing supply chain, and inclusion
…

On Custom LUT-based Obfuscation

Kolhe
Gaurav

Logic obfuscation yields hardware security against various threats, such as Intellectual
Property (IP) piracy and reverse engineering. Evolving Boolean satisfiability (SAT)
attacks have challenged the hardware security assurance rendered by various …

Securing Analog Mixed-Signal Integrated Circuits Through Shared Dependencies

Juretus
Kyle

The transition to a horizontal integrated circuit (IC) design flow has raised concerns
regarding the security and protection of IC intellectual property (IP). Obfuscation
of an IC has been explored as a potential methodology to protect IP in both the …

SESSION: Special Session 6: Neuromorphic Computing and Deep Neural Network

Design Methodology for Embedded Approximate Artificial Neural Networks

Balaji
Adarsha

Artificial neural networks (ANNs) have demonstrated significant promise while implementing
recognition and classification applications. The implementation of pre-trained ANNs
on embedded systems requires representation of data and design parameters in …

Exploration of Segmented Bus As Scalable Global Interconnect for Neuromorphic Computing

Balaji
Adarsha

Spiking Neural Networks (SNNs) are efficient computation models for spatio-temporal
pattern recognition on resource and power constrained platforms. Dedicated SNN hardware,
also called neuromorphic hardware, can further reduce the energy consumption of …

ADMM-based Weight Pruning for Real-Time Deep Learning Acceleration on Mobile Devices

Li
Hongjia

Deep learning solutions are being increasingly deployed in mobile applications, at
least for the inference phase. Due to the large model size and computational requirements,
model compression for deep neural networks (DNNs) becomes necessary, especially …

On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning

Prakash
Bharat

In autonomous embedded systems, it is often vital to reduce the amount of actions
taken in the real world and energy required to learn a policy. Training reinforcement
learning agents from high dimensional image representations can be very expensive
and …

SESSION: Panelist Position Papers

Tuning Track-based NVM Caches for Low-Power IoT Devices

Aghaei Khouzani
Hoda

Track-based non-volatile memories, such as Domain Wall Memory (DWM) and Skyrmion,
are promising candidates to be used as CPU caches due to their ultra-high density
and low-static power. However, the access latency and energy of these devices are
highly …

Dynamic Computation Migration at the Edge: Is There an Optimal Choice?

Shahhosseini
Sina

In the era of Fog computing where one can decide to compute certain time-critical
tasks at the edge of the network, designers often encounter a question whether the
sensor layer provides the optimal response time for a service, or the Fog layer, or
…

Solving Energy and Cybersecurity Constraints in IoT Devices Using Energy Recovery
Computing

Thapliyal
Himanshu

With the growth of Internet-of-Things (IoT), the potential threat vectors for malicious
cyber and hardware attacks are rapidly expanding. As the IoT paradigm emerges, there
are challenging requirements to design energy-efficient and secure systems. To …

Right-Provisioned IoT Edge Computing: An Overview

Adegbija
Tosiron

Edge computing on the Internet of Things (IoT) is an increasingly popular paradigm
in which computation is moved closer to the data source (i.e., edge devices). Edge
computing mitigates the overheads of cloud-based computing arising from increased
…

Secure Computing Systems Design Through Formal Micro-Contracts

Kinsy
Michel A.

Two enduring concepts in computer system design are abstraction levels and layered
composition. The design generally takes a layered approach where each layer implements
a different abstraction of the system. The layers communicate through interfaces …

SBCCI 2019 TOC

2 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

PHICC: an error correction code for memory devices

Philippe Magalhães
Otávio Alcântara
Jarbas Silveira

With the evolution of technology in the microelectronics field, integrated circuits (ICs) have been developed with decreasing dimensions. Despite the advances provided by the scale reduction, the occurrence of Multiple Cell Upsets (MCUs) caused by interferences such as ionizing radiation, has become increasingly common. Error Correction Codes (ECCs) are capable of augmenting fault tolerance of computer systems, however, there must be balance between error correction effectiveness and silicon implementation costs. The purpose of this article is to present the Parity Hamming Interleaved Correction Code (PHICC), which consists of a code capable of correcting multiple transient errors in memory cells, with low implementation cost. The validation of the PHICC was performed through a comparative analysis of correction effectiveness, implementation costs, reliability and Mean Time to Failure (MTTF) with others ECCs. The results show that PHICC can maintain the reliability system for longer time, which makes it a strong candidate for use in critical applications.

Lightweight security mechanisms for MPSoCs

Anderson Camargo Sant’Ana
Henrique Martins Medina
Kevin Boucinha Fiorentin
Fernando Gehm Moraes

Computational systems tend to adopt parallel architectures, by using multiprocessor systems-on-chip (MPSoCs). MPSoCs are vulnerable to software and hardware attacks, as infected applications and Hardware Trojans respectively. These attacks may have the purpose to gain access to sensitive data, interrupt a given application or even damage the system physically. The literature presents countermeasures using dedicated routing algorithms, cryptography, firewalls and secure zones. These approaches present a significant hardware cost (firewalls, cryptography) or are too restrictive regarding the use of MPSoC resources (secure zones). The goal of this paper is to present lightweight security mechanisms for MPSoCs, using four techniques: spatial isolation of applications; dedicated network to send sensitive data; traffic blocking filter; lightweight cryptography. These mechanisms protect the MPSoC against the most common software attacks, as Denial of Service (DoS) and spoofing (man-in-the-middle), and ensures confidentiality and integrity to applications. Results present low area and latency overhead, as well as the effectiveness of using the mechanisms to block malicious traffic.

Exploiting approximate computing for low-cost fault tolerant architectures

Gennaro S. Rodrigues
Juan Fonseca
Fabio Benevenuti
Fernanda Kastensmidt
Alberto Bosio

This work investigates how the approximate computing paradigm can be exploited to provide low-cost fault tolerant architectures. In particular, we focus on the implementation of Approximate Triple Modular Redundancy (ATMR) designs using the precision reduction technique. The proposed method is applied to two benchmarks and a multitude of ATMR designs with different degrees of approximation. The benchmarks are implemented on a Xilinx Zynq-7000 APSoC FPGA through high-level synthesis and evaluated concerning area usage and the inaccuracy caused by approximation. Fault injection experiments are performed by flipping bits of the FPGA configuration bitstream. Results show that the proposed approximation method can decrease the DSP usage of the hardware implementation up to 80% and the number of sensitive configuration bits up to 75% while maintaining an accuracy of more than 99.96%.

Fine-grain temperature monitoring for many-core systems

Alzemiro Lucas da Silva
André Luís del Mestre Martins
Fernando Gehm Moraes

The power density may limit the amount of energy a many-core system can consume. A many-core at its maximum performance may lead to safe temperature violations and, consequently, result in reliability issues. Dynamic Thermal Management (DTM) techniques have been proposed to guarantee that many-core systems run at good performance without compromising reliability. DTM techniques rely on accurate temperature information and estimation, which is a computationally complex problem. However, related works usually abstract the temperature monitoring complexity, assuming available temperature sensors. An issue related to temperature sensors is their granularity, frequently measuring the temperature of a large system area instead of a processing element (PE) area. Therefore, the first goal of this work is to propose a fine-grain (PE level) temperature monitoring for many-core systems. The second one is to present a dedicated hardware accelerator to estimate the system temperature. Results show that software performance can be a limiting factor when applying an accurate model to provide temperature estimation for system management. On the other side, the hardware accelerator connected to the many-core enables the fine-grain temperature estimation at runtime without sacrificing system performance.

An adaptive discrete particle swarm optimization for mapping real-time applications onto network-on-a-chip based MPSoCs

Jessé Barreto de Barros
Renato Coral Sampaio
Carlos Humberto Llanos

This paper presents a modified version of the well-known Particle Swarm Optimization (PSO) algorithm as an alternative for the single-objective Genetic Algorithm (GA) that is currently the state-of-the-art method to map real-time applications tasks onto Multiple Processors System-on-a-Chip (MPSoC) using preemptive capable wormhole-based Network-on-a-Chip (NoC) as their communication architecture. A statistical study based on an experimental setup has been performed to compare the GA-based task mapper and the proposed method by using a real-time application as a benchmark, as well as a group of randomly generated ones. Preliminary results have shown that our method is capable of achieving quicker convergence than the GA-based method, and it even produces better results when the application utilization is smaller than the available processing capacity, i.e., a fully schedulable mapping solution exists.

Exploring Tabu search based algorithms for mapping and placement in NoC-based reconfigurable systems

Guilherme A. Silva Novaes
Luiz Carlos Moreira
Wang Jiang Chau

Nowadays, the development of systems based on Networks-on-Chip (NoCs) brings big challenges to the designers due to problems of scalability, such as efficient Mapping and Placement, which are NP-hard problems. Several solutions have been proposed to solve this type of problem that is a variation of Quadratic Assignment Problems (QAP), being Tabu Search (TS) algorithms the ones showing most promising results. In NoC-based dynamically reconfigurable systems (NoC-DRSs), both mapping and placement problems present several layers of complexity due the reconfigurable scenarios. A previous work has adopted TS algorithm variations, but the best solution is not achieved with the wished high frequency. This paper introduces the original Forced Inversion (FI) Heuristic over Tabu Search algorithms for 2D-Mesh FPGA NoC-DRSs, in order to avoid local minima. Results with a series of benchmarks are presented and the performances of different approaches are quantitatively and qualitatively compared.

Performance evaluation of HEVC RCL applications mapped onto NoC-based embedded platforms

Wagner Penny
Daniel Palomino
Marcelo Porto
Bruno Zatt
Leandro Indrusiak

Today, several applications running into embedded systems have to fulfill soft or hard timing constraints. Video applications, like the modern High Efficiency Video Coding (HEVC), e.g., most often have soft real-time constraints. However, in specific scenarios, such as in robotic surgeries, the coupling of satellites and so on, harder timing constraints arise, becoming a huge challenge. Although the implementation of such applications in Networks-on-Chip (NoCs) being an alternative to reduce their algorithmic complexity and meet real-time constraints, a performance evaluation of the mapped NoC and the schedulability analysis for a given application are mandatory. In this work we make a performance evaluation of HEVC Residual Coding Loop (RCL) mapped onto a NoC-based embedded platform, considering the encoding of a single 1920×1080 pixels frame. A set of analysis exploring the combination of different NoC sizes and task mapping strategies were performed, showing for the typical and upper-bound workload cases scenarios when the application is schedulable and meets the real-time constraints.

An FPGA-based evaluation platform for energy harvesting embedded systems

Roberto Paulo Dias Alcantara Filho
Otavio Alcantara de Lima Junior
Corneli Gomes Furtado Junior

Extreme low-power embedded systems are essential in Smart Cities and the Internet of Things, once these systems are responsible for acquiring, processing, and transmitting valuable environmental data. Some of these systems should run for a very long time without any human intervention, even for batteries replacement. Energy harvesting technologies allow embedded systems to be powered up from the environment by converting surrounding energy sources into electrical energy. However, energy-harvesting embedded systems (EHES) heavily depends on the nature of the energy sources, which are mostly uncontrollable and unpredictable. To improve the evaluation of energy management techniques in EHES, we propose the emulation of I-V curves of low-power energy harvesting transducers. An FPGA-based platform controls the energy source emulation combined with an integrated logic analyzer, which allows real-time data gathering from the EHES in multiple evaluation scenarios. The experiments show that the platform replicates solar energy scenarios with only 0.56% mean error.

A comparison of two embedded systems to detect electrical disturbances using decision tree algorithm

Reneilson Santos
Edward David Moreno
Carlos Estombelo-Montesco

The Electrical Power Quality (EPQ) is a relevant subject in the academic area because of its importance on real-world problems. The anomalies on an electrical network can cause strong losses in equipment and data. In this context, much effort has been made by many types of research approaches to get solutions for this kind of problem, seeking for better accuracy on the classification of the anomalies, or building a system to detect them. This paper, therefore, aims to compare two systems built to classify electrical disturbances even in noised environments. For this purpose, it was used a microprocessor system (Raspberry Pi3) and a micro-controller system (NodeMCU Amica), analyzing their time to classify the input signal. The microprocessor achieves better results (45.50ms against 267.10ms), with an accuracy of 97.96% in an ideal environment and 76.79% in a noisy environment (20dB of SNR) for both systems.

FPGA hardware linear regression implementation using fixed-point arithmetic

Willian de Assis Pedrobon Ferreira
Ian Grout
Alexandre César Rodrigues da Silva

In this paper, a hardware design based on the field programmable gate array (FPGA) to implement a linear regression algorithm is presented. The arithmetic operations were optimized by applying a fixed-point number representation for all hardware based computations. A floating-point number training data point was initially created and stored in a personal computer (PC) which was then converted to fixed-point representation and transmitted to the FPGA via a serial communication link. With the proposed VHDL design description synthesized and implemented within the FPGA, the custom hardware architecture performs the linear regression algorithm based on matrix algebra considering a fixed size training data point set. To validate the hardware fixed-point arithmetic operations, the same algorithm was implemented in the Python language and the results of the two computation approaches were compared. The power consumption of the proposed embedded FPGA system was estimated to be 136.82 mW.

New insight for next generation SRAM: tunnel FET versus FinFET for different topologies

Adriana Arevalo
Romain Liautard
Daniel Romero
Lionel Trojman
Luis-Miguel Procel

The purpose of this work is to point out the main differences between a Static Random-Access Memory (SRAM) cells implemented by using Tunnel FET (TFET) and FinFET technologies. We have compared the behavior of SRAM cells implemented in both technologies cells for a supply voltage range from 0.4V to 1.2V. Furthermore, for our study, we have chosen different SRAM cell topologies, such as 6T, 8T, 9T and 10T. Therefore, we have simulated all of these topologies for both technologies and extracted the Static Noise Margins (SNM) for the reading and writing processes. In addition, we have determined the power consumption in order to find the best trade-off between stability and power. By analyzing these results, we have determined the best topology for each technology. Finally, we have compared these best topologies for each technology in order to perform a study of advantages and shortcomings. Our results show more advantages using TFET technology instead of FinFET one.

DNAr-logic: a constructive DNA logic circuit design library in R language for molecular computing

Renan A. Marks
Daniel K. S. Vieira
Marcos V. Guterres
Poliana A. C. Oliveira
Omar P. Vilela Neto

This paper describes the DNAr-Logic: an implementation of a software package in R language that provides ease of use and scalability of the design process of digital logic circuits in molecular computing, more specifically, DNA computing. These devices may be used in-vitro, in-vivo, or even replace the CMOS technology in some applications. Using a technique known as DNA strand displacement reaction (DSD) in conjunction with chemical reaction networks (CRN’s), DNA strands can be used as “wet” hardware to construct molecular logic circuits analogous to electronic digital projects. The circuits designed using the DNAr-Logic can be created in a constructive manner and simulated without requiring knowledge of chemistry or DSD mechanism. The package implements all the main logic gates. We describe the design of a majority gate (also available in the package) and a full-adder circuit that only uses this port. We describe the results and simulation of our design.

Finding optimal qubit permutations for IBM’s quantum computer architectures

Alexandre A. A. de Almeida
Gerhard W. Dueck
Alexandre C. R. da Silva

IBM offers quantum processors for Clifford+T circuits. The only restriction is that not all CNOT gates are implemented and must be substituted with alternate sequences of gates. Each CNOT has its own mapping with a respective cost. However, by permuting the qubits, the number of CNOT that need mappings can be reduced. The problem is to find a good permutation without an exhaustive search. In this paper we propose a solution for this problem. The permutation problem is formulated as an Integer Linear Programming (ILP) problem. Solving the ILP problem, the lowest cost permutation for the CNOT mappings is guaranteed. To test and validated the proposed formulation, quantum architectures with 5 and 16 qubits were used. The ILP formulation along with mapping techniques found circuits with up to 64% fewer gates than other approaches.

Hardware implementation of a shape recognition algorithm based on invariant moments

Clement Raffaitin
Juan-Sebastian Romero
Juan-Sebastian Romero
Luis-Miguel Procel

The present work shows the description of a simple fast shape detection algorithm and its implementation in hardware in a FPGA system. The detection algorithm is based on the concepts of Hu’s moments which are invariant to similarity transformations. The recognition algorithm is implemented by using a non-local means filter. The algorithm is implemented on a FPGA system by using a hardware description language. We present the different design stages of the algorithm implementation which is based on the finite state machine technique. This algorithm is able to recognize a target shape over a test image. Furthermore, this work, describes the advantages of the implementation in hardware, such as speed and parallelism in signal processing. Finally, we show some results of the implementation of this algorithm.

A custom processor for an FPGA-based platform for automatic license plate recognition

Guilherme A. M. Sborz
Guilherme A. Pohl
Felipe Viel
Cesar A. Zeferino

Automatic License Plate Recognition (ALPR) systems are used to identify a vehicle from an image that contains its plate. These systems have applications in a wide range of areas, such as toll payment, border control, and traffic surveillance. ALPR systems demand high computational power, especially for real-time applications. In this context, this paper describes the development of a custom processor designed to accelerate part of the processing of an FPGA-based ALPR system. This processor reduces the latency for computing the most expensive function of the ALPR system in almost 23 times, thus reducing the time necessary for detection of a vehicle plate.

Hardware design of DC/CFL intra-prediction decoder for the AV1 codec

Jones Goebel
Bruno Zatt
Luciano Agostini
Marcelo Porto

This paper presents a dedicated hardware design for the DC and Chroma from Luma (CFL) intra-prediction modes of AV1 decoder. The hardware was designed to reach real-time when processing UHD 4K videos. The AV1 codec is an open-source and royalties-free video coding, which was developed by the AOMedia group, this group is composed of multiple companies like Google, Netflix, AMD, ARM, Intel, Nvidia, Microsoft, Mozilla and others. The proposed solution can support all 19 block sizes allowed in AV1 encoder, being able to process UHD 4K videos at 60 frames per second. The DC/CFL modules were synthesized to the TSMC 40 nm cells library targeting the frequency of 132.1 MHz. Synthesis results show the proposed hardware used 89.39 Kgates and a power dissipation of 7.96mW.

Approximate interpolation filters for the fractional motion estimation in HEVC encoders and their VLSI design

Rafael da Silva
Ícaro Siqueira
Mateus Grellert

Motion Estimation (ME) is one of the most complex HEVC steps, consuming more than 60% of the average encoding time, most of which is spent on its fractional part (Fractional Motion Estimation – FME), in which sub-pixel samples are interpolated and searched over to find motion vectors with higher precision. This paper presents hardware designs for the sub-pixel interpolation unit of the FME step. The designs employ approximate computing techniques by reducing the number of taps in each filter to reduce memory access and hardware cost. The approximate filters were implemented in the HEVC reference software to assess their impact on coding performance. A complete interpolation architecture was implemented in VHDL and synthesized with different filter precision and input sizes in order to assess the effect of these parameters on hardware area and performance. The approximate designs reduce the number of adders/subtractors by up to 67.65% and memory bandwidth by up to 75% with a tolerable loss in coding performance (less than 1% using the Bjontegaard Delta bitrate metric). When synthesized to an FPGA device, 52.9% less logic elements are required with a modest increase in frequency.

An SVM-based hardware accelerator for onboard classification of hyperspectral images

Lucas A. Martins
Guilherme A. M. Sborz
Felipe Viel
Cesar A. Zeferino

Hyperspectral images (HSIs) have been used in civil and military scenarios for ground recognition, urban development management, rare minerals identification, and diverse other purposes. However, HSIs have a significant volume of information and require high computational power, especially for real-time processing in embedded applications, as in onboard computers in satellites. These issues have driven the development of hardware-based solutions able to provide the processing power necessary to meet such requirements. In this paper, we present a hardware accelerator to enhance the performance of one of the most computational expensive stages of HSI processing: the classification. We have employed the Entropy Multiple Correlation Ratio procedure to select the spectral bands to be used in the training process. For the classification step, we have applied a Support Vector Machine classifier with a Hamming Distance decision approach. The proposed custom processor was implemented in FPGA and compared with high-level implementations. The results obtained demonstrate that the processor has a silicon cost lower than similar solutions and can perform a real-time pixel classification in 0.1 ms and achieves a state-of-the-art accuracy of 99.7%.

A sub-1mA highly linear inductorless wideband LNA with low IP3 sensitivity to variability for IoT applications

Arthur Liraneto Torres Costa
Hamilton Klimach
Sergio Bampi

This paper proposes a wideband 0.4-2 GHz cascode common-gate LNA that can be used as a building block for a noise canceling topology (which entails its noise to be canceled at the output node). The design strategy is to set the operating point by analyzing the third order coefficient (α₃) of the output current and the output voltage, which is designed using a load composed by a diode-connected PMOS transistor and a resistor in parallel. This operating point allows a reasonable V_GS spread, maintaining a high IIP3 which implies a low IIP3 sensitivity to process variability. The design strategy also achieves a current consumption under 1 mA and, depending on the technology node V_DD (CMOS 130 nm in this case), it can consume under 1 mW of power. This makes the wideband LNA suitable for IoT applications. Monte Carlo simulations have been carried out to demonstrate the operating region sensitivity to variability and achieves a result of worst case IIP3_μ = +0.2 dBm with σ = 0.8 dBm (@2GHz) up to a nominal 2.75 dBm @900 MHz, S₁₁ < -23 dB, NF < 5.5 dB (canceled by virtue of its topology), a voltage gain of 11.6-14.6 dB (S₂₁ = 6.4-9.4 dB with a buffer to 50 Ω), and consuming just 1.19 mW from a 1.2 V supply.

Comparison between direct and indirect learnings for the digital pre-distortion of concurrent dual-band power amplifiers

Luis Schuartz
Artur T. Hara
André A. Mariano
Bernardo Leite
Eduardo G. Lima

Current radio-communication systems demand high linearity and high efficiency. The digital baseband pre-distorter (DPD) is a cost-effective solution to guarantee the required linearity without compromising the efficiency. In the design of a DPD for a single band power amplifier (PA), the position of the inverse system is exchanged during the identification procedure to avoid the necessity of a PA model within a cumbersome closed-loop process. However, in a practical environment where only an approximation to the inverse is achieved, the linearization capability is affected by shifting the post-inverse placed after the PA to a pre-inverse located before the PA. In DPD intended for concurrent dual-band PAs, an additional advantage of such approach is that the post-inverse identifications for each band are completely independent of each other. This work performs a comparative analysis between two learning architectures applied to the linearization of two concurrent dual-band PAs stimulated by 2.4 GHz Wi-Fi and 3.5 GHz LTE signals. For the first PA, an exact PA model is known and the replacement of a post-inverse to a pre-inverse produces only negligible degradation in linearity. For the second PA, only an approximate PA model is available and the accuracy of such PA model produces a major impact on the linearization capability than the shifting of the inverse.

Interactive evolutionary approach to reduce the optimization cycle time of a low noise amplifier

Rodrigo A. L. Moreto
Douglas Rocha
Carlos E. Thomaz
André Mariano
Salvador P. Gimenez

Nowadays, wireless communications at frequencies of gigahertz have an increasing demand due to the ever-increasing number of electronic devices that uses this type of communication. They are implemented by Radio Frequency (RF) circuits. However, the design of RF circuits is difficult, time-consuming and based on designer knowledge and experience. This work proposes an interactive evolutionary approach using the genetic algorithm, which is implemented in the in-house iMTGSPICE optimization tool, to perform the optimization process of a robust (corner and Monte Carlo analyses) Ultra Low-Power Low Noise Amplifier (LNA) dedicated to Wireless Sensor Networks (WSN), which is implemented in a 130 nm Bulk CMOS technology. We performed two experimental studies to optimize the LNA. The first one used the interactive approach of iMTGSPICE, which was monitored and assisted by a beginner designer during the optimization process. The second one used the conventional approach of iMTGSPICE (non-interactive), which was not assisted by a designer during the optimization process. The obtained results demonstrated that the interactive approach of iMTGSPICE performed the optimization process of the robust LNA around 94% faster (in approximately 20 minutes only) than the non-interactive evolutionary approach (in approximately 6 hours).

An innovative strategy to reduce die area of robust OTA by using iMTGSPICE and diamond layout style for MOSFETs

José Roberto Banin Júnior
Rodrigo Alves de Lima Moreto
Gabriel Augusto da Silva
Carlos Eduardo Thomaz
Salvador Pinillos Gimenez

This paper describes a pioneering design and optimization methodology that provides a remarkable die area reduction of robust analog Complementary Metal-Oxide-Semiconductor (CMOS) Integrated Circuits (ICs) by using a computational tool based on artificial intelligence (iMTGSPICE) and the Diamond layout style for MOSFETs. The validation of this innovative optimization strategy for analog CMOS ICs was made for an Operational Transconductance Amplifiers (OTA) by using 180 nm CMOS ICs technology. The main finding of this work reports a remarkable reduction of the total die area of a robust OTA around 30%, regarding the use of Diamond MOSFETs with alfa angles of 45° when compared to the one implemented with standard rectangular MOSFETs.

NMLSim 2.0: a robust CAD and simulation tool for in-plane nanomagnetic logic based on the LLG equation

Lucas A. Lascasas Freitas
Omar P. Vilela Neto
João G. Nizer Rahmeier
Luiz G. C. Melo

Nanomagnetic Logic (NML) is a new technology based on the magnetization of nanometric magnets. Logic operations are performed via dipolar coupling through ferromagnetic and antiferromagnetic interactions. The low energy dissipation and the possibility of higher integration density in circuits are significant advantages over CMOS technology. Even so, there is a great need for simulation and CAD tools for the proper study of large NML circuits. This paper presents a high-efficiency tool that uses the Landau-Lifshitz-Gilbert equation to evolve the magnetization of the particles over time in amonodomain approach. The new version of NMLSim comes with flexibility in its code, allowing expansion of the tool with ease and consistency. The results of simulated structures show the reliability of the simulator when compared with the current state of the art Object-Oriented Micromagnetic Framework (OOMMF). It also presents an improvement of up to 716 times in execution time and up to 41 times in memory usage.

Ropper: a placement and routing framework for field-coupled nanotechnologies

Ruan Evangelista Formigoni
Ricardo S. Ferreira
José Augusto M. Nacif

Field-Coupled Nanocomputing technologies are the subject of extensive research to overcome current CMOS limitations. These technologies include nanomagnetic and quantum structures, each with its design and synchronization challenges. In this scenario clocking schemes are used to ensure circuit synchronization and avoid signal disruptions at the cost of some area overhead. Unfortunatelly, a nanocomputing technology is limited to a small subset of clocking schemes due to its number of clocking phases and signal propagation system, thus, leading to complex design challenges when tackling the placement and routing problem resulting in technology dependant solutions. Our work consists on presenting a novel framework developed by our team that solves these design challenges when using distinct schemes, therefore, avoiding the need to design pre-defined routing algorithms for each one. The framework offers a technology independent solution and provides interfaces for the implementation of efficient and scalable placement strategies, moreover, it has full integration with reference state-of-the-art optimization and synthesis tools.

Toward nanometric scale integration: an automatic routing approach for NML circuits

Pedro Arthur R. L. Silva
Omar Paranaíba V. Neto
José Augusto M. Nacif

In recent years, many technologies have been studied to replace or complement CMOS. Some of these emerging technologies are known as Field Coupled Nanocomputing. However, these new technologies introduce the need for developing tools to perform circuit mapping, placement, and routing. Nanomagnetic Logic Circuit (NML) is one of these emergent technologies. It relies on the magnetization of nanomagnets to perform operations through majority logic. In this work, we propose an approach to map a gate-level circuit to an NML layout automatically. We use the Breadth First Search to perform the placement and the A* algorithm to transverse the circuit and build the routes for each node. To evaluate the effectiveness of our approach, we use a series of ISCAS’85 benchmarks. Our results show an area reduction varying from 20% to 60%.

Energy efficient fJ/spike LTS e-Neuron using 55-nm node

Pietro M. Ferreira
Nathan De Carvalho
Geoffroy Klisnick
Aziz Benlarbi-Delai

While CMOS technology is currently reaching its limits in power consumption and circuit density, a challenger is emerging from the analogy between biology and silicon. Hardware-based neural networks may drive a new generation of bio-inspired computers by the urge of a hardware solution for real-time applications. This paper redesigns a previous proposed electronic neuron (e-Neuron) in a higher firing rate to reduce the silicon area and highlight a better energy efficiency trade-off. Besides, an innovative schematic is proposed to state an e-Neuron library based on Izhikevichs model of neural firing patterns. Both e-Neuron circuits are designed using 55 nm technology node. Physical design of transistors in weak inversion are discussed to a minimal leakage. Neural firing pattern behaviors are validated by post-layout simulations, demonstrating the spike frequency adaptation and the rebound spikes due to post-inhibitory effect in LTS e-Neuron. Presented results suggest that the time to rebound spikes is dependent of the excitation current amplitude. Both e-Neurons have presented a fF/spike energy efficiency and a smaller silicon area in comparison to Izhikevichs library propositions in the literature.

CMOS analog four-quadrant multiplier free of voltage reference generators

Antonio José Sobrinho de Sousa
Fabian de Andrade
Hildeloi dos Santos
Gabriele Gonçalves
Maicon Deivid Pereira
Edson Santana
Ana Isabela Cunha

This work presents a CMOS four quadrant analog multiplier architecture for application as the synapse element in analog cellular neural networks. The circuit has voltage-mode inputs and a current-mode output and includes a signal application method that avoids voltage or current reference generators. Simulations have been accomplished for a CMOS 130 nm technology, featuring ±50 mV input voltage range, 60 μW static power and -25 dB maximum THD. The active area is 346 μm².

Amplifier-based MOS analog neural network implementation and weights optimization

Tiago Oliveira Weber
Diogo da Silva Labres
Fabián Leonardo Cabrera

Neural networks are achieving state-of-the-art performance in many applications, from speech recognition to computer vision. A neuron in a multi-layer network needs to multiply each input by its weight, sum the results and perform an activation function. This paper presents a variation of the implementation of an amplifier-based MOS analog neuron capable of performing these tasks and also the optimization of the synaptic weights using in-loop circuit simulations. MOS transistors operating in the triode region are used as variable resistors to convert the input and weight voltage to a proportional input current. To test the analog neuron in full networks, an automatic generator is developed to produce a netlist based on the number of neurons on each layer, inputs and weights. Simulation results using a CMOS 180 nm technology demonstrate the neuron proper transfer function and its functionality while trained in test datasets.

Reduction of neural network circuits by constant and nearly constant signal propagation

Augusto Andre Souza Berndt
Alan Mishchenko
Paulo Francisco Butzen
Andre Inacio Reis

This work focuses on optimizing circuits representing neural networks (NNs) in the form of and-inverter graphs (AIGs). The optimization is done by analyzing the training set of the neural network to find constant bit values at the primary inputs. The constant values are then propagated through the AIG, which results in removing unnecessary nodes. Furthermore, a trade-off between neural network accuracy and its reduction due to constant propagation is investigated by replacing with constants those inputs that are likely to be zero or one. The experimental results show a significant reduction in circuit size with negligible loss in accuracy.

A FPGA parameterizable multi-layer architecture for CNNs

Guilherme Korol
Fernando Gehm Moraes

Advances in hardware platforms boosted the use of Convolutional Neural Networks (CNNs) to solve problems in several fields such as Computer Vision and Natural Language Processing. With the improvements of algorithms involved in learning and inferencing for CNNs, dedicated hardware architectures have been proposed with the goal to speed up the CNNs’ performance. However, the CNNs’ requirements in bandwidth and processing power challenge designers to create architectures fitted for ASICs and FPGAs. Embedded applications targeting IoT (including sensors and actuators), health devices, smartphones, and any other battery-powered device may benefit from CNNs. For that, the CNN design must follow a different path, where the cost function is a small area footprint and reduced power consumption. This paper is a step towards this goal, by proposing an architecture for the main modules of modern CNNs. The proposal uses as case-study the Alexnet CNN, targeting Xilinx FPGA devices. Compared to the literature, results show a reduction up to 9 times in the amount of required DSP modules.

Design of a low power 10-bit 12MS/s asynchronous SAR ADC in 65nm CMOS

Arthur Lombardi Campos
João Navarro
Maximiliam Luppe

During the last decades we have witnessed the performance improvement and the aggressive growth of the complexity of integrated circuits (ICs). The progressive size reduction of transistors in recent technological nodes has allowed IC designers to perform analog tasks in the digital domain, increasing the demand for analog-to-digital converters (ADCs). This work presents the design and implementation of a low power successive approximation register analog-to-digital converter (SAR ADC) in a 65nm CMOS technology, suitable for low power frontend of wireless receivers with a flexible sampling rate up to 12 MS/s. At maximum sampling rate, the post-layout simulated circuit achieved an equivalent number of bits (ENOB) of 9.65 and a power consumption of 151.4μW, leading to a Figure of Merit of 15.8fJ/Conversion-step, inside an area of 0.074mm².

A new algorithm for an incremental sigma-delta converter reconstruction filter

Li Huang
Caroline Lelandais-Perrault
Anthony Kolar
Philippe Benabes

Image sensors dedicated for the applications of the Earth observation require medium-speed and high-resolution analog-to-digital converters (ADCs). For that purpose, an incremental sigma-delta analog-to-digital converter (IΣΔ ADC) has been designed. Post-layout simulations highlighted a degradation in resolution caused by the circuit imperfections. Therefore, a digital correction has been investigated. This paper proposes a new reconstruction filter which takes into account not only the bit values of the modulator output sequence but also the occurrence of certain patterns. This technique has been applied to an incremental sigma-delta analog-to-digital converter in order to correct the conversion errors. Performing with 400 clock periods for each conversion cycle, the theoretical resolution is 15.4 bits. Post-layout simulation shows that a 13.5-bit resolution is obtained by using the classical optimal filter whereas a 14.8-bit resolution is obtained with our reconstruction filter.

Behavioral modeling of a control module for an energy-investing piezoelectric harvester

Tales Luiz Bortolin
André Luiz Aita
João Baptista dos Santos Martins

This work analyzes a piezoelectric energy harvesting system that uses a single inductor and the concept of energy investment. The harvester behavior, with special focus on its control logic module and state machine, is fully described and modeled in Verilog-A. The needed sensors and control variables were also identified and modeled. Simulation results have shown the correct behavioral modeling of the piezoelectric energy harvester system and proposed control, highlighting the harvesting mechanism based on the concept of energy-investment and the effect of the energy invested on the characteristics of the battery charging profile. The speed of the behavioral simulations when compared to electrical ones and the obtained model accuracy, have shown a reliable and prospective higher-level design approach.

An IR-UWB pulse generator using PAM modulation with adaptive PSD in 130nm CMOS process

Luiz Carlos Moreira
José Fontebasso Neto
Walter Silva Oliveira
Thiago Ferauche
Guilherme Heck
Ney Laert Vilar Calazans
Fernando Gehm Moraes

This paper proposes an adaptive pulse generator using Pulse Amplitude Modulation (PAM). The circuit was implemented with eight Pulse Generator Units (PGUs) to produce up to eight monocycles per pulse. The number of monocycles per pulse is inversely proportional to the Power Spectrum Density (PSD) bandwidth in the Impulse Radio Ultra-Wide Band (IR-UWB). The complete circuit contains two pulse generator blocks, each one composed by eight PGUs to build a rectangular waveform at the output. The PGU has been implemented with Edge Combiners High (ECH) and Edge Combiners Low (ECL) to encode the information. Each Edge Combiner has a high impedance circuit that is selected by digital control signals. The circuit has been simulated, showing an output pulse amplitude of ≈70mV for the high logic level and an amplitude of ≈35mV for the low logic level, both at 100 MHz Pulse Repetition Frequency (PRF). This produces a mean pulse duration of ≈270ps, a mean central frequency of ≈3.7GHz and a power consumption less than 0,22μW. The pulse generator block occupies an area of 0.54mm².

NANOARCH 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Fast Estimations of Failure Probability Over Long Time Spans

Michail Noltsis
Panayiotis Englezakis
Eleni Maragkoudaki
Chrysostomos Nicopoulos
Dimitrios Rodopoulos
Francky Catthoor
Yiannakis Sazeides
Davide Zoni
Dimitrios Soudris

Shrinking of device dimensions has undoubtedly enabled the very large scale integration of transistors on electronic chips. However, it has also brought to surface time-zero and time-dependent variation phenomena that degrade system’s performance and threaten functional operation. Hence, the need to capture and describe these mechanisms, as well as effectively model their impact is crucial. To this extent, we follow existing models and propose a complete framework that evaluates failure probability of electronic components. To assess our framework, a case-study of packet-switched Network on Chip (NoC) routers is presented, studying the failure probability of its SRAM buffers.

A Probabilistic Error Model and Framework for Approximate Booth Multipliers

Yuying Zhu
Weiqiang Liu
Jie Han
Fabrizio Lombardi

Approximate computing is a paradigm for high performance and low power design by compromising computational accuracy. In this paper, the structure of an approximate modified radix-4 Booth multiplier is analyzed. A probabilistic error model is proposed to facilitate the evaluation of the approximate multiplier for errors from the approximate radix-4 Booth encoding, the approximate regular partial product array, and the approximate 4–2 compressor. The normalized mean error distances (NMEDs) of 8-bit and 16-bit approximate designs are found by utilizing the proposed model. The results from the error model and the corresponding analytical framework are close to those found by simulation, thus confirming the validity of the proposed approach.

Variability-Tolerant Memristor-based Ratioed Logic in Crossbar Array

M. Escudero
I. Vourkas
A. Rubio
F. Moll

The advent of the first TiO2-based memristor in 2008 revived the scientific interest both from academia and industry for this device technology, with several emerging applications including that of logic circuits. Several memristive logic families have been proposed, each with different attributes, in the current quest for energy-efficient computing systems of the future. However, limited endurance of memristor devices and variations (both cycle-to-cycle and device-to-device) are important parameters to be considered in the evaluation of such logic families. In this work we build upon an accurate physics-based model of a bipolar metal-oxide resistive RAM device (supporting parasitics of the device structure and variability of switching voltages and resistance states) and use it to show how performance of memristor-based logic circuits can de degraded owing to both variability and state-drift impact. Based on previous work on CMOS-like memristive logic circuits, we propose a memristive ratioed logic scheme, which is crossbar-compatible, i.e. suitable for in-/near-memory computing, and tolerant to device variability, while also it does not affect the device endurance since computations do not involve switching the memristor states. As a figure of merit, we compare such new logic scheme with MAGIC, focusing on the universal NOR logic gate.

High-Endurance Bipolar ReRAM-Based Non-Volatile Flip-Flops with Run-Time Tunable Resistive States

Mehrdad Biglari
Tobias Lieske
Dietmar Fey

ReRAM technologies feature desired properties, e.g. fast switching and high read margin, that make them attractive candidates to be used in non-volatile flip-flops (NVFFs). However, they suffer from limited endurance. Therefore, cell degradation considerations are a necessity for practical deployment in non-volatile processors (NVPs). In this paper, we present two bipolar ReRAM-based NVFFs, Hypnos and Morpheus, with enhanced endurance and energy efficiency. Hypnos reduces the ReRAM electrical stress during set operation while keeping the imposed NVFF area overhead at a minimum. In Morpheus, a write-termination circuit is used to further enhance the ReRAM endurance and energy efficiency at the cost of an affordable area overhead. Moreover, both NVFFs feature run-time tunable resistive states to enable online adjustment of the tradeoff among endurance, retention, energy consumption, and restore success rate (in case of approximate computing). Experimental results demonstrate that Hypnos reduces the ReRAM set degradation by 91%, on average. Moreover, the write-termination mechanism in Morpheus further reduces the remaining degradation by 93%/97% in set/reset operation, on average. The results also demonstrate enhanced energy efficiency in both NVFFs.

An Aging Resilient Neural Network Architecture

Seyed Nima Mozaffari
Krishna Prasad Gnawali
Spyros Tragoudas

Recent artificial neural network architectures use memristors to store synaptic weights. The crossbar structure of memristors is used because of its dense structure and extreme parallelism. Transistor aging impacts their computational accuracy. An enhancement of the memristor-based neural network architecture is introduced using built-in current-based calibration circuit. It is shown experimentally that the proposed approach alleviates the cell aging effect.

Overcoming Crossbar Nonidealities in Binary Neural Networks Through Learning

Mohammed E. Fouda
Jongeun Lee
Ahmed M. Eltawil
Fadi Kurdahi

The crossbar nonidealaties may considerably degrade the accuracy of matrix multiplication operation, which is the cornerstone of hardware accelerated neural networks. In this paper, we show that the crossbar nonidealities especially the wire resistance should be taken into consideration for accurate evaluation. We also present a simple yet highly effective way to capture the wire resistance effect for the inference and training of deep neural networks without extensive SPICE simulations. Different scenarios have been studied and used to show the efficacy of our proposed method.

Real-Time Trainable Data Converters for General Purpose Applications

Loai Danial
Shahar Kvatinsky

Data converters are ubiquitous in data-abundant systems, where they are heterogeneously distributed across the analog-digital interface. Unfortunately, conventional data converters trade off speed, power, and accuracy. Furthermore, intrinsic real-time and post-silicon variations dramatically degrade their performance. In this paper, we employ novel neuro-inspired approaches to design smart data converters that could be trained in real-time for general purpose applications, using machine learning algorithms and artificial neural network architectures. Our approach integrates emerging memristor technology with CMOS. This concept will pave the way towards adaptive interfaces with the continuous varying conditions of data driven applications.

Programmable Molecular-Nanoparticle Multi-junction Networks for Logic Operations

Angelika Balliou
Jiri Pfleger
George Skoulatakis
Samrana Kazim
Jan Rakusan
Stella Kennou
Nikos Glezos

We propose and investigate a nanoscale multi-junction network architecture that can be configured on-flight to perform Boolean logic functions at room temperature. The device exploits the electronic properties of randomly deposited molecule-interconnected metal nanoparticles, which act collectively as strongly nonlinear single-electron transistors. Disorder is being incorporated in the modeling of their electrical behavior and the collective response of interacting nano-components is being rationalized. The non-optimized energy consumption of the synaptic grid for a “then-if” logical computation is in the range of few aJ.

Multi-Valued Logic Circuits on Graphene Quantum Point Contact Devices

Konstantinos Rallis
Georgios Ch. Sirakoulis
Ioannis Karafyllidis
Antonio Rubio

Graphene quantum point contacts (G-QPC) combine switching operations with quantized conductance, which can be modulated by top and back gates. Here we use the conductance quantization to design and simulate multi-valued logic (MVL) circuits and, more specifically an adder. The adder comprises two G-QPCs connected in parallel. We compute the conductance of the adder for various inputs and show that Graphene MVL circuits are feasible.

Sequential Circuit Design with Bilayer Avalanche Spin Diode Logic

Vaibhav Vyas
Joseph S. Friedman

Novel computing paradigms like the fully cascadable InSb bilayer avalanche spin-diode logic (BASDL) are capable of performing complex logic operations. Although the original work provides a comprehensive explanation for the device structure, the fundamental logic set and basic combinational circuits, it lacks the inclusion of sequential circuit design. This paper addresses the void by demonstrating the structural design of SR and D-type latches with BASDL. Novel latch topologies are proposed that take full advantage of the BASDL-based logic set while maintaining conventional latch functionality. The effective operation of these latches is verified through a complete logic-level analysis and a briefinsight into their physical implementation.

Complementary Arranged Graphene Nanoribbon-based Boolean Gates

Yande Jiang
Nicoleta Cucu Laurenciu
Sorin Cotofana

With CMOS feature size heading towards atomic dimensions, unjustifiable static power, reliability, and economic implications are exacerbating, prompting for research on new materials, devices, and/or computation paradigms. Within this context, Graphene Nanoribbons (GNRs), owing to graphene’s excellent electronic properties, may serve as basic blocks for carbon-based nanoelectronics. In this paper we build upon the fact that GNR behaviour can be controlled according to some desired functionality via top/back gate contacts and propose to combine GNRs with complementary functionalities to construct Boolean gates. To this end, we introduce a generic GNR-based Boolean gate structure, composed of two GNRs, i.e., a pull-up GNR performing the gate Boolean function and a pull-down GNR performing the inverted Boolean function. Subsequently, by properly adjusting GNRs’ dimensions and topology, we design 2-input AND, NAND, and XOR graphene-based Boolean gates, as well as 1-input gates, i.e., inverter and buffer. Our SPICE simulations indicate that the proposed gates exhibit a smaller propagation delay, from 23% for the XOR gate to 6x for the AND gate, and 2 orders of magnitude smaller power consumption, when compared with 7nm CMOS based counterparts, while requiring a 1 to 2 orders of magnitude smaller active area footprint. These results clearly indicate that GNR-based gates have great potential as basic building blocks for future beyond CMOS energy effective nanoscale circuits.

CCE: A Combined SRAM and Non Volatile Cache for Endurance of Next Generation Multilevel Non Volatile Memories in Embedded Systems

Linbin Chen
Pilin Junsangsri
Pedro Reviriego
Fabrizio Lombardi

In this paper we present Combined Cache for Endurance (CCE), a scheme to enable the use of next generation high density multilevel non volatile memories in embedded systems. These memories are attractive as they can reduce the static power consumption dramatically and a single memory can be potentially used avoiding having both flash and SRAM or DRAM in a system. However, a common drawback of the new multilevel non volatile memories is that they support a limited number of write operations and thus its endurance needs to be improved to make them a viable alternative for the main memory of embedded systems. The proposed CCE relies on the fact that most writes are concentrated on a few addresses. Therefore, a small SRAM cache can be used to store positions that are frequently written. However, this would not preserve the non volatile nature of the memory. To do so, in the proposed CCE, the cache cell has an SRAM part and a non volatile part. At power up the contents of the non volatile part are copied to the SRAM and the other way around at power down. As many embedded systems execute predictable workloads, this cache is statically set to cover the most frequently written addresses. The evaluation shows that CCE can increase the endurance of the memory by several orders of magnitude. At the same time the overheads required to implement the cache are small relative to the main memory. Therefore, CCE can be an interesting option to improve the endurance of next generation high density multilevel non volatile memories.

Regular Expression Matching with Memristor TCAMs for Network Security

Catherine E. Graves
Wen Ma
Xia Sheng
Brent Buchanan
Le Zheng
Si-Ty Lam
Xuema Li
Sai Rahul Chalamalasetti
Lennie Kiyama
Martin Foltin
John Paul Strachan
Matthew P. Hardy

We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, RegEx matching latency and power can be incredibly high and current proposals are challenged to perform wire-speed matching for large scale rulesets. Our approach dramatically decreases RegEx matching operating power, provides high throughput, and the use of mTCAMs enables novel compression techniques to expand ruleset sizes and allows future exploitation of the multi-state (analog) capabilities of memristors. We fabricated and demonstrated nanoscale memristor TCAM cells. SPICE simulations investigate mTCAM performance at scale and a mTCAM power model at 22nm demonstrates 0.2 fJ/bit/search energy for a 36×400 mTCAM. We further propose a tiled architecture which implements a Snort ruleset and assess the application performance. Compared to a state-of-the-art FPGA approach (2 Gbps,~1W), we show x4 throughput (8 Gbps) at 60% the power (0.62W) before applying standard TCAM power-saving techniques. Our performance comparison improves further when striding (searching multiple characters) is considered, resulting in 47.2 Gbps at 1.3W for our approach compared to 3.9 Gbps at 630mW for the strided FPGA NFA, demonstrating a promising path to wire-speed RegEx matching on large scale rulesets.

A Novel Cross-point MRAM with Diode Selector Capable of High-Density, High-Speed, and Low-Power In-Memory Computation

Chaoxin Ding
Wang Kang
He Zhang
Youguang Zhang
Weisheng Zhao

In-Memory Computation (IMC), which is capable of reducing the power consumption and bandwidth requirement resulting from the data transfer between the processing and memory units, has been considered as a promising technology to break the von-Neumann bottleneck. In order to develop an effective and efficient IMC platform, the performance, such as density, operation speed and power consumption, of the memory itself is one of the most important keys. In this work, we report a cross-point magnetic random access memory (MRAM) with diode selector for IMC implementation. The memory cell consists of a magnetic tunnel junction (MTJ) device and a diode connected in series. The memory cells are arranged in a cross-point array structure, providing high storage density. The MTJ can be switched through the unipolar precessional voltage-controlled magnetic anisotropy (VCMA) effect, thus enabling high speed and low power. Further, Boolean logic functions can be realized via regular memory-like write & read operations. The feasibility and performance of the proposed IMC in the crosspoint MRAM are successfully demonstrated with hybrid VCMA-MTJ/CMOS circuit simulations under the 40 nm technology node.

Hardware Acceleration Implementation of Sparse Coding Algorithm with Spintronic Devices

Deming Zhang
Yanchun Hou
Chengzhi Wang
Jie Chen
Lang Zeng
Weisheng Zhao

In this paper, we explore the possibility of hardware acceleration implementation of sparse coding algorithm with spintronic devices by a series of design optimizations across the architecture, circuit and device. Firstly, a domain wall motion (DWM) based compound spintronic device (CSD) is engineered and modelled, which is envisioned to achieve multiple conductance states. Sequentially, a parallel architecture is presented based on a dense cross-point array of the proposed DWM based CSD, where each dictionary (D) value can be mapped into the conductance of the proposed DWM based CSD at the corresponding cross-point. Benefitting from its massively parallel read and write operation, such proposed parallel architecture can accelerate the selected sparse coding algorithm using a designed dedicated periphery read and write circuit. Experimental results show that the selected sparse coding algorithm can be accelerated by 1400x with the proposed parallel architecture in comparison with software implementation. Moreover, its energy dissipation is 8 orders of magnitude smaller than that with software implementation.

Quantum-dot Cellular Automata RAM design using Crossbar Architecture

Orestis Liolis
Vassilios A. Mardiris
Georgios Ch. Sirakoulis
Ioannis G. Karafyllidis

In this paper, a new approach of RAM circuits, using Quantum-dot Cellular Automata (QCA), based on programmable crossbar architecture, is presented. In addition, a methodology for 2n bits RAMs is presented. Using the aforementioned methodology any designer can design a RAM regardless of its size. The proposed designs utilize the benefits of QCA programmable crossbar architecture. Namely, the RAM circuit is characterized by regularity and the ability of customization. The features that the proposed RAM design methodology has, allow the designers to use the available circuit area efficiently.

Integrated Synthesis Methodology for Crossbar Arrays

M. Ceylan Morgul
Onur Tunali
Mustafa Altun
Luca Frontini
Valentina Ciriani
E. Ioana Vatajelu
Lorena Anghel
Csaba Andras Moritz
Mircea R. Stan
Dan Alexandrescu

Nano-crossbar arrays have emerged as area and power efficient structures with an aim of achieving high performance computing beyond the limits of current CMOS. Due to the stochastic nature of nano-fabrication, nano arrays show different properties both in structural and physical device levels compared to conventional technologies. Mentioned factors introduce random characteristics that need to be carefully considered by synthesis process. For instance, a competent synthesis methodology must consider basic technology preference for switching elements, defect or fault rates of the given nano switching array and the variation values as well as their effects on performance metrics including power, delay, and area. Presented synthesis methodology in this study comprehensively covers the all specified factors and provides optimization algorithms for each step of the process.

Minimal Disturbed Bits in Writing Resistive Crossbar Memories

Mohammed E Fouda
Ahmed M. Eltawil
Fadi Kurdahi

Resistive memories are promising candidates for non-volatile memories. Write disturb is one of problems that facing this kind of memories. In this paper, the write disturb problem is mathematically formulated in terms of the bias parameters and optimized analytically. A closed form solution for the optimal bias parameters is calculated. Results are compared with the 1/2 and 1/3 bias schemes showing a significant improvement.

A Recursive Growing & Featuring Mechanism for Nanocomputing Structures

Mihaela Maliţa
Gheorghe M. Ştefan

The huge amounts of physical possibilities offered by the emerging nanotechnologies must be accompanied, beyond the uniform growing mechanisms supposed by the current serial and/or parallel extensions, by an appropriate structuring mechanism able to support efficiently the increasing functional demands. A recursive growing mechanism is proposed for the upcoming Nano-Era. The current growing mechanism involves only pure quantitative aspects. We consider as mandatory, for the very big sized systems, another mechanism which interleaves the quantitative aspects with the functional ones. Because the computational parallelism is implicit for the big sized systems, the growing mechanism must be supported also by an appropriate computational model. For the current systems we started from gates. For Nano-Era structuring mechanism we will start from cellular automata. The main difference is that for nanoarchitectures the growing mechanism and the featuring mechanism are unified in an unique recursive mechanism.

Free BDD based CAD of Compact Memristor Crossbars for in-Memory Computing

Amad Ul Hassen
Salman Anwar Khokhar
Haseeb Aslam Butt
Sumit Kumar Jha

The demise of Moore’s law, breakdown of Dennard Scaling, dark silicon phenomenon, process variation, leakage currents and quantum tunneling are some of the hurdles faced in the further advancement of computing systems today. As a result, there is a renewed interest in alternate computing paradigms using emerging nanoelectronic devices. This work uses free binary decision diagrams (FBDDs) for computer-aided design (CAD) of compact memristive crossbars for sneak-path based in-memory computing. The absence of a fixed variable ordering makes FBDDs more compact than their ordered counterpart called reduced ordered binary decision diagrams (ROBDDs). Our design has used the size of the circuit-representation of Boolean functions for selecting different variable orderings along different paths which results in compact FBDDs. We have demonstrated our approach by designing compact crossbars for a four-bit multiplier and other RevLib benchmarks. Our synthesis process yields a 50.1% reduction in area over the previous FBDD-based synthesis for the fourth-output-bit of the multiplier. Overall, our approach has reduced the multiplier area by 20.1%.

Crosstalk based Fine-Grained Reconfiguration Techniques for Polymorphic Circuits

Naveen Macha
Sandeep Geedipally
Bhavana Repalle
Md Arif Iqbal
Wafi Danesh
Mostafizur Rahman

Truly polymorphic circuits, whose functionality/circuit behavior can be altered using a control variable, can provide tremendous benefits in multi-functional system design and resource sharing. For secure and fault tolerant hardware designs these can be crucial as well. Polymorphic circuits work in literature so far either rely on environmental parameters such as temperature, variation etc. or on special devices such as ambipolar FET, configurable magnetic devices, etc., that often result in inefficiencies in performance and/or realization. In this paper, we introduce a novel polymorphic circuit design approach where deterministic interference between nano-metal lines is leveraged for logic computing and configuration. For computing, the proposed approach relies on nano-metal lines, their interference and commonly used FETs. For polymorphism, it requires only an extra metal line that carries the control signal. In this paper, we show a wide range of crosstalk polymorphic logic gates and their evaluation results. We also show an example of a large circuit that performs both the functionalities of multiplier and sorter depending on the configuration signal. A comparison is made with respect to other existing approaches in literature, and transistor count is benchmarked. For crosstalk-polymorphic circuits, the transistor count reduction range from 25% to 83% with respect to various other approaches. For example, polymorphic AOI21-OA21 cell show 83%, 85% and 50% transistor count reduction, and Multiplier-Sorter circuit show 40%, 36% and 28% transistor count reduction with respect to CMOS, genetically evolved, and ambipolar transistor based polymorphic circuits, respectively.

A Novel Analog to Digital Conversion Concept with Crosstalk Computing

Rajanikanth Desh
Naveen Kumar Macha
Sehtab Hossain
Repalle Bhavana Tejaswini
Mostafizur Rahman

Analog to Digital Converters (ADCs) is the core component of computing systems forming a link between the external stimuli and digital microprocessor operations. Current CMOS based fast ADCs are difficult to scale due to the reliance on transistor sizing and high voltage operations. They also suffer from high power consumption. In this paper, we introduce a novel ADC design which uses the deterministic signal interference between metal lines as a mechanism for signal conversion. In contrast to CMOS ADCs, our approach uses a simple crosstalk tree network of metal lines to convert sampled analog levels to digital code. Here, the sampled analog signal is passed through an input metal line which is capacitively coupled to a series of metal lines in a tree-like layout, and the coupled voltages on the edge of the tree (the leaves) determine the output. The resolution is dependent on the number of branches. We show 2-bit and 3-bit ADC implemented through this mechanism at 16n technology node. Our results indicate the possibility of huge power savings with Crosstalk ADCs in comparison to CMOS; for 2-bit and 3-bit ADCs the power consumption was found to be 43.51μW and 96.74μW respectively at 50M Hz sampling frequency.

Energy Efficiency of Low Swing Signaling for Emerging Interposer Technologies

Eleni Maragkoudaki
Przemyslaw Mroszczyk
Vasilis F. Pavlidis

Interconnects often constitute the major bottleneck in the design process of low power integrated circuits (IC). Although 2.5-D integration technologies support physical proximity, the dissipated power in the communication links remains high. In this work, the additional power savings for interposer-based interconnects enabled by low swing signaling is investigated. The energy consumed by a low swing scheme is, therefore, compared with a full swing solution and the critical length of the interconnect, above which the low swing solution starts to pay off, is determined for diverse interposer technologies. The energy consumption is compared for three different substrate materials, silicon, glass, and organic. Results indicate that the higher the load capacitance of the communication medium is, the greater the energy savings of the low swing circuit are. Specifically, in cases that electrostatic discharge (ESD) protection is required, the low swing circuit is always superior in terms of energy consumption due to the high capacitive load of the ESD circuit, regardless the substrate material and the link length. Without ESD protection, the highest critical length is about 380 μm for glass and organic interposers. To further explore the limits of power reduction from low swing signaling for 2.5-D ICs, the effect of typical interconnect parameters such as width and space on the energy efficiency of low swing communication is evaluated.

Energy-Efficient 4T SRAM Bitcell with 2T Read-Port for Ultra-Low-Voltage Operations in 28 nm 3D Monolithic CoolCubeTM Technology

Reda Boumchedda
Jean-Philippe Noel
Bastien Giraud
Adam Makosiej
Marco Antonio Rios
Eduardo Esmanhotto
Emilien Bourde-Cicé
Mathis Bellet
David Turgis
Edith Beigne

This paper presents a 4T-based SRAM bitcell optimized both for write and read operations at ultra-low voltage (ULV). The proposed bitcell is designed to respond to the requirements of energy constrained systems, as in the case of most IoT-oriented circuits and applications. The use of 3D CoolCubeTM technology enables the design of a stable 4T SRAM bitcell by using data-dependent back biasing. The proposed bitcell architecture provides a major reduction of the write operation energy consumption compared to a conventional 6T bitcell. A dedicated read port coupled to a virtual GND (VGND) ensures a full functionality at ULV of read operations. Simulation results show reliable operations down to 0.35 V close to six sigma (6 σ) without any assist techniques (e.g. negative bitlines), achieving in worst case corner 300 ns and 125 ns in write and read access time, respectively. A 6x energy consumption reduction compared to a ULV ultra-low-leakage (ULL) 6T bitcell is demonstrated.

Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition

Qin Li
Huifeng Zhu
Fei Qiao
Qi Wei
Xinjun Liu
Huazhong Yang

This paper proposes a novel processing architecture to extract Mel-Frequency Cepstrum Coefficients (MFCC) for automatic speech recognition. Inspired by the human ear, the energy-efficient analog-domain information processing is adopted to replace the energy-intensive Fourier Transform in conventional digital-domain. Moreover, the proposed architecture extracts the acoustic features in the mixed-signal domain, which significantly reduces the cost of Analog-to-Digital Converter (ADC) and the computational complexity. We carry out the circuit-level simulation based on 180nm CMOS technology, which shows an energy consumption of 2.4 nJ/frame, and a processing speed of 45.79 μs/frame. The proposed architecture achieves 97.2% energy saving and about 6.4x speedup than state of the art. Speech recognition simulation reaches the classification accuracy of 99% using the proposed MFCC features.

Power Analysis of an mRNA-Ribosome System

Pratima Chatterjee
Prasun Ghosal

Energy is the heart to drive any device, such as any machine. As researchers have been trying to perform low energy operations more and more, energy requirements are turning out to be one of the key features in measuring the performance of a device. On the other hand, as conventional silicon-based computing is approaching a barrier, needs of non-conventional computing is increasing. Though several such computing platforms have arisen to prove itself as a suitable alternative to silicon-based computing, less energy requirement is certainly one of the most sought features in the competition among the new platforms. Moreover, there are certain scenarios where performing calculations in pure bio-molecular ways are highly desired. Although DNA computing has already flagged the success of bio-molecular computing in terms of energy/power requirements, its manual nature keeps it behind from other computing techniques. Another new bio-molecular computing technique Ribosomal Computing, though still in infancy, has shown real promises due to its inherent automation. This work performs an analysis of the energy/power requirements of this computing technique. With the promising result obtained, ribosomal computing can claim itself as a promising computing technique, if combined with its inherent automation.

Controlling distilleries in fault-tolerant quantum circuits: problem statement and analysis towards a solution

Alexandru Paler

The failure susceptibility of the quantum hardware will force quantum computers to execute fault-tolerant quantum circuits. These circuits are based on quantum error correcting codes, and there is increasing evidence that one of the most practical choices is the surface code. Design methodologies of surface code based quantum circuits were focused on the layout of such circuits without emphasizing the reduced availability of hardware and its effect on the execution time. Circuit layout has not been investigated for practical scenarios, and the problem presented herein was neglected until now. For achieving fault-tolerance and implementing surface code based computations, a significant amount of computing resources (hardware and time) are necessary for preparing special quantum states in a procedure called distillation. This work introduces the problem of how distilleries (circuit portions responsible for state distillation) influence the layout of surface code protected quantum circuits, and analyses the tradeoffs for reducing the resources necessary for executing the circuits. A first algorithmic solution is presented, implemented and evaluated for addition quantum circuits.

Signal Synchronization in Large Scale Quantum-dot Cellular Automata Circuits

Vassilios A. Mardiris
Orestis Liolis
Georgios Ch. Sirakoulis
Ioannis G. Karafyllidis

Quantum-dot fabrication is a well-established nanotechnology, which have many applications in many different scientific fields. By placing four quantum-dots on the corners of a square, a cell is formed, in which the digital information can be stored. This cell serves as the structural device of Quantum-dot Cellular Automata (QCA) circuits. After QCA presentation, several digital circuits and systems have been designed and proposed in the literature. However, one of the biggest problems QCA designers have to face to pave the successful design of functional and large scale QCA circuits is signal synchronization. In this paper, a novel approach of the aforementioned problem is presented. This approach is inspired by the well known computational problem of Firing Squad Synchronization (FSS). FSS problem has many similarities with large scale QCA circuits synchronization problem. In addition, FSS problem has been studied by many researchers and many efficient solutions have been proposed in the literature.

Size Optimization of MIGs with an Application to QCA and STMG Technologies

Heinz Riener
Eleonora Testa
Luca Amaru
Mathias Soeken
Giovanni De Micheli

Majority-inverter graphs (MIGs) are a logic representation with remarkable algebraic and Boolean properties that enable efficient logic optimizations beyond the capabilities of traditional logic representations. Further, since many nano-emerging technologies, such as quantum-dot cellular automata (QCA) or spin torque majority gates (STMG), are inherently majority-based, MIGs serve as a natural logic representation to map into these technologies. So far, MIG optimization methods predominantly target to reduce the depth of the logic networks, corresponding to low delay implementations in the respective technologies. In this paper, we introduce several methods to optimize the size of MIGs. They can be applied such that the depth of the logic network is preserved; therefore our methods have a direct effect on the physical area, without worsening the delay. Some methods are inspired by existing size optimization algorithms for non-majority-based logic networks, others make explicit use of the majority function and its properties. All methods are Boolean—in contrast to algebraic optimization methods—which has a positive effect on the quality but challenges their implementation. Our experiments show that using our methods the size of MIGs in the EPFL combinational benchmark suite can be reduced by up to 7.12%. When mapped to QCA and STMG technologies we reduce the average area-delay-energy product by 2.31% and 2.07%, respectively.

Representation of Qubit States using 3D Memristance Spaces: A first step towards a Memristive Quantum Simulator

Ioannis Karafyllidis
Georgios Ch. Sirakoulis
Panagiotis Dimitrakis

Development of quantum simulators is a major step towards the universal quantum computer. Quantum simulators are quantum systems that can perform specific quantum computations, or software packages that can reproduce most of the aspects of a general universal quantum computer on a general purpose classical computer. Development of quantum simulators using digital circuits, such as FPGAs is very difficult, mainly because the unit of quantum information, the qubit, has an infinite number of states, whereas the classical bit has only two. On the other hand, analog circuits comprising R, L and C elements have no internal state variables that can be used to reproduce and store qubit states. Here we take the first step towards the development of a new quantum simulator using memristors. The qubit state is mapped to a 3D space spanned by the memristances of three identical memristors. The qubit state evolution is reproduced by the input voltages applied to the memristors. We define the correspondence between the general qubit state rotation, i.e. the one-qubit quantum gates, and memristor input voltage variations and reproduce the rotations imposed by the action of quantum gates in the 3D memristance space. Our results show that, at least in principle, qubits and one-qubit quantum gates can be simulated by memristors.

ISLPED 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Machine Learning – Inference

Value-driven Synthesis for Neural Network ASICs

Zhiyuan Yang
Ankur Srivastava

In order to enable low power and high performance evaluation of neural network (NN) applications, we investigate new design methodologies for synthesizing neural network ASICs (NN-ASICs). An NN-ASIC takes a trained NN and implements a chip with customized optimization. Knowing the NN topology and weights allows us to develop unique optimization schemes which are not available to regular ASICs. In this work, we investigate two types of value-driven optimized multipliers which exploit the knowledge of synaptic weights and we develop an algorithm to synthesize the multiplication of trained NNs using these special multipliers instead of general ones. The proposed method is evaluated using several Deep Neural Networks. Experimental results demonstrate that compared to traditional NNPs, our proposed NN-ASICs can achieve up to 6.5x and 55x improvement in performance and energy efficiency (i.e. inverse of Energy-Delay-Product), respectively.

CLINK: Compact LSTM Inference Kernel for Energy Efficient Neurofeedback Devices

Zhe Chen
Andrew Howe
Hugh T. Blair
Jason Cong

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

Compact Convolution Mapping on Neuromorphic Hardware using Axonal Delay

Jinseok Kim
Yulhwa Kim
Sungho Kim
Jae-Joon Kim

Mapping Convolutional Neural Network (CNN) to a neuromorphic hardware has been inefficient in synapse memory usage because both kernel/input reuse are not exploited well. We propose a method to enable kernel reuse by utilizing axonal delay, which is a biological parameter for a spiking neuron. Using IBM TrueNorth as a test platform, we demonstrate that the number of cores, neurons, synapses, and synaptic operations per time step can be reduced by up to 20.9x, 27.9x, 88.4x, and 1586x, respectively, compared to the conventional scheme, which raises the possibility of implementing large-scale CNN on neuromorphic hardware.

NNest: Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators

Liu Ke
Xin He
Xuan Zhang

Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN’s enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an “one-size-fits-all” hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).

SESSION: Hardware Security

Blacklist Core: Machine-Learning Based Dynamic Operating-Performance-Point Blacklisting for Mitigating Power-Management Security Attacks

Sheng Zhang
Adrian Tang
Zhewei Jiang
Simha Sethumadhavan
Mingoo Seok

Most modern computing devices make available fine-grained control of operating frequency and voltage for power management. These interfaces, as demonstrated by recent attacks, open up a new class of software fault injection attacks that compromise security on commodity devices. CLKSCREW, a recently-published attack that stretches the frequency of devices beyond their operational limits to induce faults, is one such attack. Statically and permanently limiting frequency and voltage modulation space, i.e., guard-banding, could mitigate such attacks but it incurs large performance degradation and long testing time. Instead, in this paper, we propose a run-time technique which dynamically blacklists unsafe operating performance points using a neural-net model. The model is first trained offline in the design time and then subsequently adjusted at run-time by inspecting a selected set of features such as power management control registers, timing-error signals, and core temperature. We designed the algorithm and hardware, titled a BlackList (BL) core, which is capable of detecting and mitigating such power management-based security attack at high accuracy. The BL core incurs a reasonably small amount of overhead in power, delay, and area.

Threshold Defined Camouflaged Gates in 65nm Technology for Reverse Engineering Protection

Anirudh S. Iyengar
Deepak Vontela
Ithihasa Reddy
Swaroop Ghosh
Syedhamidreza Motaman
Jae-won Jang

Due to the ever-increasing threat of Reverse Engineering (RE) of Intellectual Property (IP) for malicious gains, camouflaging of logic gates is becoming very important. In this paper, we present experimental demonstration of transistor threshold voltage-defined switch [2] based camouflaged logic gates that can hide six logic functionalities i.e. NAND, AND, NOR, OR, XOR and XNOR. The proposed gates can be used to design the IP, forcing an adversary to perform brute-force guess-and-verify of the underlying functionality—increasing the RE effort. We propose two flavors of camouflaging, one employing only a pass transistor (NMOS-switch) and the other utilizing a full pass transistor (CMOS-switch). The camouflaged gates are used to design Ring-Oscillators (RO) in ST 65nm technology, one for each functionality, on which we have performed temperature, voltage, and process-variation analysis. We observe that CMOS-switch based camouflaged gate offers a higher performance (~1.5-8X better) than NMOS-switch based gate at an added area cost of only 5%. The proposed gates show functionality till 0.65V. We are also able to reclaim lost performance by dynamically changing the switch gate voltage and show that robust operation can be achieved at lower voltage and under temperature fluctuation.

Reliability and Uniformity Enhancement in 8T-SRAM based PUFs operating at NTC

Pramesh Pandey
Asmita Pal
Koushik Chakraborty
Sanghamitra Roy

SRAM-based PUFs (SPUFs) have emerged as promising security primitives for low-power devices. However, operating 8T-SPUFs at Near-Threshold Computing (NTC) realm is plagued by exacerbated process variation (PV) sensitivity which thwarts their reliable operation. In this paper, we demonstrate the massive degradation in the reliability and uniformity characteristics of 8T-SPUF. By exploiting the opportunities bestowed by schematic asymmetry of 8T-SPUF cells, we propose biasing and sizing based design strategies. Our techniques achieve an immense improvement of more than 55% in the percentage of unreliable cells and improves the proximity to ideal uniformity by 82%, over a baseline NTC 8T-SPUF with no enhancement.

Efficient and Secure Group Key Management in IoT using Multistage Interconnected PUF

Hongxiang Gu
Miodrag Potkonjak

Secure group-oriented communication is crucial to a wide range of applications in Internet of Things (IoT). Security problems related to group-oriented communications in IoT-based applications placed in a privacy-sensitive environment have become a major concern along with the development of the technology. Unfortunately, many IoT devices are designed to be portable and light-weight; thus, their functionalities, including security modules, are heavily constrained by the limited energy resources (e.g., battery capacity). To address these problems, we propose a group key management scheme based on a novel physically unclonable function (PUF) design: multistage interconnected PUF (MIPUF) to secure group communications in an energy-constrained environment. Our design is capable of performing key management tasks such as key distribution, key storage and rekeying securely and efficiently. We show that our design is secure against multiple attack methods and our experimental results show that our design saves 47.33% of energy globally comparing to state-of-the-art Elliptic-curve cryptography (ECC)-based key management scheme on average.

SESSION: Energy Efficient Wireline Circuits

An Energy-Efficient High-Swing PAM-4 Voltage-Mode Transmitter

Lejie Lu
Yong Wang
Hui Wu

As the data rate of high-speed I/Os continues to increase, four-level pulse amplitude modulation (PAM-4) is adopted to improve the bandwidth density and link margin at 50 Gb/s and beyond. Compared to non-return-to-zero (NRZ) signaling, however, the PAM-4 eye height is reduced, which calls for larger transmitter swing to maintain signal-to-noise-ratio. A new energy-efficient transmitter is proposed to generate large swing PAM-4 signals with a cascode voltage-mode driver and supporting pre-drivers and logic circuits. By reconfiguring the pull-up and pull-down branches based on the transmit data and steering the bypass currents, the proposed voltage-mode driver significantly reduces power consumption compared to conventional implementation while maintaining impedance matching. Voltage stacking technique is adopted for pre-drivers to further improve energy efficiency. To demonstrate the new transmitter design, a prototype 56 Gb/s PAM-4 transmitter is designed using a generic 28-nm CMOS technology with a 2-V power supply voltage. It achieves a overall output swing of 2 V and a minimum eye height of 490 mV with good linearity (98.7% level separation mismatch ratio). Compared to a conventional voltage-mode transmitter design with the same swing, the static power consumption of the new transmitter is reduced almost by half (from 30 mW to 16 mW), and its overall energy efficiency improves from 0.7 pJ/b to 0.5 pJ/b.

Energy-Efficient Dynamic Comparator with Active Inductor for Receiver of Memory Interfaces

Jae Whan Lee
Joo-Hyung Chae
Jihwan Park
Hyunkyu Park
Jaekwang Yun
Suhwan Kim

In this paper, we propose a dynamic comparator that improved the operation performance of receiver (RX) with the effort to reduce power consumption. It is implemented via double-tail StrongARM latch comparator with an active inductor and efforts are made to minimize power consumption for high-speed resulting in better energy efficiency at the targeted high frequency. In this regard, our comparator is suitable for memory application RX to satisfy both low-power and high-speed. It is applied to the single-ended RX designed with a continuous-time linear equalizer, a clock generator and a quarter-rate 2-tap decision-feedback equalizer which is appropriate for the high-frequency memory application. Compared to the conventional one, our design, fabricated in 55nm CMOS process, provides an improvement of 7% in unit interval (UI) margin under the same power consumption and receives up to 10Gb/s PRBS15 data at BER < 10-12 with 0.4 UI margin and energy efficiency of 0.67pJ/bit.

4-Channel Push-Pull VCSEL Drivers for HDMI Active Optical Cable in 0.18-μm CMOS

Jeongho Hwang
Hong-Seok Choi
Hyungrok Do
Gyu-Seob Jeong
Daehyun Koh
Seong Ho Park
Deog-Kyoon Jeong

The price and power consumption of standard HDMI cables exponentially rise when the data rate increases or cable runs longer. HDMI active optical cable (AOC) can potentially solve price and power issues since fibers are tolerant to loss. However, additional optical components such as vertical-cavity surface-emitting laser (VCSEL) and photodiode (PD) are required. Therefore, drivers and transimpedance amplifiers should be designed carefully for normal operations. In this paper, two types of 4-channel VCSEL drivers for HDMI AOC are presented. The first type of the driver passes data and bias separately. It uses off-chip capacitors for AC coupling. On the other hand, the second type of the driver passes data including DC value without using off-chip capacitors. Structures of the both drivers are based on push-pull current-mode logic (CML) to achieve better power efficiency. Drivers fabricated in 0.18-μm CMOS process consume 36.5 mW/channel at 6 Gb/s and 24.7 mW/channel at 12 Gb/s, respectively.

SESSION: Approximate Computing

RMAC: Runtime Configurable Floating Point Multiplier for Approximate Computing

Mohsen Imani
Ricardo Garcia
Saransh Gupta
Tajana Rosing

Approximate computing is a way to build fast and energy efficient systems, which provides responses of good enough quality tailored for different purposes. In this paper, we propose a novel approximate floating point multiplier which efficiently multiplies two floating numbers and yields a high precision product. RMAC approximates the costly mantissa multiplication to a simple addition between the mantissa of input operands. To tune the level of accuracy, RMAC looks at the first bit of the input mantissas as well as the first N bits of the result of addition to dynamically estimate the maximum multiplication error rate. Then, RMAC decides to either accept the approximate result or re-execute the exact multiplication. Depending on the value of N, the proposed RMAC can be configured to achieve different levels of accuracy. We integrate the proposed RMAC in AMD southern Island GPU, by replacing RMAC with the existing floating point units. We test the efficiency and accuracy of the enhanced GPU on a wide range of applications including multimedia and machine learning applications. Our evaluations show that a GPU enhanced by the proposed RMAC can achieve 5.2x energydelay product improvement as opposed to GPU using conventional FPUs while ensuring less than 2% quality loss. Comparing our approach with other state-of-the-art approximate multipliers shows that RMAC can achieve 3.1x faster and 1.8x more energy efficient computations while providing the same quality of service.

Designing Efficient Imprecise Adders using Multi-bit Approximate Building Blocks

Sarvenaz Tajasob
Morteza Rezaalipour
Masoud Dehyadegari
Mahdi Nazm Bojnordi

Energy-efficiency has become a major concern in designing computer systems. One of the most promising solutions to enhance power and energy-efficiency in error tolerant applications is approximate computing that balances accuracy, area, delay, and power consumption based on the computational needs. By trading accuracy of computation, approximate computing may achieve significant improvements in speed, power, and area consumption.

Adders are important arithmetic units widely used in almost every digital processing system, which contribute to significant amounts of power dissipation. With the emergence of deep learning tasks and fault tolerant big data processing in every aspect of today’s computing, the demand for low-power and energy-efficient approximate adders has increased significantly. Numerous designs have been proposed in the literature that build multi-bit adders using novel approximate full adder circuits. Regrettably, relying on single-bit building blocks only limits the design space of approximate adders and prevents the designers from achieving the most significant benefits of approximate circuits. This paper presents a novel approach to designing imprecise multi-bit adders, based on four novel approximate 2 and 3-bit adder building blocks. The proposed circuits are evaluated and compared with the existing low power adders in terms of various design characteristics, such as area, delay, power, and error tolerance. Our simulation results indicate that the proposed adders achieve more than 60% reduction in power and area consumption compared to the state-of-the-art approximate adders while introducing 12-17% less error in computation.

An Energy-Efficient, Yet Highly-Accurate, Approximate Non-Iterative Divider

Marzieh Vaeztourshizi
Mehdi Kamal
Ali Afzali-Kusha
Massoud Pedram

In1 this paper, we present a highly accurate and energy efficient non-iterative divider, which uses multiplication as its main building block. In this structure, the division operation is performed by first reforming both dividend and divisor inputs, and then multiplying the rounded value of the scaled dividend by the reciprocal of the rounded value of the scaled divisor. Precisely, the interval representing the fractional value of the scaled divisor is partitioned into non-overlapping sub-intervals, and the reciprocal of the scaled divisor is then approximated with a linear function in each of these sub-intervals. The efficacy of the proposed divider structure is assessed by comparing its design parameters and accuracy with state-of-the-art, non-iterative approximate dividers as well as exact dividers in 45nm digital CMOS technology. Circuit simulation results show that the mean absolute relative error of the proposed structure for doing 1 32-bit division is less than 0.2%, while the proposed structure has significantly lower energy consumption than the exact divider. Finally, the effectiveness of the proposed divider in one image processing application is reported and discussed.

SESSION: Architectural Techniques

Aggressive Slack Recycling via Transparent Pipelines

Gokul Subramanian Ravi
Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation – maximizing slack conservation and timing speculation efficiency.

We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.

Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache

Martin Rapp
Anuj Pathania
Jörg Henkel

Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.

SPONGE: A Scalable Pivot-based On/Off Gating Engine for Reducing Static Power in NoC Routers

Hossein Farrokhbakht
Hadi Mardani Kamali
Natalie Enright Jerger
Shaahin Hessabi

Due to high aggregate idle time of Networks-on-Chip (NoCs) routers in practical applications, power-gating techniques have been proposed to combat the ever-increasing ratio of static power. Nevertheless, the sporadic packet arrivals compromise the effectiveness of power-gating by incurring significant latency and energy overhead. In this paper, we propose a Scalable Pivot-based On/Off Gating Engine (SPONGE) which efficiently manages power-gating decisions and routing mechanism by adaptively selecting a small set of powered-on columns of routers and keeping the others in power-gated state. To this end, a router architecture augmented with a novel routing algorithm is proposed in which a packet can traverse powered-off routers without waking them up, and can only turn in predetermined powered-on routers. Experimental results on SPLASH-2 benchmarks demonstrate that, compared to the conventional power-gating method, SPONGE on average not only improves static power consumption by 81.7%, it also improves average packet latency by 63%.

SESSION: Machine Learning – Training

Taming the beast: Programming Peta-FLOP class Deep Learning Systems

Swagath Venkataramani
Vijayalakshmi Srinivasan
Jungwook Choi
Kailash Gopalakrishnan
Leland Chang

TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training

Seungkyu Choi
Jaehyeong Sim
Myeonggu Kang
Lee-Sup Kim

Training convolutional neural network on device has become essential where it allows applications to consider user’s individual environment. Meanwhile, the weight update operation from the training process is the primary factor of high energy consumption due to its substantial memory accesses. We propose a dedicated weight update architecture with two key features: (1) a specialized local buffer for the DRAM access deduction (2) a novel dataflow and its suitable processing element array structure for weight gradient computation to optimize the energy consumed by internal memories. Our scheme achieves 14.3%-30.2% total energy reduction by drastically eliminating the memory accesses.

AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

Xin He
Liu Ke
Wenyan Lu
Guihai Yan
Xuan Zhang

The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods—one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain’s ability to obtain resilient neural network parameters and system energy efficiency improvement.

Spin Orbit Torque Device based Stochastic Multi-bit Synapses for On-chip STDP Learning

Gyuseong Kang
Yunho Jang
Jongsun Park

As a large number of neurons and synapses are needed in spike neural network (SNN) design, emerging devices have been employed to implement synapses and neurons. In this paper, we present a stochastic multi-bit spin orbit torque (SOT) memory based synapse, where only one SOT device is switched for potentiation and depression using modified Gray code. The modified Gray code based approach needs only N devices to represent 2N levels of synapse weights. Early read termination scheme is also adopted to reduce the power consumption of training process by turning off less associated neurons and its ADCs. For MNIST dataset, with comparable classification accuracy, the proposed SNN architecture using 3-bit synapse achieves 68.7% reduction of ADC overhead compared to the conventional 8-level synapse.

SESSION: Non-volatile Memory

Enabling Intra-Plane Parallel Block Erase in NAND Flash to Alleviate the Impact of Garbage Collection

Tyler Garrett
Jun Yang
Youtao Zhang

Garbage collection (GC) in NAND flash can significantly decrease I/O performance in SSDs by copying valid data to other locations, thus blocking incoming I/O requests. To help improve performance, NAND flash utilizes various advanced commands to increase internal parallelism. Currently, these commands only parallelize operations across channels, chips, dies, and planes, neglecting the block level due to risk of disturbances that can compromise valid data by inducing errors. However, due to the triple-well structure of the NAND flash plane architecture, it is possible to erase multiple blocks within a plane, in parallel, without diminishing the integrity of the valid data. The number of page movements due to multiple block erases can be restrained so as to bound the overhead per GC. Moreover, more capacity can be reclaimed per GC which delays future GCs and effectively reduces their frequency. Such an Intra-Plane Parallel Block Erase (IPPBE) in turn diminishes the impact of GC on incoming requests, improving their response times. Experimental results show that IPPBE can reduce the time spent performing GC by up to 50.7% and 33.6% on average, read/write response time by up to 47.0%/45.4% and 16.5%/14.8% on average respectively, page movements by up to 52.2% and 26.6% on average, and blocks erased by up to 14.2% and 3.6% on average. An energy analysis conducted indicates that by reducing the number of page copies and the number of block erases, the energy cost of garbage collection can be reduced up to 44.1% and 19.3% on average.

Enhancing the Energy Efficiency of Journaling File System via Exploiting Multi-Write Modes on MLC NVRAM

Shuo-Han Chen
Yuan-Hao Chang
Tseng-Yi Chen
Yu-Ming Chang
Pei-Wen Hsiao
Hsin-Wen Wei
Wei-Kuan Shih

Non-volatile random-access memory (NVRAM) is regarded as a great alternative storage medium owing to its attractive features, including low idle energy consumption, byte addressability, and short read/write latency. In addition, multi-level-cell (MLC) NVRAM has also been proposed to provide higher bit density. However, MLC NVRAM has lower energy efficiency and longer write latency when compared with single-level-cell (SLC) NVRAM. These drawbacks could lead to higher energy consumption of MLC NVRAM-based storage systems. The energy consumption is magnified by existing journaling file systems (JFS) on MLC NVRAM-based storage devices due to the JFS’s fail-safe policy of writing the same data twice. Such observations motivate us to propose a multi-write-mode journaling file systems (mwJFS) to alleviate the drawbacks of MLC NVRAM and lower the energy consumption of MLC NVRAM-based JFS. The proposed mwJFS differentiates the data retention requirement of journaled data and applies different write modes to enhance the energy efficiency with better access performance. A series of experiments was conducted to demonstrate the capability of mwJFS on a MLC NVRAM-based storage system.

Computing in memory with FeFETs

Dayane Reis
Michael Niemier
X. Sharon Hu

Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.

Information Leakage Attacks on Emerging Non-Volatile Memory and Countermeasures

Mohammad Nasim Imtiaz Khan
Swaroop Ghosh

Emerging Non-Volatile Memories (NVMs) suffer from high and asymmetric read/write current and long write latency which can result in supply noise, such as supply voltage droop and ground bounce. The magnitude of supply noise depends on the old data and the new data that is being written (for a write operation) or on the stored data (for a read operation). Therefore, victim’s write operation creates a supply noise which propagates to adversary’s memory space. The adversary can detect victim’s write initiation and can leverage faster read latency (compared to write) to further sense the Hamming Weight (HW) of the victim’s write data by detecting read failures in his memory space. These attacks are specifically possible if exhaustive testing of the memory for all patterns, all possible location combinations, all possible parallel read/write conditions are not performed under bit-to-bit process variations and specified (-10°C to 90°C) and unspecified temperature ranges (i.e., less than -10°C and greater than 90°C). Simulation result indicates that adversary can sense HW of victim’s (near-by) write data = 66.77%, and further narrow the range based on read/write failure characteristics. Side Channel Attacks can utilize this information to strengthen the attacks.

SESSION: Energy-efficient Parallelism

Load-Triggered Warp Approximation on GPU

Zhenhong Liu
Daniel Wong
Nam Sung Kim

Value similarity of operands across warps have been exploited to improve energy efficiency of GPUs. Prior work, however, incurs significant overheads to check value similarity for every instruction and does not improve performance as it does not reduce the number of executed instructions. This work proposes Lock ‘n Load (LnL) which triggers approximate execution of code regions by only checking similarity of values returned from load instructions and fuses multiple approximated warps into a single warp.

GAS: A Heterogeneous Memory Architecture for Graph Processing

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Tajana Rosing

Graph processing has become important for various applications in today’s big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.

ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU

Tahmoures Shabanian
Aatreyi Bal
Prabal Basu
Koushik Chakraborty
Sanghamitra Roy

The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.

SESSION: Self-powered Devices

HomeRun: HW/SW Co-Design for Program Atomicity on Self-Powered Intermittent Systems

Chih-Kai Kang
Chun-Han Lin
Pi-Cheng Hsiu
Ming-Syan Chen

Self-powered intermittent systems featuring nonvolatile processors (NVPs) allow for accumulative execution in unstable power environments. However, frequent power failures may cause incorrect NVP execution results due to invalid data generated intermittently. This paper presents a HW/SW co-design, called HomeRun, to guarantee atomicity by ensuring that an uninterruptible program section can be run through at one execution. We design a HW module to ensure that a power pulse is sufficient for an atomic section, and develop a SW mechanism for programmers to protect atomic sections. The proposed design is validated through the development of a prototype pattern locking system. Experimental results demonstrate that the proposed design can completely guarantee atomicity and significantly improve the energy utilization of self-powered intermittent systems.

EcoMicro: A Miniature Self-Powered Inertial Sensor Node Based on Bluetooth Low Energy

Cheng-Ting Lee
Yun-Hao Liang
Pai H. Chou
Ali Heydari Gorji
Seyede Mahya Safavi
Wen-Chan Shih
Wen-Tsuen Chen

This paper describes EcoMicro, a miniature, self-powered, wireless inertial-sensing node in the volume of 8 x 13 x 9.5 mm3, including energy storage and solar cells. It is smaller than existing systems with similar functionality while retaining rich functionality and efficiency. It is capable of measuring motion using a inertial measurement unit (IMU) and communication over Bluetooth Low Energy (BLE) protocol. It is self-powered by miniature solar cells and can perform maximum power point tracking (MPPT). Its integrated energy-storage device combines the longevity and power density of supercapacitors with the relatively flat discharge curve of batteries. Our power-ground gating circuit minimizes leakage current during sleep mode and is used in conjunction with the real-time-clock for duty cycling. Experimental results show EcoMicro to be operational and efficient for a class of wireless sensing applications.

Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems

S. K. Thirumala
A. Raha
H. Jayakumar
K. Ma
V. Narayanan
V. Raghunathan
S. K. Gupta

In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.

SESSION: Design and 3D Integration

Multiple Combined Write-Read Peripheral Assists in 6T FinFET SRAMs for Low-VMIN IoT and Cognitive Applications

Arijit Banerjee
Sumanth Kamineni
Benton H. Calhoun

Battery-operated or energy-harvested IoT and cognitive SoCs in modern FinFET processes prefer the use of low-VMIN SRAMs for ultra-low power (ULP) operations. However, the 1:1:1 high-density (HD) FinFET 6T bitcell faces challenges in achieving a lower VMIN across process variation. The 6T bitcell VMIN improves either by increasing the size of the bitcell or by using combinations of peripheral assists (PAs) since a single PA cannot achieve the best VMIN across process variation. State-of-the-art works show some combinations of write and read PAs that lower the VMIN of 6T FinFET SRAMs. However, the better combinations of PA for 14nm HD 6T FinFET SRAMs are unknown. This work compares all the possible dual combinations of PAs and reveals the better ones. We show that in a usual column mux scenario the combination of negative bitline with VDD boosting and VDD collapse with VDD boosting in a proportion of 14% and 6% (total 20%), respectively, maximize the static VMIN improvement close to 191mV for ULP IoT and cognitive applications. We also show that a combination of wordline boosting with negative bitline and wordline boosting with VSS lowering achieve a 150mV and 25mV of dynamic VMIN improvement at the 5GHz frequency for the worst-case write and read corners, respectively, beating other combinations.

Road to High-Performance 3D ICs: Performance Optimization Methodologies for Monolithic 3D ICs

Kyungwook Chang
Sai Pentapati
Da Eun Shim
Sung Kyu Lim

As we approach the limits of 2D device scaling, monolithic 3D IC (M3D) has emerged as a potential solution offering performance and power benefits. Although various studies have been done to increase power savings of M3D designs, efforts to improve their performance are rarely made. In this paper, we, for the first time, perform in-depth analysis of the factors that affect the performance of M3D, and present methodologies to improve the performance. Our methodologies outperform the state-of-the-art M3D design flow by offering 15.6% performance improvement and 16.2% energy-delay product (EDP) benefit over 2D designs.

A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support

Srivatsa Srinivasa
Akshay Krishna Ramanathan
Xueqing Li
Wei-Hao Chen
Fu-Kuo Hsueh
Chih-Chao Yang
Chang-Hong Shen
Jia-Min Shieh
Sumeet Gupta
Meng-Fan Marvin Chang
Swaroop Ghosh
Jack Sampson
Vijaykrishnan Narayanan

We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.

SESSION: Industry ML/AI Compute

Across the Stack Opportunities for Deep Learning Acceleration

Vijayalakshmi Srinivasan
Bruce Fleischer
Sunil Shukla
Matthew Ziegler
Joel Silberman
Jinwook Oh
Jungwook Choi
Silvia Mueller
Ankur Agrawal
Tina Babinsky
Nianzheng Cao
Chia-Yu Chen
Pierce Chuang
Thomas Fox
George Gristede
Michael Guillorn
Howard Haynie
Michael Klaiber
Dongsoo Lee
Shih-Hsien Lo
Gary Maier
Michael Scheuermann
Swagath Venkataramani
Christos Vezyrtzis
Naigang Wang
Fanchieh Yee
Ching Zhou
Pong-Fei Lu
Brian Curran
Leland Chang
Kailash Gopalakrishnan

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers.

One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the “exactness” constraint enables exploiting opportunities for approximate computing across all layers of the system stack.

In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware.

Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training’s scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core.

Overall, our work shows how the benefits from exploiting approximation using algorithm/application’s robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

SESSION: Mobile Applications

App-Oriented Thermal Management of Mobile Devices

Jihoon Park
Seokjun Lee
Hojung Cha

The thermal issue for mobile devices becomes critical as the devices’ performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.

DiReCt: Resource-Aware Dynamic Model Reconfiguration for Convolutional Neural Network in Mobile Systems

Zirui Xu
Zhuwei Qin
Fuxun Yu
Chenchen Liu
Xiang Chen

Although Convolutional Neural Networks (CNNs) have been widely applied in various applications, their deployment in resource-constrained mobile systems remains a significant concern. To overcome the computation resource constraints, such as limited memory and energy capacity, many works are proposed for mobile CNN optimization. However, most of them lack a comprehensive modeling analysis of the CNN computation consumption and merely focus on static optimization schemes regardless of different mobile computation scenarios. In this work, we proposed DiReCt — a resource-aware CNN reconfiguration system. Leveraging accurate CNN computation consumption modeling and mobile resource constraint analysis, DiReCt can reconfigure a CNN with different accuracy and resource consumption levels to adapt to various mobile computation scenarios. The experiment results show that: the proposed computation consumption models in DiReCt can well estimate the CNN computation consumption with 94.1% accuracy, and DiReCt achieves at most 34.9% computation acceleration, 52.7% memory reduction, and 27.1% energy saving. Eventually, DiReCt can effectively adapt CNNs to dynamic mobile usage scenarios for optimal performance.

POSTER SESSION: Posters

A Low-power [email protected] H.265/HEVC Video Encoder for Smart Video Surveillance

Ke Xu
Yu Li
Bo Huang
Xiangkai Liu
Hong Wang
Zhuoyan Wu
Zhanpeng Yan
Xueying Tu
Tongqing Wu
Daibing Zeng

This paper presents the design and VLSI implementation of a low-power HEVC main profile encoder, which is able to process up to [email protected] 4:2:0 encoding in real-time with five-stage pipeline architecture. A pyramid ME (Motion Estimation) engine is employed to reduce search complexity. To compensate for the video sequences with fast moving objects, GME (Global Motion Estimation) are introduced to alleviate the effect of limited search range. We also implement an alternative 5×5 search along with 3×3 to boost video quality. For intra mode decision, original pixels, instead of reconstructed ones are used to reduce pipeline stall. The encoder supports DVFS (Dynamic Voltage and Frequency Scaling) and features three operating modes, which helps to reduce power consumption by 25%. Scalable quality that trades encoding quality for power by reducing size of search range and intra prediction candidates, achieves 11.4% power reduction with 3.5% quality degradation. Furthermore, a lossless frame buffer compression is proposed which reduced DDR bandwidth by 49.1% and power consumption by 13.6%. The entire video surveillance SoC is fabricated with TSMC 28nm technology with 1.96 mm2 area. It consumes 2.88M logic gates and 117KB SRAM. The measured power consumption is 103mW at 350MHz for 4K encoding with high-quality mode. The 0.39nJ/pixel of energy efficiency of this work, which achieves 42% ~ 97% power reduction as compared with reference designs, make it ideal for real-time low-power smart video surveillance applications.

Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories

M. Reza Mahmoodi
Dmitri Strukov

Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100×100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.

Efficient Image Sensor Subsampling for DNN-Based Image Classification

Jia Guo
Hongxiang Gu
Miodrag Potkonjak

Today’s mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.

Input-Splitting of Large Neural Networks for Power-Efficient Accelerator with Resistive Crossbar Memory Array

Yulhwa Kim
Hyungjun Kim
Daehyun Ahn
Jae-Joon Kim

Resistive Crossbar memory Arrays (RCA) have been gaining interest as a promising platform to implement Convolutional Neural Networks (CNN). One of the major challenges in RCA-based design is that the number of rows in an RCA is often smaller than the number of input neurons in a layer. Previous works used high-resolution Analog-to-Digital Converters (ADCs) to compute the partial weighted sum in each array and merged partial sums from multiple arrays outside the RCAs. However, such approach suffers from significant power consumption due to the need for high-resolution ADCs. In this paper, we propose a methodology to more efficiently construct a large CNN with multiple RCAs. By splitting the input feature map and retraining the CNN with proper initialization, we demonstrate that any CNN model can be represented with multiple arrays without using intermediate partial sums. The experimental results show that the ADC power of the proposed design is 32x smaller and the total chip power of the proposed design is 3x smaller than those of the baseline design.

Design Optimization of 3D Multi-Processor System-on-Chip with Integrated Flow Cell Arrays

Artem Andreev
Fulya Kaplan
Marina Zapater
Ayse K. Coskun
David Atienza

Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs

Swaminathan Narayanaswamy
Sangyoung Park
Sebastian Steinhorst
Samarjit Chakraborty

Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.

Intrinsic and Database-free Watermarking in ICs by Exploiting Process and Design Dependent Variability in Metal-Oxide-Metal Capacitances

Ahish Shylendra
Swarup Bhunia
Amit Ranjan Trivedi

Authentication of integrated circuits (IC) to verify their integrity has emerged as a critical need to address increasing concerns associated with counterfeit ICs in the supply chain. In this paper, novel SAR-ADC based intrinsic and database-free authentication scheme has been proposed. Proposed technique utilizes mismatch in back end of line (BEOL) capacitors used in charge-redistribution SAR ADC to generate authentication signature. BEOL metal-oxide-metal (MOM) capacitors form a reliable source of process variation information and are less sensitive to aging & temperature induced variations. Line edge roughness is the primary source of mismatch in BEOL capacitors and thus, capacitor mismatch variation has been analyzed in terms of LER and geometric parameters. Resource overhead incurred by the proposed modifications to the ADC architecture to incorporate authentication ability is minimal and existing on-chip calibration circuitry is used to extract signature. Proposed technique does not require sophisticated test setup, thereby, simplifying the authentication procedure.

Scheduling of Hybrid Battery-Supercapacitor Control Instructions for Longevity in Systems with Power Gating

Sumanta Pyne

The in-rush current due to wake-up of power gating (PG) components causes faster discharge of battery. This work introduces an instruction controlled hybrid battery-supercapacitor (B-SC) system for longer battery life in systems with instruction controlled PG. Two instructions have been introduced along with architectural support. The first instruction disconnects the battery from the PG components if the charge in the supercapacitor greater than or equal to the charge required by wake-up of PG components. The other instruction connects the battery to the PG components for recharging the supercapacitor. Disconnecting the battery during wake-up minimizes rate capacity effect (C-rate) for longer battery life. An algorithm is designed to schedule the proposed battery control instructions within a program having PG instructions. The efficacy of the proposed method is evaluated on MiBench and MediaBench benchmark programs. The proposed method reduces C-rate by an average of 14.25% at the cost of average performance loss of 6.87%.

Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter

Dongkwun Kim
Mingoo Seok

We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch’s resistance and flying-capacitor’s density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch’s resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.

Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware

Daniele Jahier Pagliari
Enrico Macii
Massimo Poncino

Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency.

In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.

Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

Mahdi Nazemi
Massoud Pedram

Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations. Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.

Battery-Aware Energy Model of Drone Delivery Tasks

Donkyu Baek
Yukai Chen
Alberto Bocca
Alberto Macii
Enrico Macii
Massimo Poncino

Drones are becoming increasingly popular in the commercial market for various package delivery services. In this scenario, the mostly adopted drones are quad-rotors (i.e., quadcopters). The energy consumed by a drone may become an issue, since it may affect (i) the delivery deadline (quality of service), (ii) the number of packages that can be delivered (throughput) and (iii) the battery lifetime (number of recharging cycles). It is thus fundamental try to find the proper compromise between the energy used to complete the delivery and the speed at which the quadcopter flies to reach the destination. In order to achieve this, we have to consider that the energy required by the drone for completing a given delivery task does not exactly correspond to the energy requested to the battery, since the latter is a non-ideal power supply that is able to deliver power with different efficiencies depending on its state of charge. In this paper, we demonstrate that the proposed battery-aware delivery scheduling algorithm carries more packages than the traditional delivery model with the same battery capacity. Moreover, the battery-aware delivery model is 17% more accurate than the traditional delivery model for the same delivery scheme, which prevents the unexpected drone landing.

A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference

Li Yang
Zhezhi He
Deliang Fan

Deep convolutional neural network has taken an important role in machine learning algorithm which has been widely used in computer vision tasks. However, its enormous model size and massive computation cost have became the main obstacle for deployment of such powerful algorithm in low power and resource limited embedded system, such as FPGA. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binary activation function, can significantly reduce the model size and computation complexity, which paves a new road for energy-efficient FPGA implementation. In this work, we first propose a new BNN algorithm, called Parallel-Convolution BNN (i.e. PC-BNN), which replaces the original binary convolution layer in conventional BNN with two parallel binary convolution layers. PC-BNN achieves ~86% on CIFAR-10 dataset with only 2.3Mb parameter size. We then deploy our proposed PC-BNN into the Xilinx PYNQ Z1 FPGA board with only 4.9Mb on-chip RAM. Since the ultra-small network parameter, it is feasible to store the whole network parameter into on-chip RAM, which could greatly reduce the energy and delay overhead to load network parameter from off-chip memory. Meanwhile, a new data streaming pipeline architecture is proposed in PC-BNN FPGA implementation to further improve throughput. The experiment results show that our PC-BNN based FPGA implementation achieves 930 frames per second, 387.5 FPS/Watt and 396×10-4 FPS/LUT, which are among the best throughput and energy efficiency compared to most recent works.

In-situ Stochastic Training of MTJ Crossbar based Neural Networks

Ankit Mondal
Ankur Srivastava

Owing to high device density, scalability and non-volatility, Magnetic Tunnel Junction-based crossbars have garnered significant interest for implementing the weights of an artificial neural network. The existence of only two stable states in MTJs implies a high overhead of obtaining optimal binary weights in software. We illustrate that the inherent parallelism in the crossbar structure makes it highly appropriate for in-situ training, wherein the network is taught directly on the hardware. It leads to significantly smaller training overhead as the training time is independent of the size of the network, while also circumventing the effects of alternate current paths in the crossbar and accounting for manufacturing variations in the device. We show how the stochastic switching characteristics of MTJs can be leveraged to perform probabilistic weight updates using the gradient descent algorithm. We describe how the update operations can be performed on crossbars both with and without access transistors and perform simulations on them to demonstrate the effectiveness of our techniques. The results reveal that stochastically trained MTJ-crossbar NNs achieve a classification accuracy nearly same as that of real-valued-weight networks trained in software and exhibit immunity to device variations.

Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit

Ioannis Tsiokanos
Lev Mukhanov
Dimitrios S. Nikolopoulos
Georgios Karakonstantis

In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.

A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications

Sana Arshad
Azam Beg
Rashad Ramzan

This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA’s power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.

SESSION: Far-out Ideas

Insights from Biology: Low Power Circuits in the Fruit Fly

Louis K. Scheffer

Fruit flies (Drosophila melanogaster) are small insects, with correspondingly small power budgets. Despite this, they perform sophisticated neural computations in real time. Careful study of these insects is revealing how some of these circuits work. Insights from these systems might be helpful in designing other low power circuits.

ICCAD 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

A fast thermal-aware fixed-outline floorplanning methodology based on analytical models

Jai-Ming Lin
Tai-Ting Chen
Yen-Fu Chang
Wei-Yi Chang
Ya-Ting Shyu
Yeong-Jar Chang
Juin-Ming Lu

High temperature or temperature non-uniformity have become a serious threat to performance and reliability of high-performance integrated circuits (ICs). Thermal effect becomes a non-ignorable issue to circuit design or physical design. To estimate temperature accurately, the locations of modules have to be determined, which makes an efficient and effective thermal-aware floorplanning play a more important role. To resolve this problem, this paper proposes a differential nonlinear model which can approximate temperature and minimize wirelength at the same time during floorplanning. We also apply some techniques such a thermal-aware clustering or shrinking hot modules in the multi-level framework to further reduce temperature without inducing longer wirelength. The experimental results demonstrate that temperature and wirelength are greatly improved in our method compared to other works. More importantly, our runtime is quite fast and the fixed-outline constraint is also satisfied.

Analytical solution of Poisson’s equation and its application to VLSI global placement

Wenxing Zhu
Zhipeng Huang
Jianli Chen
Yao-Wen Chang

Poisson’s equation has been used in VLSI global placement for describing the potential field induced by a given charge density distribution. Unlike previous global placement methods that solve Poisson’s equation numerically, in this paper, we provide an analytical solution of the equation to calculate the potential energy of an electrostatic system. The analytical solution is derived based on the separation of variables method and an exact density function to model the block distribution in a placement region, which is an infinite series and converges absolutely. Using the analytical solution, we give a fast computation scheme of Poisson’s equation and develop an effective and efficient global placement algorithm called Pplace. Experimental results show that our Pplace achieves smaller placement wirelength than ePlace and NTUplace3, two leading wirelength-driven placers. With the pervasive applications of Poisson’s equation in scientific fields, in particular, our effective, efficient, and robust computation scheme for its analytical solution can provide substantial impacts to these fields.

Novel proximal group ADMM for placement considering fogging and proximity effects

Jianli Chen
Li Yang
Zheng Peng
Wenxing Zhu
Yao-Wen Chang

Fogging and proximity effects are two major factors that cause inaccurate exposure and thus layout pattern distortions in e-beam lithography. In this paper, we propose the first analytical placement algorithm to consider both the fogging and proximity effects. We first formulate the global placement problem as a separable minimization problem with linear constraints, where different objectives can be tackled one by one in an alternating fashion. Then, we propose a novel proximal group alternating direction method of multipliers (ADMM) to solve the separable minimization problem with two subproblems, where the first subproblem (mainly associated with wirelength and density) is solved by a steepest descent method without line-search, and the second one (mainly associated with the fogging and proximity effects) is handled by an analytical scheme. We prove the property of global convergence of the proximal group ADMM method. Finally, legalization and detailed placement are used to legal and further improve the placement result. Experimental results show that our algorithm is effective and efficient for the addressed problem. Compared with the state-of-the-art work, our algorithm not only can achieve 13.4% smaller fogging variation and 21.4% lower proximity variation, but also has a 1.65X speedup.

Simultaneous partitioning and signals grouping for time-division multiplexing in 2.5D FPGA-based systems

Shih-Chun Chen
Richard Sun
Yao-Wen Chang

The 2.5D FPGA is a promising technology to accommodate a large design in one FPGA chip, but the limited number of inter-die connections in a 2.5D FPGA may cause routing failures. To resolve the failures, input/output time-division multiplexing is adopted by grouping cross-die signals to go through one routing channel with a timing penalty after netlist partitioning. However, grouping signals after partitioning might lead to a suboptimal solution. Consequently, it is desirable to consider simultaneous partitioning and signal grouping although the optimization objectives of partitioning and grouping are different, and the time complexity of such simultaneous optimization is usually high. In this paper, we propose a simultaneous partitioning and grouping algorithm that can not only integrate the two objectives smoothly, but also reduce the time complexity to linear time per partitioning iteration. Experimental results show that our proposed algorithm outperforms the state-of-the-arts flow in both cross-die signal timing criticality and system-clock periods.

IC/IP piracy assessment of reversible logic

Samah Mohamed Saeed
Xiaotong Cui
Alwin Zulehner
Robert Wille
Rolf Drechsler
Kaijie Wu
Ramesh Karri

Reversible logic is a building block for adiabatic and quantum computing in addition to other applications. Since common functions are non-reversible, one needs to embed them into proper-size reversible functions by adding ancillary inputs and garbage outputs. We explore the Intellectual Property (IP) piracy of reversible circuits. The number of embeddings of regular functions in a reversible function and the percent of leaked ancillary inputs measure the difficulty of recovering the embedded function. To illustrate the key concepts, we study reversible logic circuits designed using reversible logic synthesis tools based on Binary Decision Diagrams and Quantum Multi-valued Decision Diagrams.

TimingSAT: timing profile embedded SAT attack

Abhishek Chakraborty
Yuntao Liu
Ankur Srivastava

In order to enhance the security of logic obfuscation schemes, delay based logic locking has been proposed in combination with traditional functional logic locking approaches in recent literature. A circuit obfuscated using the aforementioned approach preserves the correct functionality only when both correct functional and delay keys are provided. In this paper, we develop a novel SAT formulation based approach called TimingSAT to deobfuscte the functionalities of such delay locked designs within a reasonable amount of time. The proposed technique models the timing characteristics of various types of gates present in the design as Boolean functions to build timing profile embedded SAT formulations in terms of targeted key inputs. TimingSAT attack works in two stages: In the first stage the functional keys are found using traditional SAT attack approach and in the second stage the delay keys are deciphered utilizing the timing profile embedded SAT formulation of the circuit. In both stages of the attack, wrong keys are iteratively eliminated till a key belonging to the correct equivalence class is obtained. The experimental results highlight the effectiveness of the proposed TimingSAT attack to break delay logic locked benchmarks within few hours.

Towards provably-secure analog and mixed-signal locking against overproduction

Nithyashankari Gummidipoondi Jayasankaran
Adriana Sanabria Borbon
Edgar Sanchez-Sinencio
Jiang Hu
Jeyavijayan Rajendran

Similar to digital circuits, analog and mixed-signal (AMS) circuits are also susceptible to supply-chain attacks such as piracy, overproduction, and Trojan insertion. However, unlike digital circuits, supply-chain security of AMS circuits is less explored. In this work, we propose to perform “logic locking” on digital section of the AMS circuits. The idea is to make the analog design intentionally suffer from the effects of process variations, which impede the operation of the circuit. Only on applying the correct key, the effect of process variations are mitigated, and the analog circuit performs as desired. We provide the theoretical guarantees of the security of the circuit, and along with simulation results for the band-pass filter, low-noise amplifier, and low-dropout regulator, we also show experimental results of our technique on a band-pass filter.

Best of both worlds: integration of split manufacturing and camouflaging into a security-driven CAD flow for 3D ICs

Satwik Patnaik
Mohammed Ashraf
Ozgur Sinanoglu
Johann Knechtel

With the globalization of manufacturing and supply chains, ensuring the security and trustworthiness of ICs has become an urgent challenge. Split manufacturing (SM) and layout camouflaging (LC) are promising techniques to protect the intellectual property (IP) of ICs from malicious entities during and after manufacturing (i.e., from untrusted foundries and reverse-engineering by end-users). In this paper, we strive for “the best of both worlds,” that is of SM and LC. To do so, we extend both techniques towards 3D integration, an up-and-coming design and manufacturing paradigm based on stacking and interconnecting of multiple chips/dies/tiers.

Initially, we review prior art and their limitations. We also put forward a novel, practical threat model of IP piracy which is in line with the business models of present-day design houses. Next, we discuss how 3D integration is a naturally strong match to combine SM and LC. We propose a security-driven CAD and manufacturing flow for face-to-face (F2F) 3D ICs, along with obfuscation of interconnects. Based on this CAD flow, we conduct comprehensive experiments on DRC-clean layouts. Strengthened by an extensive security analysis (also based on a novel attack to recover obfuscated F2F interconnects), we argue that entering the next, third dimension is eminent for effective and efficient IP protection.

Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base

Sebastian Vogel
Mengyu Liang
Andre Guntoro
Walter Stechele
Gerd Ascheid

Efficient acceleration of Deep Neural Networks is a manifold task. In order to save memory requirements and reduce energy consumption we propose the use of dedicated accelerators with novel arithmetic processing elements which use bit shifts instead of multipliers. While a regular power-of-2 quantization scheme allows for multiplierless computation of multiply-accumulate-operations, it suffers from high accuracy losses in neural networks. Therefore, we evaluate the use of powers-of-arbitrary-log-bases and confirmed their suitability for quantization of pre-trained neural networks. The presented method works without retraining of the neural network and therefore is suitable for applications in which no labeled training data is available. In order to verify our proposed method, we implement the log-based processing elements into a neural network accelerator on an FPGA. The hardware efficiency is evaluated in terms of FPGA utilization and energy requirements in comparison to regular 8-bit-fixed-point multiplier based acceleration. Using this approach hardware resources are minimized and power consumption is reduced by 22.3%.

NID: processing binary convolutional neural network in commodity DRAM

Jaehyeong Sim
Hoseok Seol
Lee-Sup Kim

Recent large-scale CNNs suffer from a severe memory wall problem as their number of weights range from tens to hundreds of millions. Processing in-memory (PIM) and binary CNN have been proposed to alleviate the number of memory accesses and footprints, respectively. By combining the two separate concepts, we propose a novel processing in-DRAM framework for binary CNN, called NID, where dominant convolution operations are processed using in-DRAM bulk bitwise operations. We first identify the problem that the bitcount operations with only bulk bitwise AND/OR/NOT incur significant overhead in terms of delay when the size of kernels gets larger. Then, we not only optimize the performance by efficiently allocating inputs and kernels to DRAM banks for both convolutional and fully-connected layers through design space explorations, but also mitigate the overhead of bitcount operations by splitting kernels into multiple parts. Partial sum accumulations and tasks of the other layers such as max-pooling and normalization layers are processed in the peripheral area of DRAM with negligible overheads. In results, our NID framework achieves 19X-36X performance and 9X-14X EDP improvements for convolutional layers, and 9X-17X performance and 1.4X-4.5X EDP improvements for fully-connected layers over previous PIM technique in four large-scale CNN models.

AXNet: approximate computing using an end-to-end trainable neural network

Zhenghao Peng
Xuyang Chen
Chengwen Xu
Naifeng Jing
Xiaoyao Liang
Cewu Lu
Li Jiang

Neural network based approximate computing is a universal architecture promising to gain tremendous energy-efficiency for many error resilient applications. To guarantee the approximation quality, existing works deploy two neural networks (NNs), e.g., an approximator and a predictor. The approximator provides the approximate results, while the predictor predicts whether the input data is safe to approximate with the given quality requirement. However, it is non-trivial and time-consuming to make these two neural network coordinate—they have different optimization objectives—by training them separately. This paper proposes a novel neural network structure—AXNet—to fuse two NNs to a holistic end-to-end trainable NN. Leveraging the philosophy of multi-task learning, AXNet can tremendously improve the invocation (proportion of safe-to-approximate samples) and reduce the approximation error. The training effort also decrease significantly. Experiment results show 50.7% more invocation and substantial cuts of training time when compared to existing neural network based approximate computing framework.

Scalable-effort ConvNets for multilevel classification

Valentino Peluso
Andrea Calimera

This work introduces the concept of scalable-effort Convolutional Neural Networks (ConvNets), an effort-accuracy scalable model for classification of data at multilevel abstraction. Scalable-effort ConvNets are able to adapt at run-timeto the complexity of the classification problem, i.e. the level of abstraction defined by the application (or context), and reach a given classification accuracy with minimal computational effort. The mechanism is implemented using a single-weight scalable-precision model rather than an ensemble of quantized weight models; this makes the proposed strategy highly flexible and particularly suited for embedded architectures with limited resource availability.

The paper describes (i) a hardware/software vertical implementation of scalable-precision multiply&accumulate arithmetic, (ii) an accuracy-constrained heuristic that delivers near-optimal layer-by-layer precision mapping at a predefined level of abstraction. It also reports the validation for three state-of-the-art nets, i.e. AlexNet, SqueezeNet and MobileNet, trained and tested with ImageNet. Collected results show scalable-effort ConvNets guarantee flexibility and substantial savings: 47.07% computational effort reduction at minimum accuracy, or 30.6% accuracy improvement at maximum effort w.r.t. standard flat ConvNets (average over the three benchmarks for high-level classification).

Emerging reconfigurable nanotechnologies: can they support future electronics?

Shubham Rai
Srivatsa Srinivasa
Patsy Cadareanu
Xunzhao Yin
Xiaobo Sharon Hu
Pierre-Emmanuel Gaillardon
Vijaykrishnan Narayanan
Akash Kumar

Several emerging reconfigurable technologies have been explored in recent years offering device level runtime reconfigurability. These technologies offer the freedom to choose between p- and n-type functionality from a single transistor. In order to optimally utilize the feature-sets of these technologies, circuit designs and storage elements require novel design to complement the existing and future electronic requirements. An important aspect to sustain such endeavors is to supplement the existing design flow from the device level to the circuit level. This should be backed by a thorough evaluation so as to ascertain the feasibility of such explorations. Additionally, since these technologies offer runtime reconfigurability and often encapsulate more than one functions, hardware security features like polymorphic logic gates and on-chip key storage come naturally cheap with circuits based on these reconfigurable technologies. This paper presents innovative approaches devised for circuit designs harnessing the reconfigurable features of these nanotechnologies. New circuit design paradigms based on these nano devices will be discussed to brainstorm on exciting avenues for novel computing elements.

Design and algorithm for clock gating and flip-flop co-optimization

Giyoung Yang
Taewhan Kim

This work firstly investigates the problem of how designing data-driven (i.e., toggling based) clock gating can be closely integrated with the synthesis of flip-flops, which has never been addressed in the prior clock gating works. Our key observation is that some internal part of a flip-flop cell can be reused to generate its clock gating enable signal. Based on this, we propose a newly optimized flip-flop wiring structure, called eXOR-FF, in which an internal logic can be reused for every clock cycle to decide if the flip-flop is to be activated or inactivated through clock gating, thereby achieving area saving (thus, leakage as well as dynamic power saving) on every pair of flip-flop and its toggling detection logic. Then, we propose a comprehensive methodology of placement/timing-aware clock gating exploration that provides two unique strengths: best suited for maximally exploiting the benefit of eXOR-FFs and precise analyses on the decomposition of power consumptions and timing impact, and translating them into cost functions in core engine of clock gating exploration.

Macro-aware row-style power delivery network design for better routability

Jai-Ming Lin
Jhih-Sheng Syu
I-Ru Chen

Reliability of a P/G network is one of the most important concerns in a chip design, which makes powerplanning the most critical step in the physical design. Traditional P/G network design mainly focuses on reducing usage of routing resource to satisfy voltage drop and electromigration constraints according to a regular mesh. As the number of macros in a modern design increases, this style may waste more routing resource and make routing congestion more severe in local regions. In order to save routing resource and increase routability, this paper proposes a delicate powerplanning method. First, we propose a row-style power mesh to facilitate connection of pre-placed macros and increase routability of signal nets in the later stage. Besides, an effective power stripe width which can reduce wastage of routing resource and provide stronger supply voltage is found. Moreover, we propose the first work to use the linear programming algorithm to minimize P/G routing area and consider routability at the same time. The experimental results show that routability of a design with many macros can be significantly improved by our row-style power networks.

Modeling and optimization of magnetic core TSV-inductor for on-chip DC-DC converter

Baixin Chen
Umamaheswara Tida
Cheng Zhuo
Yiyu Shi

Conventional on-chip spiral inductor consumes significant top metal routing area, thereby preventing its popularity in many on-chip applications. Recently TSV-inductor with a magnetic core has been proved to be a viable option for on-chip DC-DC converter in a 14nm test chip. The operating conditions of such inductors play a major role in maximizing the performance and efficiency of the DC-DC converter. However, due to its unique TSV-structure, unlike conventional spiral inductor, much of the modeling details remain unclear. This paper analyzes the modeling details of a magnetic core TSV-inductor and proposes a design methodology to optimize power losses of the inductor. With this methodology, designers can ensure fast and reliable inductor optimization for on-chip applications. Experimental results show that the optimized magnetic core TSV-inductor can achieve inductance density improvement of 6.0–7.7X and quality factor improvements of 1.3–1.6X while maintaining the same footprint.

Machine-learning-based dynamic IR drop prediction for ECO

Yen-Chun Fang
Heng-Yi Lin
Min-Yan Su
Chien-Mo Li
Eric Jia-Wei Fang

During design signoff, many iterations of Engineer Change Order (ECO) are needed to ensure IR drop of each cell instance meets the specified limit. It is a waste of resources because repeated dynamic IR drop simulations take a very long time on very similar designs. In this work, we train a machine learning model, based on data before ECO, and predict IR drop after ECO. To increase our prediction accuracy, we propose 17 timing-aware, power-aware, and physical-aware features. Our method is scalable because the feature dimension is fixed (937), independent of design size and cell library. Also, we propose to build regional models for cell instances near IR drop violations to improves both prediction accuracy and training time. Our experiments show that our prediction correlation coefficient is 0.97 and average error is 3.0mV on a 5-million-cell industry design. Our IR drop prediction for 100K cell instances can be completed within 2 minutes. Our proposed method provides a fast IR drop prediction to speedup ECO.

Privacy-preserving deep learning and inference

M. Sadegh Riazi
Farinaz Koushanfar

We provide a systemization of knowledge of the recent progress made in addressing the crucial problem of deep learning on encrypted data. The problem is important due to the prevalence of deep learning models across various applications, and privacy concerns over the exposure of deep learning IP and user’s data. Our focus is on provably secure methodologies that rely on cryptographic primitives and not trusted third parties/platforms. Computational intensity of the learning models, together with the complexity of realization of the cryptography algorithms hinder the practical implementation a challenge. We provide a summary of the state-of-the-art, comparison of the existing solutions, as well as future challenges and opportunities.

Machine learning IP protection

Rosario Cammarota
Indranil Banerjee
Ofer Rosenberg

Machine learning, specifically deep learning is becoming a key technology component in application domains such as identity management, finance, automotive, and healthcare, to name a few. Proprietary machine learning models – Machine Learning IP – are developed and deployed at the network edge, end devices and in the cloud, to maximize user experience.

With the proliferation of applications embedding Machine Learning IPs, machine learning models and hyper-parameters become attractive to attackers, and require protection. Major players in the semiconductor industry provide mechanisms on device to protect the IP at rest and during execution from being copied, altered, reverse engineered, and abused by attackers. In this work we explore system security architecture mechanisms and their applications to Machine Learning IP protection.

Assured deep learning: practical defense against adversarial attacks

Bita Darvish Rouhani
Mohammad Samragh
Mojan Javaheripi
Tara Javidi
Farinaz Koushanfar

Deep Learning (DL) models have been shown to be vulnerable to adversarial attacks. In light of the adversarial attacks, it is critical to reliably quantify the confidence of the prediction in a neural network to enable safe adoption of DL models in autonomous sensitive tasks (e.g., unmanned vehicles and drones). This article discusses recent research advances for unsupervised model assurance against the strongest adversarial attacks known to date and quantitatively compare their performance. Given the widespread usage of DL models, it is imperative to provide model assurance by carefully looking into the feature maps automatically learned within Dl models instead of looking back with regret when deep learning systems are compromised by adversaries.

Tetris: re-architecting convolutional neural network computation for machine learning accelerators

Hang Lu
Xin Wei
Ning Lin
Guihai Yan
Xiaowei Li

Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.

FCN-engine: accelerating deconvolutional layers in classic CNN processors

Dawen Xu
Kaijie Tu
Ying Wang
Cheng Liu
Bingsheng He
Huawei Li

Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead.

To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X — 44.9X speedup and reduces the energy consumption by 14.6%-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.

Designing adaptive neural networks for energy-constrained image classification

Dimitrios Stamoulis
Ting-Wu (Rudy) Chin
Anand Krishnan Prakash
Haocheng Fang
Sribhuvan Sajja
Mitchell Bognar
Diana Marculescu

As convolutional neural networks (CNNs) enable state-of-the-art computer vision applications, their high energy consumption has emerged as a key impediment to their deployment on embedded and mobile devices. Towards efficient image classification under hardware constraints, prior work has proposed adaptive CNNs, i.e., systems of networks with different accuracy and computation characteristics, where a selection scheme adaptively selects the network to be evaluated for each input image. While previous efforts have investigated different network selection schemes, we find that they do not necessarily result in energy savings when deployed on mobile systems. The key limitation of existing methods is that they learn only how data should be processed among the CNNs and not the network architectures, with each network being treated as a blackbox.

To address this limitation, we pursue a more powerful design paradigm where the architecture settings of the CNNs are treated as hyper-parameters to be globally optimized. We cast the design of adaptive CNNs as a hyper-parameter optimization problem with respect to energy, accuracy, and communication constraints imposed by the mobile device. To efficiently solve this problem, we adapt Bayesian optimization to the properties of the design space, reaching near-optimal configurations in few tens of function evaluations. Our method reduces the energy consumed for image classification on a mobile device by up to 6X, compared to the best previously published work that uses CNNs as blackboxes. Finally, we evaluate two image classification practices, i.e., classifying all images locally versus over the cloud under energy and communication constraints.

FATE: fast and accurate timing error prediction framework for low power DNN accelerator design

Jeff (Jun) Zhang
Siddharth Garg

Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs which execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8X –58X speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing a conventional DNN accelerator that uses 2’s complement (2C) arithmetic with one that uses signed magnitude representation (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest.

Waterfall is too slow, let’s go Agile: multi-domain coupling for synthesizing automotive cyber-physical systems

Debayan Roy
Michael Balszun
Thomas Heurung
Samarjit Chakraborty
Amol Naik

For future autonomous vehicles, the system development life cycle must keep up with the rapid rate of innovation and changing needs of the market. Waterfall is too slow to react to such changes, and therefore, there is a growing emphasis to adopt Agile development concepts in the automotive industry. Ensuring requirements trace-ability, and thus proving functional safety, is a serious challenge in this direction. Modern cars are complex cyber-physical systemsand are traditionally designed using a set of disjoint tools, which adds to the challenge. In this paper, we point out that multi-domain coupling and design automation using correct-by-design approaches can lead to safe designs even in an Agile environment. In this context, we study current industry trends. We further outline the challenges involved in multi-domain coupling and demonstrate using a state-of-the-art approach how these challenges can be addressed by exploiting domain-specific knowledge.

Model-based and data-driven approaches for building automation and control

Tianshu Wei
Xiaoming Chen
Xin Li
Qi Zhu

Smart buildings in the future are complex cyber-physical-human systems that involve close interactions among embedded platform (for sensing, computation, communication and control), mechanical components, physical environment, building architecture, and occupant activities. The design and operation of such buildings require a new set of methodologies and tools that can address these heterogeneous domains in a holistic, quantitative and automated fashion. In this paper, we will present our design automation methods for improving building energy efficiency and offering comfortable services to occupants at low cost. In particular, we will highlight our work in developing both model-based and data-driven approaches for building automation and control, including methods for co-scheduling heterogeneous energy demands and supplies, for integrating intelligent building energy management with grid optimization through a proactive demand response framework, for optimizing HVAC control with deep reinforcement learning, and for accurately measuring in-building temperature by combining prior modeling information with few sensor measurements based upon Bayesian inference.

Design automation for battery systems

Swaminathan Narayanaswamy
Sangyoung Park
Sebastian Steinhorst
Samarjit Chakraborty

High power Lithium-Ion (Li-Ion) battery packs used in stationary Electrical Energy Storage (EES) systems and Electric Vehicle (EV) applications require a sophisticated Battery Management System (BMS) in order to maintain safe operation and improve their performance. With the increasing complexity of these battery packs and their demand for shorter time-to-market, decentralized approaches for battery management, providing a high degree of modularity, scalability and improved control performance are typically preferred. However, manual design approaches for these complex distributed systems are time consuming and are error-prone resulting in a reduced energy efficiency of the overall system. Here, special design automation techniques considering all abstraction-levels of the battery system are required to obtain highly optimized battery packs. This paper presents from a design automation perspective the recent advances in the domain of battery systems that are a combination of the electrochemical cells and their associated management modules. Specifically, we classify the battery systems into three abstraction levels, cell-level (battery cells and their interconnection schemes), module-level (sensing and charge balancing circuits) and pack-level (computation and control algorithms). We provide an overview of challenges that exist in each abstraction layer and give an outlook towards future design automation techniques that are required to overcome these limitations.

RFUZZ: coverage-directed fuzz testing of RTL on FPGAs

Kevin Laeufer
Jack Koenig
Donggyu Kim
Jonathan Bachrach
Koushik Sen

Dynamic verification is widely used to increase confidence in the correctness of RTL circuits during the pre-silicon design phase. Despite numerous attempts over the last decades to automate the stimuli generation based on coverage feedback, Coverage Directed Test Generation (CDG) has not found the widespread adoption that one would expect. Based on new ideas from the software testing community around coverage-guided mutational fuzz testing, we propose a new approach to the CDG problem which requires minimal setup and takes advantage of FPGA-accelerated simulation for rapid testing. We provide test input and coverage definitions that allow fuzz testing to be applied to RTL circuit verification. In addition we propose and implement a series of transformation passes that make it feasible to reset arbitrary RTL designs quickly, a requirement for deterministic test execution. Alongside this paper we provide rfuzz, a fully featured implementation of our testing methodology which we make available as open-source software to the research community. An empirical evaluation of rfuzz shows promising results on archiving coverage for a wide range of different RTL designs ranging from communication IPs to an industry scale 64-bit CPU.

Steep coverage-ascent directed test generation for shared-memory verification of multicore chips

Gabriel A. G. Andrade
Marleson Graf
Nícolas Pfeifer
Luiz C. V. dos Santos

This paper proposes a framework for functional verification of shared memory that relies on reusable coverage-driven directed test generation. It reveals a new mechanism to improve the quality of non-deterministic tests. The generator exploits general properties of coherence protocols and cache memories for better control on transition coverage, which serves as a proxy for increasing the actual coverage metric adopted in a given verification environment. Being independent of coverage metric, coherence protocol, and cache parameters, the proposed generator is reusable across quite different designs and verification environments. We report the coverage for 8, 16, and 32-core designs and the effort required for exposing nine different types of errors. The proposed technique was always able to reach similar coverage as a state-of-the-art generator, and it always did it faster above a certain threshold. For instance, when executing tests with IK operations for verifying 32-core designs, the former reached 65% coverage around 5 times faster than the latter. Besides, we identified challenging errors that could hardly be found by the latter within one hour, but were exposed by our technique in 5 to 30 minutes.

SMTSampler: efficient stimulus generation from complex SMT constraints

Rafael Dutra
Jonathan Bachrach
Koushik Sen

Stimulus generation is an essential part of hardware verification, being at the core of widely applied constrained-random verification techniques. However, as verification problems get more and more complex, so do the constraints which must be satisfied. In this context, it is a challenge to efficiently generate random stimuli which can achieve a good coverage of the design space. We developed a new technique SMTSampler which can sample random solutions from Satisfiability Modulo Theories (SMT) formulas with bit-vectors, arrays, and uninterpreted functions. The technique uses a small number of calls to a constraint solver in order to generate up to millions of stimuli. Our evaluation on a large set of complex industrial SMT benchmarks shows that SMTSampler can handle a larger class of SMT problems, outperforming state-of-the-art constraint sampling techniques in the number of samples produced and the coverage of the constraint space.

DL-RSIM: a simulation framework to enable reliable ReRAM-based accelerators for deep learning

Meng-Yao Lin
Hsiang-Yun Cheng
Wei-Ting Lin
Tzu-Hsien Yang
I-Ching Tseng
Chia-Lin Yang
Han-Wen Hu
Hung-Sheng Chang
Hsiang-Pang Li
Meng-Fan Chang

Memristor-based deep learning accelerators provide a promising solution to improve the energy efficiency of neuromorphic computing systems. However, the electrical properties and crossbar structure of memristors make these accelerators error-prone. To enable reliable memristor-based accelerators, a simulation platform is needed to precisely analyze the impact of non-ideal circuit and device properties on the inference accuracy. In this paper, we propose a flexible simulation framework, DL-RSIM, to tackle this challenge. DL-RSIM simulates the error rates of every sum-of-products computation in the memristor-based accelerator and injects the errors in the targeted TensorFlow-based neural network model. A rich set of reliability impact factors are explored by DL-RSIM, and it can be incorporated with any deep learning neural network implemented by TensorFlow. Using three representative convolutional neural networks as case studies, we show that DL-RSIM can guide chip designers to choose a reliability-friendly design option and develop reliability optimization techniques.

A ferroelectric FET based power-efficient architecture for data-intensive computing

Yun Long
Taesik Na
Prakshi Rastogi
Karthik Rao
Asif Islam Khan
Sudhakar Yalamanchili
Saibal Mukhopadhyay

In this paper, we present a ferroelectric FET (FeFET) based power-efficient architecture to accelerate data-intensive applications such as deep neural networks (DNNs). We propose a cross-cutting solution combining emerging device technologies, circuit optimizations, and micro-architecture innovations. At device level, FeFET crossbar is utilized to perform vector-matrix multiplication (VMM). As a field effect device, FeFET significantly reduces the read/write energy compared with the resistive random-access memory (ReRAM). At circuit level, we propose an all-digital peripheral design, reducing the large overhead introduced by ADC and DAC in prior works. In terms of micro-architecture innovation, a dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Speed, power, area and computing accuracy are evaluated based on detailed device characterization and system modeling. For DNN computing, our design achieves 254x and 9.7x gain in power efficiency (GOPS/W) compared to GPU and ReRAM based designs, respectively.

EMAT: an efficient multi-task architecture for transfer learning using ReRAM

Fan Chen
Hai Li

Transfer learning has demonstrated a great success recently towards general supervised learning to mitigate expensive training efforts. However, existing neural network accelerators have been proven inefficient in executing transfer learning by failing to accommodate the layer-wise heterogeneity in computation and memory requirements. In this work, we propose EMAT—an efficient multi-task architecture for transfer learning built on resistive memory (ReRAM) technology. EMAT utilizes the energy-efficiency of ReRAM arrays for matrix-vector multiplication and realizes a hierarchical reconfigurable design with heterogeneous computation components to incorporate the data patterns in transfer learning. Compared to the GPU platform, EMAT can perform averagely 120X performance speedup and 87X energy saving. EMAT also obtains 2.5X speedup compared to the-state-of-the-art CMOS accelerator.

Co-manage power delivery and consumption for manycore systems using reinforcement learning

Haoran Li
Zhongyuan Tian
Rafael K. V. Maeda
Xuanqi Chen
Jun Feng
Jiang Xu

Maintaining high energy efficiency has become a critical design issue for high-performance systems. Many power management techniques have been proposed for the processor cores such as dynamic voltage and frequency scaling (DVFS). However, very few solutions consider the power losses suffered on the power delivery system (PDS), despite the fact that they have a significant impact on the system overall energy efficiency. With the explosive growth of system complexity and highly dynamic workloads variations, it is challenging to find the optimal power management policies which can effectively match the power delivery with the power consumption. To tackle the above problems, we propose a reinforcement learning-based power management scheme for manycore systems to jointly monitor and adjust both the PDS and the processor cores aiming to improve system overall energy efficiency. The learning agents distributed across power domains not only manage the power states of processor cores but also control the on/off states of on-chip VRs to proactively adapt to the workload variations. Experimental results with realistic applications show that when the proposed approach is applied to a large-scale system with a hybrid PDS, it lowers the system overall energy-delay-product (EDP) by 41% than a traditional monolithic DVFS approach with a bulky off-chip VR.

Adaptive-precision framework for SGD using deep Q-learning

Wentai Zhang
Hanxian Huang
Jiaxi Zhang
Ming Jiang
Guojie Luo

Stochastic gradient descent (SGD) is a widely-used algorithm in many applications, especially in the training process of deep learning models. Low-precision implementation for SGD has been studied as a major acceleration approach. However, if not appropriately used, low-precision implementation can deteriorate its convergence because of the rounding error when gradients become small near a local optimum. In this work, to balance throughput and algorithmic accuracy, we apply the Q-learning technique to adjust the precision of SGD automatically by designing an appropriate decision function. The proposed decision function for Q-learning takes the error rate of the objective function, its gradients, and the current precision configuration as the inputs. Q-learning then chooses proper precision adaptively for hardware efficiency and algorithmic accuracy. We use reconfigurable devices such as FPGAs to evaluate the adaptive precision configurations generated by the proposed Q-learning method. We prototype the framework using LeNet-5 model with MNIST and CIFAR10 datasets and implement it on a Xilinx KCU1500 FPGA board. In the experiments, we analyze the throughput of different precision representations and the precision-selection of our framework. The results show that the proposed framework with adapative precision increases the throughput by up to 4.3 x compared to the conventional 32-bit floating point setting, and it achieves both the best hardware efficiency and algorithmic accuracy.

Differentiated handling of physical scenes and virtual objects for mobile augmented reality

Chih-Hsuan Yen
Wei-Ming Chen
Pi-Cheng Hsiu
Tei-Wei Kuo

Mobile devices running augmented reality applications consume considerable energy for graphics-intensive workloads. This paper presents a scheme for the differentiated handling of camera-captured physical scenes and computer-generated virtual objects according to different perceptual quality metrics. We propose online algorithms and their realtime implementations to reduce energy consumption through dynamic frame rate adaptation while maintaining the visual quality required for augmented reality applications. To evaluate system efficacy, we integrate our scheme into Android and conduct extensive experiments on a commercial smartphone with various application scenarios. The results show that the proposed scheme can achieve energy savings of up to 39.1% in comparison to the native graphics system in Android while maintaining satisfactory visual quality.

DATC RDF: an academic flow from logic synthesis to detailed routing

Jinwook Jung
Iris Hui-Ru Jiang
Jianli Chen
Shih-Ting Lin
Yih-Lang Li
Victor N. Kravets
Gi-Joon Nam

In this paper, we present DATC Robust Design Flow (RDF) from logic synthesis to detailed routing. We further include detailed placement and detailed routing tools based on recent EDA research contests. We also demonstrate RDF in a scalable cloud infrastructure. Design methodology and cross-stage optimization research can be conducted via RDF.

Physical modeling of bitcell stability in subthreshold SRAMs for leakage-area optimization under PVT variations

Xin Fan
Rui Wang
Tobias Gemmeke

Subthreshold SRAM design is crucial for addressing the memory bottleneck in energy constrained applications. While statistical optimization can be applied based on Monte-Carlo (MC) simulation, exploration of bitcell design space is time consuming. This paper presents a framework for model-based design and optimization of subthreshold SRAM bitcells under random PVT variations. By incorporating key design and process features, a physical model of bitcell static noise margin (SNM) has been derived analytically. It captures intra-die SNM variations by the combination of a folded-normal distribution and a non-central chi-squared distribution. Validations with MC simulation show its accuracy of modeling SNM distributions down to 25mV beyond 6-sigma for typical bitcells in 28nm. Model-based tuning of subthreshold SRAM bitcells is investigated for design tradeoff between leakage, area and stability. When targeting a specific SNM constraint, we show that an optimal standby voltage exists which offers minimum bitcell leakage power – any deviation above or below increases the power consumption. When targeting a specific standby voltage, our design flow identifies bitcell instances of 12x less leakage power or 3x reductions in area as compared to the minimum-length design.

Comparing voltage adaptation performance between replica and in-situ timing monitors

Yutaka Masuda
Jun Nagayama
Hirotaka Takeno
Yoshimasa Ogawa
Yoichi Momiyama
Masanori Hashimoto

Adaptive voltage scaling (AVS) is a promising approach to overcome manufacturing variability, dynamic environmental fluctuation, and aging. This paper focuses on timing sensors necessary for AVS implementation and compares in-situ timing error predictive FF (TEP-FF) and critical path replica in terms of how much voltage margin can be reduced. For estimating the theoretical bound of ideal AVS, this work proposes linear programming based minimum supply voltage analysis and discusses the voltage adaptation performance quantitatively by investigating the gap between the lower bound and actual supply voltages. Experimental results show that TEP-FF based AVS and replica based AVS achieve up to 13.3% and 8.9% supply voltage reduction, respectively while satisfying the target MTTF. AVS with TEP-FF tracks the theoretical bound with 2.5 to 5.6 % voltage margin while AVS with replica needs 7.2 to 9.9 % margin.

Strain-aware performance evaluation and correction for OTFT-based flexible displays

Tengtao Li
Sachin S. Sapatnekar

Organic thin-film transistors (OTFTs) are widely used in flexible circuits, such as flexible displays, sensor arrays, and radio frequency identification cards (RFIDs), because these technologies offer features such as better flexibility, lower cost, and easy manufacturability using low-temperature fabrication process. This paper develops a procedure that evaluates the performance of flexible displays. Due to their very nature, flexible displays experience significant mechanical strain/stress in the field due to the deformation caused during daily use. These deformations can impact device and circuit performance, potentially causing a loss in functionality. This paper first models the effects of extrinsic strain due to two fundamental deformations modes, bending and twisting. Next, this strain is translated to variations in device mobility, after which analytical models for error analysis in the flexible display are derived based on the rendered image values in each pixel of the display. Finally, two error correction approaches for flexible displays are proposed, based on voltage compensation and flexible clocking.

Achieving fast sanitization with zero live data copy for MLC flash memory

Ping-Hsien Lin
Yu-Ming Chang
Yung-Chun Li
Wei-ChenWang
Chien-Chung Ho
Yuan-Hao Chang

As data security has become the major concern in modern storage systems with low-cost multi-level-cell (MLC) flash memories, it is not trivial to realize data sanitization in such a system. Even though some existing works employ the encryption or the built-in erase to achieve this requirement, they still suffer the risk of being deciphered or the issue of performance degradation. In contrast to the existing work, a fast sanitization scheme is proposed to provide the highest degree of security for data sanitization; that is, every old version of data could be immediately sanitized with zero live-data-copy overhead once the new version of data is created/written. In particular, this scheme further considers the reliability issue of MLC flash memories; the proposed scheme includes a one-shot sanitization design to minimize the disturbance during data sanitization. The feasibility and the capability of the proposed scheme were evaluated through extensive experiments based on real flash chips. The results demonstrate that this scheme can achieve the data sanitization with zero live-data-copy, where performance overhead is less than 1%.

Architecting data placement in SSDs for efficient secure deletion implementation

Hoda Aghaei Khouzani
Chen Liu
Chengmo Yang

Secure deletion ensures user privacy by permanently removing invalid data from the secondary storage. This process is particularly critical to solid state drives (SSDs) wherein invalid data are generated not only upon deleting a file but also upon updating a file of which the user is not aware. While previous secure deletion schemes are usually applied to all invalid data on the SSD, our observation is that in many cases security is not required for all files on the SSD. This paper proposes an efficient secure deletion scheme targeting only the invalid data of files marked as “secure” by the user. A security-aware data allocation strategy is designed, which separates secure and unsecure data at lower (block) level but mixes them at higher levels of SSD hierarchical organization. Block-level separation minimizes secure deletion cost, while higher-level mixing mitigates the adverse impact of secure deletion on SSD lifetime. A two-level block management scheme is further developed to scatter secure blocks over the SSD for wear leveling. Experiments on real-world benchmarks confirm the advantage of the proposed scheme in reducing secure deletion cost and improving SSD lifetime.

AxBA: an approximate bus architecture framework

Jacob R. Stevens
Ashish Ranjan
Anand Raghunathan

The exponential growth in creation and consumption of various forms of digital data has led to the emergence of new application workloads such as machine learning, data analytics and search. These workloads process large amounts of data and hence pose increased demands on the on-chip and off-chip interconnects of modern computing systems. Therefore, techniques that can improve the energy-efficiency and performance of interconnects are becoming increasingly important.

Security: the dark side of approximate computing?

Francesco Regazzoni
Cesare Alippi
Ilia Polian

Approximate computing promises significant advantages over more traditional computing architectures with respect to circuit area, performance, power efficiency, flexibility, and cost. Its use is suitable in applications where limited and controlled inaccuracies are tolerable or uncertainty is intrinsic in input or their data processing, e.g., as it happens in (deep-) machine learning, image and signal processing. This paper discusses a dimension of approximate computing that has been neglected so far, despite it represents nowadays a major asset, that of security. A number of hardware-related security threats are considered, and the implications of approximate circuits or systems designed to address these threats are discussed.

Security aspects of neuromorphic MPSoCs

Johanna Sepulveda
Cezar Reinbrecht
Jean-Philippe Diguet

Neural networks and deep learning are promising techniques for bringing brain inspired computing into embedded platforms. They pave the way to new kinds of associative memories, classifiers, data-mining, machine learning or search engines, which can be the basis of critical and sensitive applications such as autonomous driving. Emerging non-volatile memory technologies integrated in the so called Multi-Processor System-on-Chip (MPSoC) architectures enable the realization of such computational paradigms. These architectures take advantage of the Network-on-Chip concept to efficiently carry out communications with dedicated distributed memories and processing elements. However, current MPSoC-based neuromorphic architectures are deployed without taking security into account. The growing complexity and the hyper-sharing of hardware resources of MPSoCs may become a threat, thus increasing the risk of malware infections and Trojans introduced at design time. Specially, MPSoC microarchitectural side-channels and fault injection attacks can be exploited to leak sensitive information and to cause malfunctions. In this work we present three contributions to that issue: i) first analysis of security issues in MPSoC-based neuromorphic architectures; ii) discussion of the threat model of the neuromorphic architectures; ii) demonstration of the correlation between SNN input and the neural computation.

Vulnerability-tolerant secure architectures

Todd Austin
Valeria Bertacco
Baris Kasikci
Sharad Malik
Mohit Tiwari

Today, secure systems are built by identifying potential vulnerabilities and then adding protections to thwart the associated attacks. Unfortunately, the complexity of today’s systems makes it impossible to prove that all attacks are stopped, so clever attackers find a way around even the most carefully designed protections. In this article, we take a sobering look at the state of secure system design, and ask ourselves why the “security arms race” never ends? The answer lies in our inability to develop adequate security verification technologies. We then examine an advanced defensive system in nature – the human immune system – and we discover that it does not remove vulnerabilities, rather it adds offensive measures to protect the body when its vulnerabilities are penetrated We close the article with brief speculation on how the human immune system could inspire more capable secure system designs.

Machine learning for performance and power modeling of heterogeneous systems

Joseph L. Greathouse
Gabriel H. Loh

Modern processing systems with heterogeneous components (e.g., CPUs, GPUs) have numerous configuration and design options such as the number and types of cores, frequency, and memory bandwidth. Hardware architects must perform design space explorations in order to accurately target markets of interest under tight time-to-market constraints. This need highlights the importance of rapid performance and power estimation mechanisms.

This work describes the use of machine learning (ML) techniques within a methodology for the estimating performance and power of heterogeneous systems. In particular, we measure the power and performance of a large collection of test applications running on real hardware across numerous hardware configurations. We use these measurements to train a ML model; the model learns how the applications scale with the system’s key design parameters.

Later, new applications of interest are executed on a single configuration, and we gather hardware performance counter values which describe how the application used the hardware. These values are fed into our ML model’s inference algorithm, which quickly identify how this application will scale across various design points. In this way, we can rapidly predict the performance and power of the new application across a wide range of system configurations.

Once the initial run of the program is complete, our ML algorithm can predict the application’s performance and power at many hardware points faster than running it at each of those points and with a level of accuracy comparable to cycle-level simulators.

Machine learning for design space exploration and optimization of manycore systems

Ryan Gary Kim
Janardhan Rao Doppa
Partha Pratim Pande

In the emerging data-driven science paradigm, computing systems ranging from IoT and mobile to manycores and datacenters play distinct roles. These systems need to be optimized for the objectives and constraints dictated by the needs of the application. In this paper, we describe how machine learning techniques can be leveraged to improve the computational-efficiency of hardware design optimization. This includes generic methodologies that are applicable for any hardware design space. As an example, we discuss a guided design space exploration framework to accelerate application-specific manycore systems design and advanced imitation learning techniques to improve on-chip resource management. We present some experimental results for application-specific manycore system design optimization and dynamic power management to demonstrate the efficacy of these methods over traditional EDA approaches.

Failure prediction based on anomaly detection for complex core routers

Shi Jin
Zhaobo Zhang
Krishnendu Chakrabarty
Xinli Gu

Data-driven prognostic health management is essential to ensure high reliability and rapid error recovery in commercial core router systems. The effectiveness of prognostic health management depends on whether failures can be accurately predicted with sufficient lead time. This paper describes how time-series analysis and machine-learning techniques can be used to detect anomalies and predict failures in complex core router systems. First, both a feature-categorization-based hybrid method and a changepoint-based method have been developed to detect anomalies in time-varying features with different statistical characteristics. Next, a SVM-based failure predictor is developed to predict both categories and lead time of system failures from collected anomalies. A comprehensive set of experimental results is presented for data collected during 30 days of field operation from over 20 core routers deployed by customers of a major telecom company.

Invocation-driven neural approximate computing with a multiclass-classifier and multiple approximators

Haiyue Song
Chengwen Xu
Qiang Xu
Zhuoran Song
Naifeng Jing
Xiaoyao Liang
Li Jiang

Neural approximate computing gains enormous energy-efficiency at the cost of tolerable quality-loss. A neural approximator can map the input data to output while a classifier determines whether the input data are safe to approximate with quality guarantee. However, existing works cannot maximize the invocation of the approximator, resulting in limited speedup and energy saving. By exploring the mapping space of those target functions, in this paper, we observe a nonuniform distribution of the approximation error incurred by the same approximator. We thus propose a novel approximate computing architecture with a Multiclass-Classifier and Multiple Approximators (MCMA). These approximators have identica network topologies, and thus can share the same hardware resource in an neural processing unit(NPU) clip. In the runtime, MCMA can swap in the invoked approximator by merely shipping the synapse weights from the on-chip memory to the buffers near MAC within a cycle. We also propose efficient co-training methods for such MCMA architecture. Experimental results show a more substantial invocation of MCMA as well as the gain of energy-efficiency.

Deterministic methods for stochastic computing using low-discrepancy sequences

M. Hassan Najafi
David J. Lilja
Marc Riedel

Recently, deterministic approaches to stochastic computing (SC) have been proposed. These compute with the same constructs as stochastic computing but operate on deterministic bit streams. These approaches reduce the area, greatly reduce the latency (by an exponential factor), and produce completely accurate results. However, these methods do not scale well. Also, they lack the property of progressive precision enjoyed by SC. As a result, these deterministic approaches are not competitive for applications where some degree of inaccuracy can be tolerated. In this work we introduce two fast-converging, scalable deterministic approaches to SC based on low-discrepancy sequences. The results are completely accurate when running the operations for the required number of cycles. However, the computation can be truncated early if some inaccuracy is acceptable. Experimental results show that the proposed approaches significantly improve both the processing time and area-delay product compared to prior approaches.

Design space exploration of multi-output logic function approximations

Jorge Echavarria
Stefan Wildermann
Jürgen Teich

Approximate Computing has emerged as a design paradigm that allows to decrease hardware costs by reducing the accuracy of the computation for applications that are robust against such errors. In Boolean logic approximation, the number of terms and literals of a logic function can be reduced by allowing to produce erroneous outputs for some input combinations. This paper proposes a novel methodology for the approximation of multi-output logic functions. Related work on multi-output logic approximation minimizes each output function separately. In this paper, we show that thereby a huge optimization potential is lost. As a remedy, our methodology considers the effect on all output functions when introducing errors thus exploiting the cross-function minimization potential. Moreover, our approach is integrated into a design space exploration technique to obtain not only a single solution but a Pareto-set of designs with different trade-offs between hardware costs (terms and literals) and error (number of minterms that have been falsified). Experimental results show our technique is very efficient in exploring Pareto-optimal fronts. For some benchmarks, the number of terms could be reduced from an accurate function implementation by up to 15% and literals by up to 19% with degrees of inaccuracy around 0.1% w.r.t. accurate designs. Moreover, we show that the Pareto-fronts obtained by our methodology dominate the results obtained when applying related work.

3DICT: a reliable and QoS capable mobile process-in-memory architecture for lookup-based CNNs in 3D XPoint ReRAMs

Qian Lou
Wujie Wen
Lei Jiang

It is extremely challenging to deploy computing-intensive convolutional neural networks (CNNs) with rich parameters in mobile devices because of their limited computing resources and low power budgets. Although prior works build fast and energy-efficient CNN accelerators by greatly sacrificing test accuracy, mobile devices have to guarantee high CNN test accuracy for critical applications, e.g., unlocking phones by face recognitions. In this paper, we propose a 3D XPoint ReRAM-based process-in-memory architecture, 3DICT, to provide various test accuracies to applications with different priorities by lookup-based CNN tests that dynamically exploit the trade-off between test accuracy and latency. Compared to the state-of-the-art accelerators, on average, 3DICT improves the CNN test performance per Watt by 13% ~ 61X and guarantees 9-year endurance under various CNN test accuracy requirements.

Aliens: a novel hybrid architecture for resistive random-access memory

Bing Wu
Dan Feng
Wei Tong
Jingning Liu
Shuai Li
Mingshun Yang
Chengning Wang
Yang Zhang

Passive crossbar arrays of resistive random-access memory (RRAM) have shown great potential to meet the demands of future memory. By eliminating transistor per cell, the crossbar array possesses a higher memory density but introduces sneak currents which incur extra energy waste and reliability issues. The complementary resistive switch (CRS), consisting of two anti-serially stacked memristors, is considered as a promising solution to the sneak current problem. However, the destructive read of the CRS results in an additional recovery write operation which strongly restricts its further promotion. Exploiting the dual CRS/memristor mode of CRS devices, we propose Aliens, a novel hybrid architecture for resistive random-access memory which introduces one alien cell (memristor mode) for each wordline in the crossbar to provide a practical hybrid memory without operating system’s intervention. Aliens draws advantages from both modes: restrained sneak current of CRS mode and non-destructive read of memristor mode. The simple and regular cell mode organization of Aliens enables an energy-saving read method and an effective mode switching strategy called Lazy-Switch. By exploiting memory access locality, Lazy-Switch delays and merges the recovery write operations of the CRS mode. Due to fewer recovery write operations and negligible sneak currents, Aliens achieves improvement in energy, overall endurance, and access performance. The experiment results show that our design offers average energy savings of 13.9X compared with memristor-only memory, a memory lifetime 5.3X longer than CRS-only memory, and a competitive performance compared with memristor-only memory.

FELIX: fast and energy-efficient logic in memory

Saransh Gupta
Mohsen Imani
Tajana Rosing

The Internet of Things (IoT) has led to the emergence of big data. Processing this amount of data poses a challenge for current computing systems. PIM enables in-place computation which reduces data movement, a major latency bottleneck in conventional systems. In this paper, we propose an in-memory implementation of fast and energy-efficient logic (FELIX) which combines the functionality of PIM with memories. To the best of authors’ knowledge, FELIX is the first PIM logic to enable the single cycle NOR, NOT, NAND, minority, and OR directly in crossbar memory. We exploit the voltage threshold-based memristors to enable single cycle operations. It is a purely in-memory execution which neither reads out data nor changes sense amplifiers, while preserving data in-memory. We extend these single cycle operations to implement more complex functions like XOR and addition in memory with 2X lower latency than the fastest published PIM technique. We also increase the amount of in-memory parallelism in our design by segmenting bitlines using switches. To evaluate the efficiency of our design at the system level, we design a FELIX-based HyperDimensional (HD) computing accelerator. Our evaluation shows that for all applications tested using HD, FELIX provides on average 128.8X speedup and 5,589.3X lower energy consumption as compared to AMD GPU. FELIX HD also achieves on average 2.21X higher energy efficiency, 1.86X speedup, and 1.68X less memory as compared to the fastest PIM technique.

DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs

Xiaofan Zhang
Junsong Wang
Chao Zhu
Yonghua Lin
Jinjun Xiong
Wen-mei Hwu
Deming Chen

Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs

Yufei Ma
Tu Zheng
Yu Cao
Sarma Vrudhula
Jae-sun Seo

The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the inference algorithm. Therefore, this work proposes to customize the detection algorithm, e.g. SSD, to benefit its hardware implementation with low data precision at the cost of marginal accuracy degradation. The proposed FPGA-based deep learning inference accelerator is demonstrated on two Intel FPGAs for SSD algorithm achieving up to 2.18 TOPS throughput and up to 3.3X superior energy-efficiency compared to GPU.

TGPA: tile-grained pipeline architecture for low latency CNN inference

Xuechao Wei
Yun Liang
Xiuhong Li
Cody Hao Yu
Peng Zhang
Jason Cong

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

Customized locking of IP blocks on a multi-million-gate SoC

Abhrajit Sengupta
Mohammed Nabeel
Mohammed Ashraf
Ozgur Sinanoglu

Reliance on off-site untrusted fabrication facilities has given rise to several threats such as intellectual property (IP) piracy, overbuilding and hardware Trojans. Logic locking is a promising defense technique against such malicious activities that is effected at the silicon layer. Over the past decade, several logic locking defenses and attacks have been presented, thereby, enhancing the state-of-the-art. Nevertheless, there has been little research aiming to demonstrate the applicability of logic locking with large-scale multi-million-gate industrial designs consisting of multiple IP blocks with different security requirements. In this work, we take on this challenge to successfully lock a multi-million-gate system-on-chip (SoC) provided by DARPA by taking it all the way to GDSII layout. We analyze how specific features, constraints, and security requirements of an IP block can be leveraged to lock its functionality in the most appropriate way. We show that the blocks of an SoC can be locked in a customized manner at 0.5%, 15.3%, and 1.5% chip-level overhead in power, performance, and area, respectively.

Dynamic resource management for heterogeneous many-cores

Jörg Henkel
Jürgen Teich
Stefan Wildermann
Hussam Amrouch

With the advent of many-core systems, use cases of embedded systems have become more dynamic: Plenty of applications are concurrently executed, but may dynamically be exchanged and modified even after deployment. Moreover, resources may temporally or permanently become unavailable because of thermal aspects, dynamic power management, or the occurrence of faults. This poses new challenges for reaching objectives like timeliness for real-time or performance for best-effort program execution and maximizing system utilization. In this work, we first focus on dynamic management schemes for reliability/aging optimization under thermal constraints. The reliability of on-chip systems in the current and upcoming technology nodes is continuously degrading with every new generation because transistor scaling is approaching its fundamental limits. Protecting systems against degradation effects such as circuits’ aging comes with considerable losses in efficiency. We demonstrate in this work why sustaining reliability while maximizing the utilization of available resources and hence avoiding efficiency loss is quite challenging – this holds even more when thermal constraints come into play. Then, we discuss techniques for run-time management of multiple applications which sustain real-time properties. Our solution relies on hybrid application mapping denoting the combination of design-time analysis with run-time application mapping. We present a method for Real-time Mapping Reconfiguration (RMR) which enables the Run-Time Manager (RM) to execute realtime applications even in the presence of dynamic thermal-and reliability-aware resource management.

This paper is paper of the ICCAD 2018 Special Session on “Managing Heterogeneous Many-cores for High-Performance and Energy-Efficiency”. The other two papers of this Special sessions are [1] and [2].

Online learning for adaptive optimization of heterogeneous SoCs

Ganapati Bhat
Sumit K. Mandal
Ujjwal Gupta
Umit Y. Ogras

Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads.

Hybrid on-chip communication architectures for heterogeneous manycore systems

Biresh Kumar Joardar
Janardhan Rao Doppa
Partha Pratim Pande
Diana Marculescu
Radu Marculescu

The widespread adoption of big data has led to the search for high-performance and low-power computational platforms. Emerging heterogeneous manycore processing platforms consisting of CPU and GPU cores along with various types of accelerators offer power and area-efficient trade-offs for running these applications. However, heterogeneous manycore architectures need to satisfy the communication and memory requirements of the diverse computing elements that conventional Network-on-Chip (NoC) architectures are unable to handle effectively. Further, with increasing system sizes and level of heterogeneity, it becomes difficult to quickly explore the large design space and establish the appropriate design trade-offs. To address these challenges, machine learning-inspired heterogeneous manycore system design is a promising research direction to pursue. In this paper, we highlight various salient features of heterogeneous manycore architectures enabled by emerging interconnect technologies and machine learning techniques.

A practical detailed placement algorithm under multi-cell spacing constraints

Yu-Hsiang Cheng
Ding-Wei Huang
Wai-Kei Mak
Ting-Chi Wang

Multi-cell spacing constraints arise due to aggressive scaling and manufacturing issues. For example, we can incorporate multi-cell spacing constraints due to pin accessibility problem in sub-10nm nodes. This work studies detailed placement considering multi-cell spacing constraints. A naive approach is to model each multi-cell spacing constraint as a set of 2-cell spacing constraints, but the resulting total cell displacement would be much larger than necessary. Thus, we aim to tackle this problem and propose a practical multi-cell method by first analyzing the initial layout to determine which cell pair in each multi-cell spacing constraint is the easiest to break apart. Secondly, we apply a single-row dynamic programming (SRDP)-based method one row at a time, called Intra-Row Move (IRM) to resolve a majority of violations while minimizing the total cell displacement or wirelength increase. With cell virtualization and movable region computation techniques, our IRM can be easily extended to handle mixed cell-height designs with only a slight modification of the cost computation in the SRDP method. Finally, we apply an integer linear programming-based method called Global Move (GM) to resolve the remaining violations. Experimental results indicate that our multi-cell method is much better than a 2-cell method both in solution quality and runtime.

Mixed-cell-height placement considering drain-to-drain abutment

Yu-Wei Tseng
Yao-Wen Chang

Along with device scaling, the drain-to-drain abutment (DDA) constraint arises as an emerging challenge in modern circuit designs, which incurs additional difficulties especially for designs with mixed-cell-height standard cells which have prevailed in advanced technology. This paper presents the first work to address the mixed-cell-height placement problem considering the DDA constraint from post global placement throughout detailed placement. Our algorithms consists of three major stages: (1) DDA-aware preprocessing, (2) legalization, and (3) detailed placement. In the DDA-aware preprocessing stage, we first align cells to desired rows, considering the distribution ratio of source nodes to drain nodes. After deciding the cell ordering of every row, we adopt the modulus-based matrix splitting iteration method to remove all cell overlaps with minimum total displacement in the legalization stage. For detailed placement, we propose a satisfiability-based approach which considers the whole layout to flip a subset of cells and swap pairs of adjacent cells simultaneously. Compared with a shortest-path method, experimental results show that our proposed algorithm can significantly reduce cell violations and displacements with reasonable runtime.

Mixed-cell-height legalization considering technology and region constraints

Ziran Zhu
Xingquan Li
Yuhang Chen
Jianli Chen
Wenxing Zhu
Yao-Wen Chang

Mixed-cell-height circuits have become popular in advanced technologies for better power, area, routability, and performance trade-offs. With the technology and region constraints imposed by modern circuit designs, the mixed-cell-height legalization problem has become more challenging. In this paper, we present an effective and efficient legalization algorithm for mixed-cell-height circuit designs with technology and region constraints. We first present a fence region handling technique to unify the fence regions and the default ones. To obtain a desired cell assignment, we then propose a movement-aware cell reassignment method by iteratively reassigning cells in locally dense areas to their desired rows. After cell reassignment, a technology-aware legalization is presented to remove cell overlaps while satisfying the technology constraints. Finally, we propose a technology-aware refinement to further reduce the average and maximum cell movements without increasing the technology constraints violations. Compared with the champion of the 2017 ICCAD CAD Contest and the state-of-the-art work, experimental results show that our algorithm achieves the best average and maximum cell movements and significantly fewer technology constraint violations, in a comparable runtime.

Mixed-cell-height placement with complex minimum-implant-area constraints

Jianli Chen
Peng Yang
Xingquan Li
Wenxing Zhu
Yao-Wen Chang

Mixed-cell-height standard cells are prevailingly used in advanced technologies to achieve better design trade-offs among timing, power, and routability. As feature size decreases, placement of cells with multiple threshold voltages may violate the complex minimum-implant-area (MIA) layer rule arising from the limitations of patterning technologies. Existing works consider the mixed-cell-height placement problem only during legalization, or handle the MIA constraints during detailed placement. In this paper, we address the mixed-cell-height placement problem with MIA constraints into two major stages: post global placement and MIA-aware legalization. In the post global placement stage, we first present a continuous and differentiable cost function to address the Vdd/Vss alignment constraints, and add weighted pseudo nets to MIA violation cells dynamically. Then, we propose a proximal optimization method based on the given global placement result to simultaneously consider Vdd/Vss alignment constraints, MIA constraints, cell distribution, cell displacement, and total wirelength. In the MIA-aware legalization stage, we develop a graph-based method to cluster cells of specific threshold voltages, and apply a strip-packing-based binary linear programming to reshape cells. Then, we propose a matching-based technique to resolve intra-row MIA violations and reduce filler insertion. Furthermore, we formulate inter-row MIA-aware legalization as a quadratic programming problem, which is efficiently solved by a modulus-based matrix splitting iteration method. Finally, MIA-aware cell allocation and refinement are performed to further improve the result. Experimental results show that, without any extra area overhead, our algorithm still can achieve 8.5% shorter final total wirelength than the state-of-the-art work.

RAPID: read acceleration for improved performance and endurance in MLC/TLC NVMs

Poovaiah M. Palangappa
Kartik Mohanram

RAPID is a low-overhead critical-word-first read acceleration architecture for improved performance and endurance in MLC/TLC non-volatile memories (NVMs). RAPID encodes the critical words in a cache line using only the most significant bits (MSbs) of the MLC/TLC NVM cells. Since the MSbs of an NVM cell can be decoded using a single read strobe, the data (i.e., critical words) encoded using the MSbs can be decoded with low latency. System-level SPEC CPU2006 workload evaluations of a TLC RRAM architecture show that RAPID improves read latency by 21%, energy by 24%, and endurance by 2-4x over state-of-the-art striped NVM.

Sneak path free reconfiguration of via-switch crossbars based FPGA

Ryutaro Doi
Jaehoon Yu
Masanori Hashimoto

FPGA that utilizes via-switches, which are a kind of nonvolatile resistive RAMs, for crossbar implementation is attracting attention due to higher integration density and performance. However, programming via-switches arbitrarily in a crossbar is not trivial since a programming current must be provided through signal wires that are shared by multiple via-switches. Consequently, depending on the previous programming status in sequential programming, unintentional switch programming may occur due to signal detour, which is called sneak path problem. This problem interferes the reconfiguration of via-switch FPGA, and hence countermeasures for sneak path problem are indispensable. This paper identifies the circuit status that causes sneak path problem and proposes a sneak path avoidance method that gives sneak path free programming order of via-switches in a crossbar. We prove that sneak path free programming order necessarily exists for arbitrary on-off patterns in a crossbar as long as no loops exist, and also validate the proof and the proposed method with simulation-based evaluation. Thanks to the proposed method, any practical configurations of via-switch FPGA can be successfully programmed without sneak path problem.

Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method

Zhenhua Zhu
Jilan Lin
Ming Cheng
Lixue Xia
Hanbo Sun
Xiaoming Chen
Yu Wang
Huazhong Yang

Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7x speedup, 30% utilization rate improvement, and 1.2x energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4x higher on average in energy efficiency and 20x higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4x speedup and 1.65x energy efficiency improvement.

Enhancing the solution quality of hardware ising-model solver via parallel tempering

Hidenori Gyoten
Masayuki Hiromoto
Takashi Sato

We propose an efficient Ising processor with approximated parallel tempering (IPAPT) implemented on an FPGA. Hardware-friendly approximations of the components of parallel tempering (PT) are proposed to enhance solution quality with low hardware overhead. Multiple replicas of Ising states having different temperatures run in parallel by sharing a single network structure, and the replicas are exchanged based on the approximated energy evaluation. The application of PT substantially improves the quality of optimization solutions. The experimental results on the various max-cut problems have shown that utilization of PT significantly increases the probability of obtaining optimal solutions, and IPAPT obtains optimal solutions two orders magnitude faster than a software solver.

Defensive dropout for hardening deep neural networks under adversarial attacks

Siyue Wang
Xiao Wang
Pu Zhao
Wujie Wen
David Kaeli
Peter Chin
Xue Lin

Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others’ strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker’s strategy for generating adversarial examples. We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.

Online human activity recognition using low-power wearable devices

Ganapati Bhat
Ranadeep Deb
Vatika Vardhan Chaurasia
Holly Shill
Umit Y. Ogras

Human activity recognition (HAR) has attracted significant research interest due to its applications in health monitoring and patient rehabilitation. Recent research on HAR focuses on using smartphones due to their widespread use. However, this leads to inconvenient use, limited choice of sensors and inefficient use of resources, since smartphones are not designed for HAR. This paper presents the first HAR framework that can perform both online training and inference. The proposed framework starts with a novel technique that generates features using the fast Fourier and discrete wavelet transforms of a textile-based stretch sensor and accelerometer data. Using these features, we design a neural network classifier which is trained online using the policy gradient algorithm. Experiments on a low power IoT device (TI-CC2650 MCU) with nine users show 97.7% accuracy in identifying six activities and their transitions with less than 12.5 mW power consumption.

Shadow attacks on MEDA biochips

Mohammed Shayan
Sukanta Bhattacharjee
Tung-Che Liang
Jack Tang
Krishnendu Chakrabarty
Ramesh Karri

The Micro-electrode-dot-array (MEDA) is a next-generation digital microfluidic biochip (DMFB) platform that supports fine-grained control and real-time sensing of droplet movements. These capabilities permit continuous monitoring and checkpoint-based validation of assay execution on MEDA. This paper presents a class of “shadow attacks” that abuse the timing slack in the assay execution. State-of-the-art checkpoint-based validation techniques cannot expose the shadow operations. We develop a defense that introduces extra checkpoints in the assay execution at time instances when the assay is prone to shadow attacks. Experiments confirm the effectiveness and practicality of the defense.

LeapChain: efficient blockchain verification for embedded IoT

Emanuel Regnath
Sebastian Steinhorst

Blockchain provides decentralized consensus in large, open networks without a trusted authority, making it a promising solution for the Internet of Things (IoT) to distribute verifiable data, such as firmware updates. However, verifying data integrity and consensus on a linearly growing blockchain quickly exceeds memory and processing capabilities of embedded systems.

As a remedy, we propose a generic blockchain extension that enables highly constrained devices to verify the inclusion and integrity of any block within a blockchain. Instead of traversing block by block, we construct a LeapChain that reduces verification steps without weakening the integrity guarantees of the blockchain. Applied to Proof-of-Work blockchains, our scheme can be used to verify consensus by proving a certain amount of work on top of a block.

Our analytical and experimental results show that, compared to existing approaches, only LeapChain provides deterministic and tight upper bounds on the memory requirements in the kilobyte range, significantly extending the possibilities of blockchain application on embedded IoT devices.

Robust object estimation using generative-discriminative inference for secure robotics applications

Yanqi Liu
Alessandro Costantini
R. Iris Bahar
Zhiqiang Sui
Zhefan Ye
Shiyang Lu
Odest Chadwicke Jenkins

Convolutional neural networks (CNNs) are of increasing widespread use in robotics, especially for object recognition. However, such CNNs still lack several critical properties necessary for robots to properly perceive and function autonomously in uncertain, and potentially adversarial, environments. In this paper, we investigate factors for accurate, reliable, and resource-efficient object and pose recognition suitable for robotic manipulation in adversarial clutter. Our exploration is in the context of a three-stage pipeline of discriminative CNN-based recognition, generative probabilistic estimation, and robot manipulation. This pipeline proposes using a SAmpling Network Density filter, or SAND filter, to recover from potentially erroneous decisions produced by a CNN through generative probabilistic inference. We present experimental results from SAND filter perception for robotic manipulation in tabletop scenes with both benign and adversarial clutter. These experiments vary CNN model complexity for object recognition and evaluate levels of inaccuracy that can be recovered by generative pose inference. This scenario is extended to consider adversarial environmental modifications with varied lighting, occlusions, and surface modifications.

Efficient utilization of adversarial training towards robust machine learners and its analysis

Sai Manoj P D
Sairaj Amberkar
Setareh Rafatirad
Houman Homayoun

Advancements in machine learning led to its adoption into numerous applications ranging from computer vision to security. Despite the achieved advancements in the machine learning, the vulnerabilities in those techniques are as well exploited. Adversarial samples are the samples generated by adding crafted perturbations to the normal input samples. An overview of different techniques to generate adversarial samples, defense to make classifiers robust is presented in this work. Furthermore, the adversarial learning and its effective utilization to enhance the robustness and the required constraints are experimentally provided, such as up to 97.65% accuracy even against CW attack. Though adversarial learning’s effectiveness is enhanced, still it is shown in this work that it can be further exploited for vulnerabilities.

Majority logic synthesis

Luca Amarù
Eleonora Testa
Miguel Couceiro
Odysseas Zografos
Giovanni De Micheli
Mathias Soeken

The majority function <xyz> evaluates to true, if at least two of its Boolean inputs evaluate to true. The majority function has frequently been studied as a central primitive in logic synthesis applications for many decades. Knuth refers to the majority function in the last volume of his seminal The Art of Computer Programming as “probably the most important ternary operation in the entire universe.” Majority logic sythesis has recently regained significant interest in the design automation community due to nanoemerging technologies which operate based on the majority function. In addition, majority logic synthesis has successfully been employed in CMOS-based applications such as standard cell or FPGA mapping.

This tutorial gives a broad introduction into the field of majority logic synthesis. It will review fundamental results and describe recent contributions from theory, practice, and applications.

RouteNet: routability prediction for mixed-size designs using convolutional neural network

Zhiyao Xie
Yu-Hung Huang
Guan-Qi Fang
Haoxing Ren
Shao-Yun Fang
Yiran Chen
Nvidia Corporation

Early routability prediction helps designers and tools perform preventive measures so that design rule violations can be avoided in a proactive manner. However, it is a huge challenge to have a predictor that is both accurate and fast. In this work, we study how to leverage convolutional neural network to address this challenge. The proposed method, called RouteNet, can either evaluate the overall routability of cell placement solutions without global routing or predict the locations of DRC (Design Rule Checking) hotspots. In both cases, large macros in mixed-size designs are taken into consideration. Experiments on benchmark circuits show that RouteNet can forecast overall routability with accuracy similar to that of global router while using substantially less runtime. For DRC hotspot prediction, RouteNet improves accuracy by 50% compared to global routing. It also significantly outperforms other machine learning approaches such as support vector machine and logistic regression.

TritonRoute: an initial detailed router for advanced VLSI technologies

Andrew B. Kahng
Lutong Wang
Bangqi Xu

Detailed routing is a dead-or-alive critical element in design automation tooling for advanced node enablement. However, very few works address detailed routing in the recent open literature, particularly in the context of modern industrial designs and a complete, end-to-end flow. The ISPD-2018 Initial Detailed Routing Contest addressed this gap for modern industrial designs, using a reduced design rules set. In this work, we present TritonRoute, an initial detailed router for the ISPD-2018 contest. Given route guides from global routing, the initial detailed routing stage should generate a detailed routing solution honoring the route guides as much as possible, while minimizing wirelength, via count and various design rule violations. In our work, the key contribution is intra-layer parallel routing, where we partition each layer into parallel panels and route each panel using an Integer Linear Programming-based algorithm. We sequentially route layer by layer from the bottom to the top. We evaluate our router using the official ISPD-2018 benchmark suite and show that we reduce the contest metric by up to 74%, and on average 50%, compared to the first-place routing solution for each testcase.

A multithreaded initial detailed routing algorithm considering global routing guides

Fan-Keng Sun
Hao Chen
Ching-Yu Chen
Chen-Hao Hsu
Yao-Wen Chang

Detailed routing is the most complicated and time-consuming stage in VLSI design and has become a critical process for advanced node enablement. To handle the high complexity of modern detailed routing, initial detailed routing is often employed to minimize design-rule violations to facilitate final detailed routing, even though it is still not violation-free after initial routing. This paper presents a novel initial detailed routing algorithm to consider industrial design-rule constraints and optimize the total wirelength and via count. Our algorithm consists of three major stages: (1) an effective pinaccess point generation method to identify valid points to model a complex pin shape, (2) a via-aware track assignment method to minimize the overlaps between assigned wire segments, and (3) a detailed routing algorithm with a novel negotiation-based rip-up and re-route scheme that enables multithreading and honors global routing information while minimizing designrule violations. Experimental results show that our router outperforms all the winning teams of the 2018 ACM ISPD Initial Detailed Routing Contest, where the top-3 routers result in 23%, 52%, and 1224% higher costs than ours.

Extending ML-OARSMT to net open locator with efficient and effective boolean operations

Bing-Hui Jiang
Hung-Ming Chen

Multi-layer obstacle-avoiding rectilinear Steiner minimal tree (ML-OARSMT) problem has been extensively studied in recent years. In this work, we consider a variant of ML-OARSMT problem and extend the applicability to the net open location finder. Since ECO or router limitations may cause the open nets, we come up with a framework to detect and reconnect existing nets to resolve the net opens. Different from prior connection graph based approach, we propose a technique by applying efficient Boolean operations to repair net opens. Our method has good quality and scalability and is highly parallelizable. Compared with the results of ICCAD-2017 contest, we show that our proposed algorithm can achieve the smallest cost with 4.81 speedup in average than the top-3 winners.

Logic synthesis of binarized neural networks for efficient circuit implementation

Chia-Chih Chi
Jie-Hong R. Jiang

Neural networks (NNs) are key to deep learning systems. Their efficient hardware implementation is crucial to applications at the edge. Binarized NNs (BNNs), where the weights and output of a neuron are of binary values {-1, +1} (or encoded in {0,1}), have been proposed recently. As no multiplier is required, they are particularly attractive and suitable for hardware realization. Most prior NN synthesis methods target on hardware architectures with neural processing elements (NPEs), where the weights of a neuron are loaded and the output of the neuron is computed. The load-and-compute method, though area efficient, requires expensive memory access, which deteriorates energy and performance efficiency. In this work we aim at synthesizing BNN dense layers into dedicated logic circuits. We formulate the corresponding matrix covering problem and propose a scalable algorithm to reduce the area and routing cost of BNNs. Experimental results justify the effectiveness of the method in terms of area and net savings on FPGA implementation. Our method provides an alternative implementation of BNNs, and can be applied in combination with NPE-based implementation for area, speed, and power tradeoffs.

Canonicalization of threshold logic representation and its applications

Siang-Yun Lee
Nian-Ze Lee
Jie-Hong R. Jiang

Threshold logic functions gain revived attention due to their connection to neural networks employed in deep learning. Despite prior endeavors in the characterization of threshold logic functions, to the best of our knowledge, the quest for a canonical representation of threshold logic functions in the form of their realizing linear inequalities remains open. In this paper we devise a procedure to canonicalize a threshold logic function such that two threshold logic functions are equivalent if and only if their canonicalized linear inequalities are the same. We further strengthen the canonicity to ensure that symmetric variables of a threshold logic function receive the same weight in the canonicalized linear inequality. The canonicalization procedure invokes O(m) queries to a linear programming (resp. an integer linear programming) solver when a linear inequality solution with fractional (resp. integral) weight and threshold values is to be found, where m is the number of symmetry groups of the given threshold logic function. The guaranteed canonicity allows direct application to the classification of NP (input negation, input permutation) and NPN (input negation, input permutation, output negation) equivalence of threshold logic functions. It may thus enable applications such as equivalence checking, Boolean matching, and library construction for threshold circuit synthesis.

DALS: delay-driven approximate logic synthesis

Zhuangzhuang Zhou
Yue Yao
Shuyang Huang
Sanbao Su
Chang Meng
Weikang Qian

Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To synthesize approximate circuits automatically, many approximate logic synthesis (ALS) algorithms have been proposed. However, they mainly focus on area reduction and are not optimal in reducing the delay of the circuits. In this paper, we propose DALS, a delay-driven ALS framework. DALS works on the AND-inverter graph (AIG) representation of a circuit. It supports a wide range of approximate local changes and some commonly-used error metrics, including error rate and mean error distance. In order to select an optimal set of nodes in the AIG to apply approximate local changes, DALS establishes a critical error network (CEN) from the AIG and formulates a maximum flow problem on the CEN. Our experimental results on a wide range of benchmarks show that DALS produces approximate circuits with significantly reduced delays.

Unlocking fine-grain parallelism for AIG rewriting

Vinicius Possani
Yi-Shan Lu
Alan Mishchenko
Keshav Pingali
Renato Ribas
Andre Reis

Parallel computing is a trend to enhance scalability of electronic design automation (EDA) tools using widely available multicore platforms. In order to benefit from parallelism, well-known EDA algorithms have to be reformulated and optimized for multicore implementation. This paper introduces a set of principles to enable a fine-grain parallel AND-inverter graph (AIG) rewriting. It presents a novel method to discover and rewrite in parallel parts of the AIG, without the need for graph partitioning. Experiments show that, when synthesizing large designs composed of millions of AIG nodes, the parallel rewriting on 40 physical cores is up to 36x and 68x faster than ABC commands rewrite -l and drw, respectively, with comparable quality of results in terms of AIG size and depth.

High-level synthesis with timing-sensitive information flow enforcement

Zhenghong Jiang
Steve Dai
G. Edward Suh
Zhiru Zhang

Specialized hardware accelerators are being increasingly integrated into today’s computer systems to achieve improved performance and energy efficiency. However, the resulting variety and complexity make it challenging to ensure the security of these accelerators. To mitigate complexity while guaranteeing security, we propose a high-level synthesis (HLS) infrastructure that incorporates static information flow analysis to enforce security policies on HLS-generated hardware accelerators. Our security-constrained HLS infrastructure is able to effectively identify both explicit and implicit information leakage. By detecting the security vulnerabilities at the behavioral level, our tool allows designers to address these vulnerabilities at an early stage of the design flow. We further propose a novel synthesis technique in HLS to eliminate timing channels in the generated accelerator. Our approach is able to remove timing channels in a verifiable manner while incurring lower performance overhead for high-security tasks on the accelerator.

Property specific information flow analysis for hardware security verification

Wei Hu
Armaiti Ardeshiricham
Mustafa S Gobulukoglu
Xinmu Wang
Ryan Kastner

Hardware information flow analysis detects security vulnerabilities resulting from unintended design flaws, timing channels, and hardware Trojans. These information flow models are typically generated in a general way, which includes a significant amount of redundancy that is irrelevant to the specified security properties. In this work, we propose a property specific approach for information flow security. We create information flow models tailored to the properties to be verified by performing a property specific search to identify security critical paths. This helps find suspicious signals that require closer inspection and quickly eliminates portions of the design that are free of security violations. Our property specific trimming technique reduces the complexity of the security model; this accelerates security verification and restricts potential security violations to a smaller region which helps quickly pinpoint hardware security vulnerabilities.

HISA: hardware isolation-based secure architecture for CPU-FPGA embedded systems

Mengmei Ye
Xianglong Feng
Sheng Wei

Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures.

SWAN: mitigating hardware trojans with design ambiguity

Timothy Linscott
Pete Ehrett
Valeria Bertacco
Todd Austin

For the past decade, security experts have warned that malicious engineers could modify hardware designs to include hardware backdoors (trojans), which, in turn, could grant attackers full control over a system. Proposed defenses to detect these attacks have been outpaced by the development of increasingly small, but equally dangerous, trojans. To thwart trojan-based attacks, we propose a novel architecture that maps the security-critical portions of a processor design to a one-time programmable, LUT-free fabric. The programmable fabric is automatically generated by analyzing the HDL of targeted modules. We present our tools to generate the fabric and map functionally equivalent designs onto the fabric. By having a trusted party randomly select a mapping and configure each chip, we prevent an attacker from knowing the physical location of targeted signals at manufacturing time. In addition, we provide decoy options (canaries) for the mapping of security-critical signals, such that hardware trojans hitting a decoy are thwarted and exposed. Using this defense approach, any trojan capable of analyzing the entire configurable fabric must employ complex logic functions with a large silicon footprint, thus exposing it to detection by inspection. We evaluated our solution on a RISC-V BOOM processor and demonstrated that, by providing the ability to map each critical signal to 6 distinct locations on the chip, we can reduce the chance of attack success by an undetectable trojan by 99%, incurring only a 27% area overhead.

Security for safety: a path toward building trusted autonomous vehicles

Raj Gautam Dutta
Feng Yu
Teng Zhang
Yaodan Hu
Yier Jin

Automotive systems have always been designed with safety in mind. In this regard, the functional safety standard, ISO 26262, was drafted with the intention of minimizing risk due to random hardware faults or systematic failure in design of electrical and electronic components of an automobile. However, growing complexity of a modern car has added another potential point of failure in the form of cyber or sensor attacks. Recently, researchers have demonstrated that vulnerability in vehicle’s software or sensing units could enable them to remotely alter the intended operation of the vehicle. As such, in addition to safety, security should be considered as an important design goal. However, designing security solutions without the consideration of safety objectives could result in potential hazards. Consequently, in this paper we propose the notion of security for safety and show that by integrating safety conditions with our system-level security solution, which comprises of a modified Kalman filter and a Chi-squared detector, we can prevent potential hazards that could occur due to violation of safety objectives during an attack. Furthermore, with the help of a car-following case study, where the follower car is equipped with an adaptive-cruise control unit, we show that our proposed system-level security solution preserves the safety constraints and prevent collision between vehicle while under sensor attack.

Hardware-accelerated data acquisition and authentication for high-speed video streams on future heterogeneous automotive processing platforms

Martin Geier
Fabian Franzen
Samarjit Chakraborty

With the increasing use of Ethernet-based communication backbones in safety-critical real-time domains, both efficient and predictable interfacing and cryptographically secure authentication of high-speed data streams are becoming very important. Although the increasing data rates of in-vehicle networks allow the integration of more demanding (e.g., camera-based) applications, processing speeds and, in particular, memory bandwidths are no longer scaling accordingly. The need for authentication, on the other hand, stems from the ongoing convergence of traditionally separated functional domains and the extended connectivity both in- (e.g., smart-phones) and outside (e.g., telemetry, cloud-based services and vehicle-to-X technologies) current vehicles. The inclusion of cryptographic measures thus requires careful interface design to meet throughput, latency, safety, security and power constraints given by the particular application domain. Over the last decades, this has forced system designers to not only optimize their software stacks accordingly, but also incrementally move interface functionalities from software to hardware. This paper discusses existing and emerging methods for dealing with high-speed data streams ranging from software-only via mixed-hardware/software approaches to fully hardware-based solutions. In particular, we introduce two approaches to acquire and authenticate GigE Vision Video Streams at full line rate of Gigabit Ethernet on Programmable SoCs suitable for future heterogeneous automotive processing platforms.

Network and system level security in connected vehicle applications

Hengyi Liang
Matthew Jagielski
Bowen Zheng
Chung-Wei Lin
Eunsuk Kang
Shinichi Shiraishi
Cristina Nita-Rotaru
Qi Zhu

Connected vehicle applications such as autonomous intersections and intelligent traffic signals have shown great promises in improving transportation safety and efficiency. However, security is a major concern in these systems, as vehicles and surrounding infrastructures communicate through ad-hoc networks. In this paper, we will first review security vulnerabilities in connected vehicle applications. We will then introduce and discuss some of the defense mechanisms at network and system levels, including (1) the Security Credential Management System (SCMS) proposed by the United States Department of Transportation, (2) an intrusion detection system (IDS) that we are developing and its application on collaborative adaptive cruise control, and (3) a partial consensus mechanism and its application on lane merging. These mechanisms can assist to improve the security of connected vehicle applications.

A safety and security architecture for reducing accidents in intelligent transportation systems

Qian Chen
Azizeh Khaled Sowan
Shouhuai Xu

The Internet of Things (IoT) technology is transforming the world into Smart Cities, which have a huge impact on future societal lifestyle, economy and business. Intelligent Transportation Systems (ITS), especially IoT-enabled Electric Vehicles (EVs), are anticipated to be an integral part of future Smart Cities. Assuring ITS safety and security is critical to the success of Smart Cities because human lives are at stake. The state-of-the-art understanding of this matter is very superficial because there are many new problems that have yet to be investigated. For example, the cyber-physical nature of ITS requires considering human-in-the-loop (i.e., drivers and pedestrians) and imposes many new challenges. In this paper, we systematically explore the threat model against ITS safety and security (e.g., malfunctions of connected EVs/transportation infrastructures, driver misbehavior and unexpected medical conditions, and cyber attacks). Then, we present a novel and systematic ITS safety and security architecture, which aims to reduce accidents caused or amplified by a range of threats. The architecture has appealing features: (i) it is centered at proactive cyber-physical-human defense; (ii) it facilitates the detection of early-warning signals of accidents; (iii) it automates effective defense against a range of threats.

The need and opportunities of electromigration-aware integrated circuit design

Steve Bigalke
Jens Lienig
Göran Jerke
Jürgen Scheible
Roland Jancke

Electromigration (EM) is becoming a progressively severe reliability challenge due to increased interconnect current densities. A shift from traditional (post-layout) EM verification to robust (pro-active) EM-aware design – where the circuit layout is designed with individual EM-robust solutions – is urgently needed. This tutorial will give an overview of EM and its effects on the reliability of present and future integrated circuits (ICs). We introduce the physical EM process and present its specific characteristics that can be affected during physical design. Examples of EM countermeasures which are applied in today’s commercial design flows are presented. We show how to improve the EM-robustness of metallization patterns and we also consider mission profiles to obtain application-oriented current-density limits. The increasing interaction of EM with thermal migration is investigated as well. We conclude with a discussion of application examples to shift from the current post-layout EM verification towards an EM-aware physical design process. Its methodologies, such as EM-aware routing, increase the EM-robustness of the layout with the overall goal of reducing the negative impact of EM on the circuit’s reliability.

Uncertainty quantification of electronic and photonic ICs with non-Gaussian correlated process variations

Chunfeng Cui
Zheng Zhang

Since the invention of generalized polynomial chaos in 2002, uncertainty quantification has impacted many engineering fields, including variation-aware design automation of integrated circuits and integrated photonics. Due to the fast convergence rate, the generalized polynomial chaos expansion has achieved orders-of-magnitude speedup than Monte Carlo in many applications. However, almost all existing generalized polynomial chaos methods have a strong assumption: the uncertain parameters are mutually independent or Gaussian correlated. This assumption rarely holds in many realistic applications, and it has been a long-standing challenge for both theorists and practitioners.

This paper propose a rigorous and efficient solution to address the challenge of non-Gaussian correlation. We first extend generalized polynomial chaos, and propose a class of smooth basis functions to efficiently handle non-Gaussian correlations. Then, we consider high-dimensional parameters, and develop a scalable tensor method to compute the proposed basis functions. Finally, we develop a sparse solver with adaptive sample selections to solve high-dimensional uncertainty quantification problems. We validate our theory and algorithm by electronic and photonic ICs with 19 to 57 non-Gaussian correlated variation parameters. The results show that our approach outperforms Monte Carlo by 2500× to 3000× in terms of efficiency. Moreover, our method can accurately predict the output density functions with multiple peaks caused by non-Gaussian correlations, which is hard to handle by existing methods.

Based on the results in this paper, many novel uncertainty quantification algorithms can be developed and can be further applied to a broad range of engineering domains.

Parallelizable Bayesian optimization for analog and mixed-signal rare failure detection with high coverage

Hanbin Hu
Peng Li
Jianhua Z. Huang

Due to inherent complex behaviors and stringent requirements in analog and mixed-signal (AMS) systems, verification becomes a key bottleneck in the product development cycle. For the first time, we present a Bayesian optimization (BO) based approach to the challenging problem of verifying AMS circuits with stringent low failure requirements. At the heart of the proposed BO process is a delicate balancing between two competing needs: exploitation of the current statistical model for quick identification of highly-likely failures and exploration of undiscovered design space so as to detect hard-to-find failures within a large parametric space. To do so, we simultaneously leverage multiple optimized acquisition functions to explore varying degrees of balancing between exploitation and exploration. This makes it possible to not only detect rare failures which other techniques fail to identify, but also do so with significantly improved efficiency. We further build in a mechanism into the BO process to enable detection of multiple failure regions, hence providing a higher degree of coverage. Moreover, the proposed approach is readily parallelizable, further speeding up failure detection, particularly for large circuits for which acquisition of simulation/measurement data is very time-consuming. Our experimental study demonstrates that the proposed approach is very effective in finding very rare failures and multiple failure regions which existing statistical sampling techniques and other BO techniques can miss, thereby providing a more robust and cost-effective methodology for rare failure detection.

Transient circuit simulation for differential algebraic systems using matrix exponential

Pengwen Chen
Chung-Kuan Cheng
Dongwon Park
Xinyuan Wang

Transient simulation becomes a bottleneck for modern IC designs due to large numbers of transistors, interconnects and tight design margins. For modified nodal analysis (MNA) formulation, we could have differential algebraic equations (DAEs) which consist ordinary differential equations (ODEs) and algebraic equations. Study of solving DAEs with conventional multi-step integration methods has been a research topic in the last few decades. We adopt matrix exponential based integration method for circuit transient analysis, its stability and accuracy with DAEs remain an open problem. We identify that potential stability issues in the calculation of matrix exponential and vector product (MEVP) with rational Krylov method are originated from the singular system matrix in DAEs. We then devise a robust algorithm to implicitly regularize the system matrix while maintaining its sparsity. With the new approach, &phis; functions are applied for MEVP to improve the accuracy of results. Moreover our framework no longer suffers from the limitation on step sizes thus a large leap step is adopted to skip many simulation steps in between. Features of the algorithm are validated on large-scale power delivery networks which achieve high efficiency and accuracy.

CustomTopo: a topology generation method for application-specific wavelength-routed optical NoCs

Mengchu Li
Tsun-Ming Tseng
Davide Bertozzi
Mahdi Tala
Ulf Schlichtmann

Optical network-on-chip (NoC) is a promising platform beyond electronic NoCs. In particular, wavelength-routed optical network-on-chip (WRONoC) is renowned for its high bandwidth and ultra-low signal delay. Current WRONoC topology generation approaches focus on full-connectivity, i.e. all masters are connected to all slaves. This assumption leads to wasted resources for application-specific designs. In this work, we propose CustomTopo: a general solution to the topology generation problem on WRONoCs that supports customized connectivity. CustomTopo models the topology structure and its communication behavior as an integer-linear-programming (ILP) problem, with an adjustable optimization target considering the number of add-drop filters (ADFs), the number of wavelengths, and insertion loss. The time for solving the ILP problem in general positively correlates with the network communication densities. Experimental results show that CustomTopo is applicable for various communication requirements, and the resulting customized topology enables a remarkable reduction in both resource usage and insertion loss.

A cross-layer methodology for design and optimization of networks in 2.5D systems

Ayse Coskun
Furkan Eris
Ajay Joshi
Andrew B. Kahng
Yenai Ma
Vaishnav Srinivas

2.5D integration technology is gaining popularity in the design of homogeneous and heterogeneous many-core computing systems. 2.5D network design, both inter- and intra-chiplet, impacts overall system performance as well as its manufacturing cost and thermal feasibility. This paper introduces a cross-layer methodology for designing networks in 2.5D systems. We optimize the network design and chiplet placement jointly across logical, physical, and circuit layers to achieve an energy-efficient network, while maximizing system performance, minimizing manufacturing cost, and adhering to thermal constraints. In the logical layer, our co-optimization considers eight different network topologies. In the physical layer, we consider routing, microbump assignment, and microbump pitch constraints to account for the extra costs associated with microbump utilization in the inter-chiplet communication. In the circuit layer, we consider both passive and active links with five different link types, including a gas station link design. Using our cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods. Compared to 2D systems, our approach achieves 29% better performance with the same manufacturing cost, or 25% lower cost with the same performance.

Wavefront-MCTS: multi-objective design space exploration of NoC architectures based on Monte Carlo tree search

Yong Hu
Daniel Mueller-Gritschneder
Ulf Schlichtmann

Application-specific MPSoCs profit immensely from a custom-fit Network-on-Chip (NoC) architecture in terms of network performance and power consumption. In this paper we suggest a new approach to explore application-specific NoC architectures. In contrast to other heuristics, our approach uses a set of network modifications defined with graph rewriting rules to model the design space exploration as a Markov Decision Process (MDP). The MDP can be efficiently explored using the Monte Carlo Tree Search (MCTS) heuristics. We formulate a weighted sum reward function to compute a single solution with a good trade-off between power and latency or a set of max reward functions to compute the complete Pareto front between the two objectives. The Wavefront feature adds additional efficiency when computing the Pareto front by exchanging solutions between parallel MCTS optimization processes. Comparison with other popular search heuristics demonstrates a higher efficiency of MCTS-based heuristics for several test cases. Additionally, the Wavefront-MCTS heuristics allows complete tracability and control by the designer to enable an interactive design space exploration process.

HLS-based optimization and design space exploration for applications with variable loop bounds

Young-kyu Choi
Jason Cong

In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot find a design point with large speedup for applications with variable loop bounds. The reason is that loops with variable loop bounds cannot be efficiently parallelized or pipelined with simple insertion of HLS directives. Also, making highly accurate prediction of cycles and resource consumption on the entire design space becomes a challenging task because of the inaccuracy of the HLS tool cycle prediction and the wide design space. In this paper we present an HLS-based FPGA optimization and DSE framework that produces a high-performance design even in the presence of variable loop bounds. We propose code transformations that increase the utilization of the compute resources for variable loops, including several computation patterns with loop-carried dependency such as floating-point reduction and prefix sum. In order to rapidly perform DSE with high accuracy, we describe a resource and cycle estimation model constructed from the information obtained from the actual HLS synthesis. Experiments on applications with variable loop bounds in Polybench benchmarks with Vivado HLS show that our framework improves the baseline implementation by 75X on average and outperforms current state-of-the-art DSE frameworks.

HLSPredict: cross platform performance prediction for FPGA high-level synthesis

Kenneth O’Neal
Mitch Liu
Hans Tang
Amin Kalantar
Kennen DeRenard
Philip Brisk

FPGA application developers must explore increasingly large design spaces to identify regions of code to accelerate. High-Level Synthesis (HLS) tools automatically derive FPGA-based designs from high-level language specifications, which improves designer productivity; however, HLS tool run-times are cost-prohibitive for design space exploration, preventing designers from adequately answering cost-value decisions without expert guidance. To address this concern, this paper introduces a machine learning framework to predict FPGA performance and power consumption without relying on analytical models or HLS tools in-the-loop. For workloads that were manually optimized by appropriately setting pragmas, the framework obtains a worst-case relative error of 9.08% while running 43.78x faster than HLS; for unoptimized workloads, the framework obtains a worst-case relative error of 9.79% while running 36.24x faster than HLS.

C-GOOD: C-code generation framework for optimized on-device deep learning

Duseok Kang
Euiseok Kim
Inpyo Bae
Bernhard Egger
Soonhoi Ha

Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. Thus it is typical to use an energy-efficient accelerator such as mobile GPU, DSP array, and customized neural processor chip. Moreover, new deep learning algorithms that aim to balance accuracy, speed, and resource requirements are developed on a deep learning framework such as Caffe[16] and Tensorflow[1] that is assumed to run directly on the target hardware. However, embedded devices may not be able to run those frameworks directly due to hardware limitations or missing OS support. To overcome this difficulty, we develop a deep learning software framework that generates a C code that can be run on any devices. The framework is facilitated with various options for software optimization that can be performed according to the optimization methodology proposed in this paper. Another benefit is that it can generate various styles of C code, tailored for a specific compiler or the accelerator architecture. Experiments on three platforms, NVIDIA Jetson TX2[23], Odroid XU4[10], and SRP (Samsung Reconfigurable Processor)[32], demonstrate the potential of the proposed approach.

LiteHAX: lightweight hardware-assisted attestation of program execution

Ghada Dessouky
Tigist Abera
Ahmad Ibrahim
Ahmad-Reza Sadeghi

Unlike traditional processors, embedded Internet of Things (IoT) devices lack resources to incorporate protection against modern sophisticated attacks resulting in critical consequences. Remote attestation (RA) is a security service to establish trust in the integrity of a remote device. While conventional RA is static and limited to detecting malicious modification to software binaries at load-time, recent research has made progress towards runtime attestation, such as attesting the control flow of an executing program. However, existing control-flow attestation schemes are inefficient and vulnerable to sophisticated data-oriented programming (DOP) attacks subvert these schemes and keep the control flow of the code intact.

In this paper, we present LiteHAX, an efficient hardware-assisted remote attestation scheme for RISC-based embedded devices that enables detecting both control-flow attacks as well as DOP attacks. LiteHAX continuously tracks both the control-flow and data-flow events of a program executing on a remote device and reports them to a trusted verifying party. We implemented and evaluated LiteHAX on a RISC-V System-on-Chip (SoC) and show that it has minimal performance and area overhead.

SCADET: a side-channel attack detection tool for tracking prime+probe

Majid Sabbagh
Yunsi Fei
Thomas Wahl
A. Adam Ding

Microarchitectural side-channel attacks have posed serious threats to many computing systems, ranging from embedded systems and mobile devices to desktop workstations and cloud servers. Such attacks exploit side-channel vulnerabilities stemming from fundamental microarchitectural performance features, including the most common caches, out-of-order execution (for the newly revealed Meltdown exploit), and speculative execution (for Spectre). Prior efforts have focused on identifying and assessing these security vulnerabilities, and designing and implementing countermeasures against them. However, the efforts aiming at detecting specific side-channel attacks tend to be narrowly focused, which can make them effective but also makes them obsolete very quickly. In this paper, we propose a new methodology for detecting microarchitectural side-channel attacks that has the potential for a wide scope of applicability, as we demonstrate using a case study involving the Prime+Probe attack family. Instead of looking at the side-effects of side-channel attacks on microarchitectural elements such as hardware performance counters, we target the high-level semantics and invariant patterns of these attacks. We have applied our method to different Prime+Probe attack variants on the instruction cache, data cache, and last-level cache, as well as several benign programs as benchmarks. The method can detect all of the Prime+Probe attack variants with a true positive rate of 100% and an average false positive rate of 7.4%.

Industrial experiences with resource management under software randomization in ARINC653 avionics environments

Leonidas Kosmidis
Cristian Maxim
Victor Jegu
Francis Vatrinet
Francisco J. Cazorla

Injecting randomization in different layers of the computing platform has been shown beneficial for security, resilience to software bugs and timing analysis. In this paper, with focus on the latter, we show our experience regarding memory and timing resource management when software randomization techniques are applied to one of the most stringent industrial environments, ARINC653-based avionics. We describe the challenges in this task, we propose a set of solutions and present the results obtained for two commercial avionics applications, executed on COTS hardware and RTOS.

Single flux quantum circuit technology and CAD overview

Coenrad Fourie

Single Flux Quantum (SFQ) electronic circuits originated with the advent of Rapid Single Flux Quantum (RSFQ) logic in 1985 and have since evolved to include more energy-efficient technologies such as ERSFQ and eSFQ. SFQ logic circuits, based on the manipulation of quantized flux pulses, have been demonstrated to run at clock speeds in excess of 120 GHz, and with bit-switch energy below 1 aJ. Small SFQ microprocessors have been developed, but characteristics inherent to SFQ circuits and the lack of circuit design tools have hampered the development of large SFQ systems. SFQ circuit characteristics include fan-out of one and the subsequent demand for pulse splitters, gate-level clocking, susceptibility to magnetic fields and sensitivity to intra-gate and inter-gate inductance. Superconducting interconnects propagate data pulses at the speed of light, but suffer from reflections at vias that attenuate transmitted pulses. The recently started IARPA SuperTools program aims to deliver SFQ Computer-Aided Design (CAD) tools that can enable the successful design of 64 bit RISC processors given the characteristics of SFQ circuits. A discussion on the technology of SFQ circuits and the most modern SFQ fabrication processes is presented, with a focus on the unique electronic design automation CAD requirements for the design, layout and verification of SFQ circuits.

Design automation methodology and tools for superconductive electronics

Massoud Pedram
Yanzhi Wang

Josephson junction-based superconducting logic families have been proposed to implement analog and digital signals, which can achieve low energy dissipation and ultra-fast switching speed. There are two representative technologies: DC-biased RSFQ (rapid single flux quantum) technology and its variants that achieve a verified speed of 370 Ghz, and AC-biased AQFP (adiabatic quantum-flux-parametron) that achieves an energy dissipation near quantum limits. Despite extraordinary characteristics of the superconducting logic families, many technical challenges remain, including the choice of circuit fabrics and architectures that utilize the SFQ technology and the development of effective design automation methodologies and tools. This paper presents our work on developing design flows and tools for DC- and AC-biased SFQ circuits, leveraging unique characteristics and design requirements of the SFQ logic families. More precisely, physical design algorithms, including placement, clock tree routing, and signal routing algorithms targeting RSFQ circuits are presented first. Next, a majority/minority gate-based automatic synthesis framework targeting AQFP logic circuits is described. Finally, experimental results to demonstrate the efficacy of the proposed framework and tools are presented.

Multi-terminal routing with length-matching for rapid single flux quantum circuits

Pei-Yi Cheng
Kazuyoshi Takagi
Tsung-Yi Ho

With the increasing clock frequencies, the timing requirement of Rapid Single Flux Quantum (RSFQ) digital circuits is critical for achieving the correct functionality. To meet this requirement, it is necessary to incorporate length-matching constraint into routing problem. However, the solutions of existing routing algorithms are inherently limited by pre-allocated splitters (SPLs), which complicates the subsequent routing stage under length-matching constraint. Hence, in this paper, we reallocate SPLs to fully utilize routing resources to cope with length-matching effectively. We propose the first multi-terminal routing algorithm for RSFQ circuits that integrates SPL reallocation into the routing stage. The experimental results on a practical circuit show that our proposed algorithm achieves routing completion while reducing the required area by 17%. Comparing to [2], we can still improve by 7% with less runtime when SPLs are pre-allocated.

Electromagnetic equalizer: an active countermeasure against EM side-channel attack

Chenguang Wang
Yici Cai
Haoyi Wang
Qiang Zhou

Electromagnetic (EM) analysis is to reveal the secret information by analyzing the EM emission from a cryptographic device. EM analysis (EMA) attack is emerging as a serious threat to hardware security. It has been noted that the on-chip power grid (PG) has a security implication on EMA attack by affecting the fluctuations of supply current. However, there is little study on exploiting this intrinsic property as an active countermeasure against EMA. In this paper, we investigate the effect of PG on EM emission and propose an active countermeasure against EMA, i.e. EM Equalizer (EME). By adjusting the PG impedance, the current waveform can be flattened, equalizing the EM profile. Therefore, the correlation between secret data and EM emission is significantly reduced. As a first attempt to the co-optimization for power and EM security, we extend the EME method by fixing the vulnerability of power analysis. To verify the EME method, several cryptographic designs are implemented. The measurement to disclose (MTD) is improved by 1138x with area and power overheads of 0.62% and 1.36%, respectively.

GPU acceleration of RSA is vulnerable to side-channel timing attacks

Chao Luo
Yunsi Fei
David Kaeli

The RSA algorithm [21] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [9, 18, 23, 26], leveraging the Single Instruction Multiple Thread (SIMT) model.

Remote inter-chip power analysis side-channel attacks at board-level

Falk Schellenberg
Dennis R. E. Gnad
Amir Moradi
Mehdi B. Tahoori

The current practice in board-level integration is to incorporate chips and components from numerous vendors. A fully trusted supply chain for all used components and chipsets is an important, yet extremely difficult to achieve, prerequisite to validate a complete board-level system for safe and secure operation. An increasing risk is that most chips nowadays run software or firmware, typically updated throughout the system lifetime, making it practically impossible to validate the full system at every given point in the manufacturing, integration and operational life cycle. This risk is elevated in devices that run 3rd party firmware. In this paper we show that an FPGA used as a common accelerator in various boards can be reprogrammed by software to introduce a sensor, suitable as a remote power analysis side-channel attack vector at the board-level. We show successful power analysis attacks from one FPGA on the board to another chip implementing RSA and AES cryptographic modules. Since the sensor is only mapped through firmware, this threat is very hard to detect, because data can be exfiltrated without requiring inter-chip communication between victim and attacker. Our results also prove the potential vulnerability in which any untrusted chip on the board can launch such attacks on the remaining system.

Effective simple-power analysis attacks of elliptic curve cryptography on embedded systems

Chao Luo
Yunsi Fei
David Kaeli

Elliptic Curve Cryptography (ECC), initially proposed by Koblitz [17] and Miller [20], is a public-key cipher. Compared with other popular public-key ciphers (e.g., RSA), ECC features a shorter key length for the same level of security. For example, a 256-bit ECC cipher provides 128-bit security, equivalent to a 2048-bit RSA cipher [4]. Using smaller keys, ECC requires less memory for performing cryptographic operations. Embedded systems, especially given the proliferation of Internet-of-Things (IoT) devices and platforms, require efficient and low-power secure communications between edge devices and gateways/clouds. ECC has been widely adopted in IoT systems for authentication of communications, while RSA, which is much more costly to compute, remains the standard for desktops and servers.

SODA: stencil with optimized dataflow architecture

Yuze Chi
Jason Cong
Peng Wei
Peipei Zhou

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space.

In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

PolySA: polyhedral-based systolic array auto-compilation

Jason Cong
Jie Wang

Automatic systolic array generation has long been an interesting topic due to the need to reduce the lengthy development cycles of manual designs. Existing automatic systolic array generation approach builds dependency graphs from algorithms, and iteratively maps computation nodes in the graph into processing elements (PEs) with time stamps that specify the sequences of nodes that operate within the PE. There are a number of previous works that implemented the idea and generated designs for ASICs. However, all of these works relied on human intervention and usually generated inferior designs compared to manual designs. In this work, we present our ongoing compilation framework named PolySA which leverages the power of the polyhedral model to achieve the end-to-end compilation for systolic array architecture on FPGAs. PolySA is the first fully automated compilation framework for generating high-performance systolic array architectures on the FPGA leveraging recent advances in high-level synthesis. We demonstrate PolySA on two key applications—matrix multiplication and convolutional neural network. PolySA is able to generate optimal designs within one hour with performance comparable to state-of-the-art manual designs.

An efficient data reuse strategy for multi-pattern data access

Wensong Li
Fan Yang
Hengliang Zhu
Xuan Zeng
Dian Zhou

Memory partitioning has been widely adopted to increase the memory bandwidth. Data reuse is a hardware-efficient way to improve data access throughput by exploiting locality in memory access patterns. We found that for many applications in image and video processing, a global data reuse scheme can be shared by multiple patterns. In this paper, we propose an efficient data reuse strategy for multi-pattern data access. Firstly, a heuristic algorithm is proposed to extract the reuse information as well as find the non-reusable data elements of each pattern. Then the non-reusable elements are partitioned into several memory banks by an efficient memory partitioning algorithm. Moreover, the reuse information is utilized to generate the global data reuse logic shared by the multi-pattern. We design a novel algorithm to minimize the number of registers required by the data reuse logic. Experimental results show that compared with the state-of-the-art approach, our proposed method can reduce the number of required BRAMs by 62.2% on average, with the average reduction of 82.1% in SLICE, 87.1% in LUTs, 71.6% in Flip-Flops, 73.1% in DSP48Es, 83.8% in SRLs, 46.7% in storage overhead, 79.1% in dynamic power consumption, and 82.6% in execution time of memory partitioning. Besides, the performance is improved by 14.4%.

Optimizing data layout and system configuration on FPGA-based heterogeneous platforms

Hou-Jen Ko
Zhiyuan Li
Samuel Midkiff

The most attractive feature of field-programmable gate arrays (FPGAs) is their configuration flexibility. However, if the configuration is performed manually, this flexibility places a heavy burden on system designers to choose among a vast number of configuration parameters and program transformations. In this paper, we improve the state-of-the-art with two main innovations: First, we apply compiler-automated transformations to the data layout and program statements to create streaming accesses. Such accesses are turned into streaming interfaces when the kernels are implemented in hardware, allowing the kernels to run efficiently. Second, we use two-step mixed integer programming to first minimize the execution time and then to minimize energy dissipation. Configuration parameters are chosen automatically, including several important ones omitted by existing models. Experimental results demonstrate significant performance gains and energy savings using these techniques.

This work is sponsored in part by the National Science Foundation (Grant 1533822).

Design and optimization of edge computing distributed neural processor for biomedical rehabilitation with sensor fusion

Kofi Otseidu
Tianyu Jia
Joshua Bryne
Levi Hargrove
Jie Gu

Modern biomedical devices use sensor fusion techniques to improve the classification accuracy of motion intent of users for rehabilitation application. The design of motion classifier observes significant challenges due to the large number of channels and stringent communication latency requirement. This paper proposes an edge-computing distributed neural processor to effectively reduce the data traffic and physical wiring congestion. A special local and global networking architecture is introduced to significantly reduce traffic among multi-chips in edge computing. To optimize the design space of the features selected, a systematic design methodology is proposed. A novel mixed-signal feature extraction approach with assistance of neural network distortion recovery is also provided to significantly reduce the silicon area. A 12-channel 55nm CMOS test chip was implemented to demonstrate the proposed systematic design methodology. The measurement shows the test chip consumes only 20uW power, more than 10,000X less power than the current clinically used microprocessor and can perform edge-computing networking operation within 5ms time.

Area-efficient and low-power face-to-face-bonded 3D liquid state machine design

Bon Woong Ku
Yu Liu
Yingyezhe Jin
Peng Li
Sung Kyu Lim

As small-form-factor and low-power end devices matter in the cloud networking and Internet-of-Things Era, the bio-inspired neuromorphic architectures attract great attention recently in the hope of reaching the energy-efficiency of brain functions. Out of promising solutions, a liquid state machine (LSM), that consists of randomly and recurrently connected reservoir neurons and trainable readout neurons, has shown a great promise in delivering brain-inspired computing power. In this work, we adopt the state-of-the-art face-to-face (F2F)-bonded 3D IC flow named Compact-2D [4] to the LSM processor design, and study the power-area-accuracy benefits of 3D LSM ICs targeting the next generation commercial-grade neuromorphic computing platforms. First, we analyze how the different size and connection density of a reservoir in the LSM architecture affects the learning performance using the real-world speech recognition benchmark. Also, we explore how much the power-area design overhead should be paid off to enable better classification accuracy. Based on the power-area-accuracy trade-off, we implement a F2F-bonded 3D LSM IC using the optimal LSM architecture, and finally justify that 3D integration practically benefits the LSM processor design in huge form factor and power savings while preserving the best learning performance.

DIMA: a depthwise CNN in-memory accelerator

Shaahin Angizi
Zhezhi He
Deliang Fan

In this work, we first propose a deep depthwise Convolutional Neural Network (CNN) structure, called Add-Net, which uses binarized depthwise separable convolution to replace conventional spatial-convolution. In Add-Net, the computationally expensive convolution operations (i.e. Multiplication and Accumulation) are converted into hardware-friendly Addition operations. We meticulously investigate and analyze the Add-Net’s performance (i.e. accuracy, parameter size and computational cost) in object recognition application compared to traditional baseline CNN using the most popular large scale ImageNet dataset. Accordingly, we propose a Depthwise CNN In-Memory Accelerator (DIMA) based on SOT-MRAM computational sub-arrays to efficiently accelerate Add-Net within non-volatile MRAM. Our device-to-architecture co-simulation results show that, with almost the same inference accuracy to the baseline CNN on different data-sets, DIMA can obtain ~1.4× better energy-efficiency and 15.7× speedup compared to ASICs, and, ~1.6× better energy-efficiency and 5.6× speedup over the best processing-in-DRAM accelerators.

Multi-channel and fault-tolerant control multiplexing for flow-based microfluidic biochips

Ying Zhu
Bing Li
Tsung-Yi Ho
Qin Wang
Hailong Yao
Robert Wille
Ulf Schlichtmann

Continuous flow-based biochips are one of the promising platforms used in biochemical and pharmaceutical laboratories due to their efficiency and low costs. Inside such a chip, fluid volumes of nanoliter size are transported between devices for various operations, such as mixing and detection. The transportation channels and corresponding operation devices are controlled by microvalves driven by external pressure sources. Since assigning an independent pressure source to every microvalve would be impractical due to high costs and limited system dimensions, states of microvalves are switched using a control logic by time multiplexing. Existing control logic designs, however, still switch only a single control channel per operation — leading to a low efficiency. In this paper, we propose the first automatic synthesis approach for a control logic that is able to switch multiple control channels simultaneously to reduce the overall switching time of valve states. In addition, we propose the first fault-aware design in control logic to introduce redundant control paths to maintain the correct function even when manufacturing defects occur. Compared with the existing direct connection method, the proposed multi-channel switching mechanism can reduce the switching time of valve states by up to 64%. In addition, all control paths for fault tolerance have been realized.

Multi-physics-based FEM analysis for post-voiding analysis of electromigration failure effects

Hengyang Zhao
Sheldon Tan

In this paper, we propose anew multi-physics finite element method (FEM) based analysis method for void growth simulation of confined copper interconnects. This new method for the first time considers three important physics simultaneously in the EM failure process and their time-varying interactions: the hydrostatic stress in the confined interconnect wire, the current density and Joule heating induced temperature. As a result, we end up with solving a set of coupled partial differential equations which consist of the stress diffusion equation (Korhonen’s equation), the phase field equation (for modeling void boundary move), the Laplace equation for current density and the heat diffusion equation for Joule heating and wire temperature. In the new method, we show that each of the physics will have different physical domains and differential boundary conditions, and how such coupled multi-physics transient analysis was carried out based on FEM and different time scales are properly handled. Experiment results show that by considering all three coupled physics – the stress, current density, and temperature – and their transient behaviors, the proposed FEM EM solver can predict the unique transient wire resistance change pattern for copper interconnect wires, which were well observed by the published experiment data. We also show that the simulated void growth speed is less conservative than recently proposed compact EM model.

Estimating and optimizing BTI aging effects: from physics to CAD

Hussam Amrouch
Victor M. van Santen
Jörg Henkel

Transistor aging due to Bias Temperature Instability (BTI) is a crucial degradation that affects the reliability of circuits over time. Aging-aware circuit design flows do virtually not exist yet and even research is in its infancy. In this work, we demonstrate how the deleterious effects BTI-induced degradations can be modeled from physics, where they do occur, all the way up to the system level, where they finally take place and affect the delay and power of circuits. To achieve that, degradation-aware cell libraries, that properly capture the impact of BTI not only on the delay of standard cells but also on their static and dynamic power, are created. Unlike state of the art, which solely models the impact of BTI on the threshold voltage of transistors (V_th), we are the first to model the other key transistor parameters degraded by BTI like carrier mobility (μ), sub-threshold slope (SS), and gate-drain capacitance (C_gd).

Our cell libraries are compatible with existing commercial CAD tools. Employing the mature algorithms in such tools, enables designers – after importing our cell libraries – to accurately estimate the overall impact of aging on changing the delay and/or power of any circuit, despite its complexity. We demonstrate that ΔV_th alone (as done in state of the art) is insufficient to correctly model the impact of BTI either on delay or power of circuits. On the one hand, neglecting BTI-induced μ and C_gd degradations leads to underestimating the impact that BTI has on increasing the delay of circuits. Hence, designers will employ narrower timing guardbands in which reliability of circuits during lifetime cannot be sustained. On the other hand, neglecting BTI-induced SS degradation leads to overestimating the impact that BTI has on static power reduction. Hence, the potential benefit of circuits from BTI will be exaggerated.

PVT²: process, voltage, temperature and time-dependent variability in scaled CMOS process

A. K. M. Mahfuzul Islam
Hidetoshi Onodera

In addition to the conventional PVT (Process, Voltage and Temperature) variation, time-dependent current fluctuation such as random telegraph noise (RTN) poses a new challenge on VLSI reliability. In this paper, we show that compared with the static random variation, RTN amplitude of a particular device is not constant across supply voltages and temperatures. A device may show large RTN amplitude at one operating condition and small amplitude at another operating condition. As a result, RTN amplitude distribution becomes uncorrelated across a wide range of voltage and temperature. The emergence of uncorrelated distribution causes significant degradation of worst-case values. Analysis results based on variability models from a 65 nm silicon-on-insulator process show that uncorrelated RTN degrades the worst-case threshold voltage value significantly compared with that where RTN is not considered. Delay variation analysis shows that consideration of RTN in the statistical analysis have little impact at high supply voltage. However, at low voltage operation, RTN can degrade the worst-case value by more than 5 %.

Performance and accuracy in soft-error resilience evaluation using the multi-level processor simulator ETISS-ML

Daniel Mueller-Gritschneder
Uzair Sharif
Ulf Schlichtmann

Soft errors are a major safety concern in many devices, e.g., in automotive, industrial, control or medical applications. Ideally, safety-critical systems should be resilient against the impact of soft errors, but at a low cost. This requires to evaluate the soft error resilience, which is typically done by extensive fault injection.

In this paper, we present ETISS-ML, a multi-level processor simulator, which manages to achieve both accuracy and performance for fault simulation by intelligently switching the level of abstraction between an Instruction Set Simulator (ISS) and an RTL simulator. For a given software testcase and fault scenario, the software is first executed in ISS-mode until shortly before the fault injection. Then ETISS-ML switches to RTL-mode for accurate fault simulation. Whenever the impact of the fault is propagated completely out of the processor’s micro-architecture, the simulation can switch back to ISS-mode. This paper describes the methods needed to preserve accuracy during both of these switches. Experimental results show that ETISS-ML obtains near to ISS performance with RTL accuracy. It is also shown that ETISS-ML can be used as the processor model in SystemC / TLM virtual prototypes (VPs) and, hence, allows to investigate the impact of soft errors at system level.

Computer-aided design for quantum computation

Robert Wille
Austin Fowler
Yehuda Naveh

Quantum computation is currently moving from an academic idea to a practical reality. The recent past has seen tremendous progress in the physical implementation of corresponding quantum computers – also involving big players such as IBM, Google, Intel, Rigetti, Microsoft, and Alibaba. These devices promise substantial speedups over conventional computers for applications like quantum chemistry, optimization, machine learning, cryptography, quantum simulation, and systems of linear equations. The Computer-Aided Design and Verification (jointly referred as CAD) community needs to be ready for this revolutionizing new technology. While research on automatic design methods for quantum computers is currently underway, there is still far too little coordination between the CAD community and the quantum computation community. Consequently, many CAD approaches proposed in the past have either addressed the wrong problems or failed to reach the end users. In this summary paper, we provide a glimpse into both sides. To this end, we review and discuss selected accomplishments from the CAD domain as well as open challenges within the quantum domain. These examples showcase the recent state-of-the-art but also outline the remaining work left to be done in both communities.

PolyCleaner: clean your polynomials before backward rewriting to verify million-gate multipliers

Alireza Mahzoon
Daniel Große
Rolf Drechsler

Nowadays, a variety of multipliers are used in different computationally intensive industrial applications. Most of these multipliers are highly parallelized and structurally complex. Therefore, the existing formal verification techniques fail to verify them.

In recent years, formal multiplier verification based on Symbolic Computer Algebra (SCA) has shown superior results in comparison to all other existing proof techniques. However, for non-trivial architectures still a monomial explosion can be observed. A common understanding is that this is caused by redundant monomials also known as vanishing monomials. While several approaches have been proposed to overcome the explosion, the problem itself is still not fully understood.

In this paper we present a new theory for the origin of vanishing monomials and how they can be handled to prevent the explosion during backward rewriting. We implement our new approach as the SCA-verifier PolyCleaner. The experimental results show the efficiency of our proposed method in verification of non-trivial million-gate multipliers.

A formal instruction-level GPU model for scalable verification

Yue Xing
Bo-Yuan Huang
Aarti Gupta
Sharad Malik

GPUs have been widely used to accelerate big-data inference applications and scientific computing through their parallelized hardware resources and programming model. Their extreme parallelism increases the possibility of bugs such as data races and un-coalesced memory accesses, and thus verifying program correctness is critical. State-of-the-art GPU program verification efforts mainly focus on analyzing application-level programs, e.g., in C, and suffer from the following limitations: (1) high false-positive rate due to coarse-grained abstraction of synchronization primitives, (2) high complexity of reasoning about pointer arithmetic, and (3) keeping up with an evolving API for developing application-level programs.

In this paper, we address these limitations by modeling GPUs and reasoning about programs at the instruction level. We formally model the Nvidia GPU at the parallel execution thread (PTX) level using the recently proposed Instruction-Level Abstraction (ILA) model for accelerators. PTX is analogous to the Instruction-Set Architecture (ISA) of a general-purpose processor. Our formal ILA model of the GPU includes non-synchronization instructions as well as all synchronization primitives, enabling us to verify multithreaded programs. We demonstrate the applicability of our ILA model in scalable GPU program verification of data-race checking. The evaluation shows that our checker outperforms state-of-the-art GPU data race checkers with fewer false-positives and improved scalability.

Fast FPGA emulation of analog dynamics in digitally-driven systems

Steven Herbst
Byong Chan Lim
Mark Horowitz

In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation.

SPN dash: fast detection of adversarial attacks on mobile via sensor pattern noise fingerprinting

Kent W. Nixon
Jiachen Mao
Juncheng Shen
Huanrui Yang
Hai (Helen) Li
Yiran Chen

A concerning weakness of deep neural networks is their susceptibility to adversarial attacks. While methods exist to detect these attacks, they incur significant drawbacks, ignoring external features which could aid in the task of attack detection. In this work, we propose SPN Dash, a method for detection of adversarial attacks based on integrity of sensor pattern noise embedded in submitted images. Through experiment, we show that our SPN Dash method is capable of detecting the addition of adversarial noise with up to 94% accuracy for images of size 256×256. Analysis shows that SPN Dash is robust to image scaling techniques, as well as a small amount of image compression. This performance is on par with state of the art neural network-based detectors, while incurring an order of magnitude less computational and memory overhead.

Watermarking deep neural networks for embedded systems

Jia Guo
Miodrag Potkonjak

Deep neural networks (DNNs) have become an important tool for bringing intelligence to mobile and embedded devices. The increasingly wide deployment, sharing and potential commercialization of DNN models create a compelling need for intellectual property (IP) protection. Recently, DNN watermarking emerges as a plausible IP protection method. Enabling DNN watermarking on embedded devices in a practical setting requires a black-box approach. Existing DNN watermarking frameworks either fail to meet the black-box requirement or are susceptible to several forms of attacks. We propose a watermarking framework by incorporating the author’s signature in the process of training DNNs. While functioning normally in regular cases, the resulting watermarked DNN behaves in a different, predefined pattern when given any signed inputs, thus proving the authorship. We demonstrate an example implementation of the framework on popular image classification datasets and show that strong watermarks can be embedded in the models.

DeepFense: online accelerated defense against adversarial deep learning

Bita Darvish Rouhani
Mohammad Samragh
Mojan Javaheripi
Tara Javidi
Farinaz Koushanfar

Recent advances in adversarial Deep Learning (DL) have opened up a largely unexplored surface for malicious attacks jeopardizing the integrity of autonomous DL systems. With the wide-spread usage of DL in critical and time-sensitive applications, including unmanned vehicles, drones, and video surveillance systems, online detection of malicious inputs is of utmost importance. We propose DeepFense, the first end-to-end automated framework that simultaneously enables efficient and safe execution of DL models. DeepFense formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a DL network. To solve the aforementioned minimization problem, a set of complementary but disjoint modular redundancies are trained to validate the legitimacy of the input samples in parallel with the victim DL model. DeepFense leverages hardware/software/algorithm co-design and customized acceleration to achieve just-in-time performance in resource-constrained settings. The proposed countermeasure is unsupervised, meaning that no adversarial sample is leveraged to train modular redundancies. We further provide an accompanying API to reduce the non-recurring engineering cost and ensure automated adaptation to various platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders of magnitude performance improvement while enabling online adversarial sample detection.

Enabling deep learning at the IoT edge

Liangzhen Lai
Naveen Suda

Deep learning algorithms have demonstrated super-human capabilities in many cognitive tasks, such as image classification and speech recognition. As a result, there is an increasing interest in deploying neural networks (NNs) on low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers. In this paper, we discuss the challenges of deploying neural networks on microcontrollers with limited memory, compute resources and power budgets. We introduce CMSIS-NN, a library of optimized software kernels to enable deployment of NNs on Cortex-M cores. We also present techniques for NN algorithm exploration to develop light-weight models suitable for resource constrained systems, using keyword spotting as an example.

Searching toward pareto-optimal device-aware neural architectures

An-Chieh Cheng
Jin-Dong Dong
Chi-Hung Hsu
Shu-Huan Chang
Min Sun
Shih-Chieh Chang
Jia-Yu Pan
Yu-Ting Chen
Wei Wei
Da-Cheng Juan

Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

Hardware-aware machine learning: modeling and optimization

Diana Marculescu
Dimitrios Stamoulis
Ermao Cai

Recent breakthroughs in Machine Learning (ML) applications, and especially in Deep Learning (DL), have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms (from mobile devices to datacenters) have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. “What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)?” “Is it possible to predict this latency or energy consumption before a model is even trained?” “If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment?” From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention.

One cannot optimize what isn’t properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of ML applications. Furthermore, ML practitioners are currently challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardwareaware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for ML applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.

DAC 2019 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

LAMA: Link-Aware Hybrid Management for Memory Accesses in Emerging CPU-FPGA Platforms

Liang Feng
Jieru Zhao
Tingyuan Liang
Sharad Sinha
Wei Zhang

To satisfy increasing computing demands, heterogeneous computing platforms are gaining attention, especially CPU-FPGA platforms. Recently, emerging tightly coupled CPU-FPGA platforms with shared coherent caches (such as the Intel HARP and IBM POWER with CAPI) have been proposed to facilitate data communication and simplify the programming model. In this work, we propose LAMA, a static analysis and dynamic control combined framework for memory access management in such platforms, to further enhance the memory access efficiency and maintain the data consistency. Based on implementation results on the real Intel HARP2 platform, LAMA is shown to improve the performance by 34% on average with low overhead.

Thread Weaving: Static Resource Scheduling for Multithreaded High-Level Synthesis

Hsuan Hsiao
Jason Anderson

In high-level synthesis (HLS), software multithreading constructs can be used to explicitly specify coarse-grained parallelism for multiple accelerators. While software threads typically operate independently and in isolation of each other on CPUs, HLS threads/accelerators are sub-components of one circuit. Since these components generally reside in the same clock domain, we can schedule their execution statically to avoid shared-resource contention among threads. We propose thread weaving, a technique that statically interleaves requests from different threads through scheduling constraints. With the guarantee of a contention-free schedule, we eliminate replication/arbitration of shared resources, reducing the area footprint of the circuit and improving its maximum operating frequency (Fmax).

Exact and Heuristic Allocation of Multi-kernel Applications to Multi-FPGA Platforms

Junnan Shan
Mario R. Casu
Jordi Cortadella
Luciano Lavagno
Mihai T. Lazarescu

FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.

A Flat Timing-Driven Placement Flow for Modern FPGAs

Timothy Martin
Dani Maarouf
Ziad Abuowaimer
Abeer Alhyari
Gary Grewal
Shawki Areibi

In this paper, we propose a novel, flat analytic timing-driven placer without explicit packing for Xilinx UltraScale FPGA devices. Our work uses novel methods to simultaneously optimize for timing, wirelength and congestion throughout the global and detailed placement stages. We evaluate the effectiveness of the flat placer on the ISPD 2016 benchmark suite for the xcvu095 UltraScale device, as well as on industrial benchmarks. Experimental results show that on average, FTPlace achieves an 8% increase in maximum clock rate, an 18% decrease in routed wirelength, and produces placements that require 80% less time to route when compared to Xilinx Vivado 2018.1.

Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search

Weiwen Jiang
Xinyi Zhang
Edwin H.-M. Sha
Lei Yang
Qingfeng Zhuge
Yiyu Shi
Jingtong Hu

A fundamental question lies in almost every application of deep neural networks: what is the optimal neural architecture given a specific data set? Recently, several Neural Architecture Search (NAS) frameworks have been developed that use reinforcement learning and evolutionary algorithm to search for the solution. However, most of them take a long time to find the optimal architecture due to the huge search space and the lengthy training process needed to evaluate each candidate. In addition, most of them aim at accuracy only and do not take into consideration the hardware that will be used to implement the architecture. This will potentially lead to excessive latencies beyond specifications, rendering the resulting architectures useless. To address both issues, in this paper we use Field Programmable Gate Arrays (FPGAs) as a vehicle to present a novel hardware-aware NAS framework, namely FNAS, which will provide an optimal neural architecture with latency guaranteed to meet the specification. In addition, with a performance abstraction model to analyze the latency of neural architectures without training, our framework can quickly prune architectures that do not satisfy the specification, leading to higher efficiency. Experimental results on common data set such as ImageNet show that in the cases where the state-of-the-art generates architectures with latencies 7.81× longer than the specification, those from FNAS can meet the specs with less than 1% accuracy loss. Moreover, FNAS also achieves up to 11.13× speedup for the search process. To the best of the authors’ knowledge, this is the very first hardware aware NAS.

CANN: Curable Approximations for High-Performance Deep Neural Network Accelerators

Muhammad Abdullah Hanif
Faiq Khalid
Muhammad Shafique

Approximate Computing (AC) has emerged as a means for improving the performance, area and power-/energy-efficiency of a digital design at the cost of output quality degradation. Applications like machine learning (e.g., using DNNs-deep neural networks) are highly computationally intensive and, therefore, can significantly benefit from AC and specialized accelerators. However, the accuracy loss introduced because of approximations in the DNN accelerator hardware can result in undesirable results. This paper presents a novel method to design high-performance DNN accelerators where approximation error(s) from one stage/part of the design is “completely” compensated in the subsequent stage/part while offering significant efficiency gains. Towards this, the paper also presents a case-study for improving the performance of systolic array-based hardware architectures, which are commonly used for accelerating state-of-the-art deep learning algorithms.

Successive Log Quantization for Cost-Efficient Neural Networks Using Stochastic Computing

Sugil Lee
Hyeonuk Sim
Jooyeon Choi
Jongeun Lee

Despite the multifaceted benefits of stochastic computing (SC) such as low cost, low power, and flexible precision, SC-based deep neural networks (DNNs) still suffer from the long-latency problem, especially for those with high precision requirements. While log quantization can be of help, it has its own accuracy-saturation problem due to uneven precision distribution. In this paper we propose successive log quantization (SLQ), which extends log quantization with significant improvements in precision and accuracy, and apply it to state-of-the-art SC-DNNs. SLQ reuses the existing datapath of log quantization, and thus retains its advantages such as simple multiplier hardware. Our experimental results demonstrate that our SLQ can significantly extend both the accuracy and efficiency of SC-DNNs over the state-of-the-art solutions, including linear-quantized and log-quantized SC-DNNs, achieving less than 1~1.5%p accuracy drop for AlexNet, SqueezeNet, and VGG-S at mere 4~5-bit weight resolution.

ARGA: Approximate Reuse for GPGPU Acceleration

Daniel Peroni
Mohsen Imani
Hamid Nejatollahi
Nikil Dutt
Tajana Rosing

Many data-driven applications including computer vision, speech recognition, and medical diagnostics show tolerance to error during computation. These applications are often accelerated on GPUs, but high computational costs limit performance and increase energy usage. In this paper, we present ARGA, an approximate computing technique capable of accelerating GPGPU applications. ARGA provides an approximate lookup table to GPGPU cores to avoid recomputing instructions with identical or similar values. We propose multi-table parallel lookup which enables computational reuse to significantly speed-up GPGPU computation by checking incoming instructions in parallel. The inputs of each operation are searched for in a lookup table. Matches resulting in an exact or low error are removed from the floating point pipeline and used directly as output. Matches producing highly inaccurate results are computed on exact hardware to minimize application error. We simulate our design by placing ARGA within each core of an Nvidia Kepler Architecture Titan and an AMD Southern Island 7970. We show our design improves performance throughput by up to 2.7× and improves EDP by 5.3× for 6 GPGPU applications while maintaining less than 5% output error. We also show ARGA accelerates inference of a LeNet NN by 2.1× and improves EDP by 3.7× without significantly impacting classification accuracy.

Assessing the Adherence of an Industrial Autonomous Driving Framework to ISO 26262 Software Guidelines

Hamid Tabani
Leonidas Kosmidis
Jaume Abella
Francisco J. Cazorla
Guillem Bernat

The complexity and size of Autonomous Driving (AD) software are comparably higher than that of software implementing other (standard) functionalities in the car. To make things worse, a big fraction of AD software is not specifically designed for the automotive (or any other critical) domain, but the mainstream market. This brings uncertainty on to which extent AD software adheres to guidelines in safety standards. In this paper, we present our experience in applying ISO 26262 — the applicable functional safety standard for road vehicles — software safety guidelines to industrial AD software, in particular, Apollo, a heterogeneous Autonomous Driving framework used extensively in industry. We provide quantitative and qualitative metrics of compliance for many ISO 26262 recommendations on software design, implementation, and testing.

Tighter Dimensioning of Heterogeneous Multi-Resource Autonomous CPS with Control Performance Guarantees

Debayan Roy
Wanli Chang
Sanjoy K. Mitter
Samarjit Chakraborty

In modern autonomous systems, there is typically a large number of connected components realizing complex functionalities. For example, in autonomous vehicles (AVs), there are tens of millions of lines of code implemented on hundreds of sensors, controllers, and actuators. AVs have been deployed, mostly in trials and restricted environments, showing that substantial progress has been made in functionality development. However, they are still faced with two major challenges: (i) performance guarantee of safety-critical functions under all possible scenarios; (ii) functionality implementation with limited resources. These two challenges are conflicting because safety guarantees necessitate a worst-case analysis that is often very pessimistic for complex hardware/software systems, and thus require more resources. To address this, we study an abstraction of a heterogeneous cyber-physical system architecture consisting of a mix of high- and low-quality resources, such as time- and event-triggered resources, or wired and wireless resources. We show that by properly managing such a mix of resources and formulating a formal verification (model checking) problem, it is possible to tightly dimension the high-quality resource to the minimum (50% in certain cases) while providing control performance guarantees.

Dynamic Switching Speed Reconfiguration for Engine Performance Optimization

Chao Peng
Yecheng Zhao
Haibo Zeng

Today’s automotive engine control systems adopt several control strategies that come with tradeoffs between computational load and performance. The current practice is that the switching speeds at which the engine control system changes control strategy is fixed offline, typically based on the average driving need in a standard driving cycle (i.e., vehicle speed profile over time). This is clearly suboptimal since it fails to capture the variation in the driving cycle, and the actual driving cycle may be considerably different from the standard one. In this paper, we propose to dynamically adjust switching speeds based on the predicted driving cycle. We develop a hybrid set of schedulability analysis techniques to tame the complexity of ensuring the real-time schedulability of engine control tasks. We design an effective and efficient optimization algorithm that provides close-to-optimal solutions. Experimental results demonstrate that our approach efficiently finds dynamic switching speeds that significantly improve engine performance over static ones.

A Memory-Efficient Markov Decision Process Computation Framework Using BDD-based Sampling Representation

He Zhou
Sunil P. Khatri
Jiang Hu
Frank Liu

Although Markov Decision Process (MDP) has wide applications in autonomous systems as a core model in Reinforcement Learning, a key bottleneck is the large memory utilization of the state transition probability matrices. This is particularly problematic for computational platforms with limited memory, or for Bayesian MDP, which requires dozens of such matrices. To mitigate this difficulty, we propose a highly memory-efficient representation for probability matrices using Binary Decision Diagram (BDD) based sampling, and develop a corresponding (Bayesian/classical) MDP solver on a CPU-GPU platform. Simulation results indicate our approach reduces memory by one and two orders of magnitude for Bayesian/classical MDP, respectively.

Process, Circuit and System Co-optimization of Wafer Level Co-Integrated FinFET with Vertical Nanosheet Selector for STT-MRAM Applications

Trong Huynh-Bao
Anabela Veloso
Sushil Sakhare
Philippe Matagne
Julien Ryckaert
Manu Perumkunnil
Davide Crotti
Farrukh Yasin
Alessio Spessot
Arnaud Furnemont
Gouri Kar
Anda Mocuta

We present for the first time a co-integrated FinFET with vertical nanosheet transistor (VFET) process on a 300 mm silicon wafer for STT-MRAM applications and its related avenues with a holistic design-technology-co-optimization (DTCO) and power-performance-area-cost (PPAC) approach. The STT-MRAM bitcell and a 2 Mbit macro have been optimized and designed to address the viability of the co-integration process and advantages of vertical channel transistors for STT-MRAM selectors. An architectural system simulator GEM5 has been also employed with Polybench workloads to assess energy saving at system-level. In order to enable this co-integration, four extra masks are required, which costs below 10% in embedded chips. A 36% area reduction can be achieved for the STT-MRAM bitcell implemented with VFET selectors. With a UVLT flavor, the STT-MRAM bitcell comprising of 3-nanosheet could deliver the same performance of the 4-fin LVT FinFET selector. A 2 Mbit STT-MRAM macro designed with VFET selector can offer a 17% and a 21% reduction for read access latency and energy per operation respectively, and a 10% for write energy per operation. A 7% energy saving for the STT-MRAM L2 cache using VFET selector has been observed at the system level with Polybench workloads.

LL-PCM: Low-Latency Phase Change Memory Architecture

Nam Sung Kim
Choungki Song
Woo Young Cho
Jian Huang
Myoungsoo Jung

PCM is a promising non-volatile memory technology, as it can offer a unique trade-off between density and latency compared with DRAM and flash memory. Albeit PCM is much faster than flash memory, it is still notably slower than DRAM, which can significantly degrade system performance. In this paper, we analyze a PCM implementation in depth, and identify the primary cause of PCM’s long latency, i.e., a long interconnect (high resistance/capacitance) path between a cell and a sense-amp/write-driver. This in turn requires (1) a very large charge pump consuming: ~20% of PCM chip space, ~50% of latency of write operations, and ~2× more power than a write operation itself; and (2) a large current sense-amp with long time to pre-charge the interconnect path. Then, we propose Low-Latency PCM (LL-PCM) architecture. Our analysis shows that LL-PCM can give 119% higher performance and consume 43% lower memory energy than PCM for memory-intensive applications. LL-PCM is only ~1% larger than PCM, as the cost of reducing the resistance/capacitance of the interconnect path is negated by its 4.1× smaller charge pump.

What does Vibration do to Your SSD?

Janki Bhimani
Tirthak Patel
Ningfang Mi
Devesh Tiwari

Vibration generated in modern computing environments such as autonomous vehicles, edge computing infrastructure, and data center systems is an increasing concern. In this paper, we systematically measure, quantify and characterize the impact of vibration on the performance of SSD devices. Our experiments and analysis uncover that exposure to both short-term and long-term vibration, even within the vendor-specified limits, can significantly affect SSD I/O performance and reliability.

Ultra-thin Skin Electronics for High Quality and Continuous Skin-Sensor-Silicon Interfacing

Leilai Shao
Sicheng Li
Ting Lei
Tsung-Ching Huang
Raymond Beausoleil
Zhenan Bao
Kwang-Ting Cheng

Skin-inspired electronics emerges as a new paradigm due to the increasing demands for conformable and high-quality skin-sensor-silicon (SSS) interfacing in wearable, electronic skin and health monitoring applications. Advances in ultra-thin, flexible, stretchable and conformable materials have made skin electronics feasible. In this paper, we prototyped an active electrode (with a thickness ≤ 2 um), which integrates the electrode with a thin-film transistor (TFT) based amplifier, to effectively suppress motion artifacts. The fabricated ultra-thin amplifier can achieve a gain of 32 dB at 20 kHz, demonstrating the feasibility of the proposed active electrode. Using atrial fibrillation (AF) detection for electrocardiogram (ECG) as an application driver, we further develop a simulation framework taking into account all elements including the skin, the sensor, the amplifier and the silicon chip. Systematic and quantitative simulation results indicate that the proposed active electrode can effectively improve the signal quality under motion noises (achieving ≥30 dB improvement in signal-to-noise ratio (SNR)), which boosts classification accuracy by more than 19% for AF detection.

Enabling High-Dimensional Bayesian Optimization for Efficient Failure Detection of Analog and Mixed-Signal Circuits

Hanbin Hu
Peng Li
Jianhua Z. Huang

With increasing design complexity and stringent robustness requirements in application such as automotive electronics, analog and mixed-signal (AMS) verification becomes akey bottleneck. Rare failure detection in a high-dimensional parameter space using minimal expensive simulation data is a major challenge. We address this challenge under a Bayesian learning framework using Bayesian optimization (BO). We formulate the failure detection as a BO problem where a chosen acquisition function is optimized to select the next (set of) optimal simulation sampling point(s) such that rare failures may be detected using a small amount of data. While providing an attractive black-box solution to design verification, in practice BO is limited in its ability in dealing with high-dimensional problems. We propose to use random embedding to effectively reduce the dimensionality of a given verification problem to improve both the quality of BO-based optimal sampling and computational efficiency. We demonstrate the success of the proposed approach on detecting rare design failures under high-dimensional process variations which are completely missed by competitive smart sampling and BO techniques without dimension reduction.

High Performance Graph Convolutional Networks with Applications in Testability Analysis

Yuzhe Ma
Haoxing Ren
Brucek Khailany
Harbinder Sikka
Lijuan Luo
Karthikeyan Natarajan
Bei Yu

Applications of deep learning to electronic design automation (EDA) have recently begun to emerge, although they have mainly been limited to processing of regular structured data such as images. However, many EDA problems require processing irregular structures, and it can be non-trivial to manually extract important features in such cases. In this paper, a high performance graph convolutional network (GCN) model is proposed for the purpose of processing irregular graph representations of logic circuits. A GCN classifier is firstly trained to predict observation point candidates in a netlist. The GCN classifier is then used as part of an iterative process to propose observation point insertion based on the classification results. Experimental results show the proposed GCN model has superior accuracy to classical machine learning models on difficult-to-observation nodes prediction. Compared with commercial testability analysis tools, the proposed observation point insertion flow achieves similar fault coverage with an 11% reduction in observation points and a 6% reduction in test pattern count.

MRLoc: Mitigating Row-hammering based on memory Locality

Jung Min You
Joon-Sung Yang

With the increasing integration of semiconductor design, many problems have emerged. Row-hammering is one of these problems. The row-hammering effect is a critical issue for reliable memory operation because it can cause some unexpected errors. Hence, it is necessary to address this problem. Mainly, there are two different methods to deal with the row-hammering problem. One is a counter based method, and the other is a probabilistic method. This paper proposes the improved version of the latter method and compares it with other probabilistic methods, PARA and PRoHIT. According to the evaluation results, comparing the proposed method with conventional ones, the proposed one has increased row-hammering reduction per refresh 1.82 and 7.78 times against PARA and PRoHIT in average, respectively.

System-level hardware failure prediction using deep learning

Xiaoyi Sun
Krishnendu Chakrabarty
Ruirui Huang
Yiquan Chen
Bing Zhao
Hai Cao
Yinhe Han
Xiaoyao Liang
Li Jiang

Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware at risk and backup the data. However, the current model-based predictors are incapable of using the discrete time-series data, such as the values of device attributes, which conveys high-level information of the device behavior. In this paper, we propose a novel deep-learning based prediction scheme for system-level hardware failure prediction. We normalize the distribution of samples’ attributes from different vendors to make use of diverse training sets. We propose a temporal Convolution Neural Network based model that is insensitive to the noise in the time dimension. Finally, we design a loss function to train the model with extremely imbalanced samples effectively. Experimental results from an open S.M.A.R.T data set and an industrial data set show the effectiveness of the proposed scheme.

Enabling Practical Processing in and near Memory for Data-Intensive Computing

Onur Mutlu
Saugata Ghose
Juan Gómez-Luna
Rachata Ausavarungnirun

Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in the processor (and accelerators), and data storage/movement on the other, which all other parts of the system are dedicated to. Due to this dichotomy, data moves a lot in order for the system to perform computation on it. Unfortunately, data movement is extremely expensive in terms of energy and latency, much more so than computation. As a result, a large fraction of system energy is spent and performance is lost solely on moving data in a modern computing system.

In this work, we re-examine the idea of reducing data movement by performing Processing in Memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked logic and DRAM, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the idea of PIM is not new, we examine two new approaches to enabling PIM: 1) exploiting analog properties of DRAM to perform massively-parallel operations in memory, and 2) exploiting 3D-stacked memory technology design to provide high bandwidth to in-memory logic. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

Practical Near-Data Processing to Evolve Memory and Storage Devices into Mainstream Heterogeneous Computing Systems

Nam Sung Kim
Pankaj Mehra

The capacity of memory and storage devices is expected to increase drastically with adoption of the forthcoming memory and integration technologies. This is a welcome improvement especially for datacenter servers running modern data-intensive applications. Nonetheless, for such servers to fully benefit from the increasing capacity, the bandwidth of interconnects between processors and these devices must also increase proportionally, which becomes ever costlier under unabating physical constraints. As a promising alternative to tackle this challenge cost-effectively, a heterogeneous computing paradigm referred to as near-data processing (NDP) has emerged. However, NDP has not yet been widely adopted by the industry because of significant gaps between existing software stacks and demanded ones for NDP-capable memory and storage devices. Aiming to overcome the gaps, we propose to turn memory and storage devices into familiar heterogeneous distributed computing systems. Then, we demonstrate potentials of such computing systems for existing data-intensive applications with two recently implemented NDP-capable devices. Finally, we conclude with a practical blueprint to exploit the NDP-based computing systems for speeding up solving future computer-aided design and optimization problems.

HeadStart: Enforcing Optimal Inceptions in Pruning Deep Neural Networks for Efficient Inference on GPGPUs

Ning Lin
Hang Lu
Xin Wei
Xiaowei Li

Deep convolutional neural networks are well-known for the extensive parameters and computation intensity. Structured pruning is an effective solution to obtain a more compact model for the efficient inference on GPGPUs, without designing specific hardware accelerators. However, previous works resort to certain metrics in channel/filter pruning and count on labor intensive fine-tunings to recover the accuracy loss. The “inception” of the pruned model, as another form factor, has indispensable impact to the final accuracy but its importance is often ignored in these works. In this paper, we prove that optimal inception will be more likely to induce a satisfied performance and shortened fine-tuning iterations. We also propose a reinforcement learning based solution, termed as HeadStart, seeking to learn the best way of pruning aiming at the optimal inception. With the help of the specialized head-start network, it could automatically balance the tradeoff between the final accuracy and the preset speedup rather than tilting to one of them, which makes it differentiated from existing works as well. Experimental results show that HeadStart could attain up to 2.25x inference speedup with only 1.16% accuracy loss tested with large scale images on various GPGPUs, and could be well generalized to various cutting-edge DCNN models.

GATE: A Generalized Dataflow-level Approximation Tuning Engine For Data Parallel Architectures

Seokwon Kang
Yongseung Yu
Jiho Kim
Yongjun Park

Although approximate computing is widely used, it requires substantial programming effort to find appropriate approximation patterns among multiple pre-defined patterns to achieve a high performance. Therefore, we propose an automatic approximation framework called GATE to uncover hidden opportunities from any data-parallel program regardless of the code pattern or application characteristics using two compiler techniques, namely subgraph-level approximation (SGLA) and approximate thread merge(ATM). GATE also features conservative/aggressive tuning and dynamic calibration to maximize the performance while maintaining the TOQ level during runtime. Our framework achieves an average performance gain of 2.54x over the baseline with minimum accuracy loss.

LSIM: Ultra Lightweight Similarity Measurement for Mobile Graphics Applications

Yu-Chuan Chang
Wei-Ming Chen
Pi-Cheng Hsiu
Yen-Yu Lin
Tei-Wei Kuo

Perceptual similarity measurement allows mobile applications to eliminate unnecessary computations without compromising visual experience. Existing pixel-wise measures incur significant overhead with increasing display resolutions and frame rates. This paper presents an ultra lightweight similarity measure called LSIM, which assesses the similarity between frames based on the transformation matrices of graphics objects. To evaluate its efficacy, we integrate LSIM into the Open Graphics Library and conduct experiments on an Android smartphone with various mobile 3D games. The results show that LSIM is highly correlated with the most widely used pixel-wise measure SSIM, yet three to five orders of magnitude faster. We also apply LSIM to a CPU-GPU governor to suppress the rendering of similar frames, thereby further reducing computation energy consumption by up to 27.3% while maintaining satisfactory visual quality.

Efficient State Retention through Paged Memory Management for Reactive Transient Computing

Sivert T. Sliper
Domenico Balsamo
Nikos Nikoleris
William Wang
Alex S. Weddell
Geoff V. Merrett

Reactive transient computing systems preserve computational progress despite frequent power failures by suspending (saving state to nonvolatile memory) when detecting a power failure, and restoring once power returns. Existing methods inefficiently save and restore all allocated memory. We propose lightweight memory management that applies the concept of paging to load pages only when needed, and save only modified pages. We then develop a model that maximises available execution time by dynamically adjusting the suspend and restore voltage thresholds. Experiments on an MSP430FR5994 microcontroller show that our method reduces state retention overheads by up to 86.9% and executes algorithms up to 5.3× faster than the state-of-the-art.

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning

Gagandeep Singh
Juan Gómez-Luna
Giovanni Mariani
Geraldo F. Oliveira
Stefano Corda
Sander Stuijk
Onur Mutlu
Henk Corporaal

The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies workloads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.

We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on microarchitectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.

DREDGE: Dynamic Repartitioning during Dynamic Graph Execution

Andrew McCrabb
Eric Winsor
Valeria Bertacco

Graph-based algorithms have gained significant interest in several application domains. Solutions addressing the computational efficiency of such algorithms have mostly relied on many-core architectures. Cleverly laying out input graphs in storage, by placing adjacent vertices in a same storage unit (memory bank or cache unit), enables fast access during graph traversal. Dynamic graphs, however, must be continuously repartitioned to leverage this benefit. Yet software repartitioning solutions rely on costly, cross-vault communication to query and optimize the graph layout between algorithm iterations.

In this work, we propose DREDGE, a novel hardware solution to provide heuristic repartitioning optimizations in the background without extra communication. Our evaluation indicates that we achieve a 1.9x speedup, on average, over several graph algorithms and datasets, executing on a 24×24-core architecture, when compared against a baseline solution that does not repartition the dynamic graph. We estimated that DREDGE incurs only 1.5% area and 2.1% power overheads over an ARM A5 processor core.

ROC: DRAM-based Processing with Reduced Operation Cycles

Xin Xin
Youtao Zhang
Jun Yang

DRAM based memory-centric computing architectures are promising solutions to tackle the challenges of memory wall. In this paper, we develop a novel design of DRAM-based processing-in-memory (PIM) architecture which achieves lower cycles in every basic operation than prior arts. Our small yet fast in-memory computing units support basic logic operations including NOT, AND, and OR. Using those operations, along with shift and propagation, bitwise operations can be extended to word-wise operations, e.g. increment and comparison, with high efficiency. We also optimize the designs to exploit parallelism and data reuse to further improve the performance of compound operations. Compared with the most powerful state-of-the-art PIM architecture, we can achieve comparable or even better performance while consuming only 6% of its area overhead.

NV-BNN: An Accurate Deep Convolutional Neural Network Based on Binary STT-MRAM for Adaptive AI Edge

Chih-Cheng Chang
Ming-Hung Wu
Jia-Wei Lin
Chun-Hsien Li
Vivek Parmar
Heng-Yuan Lee
Jeng-Hua Wei
Shyh-Shyuan Sheu
Manan Suri
Tian-Sheuan Chang
Tuo-Hung Hou

Binary STT-MRAM is a highly anticipated embedded nonvolatile memory technology in advanced logic nodes < 28 nm. How to enable its in-memory computing (IMC) capability is critical for enhancing AI Edge. Based on the soon-available STT-MRAM, we report the first binary deep convolutional neural network (NV-BNN) capable of both local and remote learning. Exploiting intrinsic cumulative switching probability, accurate online training of CIFAR-10 color images (~ 90%) is realized using a relaxed endurance spec (switching ≤ 20 times) and hybrid digital/IMC design. For offline training, the accuracy loss due to imprecise weight placement can be mitigated using a rapid non-iterative training-with-noise and fine-tuning scheme.

No Compromises: Secure NVM with Crash Consistency, Write-Efficiency and High-Performance

Fan Yang
Youyou Lu
Youmin Chen
Haiyu Mao
Jiwu Shu

Data encryption and authentication are essential for secure NVM. However, the introduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection without compromising the performance, we propose cc-NVM. It firstly proposes an epoch-based mechanism to aggressively cache the security metadata in CPU cache while retaining the consistency of them in NVM. Deferred spreading is also introduced to reduce the calculating overhead for data authentication. Leveraging the hidden ability of data HMACs, we can always recover the consistent but old security metadata to its newest version. Compared to Osiris, a state-of-the-art secure NVM, cc-NVM improves performance by 20.4% on average. When the system crashes, instead of dropping all the data due to malicious attacks, cc-NVM is able to detect and locate the exact tampered data while only incurring extra write traffic by 29.6% on average.

In-process Memory Isolation Using Hardware Watchpoint

Jinsoo Jang
Brent Byunghoon Kang

Memory disclosure vulnerabilities have been exploited in the leaking of application secret data such as crypto keys (e.g., the Heartbleed Bug). To ameliorate this problem, we propose an in-process memory isolation mechanism by leveraging a common hardwarefeature, namely, hardware debugging. Specifically, we utilize a watchpoint to monitor a particular memory region containing secret data. We implemented the PoC of our approach based on the 64-bit ARM architecture, including the kernel patches and user APIs that help developers benefit from isolated memory use. We applied the approach to open-source applications such as OpenSSL and AESCrypt. The results of a performance evaluation show that our approach incurs a small amount of overhead.

H-ORAM: A Cacheable ORAM Interface for Efficient I/O Accesses

Liang Liu
Rujia Wang
Youtao Zhang
Jun Yang

Oblivious RAM (ORAM) is an effective security primitive to prevent access pattern leakage. By adding redundant memory accesses, ORAM prevents attackers from revealing the patterns in the access sequences. However, ORAM tends to introduce a huge degradation on the performance. With growing address space to be protected, ORAM has to store the majority of data in the lower level storage, which further degrades the system performance.

In this paper, we propose Hybrid ORAM (H-ORAM), a novel ORAM primitive to address large performance degradation when overflowing the user data to storage. H-ORAM consists of a batch scheduling scheme for enhancing the memory bandwidth usage, and a novel ORAM interface that returns data without waiting for the I/O access each time. We evaluate H-ORAM on a real machine implementation. The experimental results show that that H-ORAM outperforms the state-of-the-art Path ORAM by 20×.

RansomBlocker: a Low-Overhead Ransomware-Proof SSD

Jisung Park
Youngdon Jung
Jonghoon Won
Minji Kang
Sungjin Lee
Jihong Kim

We present a low-overhead ransomware-proof SSD, called RansomBlocker (RBlocker). RBlocker provides 100% full protections against all possible ransomware attacks by delaying every data deletion until no attack is guaranteed. To reduce storage overheads of the delayed deletion, RBlocker employs a time-out based backup policy. Based on the fact that ransomware must store encrypted version of target files, early deletions of obsolete data are allowed if no encrypted write was detected for a short interval. Otherwise, RBlocker keeps the data for an interval long enough to guarantee no attack condition. For an accurate in-line detection of encrypted writes, we leverages entropy- and CNN-based detectors in an integrated fashion. Our experimental results show that RBlocker can defend all types of ransomware attacks with negligible overheads.

Transmit or Discard: Optimizing Data Freshness in Networked Embedded Systems with Energy Harvesting Sources

Zimeng Zhou
Chenchen Fu
Chun Jason Xue
Song Han

This paper explores how to optimize the freshness of real-time data in energy harvesting based networked embedded systems. We introduce the concept of Age of Information (AoI) to quantitatively measure the data freshness and present a comprehensive analysis on the average AoI of the real-time data with stochastic update arrival and energy replenishment rates. Both an optimal offline solution and an effective online solution are designed to judiciously select a subset of the real-time data updates and determine their corresponding transmission times to optimize the average AoI subject to energy constraints. Our extensive experiments have validated the effectiveness of the proposed solutions, and showed that these two methods can significantly improve the average AoI by 47.2% comparing to the state-of-the-art solutions for low energy replenishment rate.

FPGA-Based Emulation of Embedded DRAMs for Statistical Error Resilience Evaluation of Approximate Computing Systems

Marco Widmer
Andrea Bonetti
Andreas Burg

Embedded DRAM (eDRAM) requires frequent power-hungry refresh according to the worst-case retention time across PVT variations to avoid data loss. Abandoning the error-free paradigm, by choosing sub-critical refresh rates that gracefully degrade the eDRAM content, unlocks considerable power-saving opportunities, but requires to understand the effect of stochastic memory errors at the system/application level. We propose an FPGA-based platform featuring faulty eDRAM emulation based on advanced retention time models and silicon measurements for statistical error resilience evaluation of applications in a complete embedded system. We analyze the statistical QoS for various benchmarks under different sub-critical refresh rates and retention time distributions.

Adapting Layer RBERs Variations of 3D Flash Memories via Multi-granularity Progressive LDPC Reading

Yajuan Du
Yao Zhou
Meng Zhang
Wei Liu
Shengwu Xiong

Existing studies have uncovered that there exist significant Raw Bit Error Rates (RBERs) variations among different layers of 3D flash memories due to manufacture process variation. These RBER variations would cause significantly diversed read latencies when reading data with traditional Low-Density Parity-Check (LDPC) codes designed for planar flash memories, which induces sub-optimal read performance of flash-based Solid-State Drives (SSDs).

To investigate the latency diversity, this paper first performs a preliminary experiment and observes that LDPC read levels proportional to latencies increase in diverse speeds along with data retention. Then, by exploiting the observation results, a Multi-Granularity LDPC (MG-LDPC) read method is proposed to adapt level increase speed for each layer. Five LDPC engines with varied increase granularity are designed to adapt RBER speed requirements. Finally, two implementations for MG-LDPC are applied to assign LDPC engines for each flash layer in a fixed way or dynamically according to prior read levels. Experimental results show that the proposed two implementations can reduce SSD read response time by 21% and 47% on average, respectively.

A Hybrid Agent-based Design Methodology for Dynamic Cross-layer Reliability in Heterogeneous Embedded Systems

Siva Satyendra Sahoo
Bharadwaj Veeravalli
Akash Kumar

Technology scaling and architectural innovations have led to increasing ubiquity of embedded systems across applications with widely varying and often constantly changing performance and reliability specifications. However, the increasing physical fault-rates in electronic systems have led to single-layer reliability approaches becoming infeasible for resource-constrained systems. Dynamic Cross-layer reliability (CLR) provides scope for efficient adaptation to such QoS variations and increasing unreliability. We propose a design methodology for enabling QoS-aware CLR-integrated runtime adaptation in heterogeneous MPSoC-based embedded systems. Specifically, we propose a combination of reconfiguration cost-aware optimization at design-time and an agent-based optimization at run-time. We report a reduction of up to 51% and 37% in average reconfiguration cost and average energy consumption respectively over state-of-the-art approaches.

PRIMAL: Power Inference using Machine Learning

Yuan Zhou
Haoxing Ren
Yanqing Zhang
Ben Keller
Brucek Khailany
Zhiru Zhang

This paper introduces PRIMAL, a novel learning-based framework that enables fast and accurate power estimation for ASIC designs. PRIMAL trains machine learning (ML) models with design verification testbenches for characterizing the power of reusable circuit building blocks. The trained models can then be used to generate detailed power profiles of the same blocks under different workloads. We evaluate the performance of several established ML models on this task, including ridge regression, gradient tree boosting, multi-layer perceptron, and convolutional neural network (CNN). For average power estimation, ML-based techniques can achieve an average error of less than 1% across a diverse set of realistic benchmarks, outperforming a commercial RTL power estimation tool in both accuracy and speed (15x faster). For cycle-by-cycle power estimation, PRIMAL is on average 50x faster than a commercial gate-level power analysis tool, with an average error less than 5%. In particular, our CNN-based method achieves a 35x speed-up and an error of 5.2% for cycle-by-cycle power estimation of a RISC-V processor core. Furthermore, our case study on a NoC router shows that PRIMAL can achieve a small estimation error of 4.5% using cycle-approximate traces from SystemC simulation.

Partition and Propagate: an Error Derivation Algorithm for the Design of Approximate Circuits

Ilaria Scarabottolo
Giovanni Ansaloni
George A. Constantinides
Laura Pozzi

Inexact hardware design techniques have become popular in error-tolerant systems, where energy efficiency is a primary concern. Several techniques aim to identify circuit portions that can be discarded under an error constraint, but research on systematic methods to determine such error is still at an early stage. We herein illustrate a generic, scalable algorithm that determines the influence of each circuit gate on the final output. The algorithm first partitions the graph representing the circuit, then determines the error propagation model of the resulting subgraphs. When applied to existing approximate design frameworks, our solution improves their efficiency and result quality.

Performance, Power and Cooling Trade-Offs with NCFET-based Many-Cores

Martin Rapp
Sami Salamin
Hussam Amrouch
Girish Pahwa
Yogesh Chauhan
Jörg Henkel

Negative Capacitance Field-Effect Transistor (NCFET) is an emerging technology that incorporates a ferroelectric layer within the transistor gate stack to overcome the fundamental limit of sub-threshold swing in transistors. Even though physics-based NCFET models have been recently proposed, system-level NCFET models do not exist and research is still in its infancy. In this work, we are the first to investigate the impact of NCFET on performance, energy and cooling costs in many-core processors. Our proposed methodology starts from accurate physics models all the way up to the system level, where the performance and power of a many-core are widely affected. Our new methodology and system-level models allow, for the first time, the exploration of the novel trade-offs between performance gains and power losses that NCFET now offers to system-level designers. We demonstrate that an optimal ferroelectric thickness does exist. In addition, we reveal that current state-of-the-art power management techniques fail when NCFET (with a thick ferroelectric layer) comes into play.

STFL: Energy-Efficient Data Movement with Slow Transition Fast Level Signaling

Payman Behnam
Mahdi Nazm Bojnordi

Data movement in large caches consumes a significant amount of energy in modern computer systems. Low power interfaces have been proposed to address this problem. Unfortunately, the energy-efficiency of these techniques is largely limited due to undue latency overheads of low power wires and complex coding mechanisms. This paper proposes a hybrid technique for slow-transition, fast-level (STFL) signaling that creates a balance between power and bandwidth in the last level cache interface. Combined with STFL codes, the signaling technique significantly mitigates the performance impacts of low power wires, thereby improving the energy efficiency of data movement in memory systems. When applied to the last level cache of a contemporary multicore system, STFL improves the CPU energy-delay product by 9% as compared to a voltage-frequency scaled baseline. Moreover, the proposed architecture reduces the CPU energy by 26% and achieves 98% of the performance provided by a high-performance baseline.

Formal Verification of Security Critical Hardware-Firmware Interactions in Commercial SoCs

Sayak Ray
Nishant Ghosh
Ramya Jayaram Masti
Arun Kanuparthi
Jason M. Fung

We present an effective methodology for formally verifying security-critical flows in a commercial System-on-Chip (SoC) which involve extensive interaction between firmware (FW) and hardware (HW). We describe several HW-FW interaction scenarios that are typical in commercial SoCs. We highlight unique challenges associated with formal verification of security properties of such interactions and discuss our approach of property-specific abstraction and software model checking to circumvent those challenges. To the best of our knowledge, this is the first exposition on formal co-verification of security-specific HW-FW interactions in the context and scale of a commercial SoCs. Despite traditional scalability challenges, we demonstrate that many such flows are amenable to effective formal verification.

In Hardware We Trust: Gains and Pains of Hardware-assisted Security

Lejla Batina
Patrick Jauernig
Nele Mentens
Ahmad-Reza Sadeghi
Emmanuel Stapf

Data processing and communication in almost all electronic systems are based on Central Processing Units (CPUs). In order to guarantee confidentiality and integrity of the software running on a CPU, hardware-assisted security architectures are used. However, both the threat model and the non-functional platform requirements, i.e. performance and energy budget, differ when we go from high-end desktop computers and servers to low-end embedded devices that populate the internet of things (IoT). For high-end platforms, a relatively large energy budget is available to protect software against attacks. However, measures to optimize performance give rise to microarchitectural side-channel attacks. IoT devices, in contrast, are constrained in terms of energy consumption and do not incorporate the performance enhancements found in high-end CPUs. Hence, they are less likely to be susceptible to microarchitectural attacks, but give rise to physical attacks, exploiting, e.g., leakage in power consumption or through fault injection. Whereas previous work mostly concentrates on a specific architecture, this paper covers the whole spectrum of computing systems, comparing the corresponding hardware architectures, and most relevant threats.

Protecting RISC-V against Side-Channel Attacks

Elke De Mulder
Samatha Gummalla
Michael Hutter

Software (SW) implementations of cryptographic algorithms are vulnerable to Side-channel Analysis (SCA) attacks, basically relinquishing the key to the outside world through measurable physical properties of the processor like power consumption and electromagnetic radiation. Protected SW implementations typically have a significant timing and code size overhead as well as a substantially long development time because hands-on testing the result is crucial. Plenty of scientific publications offer solutions for this problem for all kinds of algorithms but they are not straightforward to implement as they rely on device assumptions which are rarely met, nor do these solutions take micro-architecture related leakages into account. We present a solution to this problem by integrating side-channel analysis countermeasures into a RISC-V implementation. Our solution protects against first-order power or electromagnetic attacks while keeping the implementation costs as low as possible. We made use of state of the art masking techniques and present a novel solution to protect memory access against SCA. Practical results are provided that demonstrate the leakage results of various cryptographic primitives running on our protected hardware platform.

ANN Based Admission Control for On-Chip Networks

Boqian Wang
Zhonghai Lu
Shenggang Chen

We propose an admission control method in Network-on-Chip (NoC) with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node via the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. Compared to the previous work, our method builds up a high-fidelity model between the network status and the injection rate regulation. The full-system simulation results show that our proposed method can enhance application performance by 17.8% on average and up to 23.8%.

An Energy-Efficient Network-on-Chip Design using Reinforcement Learning

Hao Zheng
Ahmed Louri

The design space for energy-efficient Network-on-Chips (NoCs) has expanded significantly comprising a number of techniques. The simultaneous application of these techniques to yield maximum energy efficiency requires the monitoring of a large number of system parameters which often results in substantial engineering efforts and complicated control policies. This motivates us to explore the use of reinforcement learning (RL) approach that automatically learns an optimal control policy to improve NoC energy efficiency. First, we deploy power-gating (PG) and dynamic voltage and frequency scaling (DVFS) to simultaneously reduce both static and dynamic power. Second, we use RL to automatically explore the dynamic interactions among PG, DVFS, and system parameters, learn the critical system parameters contained in the router and cache, and eventually evolve optimal per-router control policies that significantly improve energy efficiency. Moreover, we introduce an artificial neural network (ANN) to efficiently implement the large state-action table required by RL. Simulation results using PARSEC benchmark show that the proposed RL approach improves power consumption by 26%, while improving system performance by 7%, as compared to a combined PG and DVFS design without RL. Additionally, the ANN design yields 67% area reduction, as compared to a conventional RL implementation.

Lightweight Mitigation of Hardware Trojan Attacks in NoC-based Manycore Computing

Venkata Yaswanth Raparti
Sudeep Pasricha

Data-snooping is a serious security threat in NoC fabrics that can lead to theft of sensitive information from applications executing on manycore processors. Hardware Trojans (HTs) covertly embedded in NoC components can carry out such snooping attacks. In this paper, we first describe a low-overhead snooping invalidation module (SIM) to prevent malicious data replication by HTs in NoCs. We then devise a snooping detection module (THANOS) to also detect malicious applications that utilize such HTs. Experimental analysis shows that unlike state-of-the-art mechanisms, SIM and THANOS not only mitigate snooping attacks but also improve NoC performance by 48.4% in the presence of these attacks, with a minimal ~2.15% area and ~5.5% power overhead.

Sparse 3-D NoCs with Inductive Coupling

Michihiro Koibuchi
Lambert Leong
Tomohiro Totoki
Naoya Niwa
Hiroki Matsutani
Hideharu Amano
Henri Casanova

Wireless interconnects based on inductive coupling technology are compelling propositions for designing 3-D integrated chips. This work addresses the heat dissipation problem on such systems. Although effective cooling technologies have been proposed for systems designed based on Through Silicon Via (TSV), their application to systems that use inductive coupling is problematic because of increased wireless-communication distance. For this reason, we propose two methods for designing sparse 3-D chips layouts and Networks on Chip (NoCs) based on inductive coupling. The first method computes an optimized 3-D chip layout and then generates a randomized network topology for this layout. The second method uses a standard stack chip layout with a standard network topology as a starting point, and then deterministically transforms it into either a “staircase” or a “checkerboard” layout. We quantitatively compare the designs produced by these two methods in terms of network and application performance. Our main finding is that the first method produces designs that ultimately lead to higher parallel application performance, as demonstrated for nine OpenMP applications in the NAS Parallel Benchmarks.

Surf-Bless: A Confined-interference Routing for Energy-Efficient Communication in NoCs

Peng Wang
Sobhan Niknam
Sheng Ma
Zhiying Wang
Todor Stefanov

In this paper, we address the problem of how to achieve energy-efficient confined-interference communication on a bufferless NoC taking advantage of the low power consumption of such NoC. We propose a novel routing approach called Surfing on a Bufferless NoC (Surf-Bless) where packets are assigned to domains and Surf-Bless guarantees that interference between packets is confined within a domain, i.e., there is no interference between packets assigned to different domains. By experiments, we show that our Surf-Bless routing approach is effective in supporting confined-interference communication and consumes much less energy than the related approaches.

Effect of Distributed Directories in Mesh Interconnects

Marcos Horro
Mahmut T. Kandemir
Louis-Noël Pouchet
Gabriel Rodríguez
Juan Touriño

Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Xeon Phi Knights Landing. It features 38 tiles packed in a single die, organized into a 2D mesh. Before accessing remote data, tiles need to query the distributed directory. The effect of this coherence traffic is poorly understood. We show that the apparent UMA behavior results from the degradation of the peak performance. We develop ways to optimize the coherence traffic, the core-to-core-affinity, and the scheduling of a set of tasks on the mesh, leveraging the unique characteristics of processor units stemming from process variations.

BRIC: Locality-based Encoding for Energy-Efficient Brain-Inspired Hyperdimensional Computing

Mohsen Imani
Justin Morris
John Messerly
Helen Shu
Yaobang Deng
Tajana Rosing

Brain-inspired Hyperdimensional (HD) computing is a new computing paradigm emulating the neuron’s activity in high-dimensional space. The first step in HD computing is to map each data point into high-dimensional space (e.g., 10,000), which requires the computation of thousands of operations for each element of data in the original domain. Encoding alone takes about 80% of the execution time of training. In this paper, we propose BRIC, a fully binary Brain-Inspired Classifier based on HD computing for energy-efficient and high-accuracy classification. BRIC introduces a novel encoding module based on random projection with a predictable memory access pattern which can efficiently be implemented in hardware. BRIC is the first HD-based approach which provides data projection with a 1:1 ratio to the original data and enables all training/inference computation to be performed using binary hypervectors. To further improve BRIC efficiency, we develop an online dimension reduction approach which removes insignificant hypervector dimensions during training. Additionally, we designed a fully pipelined FPGA implementation which accelerates BRIC in both training and inference phases. Our evaluation of BRIC a wide range of classification applications show that BRIC can achieve 64.1× and 9.8× (43.8× and 6.1×) energy efficiency and speed up as compared to baseline HD computing during training (inference) while providing the same classification accuracy.

Fast and Efficient Information Transmission with Burst Spikes in Deep Spiking Neural Networks

Seongsik Park
Seijoon Kim
Hyeokjun Choe
Sungroh Yoon

Spiking neural networks (SNNs) are considered as one of the most promising artificial neural networks due to their energy-efficient computing capability. Recently, conversion of a trained deep neural network to an SNN has improved the accuracy of deep SNNs. However, most of the previous studies have not achieved satisfactory results in terms of inference speed and energy efficiency. In this paper, we propose a fast and energy-efficient information transmission method with burst spikes and hybrid neural coding scheme in deep SNNs. Our experimental results showed the proposed methods can improve inference energy efficiency and shorten the latency.

Deep-DFR: A Memristive Deep Delayed Feedback Reservoir Computing System with Hybrid Neural Network Topology

Kangjun Bai
Qiyuan An
Yang Yi

Deep neural networks (DNNs), the brain-like machine learning architecture, have gained immense success in data-extensive applications. In this work, a hybrid structured deep delayed feedback reservoir (Deep-DFR) computing model is proposed and fabricated. Our Deep-DFR employs memristive synapses working in a hierarchical information processing fashion with DFR modules as the readout layer, leading our proposed deep learning structure to be both depth-in-space and depth-in-time. Our fabricated prototype along with experimental results demonstrate its high energy efficiency with low hardware implementation cost. With applications on the image classification, MNIST and SVHN, our Deep-DFR yields a 1.26~7.69X reduction on the testing error compared to state-of-the-art DNN designs.

A Fault-Tolerant Neural Network Architecture

Tao Liu
Wujie Wen
Lei Jiang
Yanzhi Wang
Chengmo Yang
Gang Quan

New DNN accelerators based on emerging technologies, such as resistive random access memory (ReRAM), are gaining increasing research attention given their potential of “in-situ” data processing. Unfortunately, device-level physical limitations that are unique to these technologies may cause weight disturbance in memory and thus compromising the performance and stability of DNN accelerators. In this work, we propose a novel fault-tolerant neural network architecture to mitigate the weight disturbance problem without involving expensive retraining. Specifically, we propose a novel collaborative logistic classifier to enhance the DNN stability by redesigning the binary classifiers augmented from both traditional error correction output code (ECOC) and modern DNN training algorithm. We also develop an optimized variable-length “decode-free” scheme to further boost the accuracy under fewer number of classifiers. Experimental results on cutting-edge DNN models and complex datasets show that the proposed fault-tolerant neural network architecture can effectively rectify the accuracy degradation against weight disturbance for DNN accelerators with low cost, thus allowing for its deployment in a variety of mainstream DNNs.

A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM

Zhenhua Zhu
Hanbo Sun
Yujun Lin
Guohao Dai
Lixue Xia
Song Han
Yu Wang
Huazhong Yang

Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials in boosting both the performance and energy efficiency of CNNs. However, restricted by the immature process technology, it is hard to implement and fabricate a CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit RRAM based CNN accelerators only focus on binary or ternary CNNs which have more than 10% accuracy loss compared with full precision CNNs. This paper proposes a configurable multi-precision CNN computing framework based on single bit RRAM, which consists of an RRAM computing overhead aware network quantization algorithm and a configurable multi-precision CNN computing architecture based on single bit RRAM. The proposed method can achieve equivalent accuracy as full precision CNN but also with lower storage consumption and latency via multiple precision quantization. The designed architecture supports for accelerating the multi-precision CNNs even with various precision among different layers. Experiment results show that the proposed framework can reduce 70% computing area and 75% computing energy on average, with nearly no accuracy loss. And the equivalent energy efficiency is 1.6 ~ 8.6× compared with existing RRAM based architectures with only 1.07% area overhead.

Noise Injection Adaption: End-to-End ReRAM Crossbar Non-ideal Effect Adaption for Neural Network Mapping

Zhezhi He
Jie Lin
Rickard Ewetz
Jiann-Shiun Yuan
Deliang Fan

In this work, we investigate various non-ideal effects (Stuck-At-Fault (SAF), IR-drop, thermal noise, shot noise, and random telegraph noise)of ReRAM crossbar when employing it as a dot-product engine for deep neural network (DNN) acceleration. In order to examine the impacts of those non-ideal effects, we first develop a comprehensive framework called PytorX based on main-stream DNN pytorch framework. PytorX could perform end-to-end training, mapping, and evaluation for crossbar-based neural network accelerator, considering all above discussed non-ideal effects of ReRAM crossbar together. Experiments based on PytorX show that directly mapping the trained large scale DNN into crossbar without considering these non-ideal effects could lead to a complete system malfunction (i.e., equal to random guess) when the neural network goes deeper and wider. In particular, to address SAF side effects, we propose a digital SAF error correction algorithm to compensate for crossbar output errors, which only needs one-time profiling to achieve almost no system accuracy degradation. Then, to overcome IR drop effects, we propose a Noise Injection Adaption (NIA) methodology by incorporating statistics of current shift caused by IR drop in each crossbar as stochastic noise to DNN training algorithm, which could efficiently regularize DNN model to make it intrinsically adaptive to non-ideal ReRAM crossbar. It is a one-time training method without the request of retraining for every specific crossbar. Optimizing system operating frequency could easily take care of rest non-ideal effects. Various experiments on different DNNs using image recognition application are conducted to show the efficacy of our proposed methodology.

A Novel Covert Channel Attack Using Memory Encryption Engine Cache

Youngkwang Han
John Kim

Microarchitectural covert channel attack is a threat when multiple tenants share hardware resources such as last-level cache. In this work, we propose a novel covert channel attack that exploits new microarchitecture that have been introduced to support memory encryption — in particular, the memory encryption engine (MEE) cache. The MEE cache is a shared resource but only utilized when accessing the integrity tree data and provides opportunity for a stealthy covert channel attack. However, there are challenges since MEE cache organization is not publicly known and the access behavior differs from a conventional cache. We demonstrate how the MEE cache can be exploited to establish a covert channel communication.

Designing Secure Cryptographic Accelerators with Information Flow Enforcement: A Case Study on AES

Zhenghong Jiang
Hanchen Jin
G. Edward Suh
Zhiru Zhang

Designing a secure cryptographic accelerator is challenging as vulnerabilities may arise from design decisions and implementation flaws. To provide high security assurance, we propose to design and build cryptographic accelerators with hardware-level information flow control so that the security of an implementation can be formally verified. This paper uses an AES accelerator as a case study to demonstrate how to express security requirements of a cryptographic accelerator as information flow policies for security enforcement. Our AES prototype on an FPGA shows that the proposed protection has a marginal impact on area and performance.

SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation

Khaled N. Khasawneh
Esmaeil Mohammadian Koruyeh
Chengyu Song
Dmitry Evtyushkin
Dmitry Ponomarev
Nael Abu-Ghazaleh

Speculative attacks, such as Spectre and Meltdown, target speculative execution to access privileged data and leak it through a side-channel. In this paper, we introduce (SafeSpec), a new model for supporting speculation in a way that is immune to the side-channel leakage by storing side effects of speculative instructions in separate structures until they commit. Additionally, we address the possibility of a covert channel from speculative instructions to committed instructions before these instructions are committed. We develop a cycle accurate model of modified design of an x86-64 processor and show that the performance impact is negligible.

SpectreGuard: An Efficient Data-centric Defense Mechanism against Spectre Attacks

Jacob Fustos
Farzad Farshchi
Heechul Yun

Speculative execution is an essential performance enhancing technique in modern processors, but it has been shown to be insecure. In this paper, we propose SpectreGuard, a novel defense mechanism against Spectre attacks. In our approach, sensitive memory blocks (e.g., secret keys) are marked using simple OS/library API, which are then selectively protected by hardware from Spectre attacks via low-cost micro-architecture extension. This technique allows microprocessors to maintain high performance, while restoring the control to software developers to make security and performance trade-offs.

PAPP: Prefetcher-Aware Prime and Probe Side-channel Attack

Daimeng Wang
Zhiyun Qian
Nael Abu-Ghazaleh
Srikanth V. Krishnamurthy

CPU memory prefetchers can substantially interfere with prime and probe cache side-channel attacks, especially on in-order CPUs which use aggressive prefetching. This interference is not accounted for in previous attacks. In this paper, we propose PAPP, a Prefetcher-Aware Prime Probe attack that can operate even in the presence of aggressive prefetchers. Specifically, we reverse engineer the prefetcher and replacement policy on several CPUs and use these insights to design a prime and probe attack that minimizes the impact of the prefetcher. We evaluate PAPP using Cache Side-channel Vulnerability (CSV) metric and demonstrate the substantial improvements in the quality of the channel under different conditions.

HardScope: Hardening Embedded Systems Against Data-Oriented Attacks

Thomas Nyman
Ghada Dessouky
Shaza Zeitouni
Aaro Lehikoinen
Andrew Paverd
N. Asokan
Ahmad-Reza Sadeghi

Memory-unsafe programming languages like C and C++ leave many (embedded) systems vulnerable to attacks like control-flow hijacking. However, defenses against control-flow attacks, such as (fine-grained) randomization or control-flow integrity are in-effective against data-oriented attacks and more expressive Data-oriented Programming (DOP) attacks that bypass state-of-the-art defenses.

We propose run-time scope enforcement (RSE), a novel approach that efficiently mitigates all currently known DOP attacks by enforcing compile-time memory safety constraints like variable visibility rules at run-time. We present Hardscope, a proof-of-concept implementation of hardware-assisted RSE for RISC-V, and show it has a low performance overhead of 3.2% for embedded benchmarks.

An Efficient Multi-fidelity Bayesian Optimization Approach for Analog Circuit Synthesis

Shuhan Zhang
Wenlong Lyu
Fan Yang
Changhao Yan
Dian Zhou
Xuan Zeng
Xiangdong Hu

This paper presents an efficient multi-fidelity Bayesian optimization approach for analog circuit synthesis. The proposed method can significantly reduce the overall computational cost by fusing the simple but potentially inaccurate low-fidelity model and a few accurate but expensive high-fidelity data. Gaussian Process (GP) models are employed to model the low- and high-fidelity black-box functions separately. The nonlinear map between the low-fidelity model and high-fidelity model is also modelled as a Gaussian process. A fusing GP model which combines the low- and high-fidelity models can thus be built. An acquisition function based on the fusing GP model is used to balance the exploitation and exploration. The fusing GP model is evolved gradually as new data points are selected sequentially by maximizing the acquisition function. Experimental results show that our proposed method reduces up to 65.5% of the simulation time compared with the state-of-the-art single-fidelity Bayesian optimization method, while exhibiting more stable performance and a more promising practical prospect.

Rethinking Sparsity in Performance Modeling for Analog and Mixed Circuits using Spike and Slab Models

Mohamed Baker Alawieh
Sinead A. Williamson
David Z. Pan

As integrated circuit technologies continue to scale, efficient performance modeling becomes indispensable. Recently, several new learning paradigms have been proposed to reduce the computational cost associated with accurate performance modeling. A common attribute among most of these paradigms is the leverage of the sparsity feature to build efficient performance models. In this work, we propose a new perspective to incorporate sparsity in the modeling task by utilizing spike and slab feature selection techniques. Practically, our proposed method uses two different priors on the different model coefficients based on their importance. This is incorporated into a mixture model that can be built using a hierarchical Bayesian framework to select the important features and find the model coefficients. Our numerical experiments demonstrate that the proposed approach can achieve better results compared to traditional sparse modeling techniques while also providing valuable insight about the important features in the model.

WellGAN: Generative-Adversarial-Network-Guided Well Generation for Analog/Mixed-Signal Circuit Layout

Biying Xu
Yibo Lin
Xiyuan Tang
Shaolan Li
Linxiao Shen
Nan Sun
David Z. Pan

In back-end analog/mixed-signal (AMS) design flow, well generation persists as a fundamental challenge for layout compactness, routing complexity, circuit performance and robustness. The immaturity of AMS layout automation tools comes to a large extent from the difficulty in comprehending and incorporating designer expertise. To mimic the behavior of experienced designers in well generation, we propose a generative adversarial network (GAN) guided well generation framework with a post-refinement stage leveraging the previous high-quality manually-crafted layouts. Guiding regions for wells are first created by a trained GAN model, after which the well generation results are legalized through post-refinement to satisfy design rules. Experimental results show that the proposed technique is able to generate wells close to manual designs with comparable post-layout circuit performance.

Digital Compatible Synthesis, Placement and Implementation of Mixed-Signal Time-Domain Computing

Zhengyu Chen
Hai Zhou
Jie Gu

Mixed-signal time-domain computing (TC) has recently drawn significant attention due to its high efficiency in applications such as machine learning accelerators. However, due to the nature of analog and mixed-signal design, there is a lack of a systematic flow of synthesis and place & route for time-domain circuits. This paper proposed a comprehensive design flow for TC. In the front-end, a variation-aware digital compatible synthesis flow is proposed. In the back-end, a placement technique using graph-based optimization engine is proposed to deal with the especially stringent matching requirement in TC. Simulation results show significant improvement over the prior analog placement methods. A 55nm test chip is used to demonstrate that the proposed design flow can meet the stringent timing matching target for TC with significant performance boost over conventional digital design.

A Rigorous Approach for the Sparsification of Dense Matrices in Model Order Reduction of RLC Circuits

Charalampos Antoniadis
Nestor Evmorfopoulos
Georgios Stamoulis

The integration of more components into modern Systems-on-Chip (SoCs) has led to very large RLC parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose a methodology for the sparsification of the dense circuit matrices resulting from Model Order Reduction of general RLC circuits, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.

Enabling Complex Stimuli in Accelerated Mixed-Signal Simulation

Sara Divanbeigi
Evan Aditya
Zhongpin Wang
Markus Olbrich

In the era of advancing technology, increasing circuit complexity requires faster simulators for the verification step. The piece-wise linear simulation approach provides an efficient and accurate solution. In this paper, a state-of-the-art mixed-signal simulator is explained. The approach is extended to new exponential and quadratic stimuli. This requires a comprehensive derivation of mathematical equations, which remove the need for computationally expensive evaluation. The new stimuli are simulated in several circuits and compared to a conventional simulator. The result shows significant run-time acceleration with high accuracy. Therefore, it meets the industrial requirement, which demands simulation with various input forms and non-linear components.

Scalable Generic Logic Synthesis: One Approach to Rule Them All

Heinz Riener
Eleonora Testa
Winston Haaswijk
Alan Mishchenko
Luca Amarù
Giovanni De Micheli
Mathias Soeken

This paper proposes a novel methodology for multi-level logic synthesis that is independent from a specific graph data-structure, but formulates synthesis procedures using an abstract concept definition of a logic representation. The idea is to capture the essence of optimisations in a general manner and tailor only small performance-critical sections to the underlying logic representation. This generic, yet scalable approach, saves many man-months of development time and enables logic synthesis and technology-mapping procedures parameterised in a logic representation. We present the generic design methodology and demonstrate its practicality by providing a complete state-of-the-art logic synthesis flow.

Comprehensive Search for ECO Rectification Using Symbolic Sampling

Victor N. Kravets
Nian-Ze Lee
Jie-Hong R. Jiang

The task of an engineering change order (ECO) is to update the current implementation of a design according to its revised specification with minimum modification. Prior studies show that the amount of design modification majorly depends on the selection of rectification points, i.e., the input pins of gates whose functionality should be rectified with some patch circuitry. In realistic ECOs, as the netlist of the current implementation has been heavily optimized to meet design objectives, it is usually structurally dissimilar to the netlist of a revised specification, which is synthesized only by lightweight optimization. This paper proposes an ECO solution for optimized designs, which is robust against structural dissimilarity caused by design optimization. It locates candidate rectification points in a sampling domain, which significantly improves the scalability of rectification search. To synthesize the circuitry of patches, a structurally independent rewiring formulation is proposed to reuse existing logic in the implementation. Based on the proposed method, a newly developed engine is evaluated on the engineering changes arising in the design of microprocessors. Its ability to derive patches of superior quality is demonstrated in comparison to industrial tools.

Embedding Functions Into Reversible Circuits: A Probabilistic Approach to the Number of Lines

Niels Gleinig
Frances Ann Hubis
Torsten Hoefler

In order to compute a non-invertible function on a reversible circuit, one needs to “embed” the function into a larger function which has some garbage bits, corresponding to additional lines. The problem of determining the minimal number of garbage bits that are needed to embed a given function has attracted extensive research, largely motivated by quantum computing, where the number of lines equals the number of qubits. However, all approaches that are known have either no theoretical quality guarantees (bounds on approximation factors) or require exponential runtime. We present an efficient probabilistic approximation algorithm with theoretical bounds.

Disjoint-Support Decomposition and Extraction for Interconnect-Driven Threshold Logic Synthesis

Hao Chen
Shao-Chun Hung
Jie-Hong R. Jiang

Threshold logic circuits are artificial neural networks with their neuron outputs being binarized, thus amenable for efficient, multiplier-free, hardware implementation of machine learning applications. In the reviving threshold logic synthesis, this work lays the foundations of disjoint-support decomposition and extraction operation of threshold logic functions. They lead to a synthesis procedure for interconnect minimization of threshold logic circuits, an important, but not well addressed, objective in both neural network and nanometer circuit designs. Experimental results show that our method can efficiently and effectively reduce interconnect as well as weight/threshold value over highly optimized circuits, thus suitable for implementation using emerging technologies.

Reducing the Multiplicative Complexity in Logic Networks for Cryptography and Security Applications

Eleonora Testa
Mathias Soeken
Luca Amarù
Giovanni De Micheli

Reducing the number of AND gates plays a central role in many cryptography and security applications. We propose a logic synthesis algorithm and tool to minimize the number of AND gates in a logic network composed of AND, XOR, and inverter gates. Our approach is fully automatic and exploits cut enumeration algorithms to explore optimization potentials in local subcircuits. The experimental results show that our approach can reduce the number of AND gates by 34% on average compared to generic size optimization algorithms. Further, we are able to reduce the number of AND gates up to 76% in best-known benchmarks from the cryptography community.

SMatch: Structural Matching for Fast Resynthesis in FPGAs

Rafael Trapani Possignolo
Jose Renau

Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective incremental flows. Wepropose SMatch, a flow for FPGAs that has a novel incremental elaboration and novel incremental FPGA placement and routing that improves the state-of-the-art by reducing the amount of placement and routing work needed. We evaluate our approach against commercial FPGAs flows. Our method finishes synthesis, placement, and routing in under 30s for most changes of publicly available benchmarks with negligible QoR impact, being over 20× faster than existing incremental FPGA flows.

Toward an Open-Source Digital Flow: First Learnings from the OpenROAD Project

Tutu Ajayi
Vidya A. Chhabria
Mateus Fogaça
Soheil Hashemi
Abdelrahman Hosny
Andrew B. Kahng
Minsoo Kim
Jeongsup Lee
Uday Mallappa
Marina Neseem
Geraldo Pradipta
Sherief Reda
Mehdi Saligane
Sachin S. Sapatnekar
Carl Sechen
Mohamed Shalan
William Swartz
Lutong Wang
Zhehong Wang
Mingyu Woo
Bangqi Xu

We describe the planned Alpha release of OpenROAD, an open-source end-to-end silicon compiler. OpenROAD will help realize the goal of “democratization of hardware design”, by reducing cost, expertise, schedule and risk barriers that confront system designers today. The development of open-source, self-driving design tools is in and of itself a “moon shot” with numerous technical and cultural challenges. The open-source flow incorporates a compatible open-source set of tools that span logic synthesis, floorplanning, placement, clock tree synthesis, global routing and detailed routing. The flow also incorporates analysis and support tools for static timing analysis, parasitic extraction, power integrity analysis, and cloud deployment. We also note several observed challenges, or “lessons learned”, with respect to development of open-source EDA tools and flows.

ALIGN: Open-Source Analog Layout Automation from the Ground Up

Kishor Kunal
Meghna Madhusudan
Arvind K. Sharma
Wenbin Xu
Steven M. Burns
Ramesh Harjani
Jiang Hu
Desmond A. Kirkpatrick
Sachin S. Sapatnekar

This paper presents analog layout automation efforts under the ALIGN (“Analog Layout, Intelligently Generated from Netlists”) project for fast layout generation using a modular approach based on a mix of algorithmic and machine learning-based tools. The road to rapid turnaround is based on an approach that detects structure and hierarchy in the input netlist and uses a grid based philosophy for layout. The paper provides a view of the current status of the project, challenges in developing open-source code with an academic/industry team, and nuts-and-bolts issues such as working with abstracted PDKs, navigating the “wall” between secured IP and open-source software, and securing access to example designs.

Essential Building Blocks for Creating an Open-source EDA Project

Tsung-Wei Huang
Chun-Xun Lin
Guannan Guo
Martin D. F. Wong

Open source has started energizing both industrial and academic research and development in electronic design automation (EDA) systems. By moving to open source, we can speed up our effort and work with others who are working toward the same goals, while reducing costs and improving end products. However, building an open-source project is much more than placing the codebase on the web. In this paper, we will talk about essential building blocks to create an impactful open-source project, including source repository, project landing page, documentation, and continuous integration. We will also cover the use of web-based frameworks to design a showcase project to bring community’s attention. We will then share our experience in developing an open-source timing analyzer (OpenTimer) and a parallel task programming library (Cpp-Taskflow), both of which are being used in many industrial and academic EDA research projects.

Open-Source EDA Tools and IP, A View from the Trenches

Elad Alon
Krste Asanović
Jonathan Bachrach
Borivoje Nikolić

We describe our experience developing and promoting a set of open-source tools and IP over the last 9 years, including the Chisel hardware construction language, the Rocket Chip SoC generator, and the BAG analog layout generator.

A 1.17 TOPS/W, 150fps Accelerator for Multi-Face Detection and Alignment

Huiyu Mo
Leibo Liu
Wenping Zhu
Qiang Li
Hong Liu
Wenjing Hu
Yao Wang
Shaojun Wei

Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection and alignment, along with the optimizations on multi-task cascaded convolutional networks algorithm, to implement both multi-face detection and alignment. First, the clustering non-maximum suppression is proposed to significantly reduce intersection over union computation and eliminate the hardware-interfer-ence sorting process, bringing 16.0% speed-up without any loss. Second, a new pipeline architecture is presented to implement the proposal network in more computation-efficient manner, with 41.7% less multiplier usage and 38.3% decrease in memory capacity compared with the similar method. Third, a batch schedule mechanism is proposed to improve hardware utilization of fully-connected layer by 16.7% on average with variable input number in batch process. Based on the TSMC 28 nm CMOS process, this accelerator only consumes 6.7ms at 400 MHz to simultaneously process 5 faces for each image and achieves 1.17 TOPS/W power efficiency, which is 54.8× higher than the state-of-the-art solution.

Analog/Mixed-Signal Hardware Error Modeling for Deep Learning Inference

Angad S. Rekhi
Brian Zimmer
Nikola Nedovic
Ningxi Liu
Rangharajan Venkatesan
Miaorong Wang
Brucek Khailany
William J. Dally
C. Thomas Gray

Analog/mixed-signal (AMS) computation can be more energy efficient than digital approaches for deep learning inference, but incurs an accuracy penalty from precision loss. Prior AMS approaches focus on small networks/datasets, which can maintain accuracy even with 2b precision. We analyze applicability of AMS approaches to larger networks by proposing a generic AMS error model, implementing it in an existing training framework, and investigating its effect on ImageNet classification with ResNet-50. We demonstrate significant accuracy recovery by exposing the network to AMS error during retraining, and we show that batch normalization layers are responsible for this accuracy recovery. We also introduce an energy model to predict the requirements of high-accuracy AMS hardware running large networks and use it to show that for ADC-dominated designs, there is a direct tradeoff between energy efficiency and network accuracy. Our model predicts that achieving < 0.4% accuracy loss on ResNet-50 with AMS hardware requires a computation energy of at least ~300 fJ/MAC. Finally, we propose methods for improving the energy-accuracy tradeoff.

A 3T/Cell Practical Embedded Nonvolatile Memory Supporting Symmetric Read and Write Access Based on Ferroelectric FETs

Juejian Wu
Hongtao Zhong
Kai Ni
Yongpan Liu
Huazhong Yang
Xueqing Li

Making embedded memory symmetric provides the capability of memory access in both rows and columns, which brings new opportunities of significant energy and time savings if only a portion of data in the words need to be accessed. This work investigates the use of ferroelectric field-effect transistors (FeFETs), an emerging nonvolatile, low-power, deeply-scalable, CMOS-compatible transistor technology, and proposes a new 3-transistor/cell symmetric nonvolatile memory (SymNVM). With ~1.67x higher density as compared with the prior FeFET design, significant benefits of energy and latency improvement have been achieved, as evaluated and discussed in depth in this paper.

A Fast, Reliable and Wide-Voltage-Range In-Memory Computing Architecture

William Simon
Juan Galicia
Alexandre Levisse
Marina Zapater
David Atienza

As the computational complexity of applications on the consumer market, such as high-definition video encoding and deep neural networks, become ever more demanding, novel ways to efficiently compute data intensive workloads are being explored. In this context, In-Memory Computing (IMC) solutions, and particularly bitline computing in SRAM, appear promising as they mitigate one of the most energy consuming aspects in computation: data movement. While IMC architectural level characteristics have been defined by the research community, only a few works so far have explored the implementation of such memories at a low level. Furthermore, these proposed solutions are either slow (<1GHz), area hungry (10T SRAM), or suffer from read disturb and corruption issues. Overall, there is no extensive design study considering realistic assumptions at the circuit level. In this work we propose a fast (up to 2.2Ghz), 6T SRAM-based, reliable (no read disturb issues), and wide voltage range (from 0.6 to 1V) IMC architecture using local bitlines. Beyond standard read and write, the proposed architecture can perform copy, addition and shift operations at the array level. As addition is the slowest operation, we propose a modified carry chain adder, providing a 2× carry propagation improvement. The proposed architecture is validated using a 28nm bulk high performances technology PDK with CMOS variability and post-layout simulations. High density SRAM bitcells (0.127μm) enable area efficiency of 59.7% for a 256×128 array, on par with current industrial standards.

BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation

Sungju Ryu
Hyungjun Kim
Wooseok Yi
Jae-Joon Kim

Deep Neural Networks (DNNs) have various performance requirements and power constraints depending on applications. To maximize the energy-efficiency of hardware accelerators for different applications, the accelerators need to support various bit-width configurations. When designing bit-reconfigurable accelerators, each PE must have variable shift-addition logic, which takes a large amount of area and power. This paper introduces an area and energy efficient precision-scalable neural network accelerator (BitBlade), which reduces the control overhead for variable shift-addition using bitwise summation method. The proposed BitBlade, when synthesized in a 28nm CMOS technology, showed reduction in area by 41% and in energy by 36-46% compared to the state-of-the-art precision-scalable architecture [14].

Acceleration of DNN Backward Propagation by Selective Computation of Gradients

Gunhee Lee
Hanmin Park
Namhyung Kim
Joonsang Yu
Sujeong Jo
Kiyoung Choi

The training process of a deep neural network commonly consists of three phases: forward propagation, backward propagation, and weight update. In this paper, we propose a hardware architecture to accelerate the backward propagation. Our approach applies to neural networks that use rectified linear unit. Considering that the backward propagation results in a zero activation gradient when the corresponding activation is zero, we can safely skip the gradient calculation. Based on this observation, we design an efficient hardware accelerator for training deep neural networks by selectively computing gradients. We show the effectiveness of our approach through experiments with various network models.

C3-Flow: Compute Compression Co-Design Flow for Deep Neural Networks

Matthew Sotoudeh
Sara S. Baghsorkhi

Existing approaches to neural network compression have failed to holistically address algorithmic (training accuracy) and computational (inference performance) demands of real-world systems, particularly on resource-constrained devices. We present C3-Flow, a new approach adding non-uniformity to low-rank approximations and designed specifically to enable highly-efficient computation on common hardware architectures while retaining more accuracy than competing methods. Evaluation on two state-of-the-art acoustic models (versus existing work, empirical limit study approaches, and hand-tuned models) demonstrates up to 60% lower error. Finally, we show that our co-design approach achieves up to 14X inference speedup across three Haswell- and Broadwell-based platforms.

ABM-SpConv: A Novel Approach to FPGA-Based Acceleration of Convolutional Neural Network Inference

Dong Wang
Ke Xu
Qun Jia
Soheil Ghiasi

Hardware accelerators for convolutional neural network (CNN) inference have been extensively studied in recent years. The reported designs tend to utilize a similar underlying architecture based on multiplier-accumulator (MAC) arrays, which has the practical consequence of limiting the FPGA-based accelerator performance by the number of available on-chip DSP blocks, while leaving other resource under-utilized. To address this problem, we consider a transformation to the convolution computation, which leads to transformation of the accelerator design space and relaxes the pressure on the required DSP resources. We demonstrate that our approach enables us to strike a judicious balance between utilization of the on-chip memory, logic, and DSP resources, due to which, our accelerator considerably outperforms state of the art. We report the effectiveness of our approach on a Stratix-V GXA7 FPGA, which shows 55% throughput improvement, while using 6.25% less DSP blocks, compared to the best reported CNN accelerator on the same device.

Pushing the speed limit of constant-time discrete Gaussian sampling. A case study on the Falcon signature scheme

Angshuman Karmakar
Sujoy Sinha Roy
Frederik Vercauteren
Ingrid Verbauwhede

Sampling from a discrete Gaussian distribution has applications in lattice-based post-quantum cryptography. Several efficient solutions have been proposed in recent years. However, making a Gaussian sampler secure against timing attacks turned out to be a challenging research problem. In this work, we present a toolchain to instantiate an efficient constant-time discrete Gaussian sampler of arbitrary standard deviation and precision. We observe an interesting property of the mapping from input random bit strings to samples during a Knuth-Yao sampling algorithm and propose an efficient way of minimizing the Boolean expressions for the mapping. Our minimization approach results in up to 37% faster discrete Gaussian sampling compared to the previous work. Finally, we apply our optimized and secure Gaussian sampler in the lattice-based digital signature algorithm Falcon, which is a NIST submission, and provide experimental evidence that the overall performance of the signing algorithm degrades by at most 33% only due to the additional overhead of ‘constant-time’ sampling, including the 60% overhead of random number generation. Breaking a general belief, our results indirectly show that the use of discrete Gaussian samples in digital signature algorithms would be beneficial.

Full-Lock: Hard Distributions of SAT instances for Obfuscating Circuits using Fully Configurable Logic and Routing Blocks

Hadi Mardani Kamali
Kimia Zamiri Azar
Houman Homayoun
Avesta Sasan

In this paper, we propose a novel and SAT-resistant logic-locking technique, denoted as Full-Lock, to obfuscate and protect the hardware against threats including IP-piracy and reverse-engineering. The Full-Lock is constructed using a set of small-size fully Programmable Logic and Routing block (PLR) networks. The PLRs are SAT-hard instances with reasonable power, performance and area overheads which are used to obfuscate (1) the routing of a group of selected wires and (2) the logic of the gates leading and proceeding the selected wires. The Full-Lock resists removal attacks and breaks a SAT attack by significantly increasing the complexity of each SAT iteration.

A Cellular Automata Guided Obfuscation Strategy For Finite-State-Machine Synthesis

Rajit Karmakar
Suman Sekhar Jana
Santanu Chattopadhyay

A popular countermeasure against IP piracy relies on obfuscating the Finite State Machine (FSM), which is assumed to be the heart of a digital system. In this paper, we propose to use a special class of non-group additive cellular automata (CA) called D1 * CA, and it’s counterpart D1 * CAdual to obfuscate each state-transition of an FSM. The synthesized FSM exhibits correct state-transitions only for a correct key, which is a designer’s secret. The proposed easily testable key-controlled FSM synthesis scheme can thwart reverse engineering attacks, thus offers IP protection.

An Efficient Spare-Line Replacement Scheme to Enhance NVM Security

Jie Xu
Dan Feng
Yu Hua
Fangting Huang
Wen Zhou
Wei Tong
Jingning Liu

Non-volatile memories (NVMs) are vulnerable to serious threat due to the endurance variation. We identify a new type of malicious attack, called Uniform Address Attack (UAA), which performs uniform and sequential writes to each line of the whole memory, and wears out the weaker lines (lines with lower endurance) early. Experimental results show that the lifetime of NVMs under UAA is reduced to 4.1% of the ideal lifetime. To address such attack, we propose a spare-line replacement scheme called Max-WE (Maximize the Weak lines’ Endurance). By employing weak-priority and weak-strong-matching strategies for spare-line allocation, Max-WE is able to maximize the number of writes that the weakest lines can endure. Furthermore, Max-WE reduces the storage overhead of the mapping table by 85% through adopting a hybrid spare-line mapping scheme. Experimental results show that Max-WE can improve the lifetime by 9.5X with the spare-line overhead and mapping overhead as 10% and 0.016% of the total space respectively.

Analyzing Parallel Real-Time Tasks Implemented with Thread Pools

Daniel Casini
Alessandro Biondi
Giorgio Buttazzo

Despite several works in the literature targeted predictable execution models for parallel tasks, limited attention has been devoted to study how specific implementation techniques may affect their execution. This paper highlights some issues that can arise when executing parallel tasks with thread pools, which may lead to deadlocks and performance degradation when adopting blocking synchronization mechanisms. A new parallel task model, inspired to a realistic design found in popular software systems, is first presented to study this problem. Then, formal conditions to ensure the absence of deadlocks and schedulability analysis techniques are proposed under both global and partitioned scheduling.

Scheduling and Analysis of Parallel Real-Time Tasks with Semaphores

Xu Jiang
Nan Guan
Weichen Liu
Maolin Yang

This paper for the first time studies the scheduling and analysis of parallel real-time tasks with semaphores. In parallel task systems, each task may issue multiple requests to a semaphore, which raises new challenges to the design and analysis problems. We propose a new locking protocol LPP that limits the maximal number of requests to a semaphore by a task that can block other tasks at any time. We develop analysis techniques to safely bound the task response times, with which we prove that the best real-time performance is achieved if only one request to a semaphore by a task is allowed to block other tasks at a time. Experiments under different parameter settings are conducted to compare our proposed protocol and analysis techniques with the state-of-the-art spinlock protocol and analysis techniques for parallel real-time tasks.

Real-Time Scheduling and Analysis of Synchronous OpenMP Task Systems with Tied Tasks

Jinghao Sun
Nan Guan
Xiaoqing Wang
Chenhan Jin
Yaoyao Chi

Synchronous parallel tasks are widely used in HPC for purchasing high average performance, but merely consider how to guarantee good timing predictabilities. OpenMP is a promising framework for multi-core real-time embedded systems. The synchronous OpenMP tasks are significantly more difficult to schedule and analyze due to constraints posed by OpenMP specifications. An important OpenMP feature is tied task, which must execute on the same thread during the whole life cycle. This paper designs a novel method, called group scheduling, to schedule synchronous OpenMP tasks, which divides tasks into several groups, and assigns some of them to dedicated cores, in order to isolate tied tasks. We derive a linear-time computable response time bound. Experiments with both randomly generated and realistic OpenMP tasks show that our new bound significantly outperforms the existing bound.

DCFNoC: A Delayed Conflict-Free Time Division Multiplexing Network on Chip

Tomás Picornell
José Flich
Carles Hernández
José Duato

The adoption of many-cores in safety-critical systems requires real-time capable networks on chip (NoC). In this paper we propose a new time-predictable NoC design paradigm where contention within the network is eliminated. This new paradigm builds on the Channel Dependency Graph (CDG) and guarantees by design the absence of contention. Our delayed conflict-free NoC (DCFNoC) is able to naturally inject messages using a TDM period equal to the optimal theoretical bound and without the need of using a computationally demanding offline process. Results show that DCFNoC guarantees time predictability with very low implementation cost.

Learning Temporal Specifications from Imperfect Traces Using Bayesian Inference

Artur Mrowca
Martin Nocker
Sebastian Steinhorst
Stephan Günnemann

Verification is essential to prevent malfunctioning of software systems. Model checking allows to verify conformity with nominal behavior. As manual definition of specifications from such systems gets infeasible, automated techniques to mine specifications from data become increasingly important. Existing approaches produce specifications of limited lengths, do not segregate functions and do not easily allow to include expert input. We present BaySpec, a dynamic mining approach to extract temporal specifications from Bayesian models, which represent behavioral patterns. This allows to learn specifications of arbitrary length from imperfect traces. Within this framework we introduce a novel extraction algorithm that for the first time mines LTL specifications from such models.

Accelerating FPGA Prototyping through Predictive Model-Based HLS Design Space Exploration

Shuangnan Liu
Francis CM Lau
Benjamin Carrion Schafer

One of the advantages of High-Level Synthesis (HLS), also called C-based VLSI-design, over traditional RT-level VLSI design flows, is that multiple micro-architectures of unique area vs. performance can be automatically generated by setting different synthesis options, typically in the form of synthesis directives specified as pragmas in the source code. This design space exploration (DSE) is very time-consuming and can easily take multiple days for complex designs. At the same time, and because of the complexity in designing large ASICs, verification teams now routinely make use of emulation and prototyping to test the circuit before the silicon is taped out. This also allows the embedded software designers to start their work earlier in the design process and thus, further reducing the Turn-Around-Times (TAT). In this work, we present a method to automatically re-optimize ASIC designs specified as behavioral descriptions for HLS to FPGAs for emulation and prototyping, based on the observation that synthesis directives that lead to efficient micro-architectures for ASICs, do not directly translate into optimal micro-architectures in FPGAs. This implies that the HLS DSE process would have to be completely repeated for the target FPGA. To avoid this, this work presents a predictive model-based method that takes as inputs the results of an ASIC HLS DSE and automatically, without the need to re-explore the behavioral description, finds the Pareto-optimal micro-architectures for the target FPGA. Experimental results comparing our predictive-model based method vs. completely re-exploring the search space show that our proposed method works well.

Sample-Guided Automated Synthesis for CCSL Specifications

Ming Hu
Tongquan Wei
Min Zhang
Frédéric Mallet
Mingsong Chen

The Clock Constraint Specification Language (CCSL) has been widely investigated in verifying causal and temporal timing behaviors of real-time embedded systems. However, due to limited expertise in formal modeling, it is difficult for requirement engineers to completely and accurately derive CCSL specifications from natural language-based design descriptions. To address this problem, we present a novel approach that facilitates automated synthesis of CCSL specifications under the guidance of sampled (expected) timing behaviors of target systems. By encoding sampled behaviors and incomplete CCSL constraints provided by requirement engineers using our proposed transformation templates, the CCSL specification synthesis problem can be naturally converted into a SKETCH synthesis problem, which enables the automated generation of CCSL specifications with high accuracy. Experiments on both well-known benchmarks and synthetic examples demonstrate the effectiveness and scalability of our approach.

DHOOM: Reusing Design-for-Debug Hardware for Online Monitoring

Neetu Jindal
Sandeep Chandran
Preeti Ranjan Panda
Sanjiva Prasad
Abhay Mitra
Kunal Singhal
Shubham Gupta
Shikhar Tuli

Runtime verification employs dedicated hardware or software monitors to check whether program properties hold at runtime. However, these monitors often incur high area and performance overheads depending on whether they are implemented in hardware or software. In this work, we propose DHOOM, an architectural framework for runtime monitoring of program assertions, which exploits the combination of a reconfigurable fabric present alongside a processor core with the vestigial on-chip Design-for-Debug hardware. This combination of hardware features allows DHOOM to minimize the overall performance overhead of runtime verification, even when subject to a given area constraint. We present an algorithm for dynamically selecting an effective subset of assertion monitors that can be accommodated in the available programmable fabric, while instrumenting the remaining assertions in software. We show that our proposed strategy, while respecting area constraints, reduces the performance overhead of runtime verification by up to 32% when compared with a baseline of software-only monitors.

Efficient System Architecture in the Era of Monolithic 3D: Dynamic Inter-tier Interconnect and Processing-in-Memory

Dylan Stow
Itir Akgun
Wenqin Huangfu
Yuan Xie
Xueqi Li
Gabriel H. Loh

Emerging Monolithic Three-Dimensional (M3D) integration technology will not only provide improved circuit density through the high-bandwidth coupling of multiple vertically-stacked layers, but it can also provide new architectural opportunities for on-chip computation, memory, and communication that are beyond the capabilities of existing process and packaging technologies. For example, with massive parallel communication between heterogeneous memory and compute layers, existing processing-in-memory architectures can be optimized and expanded, developing into efficient and flexible near-data processors. Additionally, multiple tiers of interconnect can be dynamically leveraged to provide an efficient, scalable interconnect fabric that spans the three-dimensional system. This work explores some of the challenges and opportunities presented by M3D technology for emerging computer architectures, with focus on improving efficiency and increasing system flexibility.

RTL-to-GDS Tool Flow and Design-for-Test Solutions for Monolithic 3D ICs

Heechun Park
Kyungwook Chang
Bon Woong Ku
Jinwoo Kim
Edward Lee
Daehyun Kim
Arjun Chaudhuri
Sanmitra Banerjee
Saibal Mukhopadhyay
Krishnendu Chakrabarty
Sung Kyu Lim

Monolithic 3D IC overcomes the limitation of the existing through-silicon-via (TSV) based 3D IC by providing denser vertical connections with nano-scale inter-layer vias (ILVs). In this paper, we demonstrate a thorough RTL-to-GDS design flow for monolithic 3D IC, which is based on commercial 2D place-and-route (P&R) tools and clever ways to extend them to handle 3D IC designs and simulations. We also provide a low-cost built-in-self-test (BIST) method to detect various faults that can occur on ILVs. Lastly, we present a resistive random access memory (ReRAM) compiler that generates memory modules that are to be integrated in monolithic 3D ICs.

MobiEye: An Efficient Cloud-based Video Detection System for Real-time Mobile Applications

Jiachen Mao
Qing Yang
Ang Li
Hai Li
Yiran Chen

In recent years, machine learning research has largely shifted focus from the cloud to the edge. While the resulting algorithm- and hardware-level optimizations have enabled local execution for the majority of deep neural networks (DNNs) on edge devices, the sheer magnitude of DNNs associated with real-time video detection workloads has forced them to remain relegated to remote execution in the cloud. This problematic when combined with the strict latency requirements that are coupled with these workloads, and imposes a unique set of challenges not directly addressed in prior works. In this work, we design MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is able to achieve up to a 32% reduction in latency when compared to a conventional implementation of video detection system with only a marginal reduction in accuracy.

Enabling File-Oriented Fast Secure Deletion on Shingled Magnetic Recording Drives

Shuo-Han Chen
Ming-Chang Yang
Yuan-Hao Chang
Chun-Feng Wu

Existing secure deletion approaches are inefficient in erasing data permanently because file systems have no knowledge of the data layout on the storage device, nor is the storage device aware of file information within the file systems. This inefficiency is exaggerated on the emerging shingled magnetic recording (SMR) drive due to its inherent sequential-write constraint. On SMR drives, secure deletion requests may lead to serious write amplification and performance degradation if the data layout is not properly configured. Such observation motivates us to propose a file-oriented fast secure deletion (FFSD) strategy to alleviate the negative impacts of SMR drives’ sequential-write constraint and improve the efficiency of secure deletion operations on SMR drives. A series of experiments was conducted to demonstrate the capability of the proposed strategy on improving the efficiency of secure deletion on SMR drives.

Enabling Failure-resilient Intermittently-powered Systems Without Runtime Checkpointing

Wei-Ming Chen
Pi-Cheng Hsiu
Tei-Wei Kuo

Self-powered intermittent systems enable accumulative execution in unstable power environments, where checkpointing is often adopted as a means to achieve data consistency and system recovery under power failures. However, existing approaches based on the checkpointing paradigm normally require system suspension and/or logging at runtime. This paper presents a design which enables failure-resilient intermittently-powered systems without runtime checkpointing. Our design enforces the consistency and serializability of concurrent task execution while maximizing computation progress, as well as allows instant system recovery after power resumption, by leveraging the characteristics of data accessed in hybrid memory. We integrated the design into FreeRTOS running on a Texas Instruments device. Experimental results show that our design achieves up to 11.8 times the computation progress achieved by checkpointing-based approaches, while reducing the recovery time by nearly 90%.

Sensor Drift Calibration via Spatial Correlation Model in Smart Building

Tinghuan Chen
Bingqing Lin
Hao Geng
Bei Yu

Sensor drift is an intractable obstacle to practical temperature measurement in smart building. In this paper, we propose a sensor spatial correlation model. Given prior knowledge, Maximum-aposteriori (MAP) estimation is performed to calibrate drifts. MAP is formulated as a non-convex problem with three hyper-parameters. An alternating-based method is proposed to solve this non-convex formulation. Cross-validation and Expectation-maximum with Gibbs sampling are further to determine hyper-parameters. Experimental results show that on benchmarks from simulator EnergyPlus, compared with state-of-the-art method, the proposed framework can achieve a robust drift calibration and a better trade-off between accuracy and runtime.

Machine Learning-Based Pre-Routing Timing Prediction with Reduced Pessimism

Erick Carvajal Barboza
Nishchal Shukla
Yiran Chen
Jiang Hu

Optimizations at placement stage need to be guided by timing estimation prior to routing. To handle timing uncertainty due to the lack of routing information, people tend to make very pessimistic predictions such that performance specification can be ensured in the worst case. Such pessimism causes over-design that wastes chip resources or design effort. In this work, a machine learning-based pre-routing timing prediction approach is introduced. Experimental results show that it can reach accuracy near post-routing sign-off analysis. Compared to a commercial pre-routing timing estimation tool, it reduces false positive rate by about 2/3 in reporting timing violations.

LithoGAN: End-to-End Lithography Modeling with Generative Adversarial Networks

Wei Ye
Mohamed Baker Alawieh
Yibo Lin
David Z. Pan

Lithography simulation is one of the most fundamental steps in process modeling and physical verification. Conventional simulation methods suffer from a tremendous computational cost for achieving high accuracy. Recently, machine learning was introduced to trade off between accuracy and runtime through speeding up the resist modeling stage of the simulation flow. In this work, we propose LithoGAN, an end-to-end lithography modeling framework based on a generative adversarial network (GAN), to map the input mask patterns directly to the output resist patterns. Our experimental results show that LithoGAN can predict resist patterns with high accuracy while achieving orders of magnitude speedup compared to conventional lithography simulation and previous machine learning based approach.

A General Cache Framework for Efficient Generation of Timing Critical Paths

Kuan-Ming Lai
Tsung-Wei Huang
Tsung-Yi Ho

The recent TAU 2018 contest was seeking novel idea for efficient generation of timing reports. When the timing graph is updated, users query different forms of timing reports that happen subsequently and sequentially. This process is computationally expensive and inherently complex. Therefore, we introduce in this paper a general cache framework for efficient generation of timing critical paths. Our framework efficiently supports (1) a cache scheme to minimize duplicate calculation, (2) graph contraction to reduce the search space, and (3) multi-threading. We evaluated our framework on the TAU 2018 contest benchmarks and demonstrated promising performance over the top performer.

Effective-Resistance Preserving Spectral Reduction of Graphs

Zhiqiang Zhao
Zhuo Feng

This paper proposes a scalable algorithmic framework for effective-resistance preserving spectral reduction of large undirected graphs. The proposed method allows computing much smaller graphs while preserving the key spectral (structural) properties of the original graph. Our framework is built upon the following three key components: a spectrum-preserving node aggregation and reduction scheme, a spectral graph sparsification framework with iterative edge weight scaling, as well as effective-resistance preserving post-scaling and iterative solution refinement schemes. By leveraging recent similarity-aware spectral sparsification method and graph-theoretic algebraic multigrid (AMG) Laplacian solver, a novel constrained stochastic gradient descent (SGD) optimization approach has been proposed for achieving truly scalable performance (nearly-linear complexity) for spectral graph reduction. We show that the resultant spectrally-reduced graphs can robustly preserve the first few nontrivial eigenvalues and eigenvectors of the original graph Laplacian and thus allow for developing highly-scalable spectral graph partitioning and circuit simulation algorithms.

Revisiting the ARM Debug Facility for OS Kernel Security

Jinsoo Jang
Brent Byunghoon Kang

Hardware debugging facilities, such as watchpoints, have been used for software development and analysis. In this paper, we expanded the use of watchpoints as a hardware security primitive for enhancing the runtime security of mobile devices. By analyzing the watchpoints in detail, we derived useful watchpoint properties that can be exploited to build security applications. Based on our analysis, we designed example applications for hardening the OS kernel by exploiting watchpoints. The proposed applications were implemented on a Juno development board with 64-bit ARM architecture (ARMv8). Hardening the kernel by fully enabling the proposed schemes was found to impose reasonable overhead, i.e., 3% with SPEC CPU2006.

Low-Overhead Power Trace Obfuscation for Smart Meter Privacy

Daniele Jahier Pagliari
Sara Vinco
Enrico Macii
Massimo Poncino

Smart meters communicate to the utility provider fine-grain information about a user’s energy consumption, which could be used to infer the user’s habits and pose thus a critical privacy risk. State-of-the-art solutions try to obfuscate the readings of a meter either by using a large re-chargeable battery to filter the trace or by adding random noise to alter it. Both solutions, however, have significant drawbacks: large batteries are prohibitively expensive, whereas digitally added noise implies that the user entrusts the utility provider to protect his/her privacy.

This work proposes a hybrid approach in which zero-average noise is inserted in the power trace by means of a small energy storage device (battery or supercapacitor); the distinguishing feature of our approach is that this obfuscating device is indistinguishable from any other load and therefore it complicates by construction the load disaggregation task performed by the provider or by a malicious third party. Simulation results show that our device can achieve comparable or superior privacy enhancement as that of a solution based on a large battery and therefore with smaller cost.

ARM2GC: Succinct Garbled Processor for Secure Computation

Ebrahim M. Songhori
M. Sadegh Riazi
Siam U. Hussain
Ahmad-Reza Sadeghi
Farinaz Koushanfar

We present ARM2GC, a novel secure computation framework based on Yao’s Garbled Circuit (GC) protocol and the ARM processor. It allows users to develop privacy-preserving applications using standard high-level programming languages (e.g., C) and compile them using off-the-shelf ARM compilers, e.g., gcc-arm. The main enabler of this framework is the introduction of SkipGate, an algorithm that dynamically omits the communication and encryption cost of a gate when its output is independent of the private data. SkipGate greatly enhances the performance of ARM2GC by omitting costs of the gates associated with the instructions of the compiled binary, which is known by both parties involved in the computation. Our evaluation on benchmark functions demonstrates that ARM2GC outperforms the prior best solution by 156×.

Filianore: Better Multiplier Architectures for LWE-based Post-Quantum Key Exchange

Song Bian
Masayuki Hiromoto
Takashi Sato

The (ring) learning with errors (RLWE/LWE) problem is one of the most promising candidates for constructing quantum-secure key exchange protocols. In this work, we design and implement specialized hardware multiplier units for both LWE and RLWE key exchange schemes to maximize their computational efficiency. By exploiting the algebraic structure with aggressive parameter sets, we show that the design and implementation of LWE key exchange on hardware is considerably easier and more flexible than RLWE. Using the proposed architectures, we show that client-side energy-efficiency of LWE-based key exchange can be on the same order, or even (slightly) better than RLWE-based schemes, making LWE an attractive option for designing post-quantum cryptographic suite.

Adaptive Granularity Encoding for Energy-efficient Non-Volatile Main Memory

Jie Xu
Dan Feng
Yu Hua
Wei Tong
Jingning Liu
Chunyan Li
Gaoxiang Xu
Yiran Chen

Data encoding methods have been proposed to alleviate the high write energy and limited write endurance disadvantages of Non-Volatile Memories (NVMs). Encoding methods are proved to be effective through theoretical analysis. Under the data patterns of workloads, existing encoding methods could become inefficient. We observe that the new cache line and the old cache line have many redundant (or unmodified) words. This makes the utilization ratio of the tag bits of data encoding methods become very low, and the efficiency of data encoding method decreases. To fully exploit the tag bits to reduce the bit flips of NVMs, we propose REdundant word Aware Data encoding (READ). The key idea of READ is to share the tag bits among all the words of the cache line and dynamically assign the tag bits to the modified words. The high utilization ratio of the tag bits in READ leads to heavy bit flips of the tag bits. To reduce the bit flips of the tag bits in READ, we further propose Sequential flips Aware Encoding (SAE). SAE is designed based on the observation that many sequential bits of the new data and the old data are opposite. For those writes, the bit flips of the tag bits will increase with the number of tag bits. SAE dynamically selects the encoding granularity which causes the minimum bit flips instead of using the minimum encoding granularity. Experimental results show that our schemes can reduce the energy consumption by 20.3%, decrease the bit flips by 25.0%, and improve the lifetime by 52.1%.

Magma: A Monolithic 3D Vertical Heterogeneous ReRAM-based Main Memory Architecture

Farzaneh Zokaee
Mingzhe Zhang
Xiaochun Ye
Dongrui Fan
Lei Jiang

3D vertical ReRAM (3DV-ReRAM) emerges as one of the most promising alternatives to DRAM due to its good scalability beyond 10nm. Monolithic 3D (M3D) integration enables 3DV-ReRAM to improve its array area efficiency by stacking peripheral circuits underneath an array. A 3DV-ReRAM array has to be large enough to fully cover the peripheral circuits, but such large array size significantly increases its access latency. In this paper, we propose Magma, a M3D stacked heterogeneous ReRAM array architecture, for future main memory systems by stacking a large unipolar 3DV-ReRAM array on the top of a small bipolar 3DV-ReRAM array and peripheral circuits shared by two arrays. We further architect the small bipolar array as a direct-mapped cache for the main memory system. Compared to homogeneous ReRAMs, on average, Magma improves the system performance by 11.4%, reduces the system energy by 24.3% and obtains > 5-year lifetime.

A Wear-Leveling-Aware Fine-Grained Allocator for Non-Volatile Memory

Xianzhang Chen
Zhuge Qingfeng
Qiang Sun
Edwin H.-M. Sha
Shouzhen Gu
Chaoshu Yang
Chun Jason Xue

Emerging non-volatile memories (NVMs) are promising main memory for their advanced characteristics. However, the low endurance of NVM cells makes them vulnerable to frequent fine-grained updates. This paper proposes a Wear-leveling Aware Fine-grained Allocator (WAFA) for NVM. WAFA divides pages into basic memory units to support fine-grained updates. WAFA allocates the basic memory units of a page in a rotational manner to distribute fine-grained updates evenly on memory cells. The fragmented basic memory units of each page caused by the memory allocation and deallocation operations are reorganized by reform operation. We implement WAFA in Linux kernel 4.4.4. Experimental results show that WAFA can reduce 81.1% and 40.1% of the total writes of pages over NVMalloc and nvm_alloc, the state-of-the-art wear-conscious allocator for NVM. Meanwhile, WAFA shows 48.6% and 42.3% performance improvement over NVMalloc and nvm_alloc, respectively.

DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement

Yibo Lin
Shounak Dhar
Wuxi Li
Haoxing Ren
Brucek Khailany
David Z. Pan

Placement for very-large-scale integrated (VLSI) circuits is one of the most important steps for design closure. This paper proposes a novel GPU-accelerated placement framework DREAMPlace, by casting the analytical placement problem equivalently to training a neural network. Implemented on top of a widely-adopted deep learning toolkit PyTorch, with customized key kernels for wirelength and density computations, DREAMPlace can achieve over 30× speedup in global placement without quality degradation compared to the state-of-the-art multi-threaded placer RePlAce. We believe this work shall open up new directions for revisiting classical EDA problems with advancement in AI hardware and software.

BiG: A Bivariate Gradient-Based Wirelength Model for Analytical Circuit Placement

Fan-Keng Sun
Yao-Wen Chang

The analytical formulation has been shown to be the most effective for circuit placement. A key ingredient of analytical placement is its wirelength model, which needs to be differentiable and can accurately approximate a golden wirelength model such as half-perimeter wirelength. Existing wirelength models derive gradient from differentiating smooth maximum (minimum) functions, such as the log-sum-exp and weighted-average models. In this paper, we propose a novel bivariate gradient-based wirelength model, namely BiG, which directly derives a gradient with any bivariate smooth maximum (minimum) function without any differentiation. Our wirelength model can effectively combine the advantages of both multivariate and bivariate functions. Experimental results show that our BiG model effectively and efficiently improves placement solutions.

Routability-driven Mixed-size Placement Prototyping Approach Considering Design Hierarchy and Indirect Connectivity Between Macros

Jai-Ming Lin
Szu-Ting Li
Yi-Ting Wang

The mixed-size placement becomes a great challenge in the modern VLSI design. To handle this problem, the three-stage mixed-size placement methodology is considered as the most suitable approach for a commercial design flow, where the placement prototyping is the most important stage. Since standard cells and macros have to be considered simultaneously in this stage, it is more complicated than the other two stages. To reduce complexity and improve design quality, this paper applies the multilevel framework with a design hierarchy-guided clustering scheme for getting a better coarsening result in order to improve outcome in the following stages. We propose an efficient and effective clustering scheme to group standard cells and macros based on the tree built from their design hierarchies. More importantly, our clustering algorithm considers indirect connectivity between macros which is ignored by previous works. Moreover, we propose a new overlapping bounding box constraint to avoid clustering improper macros which have connections to fixed pins. The experimental results show that wirelength and routability are improved by our methodology.

NCTUcell: A DDA-Aware Cell Library Generator for FinFET Structure with Implicitly Adjustable Grid Map

Yih-Lang Li
Shih-Ting Lin
Shinichi Nishizawa
Hong-Yan Su
Ming-Jie Fong
Oscar Chen
Hidetoshi Onodera

For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware dynamic programming based transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. With M0 routing, the congestion of M1 routing can be reduced and the pin accessibility can be improved due to the diminished use of M2 routing. To improve the routing resource utilization, we propose an implicitly adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires less filler cells than that using traditional cell library by 70.9%, which achieves a block area reduction rate of 5.7%.

Design Principles for True Random Number Generators for Security Applications

Miloš Grujić
Vladimir Rožić
David Johnston
John Kelsey
Ingrid Verbauwhede

The generation of high quality true random numbers is essential in security applications. For secure communication, we also require high quality true random number generators (TRNGs) in embedded and IoT devices. This paper provides insights into modern TRNG design principles and their evaluation, based on standard’s requirements and design experience. We illustrate our approach with a case study of a recently proposed delay chain based TRNG.

Rapid Generation of High-Qality RISC-V Processors from Functional Instruction Set Specifications

Gai Liu
Joseph Primmer
Zhiru Zhang

The increasing popularity of compute acceleration for emerging domains such as artificial intelligence and computer vision has led to the growing need for domain-specific accelerators, often implemented as specialized processors that execute a set of domain-optimized instructions. The ability to rapidly explore (1) various possibilities of the customized instruction set, and (2) its corresponding micro-architectural features is critical to achieve the best quality-of-results (QoRs). However, this ability is frequently hindered by the manual design process at the register transfer level (RTL). Such an RTL-based methodology is often expensive and slow to react when the design specifications change at the instruction-set level and/or micro-architectural level.

We address this deficiency in domain-specific processor design with ASSIST, a behavior-level synthesis framework for RISC-V processors. From an untimed functional instruction set description, ASSIST generates a spectrum of RISC-V processors implementing varying micro-architectural design choices, which enables effective tradeoffs between different QoR metrics. We demonstrate the automatic synthesis of more than 60 in-order processor implementations with varying pipeline structures from the RISC-V 32I instruction set, some of which dominate the manually optimized counterparts in the area-performance Pareto frontier. In addition, we propose an autotuning-based approach for optimizing the implementations under a given performance constraint and the technology target. We further present case studies of synthesizing various custom instruction extensions and customized instruction sets for cryptography and machine learning applications.

autoAx: An Automatic Design Space Exploration and Circuit Building Methodology utilizing Libraries of Approximate Components

Vojtech Mrazek
Muhammad Abdullah Hanif
Zdenek Vasicek
Lukas Sekanina
Muhammad Shafique

Approximate computing is an emerging paradigm for developing highly energy-efficient computing systems such as various accelerators. In the literature, many libraries of elementary approximate circuits have already been proposed to simplify the design process of approximate accelerators. Because these libraries contain from tens to thousands of approximate implementations for a single arithmetic operation it is intractable to find an optimal combination of approximate circuits in the library even for an application consisting of a few operations. An open problem is “how to effectively combine circuits from these libraries to construct complex approximate accelerators”. This paper proposes a novel methodology for searching, selecting and combining the most suitable approximate circuits from a set of available libraries to generate an approximate accelerator for a given application. To enable fast design space generation and exploration, the methodology utilizes machine learning techniques to create computational models estimating the overall quality of processing and hardware cost without performing full synthesis at the accelerator level. Using the methodology, we construct hundreds of approximate accelerators (for a Sobel edge detector) showing different but relevant tradeoffs between the quality of processing and hardware cost and identify a corresponding Pareto-frontier. Furthermore, when searching for approximate implementations of a generic Gaussian filter consisting of 17 arithmetic operations, the proposed approach allows us to identify approximately 103 highly relevant implementations from 1023 possible solutions in a few hours, while the exhaustive search would take four months on a high-end processor.

Graph-Morphing: Exploiting Hidden Parallelism of Non-Stencil Computation in High-Level Synthesis

Yu Zou
Mingjie Lin

Non-stencil kernels with irregular memory access patterns pose unique challenges to achieving high computing performance and hardware efficiency in FPGA high-level synthesis. We present a highly versatile and systematic approach, termed as Graph-Morphing, to constructing a reconfigurable computing engine specifically optimized to perform non-stencil kernel computing. Graph-Morphing achieves significant performance improvement by fragmenting operations across loop iterations and subsequently rescheduling computation and data to maximize overall performance. In experiments, Graph-Morphing achieves 2-13 times performance improvement albeit with significantly more hardware usage. For accelerating non-stencil kernel computing, Graph-Morphing proposes a new research direction.

Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators via Layer Conscious Memory Management

Xuechao Wei
Yun Liang
Jason Cong

Deep Neural Networks (DNNs) are becoming more and more complex than before. Previous hardware accelerator designs neglect the layer diversity in terms of computation and communication behavior. On-chip memory resources are underutilized for the memory bounded layers, leading to suboptimal performance. In addition, the increasing complexity of DNN structures makes it difficult to do on-chip memory allocation. To address these issues, we propose a layer conscious memory management framework for FPGA-based DNN hardware accelerators. Our framework exploits the layer diversity and the disjoint lifespan information of memory buffers to efficiently utilize the on-chip memory to improve the performance of the layers bounded by memory and thus the entire performance of DNNs. It consists of four key techniques working coordinately with each other. We first devise a memory allocation algorithm to allocate on-chip buffers for the memory bound layers. In addition, buffer sharing between different layers is applied to improve on-chip memory utilization. Finally, buffer prefetching and splitting are used to further reduce latency. Experiments show that our techniques can achieve 1.36X performance improvement compared with previous designs.

High-Level Synthesis of Resource-oriented Approximate Designs for FPGAs

Marcos T. Leipnitz
Gabriel L. Nazar

When attempting to make a design fit a set of the heterogeneous resources found in Field-Programmable Gate Arrays (FPGAs), designers using High-Level Synthesis (HLS) may resort to approximate approaches. However, current FPGA-oriented approximate HLS tools do not allow specifying constraints on heterogeneous resources such as lookup tables, flip-flops, and multipliers, being instead error-oriented. In this work, we propose a resource-oriented HLS methodology with which designers can specify heterogeneous resource constraints and satisfy them while minimizing the output error, attaining average improvements, over error-oriented approaches, of about 34% and 2.2 dB for mean-squared error and peak signal-to-noise ratio error metrics, respectively.

Improving Scalability of Exact Modulo Scheduling with Specialized Conflict-Driven Learning

Steve Dai
Zhiru Zhang

Loop pipelining is an important optimization in high-level synthesis to enable high-throughput pipelined execution of loop iterations. However, current pipeline scheduling approach relies on fundamentally inexact heuristics based on ad hoc priority functions and lacks guarantee on achieving the best throughput. To address this shortcoming, we propose a scheduling algorithm based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle various pipeline scheduling constraints. Our techniques take advantage of conflict-driven learning and problem-specific specialization to optimally yet efficiently derive pipelining solutions. Experiments demonstrate that our approach achieves notable speedup in comparison to integer linear programming based techniques.

LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator

Quan Deng
Youtao Zhang
Minxuan Zhang
Jun Yang

PIM (Processing-in-memory)-based CNN (Convolutional neural network) accelerators leverage the characteristics of basic memory cells to enable simple logic and arithmetic operations so that the bandwidth constraint can be effectively alleviated. However, it remains a major challenge to support multiplication operations efficiently on PIM accelerators, in particular, DRAM-based PIM accelerators. This has prevented PIM-based accelerators from being immediately adopted for accurate CNN inference.

In this paper, we propose LAcc, a DRAM-based PIM accelerator to support LUT- (lookup table) based fast and accurate multiplication. By enabling LUT based vector multiplication in DRAM, LAcc effectively decreases LUT size and improve its reuse. LAcc further adopts a hybrid mapping of weights and inputs to improve the hardware utilization rate. LAcc achieves 95 FPS at 5.3 W for Alexnet and 6.3 × efficiency improvement over the state-of-the-art.

DRIS-3: Deep Neural Network Reliability Improvement Scheme in 3D Die-Stacked Memory based on Fault Analysis

Jae-San Kim
Joon-Sung Yang

Various studies have been carried out to improve the operational efficiency of the Deep Neural Networks (DNNs). However, the importance of the reliability in DNNs has generally been overlooked. As the underlying semiconductor technology decreases in reliability, the probability that some components of computing devices fail also increases, preventing high accuracy in DNN operations. To achieve high accuracy, ensuring operational reliability, even if faults occur, is necessary.

In this paper, we introduce a DNN reliability improvement scheme in 3D die-stacked memory called DRIS-3, based on the correlation between the faults in weights and an accuracy loss. We analyze the fault characteristics of conventional DNN models to find the bits that cause significant accuracy loss when faults are injected into weights. On the basis of the findings, we propose a reliability improvement structure which can reduce faults on the bits that must be protected for accuracy, considering asymmetric soft error rate (SER) per layer in 3D die-stacked memory.

Experimental results show that with the proposed method, the fault tolerance is improved regardless of the type of model and the pruning applied. The fault tolerance based on bit error rate (BER) for a 1% accuracy loss is increased up to 104 times over the conventional model.

X-MANN: A Crossbar based Architecture for Memory Augmented Neural Networks

Ashish Ranjan
Shubham Jain
Jacob R. Stevens
Dipankar Das
Bharat Kaul
Anand Raghunathan

Memory Augmented Neural Networks (MANNs) enhance a deep neural network with an external differentiable memory, enabling them to perform complex tasks well beyond the capabilities of conventional deep neural networks. We identify a unique challenge that arises in MANNs due to soft reads and writes to the differentiable memory, each of which requires access to all the memory locations. This characteristic of MANN workloads severely limits the performance of MANNs on CPUs, GPUs, and classical neural network accelerators. We present the first effort to design a hardware architecture that improves the efficiency of MANNs. Leveraging the intrinsic ability of resistive crossbars to efficiently realize in-memory computations, we propose X-MANN, a memory-centric crossbar-based architecture that is specialized to match the compute characteristics observed in MANNs. We design a transposable crossbar processing unit that can efficiently perform the different computational kernels of MANNs. To improve performance of soft writes in X-MANN, we propose an incremental write mechanism that leverages the characteristics of soft write operations. We develop an architectural simulator for X-MANN that utilizes array-level timing and power models of resistive crossbars calibrated from SPICE simulations. Across a suite of MANN benchmarks, X-MANN achieves 23.7×-45.7× speedup and 75.1×-267.1× reduction in energy over state-of-the-art GPU implementations.

On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators

Haitong Li
Mudit Bhargava
Paul N. Whatmough
H.-S. Philip Wong

Deep neural network (DNN) inference tasks have become ubiquitous workloads on mobile SoCs and demand energy-efficient hardware accelerators. Mobile DNN accelerators are heavily area-constrained, with only minimal on-chip SRAM, which results in heavy use of inefficient off-chip DRAM. With diminishing returns from conventional silicon technology scaling, emerging memory technologies that offer better area density than SRAM can boost accelerator efficiency by minimizing costly off-chip DRAM accesses. This paper presents a detailed design space exploration (DSE) of technology-system co-design for systolic-array accelerators. We focus on practical/mature on-chip memory technologies, including SRAM, eDRAM, MRAM, and 3D vertical RRAM (VRRAM). The DSE employs state-of-the-art optimizations (e.g., model compression and optimized buffer scheduling), and evaluates results on important models including ResNet-50, MobileNet, and Faster-RCNN. Compared to an SRAM/DRAM baseline, MRAM-based accelerators show up to 4.68× energy benefits (57% area overhead), while a 3D VRRAM-based design achieves 2.22× energy benefits (33% area reduction).

SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks

Reza Hojabr
Kamyar Givaki
S. M. Reza Tayaranian
Parsa Esfahanian
Ahmad Khonsari
Dara Rahmati
M. Hassan Najafi

Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efficient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low-cost advantage, SC-based arithmetic units suffer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs.

In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the first multiplication, the following multiplications can be performed by multiplying the input and the differences of the successive weights. Leveraging this property, we develop a differential Multiply-and-Accumulate unit, called DMAC, to reduce the time consumed by convolutions in SkippyNN. We evaluate the efficiency of SkippyNN using four modern CNNs. On average, SkippyNN ofers 1.2x speedup and 2.7x energy saving compared to the binary implementation of CNN accelerators.

ZARA: A Novel Zero-free Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM

Fan Chen
Linghao Song
Hai Helen Li
Yiran Chen

Generative Adversarial Networks (GANs) recently demonstrated a great opportunity toward unsupervised learning with the intention to mitigate the massive human efforts on data labeling in supervised learning algorithms. GAN combines a generative model and a discriminative model to oppose each other in an adversarial situation to refine their abilities. Existing nonvolatile memory based machine learning accelerators, however, could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. In this work, we propose a novel computational deformation technique that synergistically optimizes the forward and backward functions in transposed convolution to eliminate the large resource underutilization. In addition, we present dedicated control units – a dataflow mapper and an operation scheduler, to support the proposed execution model with high parallelism and low energy consumption. ZARA is implemented with commodity ReRAM chips, and experimental results show that our design can improve GAN’s training performance by averagely 1.6× ~23× over CMOS-based GAN accelerators. Compared to state-of-the-art ReRAM-based accelerator designs, ZARA also provides 1.15 × ~2.1× performance improvement.

X-DeepSCA: Cross-Device Deep Learning Side Channel Attack

Debayan Das
Anupam Golder
Josef Danial
Santosh Ghosh
Arijit Raychowdhury
Shreyas Sen

This article, for the first time, demonstrates Cross-device Deep Learning Side-Channel Attack (X-DeepSCA), achieving an accuracy of > 99.9%, even in presence of significantly higher inter-device variations compared to the inter-key variations. Augmenting traces captured from multiple devices for training and with proper choice of hyper-parameters, the proposed 256-class Deep Neural Network (DNN) learns accurately from the power side-channel leakage of an AES-128 target encryption engine, and an N-trace (N ≤ 10) X-DeepSCA attack breaks different target devices within seconds compared to a few minutes for a correlational power analysis (CPA) attack, thereby increasing the threat surface for embedded devices significantly. Even for low SNR scenarios, the proposed X-DeepSCA attack achieves ~ 10× lower minimum traces to disclosure (MTD) compared to a traditional CPA.

Attacking Split Manufacturing from a Deep Learning Perspective

Haocheng Li
Satwik Patnaik
Abhrajit Sengupta
Haoyu Yang
Johann Knechtel
Bei Yu
Evangeline F.Y. Young
Ozgur Sinanoglu

The notion of integrated circuit split manufacturing which delegates the front-end-of-line (FEOL) and back-end-of-line (BEOL) parts to different foundries, is to prevent overproduction, piracy of the intellectual property (IP), or targeted insertion of hardware Trojans by adversaries in the FEOL facility. In this work, we challenge the security promise of split manufacturing by formulating various layout-level placement and routing hints as vector- and image-based features. We construct a sophisticated deep neural network which can infer the missing BEOL connections with high accuracy. Compared with the publicly available network-flow attack [1], for the same set of ISCAS-85 benchmarks, we achieve 1.21× accuracy when splitting on M1 and 1.12× accuracy when splitting on M3 with less than 1% running time.

ALAFA: Automatic Leakage Assessment for Fault Attack Countermeasures

Sayandeep Saha
S. Nishok Kumar
Sikhar Patranabis
Debdeep Mukhopadhyay
Pallab Dasgupta

Assessment of the security provided by a fault attack countermeasure is challenging, given that a protected cipher may leak the key if the countermeasure is not designed correctly. This paper proposes, for the first time, a statistical framework to detect information leakage in fault attack countermeasures. Based on the concept of non-interference, we formalize the leakage for fault attacks and provide a t-test based methodology for leakage assessment. One major strength of the proposed framework is that leakage can be detected without the complete knowledge of the countermeasure algorithm, solely by observing the faulty ciphertext distributions. Experimental evaluation over a representative set of countermeasures establishes the efficacy of the proposed methodology.

ChipSecure: A Reconfigurable Analog eFlash-Based PUF with Machine Learning Attack Resiliency in 55nm CMOS

Mohammad Mahmoodi
Hussein Nili
Shabnam Larimian
Xinjie Guo
Dmitri Strukov

We exploit randomness in static I-V characteristics and reconfigurability of embedded flash memories to design very efficient physically unclonable function. Leakage current and subthreshold slope variations, nonlinearity, nondeterministic tuning error, and sneak path current in the redesigned commercial flash memory arrays are exploited to create a unique digital fingerprint. A time-multiplexed architecture is designed to enhance the security and expand the challenge-response pair space to 10211. Experimental results demonstrate 50.3% average uniformity, 49.99% average diffuseness, and native <5% bit error rate. The analysis of the measured data also shows strong resilience against machine learning attacks and possibility for extremely energy efficient, 0.56 pJ/b operation.

Adversarial Attack against Modeling Attack on PUFs

Sying-Jyan Wang
Yu-Shen Chen
Katherine Shu-Min Li

The Physical Unclonable Function (PUF) has been proposed for the identification and authentication of devices and cryptographic key generation. A strong PUF provides an extremely large number of device-specific challenge-response pairs (CRP) which can be used for identification. Unfortunately, the CRP mechanism is vulnerable to modeling attack, which uses machine learning (ML) algorithms to predict PUF responses with high accuracy. Many methods have been developed to strengthen strong PUFs with complicated hardware; however, recent studies show that they are still vulnerable by leveraging GPU-accelerated ML algorithms.

In this paper, we propose to deal with the problem from a different approach. With a slightly modified CRP mechanism, a PUF can provide poison data such that an accurate model of the PUF under attack cannot be built by ML algorithms. Experimental results show that the proposed method provides an effective countermeasure against modeling attacks on PUF. In addition, the proposed method is compatible with hardware strengthening schemes to provide even better protection for PUFs.

RFTC: Runtime Frequency Tuning Countermeasure Using FPGA Dynamic Reconfiguration to Mitigate Power Analysis Attacks

Darshana Jayasinghe
Aleksandar Ignjatovic
Sri Parameswaran

Random execution time-based countermeasures against power analysis attacks have reduced resource overheads when compared to balancing power dissipation and masking countermeasures. The previous countermeasures on randomization use either a small number of clock frequencies or delays to randomize the execution. This paper presents a novel random frequency countermeasure (referred to as RFTC) using the dynamic reconfiguration ability of clock managers of Field-Programmable Gate Arrays — FPGAs (such as Xilinx Mixed-Mode Clock Manager — MMCM) which can change the frequency of operation at runtime. We show for the first time how Advanced Encryption Standard (AES) block cipher algorithm can be executed using randomly selected clock frequencies (amongst thousands of frequencies carefully chosen) generated within the FPGA to mitigate power analysis attack vulnerabilities. To test the effectiveness of the proposed clock randomization, Correlation Power analysis (CPA) attacks are performed on the collected power traces. Preprocessing methods, such as Dynamic Time Warping (DTW), Principal Component Analysis (PCA) and Fast Fourier Transform (FFT), based power analysis attacks are performed on the collected traces to test the effective removal of random execution. Compared to the state of the art, where there were 83 distinct finishing times for each encryption, the method described in this paper can have more than 60,000 distinct finishing times for each encryption, making it resistant against power analysis attacks when preprocessed and demonstrated to be secure up to four million traces.

Design Guidelines of RRAM based Neural-Processing-Unit: A Joint Device-Circuit-Algorithm Analysis

Wenqiang Zhang
Xiaochen Peng
Huaqiang Wu
Bin Gao
Hu He
Youhui Zhang
Shimeng Yu
He Qian

RRAM based neural-processing-unit (NPU) is emerging for processing general purpose machine intelligence algorithms with ultra-high energy efficiency, while the imperfections of the analog devices and cross-point arrays make the practical application more complicated. In order to improve accuracy and robustness of the NPU, device-circuit-algorithm codesign with consideration of underlying device and array characteristics should outperform the optimization of individual device or algorithm. In this work, we provide a joint device-circuit-algorithm analysis and propose the corresponding design guidelines. Key innovations include: 1) An end-to-end simulator for RRAM NPU is developed with an integrated framework from device to algorithm. 2) The complete design of circuit and architecture for RRAM NPU is provided to make the analysis much close to the real prototype. 3) A large-scale neural network as well as other general-purpose networks are processed for the study of device-circuit interaction. 4) Accuracy loss from non-idealities of RRAM, such as I-V nonlinearity, noises of analog resistance levels, voltage-drop for interconnect, ADC/DAC precision, are evaluated for the NPU design.

QURE: Qubit Re-allocation in Noisy Intermediate-Scale Quantum Computers

Abdullah Ash-Saki
Mahabubul Alam
Swaroop Ghosh

Concerted efforts by the academia and the industries e.g., IBM, Google and Intel have brought us to the era of Noisy Intermediate-Scale Quantum (NISQ) computers. Qubits, the basic elements of quantum computer, have been proven extremely susceptible to different noises. Recent experiments have exhibited spatial variations among the qubits in NISQ hardware. Therefore, conventional mapping of qubit done without quality awareness results in significant loss of fidelity for a given workload. In this paper, we have analyzed the effects of various noise sources on the overall fidelity of the given workload for a real NISQ hardware. We have also presented novel optimization technique namely, Qubit Re-allocation (QURE) to maximize the sequence fidelity of a given workload. QURE is scalable and can be applied to future large scale quantum computers. QURE can improve the fidelity of a quantum workload up to 1.54X (1.39X on average) in simulation and up to 1.7X in real device compared to variation oblivious qubit allocation without incurring any physical overhead.

Mapping Quantum Circuits to IBM QX Architectures Using the Minimal Number of SWAP and H Operations

Robert Wille
Lukas Burgholzer
Alwin Zulehner

The recent progress in the physical realization of quantum computers (the first publicly available ones—IBM’s QX architectures—have been launched in 2017) has motivated research on automatic methods that aid users in running quantum circuits on them. Here, certain physical constraints given by the architectures which restrict the allowed interactions of the involved qubits have to be satisfied. Thus far, this has been addressed by inserting SWAP and H operations. However, it remains unknown whether existing methods add a minimum number of SWAP and H operations or, if not, how far they are away from that minimum—an NP-complete problem. In this work, weaddress this by formulating the mapping task as a symbolic optimization problem that is solved using reasoning engines like Boolean satisfiability solvers. By this, we do not only provide a method that maps quantum circuits to IBM’s QX architectures with a minimal number of SWAP and H operations, but also show by experimental evaluation that the number of operations added by IBM’s heuristic solution exceeds the lower bound by more than 100% on average. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

Computing Radial Basis Function Support Vector Machine using DNA via Fractional Coding

Xingyi Liu
Keshab K. Parhi

This paper describes a novel approach to synthesize molecular reactions to compute a radial basis function (RBF) support vector machine (SVM) kernel. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to translating known stochastic logic circuits to molecular computing. Although inspired by prior stochastic logic implementation of the RBF-SVM kernel, the proposed molecular reactions require non-obvious modifications. This paper introduces a new explicit bipolar-to-unipolar molecular converter for intermediate format conversion. Two designs are presented; one is based on the explicit and the other is based on implicit conversion from prior stochastic logic. When 5 support vectors are used, it is shown that the DNA RBF-SVM realized using the explicit format conversion has orders of magnitude less regression error than that based on implicit conversion.

AlignS: A Processing-In-Memory Accelerator for DNA Short Read Alignment Leveraging SOT-MRAM

Shaahin Angizi
Jiao Sun
Wei Zhang
Deliang Fan

Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level’s parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm2 by ~12× compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6× higher throughput per Watt.

MiniControl: Synthesis of Continuous-Flow Microfluidics with Strictly Constrained Control Ports

Xing Huang
Tsung-Yi Ho
Wenzhong Guo
Bing Li
Ulf Schlichtmann

Recent advances in continuous-flow microfluidics have enabled highly integrated lab-on-a-chip biochips. These chips can execute complex biochemical applications precisely and efficiently within a tiny area, but they require a large number of control ports and the corresponding control logic to generate required pressure patterns for flow control, which, consequently, offset their advantages and prevent their wide adoption. In this paper, we propose the first synthesis flow called MiniControl, for continuous-flow microfluidic biochips (CFMBs) under strict constraints for control ports, incorporating high-level synthesis and physical design simultaneously, which has never been considered in previous work. With the maximum number of allowed control ports specified in advance, this synthesis flow generates a biochip architecture with high execution efficiency. Moreover, the overall cost of a CFMB can be reduced and the tradeoff between control logic and execution efficiency of biochemical applications can be evaluated for the first time. Experimental results demonstrate that MiniControl leads to high execution efficiency and low overall platform cost, while satisfying the given control port constraint strictly.

Faster Region-based Hotspot Detection

Ran Chen
Wei Zhong
Haoyu Yang
Hao Geng
Xuan Zeng
Bei Yu

As the circuit feature size continuously shrinks down, hotspot detection has become a more challenging problem in modern DFM flows. Developed deep learning techniques have recently shown their advantages on hotspot detection tasks. However, existing hotspot detectors only accept small layout clips as input with potential defects occurring at a center region of each clip, which will be time consuming and waste lots of computational resources when dealing with large full-chip layouts. In this paper, we develop a new end-to-end framework that can detect multiple hotspots in a large region at a time and promise a better hotspot detection performance. We design a joint auto-encoder and inception module for efficient feature extraction. A two-stage classification and regression flow is proposed to efficiently locate hotspot regions roughly and conduct final prediction with better accuracy and false alarm penalty. Experimental results show that our framework enables a significant speed improvement over existing methods with higher accuracy and fewer false alarms.

Efficient Layout Hotspot Detection via Binarized Residual Neural Network

Yiyang Jiang
Fan Yang
Hengliang Zhu
Bei Yu
Dian Zhou
Xuan Zeng

Layout hotspot detection is of great importance in the physical verification flow. Deep neural network models have been applied to hotspot detection and achieved great successes. The layouts can be viewed as binary images. The binarized neural network can thus be suitable for the hotspot detection problem. In this paper we propose a new deep learning architecture based on binarized neural networks (BNNs) to speed up the neural networks in hotspot detection. A new binarized residual neural network is carefully designed for hotspot detection. Experimental results on ICCAD 2012 Contest benchmarks show that our architecture outperforms all previous hotspot detectors in detecting accuracy and has an 8x speedup over the best deep learning-based solution.

DeePattern: Layout Pattern Generation with Transforming Convolutional Auto-Encoder

Haoyu Yang
Piyush Pathak
Frank Gennari
Ya-Chieh Lai
Bei Yu

VLSI layout patterns provide critic resources in various design for manufacturability researches, from early technology node development to back-end design and sign-off flows. However, a diverse layout pattern library is not always available due to long logic-to-chip design cycle, which slows down the technology node development procedure. To address this issue, in this paper, we explore the capability of generative machine learning models to synthesize layout patterns. A transforming convolutional auto-encoder is developed to learn vector-based instantiations of squish pattern topologies. We show our framework can capture simple design rules and contributes to enlarging the existing squish topology space under certain transformations. Geometry information of each squish topology is obtained from an associated linear system derived from design rule constraints. Experiments on 7nm EUV designs show that our framework can more effectively generate diverse pattern libraries with DRC-clean patterns compared to a state-of-the-art industrial layout pattern generator.

GAN-SRAF: Sub-Resolution Assist Feature Generation Using Conditional Generative Adversarial Networks

Mohamed Baker Alawieh
Yibo Lin
Zaiwei Zhang
Meng Li
Qixing Huang
David Z. Pan

As the integrated circuits (IC) technology continues to scale, resolution enhancement techniques (RETs) are mandatory to obtain high manufacturing quality and yield. Among various RETs, sub-resolution assist feature (SRAF) generation is a key technique to improve the target pattern quality and lithographic process window. While model-based SRAF insertion techniques have demonstrated high accuracy, they usually suffer from high computational cost. Therefore, more efficient techniques that can achieve high accuracy while reducing runtime are in strong demand. In this work, we leverage the recent advancement in machine learning for image generation to tackle the SRAF insertion problem. In particular, we propose a new SRAF insertion framework, GAN-SRAF, which uses conditional generative adversarial networks (CGANs) to generate SRAFs directly for any given layout. Our proposed approach incorporates a novel layout to image encoding using multi-channel heatmaps to preserve the layout information and facilitate layout reconstruction. Our experimental results demonstrate ~14.6× reduction in runtime when compared to the previous best machine learning approach for SRAF generation, and ~144× reduction compared to model-based approach, while achieving comparable quality of results.

Meta-Model based High-Dimensional Yield Analysis using Low-Rank Tensor Approximation

Xiao Shi
Hao Yan
Qiancun Huang
Jiajia Zhang
Longxing Shi
Lei He

“Curse of dimensionality” has become the major challenge for existing high-sigma yield analysis methods. In this paper, we develop a meta-model using Low-Rank Tensor Approximation (LRTA) to substitute expensive SPICE simulation. The polynomial degree of our LRTA model grows linearly with circuit dimension. This makes it especially promising for high-dimensional circuit problems. Our LRTA meta-model is solved efficiently with a robust greedy algorithm, and calibrated iteratively with an adaptive sampling method. Experiments on bit cell and SRAM column validate that proposed LRTA method outperforms other state-of-the-art approaches in terms of accuracy and efficiency.

Novel Guiding Template and Mask Assignment for DSA-MP Hybrid Lithography Using Multiple BCP Materials

Yi-Ting Lin
Iris Hui-Ru Jiang

Directed self-assembly (DSA) is one of the leading candidates for extending the resolution of optical lithography to sub-7nm and beyond. By incorporating DSA in multiple patterning lithography (DSA-MP), the flexibility and resolution of contact/via patterning can be further enhanced by using multiple block copolymer (BCP) materials. Prior work faces the dilemma between solution quality and efficiency and is unable to handle 2D templates. In this paper. we capture the essence of template and mask assignment in DSA-MP by a new graph model and a new problem reduction: Our graph model explicitly represents spacing conflict edges and template hyperedges; thus, extra enumeration and manipulation of incompatible via grouping edges can be avoided, and arbitrary 1D/2D templates can be natively handled. We further reduce the assignment problem to exact cover, which is encoded by a sparse matrix. Our concise integer linear programming (ILP) formulation and fast backtracking heuristic achieve substantially superior solution quality and efficiency to the state-of-the-art work. Moreover, our method is flexible and extendible to utilize dummy vias to improve manufacturability.

Actors Revisited for Time-Critical Systems

Marten Lohstroh
Martin Schoeberl
Andrés Goens
Armin Wasicek
Christopher Gill
Marjan Sirjani
Edward A. Lee

Programming time-critical systems is notoriously difficult. In this paper we propose an actor-oriented programming model with a semantic notion of time and a deterministic coordination semantics based on discrete events to exercise precise control over both the computational and timing aspects of the system behavior.

Time-Predictable Computing by Design: Looking Back, Looking Forward

Tulika Mitra

We present two contrasting approaches to achieve time predictability in the embedded compute engine, the basic building block of any Internet of Things (IoT) or Cyber-Physical (CPS) system. The traditional approach offers predictability on top of unpredictable processors with numerous optimizations for enhanced performance and programmability at the cost of huge variability in timing. Approaches such as Worst-Case Execution Time (WCET) analysis of software have been struggling to model the complex timing behavior of the underlying processor to provide guarantees. On the other hand, the inevitable slowdown of Moore’s Law and the end of Dennard scaling have curtailed the performance and energy scaling of the processors. This stagnation in conjunction with the importance of cognitive computing have motivated widespread adoption of non-von Neumann accelerators and architectures. We argue that these emerging architectures are inherently time-predictable as they depend on software to orchestrate the computation and data movement and are an excellent match for the real-time processing needs.

Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor

Benoît Dupont de Dinechin

The requirement of high performance computing at low power can be met by the parallel execution of an application on a possibly large number of programmable cores. However, the lack of accurate timing properties may prevent parallel execution from being applicable to time-critical applications. This problem has been addressed by suitably designing the architecture, implementation, and programming models, of the Kalray MPPA (Multi-Purpose Processor Array) family of single-chip many-core processors. We introduce the third-generation MPPA processor, whose key features are motivated by the high-performance and high-integrity functions of automated vehicles. High-performance computing functions, represented by deep learning inference and by computer vision, need to execute under soft real-time constraints. High-integrity functions are developed under model-based design, and must meet hard real-time constraints. Finally, the third-generation MPPA processor integrates a hardware root of trust, and its security architecture is able to support a security kernel for implementing the trusted execution environment functions required by applications.

Efficient GPU NVRAM Persistence with Helper Warps

Sui Chen
Faen Zhang
Lei Liu
Lu Peng

Non-volatile Random-Access Memories (NVRAM) have emerged in recent years to bridge the performance gap between the main memory and external storage devices. To utilize the non-volatility of NVRAMs, programs should allow durable stores, meaning consistency must be maintained during a power loss event. GPUs are designed with high throughput, leveraging high degrees of parallelism. However, with lower NVRAM write bandwidths compared to that of DRAMs, using NVRAM as is may yield suboptimal overall system performance. To address this problem, we propose using Helper Warps to move persistence out of the critical path of transaction execution, alleviating the impact of latencies. Our mechanism achieves a speedup of 4.4 and 1.5 under bandwidth limits of 1.6 GB/s and 12 GB/s and is projected to maintain speed advantage even when NVRAM bandwidth gets as high as hundreds of GB/s in certain cases.

FlashGPU: Placing New Flash Next to GPU Cores

Jie Zhang
Miryeong Kwon
Hyojong Kim
Hyesoon Kim
Myoungsoo Jung

We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores. While Z-NAND is a hundred times faster than conventional 3D-stacked flash, its latency is still longer than DRAM. To address this shortcoming, we propose a dynamic page-placement and buffer manager in Z-NAND subsystems by being aware of bulk and parallel memory access characteristics of GPU applications, thereby offering high-throughput and low-energy consumption behaviors.

Performance-aware Wear Leveling for Block RAM in Nonvolatile FPGAs

Shuo Huai
Weining Song
Mengying Zhao
Xiaojun Cai
Zhiping Jia

Field programmable gate arrays (FPGAs) have been widely adopted in both high-performance servers and embedded systems. Since static random access memory (SRAM) has limited density and comparatively high leakage power, researchers have proposed FPGA architectures based on emerging non-volatile memories (NVMs) to satisfy the requirements of data-intensive and low-power applications. Block RAM is on-chip memory of FPGAs, when it is implemented with NVM, it will face the challenge of limited endurance. Traditional wear leveling strategy cannot be directly applied to block RAM because it may induce large performance overhead. In this paper, we propose a performance-aware wear leveling scheme for block RAM in FPGAs to improve its lifetime. The placement strategy is improved by injecting wear leveling guidance. The evaluation shows that 29.75% lifetime enhancement is achieved with 16.32% performance improvement at the same time, compared with traditional wear leveling.

ZUMA: Enabling Direct Insertion/Deletion Operations with Emerging Skyrmion Racetrack Memory

Zheng Liang
Guangyu Sun
Wang Kang
Xing Chen
Weisheng Zhao

Data insertion and deletion are common operations exist in various applications. However, traditional memory architecture can only perform an indirect insertion/deletion with multiple data read and write operations, which is significantly time and energy consuming. To mitigate this problem, we propose to leverage the unique capability of emerging skyrmion racetrack memory technology that it can naturally support direct insertion/deletion operations inside a racetrack. In this work, we first present a circuit level model for skyrmion racetrack memory. Then, we further propose a novel memory architecture to enable an efficient large size data insertion/deletion. With the help of the model and the architecture, we study several potential applications to leverage the insertion and deletion operations. Experimental results demonstrate that the efficiency of these operations can be substantially improved.

ApproxLP: Approximate Multiplication with Linearization and Iterative Error Control

Mohsen Imani
Alice Sokolova
Ricardo Garcia
Andrew Huang
Fan Wu
Baris Aksanli
Tajana Rosing

In a data hungry world, approximate computing has emerged as one of the solutions to create higher energy efficiency and faster systems, while providing application tailored quality. In this paper, we propose ApproxLP, an Approximate Multiplier based on Linear Planes. We introduce an iterative method for approximating the product of two operands using fitted linear functions with two inputs, referred to as linear planes. The linearization of multiplication allows multiplication operations to be completely replaced with weighted addition. The proposed technique is used to find the significand of the product of two floating point numbers, decreasing the high energy cost of floating point arithmetic. Our method fully exploits the trade-off between accuracy and energy consumption by offering various degrees of approximation at different energy costs. As the level of approximation increases, the approximated product asymptotically approaches the exact product in an iterative manner. The performance of ApproxLP is evaluated over a range of multimedia and machine learning applications. A GPU enhanced by ApproxLP yields significant energy-delay product (EDP) improvement. For multimedia, neural network, and hyperdimensional computing applications, ApproxLP offers on average 2.4×, 2.7×, and 4.3× EDP improvement respectively with sufficient computational quality for the application. ApproxLP also provides up to 4.5× EDP improvement and has 2.3× lower chip area than other state-of-the-art approximate multipliers.

Cooperative Arithmetic-Aware Approximation Techniques for Energy-Efficient Multipliers

Vasileios Leon
Konstantinos Asimakopoulos
Sotirios Xydis
Dimitrios Soudris
Kiamal Pekmestzi

Approximate computing appears as an emerging and promising solution for energy-efficient system designs, exploiting the inherent error-tolerant nature of various applications. In this paper, targeting multiplication circuits, i.e., the energy-hungry counterpart of hardware accelerators, an extensive exploration of the error–energy trade-off, when combining arithmetic-level approximation techniques, is performed for the first time. Arithmetic-aware approximations deliver significant energy reductions, while allowing to control the error values with discipline by setting accordingly a configuration parameter. Inspired from the promising results of prior works with one configuration parameter, we propose 5 hybrid design families for approximate and energy-friendly hardware multipliers, consisting of two independent parameters to tune the approximation levels. Interestingly, the resolution of the state-of-the-art Pareto diagram is improved, giving the flexibility to achieve better energy gains for a specific error constraint imposed by the system. Moreover, we outperform prior works in the field of approximate multipliers by up to 60% energy reduction, and thus, we define the new Pareto front.

Approximate Integer and Floating-Point Dividers with Near-Zero Error Bias

Hassaan Saadat
Haris Javaid
Sri Parameswaran

We propose approximate dividers with near-zero error bias for both integer and floating-point numbers. The integer divider, INZeD, is designed using a novel, analytically deduced error-correction method in an approximate log based divider. The floating-point divider, FaNZeD, is based on a highly optimized mantissa divider that is inspired by INZeD. Both of the dividers are error configurable.

Our results show that the INZeD dividers have error bias in the range of 0.01-4.4% with area-delay product improvement of 25× – 95× and power improvement of 4.7× – 15× when compared to the accurate integer divider. Likewise, compared to IEEE single-precision floating-point divider, FaNZeD dividers offer up to 985× area-delay product and 77× power improvements with error bias in the range of 0.04-2.2%. Most importantly, using our FaNZeD dividers, floating-point arithmetic can be more resource-efficient than fixed-point arithmetic because most of the FaNZeD dividers are even smaller and have better area-delay product than the 8-bit and 16-bit accurate integer dividers. Finally, our dividers show negligible effect on the output quality when evaluated with AlexNet and JPEG compression applications.

In-Stream Stochastic Division and Square Root via Correlation

Di Wu
Joshua San Miguel

Stochastic Computing (SC) is designed to minimize hardware area and power consumption compared to traditional binary-encoded computation, stemming from the bit-serial data representation and extremely straightforward logic. Though existing Stochastic Computing Units mostly assume uncorrelated bit streams, recent works find that correlation can be exploited for higher accuracy. We propose novel architectures for SC division and square root, which leverage correlation via low-cost in-stream mechanisms that eliminate expensive bit stream regeneration. We also introduce new metrics to better evaluate SC circuits relying on equilibrium via feedback loops. Experiments indicate that our division converges 46.3% faster with both 43.3% lower error and 45.6% less area.

MASKER: Adaptive Mobile Security Enhancement against Automatic Speech Recognition in Eavesdropping

Fuxun Yu
Zirui Xu
Chenchen Liu
Xiang Chen

Benefited from recent artificial intelligence evolution, Automatic Speech Recognition (ASR) technology has achieved enormous performance improvement and wider application. Unfortunately, ASR is also heavily leveraged by speech eavesdropping, where ASR is used to translate large volume of intercepted vocal speech into text content, causing considerable information leakage. In this work, we propose MASKER — a mobile security enhancement solution to protect the mobile speech data from ASR in eavesdropping. By identifying ASR models’ ubiquitous vulnerability, MASKER is designed to generate human imperceptible adversarial noises into the real-time speech on the mobile device (e.g. phone call and voice message). Even the speech data is exposed to eavesdropping during data transmission, the adversarial noises can effectively perturb the ASR process with significant Word Error Rate (WER). Meanwhile, MASKER is further optimized for mobile user perception quality and enhanced for environmental noises adaptation. Moreover, MASKER has outstanding computation efficiency for mobile system integration. Experiments show that, MASKER can achieve security enhancement with an average WER of 84.55% for ASR perturbation, 32% noise reduction for user perception quality and 16× faster processing speed compared to the state-of-the-art method.

Adversarial Attack on Microarchitectural Events based Malware Detectors

Sai Manoj Pudukotai Dinakarrao
Sairaj Amberkar
Sahil Bhat
Abhijitt Dhavlle
Hossein Sayadi
Avesta Sasan
Houman Homayoun
Setareh Rafatirad

To overcome the performance overheads incurred by the traditional software-based malware detection techniques, Hardware-assisted Malware Detection (HMD) using machine learning (ML) classifiers has emerged as a panacea to detect malicious applications and secure the systems. To classify benign and malicious applications, HMD primarily relies on the generated low-level microarchitectural events captured through Hardware Performance Counters (HPCs). This work creates an adversarial attack on the HMD systems to tamper the security by introducing the perturbations in the HPC traces with the aid of an adversarial sample generator application. To craft the attack, we first deploy an adversarial sample predictor to predict the adversarial HPC pattern for a given application to be misclassified by the deployed ML classifier in the HMD. Further, as the attacker has no direct access to manipulate the HPCs generated during runtime, based on the output of the adversarial sample predictor, we devise an adversarial sample generator wrapped around a normal application to produce HPC patterns similar to the adversarial predictor HPC trace. As the crafted adversarial sample generator application does not have any malicious operations, it is not detectable with traditional signature-based malware detection solutions. With the proposed attack, malware detection accuracy has been reduced to 18.04% from 82.76%.

Fault Sneaking Attack: a Stealthy Framework for Misleading Deep Neural Networks

Pu Zhao
Siyue Wang
Cheng Gongye
Yanzhi Wang
Yunsi Fei
Xue Lin

Despite the great achievements of deep neural networks (DNNs), the vulnerability of state-of-the-art DNNs raises security concerns of DNNs in many application domains requiring high reliability. We propose the fault sneaking attack on DNNs, where the adversary aims to misclassify certain input images into any target labels by modifying the DNN parameters. We apply ADMM (alternating direction method of multipliers) for solving the optimization problem of the fault sneaking attack with two constraints: 1) the classification of the other images should be unchanged and 2) the parameter modifications should be minimized. Specifically, the first constraint requires us not only to inject designated faults (misclassifications), but also to hide the faults for stealthy or sneaking considerations by maintaining model accuracy. The second constraint requires us to minimize the parameter modifications (using ℓ0 norm to measure the number of modifications and ℓ2 norm to measure the magnitude of modifications). Comprehensive experimental evaluation demonstrates that the proposed framework can inject multiple sneaking faults without losing the overall test accuracy performance.

PREEMPT: PReempting Malware by Examining Embedded Processor Traces

Kanad Basu
Rana Elnaggar
Krishnendu Chakrabarty
Ramesh Karri

Anti-virus software (AVS) tools are used to detect Malware in a system. However, software-based AVS are vulnerable to attacks. A malicious entity can exploit these vulnerabilities to subvert the AVS. Recently, hardware components such as Hardware Performance Counters (HPC) have been used for Malware detection. In this paper, we propose PREEMPT, a zero overhead, high-accuracy and low-latency technique to detect Malware by re-purposing the embedded trace buffer (ETB), a debug hardware component available in most modern processors. The ETB is used for post-silicon validation and debug and allows us to control and monitor the internal activities of a chip, beyond what is provided by the Input/Output pins. PREEMPT combines these hardware-level observations with machine learning-based classifiers to preempt Malware before it can cause damage. There are many benefits of re-using the ETB for Malware detection. It is difficult to hack into hardware compared to software, and hence, PREEMPT is more robust against attacks than AVS. PREEMPT does not incur performance penalties. Finally, PREEMPT has a high True Positive value of 94% and maintains a low False Positive value of 2%.

Workload-Aware Harmonic Partitioned Scheduling of Periodic Real-Time Tasks with Constrained Deadlines

Jiankang Ren
Xiaoyan Su
Guoqi Xie
Chao Yu
Guozhen Tan
Guowei Wu

Multiprocessor platforms have been widely applied in safety-critical domains to accommodate the increasing computation requirement of modern real-time applications. In this paper, we present a workload-aware harmonic partitioned multiprocessor scheduling scheme for periodic real-time tasks with constrained deadlines under the fixed-priority preemptive scheduling policy. In particular, two grouping metrics effectively integrating both harmonicity and workload characteristic are designed to guide our task partition. With those metrics, our scheme can greatly improve system utilization by taking advantage of the combination of harmonic relationship exploration and workload awareness. Experiments show that our proposed scheme significantly outperforms existing approaches in terms of schedulability.

Holistic multi-resource allocation for multicore real-time virtualization

Meng Xu
Robert Gifford
Linh Thi Xuan Phan

This paper presents vC2M, a holistic multi-resource allocation framework for real-time multicore virtualization. vC2M integrates shared cache allocation with memory bandwidth regulation to mitigate interferences among concurrent tasks, thus providing better timing isolation among tasks and VMs. It reduces the abstraction overhead through task and VCPU release synchronization and through VCPU execution regulation, and it further introduces novel resource allocation algorithms that consider CPU, cache, and memory bandwidth altogether to optimize resources. Evaluations on our prototype show that vC2M can be implemented with minimal overhead, and that it substantially improves schedulability over existing solutions.

Runtime Resource Management with Workload Prediction

Mina Niknafs
Ivan Ukhov
Petru Eles
Zebo Peng

Modern embedded platforms need sophisticated resource managers in order to utilize the heterogeneous computational resources efficiently. Moreover, such platforms are exposed to fluctuating workloads unpredictable at design time. In such a context, predicting the incoming workload might improve the efficiency of resource management. But is this true? And, if yes, how significant is this improvement? How accurate does the prediction need to be in order to improve decisions instead of doing harm? By proposing a prediction-based resource manager aimed at minimizing energy consumption while meeting task deadlines and by running extensive experiments, we try to answer the above questions.

Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR

Francesco Barchi
Gianvito Urgese
Enrico Macii
Andrea Acquaviva

Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.

REAP: Runtime Energy-Accuracy Optimization for Energy Harvesting IoT Devices

Ganapati Bhat
Kunal Bagewadi
Hyung Gyu Lee
Umit Y. Ogras

The use of wearable and mobile devices for health and activity monitoring is growing rapidly. These devices need to maximize their accuracy and active time under a tight energy budget imposed by battery and form-factor constraints. This paper considers energy harvesting devices that run on a limited energy budget to recognize user activities over a given period. We propose a technique to co-optimize the accuracy and active time by utilizing multiple design points with different energy-accuracy trade-offs. The proposed technique switches between these design points at runtime to maximize a generalized objective function under tight harvested energy budget constraints. We evaluate our approach experimentally using a custom hardware prototype and 14 user studies. It achieves 46% higher expected accuracy and 66% longer active time compared to the highest performance design point.

Tumbler: Energy Efficient Task Scheduling for Dual-Channel Solar-Powered Sensor Nodes

Yue Xu
Hyung Gyu Lee
Yujuan Tan
Yu Wu
Xianzhang Chen
Liang Liang
Lei Qiao
Duo Liu

Energy harvesting technology has been popularly adopted in embedded systems. However, unstable energy source results in unsteady operation. In this paper, we devise a long-term energy efficient task scheduling targeting for solar-powered sensor nodes. The proposed method exploits a reinforcement learning with a solar energy prediction method to maximize the energy efficiency, which finally enhances the long-term quality of services (QoS) of the sensor nodes. Experimental results show that the proposed scheduling improves the energy efficiency by 6.0%, on average and achieves the better QoS level by 54.0%, compared with a state-of-the-art task scheduling algorithm.

GreenTPU: Improving Timing Error Resilience of a Near-Threshold Tensor Processing Unit

Pramesh Pandey
Prabal Basu
Koushik Chakraborty
Sanghamitra Roy

The emergence of hardware accelerators has brought about several orders of magnitude improvement in the speed of the deep neural-network (DNN) inference. Among such DNN accelerators, Google Tensor Processing Unit (TPU) has transpired to be the best-in-class, offering more than 15× speedup over the contemporary GPUs. However, the rapid growth in several DNN workloads conspires to escalate the energy consumptions of the TPU-based data-centers. In order to restrict the energy consumption of TPUs, we propose Green TPU—a low-power near-threshold (NTC) TPU design paradigm. To ensure a high inference accuracy at a low-voltage operation, GreenTPU identifies the patterns in the error-causing activation sequences in the systolic array, and prevents further timing errors from the same sequence by intermittently boosting the operating voltage of the specific multiplier-and-accumulator units in the TPU. Compared to a cutting-edge timing error mitigation technique for TPUs, GreenTPU enables 2X–3X higher performance in an NTC TPU, with a minimal loss in the prediction accuracy.

Thermal-Aware Design and Management for Search-based In-Memory Acceleration

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Tajana Rosing

Recently, Processing-In-Memory (PIM) techniques exploiting resistive RAM (ReRAM) have been used to accelerate various big data applications. ReRAM-based in-memory search is a powerful operation which efficiently finds required data in a large data set. However, such operations result in a large amount of current which may create serious thermal issues, especially in state-of-the-art 3D stacking chips. Therefore, designing PIM accelerators based on in-memory searches requires a careful consideration of temperature. In this work, we propose static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations. Our experiments show the proposed design significantly reduces the peak chip temperature and dynamic management overhead. We test our proposed design in two important categories of applications which benefit from the search-based PIM acceleration – hyper-dimensional computing and database query. Validated experiments show that the proposed method can reduce the steady-state temperature by at least 15.3 °C which extends the lifetime of the ReRAM device by 57.2% on average. Furthermore, the proposed fine-grained dynamic thermal management provides 17.6% performance improvement over state-of-the-art methods.

Building Robust Machine Learning Systems: Current Progress, Research Challenges, and Opportunities

Jeff Jun Zhang
Kang Liu
Faiq Khalid
Muhammad Abdullah Hanif
Semeen Rehman
Theocharis Theocharides
Alessandro Artussi
Muhammad Shafique
Siddharth Garg

Machine learning, in particular deep learning, is being used in almost all the aspects of life to facilitate humans, specifically in mobile and Internet of Things (IoT)-based applications. Due to its state-of-the-art performance, deep learning is also being employed in safety-critical applications, for instance, autonomous vehicles. Reliability and security are two of the key required characteristics for these applications because of the impact they can have on human’s life. Towards this, in this paper, we highlight the current progress, challenges and research opportunities in the domain of robust systems for machine learning-based applications.

Adversarial Machine Learning Beyond the Image Domain

Giulio Zizzo
Chris Hankin
Sergio Maffeis
Kevin Jones

Machine learning systems have had enormous success in a wide range of fields from computer vision, natural language processing, and anomaly detection. However, such systems are vulnerable to attackers who can cause deliberate misclassification by introducing small perturbations. With machine learning systems being proposed for cyber attack detection such attackers are cause for serious concern. Despite this the vast majority of adversarial machine learning security research is focused on the image domain. This work gives a brief overview of adversarial machine learning and machine learning used in cyber attack detection and suggests key differences between the traditional image domain of adversarial machine learning and the cyber domain. Finally we show an adversarial machine learning attack on an industrial control system.

Memory-Bound Proof-of-Work Acceleration for Blockchain Applications

Kun Wu
Guohao Dai
Xing Hu
Shuangchen Li
Xinfeng Xie
Yu Wang
Yuan Xie

Blockchain applications have shown huge potential in various domains. Proof of Work (PoW) is the key procedure in blockchain applications, which exhibits the memory-bound characteristic and hinders the performance improvement of blockchain accelerators. In order to mitigate the “memory wall” and improve the performance of memory-hard PoW accelerators, using Ethash as an example, we optimize the memory architecture from two perspectives: 1) Hiding memory latency. We propose specialized context switch design to overcome the uncertain cycles of repetitive memory requests. 2) Increasing memory bandwidth utilization. We introduce on-chip memory that stores a portion of the Ethash directed acyclic graph (DAG) for larger effective memory bandwidth, and further propose adopting embedded NOR flash to fulfill the role. Then, we conduct extensive experiments to explore the design space of our optimized memory architecture for Ethash, including number of hash cores, on-chip/off-chip memory technologies and specifications. Based on the design space exploration, we finally provide the guidance for designing the memory-bound PoW accelerator. The experiment results show that our optimized designs achieve 8.7% — 55% higher hash rate and 17% — 120% higher hash rate per Joule compared with the baseline design in different configurations.

Architecture, Chip, and Package Co-design Flow for 2.5D IC Design Enabling Heterogeneous IP Reuse

Jinwoo Kim
Gauthaman Murali
Heechun Park
Eric Qin
Hyoukjun Kwon
Venkata Chaitanya
Krishna Chekuri
Nihar Dasari
Arvind Singh
Minah Lee
Hakki Mert Torun
Kallol Roy
Madhavan Swaminathan
Saibal Mukhopadhyay
Tushar Krishna
Sung Kyu Lim

A new trend in complex SoC design is chiplet-based IP reuse using 2.5D integration. In this paper we present a highly-integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5D designs. We chipletize each IP by adding logical protocol translators and physical interface modules. These chiplets are placed/routed on a silicon interposer next. Our package models are then used to calculate PPA and signal/power integrity of the overall system. Our design space exploration study using our tool flow shows that 2.5D integration incurs 2.1x PPA overhead compared with 2D SoC counterpart.

LifeGuard: A Reinforcement Learning-Based Task Mapping Strategy for Performance-Centric Aging Management

Vijeta Rathore
Vivek Chaturvedi
Amit K. Singh
Thambipillai Srikanthan
Muhammad Shafique

Device scaling to subdeca nanometer has pushed device aging as a primary design concern. In manycore systems, inevitable process variation further adds to delay degradation and, coupled with the scalability issues in manycores, makes aging management, while meeting performance demands, a complex problem. LifeGuard is a performance-centric reinforcement learning-based task mapping strategy that leverages the different impact of applications on aging for improving system health. Experimental results, comparing LifeGuard with two state-of-the-art aging optimizing techniques, on a 256-core system, showed that LifeGuard led to improved health for, respectively, 57% and 74% of the cores, and also an enhanced aggregate core frequency.

Accurate Estimation of Program Error Rate for Timing-Speculative Processors

Omid Assare
Rajesh Gupta

We propose a framework that estimates the error rate experienced by an application as it runs on a timing-speculative processor. The framework uses an instruction error model that is comparable in accuracy to low-level simulations—as it considers the effects of operand values, preceding instructions, datapath configuration, and error correction scheme, as well as process variation, including its spatial correlation property—and yet efficient enough to allow its application in Monte Carlo experiments to characterize large program input datasets. We then use statistical limit theorems to estimate program error rate and quantify the effect of inter-instruction correlations.

Fast Performance Estimation and Design Space Exploration of Manycore-based Neural Processors

Jintaek Kang
Dowhan Jung
Kwanghyun Chung
Soonhoi Ha

In the design of a neural processor, a cycle-accurate simulator is usually built to estimate the performance before hardware implementation. Since using the simulator to perform design space exploration (DSE) of hardware architecture is quite time consuming, we propose a novel method to use a high-level analytical model for fast DSE. In the model, non-deterministic execution delay is modeled with some parameters whose contribution to the performance is estimated statically by simulation. The viability of the proposed methodology is confirmed with two neural processors with different manycore architectures, achieving 2000 times speed-up within 3% accuracy error, compared with simulator-based DSE.

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

Runbin Shi
Junjie Liu
Hayden K.-H. So
Shuo Wang
Yun Liang

Various models with Long Short-Term Memory (LSTM) network have demonstrated prior art performances in sequential information processing. Previous LSTM-specific architectures set large on-chip memory for weight storage to alleviate the memory-bound issue and facilitate the LSTM inference in cloud computing. In this paper, E-LSTM is proposed for embedded scenarios with the consideration of the chip-area and limited data-access bandwidth. The heterogeneous hardware in E-LSTM tightly couples an LSTM co-processor with an embedded RISC-V CPU. The eSELL format is developed to represent the sparse weight matrix. With the proposed cell fusion optimization based on the inherent sparsity in computation, E-LSTM achieves up to 2.2× speedup of processing throughput.

ReForm: Static and Dynamic Resource-Aware DNN Reconfiguration Framework for Mobile Device

Zirui Xu
Fuxun Yu
Chenchen Liu
Xiang Chen

Although the Deep Neural Network (DNN) technique has been widely applied in various applications, the DNN-based applications are still too computationally intensive for the resource-constrained mobile devices. Many works have been proposed to optimize the DNN computation performance, but most of them are limited in an algorithmic perspective, ignoring certain computing issues in practical deployment. To achieve the comprehensive DNN performance enhancement in practice, the expected DNN optimization works should closely cooperate with specific hardware and system constraints (i.e. computation capacity, energy cost, memory occupancy, and inference latency). Therefore, in this work, we propose ReForm — a resource-aware DNN optimization framework. Through thorough mobile DNN computing analysis and innovative model reconfiguration schemes (i.e. ADMM based static model fine-tuning, dynamically selective computing), ReForm can efficiently and effectively reconfigure a pre-trained DNN model for practical mobile deployment with regards to various static and dynamic computation resource constraints. Experiments show that ReForm has ~3.5× faster optimization speed than state-of-the-art resource-aware optimization method. Also, ReForm can effective reconfigure a DNN model to different mobile devices with distinct resource constraints. Moreover, ReForm achieves satisfying computation cost reduction with ignorable accuracy drop in both static and dynamic computing scenarios (at most 18% workload, 16.23% latency, 48.63% memory, and 21.5% energy enhancement).

XBioSiP: A Methodology for Approximate Bio-Signal Processing at the Edge

Bharath Srinivas Prabakaran
Semeen Rehman
Muhammad Shafique

Bio-signals exhibit high redundancy, and the algorithms for their processing are inherently error resilient. This property can be leveraged to improve the energy-efficiency of IoT-Edge (wearables) through the emerging trend of approximate computing. This paper presents XBioSiP, a novel methodology for approximate bio-signal processing that employs two quality evaluation stages, during the pre-processing and bio-signal processing stages, to determine the approximation parameters. It thereby achieves high energy savings while satisfying the user-determined quality constraint. Our methodology achieves, up to 19× and 22× reduction in the energy consumption of a QRS peak detection algorithm for 0% and < 1% loss in peak detection accuracy, respectively.

RevSCA: Using Reverse Engineering to Bring Light into Backward Rewriting for Big and Dirty Multipliers

Alireza Mahzoon
Daniel Große
Rolf Drechsler

In recent years, formal methods based on Symbolic Computer Algebra (SCA) have shown very good results in verification of integer multipliers. The success is based on removing redundant terms (vanishing monomials) early which allows to avoid the explosion in the number of monomials during backward rewriting. However, the SCA approaches still suffer from two major problems: (1) high dependence on the detection of Half Adders (HAs) realized as AND-XOR gates in the multiplier netlist, and (2) extremely large search space for finding the source of the vanishing monomials. As a consequence, if the multiplier consists of dirty logic, i.e. for instance using non-standard libraries or logic optimization, the existing SCA methods are completely blind on the resulting polynomials, and their techniques for effective division fail.

In this paper, we present RevSCA. RevSCA brings back light into backward rewriting by identifying the atomic blocks of the arithmetic circuits using dedicated reverse engineering techniques. Our approach takes advantage of these atomic blocks to detect all sources of vanishing monomials independent of the design architecture. Furthermore, it cuts the local vanishing removal time drastically due to limiting the search space to a small part of the design only. Experimental results confirm the efficiency of our approach in verification of a wide variety of integer multipliers with up to 1024 output bits.

Temporal Tracing of On-Chip Signals using Timeprints

Rehab Massoud
Hoang M. Le
Peter Chini
Prakash Saivasan
Roland Meyer
Rolf Drechsler

This paper introduces a new method to trace cycle-accurately the temporal behavior of on-chip signals while operating in-field. Current cycle-accurate schemes incur unacceptable amounts of data for logging, storage and processing.

Our key idea to enable efficient yet cycle-accurate tracing, is to bring timing to the front as a main traced artifact. We split the signal tracing into consecutive (back-to-back) finite trace-cycles. Within a trace-cycle, a signal’s value-change instance gets assigned an encoded timestamp. At the end of each trace-cycle, these encoded timestamps are aggregated into a logged timeprint, which summarizes the temporal behavior over the trace-cycle.

To retrieve the accurate timing, we reconstruct the exact instances from a timeprint via a SAT query. The experiments demonstrate how unprecedented lightweight tracing can be applied, and how timeprints enable the verification of cycle-accurate properties and the detection of sporadic temperature effects.

ACCESS: HW/SW Co-Equivalence Checking for Firmware Optimization

Michael Schwarz
Raphael Stahl
Daniel Müller-Gritschneder
Ulf Schlichtmann
Dominik Stoffel
Wolfgang Kunz

Customizing embedded computing platforms to specific application domains often necessitates optimizing the firmware and/or the HW/SW interface under tight resource constraints. Such optimizations frequently alter the communication between the firmware and the peripheral devices, possibly compromising functional correctness of the input/output behavior of the embedded system. This paper proposes a formal HW/SW co-equivalence checking technique for verifying correct I/O behavior of peripherals under a modified firmware. We demonstrate the great promise of our approach on RTL implementations of several open-source peripherals. In our experiments we successfully prove or disprove correctness of firmware optimizations for an industrial driver software. In addition, we also found a subtle bug in one of the peripherals and several undocumented preconditions for correct device behavior.

Early Concolic Testing of Embedded Binaries with Virtual Prototypes: A RISC-V Case Study

Vladimir Herdt
Daniel Große
Hoang M. Le
Rolf Drechsler

Extensive testing of IoT SW is very important to prevent errors and security vulnerabilities. In the SW domain the automated concolic testing technique has been shown very effective.

In this paper we propose an approach for concolic testing of binaries targeting RISC-V systems with peripherals. Our approach works by integrating the Concolic Testing Engine (CTE) with the architecture specific Instruction Set Simulator (ISS) inside of a Virtual Prototype (VP). We provide a designated CTE-interface to integrate (SystemC-based) peripherals into the concolic testing by means of SW models. This combination enables a high simulation performance at binary level with comparatively little effort to integrate peripherals with concolic execution capabilities. Our approach has been effective in finding several buffer overflow related security vulnerabilities in the FreeRTOS TCP/IP stack.

Tetris: A Streaming Accelerator for Physics-Limited 3D Plane-Wave Ultrasound Imaging

Brendan L. West
Jian Zhou
Ronald G. Dreslinski
J. Brian Fowlkes
Oliver Kripfgans
Chaitali Chakrabarti
Thomas F. Wenisch

High volume acquisition rates are imperative for medical ultrasound imaging applications, such as 3D elastography and 3D vector flow imaging. Unfortunately, despite recent algorithmic improvements, high-volume-rate imaging remains computationally infeasible on known platforms.

In this paper, we propose Tetris, a novel hardware accelerator for ultrasound beamforming that enables volume acquisition rates up to the physics limits of acoustic propagation delay. Through algorithmic and hardware optimizations, we enable a streaming system design outclassing previously proposed accelerators in performance while lowering hardware complexity and storage requirements. For a representative imaging task, our proposed system generates physics-limited 13,020 volumes per second in a 2.5W power budget.

ProbLP: A framework for low-precision probabilistic inference

Nimish Shah
Laura I. Galindez Olascoaga
Wannes Meert
Marian Verhelst

Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.

An Optimized Design Technique of Low-bit Neural Network Training for Personalization on IoT Devices

Seungkyu Choi
Jaekang Shin
Yeongjae Choi
Lee-Sup Kim

Personalization by incremental learning has become essential for IoT devices to enhance the performance of the deep learning models trained with global datasets. To avoid massive transmission traffic in the network, exploiting on-device learning is necessary. We propose a software/hardware co-design technique that builds an energy-efficient low-bit trainable system: (1) software optimizations by local low-bit quantization and computation freezing to minimize the on-chip storage requirement and computational complexity, (2) hardware design of a bit-flexible multiply-and-accumulate (MAC) array sharing the same resources in inference and training. Our scheme saves 99.2% on on-chip buffer storage and achieves 12.8x higher peak energy efficiency compared to previous trainable accelerators.

L-MPC: A LUT based Multi-Level Prediction-Correction Architecture for Accelerating Binary-Weight Hourglass Network

Hong Liu
Leibo Liu
Wenping Zhu
Qiang Li
Huiyu Mo
Shaojun Wei

A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on IoT edge devices. First, LUT with a unified mode is adopted to support convolutional neural network with fully variable weight bit precision to minimize operations of B-HG, which achieves 1.33×-1.50× speedup on multi-bit weight CNN relative to the similar solution. Second, multi-level prediction-correction model is proposed to achieve computational-efficient convolution with adaptive precision. The operations saved can be increase by about 30% than the two-stage model. Besides, nearly 77.4% of the operations in B-HG can be saved by using the combination of these two methods, yielding a 2.3× inference speedup. Third, block computing based pipeline is designed to improve the residual block deficiency in B-HG. It can not only reduce about 66.2% off-chip memory access than the baseline, but also save 60% and 31% on-chip memory space and access compared to the similar fused-layer accelerator. The proposed B-HG accelerator achieves 450 fps at 500MHz based on the simulation in TSMC 28 nm process. Meanwhile, the power efficiency is up to 8.5 TOPS/W, which is two orders of magnitude higher than the dedicated face landmark detection accelerator.

eSLAM: An Energy-Efficient Accelerator for Real-Time ORB-SLAM on FPGA Platform

Runze Liu
Jianlei Yang
Yiran Chen
Weisheng Zhao

Simultaneous Localization and Mapping (SLAM) is a critical task for autonomous navigation. However, due to the computational complexity of SLAM algorithms, it is very difficult to achieve real-time implementation on low-power platforms. We propose an energy-efficient architecture for real-time ORB (Oriented-FAST and Rotated-BRIEF) based visual SLAM system by accelerating the most time-consuming stages of feature extraction and matching on FPGA platform. Moreover, the original ORB descriptor pattern is reformed as a rotational symmetric manner which is much more hardware friendly. Optimizations including rescheduling and parallelizing are further utilized to improve the throughput and reduce the memory footprint. Compared with Intel i7 and ARM Cortex-A9 CPUs on TUM dataset, our FPGA realization achieves up to 3× and 31× frame rate improvement, as well as up to 71× and 25× energy efficiency improvement, respectively.

ShuntFlow: An Efficient and Scalable Dataflow Accelerator Architecture for Streaming Applications

Shijun Gong
Jiajun Li
Wenyan Lu
Guihai Yan
Xiaowei Li

Streaming processing is an important and growing class of applications for analyzing continuous streams of real time data. Sliding-window aggregations (SWAGs) dominate the computation time in such applications and dictate an unprecedented computation capacity which poses a great challenge to the computing architectures. General-purpose processors cannot efficiently handle SWAGs because of the specific computation patterns. This paper proposes an efficient accelerator architecture for ubiquitous SWAGs, called ShuntFlow. ShuntFlow is a typical type of Kernel Processing Unit (KPU) where “Kernel” represent two main categories of SWAG operations widely used in streaming processing. Meanwhile, we propose a shunt rule to enable ShuntFlow to efficiently handle SWAGs with arbitrary parameters. As a case study, we implemented ShuntFlow on an Altera Arria 10 AX115N FPGA board at 150 MHz and compared it to previous approaches. The experimental results show that ShuntFlow provides a tremendous throughput and latency advantage over CPU and GPU implementations on both reduce-like and index-like SWAGs.

A General Pattern-Based Dynamic Compilation Framework for Coarse-Grained Reconfigurable Architectures

Xingchen Man
Leibo Liu
Jianfeng Zhu
Shaojun Wei

Compilation has become a major challenge to the usability of coarse-grained reconfigurable architectures as increasing programmable resources must be orchestrated. Static compilation is insufficient for prohibitive time cost while dynamic compilation still performs poorly in both generality and efficiency. This paper proposes a general pattern-based dynamic compilation framework, which utilizes statically-generated patterns to straightforwardly determine runtime re-placement and routing so that runtime configuration creation algorithm has low complexity. Domain-specific communication characteristics are harnessed to help improve the efficiency of patterns. The experimental results show that compiled general applications can be transformed onto arbitrary resources at runtime, reserving 97% (39%~163%) of the original performance/resource on average, 7% (0~17%) better than the state-of-the-art non-general methods.

ReTagger: An Efficient Controller for DRAM Cache Architectures

Mahdi Nazm Bojnordi
Farhan Nasrullah

3D die-stacking has enabled energy-efficient solutions for near data processing by integrating multiple dice of high-density memory layers and processor cores within the same package. One promising approach is to employ the in-package memory as a gigascale last-level cache for data-intensive computing. Most existing in-package cache controllers rely on the command scheduling policies borrowed from the off-chip DRAM system. Regrettably, these control policies are not specifically tailored for in-package cache traffics, which results in a limited bandwidth efficiency. This paper proposes ReTagger, a DRAM cache controller that employs repeated tags to alleviate the cost of DRAM row buffer misses. Our simulation results on a set of ten data-intensive applications indicate an average of 20% performance improvement for the proposed controller over the state-of-the-art DRAM caches.

Software Approaches for In-time Resilience

Aviral Shrivastava
Moslem Didehban

Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough — the latency between the occurrence of the fault to it’s detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.

Cross-Layer Resilience: Challenges, Insights, and the Road Ahead

Eric Cheng
Daniel-Mueller-Gritschneder
Jacob Abraham
Pradip Bose
Alper Buyuktosunoglu
Deming Chen
Hyungmin Cho
Yanjing Li
Uzair Sharif
Kevin Skadron
Mircea Stan
Ulf Schlichtmann
Subhasish Mitra

Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by manufacturing and operating conditions, manufacturing test escapes, and early-life failures. Many publications have suggested that cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective resilience, is essential for designing cost-effective resilient digital systems. This paper presents a comprehensive overview of cross-layer resilience by addressing fundamental cross-layer resilience questions, by summarizing insights derived from recent advances in cross-layer resilience research, and by discussing future cross-layer resilience challenges.

Increasing Soft Error Resilience by Software Transformation

Michael Werner
Keerthikumara Devarajegowda
Moomen Chaari
Wolfgang Ecker

Developing software in a slightly different way can have a dramatic impact on soft error resilience. This observation can be transferred in a process of improving existing code by transformations. These transformations are of systematic nature and can be automated. In this paper, we present a framework for low level embedded software generation – commonly referred to as firmware — and the inclusion of safety measures in the generated code. The generation approach follows a three stage process starting with formalized firmware specification using both platform dependent and independent firmware models. Finally, C-code is generated from the view model in a straight forward way. Safety measures are included either as part of the translation step between the models or as transformations of single models.

FLightNNs: Lightweight Quantized Deep Neural Networks for Fast and Accurate Inference

Ruizhou Ding
Zeye Liu
Ting-Wu Chin
Diana Marculescu
R. D. (Shawn) Blanton

To improve the throughput and energy efficiency of Deep Neural Networks (DNNs) on customized hardware, lightweight neural networks constrain the weights of DNNs to be a limited combination (denoted as k &epsis; {1, 2}) of powers of 2. In such networks, the multiply-accumulate operation can be replaced with a single shift operation, or two shifts and an add operation. To provide even more design flexibility, the k for each convolutional filter can be optimally chosen instead of being fixed for every filter. In this paper, we formulate the selection of k to be differentiable, and describe model training for determining k-based weights on a per-filter basis. Over 46 FPGA-design experiments involving eight configurations and four data sets reveal that lightweight neural networks with a flexible k value (dubbed FLightNNs) fully utilize the hardware resources on Field Programmable Gate Arrays (FPGAs), our experimental results show that FLightNNs can achieve 2× speedup when compared to lightweight NNs with k = 2, with only 0.1% accuracy degradation. Compared to a 4-bit fixed-point quantization, FLightNNs achieve higher accuracy and up to 2× inference speedup, due to their lightweight shift operations. In addition, our experiments also demonstrate that FLightNNs can achieve higher computational energy efficiency for ASIC implementation.

BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks

Shubham Jain
Swagath Venkataramani
Vijayalakshmi Srinivasan
Jungwook Choi
Kailash Gopalakrishnan
Leland Chang

Fixed-point implementations (FxP) are prominently used to realize Deep Neural Networks (DNNs) efficiently on energy-constrained platforms. The choice of bit-width is often constrained by the ability of FxP to represent the entire range of numbers in the datastructure with sufficient resolution. At low bit-widths (< 8 bits), state-of-the-art DNNs invariably suffer a loss in classification accuracy due to quantization/saturation errors.

In this work, we leverage a key insight that almost all datastructures in DNNs are long-tailed i.e., a significant majority of the elements are small in magnitude, with a small fraction being orders of magnitude larger. We propose BiScaled-FxP, a new number representation which caters to the disparate range and resolution needs of long-tailed data-structures. The key idea is, whilst using the same number of bits to represent elements of both large and small magnitude, we employ two different scale factors viz. scale-fine and scale-wide in their quantization. Scale-fine allocates more fractional bits providing resolution for small numbers, while scale-wide favors covering the entire range of large numbers albeit at a coarser resolution. We develop a BiScaled DNN accelerator which computes on BiScaled-FxP tensors. A key challenge is to store the scale factor used in quantizing each element as computations that use operands quantized with different scale-factors need to scale their result. To minimize this overhead, we use a block sparse format to store only the indices of scale-wide elements, which are few in number. Also, we enhance the BiScaled-FxP processing elements with shifters to scale their output when operands to computations use different scale-factors. We develop a systematic methodology to identify the scale-fine and scale-wide factors for the weights and activations of any given DNN. Over 8 state-of-the-art image recognition benchmarks, BiScaled-FxP reduces 2 computation bits over conventional FxP, while also slightly improving classification accuracy on all cases. Compared to FxP8, the performance and energy benefits range between 1.43×-3.86× and 1.4×-3.7× respectively.

A None-Sparse Inference Accelerator that Distills and Reuses the Computation Redundancy in CNNs

Ying Wang
Shengwen Liang
Huawei Li
Xiaowei Li

Prior research on energy-efficient Convolutional Neural Network (CNN) inference accelerators mostly focus on exploiting the model sparsity, i.e., zero patterns in weight and activations, to reduce the on-chip storage and computation overhead. In this work, we found in addition to zero patterns, a larger group of repetitive patterns and values exists in the working-set of CNN inference task, which is defined as computation redundancy and induces unnecessary performance and storage overhead in CNN accelerators. Based on this observation, we proposed a redundancy-free architecture that detects and eliminates the repetitive computation and storage patterns in CNN for more efficient network inference. The architecture consists of two parts: the off-line parameter analyzer that extracts the repetitive patterns in the 3D tensor of parameters, and the dataflow accelerator. The proposed accelerator at first preprocesses the weight patterns and the dynamically generated activations, and then cache these intermediate results in special P2-cache banks for further usage in convolution or full-connection stage. It is evaluated in experiments that the proposed Cavoluche architecture removes up to 89% of the repetitive operations from the layer inference process and reduce 77% of on-chip storage space to store both redundancy-free weight and activations. It is seen in experiments that the implementation of Cavoluche outperforms the state-of-the-art mobile GPGPU in both performance and energy-efficiency. When compared to the latest sparsity base accelerators, Cavoluche also achieves better operation elimination effects.

On the Complexity Reduction of Dense Layers from O(N2) to O(NlogN) with Cyclic Sparsely Connected Layers

Morteza Hosseini
Mark Horton
Hiren Paneliya
Uttej Kallakuri
Houman Homayoun
Tinoosh Mohsenin

In deep neural networks (DNNs), model size is an important factor affecting performance, energy efficiency and scalability. Recent works on weight pruning have shown significant reduction in model size at the expense of irregularity in the DNN architecture, which necessitates additional indexing memory to address non-zero weights, thereby increasing chip size, energy consumption and delays. In this paper, we propose cyclic sparsely connected (CSC) layers, with a memory/computation complexity of O(NlogN), that can be used as an overlay for fully connected (FC) layers whose number of parameters, O(N2), can dominate the parameters of the entire DNN model. The CSC layers are composed of a few sequential layers, referred to as support layers, which result in full connectivity between the Inputs and Outputs of each CSC layer. We introduce an algorithm to train models with FC layers replaced with CSC layers in a bottom-up approach by incrementally increasing the CSC layers characteristics such as connectivity and number of synapses, to achieve the desired accuracy given a compression rate. One advantage of the CSC layers is that there will be no requirement for indexing the non-zero weights. Our experimental results using AlexNet on ImageNet and LeNet300100 on MNIST indicate that by substituting FC layers with CSC layers, we can achieve 10× to 46× compression within a margin of 2% accuracy loss, which is comparable to non-structural pruning methods. A scalable parallel hardware architecture to implement CSC layers, and an equivalent scalable parallel architecture to efficiently implement non-structurally pruned FC layers are designed and fully placed and routed on Artix-7 FPGA and ASIC 65nm CMOS technology for LeNet300100 model. The results indicate that the proposed CSC hardware outperforms the conventional non-structurally pruned architecture with an equal compression rate by ~2× in power, energy, area and resource utilization when running at the same frequency.

Sensitivity based Error Resilient Techniques for Energy Efficient Deep Neural Network Accelerators

Wonseok Choi
Dongyeob Shin
Jongsun Park
Swaroop Ghosh

With inherent algorithmic error resilience of deep neural networks (DNNs), supply voltage scaling could be a promising technique for energy efficient DNN accelerator design. In this paper, we propose novel error resilient techniques to enable aggressive voltage scaling by exploiting different amount of error resilience (sensitivity) with respect to DNN layers, filters, and channels. First, to rapidly evaluate filter/channel-level weight sensitivities of large scale DNNs, first-order Taylor expansion is used, which accurately approximates weight sensitivity from actual error injection simulation. With measured timing error probability of each multiply-accumulate (MAC) units considering process variations, the sensitivity variation among filter weights can be leveraged to design DNN accelerator, such that the computations with more sensitive weights are assigned to more robust MAC units, while those with less sensitive weights are assigned to less robust MAC units. Based on post-synthesis timing simulations, 51% energy savings has been achieved with CIFAR-10 dataset using VGG-9 compared to state-of-the-art timing error recovery technique with the same constraint of 3% accuracy loss.

St-DRC: Stretchable DRAM Refresh Controller with No Parity-overhead Error Correction Scheme for Energy-efficient DNNs

Duy-Thanh Nguyen
Nhut-Minh Ho
Ik-Joon Chang

We present a stretchable DRAM refresh control for energy-efficient processing of DNNs, namely St-DRC. We exploit the characteristic that the recognition accuracy of DNNs is insensitive to errors of insignificant bits. By replacing some insignificant bits with parity bits for the error-correction of significant bits, the St-DRC can protect the significant bits under stretched refresh periods. This significantly improves DRAM refresh energy without performance degradation of DNNs, applicable to both training and inference operations. Our simulation shows that in training, the St-DRC obtains 23%/12% DRAM energy savings for graphic/main memories, respectively. Further, the St-DRC accelerates the training speed by 0.43 ~ 4.12%.

FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge

Cong Hao
Xiaofan Zhang
Yuhong Li
Sitao Huang
Jinjun Xiong
Kyle Rupnow
Wen-mei Hwu
Deming Chen

While embedded FPGAs are attractive platforms for DNN acceleration on edge-devices due to their low latency and high energy efficiency, the scarcity of resources of edge-scale FPGA devices also makes it challenging for DNN deployment. In this paper, we propose a simultaneous FPGA/DNN co-design methodology with both bottom-up and top-down approaches: a bottom-up hardware-oriented DNN model search for high accuracy, and a top-down FPGA accelerator design considering DNN-specific characteristics. We also build an automatic co-design flow, including an Auto-DNN engine to perform hardware-oriented DNN model search, as well as an Auto-HLS engine to generate synthesizable C code of the FPGA accelerator for explored DNNs. We demonstrate our co-design approach on an object detection task using PYNQ-Z1 FPGA. Results show that our proposed DNN model and accelerator outperform the state-of-the-art FPGA designs in all aspects including Intersection-over-Union (IoU) (6.2% higher), frames per second (FPS) (2.48× higher), power consumption (40% lower), and energy efficiency (2.5× higher). Compared to GPU-based solutions, our designs deliver similar accuracy but consume far less energy.

Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System

Junzhong Shen
Deguang Wang
You Huang
Mei Wen
Chunyuan Zhang

Three-dimensional convolutional neural networks (3D CNNs) have become a promising method in lung nodule segmentation. The high computational complexity and memory requirements of 3D CNNs make it challenging to accelerate 3D CNNs on a single FPGA. In this work, we focus on accelerating the 3D CNN-based lung nodule segmentation on a multi-FPGA platform by proposing an efficient mapping scheme that takes advantage of the massive parallelism provided by the platform, as well as maximizing the computational efficiency of the accelerators. Experimental results show that our system integrating with four Xilinx VCU118 can achieve state-of-the-art performance of 14.5 TOPS, in addition with a 29.4x performance gain over CPU and 10.5x more energy efficiency over GPU.

Dr. BFS: Data Centric Breadth-First Search on FPGAs

Eric Finnerty
Zachary Sherer
Hang Liu
Yan Luo

The flexible architectures of Field Programmable Gate Arrays (FPGAs) lend themselves to an array of data analytical applications, among which Breadth-First Search (BFS), due to its vital importance, draws particular attention. Recent attempts that offload BFS on FPGAs either simply imitate the existing CPU- or Graphics Processing Units (GPU)- based mechanisms or suffer from scalability issues. To this end, we introduce a novel data centric design which extensively extracts the potential of FPGAs for BFS with the following two techniques. First, we advocate to partition and compress the BFS algorithmic metadata in order to buffer them in fast on-chip memory and circumvent the expensive metadata access. Second, we propose a hierarchical coalescing method to improve the throughput of graph data access. Taken together, our evaluation demonstrates that the proposed design achieves, on average, 1.6× and 2.2× speedups over the state-of-the-art FPGA designs TorusBFS and Umuroglu, respectively, across a collection of graph datasets.

Peregrine: A Flexible Hardware Accelerator for LSTM with Limited Synaptic Connection Patterns

Jaeha Kung
Junki Park
Sehun Park
Jae-Joon Kim

In this paper, we present an integrated solution to design a high-performance LSTM accelerator. We propose a fast and flexible hardware architecture, named Peregrine, supported by a stack of innovations from algorithm to hardware design. Peregrine first minimizes the memory footprint by limiting the synaptic connection patterns within the LSTM network. Also, Peregrine provides parallel Huffman decoders with adaptive clocking to provide flexibility in dealing with a wide range of sparsity levels in the weight matrices. All these features are incorporated in a novel hardware architecture to maximize energy-efficiency. As a result, Peregrine improves performance by ~38% and energy-efficiency by ~33% in speech recognition compared to the state-of-the-art LSTM accelerator.

Systolic Cube: A Spatial 3D CNN Accelerator Architecture for Low Power Video Analysis

Yongchen Wang
Ying Wang
Huawei Li
Cong Shi
Xiaowei Li

3D convolutional neural networks (CNN) are gaining popularity in action/activity analysis. Compared to 2D convolutions that share the filters in 2D spatial domain, 3D convolutions further reuse filters in the temporal dimension to capture time-domain features. Prior works on specialized 3D-CNN accelerators employ additional on-chip memories and multi-cluster architecture to reuse data among the process element (PE) arrays, which is too expensive for low-power chips. Instead of harvesting in-memory locality, we propose a 3D systolic cube architecture to exploit the spatial-and-temporal localities of 3D CNNs, which moves the reusable data in-between PEs connected via a 3D-cube Network-on-Chip. Evaluation shows that systolic-cube contributes to considerable energy-efficiency boost for activity-recognition benchmarks.

Context-Aware Convolutional Neural Network over Distributed System in Collaborative Computing

Jinhang Choi
Zeinab Hakimi
Philip W. Shin
Jack Sampson
Vijaykrishnan Narayanan

As the computing power of end-point devices grows, there has been interest in developing distributed deep neural networks specifically for hierarchical inference deployments on multi-sensor systems. However, as the existing approaches rely on latent parameters trained by machine learning, it is difficult to preemptively select front-end deep features across sensors, or understand individual feature’s relative importance for systematic global inference. In this paper, we propose multi-view convolutional neural networks exploiting likelihood estimation. Proof-of-concept experiments show that our likelihood-based context selection and weighted averaging collaboration scheme can decrease an endpoint’s communication and energy costs by a factor of 3×, while achieving high accuracy comparable to the original aggregation approaches.

The Best of Both Worlds: On Exploiting Bit-Alterable NAND Flash for Lifetime and Read Performance Optimization

Shuo-Han Chen
Ming-Chang Yang
Yuan-Hao Chang

With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the “bit-level rewrite operation,” which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.

WAS: Wear Aware Superblock Management for Prolonging SSD Lifetime

Shunzhuo Wang
Fei Wu
Chengmo Yang
Jiaona Zhou
Changsheng Xie
Jiguang Wan

Superblocks are widely employed in SSDs for improving performance. However, the standard superblock organization which links blocks with the same block ID across planes into one superblock leads to SSDs’ ineluctable lifetime waste due to inter-block wear tolerance variations. This work proposes a wear-aware superblock management, called WAS, which (1) dynamically organizes superblocks according to real-time block wear levels to make strong blocks relieve wear on weak ones, and (2) employs a wear-based garbage collection scheme to reduce inter-block wear gap. Comprehensive experiments are carried out in SSDsim. Results show that WAS greatly prolongs SSD lifetime by 51.3% compared with the state-of-the-art superblock management.

ASCache: An Approximate SSD Cache for Error-Tolerant Applications

Fei Li
Youyou Lu
Zhongjie Wu
Jiwu Shu

With increased density, flash memory becomes more vulnerable to errors. Error correction incurs high overhead, which is sensitive in SSD cache. However, some applications like multimedia processing have the intrinsic tolerance of inaccuracies. In this paper, we propose ASCache, an approximate SSD cache, which allows bit errors in a controllable threshold for error-tolerant applications, so as to reduce the cache miss ratio caused by incorrect cache pages. ASCache further trades the strictness of error correction mechanisms for higher SSD access performance. Evaluations show ASCache reduces the average read latency by at most 30% and the cache miss ratio by 52%.

Leveraging Approximate Data for Robust Flash Storage

Qiao Li
Liang Shi
Jun Yang
Youtao Zhang
Chun Jason Xue

With the increasing bit density and adoption of 3D NAND, flash memory suffers from increased errors. To address the issue, flash devices adopt error correction codes (ECC) with strong error correction capability, like low-density parity-check (LDPC) code, to correct errors. The drawback of LDPC is that, to correct data with a high raw bit error rate (RBER), read latency will be amplified. This work proposes to address this issue with the assistance of approximate data. First, studies have been conducted and show there are ample amount of approximate data available in flash storage. Second, a novel data organization is proposed to fortify the reliability of regular data by leaving approximate data unprotected. Finally, a new data allocation strategy and modified garbage collection scheme are presented to complete the design. The experimental results show that the proposed approach can improve read performance by 30% on average comparing to current techniques.

MARCH: MAze Routing Under a Concurrent and Hierarchical Scheme for Buses

Jingsong Chen
Jinwei Liu
Gengjie Chen
Dan Zheng
Evangeline F. Y. Young

The continuous development of modern VLSI technology has brought new challenges for on-chip interconnections. Different from classic net-by-net routing, bus routing requires all the nets (bits) in the same bus to share similar or even the same topology, besides considering wire length, via count, and other design rules. In this paper, we present MARCH, an efficient maze routing method under a concurrent and hierarchical scheme for buses. In MARCH, to achieve the same topology, all the bits in a bus are routed concurrently like marching in a path. For efficiency, our method is hierarchical, consisting of a coarse-grained topology-aware path planning and a fine-grained track assignment for bits. Additionally, an effective rip-up and reroute scheme is applied to further improve the solution quality. In experimental results, MARCH significantly outperforms the first place at 2018 IC/CAD Contest in both quality and runtime.

A DAG-Based Algorithm for Obstacle-Aware Topology-Matching On-Track Bus Routing

Chen-Hao Hsu
Shao-Chun Hung
Hao Chen
Fan-Keng Sun
Yao-Wen Chang

As clock frequencies increase, topology-matching bus routing is desired to provide an initial routing result which facilitates the following buffer insertion to meet the timing constraints. Our algorithm consists of three main techniques: (1) a bus clustering method to reduce the routing complexity, (2) a DAG-based algorithm to connect a bus in the specific topology, and (3) a rip-up and re-route scheme to alleviate the routing congestion. Experimental results show that our proposed algorithm outperforms all the participating teams of the 2018 CAD Contest at ICCAD, where the top-3 routers result in 145%, 158%, and 420% higher costs than ours.

A Learning-Based Recommender System for Autotuning Design Flows of Industrial High-Performance Processors

Jihye Kwon
Matthew M. Ziegler
Luca P. Carloni

Logic synthesis and physical design (LSPD) tools automate complex design tasks previously performed by human designers. One time-consuming task that remains manual is configuring the LSPD flow parameters, which significantly impacts design results. To reduce the parameter-tuning effort, we propose an LSPD parameter recommender system that involves learning a collaborative prediction model through tensor decomposition and regression. Using a model trained with archived data from multiple state-of-the-art 14nm processors, we reduce the exploration cost while achieving comparable design quality. Furthermore, we demonstrate the transfer-learning properties of our approach by showing that this model can be successfully applied for 7nm designs.

Painting on Placement: Forecasting Routing Congestion using Conditional Generative Adversarial Nets

Cunxi Yu
Zhiru Zhang

Physical design process commonly consumes hours to days for large designs, and routing is known as the most critical step. Demands for accurate routing quality prediction raise to a new level to accelerate hardware innovation with advanced technology nodes. This work presents an approach that forecasts the density of all routing channels over the entire floorplan, with features collected up to placement, using conditional GANs. Specifically, forecasting the routing congestion is constructed as an image translation (colorization) problem. The proposed approach is applied to a) placement exploration for minimum congestion, b) constrained placement exploration and c) forecasting congestion in real-time during incremental placement, using eight designs targeting a fixed FPGA architecture.

Pin Accessibility Prediction and Optimization with Deep Learning-based Pin Pattern Recognition

Tao-Chun Yu
Shao-Yun Fang
Hsien-Shih Chiu
Kai-Shun Hu
Philip Hui-Yuh Tai
Cindy Chin-Fang Shen
Henry Sheng

FIT: Fill Insertion Considering Timing

Bentian Jiang
Xiaopeng Zhang
Ran Chen
Gengjie Chen
Peishan Tu
Wei Li
Evangeline F. Y. Young
Bei Yu

Dummy fill insertion is a mandatory step in modern semiconductor manufacturing process to reduce dielectric thickness variation, and provide nearly uniform pattern density for the chemical mechanical planarization (CMP) process. However, with the continuous shrinking of the VLSI technology nodes, the coupling effects between the inserted metal fills and signal tracks can severely affect the original timing closure of the layout design. In this paper, we propose a robust, efficient and high-performance framework for timing-aware dummy fill insertion, which simultaneously minimizes the coupling capacitance of critical signal wires and other wires. The experimental results on IC/CAD 2018 contest benchmarks shows that our proposed framework outperforms contest winner by 8% on critical coupling capacitance with 3.3× runtime speedup.

The Metric Matters: The Art of Measuring Trust in Electronics

Jonathan Cruz
Prabhat Mishra
Swarup Bhunia

Electronic hardware trust is an emerging concern for all stakeholders in the semiconductor industry. Trust issues in electronic hardware span all stages of its life cycle – from creation of intellectual property (IP) blocks to manufacturing, test and deployment of hardware components and all abstraction levels – from chips to printed circuit boards (PCBs) to systems. The trust issues originate from a horizontal business model that promotes reliance of third-party untrusted facilities, tools, and IPs in the hardware life cycle. Today, designers are tasked with verifying the integrity of third-party IPs before incorporating them into system-on-chip (SoC) designs. Existing trust metric frameworks have limited applicability since they are not comprehensive. They capture only a subset of vulnerabilities such as potential vulnerabilities introduced through design mistakes and CAD tools, or quantify features in a design that target a particular Trojan model. Therefore, current practice uses ad-hoc security analysis of IP cores. In this paper, we propose a vector-based comprehensive coverage metric that quantifies the overall trust of an IP considering both vulnerabilities and direct malicious modifications. We use a variable weighted sum of a design’s functional coverage, structural coverage, and asset coverage to assess an IP’s integrity. Designers can also effectively use our trust metric to compare the relative trustworthiness of functionally equivalent third-party IPs. To demonstrate the applicability and usefulness of the proposed metric, we utilize our trust metric on Trojan-free and Trojan-inserted variants of an IP. Our results demonstrate that we are able to successfully distinguish between trusted and untrusted IPs.

Authenticated Call Stack

Hans Liljestrand
Thomas Nyman
Jan-Erik Ekberg
N. Asokan

Shadow stacks are the go-to solution for perfect backward-edge control-flow integrity (CFI). Software shadow stacks trade off security for performance. Hardware-assisted shadow stacks are efficient and secure, but expensive to deploy. We present authenticated call stack (ACS), a novel mechanism for precise verification of return addresses using aggregated message authentication codes. We show how ACS can be realized using ARMv8.3-A pointer authentication, a new low-overhead mechanism for protecting pointer integrity. Our solution achieves security comparable to hardware-assisted shadow stacks, while incurring negligible performance overhead (< 0.5%) but requiring no additional hardware support.

United We Stand: A Threshold Signature Scheme for Identifying Outliers in PLCs

Urbi Chatterjee
Pranesh Santikellur
Rajat Sadhukhan
Vidya Govindan
Debdeep Mukhopadhyay
Rajat Subhra Chakraborty

This work proposes a scheme to detect, isolate and mitigate malicious disruption of electro-mechanical processes in legacy PLCs where each PLC works as a finite state machine (FSM) and goes through predefined states depending on the control flow of the programs and input-output mechanism. The scheme generates a group-signature for a particular state combining the signature shares from each of these PLCs using (k,l)-threshold signature scheme. If some of them are affected by the malicious code, signature can be verified by k out of l uncorrupted PLCs and can be used to detect the corrupted PLCs and the compromised state. We use OpenPLC software to simulate Legacy PLC system on Raspberry Pi and show I/O pin configuration attack on digital and pulse width modulation (PWM) pins. We describe the protocol using a small prototype of five instances of legacy PLCs simultaneously running on OpenPLC software. We show that when our proposed protocol is deployed, the aforementioned attacks get successfully detected and the controller takes corrective measures. This work has been developed as a part of the problem statement given in the Cyber Security Awareness Week-2017 competition.

Improving Static Power Efficiency via Placement of Network Demultiplexer over Control Plane of Router in Multi-NoCs

Sonal Yadav
Vijay Laxmi
Manoj Singh Gaur
Hemangee K. Kapoor

Network Demultiplexer (Net-Demux) is an essential hardware unit in multiple NoCs for traffic distribution between the NoC networks. This paper proposes a novel idea of the placement of Net-Demux at the control plane of switch allocator of the router to improve static power and energy efficiency as compared to conventional data plane placement at the Network Interface (NI).

How Secure are Deep Learning Algorithms from Side-Channel based Reverse Engineering?

Manaar Alam
Debdeep Mukhopadhyay

Deep Learning has become a de-facto paradigm for various prediction problems including many privacy-preserving applications, where the privacy of data is a serious concern. There have been efforts to analyze and exploit information leakages from DNN to compromise data privacy. In this paper, we provide an evaluation strategy for such information leakages through DNN by considering a case study on CNN classifier. The approach utilizes low-level hardware information provided by Hardware Performance Counters and hypothesis testing during the execution of a CNN to produce alarms if there exists any information leakage on actual input.

Predicting DRC Violations Using Ensemble Random Forest Algorithm

Riadul Islam
Md Asif Shahjalal

At leading technology nodes, the industry is facing a stiff challenge to make profitable ICs. One of the primary issues is the design rule checking (DRC) violation. In this research, we cohort with the DARPA IDEA program that aims for “no-human-in-the-loop” and 24-hour turnaround time to implement an IC from design specifications. In order to reduce human effort, we introduce the ensemble random forest algorithm to predict DRC violations before global routing, which is considered the most time-consuming step in an IC design flow. In addition, we identified features that critically impact the DRC violations. The algorithm has a 5.8% better F1-score compared to the existing SVM classifiers.

Analog Circuit Generator based on Deep Neural Network enhanced Combinatorial Optimization

Kourosh Hakhamaneshi
Nick Werblun
Pieter Abbeel
Vladimir Stojanović

A deep neural network (DNN) based stochastic combinatorial optimization framework is presented that can find the optimal sizing of circuits in a sample-efficient manner. This sample efficiency allows us to unify this framework with generator-based tools like Berkeley Analog Generator (BAG) [1] to directly optimize layout, given the high level circuit specifications. We use this tool to design an optical link receiver layout, satisfying high-level design specifications, using post-layout simulations of only 348 design instances. Compared to an evolutionary algorithm without our DNN-based discriminator, our framework improves the sample efficiency and run time by more than 200x.

Distributed Timing Analysis at Scale

Tsung-Wei Huang
Chun-Xun Lin
Martin D. F. Wong

As the design complexities continue to grow, the need to efficiently analyze circuit timing with billions of transistors is quickly becoming the major bottleneck to the overall chip design flow. In this work we introduce a distributed timer that (1) has scalable performance, (2) can be seamless integrable to existing EDA applications, (3) enables transparent resource management, (4) has robust fault-tolerant control. We evaluate the distributed timer using a set of large industry benchmarks on a cluster with 24 nodes. The results show that the proposed timer achieves full accuracy over all designs with high performance and good scalability.

Towards Practical Record and Replay for Mobile Applications

Onur Sahin
Assel Aliyeva
Hariharan Mathavan
Ayse Coskun
Manuel Egele

The ability to repeat the execution of a program is a fundamental requirement in evaluating computer systems and apps. Reproducing executions of mobile apps has proven difficult under real-life scenarios due to different sources of external inputs and interactive nature of the apps. We present a new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach. We demonstrate the feasibility of RandR by recording and replaying a set of real-world apps.

The Ping-Pong Tunable Delay Line In A Super-Resilient Delay-Locked Loop

Zheng-Hong Zhang
Wei Chu
Shi-Yu Huang

The Tunable Delay Line (TDL) is the most important building block in a modern cell-based timing circuit such as Phase-Locked Loop (PLL) or Delay-Locked Loop (DLL). In previously proposed TDLs, one dilemma exists — they cannot be both power efficient and environmentally adaptive at the same time. In this paper, we present an effective solution for such a dilemma – a novel “ping-pong delay line” architecture. The idea is to use two small cell-based delay lines operated in a synergistic manner in the sense that they exchange the “role of command” dynamically like in a ping-pong game, and thereby jointly reacting to severe environmental changes over a very wide range. This proposed ping-pong delay line has been incorporated in a Delay-Locked Loop (DLL) design, to demonstrate its advantages by post-layout simulation.

An Efficient Learning-based Approach for Performance Exploration on Analog and RF Circuit Synthesis

Po-Cheng Pan
Chien-Chia Huang
Hung-Ming Chen

An efficient synthesis technique for modern analog circuits is important yet challenging due to the repeatedly re-synthesis process. To precisely explore the analog circuit performance limitation on the required technology is time-consuming. This work presents a learning-based framework for searching the limitation of analog circuits. With hierarchical architecture, the dimension of solution space can be reduced. Bayesian linear regression and support vector machine model are selected to speed up the algorithm and better performance quality can be retrieved. Experimental results show that our approach on two analog circuits can achieve up to 9x runtime speed-up without surrendering performance qualities.

LODESTAR: Creating Locally-Dense CNNs for Efficient Inference on Systolic Arrays

Bahar Asgari
Ramyad Hadidi
Hyesoon Kim
Sudhakar Yalamanchili

The performance of sparse problems suffers from lack of spatial locality and low memory bandwidth utilization. However, the distribution of non-zero values in the data structures of a class of sparse problems, such as matrix operations in neural networks, is modifiable so that it can be matched with an efficient underlying hardware, such as systolic arrays. Such modification helps addressing the challenges coupled with sparsity. To efficiently execute sparse neural network inference on systolic arrays, we propose a structured pruning algorithm that increases the spatial locality in neural network models, while maintaining the accuracy of inference.

Robustly Executing DNNs in IoT Systems Using Coded Distributed Computing

Ramyad Hadidi
Jiashen Cao
Michael S. Ryoo
Hyesoon Kim

Internet of Things (IoT) devices have access to an abundance of raw data for processing. With deep neural networks (DNNs), not only the demand for the computing power of IoT devices is increasing, but also privacy concerns are motivating the importance of close-to-edge computation. DNN execution by distributing its computation is common in IoT systems. However, managing unstable latencies in a network and intermittent failures are serious challenges. Our work provides robustness and close-to-zero recovery latency by adapting coded distributed computing (CDC). We analyze robust execution on a mesh of Raspberry Pis by studying four DNNs.

Visual Cortex Inspired Pixel-Level Re-configurable Processors for Smart Image Sensors

Pankaj Bhowmik
Md Jubaer Hossain Pantho
Christophe Bobda

This paper presents a reconfigurable hardware architecture of smart image sensors to speed up low-level image processing applications at the pixel level. For each pixel in the sensor plane, the design includes an activation module and a processor. The processor has a basic structure which is common to all applications and reconfigurable segments for specific applications. Visual cortex inspired computing, like, Predictive Coding in time is implemented in the activation module to remove temporal redundancy. The ASIC implementation shows the design saves up to 84.01% dynamic power and achieves 9x speedup at 800 MHz by accurate prediction.

Efficient Circuits for Quantum Search over 2D Square Lattice Architecture

Shaohan Hu
Dmitri Maslov
Marco Pistoia
Jay Gambetta

Quantum computing has increasingly drawn interest and investments from the academic, industrial, and governmental research communities worldwide. Among quantum algorithms, Quantum Search is important for its quadratic speedup over its classical-computing counterpart. A key ingredient in its implementation is the Multi-Control Toffoli (MCT) gate, which creates a Boolean product of control variables and XORs it into the target. On an idealized quantum computer, all-to-all connectivity would eliminate the need to use SWAP gates to communicate information. This is, however, not affordable in the current Noisy Intermediate-Scale Quantum (NISQ) computing era. In this work, we discuss how to efficiently implement MCT gates on 2D Square Lattices (2DSL), suitable for superconducting circuits, by taking advantage of relative-phase Toffoli gates and H-tree layouts to drastically reduce resulting circuits’ depths and the amount of SWAPping required.

SEDA – Single Exact Dual Approximate Adders for Approximate Processors

Chandan Kumar Jha
Joycee Mekie

Approximate computing has gained a lot of popularity due to its energy benefits in a variety of error-tolerant applications. In this paper we are proposing an adder which can perform n-bit single exact addition or dual approximate addition (SEDA), and is suitable for processors. The conversion from exact to approximate addition can be dynamically done at runtime. The maximum error is bounded for SEDA adders as carry is not approximated. Our proposed design consumes 48% lesser energy, has 32% lesser delay, occupies 24% lesser area as compared to exact mirror adder.

Merging Everything (ME): A Unified FPGA Architecture Based on Logic-in-Memory Techniques

Xiaoming Chen
Longxiang Yin
Bosheng Liu
Yinhe Han

New Computational Results and Hardware Prototypes for Oscillator-based Ising Machines

Tianshi Wang
Leon Wu
Jaijeet Roychowdhury

In this paper, we report new results on a novel Ising machine technology for solving combinatorial optimization problems using networks of coupled self-sustaining oscillators. Specifically, we present several working hardware prototypes using CMOS electronic oscillators, built on bread-boards/perfboards and PCBs, implementing Ising machines consisting of up to 240 spins with programmable couplings. We also report that, just by simulating the differential equations of such Ising machines of larger sizes, good solutions can be achieved easily on benchmark optimization problems, demonstrating the effectiveness of oscillator-based Ising machines.

Internal Structure Aware RDF Data Management in SSDs

Renhai Chen
Qiming Guan
Guohua Yan
Zhiyong Feng

In this paper, we lead the first efforts towards intelligent RDF data management in SSDs. We propose to deeply fuse the RDF data in SSDs. In detail, the operations (e.g., data query) applied to RDF can be directly achieved in SSDs. To this end, we explore two RDF data organizations (e.g., triple-based) with the consideration of the internal structure of SSDs. The experiment is conducted on the Patient Disease Drug (PDD) Graph dataset [11]. The experimental results show that the proposed two strategies achieve the comprehensive, scalable in-SSD computation from different aspects (e.g., space efficiency or query efficiency).

TODAES

18 June 2019

Yibo Lin

No comments

Categories: Publications

ACM Transactions on Design Automation of Electronic Systems

The ACM Transactions on Design Automation of Electronic Systems (TODAES) is the premier journal that publishes recent significant results of research and development efforts in the area of design automation of electronic systems. The TODAES editorial board invites submission of technical papers describing recent results of research and development efforts in the area of design automation of electronic systems. The journal intends to provide a comprehensive coverage of innovative works concerning the specification, design, analysis, simulation, testing, and evaluation of very large scale integrated electronic systems, emphasizing a computer science/engineering orientation.

Further information on submission, the editorial board, subscription, and other details can be found at the journal’s home page:
http://www.acm.org/todaes/
ACM TODAES Editor-In-Chief
X. Sharon Hu
shu@nd.edu

18 June 2019

Yibo Lin

No comments

Categories: Publications

The SIGDA Electronic Newsletter, E-Newsletter, is the main resource for the EDA professional, disseminating news and upcoming events in the general areas of Design Automation and related domains. The E-Newsletter was started in late 2002, replacing the printed version.
The E-Newsletter is distributed monthly on the 1st of each month. It includes several sections of interest to the EDA professional, including:

News headlines (SIGDA News).
“What is…?”, a column featuring recent technical concepts of broad DA interest.
Upcoming deadlines for conferences, symposia, and workshops.
Upcoming events, such as conferences, symposia, and workshops.
Announcements for upcoming SIGDA-sponsored and EDA-related events.

SIGDA E-Newsletter Editor

Dr. Debjit Sinha (debjitsinha@yahoo.com)
Prof. Keni Qiu (qiukn@cnu.edu.cn, qiukeni@qq.com)

SIGDA E-Newsletter Associate Editor

Prof. Xiang Chen, George Mason University
Prof. Yanzhi Wang, Northeastern University
Prof. Xunzhao Yin, Zhejiang University
Prof. Zheng Zhang, UCSB
Dr. Jayita Das, Intel
Prof. Qinru Qiu, Syracuse University
Prof. Yiyu Shi, University of Notre Dame
Dr. Rajsaktish Sankaranarayanan, Intel
Dr. Ying Wang, Chinese Academy of Sciences
Dr. Jiaqi Zhang, AMD Inc.

To view archived newsletters prior to 2002, click to the ACM’s Digital Library.

2025
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2024
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2023
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2022
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2021
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2020
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2019
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2013
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2012
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2011
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2010
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2009
Jan 1	Feb 1	Mar 1	Apr 1	May 1	Jun 1
Jul 1	Aug 1	Sep 1	Oct 1	Nov 1	Dec 1

2008
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1

2007
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1	Dec 15

2006
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1	Dec 15

2005
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1	Dec 15

2004
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1	Dec 15

2003
Jan 1	Jan 15	Feb 1	Feb 15	Mar 1	Mar 15
Apr 1	Apr 15	May 1	May 15	Jun 1	Jun 15
Jul 1	Jul 15	Aug 1	Aug 15	Sep 1	Sep 15
Oct 1	Oct 15	Nov 1	Nov 15	Dec 1	Dec 15

2002
Nov 1	Dec 1

Conferences

18 June 2019

Yibo Lin

No comments

Categories: Publications

ACM/SIGDA Sponsored, Co-Sponsored, or In-Cooperation Events

In-Cooperation Conferences:

ISVLSI: International Conference on VLSI Design
(January) :: http://www.eng.ucy.ac.cy/theocharides/isvlsi21/

ASPDAC 2019 TOC

17 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: University design contest

A wide conversion ratio, 92.8% efficiency, 3-level buck converter with adaptive on/off-time control and shared charge pump intermediate voltage regulator

Kousuke Miyaji
Yuki Karasawa
Takanobu Fukuoka

An efficient cascode 3-level buck converter with adaptive on/off-time (AOOT) control and shared charge pump (CP) intermediate voltage (V_mid) regulator is proposed and demonstrated. The conversion ratio (CR) V_out/V_in is enhanced by using the proposed AOOT control scheme, where the control switches between adaptive on-time (AOnT) and adaptive off-time (AOffT) mode according to the target CR. The proposed CP shares flying capacitor C_fly and power switches in the 3-level buck converter to generate V_mid=V_in/2 achieving both small size and low loss. The proposed 3-level buck converter is implemented in a standard 0.25μm CMOS process. 92.8% maximum efficiency and wide CR are obtained with the integrated V_mid regulator.

A three-dimensional millimeter-wave frequency-shift based CMOS biosensor using vertically stacked spiral inductors in LC oscillators

Maya Matsunaga
Taiki Nakanishi
Atsuki Kobayashi
Kiichi Niitsu

This paper presents a millimeter-wave frequency-shift-based CMOS biosensor that is capable of providing three-dimensional (3D) resolution. The vertical resolution from the sensor surface is obtained using dual-layer LC oscillators, which enable 3D target detection. The LC oscillators produce different frequency shifts from the desired resonant frequency due to the frequency-dependent complex relative permittivity of the biomolecular target. The measurement results from a 65-nm test chip demonstrated the feasibility of achieving 3D resolution.

Design of 385 x 385 μm² 0.165V 270pW fully-integrated supply-modulated OOK transmitter in 65nm CMOS for glasses-free, self-powered, and fuel-cell-embedded continuous glucose monitoring contact lens

Kenya Hayashi
Shigeki Arata
Ge Xu
Shunya Murakami
Cong Dang Bui
Takuyoshi Doike
Maya Matsunaga
Atsuki Kobayashi
Kiichi Niitsu

This work presents the lowest power consumption sub-mm² supply modulated OOK transmitter for enabling self-powered continuous glucose monitoring (CGM) contact lens. By combining the transmitter with a glucose fuel cell which functions as both the power source and sensing transducer, self-powered CGM contact lens can be emerged. The 385 x 385 μm² test chip implemented in 65-nm standard CMOS technology operates 270pW under 0.165V and successfully demonstrates self-powered operation using 2 x 2 mm² solid-state glucose fuel cell.

2D optical imaging using photosystem I photosensor platform with 32×32 CMOS biosensor array

Kiichi Niitsu
Taichi Sakabe
Mariko Miyachi
Yoshinori Yamanoi
Hiroshi Nishihara
Tatsuya Tomo
Kazuo Nakazato

This paper presents 2D imaging using photosensor platform with a newly-proposed large-scale CMOS biosensor array in 0.6-μm standard CMOS. The platform combines photosystem I (PSI) isolated from Thermosynechococcus elongatus and a large-scale CMOS biosensor array. PSI converts the absorbed photons into electrons, which are then sensed by the CMOS biosensor array. The prototyped photosensor enables CMOS-based 2D imaging using PSI for the first time.

Design of gate-leakage-based timer using an amplifier-less replica-bias switching technique in 55-nm DDC CMOS

Atsuki Kobayashi
Yuya Nishio
Kenya Hayashi
Shigeki Arata
Kiichi Niitsu

A design of gate-leakage-based timer using an amplifier-less replica-bias switching technique that can realize stable and low-voltage operation is presented. To generate stable oscillation frequency, the topology that discharges the pre-charged capacitor via a gate leaking MOS capacitor with low-leakage switch and logic circuits is employed. The test chip fabricated in 55-nm deeply depleted channel (DDC) CMOS technology achieves an Allan deviation floor of 200 ppm at a supply voltage of 350 mV in a 0.0022 mm² area.

A low-voltage CMOS electrophoresis IC using electroless gold plating for small-form-factor biomolecule manipulation

Kiichi Niitsu
Yuuki Yamaji
Atsuki Kobayashi
Kazuo Nakazato

We present sub-1-V CMOS-based electrophoresis method for small-form-factor biomolecule manipulation that is contained in a microchip. This is the first time this type of device has been presented in the literature. By combining CMOS technology with electroless gold plating, the electrode pitch can be reduced and the required input voltage can be decreased to less than 1 V. We fabricated the CMOS electrophoresis chip in a cost-competitive 0.6 μm standard CMOS process. A sample/hold circuit in each cell is used to generate a constant output from an analog input. After forming gold electrodes using an electroless gold plating technique, we were able to manipulate red food coloring with a 0–0.7 V input voltage range. The results shows that the proposed CMOS chip is effective for electrophoresis-based manipulation.

A low-voltage low-power multi-channel neural interface IC using level-shifted feedback technology

Liangjian Lyu
Yu Wang
Chixiao Chen
C. -J. Richard Shi

A low-voltage low-power 16-channel neural interface front-end IC for in-vivo neural recording applications is presented in this paper. A current reuse telescope amplifier is used to achieve better noise efficiency factor (NEF). Power efficiency factor (PEF) is further improved by reducing supply voltage with the proposed level-shifted feedback (LSFB) technique. The neural interface is fabricated in a 65 nm CMOS process. It operates under 0.6V supply voltage consuming 1.07 μW/channel. An input referred noise of 5.18 μV is measured, leading to a NEF of 2.94 and a PEF of 5.19 over 10 kHz bandwidth.

Development of a high stability, low standby power six-transistor CMOS SRAM employing a single power supply

Nobuaki Kobayashi
Tadayoshi Enomoto

We developed and applied a new circuit, called the “Self-controllable Voltage Level (SVL)” circuit, not only to expand both “write” and “read” stabilities, but also to achieve a low stand-by power and data holding capability in a single low power supply, 90-nm, 2-kbit, six-transistor CMOS SRAM. The SVL circuit can adaptively lower and higher the wordline voltages for a “read” and “write” operation, respectively. It can also adaptively lower and higher the memory cell supply voltages for the “write” and “hold” operations, and “read” operation, respectively. A Si area overhead of the SVL circuit is only 1.383% of the conventional SRAM.

Design of heterogeneously-integrated memory system with storage class memories and NAND flash memories

Chihiro Matsui
Ken Takeuchi

Heterogeneously-integrated memory system is configured with various types of storage class memories (SCMs) and NAND flash memories. SCMs are faster than NAND flash, and they are divided into memory and storage types with their characteristics. NAND flash memories are also classified by the number of stored bits per memory cell. These non-volatile memories have trade-offs among access speed, capacity and bit cost. Therefore, mix and match of various non-volatile memories are essential to simultaneously achieve the best speed and cost of the storage. This paper proposes a design methodology with unique interaction of device, circuit and system to achieve the appropriate configurations in the heterogeneously-integrated memory system for application.

A 65-nm CMOS fully-integrated circulating tumor cell and exosome analyzer using an on-chip vector network analyzer and a transmission-line-based detection window

Taiki Nakanishi
Maya Matsunaga
Shunya Murakami
Atsuki Kobayashi
Kiichi Niitsu

A fully-integrated CMOS circuit based on a vector network analyzer (VNA) and a transmission-line-based detection window for circulating tumor cell (CTC) and exosome analysis is presented. We have introduced a fully-integrated architecture, which eliminates the undesired parasitic components and enables high-sensitivity, for analysis of extremely low-concentration CTC in blood. To validate the operation of the proposed system, a test chip was fabricated using 65-nm CMOS technology. Measurement results shows the effectiveness of the approach.

Low standby power CMOS delay flip-flop with data retention capability

Nobuaki Kobayashi
Tadayoshi Enomoto

We developed and applied a new circuit, called the self-controllable voltage level (SVL) circuit, to achieve not only low standby power dissipation (P_st) while retaining data, but also to switch significantly quickly between an operational mode and a standby mode, in a single power source, 90-nm CMOS delay flip-flop (D-FF). The P_st of the developed D-FF is only 5.585 nW/bit, 14.81% of the 37.71 nW/bit of the conventional D-FF at a supply voltage (V_dd) of 1.0 V. The static-noise margin of the developed D-FF is 0.2576 V, and that of the conventional D-FF is 0.3576 V (at V_dd of 1.0 V). The Si area overhead of the SVL circuit is 11.62% of the conventional D-FF.

Accelerate pattern recognition for cyber security analysis

Mohammad Tahghighi
Wei Zhang

Network security analysis is about processing the network equipment’s log records to capture malicious and anomalous traffic. Scrutinizing huge amount of records to capture complex patterns is time consuming and difficult to parallelize. In this paper, we proposed a hardware/software co-designed system to address this problem for specific IP chaining patterns.

FPGA laboratory system supporting power measurement for low-power digital design

Marco Winzker
Andrea Schwandt

Power measurement of a digital design implementation supports development of low-power systems and gives insight into the performance of a circuit. A laboratory system is presented that consists of an FPGA board for use in a hands-on and remote laboratory. Measurement results show how the system can be utilized for teaching and research.

SESSION: Real-time embedded software

Towards limiting the impact of timing anomalies in complex real-time processors

Pedro Benedicte
Jaume Abella
Carles Hernandez
Enrico Mezzetti
Francisco J. Cazorla

Timing verification of embedded critical real-time systems is hindered by complex designs. Timing anomalies, deeply analyzed in static timing analysis, require specific solutions to bound their impact. For the first time, we study the concept and impact of timing anomalies in measurement-based timing analysis, the most used in industry, showing that they require to be considered and handled differently. In addition, we analyze anomalies in the context of Measurement-Based Probabilistic Timing Analysis, which simplifies quantifying their impact.

SeRoHAL: generation of selectively robust hardware abstraction layers for efficient protection of mixed-criticality systems

Petra R. Kleeberger
Juana Rivera
Daniel Mueller-Gritschneder
Ulf Schlichtmann

A major challenge in mixed-criticality system design is to ensure safe behavior under the influence of hardware errors while complying with cost and performance constraints. SeRoHAL generates hardware abstraction layers with software-based safety mechanisms to handle errors in peripheral interfaces. To reduce performance and memory overheads, SeRoHAL can select protection mechanisms, depending on the criticality of the hardware accesses.

We evaluated SeRoHAL on a robot arm control software. During fault injection, it prevents up to 76% of the assertion failures. Selective protection customized to the criticality of the accesses reduces the induced overheads significantly compared to protection of all hardware accesses.

Partitioned and overhead-aware scheduling of mixed-criticality real-time systems

Yuanbin Zhou
Soheil Samii
Petru Eles
Zebo Peng

Modern real-time embedded and cyber-physical systems comprise a large number of applications, often of different criticalities, executing on the same computing platform. Partitioned scheduling is used to provide temporal isolation among tasks with different criticalities. Isolation is often a requirement, for example, in order to avoid the case when a low criticality task overruns or fails in such a way that causes a failure in a high criticality task. When the number of partitions increases in mixed criticality systems, the size of the schedule table can become extremely large, which becomes a critical bottleneck due to design time and memory constraints of embedded systems. In addition, switching between partitions at runtime causes CPU overhead due to preemption. In this paper, we propose a design framework comprising a hyper-period optimization algorithm, which reduces the size of schedule table and preserves schedulability, and a re-scheduling algorithm to reduce the number of preemptions. Extensive experiments demonstrate the effectiveness of proposed algorithms and design framework.

SESSION: Hardware and system security

Layout recognition attacks on split manufacturing

Wenbin Xu
Lang Feng
Jeyavijayan (JV) Rajendran
Jiang Hu

One technique to prevent attacks from an untrusted foundry is split manufacturing, where only a part of the layout is sent to the untrusted high-end foundry, and the rest is manufactured at a trusted low-end foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. For instance, Boolean satisfiability (SAT) based bijective mapping attack is mentioned without detailed research. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We provide the first systematic study, to the best of our knowledge, on attack techniques in this scenario. Besides of implementing SAT-based bijective mapping attack, we develop a new attack technique based on structural pattern matching. Experimental comparison with bijective mapping attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing.

Execution of provably secure assays on MEDA biochips to thwart attacks

Tung-Che Liang
Mohammed Shayan
Krishnendu Chakrabarty
Ramesh Karri

Digital microfluidic biochips (DMFBs) have emerged as a promising platform for DNA sequencing, clinical chemistry, and point-of-care diagnostics. Recent research has shown that DMFBs are susceptible to various types of malicious attacks. Defenses proposed thus far only offer probabilistic guarantees of security due to the limitation of on-chip sensor resources. A micro-electrode-dot-array (MEDA) biochip is a next-generation DMFB that enables the sensing of on-chip droplet locations, which are captured in the form of a droplet-location map. We propose a security mechanism that validates assay execution by reconstructing the sequencing graph (i.e., the assay specification) from the droplet-location maps and comparing it against the golden sequencing graph. We prove that there is a unique (one-to-one) mapping from the set of droplet-location maps (over the duration of the assay) to the set of possible sequencing graphs. Any deviation in the droplet-location maps due to an attack is detected by this countermeasure because the resulting derived sequencing graph is not isomorphic to the original sequencing graph. We highlight the strength of the security mechanism by simulating attacks on real-life bioassays.

TAD: time side-channel attack defense of obfuscated source code

Alexander Fell
Hung Thinh Pham
Siew-Kei Lam

Program obfuscation is widely used to protect commercial software against reverse-engineering. However, an adversary can still download, disassemble and analyze binaries of the obfuscated code executed on an embedded System-on-Chip (SoC), and by correlating execution times to input values, extract secret information from the program. In this paper, we show (1) the impact of widely-used obfuscation methods on timing leakage, and (2) that well-known software countermeasures to reduce timing leakage of programs, are not always effective for low-noise environments found in embedded systems. We propose two methods for mitigating timing leakage in obfuscated codes. The first is a compiler driven method, called TAD, which removes conditional branches with distinguishable execution times for an input program. In the second method (TADCI), TAD is combined with dynamic hardware diversity by replacing primitive instructions with Custom Instructions (CIs) that exhibit non-deterministic execution times at runtime. Experimental results on the RISC-V platform show that the information leakage is reduced by 92% and 82% when TADCI is applied to the original and obfuscated source code, respectively.

SESSION: Thermal- and power-aware design and optimization

Leakage-aware thermal management for multi-core systems using piecewise linear model based predictive control

Xingxing Guo
Hai Wang
Chi Zhang
He Tang
Yuan Yuan

Performing thermal management on new generation IC chips is challenging. This is because the leakage power, which is significant in today’s chips, is nonlinearly related to temperature, resulting in a complex nonlinear control problem in thermal management. In this paper, a new dynamic thermal management (DTM) method with piecewise linear (PWL) thermal model based predictive control is proposed to solve the nonlinear control problem. First, a PWL thermal model is built by combining multiple local linear thermal models expanded at several Taylor expansion points. These Taylor expansion points are carefully selected by a systematic scheme which exploits the thermal behavior property of the IC chips. Based on the PWL thermal model, a new predictive control method is proposed to compute the future power recommendation for DTM. By approximating the nonlinearity accurately with the PWL thermal model and being equipped with predictive control technique, the new DTM can achieve an overall high quality temperature management with smooth and accurate temperature tracking. Experimental results show the new method outperforms the linear model predictive control based method in temperature management quality with negligible computing overhead.

Multi-angle bended heat pipe design using x-architecture routing with dynamic thermal weight on mobile devices

Hsuan-Hsuan Hsiao
Hong-Wen Chiou
Yu-Min Lee

Heat pipe is an effective passive cooling technique for mobile devices. This work builds a multi-angle bended heat pipe thermal model and presents an X-architecture routing engine guided by developed dynamic thermal weights to construct the heat pipe path for reducing the operating temperatures of a smartphone. Compared with a commercial tool, the error of the thermal model is only 4.79%. The routing engine can efficiently reduce the operating temperatures of application processors at least 13.20% in smartphones.

Fully-automated synthesis of power management controllers from UPF

Dustin Peterson
Oliver Bringmann

We present a methodology for automatic synthesis of power management controllers for System-on-Chip designs by using an extended version of the Unified Power Format (UPF). Our methodology takes an SoC design and a UPF-based power design, and automatically generates a power management controller in Verilog/VHDL that implements the power state machine specified in UPF. It performs a priority-based scheduling for all power state machine actions, connects each power management signal to the corresponding logic wire in the UPF design and integrates the controller into the System-on-Chip using a configurable bus interface. We implemented the proposed approach as a plugin for Synopsys Design Compiler to close the gap in today’s power management flows and evaluated it by a RISC-V System-on-Chip.

SESSION: Reverse engineering: growing more mature – and facing powerful countermeasures

Integrated flow for reverse engineering of nanoscale technologies

Bernhard Lippmann
Michael Werner
Niklas Unverricht
Aayush Singla
Peter Egger
Anja Dübotzky
Horst Gieser
Martin Rasche
Oliver Kellermann
Helmut Graeb

In view of potential risks of piracy and malicious manipulation of complex integrated circuits built in technologies of 45 nm and less, there is an increasing need for an effective and efficient process of reverse engineering. This paper provides an overview of the current process and details on a new tool for the acquisition and synthesis of large area images and the extraction of a layout. For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM). From this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

NETA: when IP fails, secrets leak

Travis Meade
Jason Portillo
Shaojie Zhang
Yier Jin

Assuring the quality and the trustworthiness of third party resources has been a hard problem to tackle. Researchers have shown that analyzing Integrated Circuits (IC), without the aid of golden models, is challenging. In this paper we discuss a toolset, NETA, designed to aid IP users in assuring the confidentiality, integrity, and accessibility of their IC or third party IP core. The discussed toolset gives access to a slew of gate-level analysis tools, many of which are heuristic-based, for the purposes of extracting high-level circuit design information. NETA majorly comprises the following tools: RELIC, REBUS, REPCA, REFSM, and REPATH.

The first step involved in netlist analysis falls to signal classification. RELIC uses a heuristic based fan-in structure matcher to determine the uniqueness of each signal in the netlist. REBUS finds word groups by leveraging the data bus in the netlist in conjunction with RELIC’s signal comparison through heuristic verification of input structures. REPCA on the other hand tries to improve upon the standard bruteforce RELIC comparison by leveraging the data analysis technique of PCA and a sparse RELIC analysis on all signals. Given a netlist and a set of registers, REFSM reconstructs the logic which represents the behavior of a particular register set over the course of the operation of a given netlist. REFSM has been shown useful for examining register interaction at a higher level. REPATH, similar to REFSM, finds a series of input patterns which forces a logical FSM initialize with some reset state into a state specified by the user. Finally, REFSM 2 is introduced to utilizes linear time precomputation to improve the original REFSM.

Machine learning and structural characteristics for reverse engineering

Johanna Baehr
Alessandro Bernardini
Georg Sigl
Ulf Schlichtmann

In the past years, much of the research into hardware reverse engineering has focused on the abstraction of gate level netlists to a human readable form. However, none of the proposed methods consider a realistic reverse engineering scenario, where the netlist is physically extracted from a chip. This paper analyzes how errors caused by this extraction and the later partitioning of the netlist affect the ability to identify the functionality. Current formal verification based methods, which compare against a golden model, are incapable of dealing with such erroneous netlists. Two new methods are proposed, which focus on the idea that structural similarity implies functional similarity. The first approach uses fuzzy structural similarity matching to compare the structural characteristics of an unknown design against designs in a golden model library using machine learning. The second approach proposes a method for inexact graph matching using fuzzy graph isomorphisms, based on the functionalities of gates used within the design. For realistic error percentages, both approaches are able to match more than 90% of designs correctly. This is an important first step for hardware reverse engineering methods beyond formal verification based equivalence matching.

Towards cognitive obfuscation: impeding hardware reverse engineering based on psychological insights

Carina Wiesen
Nils Albartus
Max Hoffmann
Steffen Becker
Sebastian Wallat
Marc Fyrbiak
Nikol Rummel
Christof Paar

In contrast to software reverse engineering, there are hardly any tools available that support hardware reversing. Therefore, the reversing process is conducted by human analysts combining several complex semi-automated steps. However, countermeasures against reversing are evaluated solely against mathematical models. Our research goal is the establishment of cognitive obfuscation based on the exploration of underlying psychological processes. We aim to identify problems which are hard to solve for human analysts and derive novel quantification metrics, thus enabling stronger obfuscation techniques.

Insights into the mind of a trojan designer: the challenge to integrate a trojan into the bitstream

Maik Ender
Pawel Swierczynski
Sebastian Wallat
Matthias Wilhelm
Paul Martin Knopp
Christof Paar

The threat of inserting hardware Trojans during the design, production, or in-field poses a danger for integrated circuits in real-world applications. A particular critical case of hardware Trojans is the malicious manipulation of third-party FPGA configurations. In addition to attack vectors during the design process, FPGAs can be infiltrated in a non-invasive manner after shipment through alterations of the bitstream. First, we present an improved methodology for bitstream file format reversing. Second, we introduce a novel idea for Trojan insertion.

SESSION: All about PIM

GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs

Guohao Dai
Tianhao Huang
Yu Wang
Huazhong Yang
John Wawrzynek

Large-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR [1]).

ParaPIM: a parallel processing-in-memory accelerator for binary-weight deep neural networks

Shaahin Angizi
Zhezhi He
Deliang Fan

Recent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM’s in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4x higher energy efficiency and 7.3x speedup over recent processing-in-DRAM acceleration, or roughly 5x higher energy-efficiency and 20.5x speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.

CompRRAE: RRAM-based convolutional neural network accelerator with reduced computations through a runtime activation estimation

Xizi Chen
Jingyang Zhu
Jingbo Jiang
Chi-Ying Tsui

Recently Resistive-RAM (RRAM) crossbar has been used in the design of the accelerator of convolutional neural networks (CNNs) to solve the memory wall issue. However, the intensive multiply-accumulate computations (MACs) executed at the crossbars during the inference phase are still the bottleneck for the further improvement of energy efficiency and throughput. In this work, we explore several methods to reduce the computations for the RRAM-based CNN accelerators. First, the output sparsity resulting from the widely employed Rectified Linear Unit is exploited, and a significant portion of computations are bypassed through an early detection of the negative output activations. Second, an adaptive approximation is proposed to terminate the MAC early when the sum of the partial results of the remaining computations is considered to be within a certain range of the intermediate accumulated result and thus has an insignificant contribution to the inference. In order to determine these redundant computations, a novel runtime estimation on the maximum and minimum values of each output activation is developed and used during the MAC operation. Experimental results show that around 70% of the computations can be reduced during the inference with a negligible accuracy loss smaller than 0.2%. As a result, the energy efficiency and the throughput are improved by over 2.9 and 2.8 times, respectively, compared with the state-of-the-art RRAM-based accelerators.

CuckooPIM: an efficient and less-blocking coherence mechanism for processing-in-memory systems

Sheng Xu
Xiaoming Chen
Ying Wang
Yinhe Han
Xiaowei Li

The ever-growing processing ability of in-memory processing logic makes the data sharing and coherence between processors and in-memory logic play an increasingly important role in Processing-in-Memory (PIM) systems. Unfortunately, the existing state-of-the-art coarse-grained PIM coherence solutions suffer from unnecessary data movements and stalls caused by a data ping-pong issue. This work proposes CuckooPIM, a criticality-aware and less-blocking coherence mechanism, which can effectively avoid unnecessary data movements and stalls. Experiments reveal that CuckooPIM achieves 1.68x speedup on average comparing with coarse-grained PIM coherence.

AERIS: area/energy-efficient 1T2R ReRAM based processing-in-memory neural network system-on-a-chip

Jinshan Yue
Yongpan Liu
Fang Su
Shuangchen Li
Zhe Yuan
Zhibo Wang
Wenyu Sun
Xueqing Li
Huazhong Yang

ReRAM-based processing-in-memory (PIM) architecture is a promising solution for deep neural networks (NN), due to its high energy efficiency and small footprint. However, traditional PIM architecture has to use a separate crossbar array to store either positive or negative (P/N) weights, which limits both energy efficiency and area efficiency. Even worse, imbalance running time of different layers and idle ADCs/DACs even lower down the whole system efficiency. This paper proposes AERIS, an Area/Energy-efficient 1T2R ReRAM based processing-In-memory NN System-on-a-chip to enhance both energy and area efficiency. We propose an area-efficient 1T2R ReRAM structure to represent both P/N weights in a single array, and a reference current cancelling scheme (RCS) is also presented for better accuracy. Moreover, a layer-balance scheduling strategy, as well as the power gating technique for interface circuits, such as ADCs/DACs, is adopted for higher energy efficiency. Experiment results show that compared with state-of-the-art ReRAM-based architectures, AERIS achieves 8.5x/1.3x peak energy/area efficiency improvements in total, due to layer-balance scheduling for different layers, power gating of interface circuits, and 1T2R ReRAM circuits. Furthermore, we demonstrate that the proposed RCS compensates the non-ideal factors of ReRAM and improves NN accuracy by 5.2% in the XNOR net on CIFAR-10 dataset.

SESSION: Design for reliability

IR-ATA: IR annotated timing analysis, a flow for closing the loop between PDN design, IR analysis & timing closure

Ashkan Vakil
Houman Homayoun
Avesta Sasan

This paper presents IR-ATA, a novel flow for modeling the timing impact of IR drop during the physical design and timing closure of an ASIC chip. We first illustrate how the current and conventional mechanism for budgeting the IR drop and voltage noise (by using hard margins) lead to sub-optimal design. Consequently, we propose a new approach for modeling and margining against voltage noise, such that each timing path is margined based on its own topology and its own view of voltage noise. By having such a path based margining mechanism, the margins for IR drop and voltage noise for most timing paths in the design are safely relaxed. The reduction in the margin increases the available timing slack that could be used for improving the power, performance, and area of a design. Finally, we illustrate how IR-ATA could be used to track the timing impact of physical or PDN changes, allowing the physical designers to explore tradeoffs that were previously, for lack of methodology, not possible.

Learning-based prediction of package power delivery network quality

Yi Cao
Andrew B. Kahng
Joseph Li
Abinash Roy
Vaishnav Srinivas
Bangqi Xu

Power Delivery Network (PDN) is a critical component in modern System-on-Chip (SoC) designs. With the rapid development in applications, the quality of PDN, especially Package (PKG) PDN, determines whether a sufficient amount of power can be delivered to critical computing blocks. In conventional PKG design, PDN design typically takes multiple weeks including many manual iterations for optimization. Also, there is a large discrepancy between (i) quick simulation tools used for quick PDN quality assessment during the design phase, and (ii) the golden extraction tool used for signoff. This discrepancy may introduce more iterations. In this work, we propose a learning-based methodology to perform PKG PDN quality assessment both before layout (when only bump/ball maps, but no package routing, are available) and after layout (when routing is completed but no signoff analysis has been launched). Our contributions include (i) identification of important parameters to estimate the achievable PKG PDN quality in terms of bump inductance; (ii) the avoidance of unnecessary manual trial and error overheads in PKG PDN design; and (iii) more accurate design-phase PKG PDN quality assessment. We validate accuracy of our predictive models on PKG designs from industry. Experimental results show that, across a testbed of 17 industry PKG designs, we can predict bump inductance with an average absolute percentage error of 21.2% or less, given only pinmap and technology information. We improve prediction accuracy to achieve an average absolute percentage error of 17.5% or less when layout information is considered.

Tackling signal electromigration with learning-based detection and multistage mitigation

Wei Ye
Mohamed Baker Alawieh
Yibo Lin
David Z. Pan

With the continuous scaling of integrated circuit (IC) technologies, electromigration (EM) prevails as one of the major reliability challenges facing the design of robust circuits. With such aggressive scaling in advanced technology nodes, signal nets experience high switching frequency, which further exacerbates the signal EM effect. Traditionally, signal EM fixing approaches analyze EM violations after the routing stage and repair is attempted via iterative incremental routing or cell resizing techniques. However, these “EM-analysis-then fix” approaches are ill-equipped when faced with the ever-growing EM violations in advanced technology nodes. In this work, we propose a novel signal EM handling framework that (i) incorporates EM detection and fixing techniques into earlier stages of the physical design process, and (ii) integrates machine learning based detection alongside a multistage mitigation. Experimental results demonstrate that our framework can achieve 15x speedup when compared to the state-of-the-art EDA tool while achieving similar performance in terms of EM mitigation and overhead.

ROBIN: incremental oblique interleaved ECC for reliability improvement in STT-MRAM caches

Elham Cheshmikhani
Hamed Farbeh
Hossein Asadi

Spin-Transfer Torque Magnetic RAM (STT-MRAM) is a promising alternative for SRAMs in on-chip cache memories. Besides all its advantages, high error rate in STT-MRAM is a major limiting factor for on-chip cache memories. In this paper, we first present a comprehensive analysis that reveals that the conventional Error-Correcting Codes (ECCs) lose their efficiency due to data-dependent error patterns, and then propose an efficient ECC configuration, so-called ROBIN, to improve the correction capability. The evaluations show that the inefficiency of conventional ECC increases the cache error rate by an average of 151.7% while ROBIN reduces this value by more than 28.6x.

Aging-aware chip health prediction adopting an innovative monitoring strategy

Yun-Ting Wang
Kai-Chiang Wu
Chung-Han Chou
Shih-Chieh Chang

Concerns exist that the reliability of chips is worsening because of downscaling technology. Among various reliability challenges, device aging is a dominant concern because it degrades circuit performance over time. Traditionally, runtime monitoring approaches are proposed to estimate aging effects. However, such techniques tend to predict and monitor delay degradation status for circuit mitigation measures rather than the health condition of the chip. In this paper, we propose an aging-aware chip health prediction methodology that adapts to workload conditions and process, supply voltage, and temperature variations. Our prediction methodology adopts an innovative on-chip delay monitoring strategy by tracing representative aging-aware delay behavior. The delay behavior is then fed into a machine learning engine to predict the age of the tested chips. Experimental results indicate that our strategy can obtain 97.40% accuracy with 4.14% area overhead on average. To the authors’ knowledge, this is the first method that accurately predicts current chip age and provides information regarding future chip health.

SESSION: New advances in emerging computing paradigms

Compiling SU(4) quantum circuits to IBM QX architectures

Alwin Zulehner
Robert Wille

The Noisy Intermediate-Scale Quantum (NISQ) technology is currently investigated by major players in the field to build the first practically useful quantum computer. IBM QX architectures are the first ones which are already publicly available today. However, in order to use them, the respective quantum circuits have to be compiled for the respectively used target architecture. While first approaches have been proposed for this purpose, they are infeasible for a certain set of SU(4) quantum circuits which have recently been introduced to benchmark corresponding compilers. In this work, we analyze the bottlenecks of existing compilers and provide a dedicated method for compiling this kind of circuits to IBM QX architectures. Our experimental evaluation (using tools provided by IBM) shows that the proposed approach significantly outperforms IBM’s own solution regarding fidelity of the compiled circuit as well as runtime. Moreover, the solution proposed in this work has been declared winner of the IBM QISKit Developer Challenge. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

Quantum circuit compilers using gate commutation rules

Toshinari Itoko
Rudy Raymond
Takashi Imamichi
Atsushi Matsuo
Andrew W. Cross

The use of noisy intermediate-scale quantum computers (NISQCs), which consist of dozens of noisy qubits with limited coupling constraints, has been increasing. A circuit compiler, which transforms an input circuit into an equivalent output circuit conforming the coupling constraints with as few additional gates as possible, is essential for running applications on NISQCs. We propose a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler.

Scalable design for field-coupled nanocomputing circuits

Marcel Walter
Robert Wille
Frank Sill Torres
Daniel Große
Rolf Drechsler

Field-coupled Nanocomputing (FCN) technologies are considered as a solution to overcome physical boundaries of conventional CMOS approaches. But despite ground breaking advances regarding their physical implementation as e.g. Quantum-dot Cellular Automata (QCA), Nanomagnet Logic (NML), and many more, there is an unsettling lack of methods for large-scale design automation of FCN circuits. In fact, design automation for this class of technologies still is in its infancy – heavily relying either on manual labor or automatic methods which are applicable for rather small functionality only. This work presents a design method which – for the first time – allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies. The proposed scheme is capable of handling around 40000 gates within seconds while the current state-of-the-art takes hours to handle around 20 gates. This is confirmed by experimental results on the layout level for various established benchmarks libraries.

BDD-based synthesis of optical logic circuits exploiting wavelength division multiplexing

Ryosuke Matsuo
Jun Shiomi
Tohru Ishihara
Hidetoshi Onodera
Akihiko Shinya
Masaya Notomi

Optical circuits using nanophotonic devices attract significant interest due to its ultra-high speed operation. As a consequence, the synthesis methods for the optical circuits also attract increasing attention. However, existing methods for synthesizing optical circuits mostly rely on straight-forward mappings from established data structures such as Binary Decision Diagram (BDD). The strategy of simply mapping a BDD to an optical circuit sometimes results in an explosion of size and involves significant power losses in branches and optical devices. To address these issues, this paper proposes a method for reducing the size of BDD-based optical logic circuits exploiting wavelength division multiplexing (WDM). The paper also proposes a method for reducing the number of branches in a BDD-based circuit, which reduces the power dissipation in laser sources. Experimental results obtained using a partial product accumulation circuit in parallel multipliers demonstrates significant advantages of our method over existing approaches in terms of area and power consumption.

Hybrid binary-unary hardware accelerator

S. Rasoul Faraji
Kia Bazargan

Stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the low area advantage comes at an exponential price in latency, making the area x delay cost unattractive. In this paper, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or sub-optimal design in the design space. The proposed method is especially well-suited to FPGAs due to the abundant availability of routing and flip-flop resources. To the best of our knowledge, we are the first to show a scalable method based on the principles of stochastic computing that can beat conventional binary in terms of a real cost, i.e., area x delay. Our method outperforms the binary and fully unary methods on a number of functions and on a common edge detection algorithm. In terms of area x delay cost, our cost is on average only 2.51% and 10.2% of the binary for 8- and 10-bit resolutions, respectively. These numbers are 2–3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the binary method for high-resolution oscillating functions such as sin(15x).

SESSION: Design, testing, and fault tolerance of neuromorphic systems

Fault tolerance in neuromorphic computing systems

Mengyun Liu
Lixue Xia
Yu Wang
Krishnendu Chakrabarty

Resistive Random Access Memory (RRAM) and RRAM-based computing systems (RCS) provide energy-efficient technology options for neuromorphic computing. However, the applicability of RCS is limited by reliability problems that arise from the immature fabrication process. In order to take advantage of RCS in practical applications, fault-tolerant design is a key challenge. We present a survey of fault-tolerant designs for RRAM-based neuromorphic computing systems. We first describe RRAM-based crossbars and training architectures in RCS. Following this, we classify fault models into different categories, and review post-fabrication testing methods. Subsequently, online testing methods are presented. Finally, we present various fault-tolerant techniques that were designed to tolerate different types of RRAM faults. The methods reviewed in this survey represent recent trends in fault-tolerant designs of RCS, and are expected motivate further research in this field.

Build reliable and efficient neuromorphic design with memristor technology

Bing Li
Bonan Yan
Chenchen Liu
Hai (Helen) Li

Neuromorphic computing is a revolutionary approach of computation, which attempts to mimic the human brain’s mechanism for extremely high implementation efficiency and intelligence. Latest research studies showed that the memristor technology has a great potential for realizing power- and area-efficient neuromorphic computing systems (NCS). On the other hand, the memristor device processing is still under development. Unreliable devices can severely degrade system performance, which arises as one of the major challenges in developing memristor-based NCS. In this paper, we first review the impacts of the limited reliability of memristor devices and summarize the recent research progress in building reliable and efficient memristor-based NCS. In the end, we discuss the main difficulties and the trend in memristor-based NCS development.

Reliable in-memory neuromorphic computing using spintronics

Christopher Münch
Rajendra Bishnoi
Mehdi B. Tahoori

Recently Spin Transfer Torque Random Access Memory (STT-MRAM) technology has drawn a lot of attention for the direct implementation of neural networks, because it offers several advantages such as near-zero leakage, high endurance, good scalability, small foot print and CMOS compatibility. The storing device in this technology, the Magnetic Tunnel Junction (MTJ), is developed using magnetic layers that requires new fabrication materials and processes. Due to complexities of fabrication steps and materials, MTJ cells are subject to various failure mechanisms. As a consequence, the functionality of the neuromorphic computing architecture based on this technology is severely affected. In this paper, we have developed a framework to analyze the functional capability of the neural network inference in the presence of the several MTJ defects. Using this framework, we have demonstrated the required memory array size that is necessary to tolerate the given amount of defects and how to actively decrease this overhead by disabling parts of the network.

SESSION: Memory-centric design and synthesis

A staircase structure for scalable and efficient synthesis of memristor-aided logic

Alwin Zulehner
Kamalika Datta
Indranil Sengupta
Robert Wille

The identification of the memristor as fourth fundamental circuit element and, eventually, its fabrication in the HP labs provide new capabilities for in-memory computing. While there already exist sophisticated methods for realizing logic gates with memristors, mapping them to crossbar structures (which can easily be fabricated) still constitutes a challenging task. This is particularly the case since several (complementary) design objectives have to be satisfied, e.g. the design method has to be scalable, should yield designs requiring a low number of timesteps and utilized memristors, and a layout should result that is hardly skewed. However, all solutions proposed thus far only focus on one of these objectives and hardly address the other ones. Consequently, rather imperfect solutions are generated by state-of-the-art design methods for memristor-aided logic thus far. In this work, we propose a corresponding automatic design solution which addresses all these design objectives at once. To this end, a staircase structure is utilized which employs an almost square-like layout and remains perfectly scalable while, at the same time, keeps the number of timesteps and utilized memristors close to the minimum. Experimental evaluations confirm that the proposed approach indeed allows to satisfy all design objectives at once.

On-chip memory optimization for high-level synthesis of multi-dimensional data on FPGA

Daewoo Kim
Sugil Lee
Jongeun Lee

It is very challenging to design an on-chip memory architecture for high-performance kernels with large amount of computation and data. The on-chip memory architecture must support efficient data access from both the computation part and the external memory part, which often have very different expectations about how data should be accessed and stored. Previous work provides only a limited set of optimizations. In this paper we show how to fundamentally restructure on-chip buffers, by decoupling logical array view from the physical buffer view, and providing general mapping schemes for the two. Our framework considers the entire data flow from the external memory to the computation part in order to minimize resource usage without creating performance bottleneck. Our experimental results demonstrate that our proposed technique can generate solutions that reduce memory usage significantly (2X over the conventional method), and successfully generate optimized on-chip buffer architectures without costly design iterations for highly optimized computation kernels.

HUBPA: high utilization bidirectional pipeline architecture for neuromorphic computing

Houxiang Ji
Li Jiang
Tianjian Li
Naifeng Jing
Jing Ke
Xiaoyao Liang

Training Convolutional Neural Networks(CNNs) is both memory-and computation-intensive. The resistive random access memory (ReRAM) has shown its advantage to accelerate such tasks with high energy-efficiency. However, the ReRAM-based pipeline architecture suffers from the low utilization of computing resource, caused by the imbalanced data throughput in different pipeline stages because of the inherent down-sampling effect in CNNs and the inflexible usage of ReRAM cells. In this paper, we propose a novel ReRAM-based bidirectional pipeline architecture, named HUBPA, to accelerate the training with higher utilization of the computing resource. Two stages of the CNN training, forward and backward propagations, are scheduled in HUBPA dynamically to share the computing resource. We design an accessory control scheme for the context switch of these two tasks. We also propose an efficient algorithm to allocate computing resource for each neural network layer. Our experiment results show that, compared with state-of-the-art ReRAM pipeline architecture, HUBPA improves the performance by 1.7X and reduces the energy consumption by 1.5X, based on the current benchmarks.

SESSION: Efficient modeling of analog, mixed signal and arithmetic circuits

Efficient sparsification of dense circuit matrices in model order reduction

Charalampos Antoniadis
Nestor Evmorfopoulos
Georgios Stamoulis

The integration of more components into ICs due to the ever increasing technology scaling has led to very large parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose a methodology for the sparsification of the dense circuit matrices resulting from Model Order Reduction, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.

Spectral approach to verifying non-linear arithmetic circuits

Cunxi Yu
Tiankai Su
Atif Yasin
Maciej Ciesielski

This paper presents a fast and effective computer algebraic method for analyzing and verifying non-linear integer arithmetic circuits using a novel algebraic spectral model. It introduces a concept of algebraic spectrum, a numerical form of polynomial expression; it uses the distribution of coefficients of the monomials to determine the type of arithmetic function under verification. In contrast to previous works, the proof of functional correctness is achieved by computing an algebraic spectrum combined with local rewriting of word-level polynomials. The speedup is achieved by propagating coefficients through the circuit using And-Inverter Graph (AIG) datastructure. The effectiveness of the method is demonstrated with experiments including standard and Booth multipliers, and other synthesized non-linear arithmetic circuits up to 1024 bits containing over 12 million gates.

S²-PM: semi-supervised learning for efficient performance modeling of analog and mixed signal circuits

Mohamed Baker Alawieh
Xiyuan Tang
David Z. Pan

As integrated circuit technologies continue to scale, variability modeling is becoming more crucial yet, more challenging. In this paper, we propose a novel performance modeling method based on semi-supervised co-learning. We exploit the multiple representations of process variation in any analog and mixed signal circuit to establish a co-learning framework where unlabeled samples are leveraged to improve the model accuracy without enduring any simulation cost. Practically, our proposed method relies on a small set of labeled data, and the availability of no-cost unlabeled data to efficiently build accurate performance model for any analog and mixed signals circuit design. Our numerical experiments demonstrate that the proposed approach achieves up to 30% reduction in simulation cost compared to the state-of-the-art modeling technique without surrendering any accuracy.

SESSION: Logic and precision optimization for neural network designs

Energy-efficient, low-latency realization of neural networks through boolean logic minimization

Mahdi Nazemi
Ghasem Pasandi
Massoud Pedram

Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

Log-quantized stochastic computing for memory and computation efficient DNNs

Hyeonuk Sim
Jongeun Lee

For energy efficiency, many low-bit quantization methods for deep neural networks (DNNs) have been proposed. Among them, logarithmic quantization is being highlighted showing acceptable deep learning performance. It also simplifies high-cost multipliers as well as reducing memory footprint drastically. Meanwhile, stochastic computing (SC) was proposed for low-cost DNN acceleration and the recently proposed SC multiplier improved the accuracy and latency significantly which are main drawbacks of SC. However, in their binary-interfaced system which yet costs much less than storing all stochastic stream, quantization is basically linear as same as conventional fixed-point binary. We applied logarithmically quantized DNNs to the state-of-the-art SC multiplier and studied how it can benefit. We found that SC multiplication on logarithmically quantized input is more accurate and it can help fine-tuning process. Furthermore, we designed the much low-cost SC-DNN accelerator utilizing the reduced complexity of inputs. Finally, while logarithmic quantization benefits data flow, proposed architecture achieves 40% and 24% less area and power consumption than the previous SC-DNN accelerator. Its area X latency product is smaller even than the shifter based accelerator.

Cell division: weight bit-width reduction technique for convolutional neural network hardware accelerators

Hanmin Park
Kiyoung Choi

The datapath bit-width of hardware accelerators for convolutional neural network (CNN) inference is generally chosen to be wide enough, so that they can be used to process upcoming unknown CNNs. Here we introduce the cell division technique, which is a variant of function-preserving transformations. With this technique, it is guaranteed that CNNs that have weights quantized to fixed-point format of arbitrary bit-widths, can be transformed to CNNs with less bit-widths of weights without any accuracy drop (or any accuracy change). As a result, CNN hardware accelerators are released from the weight bit-width constraint, which has been preventing them from having narrower datapaths. In addition, CNNs that have wider weight bit-widths than those assumed by a CNN hardware accelerator can be executed on the accelerator. Experimental results on LeNet-300-100, LeNet-5, AlexNet, and VGG-16 show that weights can be reduced down to 2–5 bits with 2.5X–5.2X decrease in weight storage requirement and of course without any accuracy drop.

SESSION: Modern mask optimization: from shallow to deep learning

LithoROC: lithography hotspot detection with explicit ROC optimization

Wei Ye
Yibo Lin
Meng Li
Qiang Liu
David Z. Pan

As modern integrated circuits scale up with escalating complexity of layout design patterns, lithography hotspot detection, a key stage of physical verification to ensure layout finishing and design closure, has raised a higher demand on its efficiency and accuracy. Among all the hotspot detection approaches, machine learning distinguishes itself for achieving high accuracy while maintaining low false alarms. However, due to the class imbalance problem, the conventional practice which uses the accuracy and false alarm metrics to evaluate different machine learning models is becoming less effective. In this work, we propose the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods. To systematically handle class imbalance, we further propose the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss. Experimental results demonstrate that the new surrogate loss functions are promising to outperform the cross-entropy loss when applied to the state-of-the-art neural network model for hotspot detection.

Detecting multi-layer layout hotspots with adaptive squish patterns

Haoyu Yang
Piyush Pathak
Frank Gennari
Ya-Chieh Lai
Bei Yu

Layout hotpot detection is one of the critical steps in modern integrated circuit design flow. It aims to find potential weak points in layouts before feeding them into manufacturing stage. Rapid development of machine learning has made it a preferable alternative of traditional hotspot detection solutions. Recent researches range from layout feature extraction and learning model design. However, only single layer layout hotspots are considered in state-of-the-art hotspot detectors and certain defects such as metal-to-via failures are not naturally supported. In this paper, we propose an adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks. We conduct experiments on 14nm industrial designs with a metal layer and its two adjacent via layers that contain metal-to-via hotspots. Results show that the adaptive squish representation can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

A local optimal method on DSA guiding template assignment with redundant/dummy via insertion

Xingquan Li
Bei Yu
Jianli Chen
Wenxing Zhu

As an emerging manufacture technology, block copolymer directed self-assembly (DSA) is promising for via layer fabrication. Meanwhile, redundant via insertion is considered as an essential step for yield improvement. For better reliability and manufacturability, in this paper, we concurrently consider DSA guiding template assignment with redundant via and dummy via insertion at post-routing stage. Firstly, by analyzing the structure property of guiding templates, we propose a building-block based solution expression to discard redundant solutions. Then, honoring the compact solution expression, we construct a conflict graph with dummy via insertion, and then formulate the problem to an integer linear programming (ILP). To make a good trade-off between solution quality and runtime, we relax the ILP to an unconstrained nonlinear programming (UNP). Finally, a line search optimization algorithm is proposed to solve the UNP. Experimental results verify the effectiveness of our new solution expression and the efficiency of our proposed algorithm.

Deep learning-based framework for comprehensive mask optimization

Bo-Yi Yu
Yong Zhong
Shao-Yun Fang
Hung-Fei Kuo

With the dramatically increase of design complexity and the advance of semiconductor technology nodes, huge difficulties appear during design for manufacturability with existing lithography solutions. Sub-resolution assist feature (SRAF) insertion and optical proximity correction (OPC) are both inevitable resolution enhancement techniques (RET) to maximize process window and ensure feature printability. Conventional model-based SRAF insertion and OPC methods are widely applied in industrial application but suffer from the extremely long runtime due to iterative optimization process. In this paper, we propose the first work developing a deep learning framework to simultaneously perform SRAF insertion and edge-based OPC. In addition, to make the optimized masks more reliable and convincing for industrial application, we employ a commercial lithography simulation tool to consider the quality of wafer image with various lithographic metrics. The effectiveness and efficiency of the proposed framework are demonstrated in experimental results, which also show the success of machine learning-based lithography optimization techniques for the current complex and large-scale circuit layouts.

SESSION: System level modelling methods I

AxDNN: towards the cross-layer design of approximate DNNs

Yinghui Fan
Xiaoxi Wu
Jiying Dong
Zhi Qi

Thanks for the inborn error resistance of neural networks, approximate computing has become a promising and hardware friendly technique to improve the energy efficiency of DNNs. From the layer of algorithms, architectures, to circuits, there are many possibilities to implement approximate DNNs. However, the complicated interaction between major design concerns, e.g., power performance, and the lack of an efficient simulator cross multiple design layers have generated suboptimal solutions of approximate DNNs through the conventional design method. In this paper, we present a systematical framework towards the cross-layer design of approximation DNNs. By introducing hardware imperfection to the training phase, the accuracy of DNN models can be recovered by up to 5.32% when the most aggressive approximate multiplier has been used. Integrated with the techniques of activation pruning and voltage scaling, the energy efficiency of the approximate DNN accelerator can be improved by 52.5% on average. We also build a pre-RTL simulation environment where we can easily express accelerator architectures, try the combination of different approximate strategies, and evaluate the power consumption. Experiments demonstrate the pre-RTL simulation has achieved ~20X speed up compared with traditional RTL method when evaluating the same target. The convenient pre-RTL simulation helps us to quickly figure out the trade-off between accuracy and energy at the design stage for an approximate DNN accelerator.

Simulate-the-hardware: training accurate binarized neural networks for low-precision neural accelerators

Jiajun Li
Ying Wang
Bosheng Liu
Yinhe Han
Xiaowei Li

This work investigates how to effectively train binarized neural networks (BNNs) for the specialized low-precision neural accelerators. When mapping BNNs onto the specialized neural accelerators that adopt fixed-point feature data representation and binary parameters, due to the operation overflow caused by short fixed-point coding, the BNN inference results from the deep learning frameworks on CPU/GPU will be inconsistent with those from the accelerators. This issue leads to a large deviation between the training environment and the inference implementation, and causes potential model accuracy losses when deployed on the accelerators. Therefore, we present a series of methods to contain the overflow phenomenon, and enable typical deep learning frameworks like Tensorflow to effectively train BNNs that could work with high accuracy and convergence speed on the specialized neural accelerators.

An N-way group association architecture and sparse data group association load balancing algorithm for sparse CNN accelerators

Jingyu Wang
Zhe Yuan
Ruoyang Liu
Huazhong Yang
Yongpan Liu

In recent years, ASIC CNN Accelerators have attracted great attention among researchers for the high performance and energy efficiency. Some former works utilize the sparsity of CNN networks to improve the performance and the energy efficiency. However, these methods bring tremendous overhead to the output memory, and the performance suffers from the hash collision. This paper presents: 1) an N-Way Group Association Architecture to reduce the memory overhead for Sparse CNN Accelerators; 2) a Sparse Data Group Association Load Balancing Algorithm which is implemented by the Scheduler module in the architecture to reduce the collision rate and improve the performance. Compared with the state-of-art accelerator, this work achieves either 1) 1.74x performance with 50% memory overhead reduction in the 4-way associated design or 2) 1.91x performance without memory overhead reduction the 2-way associated design, which is close to the theoretical performance limit (without collision).

Maximizing power state cross coverage in firmware-based power management

Vladimir Herdt
Hoang M. Le
Daniel Große
Rolf Drechsler

Virtual Prototypes (VPs) are becoming increasingly attractive for the early analysis of SoC power management, which is nowadays mostly implemented in firmware (FW). Power and timing constraints can be monitored and validated by executing a set of test-cases in a power-aware FW/VP co-simulation. In this context, cross coverage of power states is an effective but challenging quality metric. This paper proposes a novel coverage-driven approach to automatically generate test-cases maximizing this cross coverage. In particular, we integrate a coverage-loop that successively refines the generation process based on previous results. We demonstrate our approach on a LEON3-based VP.

SESSION: Testing and design for security

Improving scan chain diagnostic accuracy using multi-stage artificial neural networks

Mason Chern
Shih-Wei Lee
Shi-Yu Huang
Yu Huang
Gaurav Veda
Kun-Han (Hans) Tsai
Wu-Tung Cheng

Diagnosis of intermittent scan chain failures remains a hard problem. We demonstrate that Artificial Neural Networks (ANNs) can be used to achieve significantly higher accuracy. The key is to take on domain knowledge and use a multi-stage process incorporating ANNs with gradually refined focuses. Experimental results on benchmark circuits show that this method is, on average, 20% more accurate than a state-of-the-art commercial tool for intermittent stuck-at faults, and improves the hit rate from 25.3% to 73.9% for some test-case.

Testing stuck-open faults of priority address encoder in content addressable memories

Tsai-Ling Tsai
Jin-Fu Li
Chun-Lung Hsu
Chi-Tien Su

Content addressable memory (CAM) is widely used in the systems with the need of parallel search. The testing of CAM is more difficult than that of random access memory (RAM) due to the complicated function of CAM. Similar to the testing of RAM, the testing of CAM should cover the cell array and peripheral circuits. In this paper, we propose a March-like test, March-PCL, for detecting the stuck-open faults (SOFs) of the priority address encoder of CAMs. As the best of our knowledge, this is the first word to discuss the testing of SOFs of the priority address encoder of CAMs. The March-PCL requires 4N Write and 4N Compare operations to cover 100% SOFs.

ScanSAT: unlocking obfuscated scan chains

Lilas Alrahis
Muhammad Yasin
Hani Saleh
Baker Mohammad
Mahmoud Al-Qutayri
Ozgur Sinanoglu

While financially advantageous, outsourcing key steps such as testing to potentially untrusted Outsourced Semiconductor Assembly and Test (OSAT) companies may pose a risk of compromising on-chip assets. Obfuscation of scan chains is a technique that hides the actual scan data from the untrusted testers; logic inserted between the scan cells, driven by a secret key, hide the transformation functions between the scan-in stimulus (scan-out response) and the delivered scan pattern (captured response). In this paper, we propose ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key. Our empirical results demonstrate that ScanSAT can easily break naive scan obfuscation techniques using only three or fewer attack iterations even for large key sizes and in the presence of scan compression.

CycSAT-unresolvable cyclic logic encryption using unreachable states

Amin Rezaei
You Li
Yuanqi Shen
Shuyu Kong
Hai Zhou

Logic encryption has attracted much attention due to increasing IC design costs and growing number of untrusted foundries. Unreachable states in a design provide a space of flexibility for logic encryption to explore. However, due to the available access of scan chain, traditional combinational encryption cannot leverage the benefit of such flexibility. Cyclic logic encryption inserts key-controlled feedbacks into the original circuit to prevent piracy and overproduction. Based on our discovery, cyclic logic encryption can utilize unreachable states to improve security. Even though cyclic encryption is vulnerable to a powerful attack called CycSAT, we develop a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT. The attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

SESSION: Network-centric design and system

Routing in optical network-on-chip: minimizing contention with guaranteed thermal reliability

Mengquan Li
Weichen Liu
Lei Yang
Peng Chen
Duo Liu
Nan Guan

Communication contention and thermal susceptibility are two potential issues in optical network-on-chip (ONoC) architecture, which are both critical for ONoC designs. However, minimizing conflict and guaranteeing thermal reliability are incompatible in most cases. In this paper, we present a routing criterion in the network level. Combined with device-level thermal tuning, it can implement thermal-reliable ONoC. We further propose two routing approaches (including a mixed-integer linear programming (MILP) model and a heuristic algorithm (CAR)) to minimize communication conflict based on the guaranteed thermal reliability, and meanwhile, mitigate the energy overheads of thermal regulation in the presence of chip thermal variations. By applying the criterion, our approaches achieve excellent performance with largely reduced complexity of design space exploration. Evaluation results on synthetic communication traces and realistic benchmarks show that the MILP-based approach achieves an average of 112.73% improvement in communication performance and 4.18% reduction in energy overhead compared to state-of-the-art techniques. Our heuristic algorithm only introduces 4.40% performance difference compared to the optimal results and is more scalable to large-size ONoCs.

Bidirectional tuning of microring-based silicon photonic transceivers for optimal energy efficiency

Yuyang Wang
M. Ashkan Seyedi
Jared Hulme
Marco Fiorentino
Raymond G. Beausoleil
Kwang-Ting Cheng

Microring-based silicon photonic transceivers are promising to resolve the communication bottleneck of future high-performance computing systems. To rectify process variations in microring resonance wavelengths, thermal tuning is usually preferred over electrical tuning due to its preservation of extinction ratios and quality factors. However, the low energy efficiency of resistive thermal tuners results in nontrivial tuning cost and overall energy consumption of the transceiver. In this study, we propose a hybrid tuning strategy which involves both thermal and electrical tuning. Our strategy determines the tuning direction of each resonance wavelength with the goal of optimizing the transceiver energy efficiency without compromising signal integrity. Formulated as an integer programming problem and solved by a genetic algorithm, our tuning strategy yields 32%~53% savings of overall energy per bit for measured data of 5-channel transceivers at 5~10 Gb/s per channel, and up to 24% saving for synthetic data of 30-channel transceivers, generated based on the process variation models built upon measured data. We further investigated a polynomial-time approximation method which achieves over 100x speedup in tuning scheme computation, while still maintaining considerable energy-per-bit savings.

Redeeming chip-level power efficiency by collaborative management of the computation and communication

Ning Lin
Hang Lu
Xin Wei
Xiaowei Li

Power consumption is the first order design constraint in future many-core processors. Conventional power management approaches usually focus on certain functional components, either computation or communication hardware resources, trying to optimize its power consumption as much as possible, while leave the other part untouched. However, such unilateral power control concept, though has some potentials to contribute overall power reduction, cannot guarantee the optimal power efficiency of the chip. In this paper, we propose a novel Collaborative management approach, coordinating both Computation and Communication infrastructure in tandem, termed as CoCom. Apart from prior work that deals with power control separately, it leverages the correlations between the two parts, as the “key chain” to guide their respective power state coordination to the appropriate direction. Besides, it uses dedicated hybrid on-chip/off-chip mechanisms to minimize the control cost and simultaneously guarantee the effectiveness. Experimental results show that, compared with the conventional unilateral baselines, CoCom is able to achieve abundant power reduction with minimal performance degradation at the same time.

A high-level modeling and simulation approach using test-driven cellular automata for fast performance analysis of RTL NoC designs

Moon Gi Seok
Hessam S. Sarjoughian
Daejin Park

The simulation speedup of designed RTL NoC regarding the packet transmission is essential to analyze the performance or to optimize NoC parameters for various combinations of intellectual-property (IP) blocks, which requires repeated computations for parameter-space exploration. In this paper, we propose a high-level modeling and simulation (M&S) approach using a revised cellular automata (CA) concept to speed up simulation of dynamic flit movements and queue occupancy within target RTL NoC. The CA abstracts the detailed RTL operations with the view of deciding a cell’s state of actions (related to moving packet flits and changing the connection between CA cells) using its own high-level states and those of neighbors, and executing relevant operations to the decided action states. During the performing the operations including connection requests and acceptances, architecture-independent and user-developed routing and arbitration functions are utilized. The decision regarding the action states follows a rule set, which is generated by the proposed test environment. The proposed method was applied to an open-source Verilog NoC, which achieves simulation speedup by approximately 8 to 31 times for a given parameter set.

SESSION: Advanced memory systems

A sharing-aware L1.5D cache for data reuse in GPGPUs

Jianfei Wang
Li Jiang
Jing Ke
Xiaoyao Liang
Naifeng Jing

With GPUs heading towards general-purpose, hardware caching, e.g. the first-level data (L1D) cache is introduced into the on-chip memory hierarchy for GPGPUs. However, facing the GPGPU massive multi-threading, the small L1D requires a better management for a higher hit rate to benefit the performance. In this paper, on observing the L1D usage inefficiency, such as data duplication among streaming multiprocessors (SMs) that wastes the precious L1D resources, we first propose a shared L1.5D cache that substitutes the private L1D caches in several SMs to reduce the duplicated data and in turn increase the effective cache size for each SM. We evaluate and adopt a suitable layout of L1.5D to meet the timing requirements in GPGPUs. Then, to protect the sharable data from early evictions, we propose a sharable data aware cache management, which leverages a lightweight PC-based history table to protect sharable data on cache replacement. The experiments demonstrate that the proposed design can achieve an averaged 20.1% performance improvement with an increased on-chip hit rate by 16.9% for applications with sharable data.

NeuralHMC: an efficient HMC-based accelerator for deep neural networks

Chuhan Min
Jiachen Mao
Hai Li
Yiran Chen

In Deep Neural Network (DNN) applications, energy consumption and performance cost of moving data between memory hierarchy and computational units are significantly higher than that of the computation itself. Process-in-memory (PIM) architecture such as Hybrid Memory Cube (HMC), becomes an excellent candidate to improve the data locality for efficient DNN execution. However, it’s still hard to efficiently deploy large-scale matrix computation in DNN on HMC because of its coarse grained packet protocol. In this work, we propose NeuralHMC, the first HMC-based accelerator tailored for efficient DNN execution. Experimental results show that NeuralHMC reduces the data movement by 1.4x to 2.5x (depending on the DNN data reuse strategy) compared to Von Neumann architecture. Furthermore, compared to state-of-the-art PIM-based DNN accelerator, NeuralHMC can promisingly improve the system performance by 4.1x and reduces energy by 1.5x, on average.

Boosting chipkill capability under retention-error induced reliability emergency

Xianwei Zhang
Rujia Wang
Youtao Zhang
Jun Yang

The DRAM based main memory of high embedded systems faces two design challenges: (i) degrading reliability; and (ii) increasing power and energy consumption. While chipkill ECC (error correction code) and multi-rate refresh may be adopted to address them, respectively, a simple integration of the two results in 3x or more SDC (silent data corruption) errors and failing to meet the system reliability guarantee. This is referred to as reliability emergency.

In this paper, we propose PlusN, a hardware-assisted memory error protection design that adaptively boosts the baseline chipkill capability to address the reliability emergency. Based on the error probability assessment at runtime, the system switches its memory protection between the baseline chipkill and PlusN — the latter generates a stronger ECC with low storage and access overheads. Our experimental results show that PlusN can effectively enforce the system reliability guarantee under different reliability emergency scenarios.

SESSION: Learning: make patterning light and right

SRAF insertion via supervised dictionary learning

Hao Geng
Haoyu Yang
Yuzhe Ma
Joydeep Mitra
Bei Yu

In modern VLSI design flow, sub-resolution assist feature (SRAF) insertion is one of the resolution enhancement techniques (RETs) to improve chip manufacturing yield. With aggressive feature size continuously scaling down, layout feature learning becomes extremely critical. In this paper, for the first time, we enhance conventional manual feature construction, by proposing a supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction. By taking advantage of label information, the proposed dictionary learning engine can discriminatively and accurately represent the input data. We further consider SRAF design rules in a global view, and design an integer linear programming model in the post-processing stage of SRAF insertion framework. Experimental results demonstrate that, compared with a state-of-the-art SRAF insertion tool, our framework not only boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

A fast machine learning-based mask printability predictor for OPC acceleration

Bentian Jiang
Hang Zhang
Jinglei Yang
Evangeline F. Y. Young

Continuous shrinking of VLSI technology nodes brings us powerful chips with lower power consumption, but it also introduces many issues in manufacturability. Lithography simulation process for new feature size suffers from large computational overhead. As a result, conventional mask optimization process has been drastically resource consuming in terms of both time and cost. In this paper, we propose a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and apply it in a conventional mask optimization tool to verify its effectiveness.

Semi-supervised hotspot detection with self-paced multi-task learning

Ying Chen
Yibo Lin
Tianyang Gai
Yajuan Su
Yayi Wei
David Z. Pan

Lithography simulation is computationally expensive for hotspot detection. Machine learning based hotspot detection is a promising technique to reduce the simulation overhead. However, most learning approaches rely on a large amount of training data to achieve good accuracy and generality. At the early stage of developing a new technology node, the amount of data with labeled hotspots or non-hotspots is very limited. In this paper, we propose a semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality. Experimental results demonstrate that our approach can achieve 2.9–4.5% better accuracy at the same false alarm levels than the state-of-the-art work using 10%-50% of training data. The source code and trained models are released on https://github.com/qwepi/SSL.

SESSION: Design and CAD for emerging memories

Exploring emerging CNFET for efficient last level cache design

Dawen Xu
Li Li
Ying Wang
Cheng Liu
Huawei Li

Carbon Nanotube field-effect transistors (CNFET) emerge as a promising alternative to the conventional CMOS for the much higher speed and power efficiency. It is particularly suitable for building the power-hungry last level cache (LLC). However, the process variation (PV) in CNFET substantially affects the operation stability and thus the worst-case timing, which limits the LLC operation frequency dramatically given a fully synchronous design. To address this problem, we developed a variation-aware cache such that each part of the cache can run at its optimal frequency and the overall cache performance can be improved significantly.

While asymmetric-correlated in the variation unique to the CNFET fabrication process, this indicates that cache latency distribution is closely related with the LLC layouts. For the two typical LLC layouts, we proposed variation-aware-set (VAS) cache and variation-aware-way (VAW) cache respectively to make best use of the CNFET cache architecture. For VAS cache, we further proposed a static page mapping to ensure the most frequent used data are mapped to the fast cache region. Similarly, we apply a latency-aware LRU replacement strategy to assign the most recent data to the fast cache region. According to the experiments, the optimized CNFET based LLC improves the performance by 39% and reduces the power consumption by 10% on average compared to the baseline CNFET LLC design.

Mosaic: an automated synthesis flow for boolean logic based on memristor crossbar

Lei Xie

Memristor crossbar stacked on the top of CMOS circuitry is a promising candidate for future VLSI circuits, due to its great scalability, near-zero standby power consumption, etc. In order to design large-scale logic circuits, an automated synthesis flow is highly demanded to map Boolean functions onto memristor crossbar. This paper proposes such a synthesis flow, Mosaic by reusing a part of the existing CMOS synthesis flow. In addition, two schemes are proposed to optimize designs in terms of delay and power consumption. To verify Mosaic and its optimization schemes, four types of adders are used as a study case; the incurred delay, area and power costs for both the crossbar and its CMOS controller are evaluated. The results show that the optimized adders reduce delay (>26%), power consumption (>21%) and area (>23%) as compared to initial ones. To show the potential of Mosaic for design space exploration, we use other nice more complex benchmarks. The results shows that the design can be significantly optimized in terms of both area (4.5x to 82.9x) and delay (2.4x to 9.5x).

Handling stuck-at-faults in memristor crossbar arrays using matrix transformations

Baogang Zhang
Necati Uysal
Deliang Fan
Rickard Ewetz

Matrix-vector multiplication is the dominating computational workload in the inference phase of neural networks. Memristor crossbar arrays (MCAs) can inherently execute matrix-vector multiplication with low latency and small power consumption. A key challenge is that the classification accuracy may be severely degraded by stuck-at-fault defects. Earlier studies have shown that the accuracy loss can be recovered by retraining each neural network or by utilizing additional hardware. In this paper, we propose to handle stuck-at-faults using matrix transformations. A transformation T changes a weight matrix W into a weight matrix, @ = T(W), which is more robust to stuck-at-faults. In particular, we propose a row flipping transformation, a permutation transformation, and a value range transformation. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the matrix, which results in that each stuck-at-fault introduces an error of smaller magnitude. The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

SESSION: Optimized training for neural networks

CAPTOR: a class adaptive filter pruning framework for convolutional neural networks in mobile applications

Zhuwei Qin
Fuxun Yu
Chenchen Liu
Xiang Chen

Nowadays, the evolution of deep learning and cloud service significantly promotes neural network based mobile applications. Although intelligent and prolific, those applications still lack certain flexibility: For classification tasks, neural networks are generally trained online with vast classification targets to cover various utilization contexts. However, only partial classes are practically tested due to individual mobile user preference and application specificity. Thus the unneeded classes cause considerable computation and communication cost. In this work, we propose CAPTOR – a class-level reconfiguration framework for Convolutional Neural Networks (CNNs). By identifying the class activation preference of convolutional filters through feature interest visualization and gradient analysis, CAPTOR can effectively cluster and adaptively prune the filters associated with unneeded classes. Therefore, CAPTOR enables class-level CNN reconfiguration for network model compression and local deployment on mobile devices. Experiment shows that, CAPTOR can reduce computation load for VGG-16 by up to 40.5% and 37.9% energy consumption with ignored loss of accuracy. For AlexNet, CAPTOR also reduces computation load by up to 42.8% and 37.6% energy consumption with less than 3% loss in accuracy.

TNPU: an efficient accelerator architecture for training convolutional neural networks

Jiajun Li
Guihai Yan
Wenyan Lu
Shuhao Jiang
Shijun Gong
Jingya Wu
Junchao Yan
Xiaowei Li

Training large scale convolutional neural networks (CNNs) is an extremely computation and memory intensive task that requires massive computational resources and training time. Recently, many accelerator solutions have been proposed to improve the performance and efficiency of CNNs. Existing approaches mainly focus on the inference phase of CNN, and can hardly address the new challenges posed in CNN training: the resource requirement diversity and bidirectional data dependency between convolutional layers (CVLs) and fully-connected layers (FCLs). To overcome this problem, this paper presents a new accelerator architecture for CNN training, called TNPU, which leverages the complementary effect of the resource requirements between CVLs and FCLs. Unlike prior approaches optimizing CVLs and FCLs in separate way, we take an alternative by smartly orchestrating the computation of CVLs and FCLs in single computing unit to work concurrently so that both computing and memory resources will maintain high utilization, thereby boosting the performance. We also proposed a simplified out-of-order scheduling mechanism to address the bidirectional data dependency issues in CNN training. The experiments show that TNPU achieves a speedup of 1.5x and 1.3x, with an average energy reduction of 35.7% and 24.1% over comparably provisioned state-of-the-art accelerators (DNPU and DaDianNao), respectively.

REIN: a robust training method for enhancing generalization ability of neural networks in autonomous driving systems

Fuxun Yu
Chenchen Liu
Xiang Chen

In recent years, neural network has shown its great potential in autonomous driving systems. However, the theoretically well-train neural networks usually fail their performance when facing real-world examples with unexpected physical variations. As the current neural networks still suffer from limited generalization ability, those unexpected variations would cause considerable accuracy degradation and critical safety issues. Therefore, the generalization ability of neural networks becomes one of the most critical challenges for autonomous driving system design. In this work, we propose a robust training method to enhance neural network’s generalization ability in various practical autonomous driving scenarios. Based on detailed practical variation modeling and neural network generation ability analysis, the proposed training method could consistently improve model classification accuracy by at most 25% in various scenarios (e.g. raining/fogy, dark lighting, and camera discrepancy). Even with adversarial corner cases, our model could still achieve at most 40% accuracy improvement over natural model.

SESSION: New trends in biochips

Factorization based dilution of biochemical fluids with micro-electrode-dot-array biochips

Sohini Saha
Debraj Kundu
Sudip Roy
Sukanta Bhattacharjee
Krishnendu Chakrabarty
Partha P. Chakrabarti
Bhargab B. Bhattacharya

Sample preparation, an essential preprocessing step for biochemical protocols, is concerned with the generation of fluids satisfying specific target ratios and error-tolerance. Recent micro-electrode-dot-array (MEDA)-based DMF biochips provide the advantage of supporting both discrete and dynamic mixing models, the power of which has not yet been fully harnessed for implementing on-chip dilution and mixing of fluids. In this paper, we propose a novel factorization-based algorithm called FacDA for efficient and accurate dilution of sample fluid on a MEDA chip. Simulation results reveal that over a large number of test-cases with the mixing volume constraint in the range of 4–10 units, FacDA requires around 38% fewer mixing steps, 52% less sample units, and generates approximately 23% less wastage, all on average, compared to two prior dilution algorithms used for MEDA chips.

Sample preparation for multiple-reactant bioassays on micro-electrode-dot-array biochips

Tung-Che Liang
Yun-Sheng Chan
Tsung-Yi Ho
Krishnendu Chakrabarty
Chen-Yi Lee

Sample preparation, as a key procedure in many biochemical protocols, mixes various samples and/or reagents into solutions that contain the target concentrations. Digital microfluidic biochips (DMFBs) have been adopted as a platform for sample preparation because they provide automatic procedures that require less reactant consumption and reduce human-induced errors. However, traditional DMFBs only utilize the (1:1) mixing model, i.e., only two droplets of the same volume can be mixed at a time, which results in higher completion time and the wastage of valuable reactants. To overcome this limitation, a next-generation micro-electrode-dot-array (MEDA) architecture that provides flexibility of mixing multiple droplets of different volumes in a single operation was proposed. In this paper, we present a generic multiple-reactant sample preparation algorithm that exploits the novel fluidic operations on MEDA biochips. Simulated experiments show that the proposed method outperforms existing methods in terms of saving reactant cost, minimizing the number of operations, and reducing the amount of waste.

Robust sample preparation on digital microfluidic biochips

Zhanwei Zhong
Robert Wille
Krishnendu Chakrabarty

Sample preparation is an important application for the digital microfluidic biochips (DMFBs) platform, and many methods have been developed to reduce the time and reagent usage associated with on-chip sample preparation. However, errors in fluidic operations can result in the concentration of the resulting droplet being outside the calibration range. Current error-recovery methods have the drawback that they need the use of on-chip sensors and further re-execution time. In this paper, we present two dilution-chain structures that can generate a droplet with a desired concentration even if volume variations occur during droplet splitting. Experimental results show the effectiveness of the proposed method compared to previous methods.

SESSION: Power-efficient machine learning hardware design

SAADI: a scalable accuracy approximate divider for dynamic energy-quality scaling

Setareh Behroozi
Jingjie Li
Jackson Melchert
Younghyun Kim

Approximate computing can significantly improve the energy efficiency of arithmetic operations in error-resilient applications. In this paper, we propose an approximate divider design that facilitates dynamic energy-quality scaling. Conventional approximate dividers lack runtime energy-quality scalability, which is the key to maximizing the energy efficiency while meeting dynamically varying accuracy requirements. Our divider design, named SAADI, makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations. For the approximate 8-bit division of 32-bit/16-bit division, the average accuracy of SAADI can be adjusted in between 92.5% and 99.0% by varying latency up to 7x. We evaluate the accuracy and energy consumption of SAADI for various design parameters and demonstrate its efficacy for low-power signal processing applications.

SeFAct: selective feature activation and early classification for CNNs

Farhana Sharmin Snigdha
Ibrahim Ahmed
Susmita Dey Manasi
Meghna G. Mankalale
Jiang Hu
Sachin S. Sapatnekar

This work presents a dynamic energy reduction approach for hardware accelerators for convolutional neural networks (CNN). Two methods are used: (1) an adaptive data-dependent scheme to selectively activate a subset of all neurons, by narrowing down the possible activated classes (2) static bitwidth reduction. The former is applied in late layers of the CNN, while the latter is more effective in early layers. Even accounting for the implementation overheads, the results show 20%–25% energy savings with 5–10% accuracy loss.

FACH: FPGA-based acceleration of hyperdimensional computing by reducing computational complexity

Mohsen Imani
Sahand Salamat
Saransh Gupta
Jiani Huang
Tajana Rosing

Brain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a query hypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosine to perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.

SESSION: Security of machine learning and machine learning for security: progress and challenges for secure, machine intelligent mobile systems

ADMM attack: an enhanced adversarial attack for deep neural networks with undetectable distortions

Pu Zhao
Kaidi Xu
Sijia Liu
Yanzhi Wang
Xue Lin

Many recent studies demonstrate that state-of-the-art Deep neural networks (DNNs) might be easily fooled by adversarial examples, generated by adding carefully crafted and visually imperceptible distortions onto original legal inputs through adversarial attacks. Adversarial examples can lead the DNN to misclassify them as any target labels. In the literature, various methods are proposed to minimize the different l_p norms of the distortion. However, there lacks a versatile framework for all types of adversarial attacks. To achieve a better understanding for the security properties of DNNs, we propose a general framework for constructing adversarial examples by leveraging Alternating Direction Method of Multipliers (ADMM) to split the optimization approach for effective minimization of various l_p norms of the distortion, including l₀, l₁, l₂, and l_∞ norms. Thus, the proposed general framework unifies the methods of crafting l₀, l₁, l₂, and l_∞ attacks. The experimental results demonstrate that the proposed ADMM attacks achieve both the high attack success rate and the minimal distortion for the misclassification compared with state-of-the-art attack methods.

A system-level perspective to understand the vulnerability of deep learning systems

Tao Liu
Nuo Xu
Qi Liu
Yanzhi Wang
Wujie Wen

Deep neural network (DNN) is nowadays achieving the human-level performance on many machine learning applications like self-driving car, gaming and computer-aided diagnosis. However, recent studies show that such a promising technique has gradually become the major attack target, significantly threatening the safety of machine learning services. On one hand, the adversarial or poisoning attacks incurred by DNN algorithm vulnerabilities can cause the decision misleading with very high confidence. On the other hand, the system-level DNN attacks built upon models, training/inference algorithms and hardware and software in DNN execution, have also emerged for more diversified damages like denial of service, private data stealing. In this paper, we present an overview of such emerging system-level DNN attacks by systematically formulating their attack routines. Several representative cases are selected in our study to summarize the characteristics of system-level DNN attacks. Based on our formulation, we further discuss the challenges and several possible techniques to mitigate such emerging system-level DNN attacks.

HAMPER: high-performance adaptive mobile security enhancement against malicious speech and image recognition

Zirui Xu
Fuxun Yu
Chenchen Liu
Xiang Chen

Recently, the machine learning technologies have been widely used in cognitive applications such as Automatic Speech Recognition (ASR) and Image Recognition (IR). Unfortunately, these techniques have been massively used in unauthorized audio/image data analysis, causing serious privacy leakage. To address this issue, we propose HAMPER in this work, which is a data encryption framework that protects the audio/image data from unauthorized ASR/IR analysis. Leveraging machine learning models’ vulnerability to adversarial examples, HAMPER encrypt the audio/image data with adversarial noises to perturb the recognition results of ASR/IR systems. To deploy the proposed framework in extensive platforms (e.g. mobile devices), HAMPER also take into consideration of computation efficiency, perturbation transferability, as well as data attribute configuration. Therefore, rather than focusing on the high-level machine learning models, HAMPER generates adversarial examples from the low-level features. Taking advantage of the light computation load, fundamental impact, and direct configurability of the low-level features, the generated adversarial examples can efficiently and effectively affect the whole ASR/IR systems. Experiment results show that, HAMPER can effectively perturb the unauthorized ASR/IR analysis with 85% Word-Error-Rate (WER) and 83% Image-Error-Rate (IER) respectively. Also, HAMPER achieves faster processing speed with 1.5X speedup for image encryption and even 26X in audio, comparing to the state-of-the-art methods. Moreover, HAMPER achieves strong transferability and configures adversarial examples with desired attributes for better scenario adaptation.

AdverQuil: an efficient adversarial detection and alleviation technique for black-box neuromorphic computing systems

Hsin-Pai Cheng
Juncheng Shen
Huanrui Yang
Qing Wu
Hai Li
Yiran Chen

In recent years, neuromorphic computing systems (NCS) have gained popularity in accelerating neural network computation because of their high energy efficiency. The known vulnerability of neural networks to adversarial attack, however, raises a severe security concern of NCS. In addition, there are certain application scenarios in which users have limited access to the NCS. In such scenarios, defense technologies that require changing the training methods of the NCS, e.g., adversarial training become impracticable. In this work, we propose AdverQuil – an efficient adversarial detection and alleviation technique for black-box NCS. AdverQuil can identify the adversarial strength of input examples and select the best strategy for NCS to respond to the attack, without changing structure/parameter of the original neural network or its training method. Experimental results show that on MNIST and CIFAR-10 datasets, AdverQuil achieves a high efficiency of 79.5 – 167K image/sec/watt. AdverQuil introduces less than 25% of hardware overhead, and can be combined with various adversarial alleviation techniques to provide a flexible trade-off between hardware cost, energy efficiency and classification accuracy.

SESSION: System level modelling methods II

SIMULTime: Context-sensitive timing simulation on intermediate code representation for rapid platform explorations

Alessandro Cornaglia
Alexander Viehl
Oliver Bringmann
Wolfgang Rosenstiel

Nowadays, product lines are common practice in the embedded systems domain as they allow for substantial reductions in development costs and the time-to-market by a consequent application of design paradigms such as variability and structured reuse management. In that context, accurate and fast timing predictions are essential for an early evaluation of all relevant variants of a product line concerning target platform properties. Context-sensitive simulations provide attractive benefits for timing analysis. Nevertheless, these simulations depend strongly on a single configuration pair of compiler and hardware platform. To cope with this limitation, we present SIMULTime, a new technique for context-sensitive timing simulation based on the software intermediate representation. The assured simulation throughput significantly increases by simulating simultaneously different hardware hardware platforms and compiler configurations. Multiple accurate timing predictions are produced by running the simulator only once. Our novel approach was applied on several applications showing that SIMULTime increases the average simulation throughput by 90% when at least four configurations are analyzed in parallel.

Modeling processor idle times in MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard deadline

Amirhossein Esmaili
Mahdi Nazemi
Massoud Pedram

Energy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

Phone-nomenon: a system-level thermal simulator for handheld devices

Hong-Wen Chiou
Yu-Min Lee
Shin-Yu Shiau
Chi-Wen Pan
Tai-Yu Chen

This work presents a system-level thermal simulator, Phone-nomenon, to predict the thermal behavior of smartphone. First, we study the nonlinearity of internal and external heat transfer mechanisms and propose a compact thermal model. After that, we develop an iterative framework to handle the nonlinearity. Compared with a commercial tool, ANSYS Icepak, Phonenomenon can achieve two and three orders of magnitude speedup with 3.58% maximum error and 1.72°C difference for steady-state and transient-state simulations, respectively. Meanwhile, Phone-nomenon also fits the measured data of a built thermal test vehicle pretty well.

Virtual prototyping of heterogeneous automotive applications: matlab, SystemC, or both?

Xiao Pan
Carna Zivkovic
Christoph Grimm

We present a case study on virtual prototyping of automotive applications. We address the co-simulation of HW/SW systems involving firmware, communication protocols, and physical/mechanical systems in the context of model-based and agile development processes. The case study compares the Matlab/Simulink and SystemC based approaches by an e-gas benchmark. We compare the simulation performance, modeling capabilities and applicability in different stages of the development process.

SESSION: Placement

Diffusion break-aware leakage power optimization and detailed placement in sub-10nm VLSI

Sun ik Heo
Andrew B. Kahng
Minsoo Kim
Lutong Wang

A diffusion break (DB) isolates two neighboring devices in a standard cell-based design and has a stress effect on delay and leakage power. In foundry sub-10nm design enablements, device performance is changed according to the type of DB – single diffusion break (SDB) or double diffusion break (DDB) – that is used in the library cell layout. Crucially, local layout effect (LLE) can substantially affect device performance and leakage. Our present work focuses on the 2^nd DB effect, a type of LLE in which distance to the second-closest DB (i.e., a distance that depends on the placement of a given cell’s neighboring cell) also impacts performance of a given device. In this work, we implement a 2^nd DB-aware timing and leakage analysis flow, and show how a lack of 2^nd DB awareness can misguide current optimization in place-and-route stages. We then develop 2^nd DB-aware leakage optimization and detailed placement heuristics. Experimental results in a scaled foundry 14nm technology indicate that our 2^nd DB-aware analysis and optimization flow achieves, on average, 80% recovery of the leakage increment that is induced by the 2^nd DB effect, without changing design performance.

MDP-trees: multi-domain macro placement for ultra large-scale mixed-size designs

Yen-Chun Liu
Tung-Chieh Chen
Yao-Wen Chang
Sy-Yen Kuo

In this paper, we present a new hybrid representation of slicing trees and multi-packing trees, called multi-domain-packing trees (MDP-trees), for macro placement to handle ultra large-scale multi-domain mixed-size designs. A multi-domain design typically consists of a set of mixed-size domains, each with hundreds/thousands of large macros and (tens of) millions of standard cells, which is often seen in modern high-end applications (e.g., 4G LTE products and upcoming 5G ones). To the best of our knowledge, there is still no published work specifically tackling the domain planning and macro placement simultaneously. Based on binary trees, the MDP-tree is very efficient and effective for handling macro placement with multiple domains. Previous works on macro placement can handle only single-domain designs, which do not consider the global interactions among domains. In contrast, our MDP-trees plan domain regions globally, and optimize the interconnections among domains and macro/cell positions simultaneously. The placement area of each domain is well reserved, and the macro displacement is minimized from initial macro positions of the design prototype. Experimental results show that our approach can significantly reduce both the average half-perimeter wirelength and the average global routing wirelength.

A shape-driven spreading algorithm using linear programming for global placement

Shounak Dhar
Love Singhal
Mahesh A. Iyer
David Z. Pan

In this paper, we consider the problem of finding the global shape for placement of cells in a chip that results in minimum wirelength. Under certain assumptions, we theoretically prove that some shapes are better than others for purposes of minimizing wirelength, while ensuring that overlap-removal is a key constraint of the placer. We derive some conditions for the optimal shape and obtain a shape which is numerically close to the optimum. We also propose a linear-programming-based spreading algorithm with parameters to tune the resultant shape and derive a cost function that is better than total or maximum displacement objectives, that are traditionally used in many numerical global placers. Our new cost function also does not require explicit wirelength computation, and our spreading algorithm preserves to a large extent, the relative order among the cells placed after a numerical placer iteration. Our experimental results demonstrate that our shape-driven spreading algorithm improves wirelength, routing congestion and runtime compared to a bi-partitioning based spreading algorithm used in a state-of-the-art academic global placer for FPGAs.

Finding placement-relevant clusters with fast modularity-based clustering

Mateus Fogaça
Andrew B. Kahng
Ricardo Reis
Lutong Wang

In advanced technology nodes, IC implementation faces increasing design complexity as well as ever-more demanding design schedule requirements. This raises the need for new decomposition approaches that can help reduce problem complexity, in conjunction with new predictive methodologies that can help avoid bottlenecks and loops in the physical implementation flow. Notably, with modern design methodologies it would be very valuable to better predict final placement of the gate-level netlist: this would enable more accurate early assessment of performance, congestion and floorplan viability in the SOC floorplanning/RTL planning stages of design. In this work, we study a new criterion for the classic challenge of VLSI netlist clustering: how well netlist clusters “stay together” through final implementation. We propose use of several evaluators of this criterion. We also explore the use of modularity-driven clustering to identify natural clusters in a given graph without the tuning of parameters and size balance constraints typically required by VLSI CAD partitioning methods. We find that the netlist hypergraph-to-graph mapping can significantly affect quality of results, and we experimentally identify an effective recipe for weighting that also comprehends topological proximity to I/Os. Further, we empirically demonstrate that modularity-based clustering achieves better correlation to actual netlist placements than traditional VLSI CAD methods (our method is also 4X faster than use of hMetis for our largest testcases). Finally, we show a potential flow with fast “blob placement” of clusters to evaluate netlist and floorplan viability in early design stages; this flow can predict gate-level placement of 370K cells in 200 seconds on a single core.

SESSION: Algorithms and architectures for emerging applications

An approximation algorithm to the optimal switch control of reconfigurable battery packs

Shih-Yu Chen
Jie-Hong R. Jiang
Shou-Hung Welkin Ling
Shih-Hao Liang
Mao-Cheng Huang

The broad applications of lithium-ion batteries in cyber-physical systems attract intensive research on building energy-efficient battery systems. Reconfigurable battery packs have been proposed to improve reliability and energy efficiency. Despite recent efforts, how to simultaneously maximize battery usage time and minimize switching count during reconfiguration is rarely addressed. In this work, we devise a control algorithm that, under a simplified battery model, achieves the longest usage time under a given constant power-load while the switching count is at most twice above the minimum. It is further generalized for arbitrary power-loads and adjusted for refined battery models. Simulation experiments show promising benefits of the proposed algorithm.

Autonomous vehicle routing in multiple intersections

Sheng-Hao Lin
Tsung-Yi Ho

Advancements in artificial intelligence and Internet of Things indicates the realization of commercial autonomous vehicles is almost ready. With autonomous vehicles comes new approaches in solving some of the current traffic problems such as fuel consumption, congestion, and high incident rates. Autonomous Intersection Management (AIM) is an example that utilizes the unique attributes of autonomous vehicles to improve the efficiency of a single intersection. However, in a system of interconnected intersections, just by improving individual intersections does not guarantee a system optimum. Therefore, we extend from a single intersection to a grid of intersections and propose a novel vehicle routing method for autonomous vehicles that can effectively reduce the travel time of each vehicle. With dedicated short range communications and the fine-grained control of autonomous vehicles, we are able to apply wire routing algorithms with modified constraints to vehicle routing. Our method intelligently avoids congestions by simulating the future traffic and thereby achieving a system optimum.

GRAM: graph processing in a ReRAM-based computational memory

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Yeseong Kim
Tajana Rosing

The performance of graph processing for real-world graphs is limited by inefficient memory behaviours in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this paper, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM provides 122.5X and 11.1x speedup compared with an in-memory graph system and optimized multithreading algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1X and 3.8X respectively.

ADEPOS: anomaly detection based power saving for predictive maintenance using edge computing

Sumon Kumar Bose
Bapi Kar
Mohendra Roy
Pradeep Kumar Gopalakrishnan
Arindam Basu

In Industry 4.0, predictive maintenance (PdM) is one of the most important applications pertaining to the Internet of Things (IoT). Machine learning is used to predict the possible failure of a machine before the actual event occurs. However, main challenges in PdM are: (a) lack of enough data from failing machines, and (b) paucity of power and bandwidth to transmit sensor data to cloud throughout the lifetime of the machine. Alternatively, edge computing approaches reduce data transmission and consume low energy. In this paper, we propose Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine. In the beginning of the machine’s life, low accuracy computations are used when machine is healthy. However, on detection of anomalies as time progresses, system is switched to higher accuracy modes. We show using the NASA bearing dataset that using ADEPOS, we need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4–6.65X.

SESSION: Embedded software for parallel architecture

Efficient sporadic task handling in parallel AUTOSAR applications using runnable migration

Milan Copic
Rainer Leupers
Gerd Ascheid

Automotive software has become immensely complex. To manage this complexity, a safety-critical application is commonly written respecting the AUTOSAR standard and deployed on a multi-core ECU. However, parallelization of an AUTOSAR task is hindered by data dependencies between runnables, the smallest code-fragments executed by the run-time system. Consequently, a substantial number of idle intervals is introduced. We propose to utilize such intervals in sporadic tasks by migrating runnables that were originally scheduled to execute in the scope of periodic tasks.

A heuristic for multi objective software application mappings on heterogeneous MPSoCs

Gereon Onnebrink
Ahmed Hallawa
Rainer Leupers
Gerd Ascheid
Awaid-Ud-Din Shaheen

Efficient development of parallel software is one of the biggest hurdles to exploit the advantages of heterogeneous multi-core architectures. Fast and accurate compiler technology is required for determining the trade-off between multiple objectives, such as power and performance. To tackle this problem, the paper at hand proposes the novel heuristic TONPET. Furthermore, it is integrated into the SLX tool suite for a detailed evaluation and an applicability study. TOPNET is tested against representative benchmarks on three different platforms and compared to a state-of-the-art Evolutionary Multi Objective Algorithm (EMOA). On average, TONPET produces 6% better Pareto fronts, while being 18X faster in the worst case.

ReRAM-based processing-in-memory architecture for blockchain platforms

Fang Wang
Zhaoyan Shen
Lei Han
Zili Shao

Blockchain’s decentralized and consensus mechanism has attracted lots of applications, such as IoT devices. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the Blockchain mining consumes huge computation resource and energy, which is unacceptable for resource-limited embedded devices. This paper for the first time presents a ReRAM-based processing-in-memory architecture for Blockchain mining, called Re-Mining. Re-Mining includes a message schedule module and a SHA computation module. The modules are composed of several basic ReRAM-based logic operations units, such as ROR, RSF and XOR. Re-Mining further designs intra-transaction and inter-transaction parallel mechanisms to accelerate the Blockchain mining. Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly.

SESSION: Machine learning and hardware security

Towards practical homomorphic email filtering: a hardware-accelerated secure naïve bayesian filter

Song Bian
Masayuki Hiromoto
Takashi Sato

A secure version of the naïve Bayesian filter (NBF) is proposed utilizing partially homomorphic encryption (PHE) scheme. SNBF can be implemented with only the additive homomorphism from the Paillier system, and we derive new techniques to reduce the computational cost of PHE-based SNBF. In the experiment, we implemented SNBF both in software and hardware. Compared to the best existing PHE scheme, we achieved 1,200x (resp., 398,840x) runtime reduction in the CPU (resp., ASIC) implementations, with additional 1,919x power reduction on the designated hardware multiplier. Our hardware implementation is able to classify an average-length email in 0.5 s, making it one of the most practical NBF schemes to date.

A 0.16pJ/bit recurrent neural network based PUF for enhanced machine learning attack resistance

Nimesh Shah
Manaar Alam
Durga Prasad Sahoo
Debdeep Mukhopadhyay
Arindam Basu

Physically Unclonable Function (PUF) circuits are finding wide-spread use due to increasing adoption of IoT devices. However, the existing strong PUFs such as Arbiter PUFs (APUF) and its compositions are susceptible to machine learning (ML) attacks because the challenge-response pairs have a linear relationship. In this paper, we present a Recurrent-Neural-Network PUF (RNN-PUF) which uses a combination of feedback and XOR function to significantly improve resistance to ML attack, without significant reduction in the reliability. ML attack is also partly reduced by using a shared comparator with offset-cancellation to remove bias and save power. From simulation results, we obtain ML attack accuracy of 62% for different ML algorithms, while reliability stays above 93%. This represents a 33.5% improvement in our Figure-of-Merit. Power consumption is estimated to be 12.3μW with energy/bit of ≈ 0.16pJ.

P³M: a PIM-based neural network model protection scheme for deep learning accelerator

Wen Li
Ying Wang
Huawei Li
Xiaowei Li

This work is oriented at the edge computing scenario that terminal deep learning accelerators use pre-trained neural network models distributed from third-party providers (e.g. from data center clouds) to process the private data instead of sending it to the cloud. In this scenario, the network model is exposed to the risk of being attacked in the unverified devices if the parameters and hyper-parameters are transmitted and processed in an unencrypted way. Our work tackles this security problem by using on-chip memory Physical Unclonable Functions (PUFs) and Processing-In-Memory (PIM). We allow the model execution only on authorized devices and protect the model from white-box attacks, black-box attacks and model tampering attacks. The proposed PUFs-and-PIM based Protection method for neural Models (P³M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead. The experimental results show considerable performance improvement over two state-of-the-art solutions we evaluated.

SESSION: Memory architecture for efficient neural network computing

Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator

Jilan Lin
Zhenhua Zhu
Yu Wang
Yuan Xie

With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 — 5X energy efficiency and more than 2.5 — 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

In-memory batch-normalization for resistive memory based binary neural network hardware

Hyungjun Kim
Yulhwa Kim
Jae-Joon Kim

Binary Neural Network (BNN) has a great potential to be implemented on Resistive memory Crossbar Array (RCA)-based hardware accelerators because it requires only 1-bit precision for weights and activations. While general structures to implement convolution or fully-connected layers in RCA-based BNN hardware were actively studied in previous works, Batch-Normalization (BN) layer, which is another key layer of BNN, has not been discussed in depth yet. In this work, we propose in-memory batch-normalization schemes which integrate BN layers on RCA so that area/energy-efficiency of the BNN accelerators can be maximized. In addition, we also show that sense amp error due to device mismatch can be suppressed using the proposed in-memory BN design.

XOMA: exclusive on-chip memory architecture for energy-efficient deep learning acceleration

Hyeonuk Sim
Jason H. Anderson
Jongeun Lee

State-of-the-art deep neural networks (DNNs) require hundreds of millions of multiply-accumulate (MAC) computations to perform inference, e.g. in image-recognition tasks. To improve the performance and energy efficiency, deep learning accelerators have been proposed, realized both on FPGAs and as custom ASICs. Generally, such accelerators comprise many parallel processing elements, capable of executing large numbers of concurrent MAC operations. From the energy perspective, however, most consumption arises due to memory accesses, both to off-chip external memory, and on-chip buffers. In this paper, we propose an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective. To the maximum possible extent, off-chip memory accesses are eliminated, providing lowest-possible energy consumption for inference. Compared to a state-of-the-art ASIC, our architecture requires 36% fewer external memory accesses and 53% less energy consumption for low-latency image classification.

SESSION: Logic-level security and synthesis

BeSAT: behavioral SAT-based attack on cyclic logic encryption

Yuanqi Shen
You Li
Amin Rezaei
Shuyu Kong
David Dlott
Hai Zhou

Cyclic logic encryption is newly proposed in the area of hardware security. It introduces feedback cycles into the circuit to defeat existing logic decryption techniques. To ensure that the circuit is acyclic under the correct key, CycSAT is developed to add the acyclic condition as a CNF formula to the SAT-based attack. However, we found that it is impossible to capture all cycles in any graph with any set of feedback signals as done in the CycSAT algorithm. In this paper, we propose a behavioral SAT-based attack called BeSAT. Be-SAT observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked. The experimental results show that BeSAT successfully overcomes the drawback of CycSAT.

Structural rewriting in XOR-majority graphs

Zhufei Chu
Mathias Soeken
Yinshui Xia
Lunyao Wang
Giovanni De Micheli

In this paper, we present a structural rewriting method for a recently proposed XOR-Majority graph (XMG), which has exclusive-OR (XOR), majority-of-three (MAJ), and inverters as primitives. XMGs are an extension of Majority-Inverter Graphs (MIGs). Previous work presented an axiomatic system, Ω, and its derived transformation rules for manipulation of MIGs. By additionally introducing XOR primitive, the identities of MAJ-XOR operations should be exploited to enable powerful logic rewriting in XMGs. We first proposed two MAJ-XOR identities and exploit its potential optimization opportunities during structural rewriting. Then, we discuss the rewriting rules that can be used for different operations. Finally, we also address structural XOR detection problem in MIG. The experimental results on EPFL benchmark suites show that the proposed method can optimize the size/depth product of XMGs and its mapped look-up tables (LUTs), which in turn benefits the quantum circuit synthesis that using XMG as the underlying logic representations.

Design automation for adiabatic circuits

Alwin Zulehner
Michael P. Frank
Robert Wille

Adiabatic circuits are heavily investigated since they allow for computations with an asymptotically close to zero energy dissipation per operation—serving as an alternative technology for many scenarios where energy efficiency is preferred over fast execution. Their concepts are motivated by the fact that the information lost from conventional circuits results in an entropy increase which causes energy dissipation. To overcome this issue, computations are performed in a (conditionally) reversible fashion which, additionally, have to satisfy switching rules that are different from conventional circuitry—crying out for dedicated design automation solutions. While previous approaches either focus on their electrical realization (resulting in small, hand-crafted circuits only) or on designing fully reversible building blocks (an unnecessary overhead), this work aims for providing an automatic and dedicated design scheme that explicitly takes the recent findings in this domain into account. To this end, we review the theoretical and technical background of adiabatic circuits and present automated methods that dedicatedly realize the desired function as an adiabatic circuit. The resulting methods are further optimized—leading to an automatic and efficient design automation for this promising technology. Evaluations confirm the benefits and applicability of the proposed solution.

SESSION: Analysis and algorithms for digital design verification

A figure of merit for assertions in verification

Samuel Hertz
Debjit Pal
Spencer Offenberger
Shobha Vasudevan

Assertion quality is critical to the confidence and claims in a design’s verification. In current practice, there is no metric to evaluate assertions. We introduce a methodology to rank register transfer level (RTL) assertions. We define assertion importance and assertion complexity and present efficient algorithms to compute them. Our method ranks each assertion according to its importance and complexity. We demonstrate the effectiveness of our ranking for pre-silicon verification on a detailed case study. For completeness, we study the relevance of our highly ranked assertions in a post-silicon validation context, using traced and restored signal values from the design’s netlist.

Suspect2vec: a suspect prediction model for directed RTL debugging

Neil Veira
Zissis Poulos
Andreas Veneris

Automated debugging tools based on Boolean Satisfiability (SAT) have greatly alleviated the time and effort required to diagnose and rectify a failing design. Practical experience shows that long-running debugging instances can often be resolved faster using partial results that are available before the SAT solver completes its search. In such cases it is preferable for the tool to maximize the number of suspects it returns during the early stages of its deployment. To capitalize on this observation, this paper proposes a directed SAT-based debugging algorithm which prioritizes examining design locations that are more likely to be suspects. This prioritization is determined by suspect2vec — a model which learns from historical debug data to predict the suspect locations that will be found. Experiments show that this algorithm is expected to find 16% more suspects than the baseline algorithm if terminated prematurely, while still retaining the ability to find all suspects if executed to completion. Key to its performance and a contribution of this work is the accuracy of the suspect prediction model. This is because incorrect predictions introduce overhead in exploring parts of the search space where few or no solutions exist. Suspect2vec is experimentally demonstrated to outperform existing suspect prediction methods by an average accuracy of 5–20%.

Path controllability analysis for high quality designs

Li-Jie Chen
Hong-Zu Chou
Kai-Hui Chang
Sy-Yen Kuo
Chi-Lai Huang

Given a design variable and its fanin cone, determining whether one fanin variable has controlling power over other fanin variables can benefit many design steps such as verification, synthesis and test generation. In this work we formulate this path controllability problem and propose several algorithms that not only solve this problem but also return values that enable or block other fanin variables. Empirical results show that our algorithms can effectively perform path controllability analysis and help produce high-quality designs.

SESSION: FPGA and optics-based neural network designs

Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS

Qin Li
Xiaofan Zhang
JinJun Xiong
Wen-mei Hwu
Deming Chen

Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

Efficient FPGA implementation of local binary convolutional neural network

Aidyn Zhakatayev
Jongeun Lee

Binarized Neural Networks (BNN) has shown a capability of performing various classification tasks while taking advantage of computational simplicity and memory saving. The problem with BNN, however, is a low accuracy on large convolutional neural networks (CNN). Local Binary Convolutional Neural Network (LBCNN) compensates accuracy loss of BNN by using standard convolutional layer together with binary convolutional layer and can achieve as high accuracy as standard AlexNet CNN. For the first time we propose FPGA hardware design architecture of LBCNN and address its unique challenges. We present performance and resource usage predictor along with design space exploration framework. Our architecture on LBCNN AlexNet shows 76.6% higher performance in terms of GOPS, 2.6X and 2.7X higher performance density in terms of GOPS/Slice, and GOPS/DSP compared to previous FPGA implementation of standard AlexNet CNN.

Hardware-software co-design of slimmed optical neural networks

Zheng Zhao
Derong Liu
Meng Li
Zhoufeng Ying
Lu Zhang
Biying Xu
Bei Yu
Ray T. Chen
David Z. Pan

Optical neural network (ONN) is a neuromorphic computing hardware based on optical components. Since its first on-chip experimental demonstration, it has attracted more and more research interests due to the advantages of ultra-high speed inference with low power consumption. In this work, we design a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations. Different from the originally proposed ONN architecture based on singular value decomposition which results in two implementation-expensive unitary matrices, we show a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer. In the experiments, we demonstrate that by leveraging the training engine, we are able to find a comparable accuracy to that of the previous architecture, which brings about the flexibility of using the slimmed implementation. The area cost in terms of the Mach-Zehnder interferometers, the core optical components of ONN, is 15%-38% less for various sizes of optical neural networks.

SESSION: The resurgence of reconfigurable computing in the post moore era

Software defined architectures for data analytics

Vito Giovanni Castellana
Marco Minutoli
Antonino Tumeo
Marco Lattuada
Pietro Fezzardi
Fabrizio Ferrandi

Data analytics applications increasingly are complex workflows composed of phases with very different program behaviors (e.g., graph algorithms and machine learning, algorithms operating on sparse and dense data structures, etc). To reach the levels of efficiency required to process these workflows in real time, upcoming architectures will need to leverage even more workload specialization. If, at one end, we may find even more heterogenous processors composed by a myriad of specialized processing elements, at the other end we may see novel reconfigurable architectures, composed of sets of functional units and memories interconnected with (re)configurable on-chip networks, able to adapt dynamically to adapt the workload characteristics. Field Programmable Gate Arrays are more and more used for accelerating various workloads and, in particular, inferencing in machine learning, providing higher efficiency than other solutions. However, their fine-grained nature still leads to issues for the design software and still makes dynamic reconfiguration impractical. Future, more coarse-grained architectures could offer the features to execute diverse workloads at high efficiency while providing better reconfiguration mechanisms for dynamic adaptability. Nevertheless, we argue that the challenges for reconfigurable computing remain in the software. In this position paper, we describe a possible toolchain for reconfigurable architectures targeted at data analytics.

Runtime reconfigurable memory hierarchy in embedded scalable platforms

Davide Giri
Paolo Mantovani
Luca P. Carloni

In heterogeneous systems-on-chip, the optimal choice of the cache-coherence model for a loosely-coupled accelerator may vary at each invocation, depending on workload and system status. We propose a runtime adaptive algorithm to manage the coherence of accelerators. The algorithm’s choices are based on the combination of static and dynamic features of the active accelerators and their workloads. We evaluate the algorithm by leveraging our FPGA-based platform for rapid SoC prototyping. Experimental results, obtained through the deployment of a multi-core and multi-accelerator system that runs Linux SMP, show the benefits of our approach in terms of execution time and memory accesses.

XPPE: cross-platform performance estimation of hardware accelerators using machine learning

Hosein Mohammadi Makrani
Hossein Sayadi
Tinoosh Mohsenin
Setareh rafatirad
Avesta Sasan
Houman Homayoun

The increasing heterogeneity in the applications to be processed ceased ASICs to exist as the most efficient processing platform. Hybrid processing platforms such as CPU+FPGA are emerging as powerful processing platforms to support an efficient processing for a diverse range of applications. Hardware/Software co-design enabled designers to take advantage of these new hybrid platforms such as Zynq. However, dividing an application into two parts that one part runs on CPU and the other part is converted to a hardware accelerator implemented on FPGA, is making the platform selection difficult for the developers as there is a significant variation in the application’s performance achieved on different platforms. Developers are required to fully implement the design on each platform to have an estimation of the performance. This process is tedious when the number of available platforms is large. To address such challenge, in this work we propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed estimation is performed for a wide range of applications and evaluated against a vast set of platforms. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

SESSION: Hardware acceleration

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

Bosheng Liu
Xiaoming Chen
Ying Wang
Yinhe Han
Jiajun Li
Haobo Xu
Xiaowei Li

As an energy-efficient hardware solution for deep neural network (DNN) inference, systolic accelerators are particularly popular in both embedded and datacenter computing scenarios. Despite their excellent performance and energy efficiency, however, systolic DNN accelerators are naturally facing a resource under-utilization problem – not all DNN models can well match the fixed processing elements (PEs) in a systolic array implementation, because typical DNN models vary significantly from applications to applications. Consequently, state-of-the-art hardware solutions are not expected to deliver the nominal (peak) performance and energy efficiency as claimed because of resource under-utilization. To deal with this dilemma, this study proposes a novel systolic DNN accelerator with a flexible computation mapping and dataflow scheme. By providing three types of parallelism and dynamically switching among them: channel-direction mapping, planar mapping, and hybrid, our accelerator offers the adaptability to match various DNN models to the fixed hardware resources, and thus, enables flexibly exploiting PE provision and data reuse for a wide range of DNN models to achieve optimal performance and energy efficiency.

ALook: adaptive lookup for GPGPU acceleration

Daniel Peroni
Mohsen Imani
Tajana Rosing

Associative memory in form of look-up table can decrease the energy consumption of GPGPU applications by exploiting data locality and reducing the number redundant computations. State of the art architectures utilize associative memory as static look-up tables. Static designs lack the ability to adapt to applications at runtime, limiting them to small segments of code with high redundancy. In this paper, we propose an adaptive look-up based approach, called ALook, which uses a dynamic update policy to maintain a set of recently used operations in associative memory. ALook updates with values computed by floating point units at runtime to adapt to the workload and matches the stored results to avoid recomputing similar operations. ALook utilizes a novel FPU architecture which accelerates GPU computation by parallelizing the operation lookup process. We test the efficiency of ALook on image processing, general purpose, and machine learning applications by integrating it beside FPUs in an AMD Southern Island GPU. Our evaluation shows that ALook provides 3.6X EDP (Energy Delay Product) and 32.8% performance speedup, compared to an unmodified GPU, for applications accepting less than 5% output error. The proposed ALook architecture improves the GPU performance by 2.0X as compared to state-of-the-art computational reuse methods for the same level of output error.

Collaborative accelerators for in-memory MapReduce on scale-up machines

Abraham Addisie
Valeria Bertacco

Relying on efficient data analytics platforms is increasingly becoming crucial for both small and large scale datasets. While MapReduce implementations, such as Hadoop and Spark, were originally proposed for petascale processing in scale-out clusters, it has been noted that, today, most data centers processes operate on gigabyte-order or smaller datasets, which are best processed in single high-end scale-up machines. In this context, Phoenix++ is a highly optimized MapReduce framework available for chip-multiprocessor (CMP) scale-up machines. In this paper we observe that Phoenix++ suffers from an inefficient utilization of the memory subsystem, and a serialized execution of the MapReduce stages. To overcome these inefficiencies, we propose CASM, an architecture that equips each core in a CMP design with a dedicated instance of a specialized hardware unit (the CASM accelerators). These units collaborate to manage the key-value data structure and minimize both on- and off-chip communication costs. Our experimental evaluation on a 64-core design indicates that CASM provides more than a 4x speedup over the highly optimized Phoenix++ framework, while keeping area overhead at only 6%, and reducing energy demands by over 3.5x.

SESSION: Routing

Detailed routing by sparse grid graph and minimum-area-captured path search

Gengjie Chen
Chak-Wa Pui
Haocheng Li
Jingsong Chen
Bentian Jiang
Evangeline F. Y. Young

Different from global routing, detailed routing takes care of many detailed design rules and is performed on a significantly larger routing grid graph. In advanced technology nodes, it becomes the most complicated and time-consuming stage. We propose Dr. CU, an efficient and effective detailed router, to tackle the challenges. To handle a 3D detailed routing grid graph of enormous size, a set of two-level sparse data structures is designed for runtime and memory efficiency. For handling the minimum-area constraint, an optimal correct-by-construction path search algorithm is proposed. Besides, an efficient bulk synchronous parallel scheme is adopted to further reduce the runtime usage. Compared with the first place of ISPD 2018 Contest, our router improves the routing quality by up to 65% and on average 39%, according to the contest metric. At the same time, it achieves 80–93% memory reduction, and 2.5–15X speed-up.

Latency constraint guided buffer sizing and layer assignment for clock trees with useful skew

Necati Uysal
Wen-Hao Liu
Rickard Ewetz

Closing timing using clock tree optimization (CTO) is a tremendously challenging problem that may require designer intervention. CTO is performed by specifying and realizing delay adjustments in an initially constructed clock tree. Delay adjustments are typically realized by inserting delay buffers or detour wires. In this paper, we propose a latency constraint guided buffer sizing and layer assignment framework for clock trees with useful skew, called the (BLU) framework. The BLU framework realizes delay adjustments during CTO by performing buffer sizing and layer assignment. Given an initial clock tree, the BLU framework first predicts the final timing quality and specifies a set of delay adjustments, which are translated into latency constraints. Next, buffer sizing and layer assignment is performed with respect to the latency constraints using an extension of van Ginneken’s algorithm. Moreover, the framework includes a feature of reducing the power consumption by relaxing the latency constraints and a method of improving the timing performance by tightening the latency constraints. The experimental results demonstrate that the proposed framework is capable of reducing the capacitive cost with 13% on the average. The total negative slack (TNS) and worst negative slack (WNS) are reduced with up to 58% and 20%, respectively.

Page 4 of 4

GLSVLSI 2019 TOC

SESSION: Keynote & Invited Talks

SESSION: Tech Session 1: Design and Integration of Hardware Security Primitives

SESSION: Tech Session 2: VLSI Circuits and Power Aware Design

SESSION: Tech Session 3: : VLSI for Machine Learning and Artificial Intelligence

SESSION: Tech Session 4: Next Generation Interconnect: Architecture to Physical Design

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware security

SESSION: Tech Session 6: Emerging Computing & Post-CMOS Technologies

SESSION: Tech Session 7: Physical Design and Obfuscation

SESSION: Tech Session 8: Quantum Circuits and Emerging Technologies

SESSION: Tech Session 9: Towards Fast, Efficient, and Robust Memory

SESSION: Tech Session 10: MSE

SESSION: Poster Sessions I, II

SESSION: Special Session 1: In-Memory Processing for Future Electronics

SESSION: Special Session 2: Approximate Computing Systems Design: Energy Efficiency and Security Implications

SESSION: Special Session 3: Recent Advances in Near and In-Memory Computing Circuit ?

SESSION: Special Session 4: Opportunities and Challenges for Emerging Monolithic 3D Integrated Circuits

SESSION: Special Session 5: Robust IC Authentication and Protected Intellectual Property: A Special Session on Hardware Security

SESSION: Special Session 6: Neuromorphic Computing and Deep Neural Network

SESSION: Panelist Position Papers

SBCCI 2019 TOC

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware
security

SESSION: Special Session 2: Approximate Computing Systems Design: Energy Efficiency and Security
Implications

SESSION: Special Session 4: Opportunities and Challenges for Emerging Monolithic 3D Integrated
Circuits

SESSION: Special Session 5: Robust IC Authentication and Protected Intellectual Property: A
Special Session on Hardware Security