Yibo Lin, Author at SIGDA

SLIP 2018 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Resource and data optimization for hardware implementation of deep neural networks
targeting FPGA-based edge devices

Liu
Xinheng

Recently, as machine learning algorithms have become more practical, there has been
much effort to implement them on edge devices that can be used in our daily lives.
However, unlike server-scale devices, edge devices are relatively small and thus have
…

A study of optimal cost-skew tradeoff and remaining suboptimality in interconnect
tree constructions

Han
Kwangsoo

Cost and skew are among the most fundamental objectives for interconnect tree synthesis.
The cost-skew tradeoff is particularly important in buffered clock tree construction,
where clock subnets are an important “sweet spot” for balancing on-chip …

A design framework for processing-in-memory accelerator

Gao
Di

With increasing performance mismatch between processor and memory, “memory wall” has
become the bottleneck of the entire computing system. In order to bridge the gap,
processing-in-memory (PIM) has been revisited as a viable option to overcome the …

Fast and precise routability analysis with conditional design rules

Kang
Ilgweon

As pin accessibility encounters more challenges due to the less number of tracks,
higher pin density, and more complex design rules, routability has become one bottleneck
of sub-10nm designs. Thus, we need a new design methodology for fast turnaround in …

Adaptive sensitivity analysis with nonlinear power load modeling

Hsu
Po-Ya

Voltage fluctuation in power networks is a critical issue for VLSI designs. The analysis
and optimization of the voltage drops rely on accurate sensitivity calculation. Due
to the high complexity of large-scale circuits, in practice active devices are …

Exploiting PDN noise to thwart correlation power analysis attacks in 3D ICs

Dofe
Jaya

Three-dimensional (3D) integration is envisioned as a natural defense to thwart side-channel
analysis (SCA) attacks. However, there lack extensive studies on the unique feature
of 3D power distribution network (PDN) noise and its impact on the …

ISPD 2017 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Welcome and Keynote Address

Technology Options for Beyond-CMOS

Young
Ian

CMOS integrated circuit technology for computation is at an inflexion point. Although
this is the technology which has enabled the semiconductor industry to make vast progress
over the past 30-plus years, it is expected to see challenges going beyond …

SESSION: Machine Learning in EDA

The Quest for The Ultimate Learning Machine

Dubey
Pradeep

Traditionally, there has been a division of labor between computers and humans where
all forms of number crunching and bit manipulations are left to computers; whereas,
intelligent decision-making is left to us humans. We are now at the cusp of a major
…

Deep Learning in the Enhanced Cloud

Chung
Eric

Deep Learning has emerged as a singularly critical technology for enabling human-like
intelligence in online services such as Azure, Office 365, Bing, Cortana, Skype, and
other high-valued scenarios at Microsoft. While Deep Neural Networks (DNNs) have …

Bilinear Lithography Hotspot Detection

Zhang
Hang

Advanced semiconductor process technologies are producing various circuit layout patterns,
and it is essential to detect and eliminate problematic ones, which are called lithography
hotspots. These hotspots are formed due to light diffraction and …

Routability Optimization for Industrial Designs at Sub-14nm Process Nodes Using Machine
Learning

Chan
Wei-Ting J.

Design rule check (DRC) violations after detailed routing prevent a design from being
taped out. To solve this problem, state-of-the-art commercial EDA tools global-route
the design to produce a global-route congestion map; this map is used by the …

SESSION: Monday Afternoon Keynote

Pushing the boundaries of Moore’s Law to transition from FPGA to All Programmable
Platform

Bolsens
Ivo

Since their inception, FPGAs have changed significantly in their capacity and architecture.
The devices we use today are called upon to solve problems in mixed-signal, high-speed
communications, signal processing and compute acceleration that early …

POSTER SESSION: Invited Poster Presentation

How Game Engines Can Inspire EDA Tools Development: A use case for an open-source physical design library

Fontana
Tiago

Similarly to game engines, physical design tools must handle huge amounts of data.
Although the game industry has been employing modern software development concepts
such as data-oriented design, most physical design tools still relies on object-…

Rsyn: An Extensible Physical Synthesis Framework

Flach
Guilherme

Due to the advanced stage of development on EDA science, it has been increasingly
difficult to implement realistic software infrastructures in academia so that new
problems and solutions are tested in a meaningful and consistent way. In this paper
we …

SESSION: Nontraditional Physical Design Challenges

Research Challenges in Security-Aware Physical Design

Karri
Ramesh

The presentation will discuss security techniques such as IC camouflaging and logic
encryption.

Challenges and Opportunities: From Near-memory Computing to In-memory Computing

Khoram
Soroosh

The confluence of the recent advances in technology and the ever-growing demand for
large-scale data analytics created a renewed interest in a decades-old concept, processing-in-memory
(PIM). PIM, in general, may cover a very wide spectrum of compute …

Physical Design Considerations of One-level RRAM-based Routing Multiplexers

Tang
Xifan

Resistive Random Access Memory(RRAM) technology opens the opportunity for granting both high-performance and low-power
features to routing multiplexers. In this paper, we study the physical design considerations
related to RRAM-based routing …

Hierarchical and Analytical Placement Techniques for High-Performance Analog Circuits

Xu
Biying

High-performance analog integrated circuits usually require minimizing critical parasitic
loading, which can be modeled by the critical net wire length in the layout stage.
In order to reduce post-layout circuit performance degradation, critical net …

SESSION: Tuesday Keynote Address

Physical Design Challenges and Innovations to Meet Power, Speed, and Area Scaling
Trend

Lu
Lee-Chung

In the advanced process technologies of 7nm and beyond, the semiconductor industry
faces several new challenges: (1) aggressive chip area scaling with economically feasible
process technology development, (2) sufficient performance enhancement of …

SESSION: Clock and Timing

Modern Challenges in Constructing Clocks

Alpert
Charles J.

Clock Tree Construction based on Arrival Time Constraints

Ewetz
Rickard

There are striking differences between constructing clock trees based on dynamic implied
skew constraints and based on static arrival time constraints. Dynamic implied skew
constraints allow the full timing margins to be utilized, but the constraints …

A Fast Incremental Cycle Ratio Algorithm

Wu
Gang

In this paper, we propose an algorithm to quickly find the maximum cycle ratio (MCR)
on an incrementally changing directed cyclic graph. Compared with traditional MCR
algorithms which have to recalculate everything from scratch at each incremental …

iTimerM: Compact and Accurate Timing Macro Modeling for Efficient Hierarchical Timing Analysis

Lee
Pei-Yu

As designs continue to grow in size and complexity, EDA paradigm shifts from flat
to hierarchical timing analysis. In this paper, we propose compact and accurate timing
macro modeling, which is the key to achieve efficient and accurate hierarchical …

SESSION: Routability Considerations

DSAR: DSA aware Routing with Simultaneous DSA Guiding Pattern and Double Patterning Assignment

Ou
Jiaojiao

Directed self-assembly (DSA) is a promising solution for fabrication of contacts and
vias for advanced technology nodes. In this paper, we study a DSA aware detailed routing
problem, where DSA guiding pattern assignment and guiding pattern double …

Automatic Cell Layout in the 7nm Era

Cremer
Pascal

Multi patterning technology used in 7nm technology and beyond imposes more and more
complex design rules on the layout of cells. The often non local nature of these new
design rules is a great challenge not only for human designers but also for existing
…

Improving Detailed Routability and Pin Access with 3D Monolithic Standard Cells

Shi
Daohang

We study the impact of using 3D monolithic (3DM) standard cells on improving detailed
routability and pin access. We propose a design flow which transforms standard rows
of single-tier “2D” cells into rows of standard 3DM cells folded into two tiers. …

SESSION: Commemoration for Professor Satoshi Goto

The Spirit of in-house CAD Achieved by the Legend of Master “Prof. Goto” and his Apprentices

Nakamura
Yuichi

In this paper, a legend story to develop CAD algorithms and CAD/EDA tools for NEC’s
in-house use is described. About 30 years ago, since there are few commercial CAD
tools, ICT vendors had to develop their own CAD tools to enhance the performance of
…

Generalized Force Directed Relaxation with Optimal Regions and Its Applications to
Circuit Placement

Chang
Yao Wen

This paper introduces popular algorithmic paradigms for circuit placement, presents
Goto’s classical placement framework based on the generalized force directed relaxation
(GFDR) method with an optimal region (OR) formulation and its impacts on modern …

100x Evolution of Video Codec Chips

Zhou
Jinjia

In the past two decades, there has been tremendous progress in video compression technologies.
Meanwhile, the use of these technologies, along with the ever-increasing demand for
emerging ultra-high-definition applications greatly challenges the design …

Physical Layout after Half a Century: From Back-Board Ordering to Multi-Dimensional Placement and Beyond

Kang
Ilgweon

Innovations and advancements on physical design (PD) in the past half century significantly
contribute to the progresses of modern VLSI designs. While “Moore’s Law” and “Dennard
Scaling” have become slowing down recently, physical design society …

Past, Present and Future of the Research

Goto
Satoshi

SESSION: Optimization and Placement

Interesting Problems in Physical Synthesis

Ho
Pei-Hsin

It is a misperception that the Chinese have the same word for crisis as opportunity.
Despite that, a technical crisis does present opportunities for researchers and practitioners
to solve interesting problems. In this talk we point out two crises: …

Pin Accessibility-Driven Detailed Placement Refinement

Ding
Yixiao

The significantly increased number of routing design rules at sub-20nm nodes has made
pin access one of the most critical challenges in detailed routing. Resolving pin
access issues in detailed routing stage may be too late due to the fixed pin …

A Fast, Robust Network Flow-based Standard-Cell Legalization Method for Minimizing
Maximum Movement

Karimpour Darav
Nima

The standard-cell placement legalization problem has become critical due to increasing
design rule complexity and design utilization at 16nm and lower technology nodes.
An ideal legalization approach should preserve the quality of the input placement
in …

SESSION: FPGA CAD and Contest

CAD Opportunities with Hyper-Pipelining

Iyer
Mahesh A.

Hyper-pipelining is a design technique that results in significant performance and
throughput improvements in latency-insensitive designs. Modern FPGA architectures
like Intel’s Stratix®10 feature a revolutionary register-rich HyperFlex? core fabric
…

An Effective Timing-Driven Detailed Placement Algorithm for FPGAs

Dhar
Shounak

In this paper, we propose a new timing-driven detailed placement technique for FPGAs
based on optimizing critical paths. Our approach extends well beyond the previously
known critical path optimization approaches and explores a significantly larger …

Clock-Aware FPGA Placement Contest

Yang
Stephen

Modern FPGA device contains complex clocking architecture on top of FPGA logic fabric.
To best utilize FPGA clocking architecture, both FPGA designers and EDA tool developers
need to understand the clocking architecture and design best methodology/…

SLIP 2016 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Digital Library logo
Full Citation in the ACM Digital Library

A Comparative Analysis of Front-End and Back-End Compatible Silicon Photonic On-Chip
Interconnects

Thakkar
Ishan G.

Photonic devices fabricated with back-end compatible silicon photonic (BCSP) materials
can provide independence from the complex CMOS front-end compatible silicon photonic
(FCSP) process, to significantly enhance photonic network-on-chip (PNoC) …

Latch Clustering for Minimizing Detection-to-Boosting Latency Toward Low-Power Resilient
Circuits

Hsu
Chih-Cheng

Dynamic voltage scaling (DVS) has become one of the most effective approaches to achieve
ultra-low-power SoC. To eliminate timing errors arising from DVS, several error-resilient
circuit design techniques were proposed to detect and/or correct timing …

Connectivity Effects on Energy and Area for Neuromorphic System with High Speed Asynchronous
Pulse Mode Links

Segal
Carrie

Hardware neuromorphic systems are challenged to achieve biologically realistic levels
of interconnectivity. When building a physical implementation of a neural net, the
properties of the media immediately impose limits on the number of interconnects and
…

Buffered Interconnects in 3D IC Layout Design

Ahmed
Mohammad A.

A very important challenge in designing through-silicon via (TSV)-based 3D ICs is
to accurately estimate, through all stages of the physical design, the interconnect
delay which is strongly dependent on the layout of 3D IC. The earlier in the design
…

Topologically-Geometric Routing

Bazylevych
Roman

The paper introduces foundations of the “Flexible Routing Method” that belongs to
the topologically-geometric type. It develops the idea to divide the routing problem
on two separate successive stages: topological and geometrical. At the first stage
it …

Revisiting 3DIC Benefit with Multiple Tiers

Chan
Wei-Ting Jonas

3DICs with multiple tiers are expected to achieve large benefits (e.g., in terms of
power, area) as compared to conventional planar designs. However, few if any previous
works study upper bounds on power and area benefits from 3DIC integration with …

Spin-Hall Assisted STT-RAM Design and Discussion

Eken
Enes

In recent years, Spin-Transfer Torque Random Access Memory (STT-RAM) has attracted
significant attentions from both industry and academia due to its attractive attributes
such as small cell area and non-volatility. However, long switching time and large
…

A Demand-Aware Predictive Dynamic Bandwidth Allocation Mechanism for Wireless Network-on-Chip

Mansoor
Naseef

Long distance data communication over multi-hop wireline paths in conventional Networks-on-Chips
(NoCs) cause high energy consumption and degradation in bandwidth. Wireless interconnects
in the millimeter-wave band have emerged as an energy-efficient …

ISPD 2018 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote Address

Session details: Keynote Address

Chu
Chris

Challenges and Opportunities in Automotive, Industrial, and IoT Physical Design

Hill
Anthony M.

Taping out modern, complex SOCs presents a myriad of challenges in physical design.
Doing so for demanding markets such as automotive, industrial, and IoT multiplies
that complexity. In this talk we will take a broad look across the physical design
…

SESSION: Finding the Golden Tree in the Forest!

Session details: Finding the Golden Tree in the Forest!

Yeap
Gary

Wot the L: Analysis of Real versus Random Placed Nets, and Implications for Steiner Tree Heuristics

Kahng
Andrew B.

The NP-hard Rectilinear Steiner Minimum Tree (RSMT) problem has been studied in the
VLSI physical design literature for well over three decades. Fast estimators of RSMT
cost (which reflects routed wirelength) are a required ingredient of modern physical
…

Prim-Dijkstra Revisited: Achieving Superior Timing-driven Routing Trees

Alpert
Charles J.

The Prim-Dijkstra (PD ) construction [1] was first presented over 20 years ago as a way to efficiently
trade off between shortest-path and minimum-wirelength routing trees. This approach
has stood the test of time, having been integrated into leading …

Construction of All Rectilinear Steiner Minimum Trees on the Hanan Grid

Lin
Sheng-En David

Given a set of pins, a Rectilinear Steiner Minimum Tree (RSMT) connects the pins using
only rectilinear edges with the minimum wirelength. RSMT construction is heavily used
at various design steps such as floorplanning, placement, routing, and …

SESSION: FPGA Special Session

Session details: FPGA Special Session

Das
Sabya

Challenges in Large FPGA-based Logic Emulation Systems

Hung
William N.N.

Functional verification is an important aspect of electronic design automation. Traditionally,
simulation at the register transfer-level has been the mainstream functional verification
approach. Formal verification and various static analysis checkers …

Flexibility: FPGAs and CAD in Deep Learning Acceleration

Chiu
Gordon R.

Deep learning inference has become the key workload to accelerate in our AI-powered
world. FPGAs are an ideal platform for the acceleration of deep learning inference
by combining low-latency performance, power-efficiency, and flexibility. This paper
…

Exploration and Tradeoffs of different Kernels in FPGA Deep Learning Applications

Delaye
Elliott

In the field of deep learning, efficient computational hardware has come to the forefront
of the large scale implementation and deployment of many applications. In the process
of designing hardware, various characteristics of hardware platforms have …

Architecture Exploration of Standard-Cell and FPGA-Overlay CGRAs Using the Open-Source
CGRA-ME Framework

Chin
S. Alexander

We describe an open-source software framework,CGRA-ME, for the modeling and exploration
of coarse-grained reconfigurable architectures (CGRAs). CGRAs are programmable hardware
devices having large ALU-like logic blocks, and datapath bus-style inter-…

SESSION: Design Flow and Power Grid Optimization

Session details: Design Flow and Power Grid Optimization

Iyer
Mahesh

Concurrent High Performance Processor Design: From Logic to PD in Parallel

Stok
Leon

The design of a high-performance processor in an advanced technology node is a highly
concurrent process. While most SoCs are designed with (fairly) stable IP, several
trends are driving the design of the micro-architecture, the logic and the physical
…

Towards a VLSI Design Flow Based on Logic Computation and Signal Distribution

Reis
André

This paper discusses directions for a VLSI design flow based on a novel paradigm of
local logic computation and global signal distribution. In the last years there has
been an increasing effort to perform a better integration between logic synthesis
and …

Power Grid Reduction by Sparse Convex Optimization

Ye
Wei

With the dramatic increase in the complexity of modern integrated circuits (ICs),
direct analysis and verification of IC power distribution networks (PDNs) have become
extremely computationally expensive. Various power grid reduction methods are …

SESSION: Statistical and Machine Learning-Based CAD

Session details: Statistical and Machine Learning-Based CAD

Kissiov
Ivan

Machine Learning Applications in Physical Design: Recent Results and Directions

Kahng
Andrew B.

In the late-CMOS era, semiconductor and electronics companies face severe product
schedule and other competitive pressures. In this context, electronic design automation
(EDA) must deliver “design-based equivalent scaling” to help continue essential …

Machine Learning for Feature-Based Analytics

Wang
Li-C.

Applying machine learning in Electronic Design Automation (EDA) has received growing
interests in recent years. One approach to analyze data in EDA applications can be
called feature-based analytics. In this context, the paper explains the inadequacy
of …

Data Efficient Lithography Modeling with Residual Neural Networks and Transfer Learning

Lin
Yibo

Lithography simulation is one of the key steps in physical verification, enabled by
the substantial optical and resist models. A resist model bridges the aerial image
simulation to printed patterns. While the effectiveness of learning-based solutions
…

SESSION: Three Shades of Placement!

Session details: Three Shades of Placement!

Shinnerl
Joseph

Compact-2D: A Physical Design Methodology to Build Commercial-Quality Face-to-Face-Bonded 3D ICs

Ku
Bon Woong

The recent advancement of wafer bonding technology offers fine-grained and silicon-space
overhead-free 3D interconnections in face-to-face (F2F) bonded 3D ICs. In this paper,
we propose a full-chip RTL-to-GDSII physical design solution to build high-…

Analog Placement Constraint Extraction and Exploration with the Application to Layout
Retargeting

Xu
Biying

In analog/mixed-signal (AMS) integrated circuits (ICs), most of the layout design
efforts are still handled manually, which is time-consuming and error-prone. Given
the previous high-quality manual layouts containing valuable design expertise of …

Pin Assignment Optimization for Multi-2.5D FPGA-based Systems

Kuo
Wan-Sin

Advanced 2.5D FPGAs with larger logic capacity and higher pin counts compared to conventional
FPGAs are commercially available. Some multi-FPGA systems have already utilized 2.5D
FPGAs. Commercial 2.5D FPGA consists of multiple dies connected through an …

SESSION: Commemoration for Professor Te Chiang Hu

Session details: Commemoration for Professor Te Chiang Hu

Kahng
Andrew B.

Influence of Professor T. C. Hu’s Works on Fundamental Approaches in Layout

Kahng
Andrew B.

Professor T. C. Hu has made numerous pioneering and fundamental contributions in combinatorial
algorithms, mathematical programming and operations research. His seminal 1985 IEEE
book VLSI Circuit Layout: Theory and Design, coedited with Prof. E. S. Kuh,…

Tree Structures and Algorithms for Physical Design

Cheng
Chung-Kuan

Tree structures and algorithms provide a fundamental and powerful data abstraction
and methods for computer science and operations research. In particular, they enable
significant advancement of IC physical design techniques and design optimization.
For …

Pioneer Research on Mathematical Models and Methods for Physical Design

Chu
Chris

In the inaugural International Symposium on Physical Design (ISPD) at 1997, Prof.
Te Chiang Hu has delivered the keynote address “Physical Design: Mathematical Models
and Methods” [1]. Without any question, Prof. Hu has made a lot of foundational and
…

Theory and Algorithms of Physical Design

Cheng
Chung-Kuan

SESSION: Interconnect Optimization and Detailed Routing Contest Results

Session details: Interconnect Optimization and Detailed Routing Contest Results

Yan
Jackey

Interconnect Optimization Considering Multiple Critical Paths

Hu
Jiang

Interconnect optimization, including buffer insertion and Steiner tree construction,
continues to be a pillar technology that largely determines overall chip performance.
Buffer insertion algorithms in published literature are mostly focused on …

Interconnect Physical Optimization

Janac
K. Charles

The SoC Interconnect is one of the most important IPs in modern chips as it is the
logical and physical instantiation of an SoC architecture and carries virtually all
the SoC data. Interconnect IPs have to carry non-coherent, cache coherent, subsystem
…

ISPD 2018 Initial Detailed Routing Contest and Benchmarks

Mantik
Stefanus

In advanced technology nodes, detailed routing becomes the most complicated and runtime
consuming stage. To spur detailed routing research, ISPD 2018 initial detailed routing
contest is hosted and it is the first ISPD contest on detailed routing …

SESSION: How to Make Your Foundry Happier?

Session details: How to Make Your Foundry Happier?

Hu
Jiang

The Pressing Need for Electromigration-Aware Physical Design

Lienig
Jens

Electromigration (EM) is becoming a progressively intractable design challenge due
to increased interconnect current densities. It has changed from something designers
“should” think about to something they “must” think about, i.e., it is now a definite
…

On Coloring and Colorability Analysis of Integrated Circuits with Triple and Quadruple
Patterning Techniques

Lvov
Alexey

The continued delay of higher resolution alternatives for lithography, such as EUV,
is forcing the continued adoption of multi-patterning solutions in new technology
nodes, which include triple and quadruple patterning using several lithography-etch
…

Standard CAD Tool-Based Method for Simulation of Laser-Induced Faults in Large-Scale
Circuits

Viera
Raphael A.C.

Designing secure integrated systems requires methods and tools dedicated to simulating
that early design stages’ the effects of laser-induced transient faults maliciously
injected by attackers. Existing methods for simulation of laser-induced transient
…

GLSVLSI 2019 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

Thoughts on Edge Intelligence

Wolf
Marilyn

Machine learning methods have exploded in the past half-dozen years. Machine learning
is being applied to a huge range of problems across the spectrum of applications.
Initial results relied on server-oriented computations. But many applications will
…

Automatic Implementation of Secure Silicon

Leef
Serge

Throughout the past decade, cybersecurity threats have evolved from attacks focused
high in the software stack to progressively lower levels of computational hierarchy.
With the explosion of popularity and growing deployment of internet connected …

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

Mutlu
Onur

Today’s systems are overwhelmingly designed to move data to computation. This design
choice goes directly against at least three key trends in systems that cause performance,
scalability and energy bottlenecks: 1) data access from memory is already a …

Innovations in IoT for a Safe, Secure, and Sustainable Future

Bhunia
Swarup

Internet of things (IoT) promises to usher in the fourth industrial revolution through
an exponential growth of smart connected devices deployed in myriad application domains.
It gives rise to new relationships between man and smart connected machines …

SESSION: Tech Session 1: Design and Integration of Hardware Security Primitives

LPN-based Device Authentication Using Resistive Memory

Arafin
Md Tanvir

Recent progress in the design and implementation of resistive memory components such
as RRAMs and PCMs has introduced opportunities for developing novel hardware security
solutions using unique physical properties of these devices. In this work, we …

Leveraging On-Chip Voltage Regulators Against Fault Injection Attacks

Vosoughi
Ali

The security implications of utilizing an on-chip voltage regulator as a countermeasure
against fault injection attacks are investigated in this paper. The effect of the
size of the capacitors and number of phases of the voltage regulator on the …

On the Theoretical Analysis of Memristor based True Random Number Generator

Uddin
Mesbah

Emerging nano-devices like memristors display stochastic switching behavior which
poses a big uncertainty in their implementation as the next-generation CMOS alternative.
However, this stochasticity provides an opportunity to design circuits for …

Control-Lock: Securing Processor Cores Against Software-Controlled Hardware Trojans

Šišejković
Dominik

Malicious circuit modifications known as hardware Trojans represent a rising threat
to the integrated circuit supply chain. As many Trojans are activated based on a specific
sequence of circuit states, we have recognized the ease of utilizing an …

Lightweight Authenticated Encryption for Network-on-Chip Communications

Harttung
Julian

In recent years, Network-on-Chip (NoC) has gained increasing popularity as a promising
solution for the challenging interconnection problem in multi-processor systems-on-chip
(MPSoCs). However, the interest of adversaries to compromise such systems grew …

SESSION: Tech Session 2: VLSI Circuits and Power Aware Design

Design of a Low-power and Small-area Approximate Multiplier using First the Approximate
and then the Accurate Compression Method

Yang
Tongxin

Recently emerging applications, such as convolution neural networks (CNNs), which
process thousands of convolutional computations, require a large amount of power.
Multiplication is the key arithmetic in these applications and an approximate multiplier
…

GraphiDe: A Graph Processing Accelerator leveraging In-DRAM-Computing

Angizi
Shaahin

In this paper, we propose GraphiDe, a novel DRAM-based processing-in-memory (PIM)
accelerator for graph processing. It transforms current DRAM architecture to massively
parallel computational units exploiting the high internal bandwidth of the modern
…

An Efficient Time-based Stochastic Computing Circuitry Employing Neuron-MOS

Erlina
Tati

A compact and low energy circuitry of time-based stochastic computing (TBSC) have
been designed. In the TBSC theory, stochastic numbers (SNs) are represented by duty-cycle
of periodic signals. Additionally, multiplication and addition operations of the …

Monolithic 8×8 SiPM with 4-bit Current-Mode Flash ADC with Tunable Dynamic Range

Vinayaka
Vikas

A monolithic photon-counting receiver consisting of an integrated silicon-photomultiplier
and a current-mode analog-to-digital converter (ADC) was designed, simulated and fabricated
in the AMS 0.35 μm SiGe BiCMOS process. The silicon photomultiplier (…

SESSION: Tech Session 3: : VLSI for Machine Learning and Artificial Intelligence

A Systolic SNN Inference Accelerator and its Co-optimized Software Framework

Guo
Shasha

Although Deep Neural Network (DNN) architectures have made some breakthroughs in computer
vision tasks, they are not close to biological brain neurons. Spiking Neural Network
(SNN) is highly expected to bridge the gap between artificial computing …

Dynamic Beam Width Tuning for Energy-Efficient Recurrent Neural Networks

Jahier Pagliari
Daniele

Recurrent Neural Networks (RNNs) are state-of-the-art models for many machine learning
tasks, such as language modeling and machine translation. Executing the inference
phase of a RNN directly in edge nodes, rather than in the cloud, would provide …

Efficient Softmax Hardware Architecture for Deep Neural Networks

Du
Gaoming

Deep neural network (DNN) has become a pivotal machine learning and object recognition
technology in the big data era. The softmax layer is one of the key component layers
for completing multi-classification tasks. However, the softmax layer contains …

HSIM-DNN: Hardware Simulator for Computation-, Storage- and Power-Efficient Deep Neural Networks

Sun
Mengshu

Deep learning that utilizes large-scale deep neural networks (DNNs) is effective in
automatic high-level feature extraction but also computation and memory intensive.
Constructing DNNs using block-circulant matrices can simultaneously achieve hardware
…

SESSION: Tech Session 4: Next Generation Interconnect: Architecture to Physical Design

An Area-Efficient Iterative Single-Precision Floating-Point Multiplier Architecture
for FPGA

Kim
Sunwoong

Approximate multipliers have been widely used in critical applications, such as machine
learning and multimedia, which are tolerant to approximation errors. This paper proposes
a novel single-precision floating-point (SPFP) multiplication algorithm and …

An Automatic Transistor-Level Tool for GRM FPGA Interconnect Circuits Optimization

Li
Zhengjie

Due to its dominance in FPGA area and delay, the interconnect circuit is traditionally
designed and optimized in full customized fashion, which can be extremely time consuming.
In this paper, we propose an automated transistor-level sizing optimization …

Low Voltage Clock Tree Synthesis with Local Gate Clusters

Sitik
Can

In this paper, a novel local clock gate cluster-aware low voltage clock tree synthesis
methodology is introduced. In low voltage/swing clocking, timing closure is a challenging
problem due to tight skew and slew constraints. The clock gating makes this …

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware
security

TOIC: Timing Obfuscated Integrated Circuits

Alam
Mahabubul

To counter the threats of reverse engineering (RE) and Trojan in-sertion, researchers
have considered gate-level obfuscation in inte-grated circuits (IC) as a viable solution.
However, several techniques are present in the literature to crack the …

Design for Eliminating Operation Specific Power Signatures from Digital Logic

Majumder
Md Badruddoja

Conventional digital logic operations have distinguishable power signatures. Side
channel power analysis combined with classification algorithm can reveal unknown logic
operations. Revealing the underlying operations is the main task in reverse …

Non-Uniform Temperature Distribution in Interconnects and Its Impact on Electromigration

Abbasinasab
Ali

We investigate the effect of electrically induced thermal load on interconnect reliability
and aging. We propose new models for uniform and non-uniform temperature evolution
and its steady state distribution in interconnects considering Joule heating …

Fault Classification and Coverage of Analog Circuits using DC Operating Point and
Frequency Response Analysis

Sanyal
Sayandeep

Detection of faults in a mixed-signal SOC at the pre-silicon stage is a challenge,
especially when it has substantial analog components. Given the time taken for simulating
analog circuits, designing tests to detect faults in them is not a …

Crash Skipping: A Minimal-Cost Framework for Efficient Error Recovery in Approximate Computing Environments

Verdeja Herms
Yan

We present a lightweight technique to minimize error recovery costs in approximate
computing environments. We take advantage of the key observation that if an application
crashes in a “non-critical” region of its execution, then skipping the crash and …

SESSION: Tech Session 6: Emerging Computing & Post-CMOS Technologies

Voltage-Controlled Magnetoelectric Memory Bit-cell Design With Assisted Body-bias
in FD-SOI

Cai
Hao

Voltage-controlled magnetic anisotropy (VCMA)-magnetic tunnel junction (MTJ) is incorporated
into FD-SOI CMOS technology. The design space of 1 transistor-1 MTJ (1T-1M) bit-cell
is explored through varied VCMA pulse duration/amplitude and scaling down …

Low Cost Hybrid Spin-CMOS Compressor for Stochastic Neural Networks

Li
Bingzhe

With expansion of neural network (NN) applications lowering their hardware implementation
cost becomes an urgent task especially in back-end applications where the power-supply
is limited. Stochastic computing (SC) is a promising solution to realize low-…

Functionally Complete Boolean Logic and Adder Design Based on 2T2R RRAMs for Post-CMOS
In-Memory Computing

Yang
Zongxian

In-memory computing (IMC) paradigm has attracted extensive attention for future electronics
to overcome the bottleneck and memory wall problem in the von Neumann systems. Nonvolatile
logic based on resistive random-access memory (RRAM) is a promising …

Jump Search: A Fast Technique for the Synthesis of Approximate Circuits

Witschen
Linus

State-of-the-art frameworks for generating approximate circuits automatically explore
the search space in an iterative process – often greedily. Synthesis and verification
processes are invoked in each iteration to evaluate the found solutions and to …

SESSION: Tech Session 7: Physical Design and Obfuscation

SAT-Based Placement Adjustment of FinFETs inside Unroutable Standard Cells Targeting
Feasible DRC-Clean Routing

Sorokin
Anton

In this paper, we present an algorithm of transistor placement that takes unroutable
standard cells and makes them routable by moving transistors in local windows. It
converts the task of placement of gridded FinFETs into a Boolean problem and employs
…

A Scalable and Process Variation Aware NVM-FPGA Placement Algorithm

Yang
Chengmo

As non-volatile memory (NVM) based FPGAs gain increasing popularity, FPGA synthesis
tools start to tune the synthesis flow to match NVM characteristics. State-of-the-art
NVM FPGA placement algorithms tried to reduce the high reconfiguration cost induced
…

Functional Obfuscation of Hardware Accelerators through Selective Partial Design Extraction
onto an Embedded FPGA

Hu
Bo

The protection of Intellectual Property (IP) has emerged as one of the most serious
areas of concern in the semiconductor industry. To address this issue, we present
a method and architecture to map selective portions of a design, given as a behavioral
…

HydraRoute: A Novel Approach to Circuit Routing

Khasawneh
Mohammad

Routing for dense circuits is a major challenge for VLSI physical design. Most routing
approaches rely at least partially on a “rip-up and reroute” scheme, where solution
quality and run times can be impacted profoundly by the order in which nets are …

SESSION: Tech Session 8: Quantum Circuits and Emerging Technologies

Balanced Factorization and Rewriting Algorithms for Synthesizing Single Flux Quantum
Logic Circuits

Pasandi
Ghasem

Single Flux Quantum (SFQ) logic with switching energy of 100zJ1 and switching delay
of 1ps is a promising post-CMOS candidate. Logic synthesis of these magnetic-pulse-based
circuits is a very important step in their design flow with a big impact on the …

A Majority Logic Synthesis Framework for Adiabatic Quantum-Flux-Parametron Superconducting
Circuits

Cai
Ruizhe

Adiabatic Quantum-Flux-Parametron (AQFP) logic is an adiabatic superconductor logic
that has been proposed as alternative to CMOS logic with extremely high energy efficiency.
In AQFP technology, majority-based gates have the same area as two-input AND/…

A Processing-In-Memory Implementation of SHA-3 Using a Voltage-Gated Spin Hall-Effect
Driven MTJ-based Crossbar

Yang
Chengmo

Processing-In-Memory (PIM), which implements logic operations within memory cells,
opens up a new direction on organizing data and computation. Leveraging resistive
or magnetic characteristics of nonvolatile memory (NVM) devices, platforms such as
PLiM …

Exploring Processing In-Memory for Different Technologies

Gupta
Saransh

The recent emergence of IoT has led to a substantial increase in the amount of data
processed. Today, a large number of applications are data intensive, involving massive
data transfers between processing core and memory. These transfers act as a …

SESSION: Tech Session 9: Towards Fast, Efficient, and Robust Memory

BLADE: A BitLine Accelerator for Devices on the Edge

Simon
William Andrew

The increasing ubiquity of edge devices in the consumer market, along with their ever
more computationally expensive workloads, necessitate corresponding increases in computing
power to support such workloads. In-memory computing is attractive in edge …

Enhancing the Lifetime of Non-Volatile Caches by Exploiting Module-Wise Write Restriction

Agarwal
Sukarn

The emerging Non-Volatile Memory (NVM) technologies offer a good combination of high
density and near-zero leakage power, becoming the strongest candidate in the memory
hierarchy including caches. However, the weak write endurance of these memories …

Mitigating the Performance and Quality of Parallelized Compressive Sensing Reconstruction
Using Image Stitching

Namazi
Mahmoud

Orthogonal Matching Pursuit is an iterative greedy algorithm used to find a sparse
approximation for high-dimensional signals. The algorithm is most popularly used in
Compressive Sensing, which allows for the reconstruction of sparse signals at rates
…

Towards Optimizing Refresh Energy in embedded-DRAM Caches using Private Blocks

Manohar
Sheel Sindhu

In recent years, the increased working set size of applications craves for more memory
demand in terms of large size Last Level Caches (LLC). To fulfill this, embedded DRAM
(eDRAM) caches have been considered as one of the best alternatives over …

SESSION: Tech Session 10: MSE

Extending Student Labs with SMT Circuit Implementation

Brunvand
Erik

Computer Science and Computer Engineering classes related to digital circuits, embedded
systems, Human Computer Interaction (HCI), and a wide variety of “maker” subjects,
would often like to include physical computing projects. Extending these physical
…

Teaching the Next Generation of Cryptographic Hardware Design to the Next Generation
of Engineers

Aysu
Aydin

Evolving threats against cryptographic systems and the increasing diversity of computing
platforms enforce teaching cryptographic engineering to a wider audience. This paper
describes the development of a new graduate course on hardware security taught …

A Web-based Remote FPGA Laboratory for Computer Organization Course

Wan
Han

Learning in digital systems could be enhanced by applying a learn-by-doing mechanism.
In this paper the implementation of a web-based remote FPGA laboratory for Computer
Organization course is proposed. The projects created for this course are designed
…

System-on-a-Chip Design as a Platform for Teaching Design and Design Flow Integration

Covey
Jacob

The design of microelectronic systems requires integration and cooperation across
multiple disciplines, but most curriculum is taught in unconnected pieces. This makes
the creation of manageable projects that reflect the design experience very …

SESSION: Poster Sessions I, II

UPIM: Unipolar Switching Logic for High Density Processing-in-Memory Applications

Sim
Joonseop

Internet of Things (IoT) has built a network with billions of connected devices which
generate massive volumes of data. Processing large data on existing systems requires
significant costs for data movements between processors and memory due to limited
…

Fence-Region-Aware Mixed-Height Standard Cell Legalization

Do
SangGi

We propose a fence-region-aware mixed-height standard cell legalization that can optimize
the placement of standard cells that have more than a two row height in various shapes
of the fence region. The algorithm consists of pre-legalization and mixed-…

A Case for Heterogeneous Network-on-Chip Based H.264 Video Decoders

Ghorbani Moghaddam
Milad

The design of a heterogeneous network-on-chip (NoC) based H.264 video decoder is proposed.
A thorough investigation using a system simulator developed as the combination of
a cycle accurate NoC simulator together with complete implementations of all the …

A 16b Clockless Digital-to-Analog Converter with Ultra-Low-Cost Poly Resistors Supporting
Wide-Temperature Range from -40°C to 85°C

Wang
Xuedi

High-precision digital-to-analog converter (DAC) is a critical component in process
control, data acquisition, and testing instruments. In order to achieve high resolution
and a wide-temperature range, conventional designs have been adopting high-cost …

A Skyrmion Racetrack Memory based Computing In-memory Architecture for Binary Neural
Convolutional Network

Pan
Yu

A Skyrmion Racetrack Memory (SRM) based Computing In-Memory Architecture (SRM-CIM)
was proposed in this paper. Both data and computing operation can be achieved in SRM-CIM.
SRM-CIM is used to support convolutional computing in Binary Convolutional …

TASecure: Temperature-Aware Secure Deletion Scheme for Solid State Drives

Li
Bingzhe

With the increasing concerns of security, the secure deletion for SSDs becomes very
costly due to its out-of-place update (i.e., an update is performed in a new location
leaving the old data un-touched). Some previous studies used a combined erase-based
…

An Asymmetric Dual Output On-Chip DC-DC Converter for Dynamic Workloads

Liu
Xingye

We propose a novel two-stage hybrid on-chip DC-DC converter targeting low power applications
with multiple supply voltage domains and dynamic workloads. The converter has a nominal
input voltage of 1.2V and generates two asymmetrically regulated output …

CNNWire: Boosting Convolutional Neural Network with Winograd on ReRAM based Accelerators

Lin
Jilan

Resistive random access memory (ReRAM) demonstrates the great potential of in-memory
processing for neural network (NN) acceleration. However, since the convolutional
neural network (CNN) is widely known as compute-bound, current ReRAM-based …

Feed-Forward XOR PUFs: Reliability and Attack-Resistance Analysis

Avvaru
S. V. Sandeep

Physical unclonable functions (PUFs) can be used to generate unique signatures of
integrated circuit (IC) chips. XOR arbiter PUFs (XOR PUFs), that typically contain
multiple standard arbiter PUFs as their components, are more secure than standard
…

Exploring Design Trade-offs in Fault-Tolerant Behavioral Hardware Accelerators

Zhu
Zhiqi

High-Level Synthesis (HLS) allows the automatic generation of hardware accelerators
with unique design metrics. This work leverages this unique feature and presents a
method to increase the search space of fault-tolerant hardware accelerators. The …

Automatic Extraction of Requirements from State-based Hardware Designs for Runtime
Verification

Seo
Minjun

Runtime monitoring and verification enables a system to monitor itself and ensure
system requirements are met even in the presence of dynamic environments. For hardware,
state-based models are widely used, but verifying the correctness between the state-…

MirrorCache: An Energy-Efficient Relaxed Retention L1 STTRAM Cache

Kuan
Kyle

Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAMs in on-chip caches,
due to several advantages, including non-volatility, low leakage, high integration
density, and CMOS compatibility. However, STTRAMs’ wide adoption in resource-…

Design and Evaluation of DNU-Tolerant Registers for Resilient Architectural State
Storage

Alghareb
Faris S.

In this work, we aim to maintain the correct execution of instructions in the pipeline
stages. To achieve that, the integrity for the data computed in registers during execution
should be maintained via protecting the susceptible registers. Thus, we …

Automated Analysis of Virtual Prototypes at Electronic System Level

Goli
Mehran

The exponential increase in functionality of System-on-Chips (SoCs) and reduced Time-to-Market
(TTM) requirements have significantly altered the typical design and verification
flow. Virtual Prototyping (VP) at the Electronic System Level (ESL) using …

Dynamic Physically Unclonable Functions

Xiong
Wenjie

Physical variations in the manufacturing processes of electronic devices have been
widely leveraged to design Physically Unclonable Functions (PUFs), which can be used
for authentication and key storage. Existing PUFs are static, as their PUF responses
…

RDTA: An Efficient Routability-Driven Track Assignment Algorithm

Liu
Genggeng

This paper presents a routability-driven track assignment algorithm (RDTA) to efficiently
estimate routability. Routability has become a very challenging issue in modern IC
design and it can be effectively estimated by routing congestion. Track …

EraseMe: A Defense Mechanism against Information Leakage exploiting GPU Memory

Fang
Hongyu

Graphics Processing Units (GPU) play a major role in speeding up computational tasks
of the users, especially in applications such as high volume text and image processing.
Recent works have demonstrated the security problems associated with GPU that do …

A Statistical Current and Delay Model Based on Log-Skew-Normal Distribution for Low
Voltage Region

Cao
Peng

The increasing performance variation and non-Gaussian distribution pose remarkable
challenges to timing analysis for circuits operating in low voltage region. Accurate
modeling of the statistical characteristics is urgently required with process …

Enabling Approximate Storage through Lossy Media Data Compression

Worek
Brian

As compute capabilities continue to scale, memory capacity and bandwidth continue
to lag behind. Data compression is an effective approach to improving memory capacity
and bandwidth; but prior works have focused primarily on lossless compression and
…

Thermal Fingerprinting of FPGA Designs through High-Level Synthesis

Chen
Jianqi

This work investigates if temperature can be used to fingerprint FPGA designs and
presents a method to generate a large number of functionally equivalent FPGA designs
such that each design has a unique distinguishable thermal signature. The main …

Deep RNN-Oriented Paradigm Shift through BOCANet: Broken Obfuscated Circuit Attack

Tehranipoor
Fatemeh

Logic encryption obfuscation has been used for thwarting counterfeiting, overproduction,
and reverse engineering but vulnerable to attacks. However, it was recently shown
that satisfiability – checking (SAT) can potentially compromise hardware …

STAT: Mean and Variance Characterization for Robust Inference of DNNs on Memristor-based
Platforms

Zhang
Baogang

An emerging solution to accelerate the inference phase of deep neural networks (DNNs)
is to utilize memristor crossbar arrays (MCAs) to perform highly efficient matrix-vector
multiplication in the analog domain. An adverse challenge is that memristor …

LSM: Novel Low-Complexity Unified Systolic Multiplier over Binary Extension Field

Xie
Jiafeng

Unified (hybrid field-size) systolic multiplier over GF(2m) (binary extension field)
has attracted significant attentions from research communities recently as it can
be used in reconfigurable cryptographic processors. In this paper, we present a novel
…

Binarized Depthwise Separable Neural Network for Object Tracking in FPGA

Yang
Li

Object tracking has achieved great advances in the past few years and has been widely
applied in vision-based application. Nowadays, deep convolutional neural network has
taken an important role in object tracking tasks. However, its enormous model size
…

An Analytical-based Hybrid Algorithm for FPGA Placement

Hu
Chengyu

As the capacity of FPGA increases, FPGA placers that adopt Simulated Annealing (SA)
algorithm take more and more runtime. To solve this problem, this paper presents HCAS,
a Hybrid algorithm Combining Analytical method and SA. There are three …

Approximate Memory with Approximate DCT

Ma
Shenghou

Approximate Computing is an emerging computing paradigm where one exploits inherent
error resilience of certain applications (e.g., digital signal processing, multimedia
and artificial intelligence) and trades off absolute computation precisions for …

AQuRate: MRAM-based Stochastic Oscillator for Adaptive Quantization Rate Sampling of Sparse
Signals

Salehi
Soheil

Recently, the promising aspects of compressive sensing have inspired new circuit-level
approaches for their efficient realization within the literature. However, most of
these recent advances involving novel sampling techniques have been proposed …

Clockless Spin-based Look-Up Tables with Wide Read Margin

Salehi
Soheil

In this paper, we develop a 6-input fracturable non-volatile Clockless LUT (C-LUT)
using spin Hall effect (SHE)-based Magnetic Tunnel Junctions (MTJs) and provide a
detailed comparison between the SHE-MTJ-based C-LUT and Spin Transfer Torque (STT)-MTJ-…

A Hybrid Framework for Functional Verification using Reinforcement Learning and Deep
Learning

Singh
Karunveer

In this paper, we propose a novel hybrid verification framework (HVF) which uses Reinforcement
Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex
systems. More precisely, our HVF incorporates RL to generate all …

SESSION: Special Session 1: In-Memory Processing for Future Electronics

Digital and Analog-Mixed-Signal In-Memory Processing in CMOS SRAM

Jaiswal
Akhilesh

Ferroelectric FET Based In-Memory Computing for Few-Shot Learning

Laguna
Ann Franchesca

As CMOS technology advances, the performance gap between the CPU and main memory has
not improved. Furthermore, the hardware deployed for Internet of Things (IoT) applications
need to process ever growing volumes of data, which can further exacerbate …

True In-memory Computing with the CRAM: From Technology to Applications

Zabihi
Masoud

An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive
Applications

Li
Bing

The conventional von Neumann architecture has been revealed as a major performance
and energy bottleneck for rising data-intensive applications. The decade-old idea
of leveraging in-memory processing to eliminate substantial data movements has returned
…

SESSION: Special Session 2: Approximate Computing Systems Design: Energy Efficiency and Security
Implications

Security Threats in Approximate Computing Systems

Yellu
Pruthvy

Approximate computing systems improve energy efficiency and computation speed at the
cost of reduced accuracy on system outputs. Existing efforts mainly explore the feasible
approximation mechanisms and their implementation methods. There is limited …

Characterizing Approximate Adders and Multipliers Optimized under Different Design
Constraints

Jiang
Honglan

Taking advantage of the error resilience in many applications as well as the perceptual
limitations of humans, numerous approximate arithmetic circuits have been proposed
that trade off accuracy for higher speed or lower power in emerging applications …

Approximate Communication Strategies for Energy-Efficient and High Performance NoC: Opportunities and Challenges

Reza
Md Farhadur

With the advancement and miniaturization of transistor technology, hundreds of cores
can be integrated on a single chip. Network-on-Chips (NoCs) are the de facto on-chip
communication fabrics for multi/many core systems because of their benefits over …

Information Hiding behind Approximate Computation

Wang
Ye

There are many interesting advances in approximate computing recently targeting the
energy efficiency in system design and execution. The basic idea is to trade computation
accuracy for power and energy during all phases of the computation, from data to …

MLPrivacyGuard: Defeating Confidence Information based Model Inversion Attacks on Machine Learning
Systems

Alves
Tiago A. O.

As services based on Machine Learning (ML) applications find increasing use, there
is a growing risk of attack against such systems. Recently, adversarial machine learning
has received a lot of attention, where an adversary is able to craft an input or …

SESSION: Special Session 3: Recent Advances in Near and In-Memory Computing Circuit ?

XNOR-SRAM: In-Bitcell Computing SRAM Macro based on Resistive Computing Mechanism

Jiang
Zhewei

We present an in-memory computing SRAM macro for binary neural networks. The memory
macro computes XNOR-and-accumulate for binary/ternary deep convolutional neural networks
on the bitline without row-by-row data access. It achieves 33X better energy and …

Efficient Process-in-Memory Architecture Design for Unsupervised GAN-based Deep Learning
using ReRAM

Chen
Fan

The ending of Moore’s Law makes domain-specific architecture as the future of computing.
The most representative is the emergence of various deep learning accelerators. Among
the proposed solutions, resistive random access memory (ReRAM) based process-…

DigitalPIM: Digital-based Processing In-Memory for Big Data Acceleration

Imani
Mohsen

In this work, we design, DigitalPIM, a Digital-based Processing In-Memory platform
capable of accelerating fundamental big data algorithms in real time with orders of
magnitude more energy efficient operation. Unlike the existing near-data processing
…

In-memory Processing based on Time-domain Circuit

Kong
Yuyao

Deep Neural Networks (DNN) have emerged as a dominant algorithm for machine learning
(ML). High performance and extreme energy efficiency are critical for deployments
of DNN, especially in mobile platforms such as autonomous vehicles, cameras, and other
…

SESSION: Special Session 4: Opportunities and Challenges for Emerging Monolithic 3D Integrated
Circuits

An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs

Shukla
Prachi

Monolithic 3D (Mono3D) is a three-dimensional integration technology that can overcome
some of the fundamental limitations faced by traditional, two-dimensional scaling.
This paper analyzes the unique thermal characteristics of Mono3D ICs by simulating
…

Logic Monolithic 3D ICs: PPA Benefits and EDA Tools Necessary

Pentapati
Sai Surya Kiran

Monolithic 3D (M3D) ICs provide a way to achieve high performance and low power designs
within the same technology node, thereby bypassing the need for transistor scaling.
M3D ICs have multiple 2D tiers sequentially fabricated on top of each other and …

Investigation and Trade-offs in 3DIC Partitioning Methodologies: N/A

Sketopoulos
Nikolaos

In this work, we compare alternative 3DIC partitioning methodologies, in terms of
slack, number of inter-tier vias, Tier Area Ratio (TAR) and HPWL design parameters.
The popular 3DIC postplacement, bin-based Fidducia-Mattheyses (FM) partitioning flow
is …

Test and Design-for-Testability Solutions for Monolithic 3D Integrated Circuits

Koneru
Abhishek

M3D integration can result in reduced area and higher performance when compared to
3D die stacking. Due to the benefits of M3D integration, there is growing interest
in industry towards the adoption of this technology. However, test challenges for
M3D …

N3XT Monolithic 3D Energy-Efficient Computing Systems

Aly
Mohamed M. Sabry

The world’s appetite for analyzing massive amounts of structured and unstructured
data has grown dramatically. The computational demands of these abundant-data applications
far exceed the capabilities of today’s computing systems. The N3XT (Nano-…

SESSION: Special Session 5: Robust IC Authentication and Protected Intellectual Property: A
Special Session on Hardware Security

How to Generate Robust Keys from Noisy DRAMs?

Karimian
Nima

Security primitives based on Dynamic Random Access Memory (DRAM) can provide cost-efficient
and practical security solutions, especially for resource-constrained devices, such
as hardware used in the Internet of Things (IoT), as DRAMs are an intrinsic …

Threats on Logic Locking: A Decade Later

Zamiri Azar
Kimia

To reduce the cost of ICs and to meet the market’s demand, a considerable portion
of manufacturing supply chain, including silicon fabrication, packaging and testing
may be pushed offshore. Utilizing a global IC manufacturing supply chain, and inclusion
…

On Custom LUT-based Obfuscation

Kolhe
Gaurav

Logic obfuscation yields hardware security against various threats, such as Intellectual
Property (IP) piracy and reverse engineering. Evolving Boolean satisfiability (SAT)
attacks have challenged the hardware security assurance rendered by various …

Securing Analog Mixed-Signal Integrated Circuits Through Shared Dependencies

Juretus
Kyle

The transition to a horizontal integrated circuit (IC) design flow has raised concerns
regarding the security and protection of IC intellectual property (IP). Obfuscation
of an IC has been explored as a potential methodology to protect IP in both the …

SESSION: Special Session 6: Neuromorphic Computing and Deep Neural Network

Design Methodology for Embedded Approximate Artificial Neural Networks

Balaji
Adarsha

Artificial neural networks (ANNs) have demonstrated significant promise while implementing
recognition and classification applications. The implementation of pre-trained ANNs
on embedded systems requires representation of data and design parameters in …

Exploration of Segmented Bus As Scalable Global Interconnect for Neuromorphic Computing

Balaji
Adarsha

Spiking Neural Networks (SNNs) are efficient computation models for spatio-temporal
pattern recognition on resource and power constrained platforms. Dedicated SNN hardware,
also called neuromorphic hardware, can further reduce the energy consumption of …

ADMM-based Weight Pruning for Real-Time Deep Learning Acceleration on Mobile Devices

Li
Hongjia

Deep learning solutions are being increasingly deployed in mobile applications, at
least for the inference phase. Due to the large model size and computational requirements,
model compression for deep neural networks (DNNs) becomes necessary, especially …

On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning

Prakash
Bharat

In autonomous embedded systems, it is often vital to reduce the amount of actions
taken in the real world and energy required to learn a policy. Training reinforcement
learning agents from high dimensional image representations can be very expensive
and …

SESSION: Panelist Position Papers

Tuning Track-based NVM Caches for Low-Power IoT Devices

Aghaei Khouzani
Hoda

Track-based non-volatile memories, such as Domain Wall Memory (DWM) and Skyrmion,
are promising candidates to be used as CPU caches due to their ultra-high density
and low-static power. However, the access latency and energy of these devices are
highly …

Dynamic Computation Migration at the Edge: Is There an Optimal Choice?

Shahhosseini
Sina

In the era of Fog computing where one can decide to compute certain time-critical
tasks at the edge of the network, designers often encounter a question whether the
sensor layer provides the optimal response time for a service, or the Fog layer, or
…

Solving Energy and Cybersecurity Constraints in IoT Devices Using Energy Recovery
Computing

Thapliyal
Himanshu

With the growth of Internet-of-Things (IoT), the potential threat vectors for malicious
cyber and hardware attacks are rapidly expanding. As the IoT paradigm emerges, there
are challenging requirements to design energy-efficient and secure systems. To …

Right-Provisioned IoT Edge Computing: An Overview

Adegbija
Tosiron

Edge computing on the Internet of Things (IoT) is an increasingly popular paradigm
in which computation is moved closer to the data source (i.e., edge devices). Edge
computing mitigates the overheads of cloud-based computing arising from increased
…

Secure Computing Systems Design Through Formal Micro-Contracts

Kinsy
Michel A.

Two enduring concepts in computer system design are abstraction levels and layered
composition. The design generally takes a layered approach where each layer implements
a different abstraction of the system. The layers communicate through interfaces …

SBCCI 2019 TOC

2 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

PHICC: an error correction code for memory devices

Philippe Magalhães
Otávio Alcântara
Jarbas Silveira

With the evolution of technology in the microelectronics field, integrated circuits (ICs) have been developed with decreasing dimensions. Despite the advances provided by the scale reduction, the occurrence of Multiple Cell Upsets (MCUs) caused by interferences such as ionizing radiation, has become increasingly common. Error Correction Codes (ECCs) are capable of augmenting fault tolerance of computer systems, however, there must be balance between error correction effectiveness and silicon implementation costs. The purpose of this article is to present the Parity Hamming Interleaved Correction Code (PHICC), which consists of a code capable of correcting multiple transient errors in memory cells, with low implementation cost. The validation of the PHICC was performed through a comparative analysis of correction effectiveness, implementation costs, reliability and Mean Time to Failure (MTTF) with others ECCs. The results show that PHICC can maintain the reliability system for longer time, which makes it a strong candidate for use in critical applications.

Lightweight security mechanisms for MPSoCs

Anderson Camargo Sant’Ana
Henrique Martins Medina
Kevin Boucinha Fiorentin
Fernando Gehm Moraes

Computational systems tend to adopt parallel architectures, by using multiprocessor systems-on-chip (MPSoCs). MPSoCs are vulnerable to software and hardware attacks, as infected applications and Hardware Trojans respectively. These attacks may have the purpose to gain access to sensitive data, interrupt a given application or even damage the system physically. The literature presents countermeasures using dedicated routing algorithms, cryptography, firewalls and secure zones. These approaches present a significant hardware cost (firewalls, cryptography) or are too restrictive regarding the use of MPSoC resources (secure zones). The goal of this paper is to present lightweight security mechanisms for MPSoCs, using four techniques: spatial isolation of applications; dedicated network to send sensitive data; traffic blocking filter; lightweight cryptography. These mechanisms protect the MPSoC against the most common software attacks, as Denial of Service (DoS) and spoofing (man-in-the-middle), and ensures confidentiality and integrity to applications. Results present low area and latency overhead, as well as the effectiveness of using the mechanisms to block malicious traffic.

Exploiting approximate computing for low-cost fault tolerant architectures

Gennaro S. Rodrigues
Juan Fonseca
Fabio Benevenuti
Fernanda Kastensmidt
Alberto Bosio

This work investigates how the approximate computing paradigm can be exploited to provide low-cost fault tolerant architectures. In particular, we focus on the implementation of Approximate Triple Modular Redundancy (ATMR) designs using the precision reduction technique. The proposed method is applied to two benchmarks and a multitude of ATMR designs with different degrees of approximation. The benchmarks are implemented on a Xilinx Zynq-7000 APSoC FPGA through high-level synthesis and evaluated concerning area usage and the inaccuracy caused by approximation. Fault injection experiments are performed by flipping bits of the FPGA configuration bitstream. Results show that the proposed approximation method can decrease the DSP usage of the hardware implementation up to 80% and the number of sensitive configuration bits up to 75% while maintaining an accuracy of more than 99.96%.

Fine-grain temperature monitoring for many-core systems

Alzemiro Lucas da Silva
André Luís del Mestre Martins
Fernando Gehm Moraes

The power density may limit the amount of energy a many-core system can consume. A many-core at its maximum performance may lead to safe temperature violations and, consequently, result in reliability issues. Dynamic Thermal Management (DTM) techniques have been proposed to guarantee that many-core systems run at good performance without compromising reliability. DTM techniques rely on accurate temperature information and estimation, which is a computationally complex problem. However, related works usually abstract the temperature monitoring complexity, assuming available temperature sensors. An issue related to temperature sensors is their granularity, frequently measuring the temperature of a large system area instead of a processing element (PE) area. Therefore, the first goal of this work is to propose a fine-grain (PE level) temperature monitoring for many-core systems. The second one is to present a dedicated hardware accelerator to estimate the system temperature. Results show that software performance can be a limiting factor when applying an accurate model to provide temperature estimation for system management. On the other side, the hardware accelerator connected to the many-core enables the fine-grain temperature estimation at runtime without sacrificing system performance.

An adaptive discrete particle swarm optimization for mapping real-time applications onto network-on-a-chip based MPSoCs

Jessé Barreto de Barros
Renato Coral Sampaio
Carlos Humberto Llanos

This paper presents a modified version of the well-known Particle Swarm Optimization (PSO) algorithm as an alternative for the single-objective Genetic Algorithm (GA) that is currently the state-of-the-art method to map real-time applications tasks onto Multiple Processors System-on-a-Chip (MPSoC) using preemptive capable wormhole-based Network-on-a-Chip (NoC) as their communication architecture. A statistical study based on an experimental setup has been performed to compare the GA-based task mapper and the proposed method by using a real-time application as a benchmark, as well as a group of randomly generated ones. Preliminary results have shown that our method is capable of achieving quicker convergence than the GA-based method, and it even produces better results when the application utilization is smaller than the available processing capacity, i.e., a fully schedulable mapping solution exists.

Exploring Tabu search based algorithms for mapping and placement in NoC-based reconfigurable systems

Guilherme A. Silva Novaes
Luiz Carlos Moreira
Wang Jiang Chau

Nowadays, the development of systems based on Networks-on-Chip (NoCs) brings big challenges to the designers due to problems of scalability, such as efficient Mapping and Placement, which are NP-hard problems. Several solutions have been proposed to solve this type of problem that is a variation of Quadratic Assignment Problems (QAP), being Tabu Search (TS) algorithms the ones showing most promising results. In NoC-based dynamically reconfigurable systems (NoC-DRSs), both mapping and placement problems present several layers of complexity due the reconfigurable scenarios. A previous work has adopted TS algorithm variations, but the best solution is not achieved with the wished high frequency. This paper introduces the original Forced Inversion (FI) Heuristic over Tabu Search algorithms for 2D-Mesh FPGA NoC-DRSs, in order to avoid local minima. Results with a series of benchmarks are presented and the performances of different approaches are quantitatively and qualitatively compared.

Performance evaluation of HEVC RCL applications mapped onto NoC-based embedded platforms

Wagner Penny
Daniel Palomino
Marcelo Porto
Bruno Zatt
Leandro Indrusiak

Today, several applications running into embedded systems have to fulfill soft or hard timing constraints. Video applications, like the modern High Efficiency Video Coding (HEVC), e.g., most often have soft real-time constraints. However, in specific scenarios, such as in robotic surgeries, the coupling of satellites and so on, harder timing constraints arise, becoming a huge challenge. Although the implementation of such applications in Networks-on-Chip (NoCs) being an alternative to reduce their algorithmic complexity and meet real-time constraints, a performance evaluation of the mapped NoC and the schedulability analysis for a given application are mandatory. In this work we make a performance evaluation of HEVC Residual Coding Loop (RCL) mapped onto a NoC-based embedded platform, considering the encoding of a single 1920×1080 pixels frame. A set of analysis exploring the combination of different NoC sizes and task mapping strategies were performed, showing for the typical and upper-bound workload cases scenarios when the application is schedulable and meets the real-time constraints.

An FPGA-based evaluation platform for energy harvesting embedded systems

Roberto Paulo Dias Alcantara Filho
Otavio Alcantara de Lima Junior
Corneli Gomes Furtado Junior

Extreme low-power embedded systems are essential in Smart Cities and the Internet of Things, once these systems are responsible for acquiring, processing, and transmitting valuable environmental data. Some of these systems should run for a very long time without any human intervention, even for batteries replacement. Energy harvesting technologies allow embedded systems to be powered up from the environment by converting surrounding energy sources into electrical energy. However, energy-harvesting embedded systems (EHES) heavily depends on the nature of the energy sources, which are mostly uncontrollable and unpredictable. To improve the evaluation of energy management techniques in EHES, we propose the emulation of I-V curves of low-power energy harvesting transducers. An FPGA-based platform controls the energy source emulation combined with an integrated logic analyzer, which allows real-time data gathering from the EHES in multiple evaluation scenarios. The experiments show that the platform replicates solar energy scenarios with only 0.56% mean error.

A comparison of two embedded systems to detect electrical disturbances using decision tree algorithm

Reneilson Santos
Edward David Moreno
Carlos Estombelo-Montesco

The Electrical Power Quality (EPQ) is a relevant subject in the academic area because of its importance on real-world problems. The anomalies on an electrical network can cause strong losses in equipment and data. In this context, much effort has been made by many types of research approaches to get solutions for this kind of problem, seeking for better accuracy on the classification of the anomalies, or building a system to detect them. This paper, therefore, aims to compare two systems built to classify electrical disturbances even in noised environments. For this purpose, it was used a microprocessor system (Raspberry Pi3) and a micro-controller system (NodeMCU Amica), analyzing their time to classify the input signal. The microprocessor achieves better results (45.50ms against 267.10ms), with an accuracy of 97.96% in an ideal environment and 76.79% in a noisy environment (20dB of SNR) for both systems.

FPGA hardware linear regression implementation using fixed-point arithmetic

Willian de Assis Pedrobon Ferreira
Ian Grout
Alexandre César Rodrigues da Silva

In this paper, a hardware design based on the field programmable gate array (FPGA) to implement a linear regression algorithm is presented. The arithmetic operations were optimized by applying a fixed-point number representation for all hardware based computations. A floating-point number training data point was initially created and stored in a personal computer (PC) which was then converted to fixed-point representation and transmitted to the FPGA via a serial communication link. With the proposed VHDL design description synthesized and implemented within the FPGA, the custom hardware architecture performs the linear regression algorithm based on matrix algebra considering a fixed size training data point set. To validate the hardware fixed-point arithmetic operations, the same algorithm was implemented in the Python language and the results of the two computation approaches were compared. The power consumption of the proposed embedded FPGA system was estimated to be 136.82 mW.

New insight for next generation SRAM: tunnel FET versus FinFET for different topologies

Adriana Arevalo
Romain Liautard
Daniel Romero
Lionel Trojman
Luis-Miguel Procel

The purpose of this work is to point out the main differences between a Static Random-Access Memory (SRAM) cells implemented by using Tunnel FET (TFET) and FinFET technologies. We have compared the behavior of SRAM cells implemented in both technologies cells for a supply voltage range from 0.4V to 1.2V. Furthermore, for our study, we have chosen different SRAM cell topologies, such as 6T, 8T, 9T and 10T. Therefore, we have simulated all of these topologies for both technologies and extracted the Static Noise Margins (SNM) for the reading and writing processes. In addition, we have determined the power consumption in order to find the best trade-off between stability and power. By analyzing these results, we have determined the best topology for each technology. Finally, we have compared these best topologies for each technology in order to perform a study of advantages and shortcomings. Our results show more advantages using TFET technology instead of FinFET one.

DNAr-logic: a constructive DNA logic circuit design library in R language for molecular computing

Renan A. Marks
Daniel K. S. Vieira
Marcos V. Guterres
Poliana A. C. Oliveira
Omar P. Vilela Neto

This paper describes the DNAr-Logic: an implementation of a software package in R language that provides ease of use and scalability of the design process of digital logic circuits in molecular computing, more specifically, DNA computing. These devices may be used in-vitro, in-vivo, or even replace the CMOS technology in some applications. Using a technique known as DNA strand displacement reaction (DSD) in conjunction with chemical reaction networks (CRN’s), DNA strands can be used as “wet” hardware to construct molecular logic circuits analogous to electronic digital projects. The circuits designed using the DNAr-Logic can be created in a constructive manner and simulated without requiring knowledge of chemistry or DSD mechanism. The package implements all the main logic gates. We describe the design of a majority gate (also available in the package) and a full-adder circuit that only uses this port. We describe the results and simulation of our design.

Finding optimal qubit permutations for IBM’s quantum computer architectures

Alexandre A. A. de Almeida
Gerhard W. Dueck
Alexandre C. R. da Silva

IBM offers quantum processors for Clifford+T circuits. The only restriction is that not all CNOT gates are implemented and must be substituted with alternate sequences of gates. Each CNOT has its own mapping with a respective cost. However, by permuting the qubits, the number of CNOT that need mappings can be reduced. The problem is to find a good permutation without an exhaustive search. In this paper we propose a solution for this problem. The permutation problem is formulated as an Integer Linear Programming (ILP) problem. Solving the ILP problem, the lowest cost permutation for the CNOT mappings is guaranteed. To test and validated the proposed formulation, quantum architectures with 5 and 16 qubits were used. The ILP formulation along with mapping techniques found circuits with up to 64% fewer gates than other approaches.

Hardware implementation of a shape recognition algorithm based on invariant moments

Clement Raffaitin
Juan-Sebastian Romero
Juan-Sebastian Romero
Luis-Miguel Procel

The present work shows the description of a simple fast shape detection algorithm and its implementation in hardware in a FPGA system. The detection algorithm is based on the concepts of Hu’s moments which are invariant to similarity transformations. The recognition algorithm is implemented by using a non-local means filter. The algorithm is implemented on a FPGA system by using a hardware description language. We present the different design stages of the algorithm implementation which is based on the finite state machine technique. This algorithm is able to recognize a target shape over a test image. Furthermore, this work, describes the advantages of the implementation in hardware, such as speed and parallelism in signal processing. Finally, we show some results of the implementation of this algorithm.

A custom processor for an FPGA-based platform for automatic license plate recognition

Guilherme A. M. Sborz
Guilherme A. Pohl
Felipe Viel
Cesar A. Zeferino

Automatic License Plate Recognition (ALPR) systems are used to identify a vehicle from an image that contains its plate. These systems have applications in a wide range of areas, such as toll payment, border control, and traffic surveillance. ALPR systems demand high computational power, especially for real-time applications. In this context, this paper describes the development of a custom processor designed to accelerate part of the processing of an FPGA-based ALPR system. This processor reduces the latency for computing the most expensive function of the ALPR system in almost 23 times, thus reducing the time necessary for detection of a vehicle plate.

Hardware design of DC/CFL intra-prediction decoder for the AV1 codec

Jones Goebel
Bruno Zatt
Luciano Agostini
Marcelo Porto

This paper presents a dedicated hardware design for the DC and Chroma from Luma (CFL) intra-prediction modes of AV1 decoder. The hardware was designed to reach real-time when processing UHD 4K videos. The AV1 codec is an open-source and royalties-free video coding, which was developed by the AOMedia group, this group is composed of multiple companies like Google, Netflix, AMD, ARM, Intel, Nvidia, Microsoft, Mozilla and others. The proposed solution can support all 19 block sizes allowed in AV1 encoder, being able to process UHD 4K videos at 60 frames per second. The DC/CFL modules were synthesized to the TSMC 40 nm cells library targeting the frequency of 132.1 MHz. Synthesis results show the proposed hardware used 89.39 Kgates and a power dissipation of 7.96mW.

Approximate interpolation filters for the fractional motion estimation in HEVC encoders and their VLSI design

Rafael da Silva
Ícaro Siqueira
Mateus Grellert

Motion Estimation (ME) is one of the most complex HEVC steps, consuming more than 60% of the average encoding time, most of which is spent on its fractional part (Fractional Motion Estimation – FME), in which sub-pixel samples are interpolated and searched over to find motion vectors with higher precision. This paper presents hardware designs for the sub-pixel interpolation unit of the FME step. The designs employ approximate computing techniques by reducing the number of taps in each filter to reduce memory access and hardware cost. The approximate filters were implemented in the HEVC reference software to assess their impact on coding performance. A complete interpolation architecture was implemented in VHDL and synthesized with different filter precision and input sizes in order to assess the effect of these parameters on hardware area and performance. The approximate designs reduce the number of adders/subtractors by up to 67.65% and memory bandwidth by up to 75% with a tolerable loss in coding performance (less than 1% using the Bjontegaard Delta bitrate metric). When synthesized to an FPGA device, 52.9% less logic elements are required with a modest increase in frequency.

An SVM-based hardware accelerator for onboard classification of hyperspectral images

Lucas A. Martins
Guilherme A. M. Sborz
Felipe Viel
Cesar A. Zeferino

Hyperspectral images (HSIs) have been used in civil and military scenarios for ground recognition, urban development management, rare minerals identification, and diverse other purposes. However, HSIs have a significant volume of information and require high computational power, especially for real-time processing in embedded applications, as in onboard computers in satellites. These issues have driven the development of hardware-based solutions able to provide the processing power necessary to meet such requirements. In this paper, we present a hardware accelerator to enhance the performance of one of the most computational expensive stages of HSI processing: the classification. We have employed the Entropy Multiple Correlation Ratio procedure to select the spectral bands to be used in the training process. For the classification step, we have applied a Support Vector Machine classifier with a Hamming Distance decision approach. The proposed custom processor was implemented in FPGA and compared with high-level implementations. The results obtained demonstrate that the processor has a silicon cost lower than similar solutions and can perform a real-time pixel classification in 0.1 ms and achieves a state-of-the-art accuracy of 99.7%.

A sub-1mA highly linear inductorless wideband LNA with low IP3 sensitivity to variability for IoT applications

Arthur Liraneto Torres Costa
Hamilton Klimach
Sergio Bampi

This paper proposes a wideband 0.4-2 GHz cascode common-gate LNA that can be used as a building block for a noise canceling topology (which entails its noise to be canceled at the output node). The design strategy is to set the operating point by analyzing the third order coefficient (α₃) of the output current and the output voltage, which is designed using a load composed by a diode-connected PMOS transistor and a resistor in parallel. This operating point allows a reasonable V_GS spread, maintaining a high IIP3 which implies a low IIP3 sensitivity to process variability. The design strategy also achieves a current consumption under 1 mA and, depending on the technology node V_DD (CMOS 130 nm in this case), it can consume under 1 mW of power. This makes the wideband LNA suitable for IoT applications. Monte Carlo simulations have been carried out to demonstrate the operating region sensitivity to variability and achieves a result of worst case IIP3_μ = +0.2 dBm with σ = 0.8 dBm (@2GHz) up to a nominal 2.75 dBm @900 MHz, S₁₁ < -23 dB, NF < 5.5 dB (canceled by virtue of its topology), a voltage gain of 11.6-14.6 dB (S₂₁ = 6.4-9.4 dB with a buffer to 50 Ω), and consuming just 1.19 mW from a 1.2 V supply.

Comparison between direct and indirect learnings for the digital pre-distortion of concurrent dual-band power amplifiers

Luis Schuartz
Artur T. Hara
André A. Mariano
Bernardo Leite
Eduardo G. Lima

Current radio-communication systems demand high linearity and high efficiency. The digital baseband pre-distorter (DPD) is a cost-effective solution to guarantee the required linearity without compromising the efficiency. In the design of a DPD for a single band power amplifier (PA), the position of the inverse system is exchanged during the identification procedure to avoid the necessity of a PA model within a cumbersome closed-loop process. However, in a practical environment where only an approximation to the inverse is achieved, the linearization capability is affected by shifting the post-inverse placed after the PA to a pre-inverse located before the PA. In DPD intended for concurrent dual-band PAs, an additional advantage of such approach is that the post-inverse identifications for each band are completely independent of each other. This work performs a comparative analysis between two learning architectures applied to the linearization of two concurrent dual-band PAs stimulated by 2.4 GHz Wi-Fi and 3.5 GHz LTE signals. For the first PA, an exact PA model is known and the replacement of a post-inverse to a pre-inverse produces only negligible degradation in linearity. For the second PA, only an approximate PA model is available and the accuracy of such PA model produces a major impact on the linearization capability than the shifting of the inverse.

Interactive evolutionary approach to reduce the optimization cycle time of a low noise amplifier

Rodrigo A. L. Moreto
Douglas Rocha
Carlos E. Thomaz
André Mariano
Salvador P. Gimenez

Nowadays, wireless communications at frequencies of gigahertz have an increasing demand due to the ever-increasing number of electronic devices that uses this type of communication. They are implemented by Radio Frequency (RF) circuits. However, the design of RF circuits is difficult, time-consuming and based on designer knowledge and experience. This work proposes an interactive evolutionary approach using the genetic algorithm, which is implemented in the in-house iMTGSPICE optimization tool, to perform the optimization process of a robust (corner and Monte Carlo analyses) Ultra Low-Power Low Noise Amplifier (LNA) dedicated to Wireless Sensor Networks (WSN), which is implemented in a 130 nm Bulk CMOS technology. We performed two experimental studies to optimize the LNA. The first one used the interactive approach of iMTGSPICE, which was monitored and assisted by a beginner designer during the optimization process. The second one used the conventional approach of iMTGSPICE (non-interactive), which was not assisted by a designer during the optimization process. The obtained results demonstrated that the interactive approach of iMTGSPICE performed the optimization process of the robust LNA around 94% faster (in approximately 20 minutes only) than the non-interactive evolutionary approach (in approximately 6 hours).

An innovative strategy to reduce die area of robust OTA by using iMTGSPICE and diamond layout style for MOSFETs

José Roberto Banin Júnior
Rodrigo Alves de Lima Moreto
Gabriel Augusto da Silva
Carlos Eduardo Thomaz
Salvador Pinillos Gimenez

This paper describes a pioneering design and optimization methodology that provides a remarkable die area reduction of robust analog Complementary Metal-Oxide-Semiconductor (CMOS) Integrated Circuits (ICs) by using a computational tool based on artificial intelligence (iMTGSPICE) and the Diamond layout style for MOSFETs. The validation of this innovative optimization strategy for analog CMOS ICs was made for an Operational Transconductance Amplifiers (OTA) by using 180 nm CMOS ICs technology. The main finding of this work reports a remarkable reduction of the total die area of a robust OTA around 30%, regarding the use of Diamond MOSFETs with alfa angles of 45° when compared to the one implemented with standard rectangular MOSFETs.

NMLSim 2.0: a robust CAD and simulation tool for in-plane nanomagnetic logic based on the LLG equation

Lucas A. Lascasas Freitas
Omar P. Vilela Neto
João G. Nizer Rahmeier
Luiz G. C. Melo

Nanomagnetic Logic (NML) is a new technology based on the magnetization of nanometric magnets. Logic operations are performed via dipolar coupling through ferromagnetic and antiferromagnetic interactions. The low energy dissipation and the possibility of higher integration density in circuits are significant advantages over CMOS technology. Even so, there is a great need for simulation and CAD tools for the proper study of large NML circuits. This paper presents a high-efficiency tool that uses the Landau-Lifshitz-Gilbert equation to evolve the magnetization of the particles over time in amonodomain approach. The new version of NMLSim comes with flexibility in its code, allowing expansion of the tool with ease and consistency. The results of simulated structures show the reliability of the simulator when compared with the current state of the art Object-Oriented Micromagnetic Framework (OOMMF). It also presents an improvement of up to 716 times in execution time and up to 41 times in memory usage.

Ropper: a placement and routing framework for field-coupled nanotechnologies

Ruan Evangelista Formigoni
Ricardo S. Ferreira
José Augusto M. Nacif

Field-Coupled Nanocomputing technologies are the subject of extensive research to overcome current CMOS limitations. These technologies include nanomagnetic and quantum structures, each with its design and synchronization challenges. In this scenario clocking schemes are used to ensure circuit synchronization and avoid signal disruptions at the cost of some area overhead. Unfortunatelly, a nanocomputing technology is limited to a small subset of clocking schemes due to its number of clocking phases and signal propagation system, thus, leading to complex design challenges when tackling the placement and routing problem resulting in technology dependant solutions. Our work consists on presenting a novel framework developed by our team that solves these design challenges when using distinct schemes, therefore, avoiding the need to design pre-defined routing algorithms for each one. The framework offers a technology independent solution and provides interfaces for the implementation of efficient and scalable placement strategies, moreover, it has full integration with reference state-of-the-art optimization and synthesis tools.

Toward nanometric scale integration: an automatic routing approach for NML circuits

Pedro Arthur R. L. Silva
Omar Paranaíba V. Neto
José Augusto M. Nacif

In recent years, many technologies have been studied to replace or complement CMOS. Some of these emerging technologies are known as Field Coupled Nanocomputing. However, these new technologies introduce the need for developing tools to perform circuit mapping, placement, and routing. Nanomagnetic Logic Circuit (NML) is one of these emergent technologies. It relies on the magnetization of nanomagnets to perform operations through majority logic. In this work, we propose an approach to map a gate-level circuit to an NML layout automatically. We use the Breadth First Search to perform the placement and the A* algorithm to transverse the circuit and build the routes for each node. To evaluate the effectiveness of our approach, we use a series of ISCAS’85 benchmarks. Our results show an area reduction varying from 20% to 60%.

Energy efficient fJ/spike LTS e-Neuron using 55-nm node

Pietro M. Ferreira
Nathan De Carvalho
Geoffroy Klisnick
Aziz Benlarbi-Delai

While CMOS technology is currently reaching its limits in power consumption and circuit density, a challenger is emerging from the analogy between biology and silicon. Hardware-based neural networks may drive a new generation of bio-inspired computers by the urge of a hardware solution for real-time applications. This paper redesigns a previous proposed electronic neuron (e-Neuron) in a higher firing rate to reduce the silicon area and highlight a better energy efficiency trade-off. Besides, an innovative schematic is proposed to state an e-Neuron library based on Izhikevichs model of neural firing patterns. Both e-Neuron circuits are designed using 55 nm technology node. Physical design of transistors in weak inversion are discussed to a minimal leakage. Neural firing pattern behaviors are validated by post-layout simulations, demonstrating the spike frequency adaptation and the rebound spikes due to post-inhibitory effect in LTS e-Neuron. Presented results suggest that the time to rebound spikes is dependent of the excitation current amplitude. Both e-Neurons have presented a fF/spike energy efficiency and a smaller silicon area in comparison to Izhikevichs library propositions in the literature.

CMOS analog four-quadrant multiplier free of voltage reference generators

Antonio José Sobrinho de Sousa
Fabian de Andrade
Hildeloi dos Santos
Gabriele Gonçalves
Maicon Deivid Pereira
Edson Santana
Ana Isabela Cunha

This work presents a CMOS four quadrant analog multiplier architecture for application as the synapse element in analog cellular neural networks. The circuit has voltage-mode inputs and a current-mode output and includes a signal application method that avoids voltage or current reference generators. Simulations have been accomplished for a CMOS 130 nm technology, featuring ±50 mV input voltage range, 60 μW static power and -25 dB maximum THD. The active area is 346 μm².

Amplifier-based MOS analog neural network implementation and weights optimization

Tiago Oliveira Weber
Diogo da Silva Labres
Fabián Leonardo Cabrera

Neural networks are achieving state-of-the-art performance in many applications, from speech recognition to computer vision. A neuron in a multi-layer network needs to multiply each input by its weight, sum the results and perform an activation function. This paper presents a variation of the implementation of an amplifier-based MOS analog neuron capable of performing these tasks and also the optimization of the synaptic weights using in-loop circuit simulations. MOS transistors operating in the triode region are used as variable resistors to convert the input and weight voltage to a proportional input current. To test the analog neuron in full networks, an automatic generator is developed to produce a netlist based on the number of neurons on each layer, inputs and weights. Simulation results using a CMOS 180 nm technology demonstrate the neuron proper transfer function and its functionality while trained in test datasets.

Reduction of neural network circuits by constant and nearly constant signal propagation

Augusto Andre Souza Berndt
Alan Mishchenko
Paulo Francisco Butzen
Andre Inacio Reis

This work focuses on optimizing circuits representing neural networks (NNs) in the form of and-inverter graphs (AIGs). The optimization is done by analyzing the training set of the neural network to find constant bit values at the primary inputs. The constant values are then propagated through the AIG, which results in removing unnecessary nodes. Furthermore, a trade-off between neural network accuracy and its reduction due to constant propagation is investigated by replacing with constants those inputs that are likely to be zero or one. The experimental results show a significant reduction in circuit size with negligible loss in accuracy.

A FPGA parameterizable multi-layer architecture for CNNs

Guilherme Korol
Fernando Gehm Moraes

Advances in hardware platforms boosted the use of Convolutional Neural Networks (CNNs) to solve problems in several fields such as Computer Vision and Natural Language Processing. With the improvements of algorithms involved in learning and inferencing for CNNs, dedicated hardware architectures have been proposed with the goal to speed up the CNNs’ performance. However, the CNNs’ requirements in bandwidth and processing power challenge designers to create architectures fitted for ASICs and FPGAs. Embedded applications targeting IoT (including sensors and actuators), health devices, smartphones, and any other battery-powered device may benefit from CNNs. For that, the CNN design must follow a different path, where the cost function is a small area footprint and reduced power consumption. This paper is a step towards this goal, by proposing an architecture for the main modules of modern CNNs. The proposal uses as case-study the Alexnet CNN, targeting Xilinx FPGA devices. Compared to the literature, results show a reduction up to 9 times in the amount of required DSP modules.

Design of a low power 10-bit 12MS/s asynchronous SAR ADC in 65nm CMOS

Arthur Lombardi Campos
João Navarro
Maximiliam Luppe

During the last decades we have witnessed the performance improvement and the aggressive growth of the complexity of integrated circuits (ICs). The progressive size reduction of transistors in recent technological nodes has allowed IC designers to perform analog tasks in the digital domain, increasing the demand for analog-to-digital converters (ADCs). This work presents the design and implementation of a low power successive approximation register analog-to-digital converter (SAR ADC) in a 65nm CMOS technology, suitable for low power frontend of wireless receivers with a flexible sampling rate up to 12 MS/s. At maximum sampling rate, the post-layout simulated circuit achieved an equivalent number of bits (ENOB) of 9.65 and a power consumption of 151.4μW, leading to a Figure of Merit of 15.8fJ/Conversion-step, inside an area of 0.074mm².

A new algorithm for an incremental sigma-delta converter reconstruction filter

Li Huang
Caroline Lelandais-Perrault
Anthony Kolar
Philippe Benabes

Image sensors dedicated for the applications of the Earth observation require medium-speed and high-resolution analog-to-digital converters (ADCs). For that purpose, an incremental sigma-delta analog-to-digital converter (IΣΔ ADC) has been designed. Post-layout simulations highlighted a degradation in resolution caused by the circuit imperfections. Therefore, a digital correction has been investigated. This paper proposes a new reconstruction filter which takes into account not only the bit values of the modulator output sequence but also the occurrence of certain patterns. This technique has been applied to an incremental sigma-delta analog-to-digital converter in order to correct the conversion errors. Performing with 400 clock periods for each conversion cycle, the theoretical resolution is 15.4 bits. Post-layout simulation shows that a 13.5-bit resolution is obtained by using the classical optimal filter whereas a 14.8-bit resolution is obtained with our reconstruction filter.

Behavioral modeling of a control module for an energy-investing piezoelectric harvester

Tales Luiz Bortolin
André Luiz Aita
João Baptista dos Santos Martins

This work analyzes a piezoelectric energy harvesting system that uses a single inductor and the concept of energy investment. The harvester behavior, with special focus on its control logic module and state machine, is fully described and modeled in Verilog-A. The needed sensors and control variables were also identified and modeled. Simulation results have shown the correct behavioral modeling of the piezoelectric energy harvester system and proposed control, highlighting the harvesting mechanism based on the concept of energy-investment and the effect of the energy invested on the characteristics of the battery charging profile. The speed of the behavioral simulations when compared to electrical ones and the obtained model accuracy, have shown a reliable and prospective higher-level design approach.

An IR-UWB pulse generator using PAM modulation with adaptive PSD in 130nm CMOS process

Luiz Carlos Moreira
José Fontebasso Neto
Walter Silva Oliveira
Thiago Ferauche
Guilherme Heck
Ney Laert Vilar Calazans
Fernando Gehm Moraes

This paper proposes an adaptive pulse generator using Pulse Amplitude Modulation (PAM). The circuit was implemented with eight Pulse Generator Units (PGUs) to produce up to eight monocycles per pulse. The number of monocycles per pulse is inversely proportional to the Power Spectrum Density (PSD) bandwidth in the Impulse Radio Ultra-Wide Band (IR-UWB). The complete circuit contains two pulse generator blocks, each one composed by eight PGUs to build a rectangular waveform at the output. The PGU has been implemented with Edge Combiners High (ECH) and Edge Combiners Low (ECL) to encode the information. Each Edge Combiner has a high impedance circuit that is selected by digital control signals. The circuit has been simulated, showing an output pulse amplitude of ≈70mV for the high logic level and an amplitude of ≈35mV for the low logic level, both at 100 MHz Pulse Repetition Frequency (PRF). This produces a mean pulse duration of ≈270ps, a mean central frequency of ≈3.7GHz and a power consumption less than 0,22μW. The pulse generator block occupies an area of 0.54mm².

2019 ACM Student Research Competition at ICCAD (SRC@ICCAD’19)

13 August 2019

Yibo Lin

No comments

Categories: SIGDA Events

Winners in the graduate category

1st	Stefan Hillmich	Johannes Kepler University Linz	“Decision Diagrams for Quantum Computing“
2nd	Justin Sanchez	UNC Charlotte	“Architectures Leveraging Edge and Real-time Template Systems“
3rd	Mengchu Li	Technical University of Munich	“High-Level Synthesis for Microfluidics Large-Scale Integration“

Winners in the undergraduate category

1st	Milind Srivastava	Indian Institute of Technology Madras	“Sauron- An Automated Framework for Detecting Fault Attack Vulnerabilities in Hardware“
2nd	Shuting Cheng	Yuan Ze University	“A Novel Approach for Improving Lifetime of Multi-core Systems How Asymmetric Aging Can Lead a Way“

DEADLINE: August 17, 2019
Online Submission: https://www.easychair.org/conferences/?conf=srciccad2019

Sponsored by Microsoft Research, the ACM Student Research Competition is an internationally recognized venue enabling undergraduate and graduate students who are ACM members to:

Experience the research world — for many undergraduates, this is a first!
Share research results and exchange ideas with other students, judges, and conference attendees
Rub shoulders with academic and industry luminaries
Understand the practical applications of their research
Perfect their communication skills
Receive prizes and gain recognition from ACM and the greater computing community.

The ACM Special Interest Group on Design Automation (ACM SIGDA) is organizing such an event in conjunction with the International Conference on Computer Aided Design (ICCAD). Authors of accepted submissions will get travel grants up to $500 from ACM/Microsoft and ICCAD registration fee support from SIGDA. The event consists of several rounds, as described at http://src.acm.org/ and http://www.acm.org/student-research-competition, where you can also find more details on student eligibility and timeline.

At SRC@ICCAD’18, the first-place winners in the graduate category, Gengjie Chen (Chinese University of Hong Kong), and the first-place winner in the undergraduate category, Zhuangzhuang Zhou (Shanghai Jiaotong Univeristy), both won the First Place in the 2019 ACM SRC Grand Finals! (https://www.acm.org/media-center/2019/may/src-2019-grand-finals)

The first-place winner in the graduate category at SRC@ICCAD’17, Meng Li (University of Texas at Austin), also won the First Place in the 2018 ACM SRC Grand Finals! (https://www.acm.org/media-center/2018/june/src-2018-grand-finals)

The first-place winner in the undergraduate category at SRC@ICCAD’16, Jennifer Vaccaro (Olin College of Engineering), also won the Second Place in the 2017 ACM SRC Grand Finals: http://www.acm.org/media-center/2017/june/src-2017-grand-finals.

Details on abstract submission:
Research projects from all areas of design automation are encouraged. The author submitting the abstract must still be a student at the time the abstract is due. Each submission should be made on the EasyChair submission site. Please include the author’s name, affiliation, postal address, and email address; research advisor’s name; ACM student member number; category (undergraduate or graduate); research title; and an extended abstract (maximum 2 pages or 800 words) containing the following sections:

Problem and Motivation: This section should clearly state the problem being addressed and explain the reasons for seeking a solution to this problem.
Background and Related Work: This section should describe the specialized (but pertinent) background necessary to appreciate the work. Include references to the literature where appropriate, and briefly explain where your work departs from that done by others. Reference lists do not count towards the limit on the length of the abstract.
Approach and Uniqueness: This section should describe your approach in attacking the problem and should clearly state how your approach is novel.
Results and Contributions: This section should clearly show how the results of your work contribute to computer science and should explain the significance of those results. Include a separate paragraph (maximum of 100 words) for possible publication in the conference proceedings that serves as a succinct description of the project.
Single paper summaries (or just cut & paste versions of published papers) are inappropriate for the ACM SRC. Submissions should include at least one year worth of research contributions, but not subsuming an entire doctoral thesis load.

Note that this event is different than other ACM/SIGDA sponsored or supported events at DAC or ICCAD: RNYF brings together seniors and 1st year graduate students at DAC, UBooth features demos from research groups, DASS allows graduate students to get up to speed on lectures on design automation, while the PhD Forum showcases post-proposal PhD research at DAC and the CADathlon allows graduate students to compete in a programming contest at ICCAD.

The ACM Student Research Competition allows both graduate and undergraduate students to discuss their research with student peers, as well as academic and industry researchers, in an informal setting, while enabling them to attend ICCAD and compete with other ACM SRC winners from other computing areas in the ACM Grand Finals. Travel grant recipients cannot receive travel support from any other ICCAD or ACM/SIGDA sponsored program.

This year we plan to reserve as many as 5 poster session spots for undergraduate attendees to encourage their continuous investigation in the design automation field. The exact number is subject to the total undergraduates’ submissions as well as the quality of the works.

Online Submission – EasyChair:
https://www.easychair.org/conferences/?conf=srciccad2019

Important dates:

Abstract submission deadline: 11:59pm, PST, August 17, 2019
Acceptance notification: September 01 08, 2019
Poster session: 11:30am–1:30pm, Nov. 04 (Monday) @Westminster Foyer
Presentation session: 6:45–8:15pm, Nov. 04 (Monday) @Westminster I Ballroom
Award winners announced at ACM SIGDA Dinner: 6:45–8:30pm, Nov. 5 (Tuesday) @Legacy Ballroom
Grand Finals winners honored at ACM Awards Banquet: June 2020 (Estimated)

Requirement:
Students submitting and presenting their work at SRC@ICCAD’19 are required to be members of both ACM and ACM SIGDA.

Organizers:
Bei Yu (Chinese University of Hong Kong, Hong Kong)
Robert Wille (Johannes Kepler University Linz, Austria)

NANOARCH 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Fast Estimations of Failure Probability Over Long Time Spans

Michail Noltsis
Panayiotis Englezakis
Eleni Maragkoudaki
Chrysostomos Nicopoulos
Dimitrios Rodopoulos
Francky Catthoor
Yiannakis Sazeides
Davide Zoni
Dimitrios Soudris

Shrinking of device dimensions has undoubtedly enabled the very large scale integration of transistors on electronic chips. However, it has also brought to surface time-zero and time-dependent variation phenomena that degrade system’s performance and threaten functional operation. Hence, the need to capture and describe these mechanisms, as well as effectively model their impact is crucial. To this extent, we follow existing models and propose a complete framework that evaluates failure probability of electronic components. To assess our framework, a case-study of packet-switched Network on Chip (NoC) routers is presented, studying the failure probability of its SRAM buffers.

A Probabilistic Error Model and Framework for Approximate Booth Multipliers

Yuying Zhu
Weiqiang Liu
Jie Han
Fabrizio Lombardi

Approximate computing is a paradigm for high performance and low power design by compromising computational accuracy. In this paper, the structure of an approximate modified radix-4 Booth multiplier is analyzed. A probabilistic error model is proposed to facilitate the evaluation of the approximate multiplier for errors from the approximate radix-4 Booth encoding, the approximate regular partial product array, and the approximate 4–2 compressor. The normalized mean error distances (NMEDs) of 8-bit and 16-bit approximate designs are found by utilizing the proposed model. The results from the error model and the corresponding analytical framework are close to those found by simulation, thus confirming the validity of the proposed approach.

Variability-Tolerant Memristor-based Ratioed Logic in Crossbar Array

M. Escudero
I. Vourkas
A. Rubio
F. Moll

The advent of the first TiO2-based memristor in 2008 revived the scientific interest both from academia and industry for this device technology, with several emerging applications including that of logic circuits. Several memristive logic families have been proposed, each with different attributes, in the current quest for energy-efficient computing systems of the future. However, limited endurance of memristor devices and variations (both cycle-to-cycle and device-to-device) are important parameters to be considered in the evaluation of such logic families. In this work we build upon an accurate physics-based model of a bipolar metal-oxide resistive RAM device (supporting parasitics of the device structure and variability of switching voltages and resistance states) and use it to show how performance of memristor-based logic circuits can de degraded owing to both variability and state-drift impact. Based on previous work on CMOS-like memristive logic circuits, we propose a memristive ratioed logic scheme, which is crossbar-compatible, i.e. suitable for in-/near-memory computing, and tolerant to device variability, while also it does not affect the device endurance since computations do not involve switching the memristor states. As a figure of merit, we compare such new logic scheme with MAGIC, focusing on the universal NOR logic gate.

High-Endurance Bipolar ReRAM-Based Non-Volatile Flip-Flops with Run-Time Tunable Resistive States

Mehrdad Biglari
Tobias Lieske
Dietmar Fey

ReRAM technologies feature desired properties, e.g. fast switching and high read margin, that make them attractive candidates to be used in non-volatile flip-flops (NVFFs). However, they suffer from limited endurance. Therefore, cell degradation considerations are a necessity for practical deployment in non-volatile processors (NVPs). In this paper, we present two bipolar ReRAM-based NVFFs, Hypnos and Morpheus, with enhanced endurance and energy efficiency. Hypnos reduces the ReRAM electrical stress during set operation while keeping the imposed NVFF area overhead at a minimum. In Morpheus, a write-termination circuit is used to further enhance the ReRAM endurance and energy efficiency at the cost of an affordable area overhead. Moreover, both NVFFs feature run-time tunable resistive states to enable online adjustment of the tradeoff among endurance, retention, energy consumption, and restore success rate (in case of approximate computing). Experimental results demonstrate that Hypnos reduces the ReRAM set degradation by 91%, on average. Moreover, the write-termination mechanism in Morpheus further reduces the remaining degradation by 93%/97% in set/reset operation, on average. The results also demonstrate enhanced energy efficiency in both NVFFs.

An Aging Resilient Neural Network Architecture

Seyed Nima Mozaffari
Krishna Prasad Gnawali
Spyros Tragoudas

Recent artificial neural network architectures use memristors to store synaptic weights. The crossbar structure of memristors is used because of its dense structure and extreme parallelism. Transistor aging impacts their computational accuracy. An enhancement of the memristor-based neural network architecture is introduced using built-in current-based calibration circuit. It is shown experimentally that the proposed approach alleviates the cell aging effect.

Overcoming Crossbar Nonidealities in Binary Neural Networks Through Learning

Mohammed E. Fouda
Jongeun Lee
Ahmed M. Eltawil
Fadi Kurdahi

The crossbar nonidealaties may considerably degrade the accuracy of matrix multiplication operation, which is the cornerstone of hardware accelerated neural networks. In this paper, we show that the crossbar nonidealities especially the wire resistance should be taken into consideration for accurate evaluation. We also present a simple yet highly effective way to capture the wire resistance effect for the inference and training of deep neural networks without extensive SPICE simulations. Different scenarios have been studied and used to show the efficacy of our proposed method.

Real-Time Trainable Data Converters for General Purpose Applications

Loai Danial
Shahar Kvatinsky

Data converters are ubiquitous in data-abundant systems, where they are heterogeneously distributed across the analog-digital interface. Unfortunately, conventional data converters trade off speed, power, and accuracy. Furthermore, intrinsic real-time and post-silicon variations dramatically degrade their performance. In this paper, we employ novel neuro-inspired approaches to design smart data converters that could be trained in real-time for general purpose applications, using machine learning algorithms and artificial neural network architectures. Our approach integrates emerging memristor technology with CMOS. This concept will pave the way towards adaptive interfaces with the continuous varying conditions of data driven applications.

Programmable Molecular-Nanoparticle Multi-junction Networks for Logic Operations

Angelika Balliou
Jiri Pfleger
George Skoulatakis
Samrana Kazim
Jan Rakusan
Stella Kennou
Nikos Glezos

We propose and investigate a nanoscale multi-junction network architecture that can be configured on-flight to perform Boolean logic functions at room temperature. The device exploits the electronic properties of randomly deposited molecule-interconnected metal nanoparticles, which act collectively as strongly nonlinear single-electron transistors. Disorder is being incorporated in the modeling of their electrical behavior and the collective response of interacting nano-components is being rationalized. The non-optimized energy consumption of the synaptic grid for a “then-if” logical computation is in the range of few aJ.

Multi-Valued Logic Circuits on Graphene Quantum Point Contact Devices

Konstantinos Rallis
Georgios Ch. Sirakoulis
Ioannis Karafyllidis
Antonio Rubio

Graphene quantum point contacts (G-QPC) combine switching operations with quantized conductance, which can be modulated by top and back gates. Here we use the conductance quantization to design and simulate multi-valued logic (MVL) circuits and, more specifically an adder. The adder comprises two G-QPCs connected in parallel. We compute the conductance of the adder for various inputs and show that Graphene MVL circuits are feasible.

Sequential Circuit Design with Bilayer Avalanche Spin Diode Logic

Vaibhav Vyas
Joseph S. Friedman

Novel computing paradigms like the fully cascadable InSb bilayer avalanche spin-diode logic (BASDL) are capable of performing complex logic operations. Although the original work provides a comprehensive explanation for the device structure, the fundamental logic set and basic combinational circuits, it lacks the inclusion of sequential circuit design. This paper addresses the void by demonstrating the structural design of SR and D-type latches with BASDL. Novel latch topologies are proposed that take full advantage of the BASDL-based logic set while maintaining conventional latch functionality. The effective operation of these latches is verified through a complete logic-level analysis and a briefinsight into their physical implementation.

Complementary Arranged Graphene Nanoribbon-based Boolean Gates

Yande Jiang
Nicoleta Cucu Laurenciu
Sorin Cotofana

With CMOS feature size heading towards atomic dimensions, unjustifiable static power, reliability, and economic implications are exacerbating, prompting for research on new materials, devices, and/or computation paradigms. Within this context, Graphene Nanoribbons (GNRs), owing to graphene’s excellent electronic properties, may serve as basic blocks for carbon-based nanoelectronics. In this paper we build upon the fact that GNR behaviour can be controlled according to some desired functionality via top/back gate contacts and propose to combine GNRs with complementary functionalities to construct Boolean gates. To this end, we introduce a generic GNR-based Boolean gate structure, composed of two GNRs, i.e., a pull-up GNR performing the gate Boolean function and a pull-down GNR performing the inverted Boolean function. Subsequently, by properly adjusting GNRs’ dimensions and topology, we design 2-input AND, NAND, and XOR graphene-based Boolean gates, as well as 1-input gates, i.e., inverter and buffer. Our SPICE simulations indicate that the proposed gates exhibit a smaller propagation delay, from 23% for the XOR gate to 6x for the AND gate, and 2 orders of magnitude smaller power consumption, when compared with 7nm CMOS based counterparts, while requiring a 1 to 2 orders of magnitude smaller active area footprint. These results clearly indicate that GNR-based gates have great potential as basic building blocks for future beyond CMOS energy effective nanoscale circuits.

CCE: A Combined SRAM and Non Volatile Cache for Endurance of Next Generation Multilevel Non Volatile Memories in Embedded Systems

Linbin Chen
Pilin Junsangsri
Pedro Reviriego
Fabrizio Lombardi

In this paper we present Combined Cache for Endurance (CCE), a scheme to enable the use of next generation high density multilevel non volatile memories in embedded systems. These memories are attractive as they can reduce the static power consumption dramatically and a single memory can be potentially used avoiding having both flash and SRAM or DRAM in a system. However, a common drawback of the new multilevel non volatile memories is that they support a limited number of write operations and thus its endurance needs to be improved to make them a viable alternative for the main memory of embedded systems. The proposed CCE relies on the fact that most writes are concentrated on a few addresses. Therefore, a small SRAM cache can be used to store positions that are frequently written. However, this would not preserve the non volatile nature of the memory. To do so, in the proposed CCE, the cache cell has an SRAM part and a non volatile part. At power up the contents of the non volatile part are copied to the SRAM and the other way around at power down. As many embedded systems execute predictable workloads, this cache is statically set to cover the most frequently written addresses. The evaluation shows that CCE can increase the endurance of the memory by several orders of magnitude. At the same time the overheads required to implement the cache are small relative to the main memory. Therefore, CCE can be an interesting option to improve the endurance of next generation high density multilevel non volatile memories.

Regular Expression Matching with Memristor TCAMs for Network Security

Catherine E. Graves
Wen Ma
Xia Sheng
Brent Buchanan
Le Zheng
Si-Ty Lam
Xuema Li
Sai Rahul Chalamalasetti
Lennie Kiyama
Martin Foltin
John Paul Strachan
Matthew P. Hardy

We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, RegEx matching latency and power can be incredibly high and current proposals are challenged to perform wire-speed matching for large scale rulesets. Our approach dramatically decreases RegEx matching operating power, provides high throughput, and the use of mTCAMs enables novel compression techniques to expand ruleset sizes and allows future exploitation of the multi-state (analog) capabilities of memristors. We fabricated and demonstrated nanoscale memristor TCAM cells. SPICE simulations investigate mTCAM performance at scale and a mTCAM power model at 22nm demonstrates 0.2 fJ/bit/search energy for a 36×400 mTCAM. We further propose a tiled architecture which implements a Snort ruleset and assess the application performance. Compared to a state-of-the-art FPGA approach (2 Gbps,~1W), we show x4 throughput (8 Gbps) at 60% the power (0.62W) before applying standard TCAM power-saving techniques. Our performance comparison improves further when striding (searching multiple characters) is considered, resulting in 47.2 Gbps at 1.3W for our approach compared to 3.9 Gbps at 630mW for the strided FPGA NFA, demonstrating a promising path to wire-speed RegEx matching on large scale rulesets.

A Novel Cross-point MRAM with Diode Selector Capable of High-Density, High-Speed, and Low-Power In-Memory Computation

Chaoxin Ding
Wang Kang
He Zhang
Youguang Zhang
Weisheng Zhao

In-Memory Computation (IMC), which is capable of reducing the power consumption and bandwidth requirement resulting from the data transfer between the processing and memory units, has been considered as a promising technology to break the von-Neumann bottleneck. In order to develop an effective and efficient IMC platform, the performance, such as density, operation speed and power consumption, of the memory itself is one of the most important keys. In this work, we report a cross-point magnetic random access memory (MRAM) with diode selector for IMC implementation. The memory cell consists of a magnetic tunnel junction (MTJ) device and a diode connected in series. The memory cells are arranged in a cross-point array structure, providing high storage density. The MTJ can be switched through the unipolar precessional voltage-controlled magnetic anisotropy (VCMA) effect, thus enabling high speed and low power. Further, Boolean logic functions can be realized via regular memory-like write & read operations. The feasibility and performance of the proposed IMC in the crosspoint MRAM are successfully demonstrated with hybrid VCMA-MTJ/CMOS circuit simulations under the 40 nm technology node.

Hardware Acceleration Implementation of Sparse Coding Algorithm with Spintronic Devices

Deming Zhang
Yanchun Hou
Chengzhi Wang
Jie Chen
Lang Zeng
Weisheng Zhao

In this paper, we explore the possibility of hardware acceleration implementation of sparse coding algorithm with spintronic devices by a series of design optimizations across the architecture, circuit and device. Firstly, a domain wall motion (DWM) based compound spintronic device (CSD) is engineered and modelled, which is envisioned to achieve multiple conductance states. Sequentially, a parallel architecture is presented based on a dense cross-point array of the proposed DWM based CSD, where each dictionary (D) value can be mapped into the conductance of the proposed DWM based CSD at the corresponding cross-point. Benefitting from its massively parallel read and write operation, such proposed parallel architecture can accelerate the selected sparse coding algorithm using a designed dedicated periphery read and write circuit. Experimental results show that the selected sparse coding algorithm can be accelerated by 1400x with the proposed parallel architecture in comparison with software implementation. Moreover, its energy dissipation is 8 orders of magnitude smaller than that with software implementation.

Quantum-dot Cellular Automata RAM design using Crossbar Architecture

Orestis Liolis
Vassilios A. Mardiris
Georgios Ch. Sirakoulis
Ioannis G. Karafyllidis

In this paper, a new approach of RAM circuits, using Quantum-dot Cellular Automata (QCA), based on programmable crossbar architecture, is presented. In addition, a methodology for 2n bits RAMs is presented. Using the aforementioned methodology any designer can design a RAM regardless of its size. The proposed designs utilize the benefits of QCA programmable crossbar architecture. Namely, the RAM circuit is characterized by regularity and the ability of customization. The features that the proposed RAM design methodology has, allow the designers to use the available circuit area efficiently.

Integrated Synthesis Methodology for Crossbar Arrays

M. Ceylan Morgul
Onur Tunali
Mustafa Altun
Luca Frontini
Valentina Ciriani
E. Ioana Vatajelu
Lorena Anghel
Csaba Andras Moritz
Mircea R. Stan
Dan Alexandrescu

Nano-crossbar arrays have emerged as area and power efficient structures with an aim of achieving high performance computing beyond the limits of current CMOS. Due to the stochastic nature of nano-fabrication, nano arrays show different properties both in structural and physical device levels compared to conventional technologies. Mentioned factors introduce random characteristics that need to be carefully considered by synthesis process. For instance, a competent synthesis methodology must consider basic technology preference for switching elements, defect or fault rates of the given nano switching array and the variation values as well as their effects on performance metrics including power, delay, and area. Presented synthesis methodology in this study comprehensively covers the all specified factors and provides optimization algorithms for each step of the process.

Minimal Disturbed Bits in Writing Resistive Crossbar Memories

Mohammed E Fouda
Ahmed M. Eltawil
Fadi Kurdahi

Resistive memories are promising candidates for non-volatile memories. Write disturb is one of problems that facing this kind of memories. In this paper, the write disturb problem is mathematically formulated in terms of the bias parameters and optimized analytically. A closed form solution for the optimal bias parameters is calculated. Results are compared with the 1/2 and 1/3 bias schemes showing a significant improvement.

A Recursive Growing & Featuring Mechanism for Nanocomputing Structures

Mihaela Maliţa
Gheorghe M. Ştefan

The huge amounts of physical possibilities offered by the emerging nanotechnologies must be accompanied, beyond the uniform growing mechanisms supposed by the current serial and/or parallel extensions, by an appropriate structuring mechanism able to support efficiently the increasing functional demands. A recursive growing mechanism is proposed for the upcoming Nano-Era. The current growing mechanism involves only pure quantitative aspects. We consider as mandatory, for the very big sized systems, another mechanism which interleaves the quantitative aspects with the functional ones. Because the computational parallelism is implicit for the big sized systems, the growing mechanism must be supported also by an appropriate computational model. For the current systems we started from gates. For Nano-Era structuring mechanism we will start from cellular automata. The main difference is that for nanoarchitectures the growing mechanism and the featuring mechanism are unified in an unique recursive mechanism.

Free BDD based CAD of Compact Memristor Crossbars for in-Memory Computing

Amad Ul Hassen
Salman Anwar Khokhar
Haseeb Aslam Butt
Sumit Kumar Jha

The demise of Moore’s law, breakdown of Dennard Scaling, dark silicon phenomenon, process variation, leakage currents and quantum tunneling are some of the hurdles faced in the further advancement of computing systems today. As a result, there is a renewed interest in alternate computing paradigms using emerging nanoelectronic devices. This work uses free binary decision diagrams (FBDDs) for computer-aided design (CAD) of compact memristive crossbars for sneak-path based in-memory computing. The absence of a fixed variable ordering makes FBDDs more compact than their ordered counterpart called reduced ordered binary decision diagrams (ROBDDs). Our design has used the size of the circuit-representation of Boolean functions for selecting different variable orderings along different paths which results in compact FBDDs. We have demonstrated our approach by designing compact crossbars for a four-bit multiplier and other RevLib benchmarks. Our synthesis process yields a 50.1% reduction in area over the previous FBDD-based synthesis for the fourth-output-bit of the multiplier. Overall, our approach has reduced the multiplier area by 20.1%.

Crosstalk based Fine-Grained Reconfiguration Techniques for Polymorphic Circuits

Naveen Macha
Sandeep Geedipally
Bhavana Repalle
Md Arif Iqbal
Wafi Danesh
Mostafizur Rahman

Truly polymorphic circuits, whose functionality/circuit behavior can be altered using a control variable, can provide tremendous benefits in multi-functional system design and resource sharing. For secure and fault tolerant hardware designs these can be crucial as well. Polymorphic circuits work in literature so far either rely on environmental parameters such as temperature, variation etc. or on special devices such as ambipolar FET, configurable magnetic devices, etc., that often result in inefficiencies in performance and/or realization. In this paper, we introduce a novel polymorphic circuit design approach where deterministic interference between nano-metal lines is leveraged for logic computing and configuration. For computing, the proposed approach relies on nano-metal lines, their interference and commonly used FETs. For polymorphism, it requires only an extra metal line that carries the control signal. In this paper, we show a wide range of crosstalk polymorphic logic gates and their evaluation results. We also show an example of a large circuit that performs both the functionalities of multiplier and sorter depending on the configuration signal. A comparison is made with respect to other existing approaches in literature, and transistor count is benchmarked. For crosstalk-polymorphic circuits, the transistor count reduction range from 25% to 83% with respect to various other approaches. For example, polymorphic AOI21-OA21 cell show 83%, 85% and 50% transistor count reduction, and Multiplier-Sorter circuit show 40%, 36% and 28% transistor count reduction with respect to CMOS, genetically evolved, and ambipolar transistor based polymorphic circuits, respectively.

A Novel Analog to Digital Conversion Concept with Crosstalk Computing

Rajanikanth Desh
Naveen Kumar Macha
Sehtab Hossain
Repalle Bhavana Tejaswini
Mostafizur Rahman

Analog to Digital Converters (ADCs) is the core component of computing systems forming a link between the external stimuli and digital microprocessor operations. Current CMOS based fast ADCs are difficult to scale due to the reliance on transistor sizing and high voltage operations. They also suffer from high power consumption. In this paper, we introduce a novel ADC design which uses the deterministic signal interference between metal lines as a mechanism for signal conversion. In contrast to CMOS ADCs, our approach uses a simple crosstalk tree network of metal lines to convert sampled analog levels to digital code. Here, the sampled analog signal is passed through an input metal line which is capacitively coupled to a series of metal lines in a tree-like layout, and the coupled voltages on the edge of the tree (the leaves) determine the output. The resolution is dependent on the number of branches. We show 2-bit and 3-bit ADC implemented through this mechanism at 16n technology node. Our results indicate the possibility of huge power savings with Crosstalk ADCs in comparison to CMOS; for 2-bit and 3-bit ADCs the power consumption was found to be 43.51μW and 96.74μW respectively at 50M Hz sampling frequency.

Energy Efficiency of Low Swing Signaling for Emerging Interposer Technologies

Eleni Maragkoudaki
Przemyslaw Mroszczyk
Vasilis F. Pavlidis

Interconnects often constitute the major bottleneck in the design process of low power integrated circuits (IC). Although 2.5-D integration technologies support physical proximity, the dissipated power in the communication links remains high. In this work, the additional power savings for interposer-based interconnects enabled by low swing signaling is investigated. The energy consumed by a low swing scheme is, therefore, compared with a full swing solution and the critical length of the interconnect, above which the low swing solution starts to pay off, is determined for diverse interposer technologies. The energy consumption is compared for three different substrate materials, silicon, glass, and organic. Results indicate that the higher the load capacitance of the communication medium is, the greater the energy savings of the low swing circuit are. Specifically, in cases that electrostatic discharge (ESD) protection is required, the low swing circuit is always superior in terms of energy consumption due to the high capacitive load of the ESD circuit, regardless the substrate material and the link length. Without ESD protection, the highest critical length is about 380 μm for glass and organic interposers. To further explore the limits of power reduction from low swing signaling for 2.5-D ICs, the effect of typical interconnect parameters such as width and space on the energy efficiency of low swing communication is evaluated.

Energy-Efficient 4T SRAM Bitcell with 2T Read-Port for Ultra-Low-Voltage Operations in 28 nm 3D Monolithic CoolCubeTM Technology

Reda Boumchedda
Jean-Philippe Noel
Bastien Giraud
Adam Makosiej
Marco Antonio Rios
Eduardo Esmanhotto
Emilien Bourde-Cicé
Mathis Bellet
David Turgis
Edith Beigne

This paper presents a 4T-based SRAM bitcell optimized both for write and read operations at ultra-low voltage (ULV). The proposed bitcell is designed to respond to the requirements of energy constrained systems, as in the case of most IoT-oriented circuits and applications. The use of 3D CoolCubeTM technology enables the design of a stable 4T SRAM bitcell by using data-dependent back biasing. The proposed bitcell architecture provides a major reduction of the write operation energy consumption compared to a conventional 6T bitcell. A dedicated read port coupled to a virtual GND (VGND) ensures a full functionality at ULV of read operations. Simulation results show reliable operations down to 0.35 V close to six sigma (6 σ) without any assist techniques (e.g. negative bitlines), achieving in worst case corner 300 ns and 125 ns in write and read access time, respectively. A 6x energy consumption reduction compared to a ULV ultra-low-leakage (ULL) 6T bitcell is demonstrated.

Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition

Qin Li
Huifeng Zhu
Fei Qiao
Qi Wei
Xinjun Liu
Huazhong Yang

This paper proposes a novel processing architecture to extract Mel-Frequency Cepstrum Coefficients (MFCC) for automatic speech recognition. Inspired by the human ear, the energy-efficient analog-domain information processing is adopted to replace the energy-intensive Fourier Transform in conventional digital-domain. Moreover, the proposed architecture extracts the acoustic features in the mixed-signal domain, which significantly reduces the cost of Analog-to-Digital Converter (ADC) and the computational complexity. We carry out the circuit-level simulation based on 180nm CMOS technology, which shows an energy consumption of 2.4 nJ/frame, and a processing speed of 45.79 μs/frame. The proposed architecture achieves 97.2% energy saving and about 6.4x speedup than state of the art. Speech recognition simulation reaches the classification accuracy of 99% using the proposed MFCC features.

Power Analysis of an mRNA-Ribosome System

Pratima Chatterjee
Prasun Ghosal

Energy is the heart to drive any device, such as any machine. As researchers have been trying to perform low energy operations more and more, energy requirements are turning out to be one of the key features in measuring the performance of a device. On the other hand, as conventional silicon-based computing is approaching a barrier, needs of non-conventional computing is increasing. Though several such computing platforms have arisen to prove itself as a suitable alternative to silicon-based computing, less energy requirement is certainly one of the most sought features in the competition among the new platforms. Moreover, there are certain scenarios where performing calculations in pure bio-molecular ways are highly desired. Although DNA computing has already flagged the success of bio-molecular computing in terms of energy/power requirements, its manual nature keeps it behind from other computing techniques. Another new bio-molecular computing technique Ribosomal Computing, though still in infancy, has shown real promises due to its inherent automation. This work performs an analysis of the energy/power requirements of this computing technique. With the promising result obtained, ribosomal computing can claim itself as a promising computing technique, if combined with its inherent automation.

Controlling distilleries in fault-tolerant quantum circuits: problem statement and analysis towards a solution

Alexandru Paler

The failure susceptibility of the quantum hardware will force quantum computers to execute fault-tolerant quantum circuits. These circuits are based on quantum error correcting codes, and there is increasing evidence that one of the most practical choices is the surface code. Design methodologies of surface code based quantum circuits were focused on the layout of such circuits without emphasizing the reduced availability of hardware and its effect on the execution time. Circuit layout has not been investigated for practical scenarios, and the problem presented herein was neglected until now. For achieving fault-tolerance and implementing surface code based computations, a significant amount of computing resources (hardware and time) are necessary for preparing special quantum states in a procedure called distillation. This work introduces the problem of how distilleries (circuit portions responsible for state distillation) influence the layout of surface code protected quantum circuits, and analyses the tradeoffs for reducing the resources necessary for executing the circuits. A first algorithmic solution is presented, implemented and evaluated for addition quantum circuits.

Signal Synchronization in Large Scale Quantum-dot Cellular Automata Circuits

Vassilios A. Mardiris
Orestis Liolis
Georgios Ch. Sirakoulis
Ioannis G. Karafyllidis

Quantum-dot fabrication is a well-established nanotechnology, which have many applications in many different scientific fields. By placing four quantum-dots on the corners of a square, a cell is formed, in which the digital information can be stored. This cell serves as the structural device of Quantum-dot Cellular Automata (QCA) circuits. After QCA presentation, several digital circuits and systems have been designed and proposed in the literature. However, one of the biggest problems QCA designers have to face to pave the successful design of functional and large scale QCA circuits is signal synchronization. In this paper, a novel approach of the aforementioned problem is presented. This approach is inspired by the well known computational problem of Firing Squad Synchronization (FSS). FSS problem has many similarities with large scale QCA circuits synchronization problem. In addition, FSS problem has been studied by many researchers and many efficient solutions have been proposed in the literature.

Size Optimization of MIGs with an Application to QCA and STMG Technologies

Heinz Riener
Eleonora Testa
Luca Amaru
Mathias Soeken
Giovanni De Micheli

Majority-inverter graphs (MIGs) are a logic representation with remarkable algebraic and Boolean properties that enable efficient logic optimizations beyond the capabilities of traditional logic representations. Further, since many nano-emerging technologies, such as quantum-dot cellular automata (QCA) or spin torque majority gates (STMG), are inherently majority-based, MIGs serve as a natural logic representation to map into these technologies. So far, MIG optimization methods predominantly target to reduce the depth of the logic networks, corresponding to low delay implementations in the respective technologies. In this paper, we introduce several methods to optimize the size of MIGs. They can be applied such that the depth of the logic network is preserved; therefore our methods have a direct effect on the physical area, without worsening the delay. Some methods are inspired by existing size optimization algorithms for non-majority-based logic networks, others make explicit use of the majority function and its properties. All methods are Boolean—in contrast to algebraic optimization methods—which has a positive effect on the quality but challenges their implementation. Our experiments show that using our methods the size of MIGs in the EPFL combinational benchmark suite can be reduced by up to 7.12%. When mapped to QCA and STMG technologies we reduce the average area-delay-energy product by 2.31% and 2.07%, respectively.

Representation of Qubit States using 3D Memristance Spaces: A first step towards a Memristive Quantum Simulator

Ioannis Karafyllidis
Georgios Ch. Sirakoulis
Panagiotis Dimitrakis

Development of quantum simulators is a major step towards the universal quantum computer. Quantum simulators are quantum systems that can perform specific quantum computations, or software packages that can reproduce most of the aspects of a general universal quantum computer on a general purpose classical computer. Development of quantum simulators using digital circuits, such as FPGAs is very difficult, mainly because the unit of quantum information, the qubit, has an infinite number of states, whereas the classical bit has only two. On the other hand, analog circuits comprising R, L and C elements have no internal state variables that can be used to reproduce and store qubit states. Here we take the first step towards the development of a new quantum simulator using memristors. The qubit state is mapped to a 3D space spanned by the memristances of three identical memristors. The qubit state evolution is reproduced by the input voltages applied to the memristors. We define the correspondence between the general qubit state rotation, i.e. the one-qubit quantum gates, and memristor input voltage variations and reproduce the rotations imposed by the action of quantum gates in the 3D memristance space. Our results show that, at least in principle, qubits and one-qubit quantum gates can be simulated by memristors.

ISLPED 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Machine Learning – Inference

Value-driven Synthesis for Neural Network ASICs

Zhiyuan Yang
Ankur Srivastava

In order to enable low power and high performance evaluation of neural network (NN) applications, we investigate new design methodologies for synthesizing neural network ASICs (NN-ASICs). An NN-ASIC takes a trained NN and implements a chip with customized optimization. Knowing the NN topology and weights allows us to develop unique optimization schemes which are not available to regular ASICs. In this work, we investigate two types of value-driven optimized multipliers which exploit the knowledge of synaptic weights and we develop an algorithm to synthesize the multiplication of trained NNs using these special multipliers instead of general ones. The proposed method is evaluated using several Deep Neural Networks. Experimental results demonstrate that compared to traditional NNPs, our proposed NN-ASICs can achieve up to 6.5x and 55x improvement in performance and energy efficiency (i.e. inverse of Energy-Delay-Product), respectively.

CLINK: Compact LSTM Inference Kernel for Energy Efficient Neurofeedback Devices

Zhe Chen
Andrew Howe
Hugh T. Blair
Jason Cong

Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.

Compact Convolution Mapping on Neuromorphic Hardware using Axonal Delay

Jinseok Kim
Yulhwa Kim
Sungho Kim
Jae-Joon Kim

Mapping Convolutional Neural Network (CNN) to a neuromorphic hardware has been inefficient in synapse memory usage because both kernel/input reuse are not exploited well. We propose a method to enable kernel reuse by utilizing axonal delay, which is a biological parameter for a spiking neuron. Using IBM TrueNorth as a test platform, we demonstrate that the number of cores, neurons, synapses, and synaptic operations per time step can be reduced by up to 20.9x, 27.9x, 88.4x, and 1586x, respectively, compared to the conventional scheme, which raises the possibility of implementing large-scale CNN on neuromorphic hardware.

NNest: Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators

Liu Ke
Xin He
Xuan Zhang

Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN’s enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an “one-size-fits-all” hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).

SESSION: Hardware Security

Blacklist Core: Machine-Learning Based Dynamic Operating-Performance-Point Blacklisting for Mitigating Power-Management Security Attacks

Sheng Zhang
Adrian Tang
Zhewei Jiang
Simha Sethumadhavan
Mingoo Seok

Most modern computing devices make available fine-grained control of operating frequency and voltage for power management. These interfaces, as demonstrated by recent attacks, open up a new class of software fault injection attacks that compromise security on commodity devices. CLKSCREW, a recently-published attack that stretches the frequency of devices beyond their operational limits to induce faults, is one such attack. Statically and permanently limiting frequency and voltage modulation space, i.e., guard-banding, could mitigate such attacks but it incurs large performance degradation and long testing time. Instead, in this paper, we propose a run-time technique which dynamically blacklists unsafe operating performance points using a neural-net model. The model is first trained offline in the design time and then subsequently adjusted at run-time by inspecting a selected set of features such as power management control registers, timing-error signals, and core temperature. We designed the algorithm and hardware, titled a BlackList (BL) core, which is capable of detecting and mitigating such power management-based security attack at high accuracy. The BL core incurs a reasonably small amount of overhead in power, delay, and area.

Threshold Defined Camouflaged Gates in 65nm Technology for Reverse Engineering Protection

Anirudh S. Iyengar
Deepak Vontela
Ithihasa Reddy
Swaroop Ghosh
Syedhamidreza Motaman
Jae-won Jang

Due to the ever-increasing threat of Reverse Engineering (RE) of Intellectual Property (IP) for malicious gains, camouflaging of logic gates is becoming very important. In this paper, we present experimental demonstration of transistor threshold voltage-defined switch [2] based camouflaged logic gates that can hide six logic functionalities i.e. NAND, AND, NOR, OR, XOR and XNOR. The proposed gates can be used to design the IP, forcing an adversary to perform brute-force guess-and-verify of the underlying functionality—increasing the RE effort. We propose two flavors of camouflaging, one employing only a pass transistor (NMOS-switch) and the other utilizing a full pass transistor (CMOS-switch). The camouflaged gates are used to design Ring-Oscillators (RO) in ST 65nm technology, one for each functionality, on which we have performed temperature, voltage, and process-variation analysis. We observe that CMOS-switch based camouflaged gate offers a higher performance (~1.5-8X better) than NMOS-switch based gate at an added area cost of only 5%. The proposed gates show functionality till 0.65V. We are also able to reclaim lost performance by dynamically changing the switch gate voltage and show that robust operation can be achieved at lower voltage and under temperature fluctuation.

Reliability and Uniformity Enhancement in 8T-SRAM based PUFs operating at NTC

Pramesh Pandey
Asmita Pal
Koushik Chakraborty
Sanghamitra Roy

SRAM-based PUFs (SPUFs) have emerged as promising security primitives for low-power devices. However, operating 8T-SPUFs at Near-Threshold Computing (NTC) realm is plagued by exacerbated process variation (PV) sensitivity which thwarts their reliable operation. In this paper, we demonstrate the massive degradation in the reliability and uniformity characteristics of 8T-SPUF. By exploiting the opportunities bestowed by schematic asymmetry of 8T-SPUF cells, we propose biasing and sizing based design strategies. Our techniques achieve an immense improvement of more than 55% in the percentage of unreliable cells and improves the proximity to ideal uniformity by 82%, over a baseline NTC 8T-SPUF with no enhancement.

Efficient and Secure Group Key Management in IoT using Multistage Interconnected PUF

Hongxiang Gu
Miodrag Potkonjak

Secure group-oriented communication is crucial to a wide range of applications in Internet of Things (IoT). Security problems related to group-oriented communications in IoT-based applications placed in a privacy-sensitive environment have become a major concern along with the development of the technology. Unfortunately, many IoT devices are designed to be portable and light-weight; thus, their functionalities, including security modules, are heavily constrained by the limited energy resources (e.g., battery capacity). To address these problems, we propose a group key management scheme based on a novel physically unclonable function (PUF) design: multistage interconnected PUF (MIPUF) to secure group communications in an energy-constrained environment. Our design is capable of performing key management tasks such as key distribution, key storage and rekeying securely and efficiently. We show that our design is secure against multiple attack methods and our experimental results show that our design saves 47.33% of energy globally comparing to state-of-the-art Elliptic-curve cryptography (ECC)-based key management scheme on average.

SESSION: Energy Efficient Wireline Circuits

An Energy-Efficient High-Swing PAM-4 Voltage-Mode Transmitter

Lejie Lu
Yong Wang
Hui Wu

As the data rate of high-speed I/Os continues to increase, four-level pulse amplitude modulation (PAM-4) is adopted to improve the bandwidth density and link margin at 50 Gb/s and beyond. Compared to non-return-to-zero (NRZ) signaling, however, the PAM-4 eye height is reduced, which calls for larger transmitter swing to maintain signal-to-noise-ratio. A new energy-efficient transmitter is proposed to generate large swing PAM-4 signals with a cascode voltage-mode driver and supporting pre-drivers and logic circuits. By reconfiguring the pull-up and pull-down branches based on the transmit data and steering the bypass currents, the proposed voltage-mode driver significantly reduces power consumption compared to conventional implementation while maintaining impedance matching. Voltage stacking technique is adopted for pre-drivers to further improve energy efficiency. To demonstrate the new transmitter design, a prototype 56 Gb/s PAM-4 transmitter is designed using a generic 28-nm CMOS technology with a 2-V power supply voltage. It achieves a overall output swing of 2 V and a minimum eye height of 490 mV with good linearity (98.7% level separation mismatch ratio). Compared to a conventional voltage-mode transmitter design with the same swing, the static power consumption of the new transmitter is reduced almost by half (from 30 mW to 16 mW), and its overall energy efficiency improves from 0.7 pJ/b to 0.5 pJ/b.

Energy-Efficient Dynamic Comparator with Active Inductor for Receiver of Memory Interfaces

Jae Whan Lee
Joo-Hyung Chae
Jihwan Park
Hyunkyu Park
Jaekwang Yun
Suhwan Kim

In this paper, we propose a dynamic comparator that improved the operation performance of receiver (RX) with the effort to reduce power consumption. It is implemented via double-tail StrongARM latch comparator with an active inductor and efforts are made to minimize power consumption for high-speed resulting in better energy efficiency at the targeted high frequency. In this regard, our comparator is suitable for memory application RX to satisfy both low-power and high-speed. It is applied to the single-ended RX designed with a continuous-time linear equalizer, a clock generator and a quarter-rate 2-tap decision-feedback equalizer which is appropriate for the high-frequency memory application. Compared to the conventional one, our design, fabricated in 55nm CMOS process, provides an improvement of 7% in unit interval (UI) margin under the same power consumption and receives up to 10Gb/s PRBS15 data at BER < 10-12 with 0.4 UI margin and energy efficiency of 0.67pJ/bit.

4-Channel Push-Pull VCSEL Drivers for HDMI Active Optical Cable in 0.18-μm CMOS

Jeongho Hwang
Hong-Seok Choi
Hyungrok Do
Gyu-Seob Jeong
Daehyun Koh
Seong Ho Park
Deog-Kyoon Jeong

The price and power consumption of standard HDMI cables exponentially rise when the data rate increases or cable runs longer. HDMI active optical cable (AOC) can potentially solve price and power issues since fibers are tolerant to loss. However, additional optical components such as vertical-cavity surface-emitting laser (VCSEL) and photodiode (PD) are required. Therefore, drivers and transimpedance amplifiers should be designed carefully for normal operations. In this paper, two types of 4-channel VCSEL drivers for HDMI AOC are presented. The first type of the driver passes data and bias separately. It uses off-chip capacitors for AC coupling. On the other hand, the second type of the driver passes data including DC value without using off-chip capacitors. Structures of the both drivers are based on push-pull current-mode logic (CML) to achieve better power efficiency. Drivers fabricated in 0.18-μm CMOS process consume 36.5 mW/channel at 6 Gb/s and 24.7 mW/channel at 12 Gb/s, respectively.

SESSION: Approximate Computing

RMAC: Runtime Configurable Floating Point Multiplier for Approximate Computing

Mohsen Imani
Ricardo Garcia
Saransh Gupta
Tajana Rosing

Approximate computing is a way to build fast and energy efficient systems, which provides responses of good enough quality tailored for different purposes. In this paper, we propose a novel approximate floating point multiplier which efficiently multiplies two floating numbers and yields a high precision product. RMAC approximates the costly mantissa multiplication to a simple addition between the mantissa of input operands. To tune the level of accuracy, RMAC looks at the first bit of the input mantissas as well as the first N bits of the result of addition to dynamically estimate the maximum multiplication error rate. Then, RMAC decides to either accept the approximate result or re-execute the exact multiplication. Depending on the value of N, the proposed RMAC can be configured to achieve different levels of accuracy. We integrate the proposed RMAC in AMD southern Island GPU, by replacing RMAC with the existing floating point units. We test the efficiency and accuracy of the enhanced GPU on a wide range of applications including multimedia and machine learning applications. Our evaluations show that a GPU enhanced by the proposed RMAC can achieve 5.2x energydelay product improvement as opposed to GPU using conventional FPUs while ensuring less than 2% quality loss. Comparing our approach with other state-of-the-art approximate multipliers shows that RMAC can achieve 3.1x faster and 1.8x more energy efficient computations while providing the same quality of service.

Designing Efficient Imprecise Adders using Multi-bit Approximate Building Blocks

Sarvenaz Tajasob
Morteza Rezaalipour
Masoud Dehyadegari
Mahdi Nazm Bojnordi

Energy-efficiency has become a major concern in designing computer systems. One of the most promising solutions to enhance power and energy-efficiency in error tolerant applications is approximate computing that balances accuracy, area, delay, and power consumption based on the computational needs. By trading accuracy of computation, approximate computing may achieve significant improvements in speed, power, and area consumption.

Adders are important arithmetic units widely used in almost every digital processing system, which contribute to significant amounts of power dissipation. With the emergence of deep learning tasks and fault tolerant big data processing in every aspect of today’s computing, the demand for low-power and energy-efficient approximate adders has increased significantly. Numerous designs have been proposed in the literature that build multi-bit adders using novel approximate full adder circuits. Regrettably, relying on single-bit building blocks only limits the design space of approximate adders and prevents the designers from achieving the most significant benefits of approximate circuits. This paper presents a novel approach to designing imprecise multi-bit adders, based on four novel approximate 2 and 3-bit adder building blocks. The proposed circuits are evaluated and compared with the existing low power adders in terms of various design characteristics, such as area, delay, power, and error tolerance. Our simulation results indicate that the proposed adders achieve more than 60% reduction in power and area consumption compared to the state-of-the-art approximate adders while introducing 12-17% less error in computation.

An Energy-Efficient, Yet Highly-Accurate, Approximate Non-Iterative Divider

Marzieh Vaeztourshizi
Mehdi Kamal
Ali Afzali-Kusha
Massoud Pedram

In1 this paper, we present a highly accurate and energy efficient non-iterative divider, which uses multiplication as its main building block. In this structure, the division operation is performed by first reforming both dividend and divisor inputs, and then multiplying the rounded value of the scaled dividend by the reciprocal of the rounded value of the scaled divisor. Precisely, the interval representing the fractional value of the scaled divisor is partitioned into non-overlapping sub-intervals, and the reciprocal of the scaled divisor is then approximated with a linear function in each of these sub-intervals. The efficacy of the proposed divider structure is assessed by comparing its design parameters and accuracy with state-of-the-art, non-iterative approximate dividers as well as exact dividers in 45nm digital CMOS technology. Circuit simulation results show that the mean absolute relative error of the proposed structure for doing 1 32-bit division is less than 0.2%, while the proposed structure has significantly lower energy consumption than the exact divider. Finally, the effectiveness of the proposed divider in one image processing application is reported and discussed.

SESSION: Architectural Techniques

Aggressive Slack Recycling via Transparent Pipelines

Gokul Subramanian Ravi
Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation – maximizing slack conservation and timing speculation efficiency.

We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.

Pareto-Optimal Power- and Cache-Aware Task Mapping for Many-Cores with Distributed Shared Last-Level Cache

Martin Rapp
Anuj Pathania
Jörg Henkel

Two factors primarily affect performance of multi-threaded tasks on many-core processors with both shared and physically distributed Last-Level Cache (LLC): the power budget associated with a certain task mapping that aims to guarantee thermally safe operation and the non-uniform LLC access latency of threads running on different cores. Spatially distributing threads across the many-core increases the power budget, but unfortunately also increases the associated LLC latency. On the other side, mapping more threads to cores near the center of the many-core decreases the LLC latency, but unfortunately also decreases the power budget. Consequently, both metrics (LLC latency and power budget) cannot be simultaneously optimal, which leads to a Pareto-optimization that has formerly not been exploited. We are the first to present a run-time task mapping algorithm called PCMap that exploits this trade-off. Our approach results in up to 8.6% reduction in the average task response time accompanied by a reduction of up to 8.5% in the energy consumption compared to the state-of-the-art.

SPONGE: A Scalable Pivot-based On/Off Gating Engine for Reducing Static Power in NoC Routers

Hossein Farrokhbakht
Hadi Mardani Kamali
Natalie Enright Jerger
Shaahin Hessabi

Due to high aggregate idle time of Networks-on-Chip (NoCs) routers in practical applications, power-gating techniques have been proposed to combat the ever-increasing ratio of static power. Nevertheless, the sporadic packet arrivals compromise the effectiveness of power-gating by incurring significant latency and energy overhead. In this paper, we propose a Scalable Pivot-based On/Off Gating Engine (SPONGE) which efficiently manages power-gating decisions and routing mechanism by adaptively selecting a small set of powered-on columns of routers and keeping the others in power-gated state. To this end, a router architecture augmented with a novel routing algorithm is proposed in which a packet can traverse powered-off routers without waking them up, and can only turn in predetermined powered-on routers. Experimental results on SPLASH-2 benchmarks demonstrate that, compared to the conventional power-gating method, SPONGE on average not only improves static power consumption by 81.7%, it also improves average packet latency by 63%.

SESSION: Machine Learning – Training

Taming the beast: Programming Peta-FLOP class Deep Learning Systems

Swagath Venkataramani
Vijayalakshmi Srinivasan
Jungwook Choi
Kailash Gopalakrishnan
Leland Chang

TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training

Seungkyu Choi
Jaehyeong Sim
Myeonggu Kang
Lee-Sup Kim

Training convolutional neural network on device has become essential where it allows applications to consider user’s individual environment. Meanwhile, the weight update operation from the training process is the primary factor of high energy consumption due to its substantial memory accesses. We propose a dedicated weight update architecture with two key features: (1) a specialized local buffer for the DRAM access deduction (2) a novel dataflow and its suitable processing element array structure for weight gradient computation to optimize the energy consumed by internal memories. Our scheme achieves 14.3%-30.2% total energy reduction by drastically eliminating the memory accesses.

AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference

Xin He
Liu Ke
Wenyan Lu
Guihai Yan
Xuan Zhang

The intrinsic error tolerance of neural network (NN) makes approximate computing a promising technique to improve the energy efficiency of NN inference. Conventional approximate computing focuses on balancing the efficiency-accuracy trade-off for existing pre-trained networks, which can lead to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented training framework to facilitate approximate computing for NN inference. Specifically, AxTrain leverages the synergy between two orthogonal methods—one actively searches for a network parameters distribution with high error tolerance, and the other passively learns resilient weights by numerically incorporating the noise distributions of the approximate hardware in the forward pass during the training phase. Experimental results from various datasets with near-threshold computing and approximation multiplication strategies demonstrate AxTrain’s ability to obtain resilient neural network parameters and system energy efficiency improvement.

Spin Orbit Torque Device based Stochastic Multi-bit Synapses for On-chip STDP Learning

Gyuseong Kang
Yunho Jang
Jongsun Park

As a large number of neurons and synapses are needed in spike neural network (SNN) design, emerging devices have been employed to implement synapses and neurons. In this paper, we present a stochastic multi-bit spin orbit torque (SOT) memory based synapse, where only one SOT device is switched for potentiation and depression using modified Gray code. The modified Gray code based approach needs only N devices to represent 2N levels of synapse weights. Early read termination scheme is also adopted to reduce the power consumption of training process by turning off less associated neurons and its ADCs. For MNIST dataset, with comparable classification accuracy, the proposed SNN architecture using 3-bit synapse achieves 68.7% reduction of ADC overhead compared to the conventional 8-level synapse.

SESSION: Non-volatile Memory

Enabling Intra-Plane Parallel Block Erase in NAND Flash to Alleviate the Impact of Garbage Collection

Tyler Garrett
Jun Yang
Youtao Zhang

Garbage collection (GC) in NAND flash can significantly decrease I/O performance in SSDs by copying valid data to other locations, thus blocking incoming I/O requests. To help improve performance, NAND flash utilizes various advanced commands to increase internal parallelism. Currently, these commands only parallelize operations across channels, chips, dies, and planes, neglecting the block level due to risk of disturbances that can compromise valid data by inducing errors. However, due to the triple-well structure of the NAND flash plane architecture, it is possible to erase multiple blocks within a plane, in parallel, without diminishing the integrity of the valid data. The number of page movements due to multiple block erases can be restrained so as to bound the overhead per GC. Moreover, more capacity can be reclaimed per GC which delays future GCs and effectively reduces their frequency. Such an Intra-Plane Parallel Block Erase (IPPBE) in turn diminishes the impact of GC on incoming requests, improving their response times. Experimental results show that IPPBE can reduce the time spent performing GC by up to 50.7% and 33.6% on average, read/write response time by up to 47.0%/45.4% and 16.5%/14.8% on average respectively, page movements by up to 52.2% and 26.6% on average, and blocks erased by up to 14.2% and 3.6% on average. An energy analysis conducted indicates that by reducing the number of page copies and the number of block erases, the energy cost of garbage collection can be reduced up to 44.1% and 19.3% on average.

Enhancing the Energy Efficiency of Journaling File System via Exploiting Multi-Write Modes on MLC NVRAM

Shuo-Han Chen
Yuan-Hao Chang
Tseng-Yi Chen
Yu-Ming Chang
Pei-Wen Hsiao
Hsin-Wen Wei
Wei-Kuan Shih

Non-volatile random-access memory (NVRAM) is regarded as a great alternative storage medium owing to its attractive features, including low idle energy consumption, byte addressability, and short read/write latency. In addition, multi-level-cell (MLC) NVRAM has also been proposed to provide higher bit density. However, MLC NVRAM has lower energy efficiency and longer write latency when compared with single-level-cell (SLC) NVRAM. These drawbacks could lead to higher energy consumption of MLC NVRAM-based storage systems. The energy consumption is magnified by existing journaling file systems (JFS) on MLC NVRAM-based storage devices due to the JFS’s fail-safe policy of writing the same data twice. Such observations motivate us to propose a multi-write-mode journaling file systems (mwJFS) to alleviate the drawbacks of MLC NVRAM and lower the energy consumption of MLC NVRAM-based JFS. The proposed mwJFS differentiates the data retention requirement of journaled data and applies different write modes to enhance the energy efficiency with better access performance. A series of experiments was conducted to demonstrate the capability of mwJFS on a MLC NVRAM-based storage system.

Computing in memory with FeFETs

Dayane Reis
Michael Niemier
X. Sharon Hu

Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.

Information Leakage Attacks on Emerging Non-Volatile Memory and Countermeasures

Mohammad Nasim Imtiaz Khan
Swaroop Ghosh

Emerging Non-Volatile Memories (NVMs) suffer from high and asymmetric read/write current and long write latency which can result in supply noise, such as supply voltage droop and ground bounce. The magnitude of supply noise depends on the old data and the new data that is being written (for a write operation) or on the stored data (for a read operation). Therefore, victim’s write operation creates a supply noise which propagates to adversary’s memory space. The adversary can detect victim’s write initiation and can leverage faster read latency (compared to write) to further sense the Hamming Weight (HW) of the victim’s write data by detecting read failures in his memory space. These attacks are specifically possible if exhaustive testing of the memory for all patterns, all possible location combinations, all possible parallel read/write conditions are not performed under bit-to-bit process variations and specified (-10°C to 90°C) and unspecified temperature ranges (i.e., less than -10°C and greater than 90°C). Simulation result indicates that adversary can sense HW of victim’s (near-by) write data = 66.77%, and further narrow the range based on read/write failure characteristics. Side Channel Attacks can utilize this information to strengthen the attacks.

SESSION: Energy-efficient Parallelism

Load-Triggered Warp Approximation on GPU

Zhenhong Liu
Daniel Wong
Nam Sung Kim

Value similarity of operands across warps have been exploited to improve energy efficiency of GPUs. Prior work, however, incurs significant overheads to check value similarity for every instruction and does not improve performance as it does not reduce the number of executed instructions. This work proposes Lock ‘n Load (LnL) which triggers approximate execution of code regions by only checking similarity of values returned from load instructions and fuses multiple approximated warps into a single warp.

GAS: A Heterogeneous Memory Architecture for Graph Processing

Minxuan Zhou
Mohsen Imani
Saransh Gupta
Tajana Rosing

Graph processing has become important for various applications in today’s big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.

ACE-GPU: Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold Computing GPU

Tahmoures Shabanian
Aatreyi Bal
Prabal Basu
Koushik Chakraborty
Sanghamitra Roy

The proliferation of multicore devices with a strict thermal budget has aided to the research in Near-Threshold Computing (NTC). However, the operation of a Graphics Processing Unit (GPU) at the NTC region has still remained recondite. In this work, we explore an important reliability predicament of NTC, called choke points, that severely throttles the performance of GPUs. Employing a cross-layer methodology, we demonstrate the potency of choke points in inducing timing errors in a GPU, operating at the NTC region. We propose a holistic circuit-architectural solution, that promotes an energy-efficient NTC-GPU design paradigm by gracefully tackling the choke point induced timing errors. Our proposed scheme offers 3.18x and 88.5% improvements in NTC-GPU performance and energy delay product, respectively, over a state-of-the-art timing error mitigation technique, with marginal area and power overheads.

SESSION: Self-powered Devices

HomeRun: HW/SW Co-Design for Program Atomicity on Self-Powered Intermittent Systems

Chih-Kai Kang
Chun-Han Lin
Pi-Cheng Hsiu
Ming-Syan Chen

Self-powered intermittent systems featuring nonvolatile processors (NVPs) allow for accumulative execution in unstable power environments. However, frequent power failures may cause incorrect NVP execution results due to invalid data generated intermittently. This paper presents a HW/SW co-design, called HomeRun, to guarantee atomicity by ensuring that an uninterruptible program section can be run through at one execution. We design a HW module to ensure that a power pulse is sufficient for an atomic section, and develop a SW mechanism for programmers to protect atomic sections. The proposed design is validated through the development of a prototype pattern locking system. Experimental results demonstrate that the proposed design can completely guarantee atomicity and significantly improve the energy utilization of self-powered intermittent systems.

EcoMicro: A Miniature Self-Powered Inertial Sensor Node Based on Bluetooth Low Energy

Cheng-Ting Lee
Yun-Hao Liang
Pai H. Chou
Ali Heydari Gorji
Seyede Mahya Safavi
Wen-Chan Shih
Wen-Tsuen Chen

This paper describes EcoMicro, a miniature, self-powered, wireless inertial-sensing node in the volume of 8 x 13 x 9.5 mm3, including energy storage and solar cells. It is smaller than existing systems with similar functionality while retaining rich functionality and efficiency. It is capable of measuring motion using a inertial measurement unit (IMU) and communication over Bluetooth Low Energy (BLE) protocol. It is self-powered by miniature solar cells and can perform maximum power point tracking (MPPT). Its integrated energy-storage device combines the longevity and power density of supercapacitors with the relatively flat discharge curve of batteries. Our power-ground gating circuit minimizes leakage current during sleep mode and is used in conjunction with the real-time-clock for duty cycling. Experimental results show EcoMicro to be operational and efficient for a class of wireless sensing applications.

Dual Mode Ferroelectric Transistor based Non-Volatile Flip-Flops for Intermittently-Powered Systems

S. K. Thirumala
A. Raha
H. Jayakumar
K. Ma
V. Narayanan
V. Raghunathan
S. K. Gupta

In this work, we propose dual mode ferroelectric transistors (D-FEFETs) that exhibit dynamic tuning of operation between volatile and non-volatile modes with the help of a control signal. We utilize the unique features of D-FEFET to design two variants of non-volatile flip-flops (NVFFs). In both designs, D-FEFETs are operated in the volatile mode for normal operations and in the non-volatile mode to backup the state of the flip-flop during a power outage. The first design comprises of a truly embedded non-volatile element (D-FEFET) which enables a fully automatic backup operation. In the second design, we introduce need-based backup, which lowers energy during normal operation at the cost of area with respect to the first design. Compared to a previously proposed FEFET based NVFF, the first design achieves 19% area reduction along with 96% lower backup energy and 9% lower restore energy, but at 14%-35% larger operation energy. The second design shows 11% lower area, 21% lower backup energy, 16% decrease in backup delay and similar operation energy but with a penalty of 17% and 19% in the restore energy and delay, respectively. System-level analysis of the proposed NVFFs in context of a state-of-the-art intermittently-powered system using real benchmarks yielded 5%-33% energy savings.

SESSION: Design and 3D Integration

Multiple Combined Write-Read Peripheral Assists in 6T FinFET SRAMs for Low-VMIN IoT and Cognitive Applications

Arijit Banerjee
Sumanth Kamineni
Benton H. Calhoun

Battery-operated or energy-harvested IoT and cognitive SoCs in modern FinFET processes prefer the use of low-VMIN SRAMs for ultra-low power (ULP) operations. However, the 1:1:1 high-density (HD) FinFET 6T bitcell faces challenges in achieving a lower VMIN across process variation. The 6T bitcell VMIN improves either by increasing the size of the bitcell or by using combinations of peripheral assists (PAs) since a single PA cannot achieve the best VMIN across process variation. State-of-the-art works show some combinations of write and read PAs that lower the VMIN of 6T FinFET SRAMs. However, the better combinations of PA for 14nm HD 6T FinFET SRAMs are unknown. This work compares all the possible dual combinations of PAs and reveals the better ones. We show that in a usual column mux scenario the combination of negative bitline with VDD boosting and VDD collapse with VDD boosting in a proportion of 14% and 6% (total 20%), respectively, maximize the static VMIN improvement close to 191mV for ULP IoT and cognitive applications. We also show that a combination of wordline boosting with negative bitline and wordline boosting with VSS lowering achieve a 150mV and 25mV of dynamic VMIN improvement at the 5GHz frequency for the worst-case write and read corners, respectively, beating other combinations.

Road to High-Performance 3D ICs: Performance Optimization Methodologies for Monolithic 3D ICs

Kyungwook Chang
Sai Pentapati
Da Eun Shim
Sung Kyu Lim

As we approach the limits of 2D device scaling, monolithic 3D IC (M3D) has emerged as a potential solution offering performance and power benefits. Although various studies have been done to increase power savings of M3D designs, efforts to improve their performance are rarely made. In this paper, we, for the first time, perform in-depth analysis of the factors that affect the performance of M3D, and present methodologies to improve the performance. Our methodologies outperform the state-of-the-art M3D design flow by offering 15.6% performance improvement and 16.2% energy-delay product (EDP) benefit over 2D designs.

A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support

Srivatsa Srinivasa
Akshay Krishna Ramanathan
Xueqing Li
Wei-Hao Chen
Fu-Kuo Hsueh
Chih-Chao Yang
Chang-Hong Shen
Jia-Min Shieh
Sumeet Gupta
Meng-Fan Marvin Chang
Swaroop Ghosh
Jack Sampson
Vijaykrishnan Narayanan

We present a novel 3D-SRAM cell using a Monolithic 3D integration (M3D-IC) technology for realizing both robustness and In-memory Boolean logic compute support. The proposed two-layer design makes use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, XNOR/XOR) without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes and evaluate the energy efficiency of our proposed design. Assist techniques improve SRAM read stability by 2.2x and increase the write margin by 17.6%, while staying within the SRAM footprint. By virtue of increased robustness, the cell enables seamless operation at lower supply voltages and thereby ensures energy efficiency. Energy Delay Product (EDP) reduces by 1.6x over standard 6T SRAM with a faster data access. Transistor placement and their biasing technique in layer-2 enables In-memory bitwise Boolean computation. When computing bulk In-memory operations, 6.5x energy savings is achieved as compared to computing outside the memory system.

SESSION: Industry ML/AI Compute

Across the Stack Opportunities for Deep Learning Acceleration

Vijayalakshmi Srinivasan
Bruce Fleischer
Sunil Shukla
Matthew Ziegler
Joel Silberman
Jinwook Oh
Jungwook Choi
Silvia Mueller
Ankur Agrawal
Tina Babinsky
Nianzheng Cao
Chia-Yu Chen
Pierce Chuang
Thomas Fox
George Gristede
Michael Guillorn
Howard Haynie
Michael Klaiber
Dongsoo Lee
Shih-Hsien Lo
Gary Maier
Michael Scheuermann
Swagath Venkataramani
Christos Vezyrtzis
Naigang Wang
Fanchieh Yee
Ching Zhou
Pong-Fei Lu
Brian Curran
Leland Chang
Kailash Gopalakrishnan

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers.

One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the “exactness” constraint enables exploiting opportunities for approximate computing across all layers of the system stack.

In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware.

Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training’s scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core.

Overall, our work shows how the benefits from exploiting approximation using algorithm/application’s robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

SESSION: Mobile Applications

App-Oriented Thermal Management of Mobile Devices

Jihoon Park
Seokjun Lee
Hojung Cha

The thermal issue for mobile devices becomes critical as the devices’ performance increases to handle complicated applications. Conventional thermal management limits the performance of the entire device, degrading the quality of both foreground and background applications. This is not desirable because the quality of the foreground application, i.e., the frames per second (FPS), is directly affected, whereas users are generally not aware of the performance of background applications. In this paper, we propose an app-oriented thermal management scheme that specifically restricts background applications to preserve the FPS of foreground applications. For efficient thermal management, we developed a model that predicts the heat contribution of individual applications based on hardware utilization. The proposed system gradually limits system resources for each background application according to its heat contribution. The scheme was implemented on a Galaxy S8+ smartphone, and its usefulness was validated with a thorough evaluation.

DiReCt: Resource-Aware Dynamic Model Reconfiguration for Convolutional Neural Network in Mobile Systems

Zirui Xu
Zhuwei Qin
Fuxun Yu
Chenchen Liu
Xiang Chen

Although Convolutional Neural Networks (CNNs) have been widely applied in various applications, their deployment in resource-constrained mobile systems remains a significant concern. To overcome the computation resource constraints, such as limited memory and energy capacity, many works are proposed for mobile CNN optimization. However, most of them lack a comprehensive modeling analysis of the CNN computation consumption and merely focus on static optimization schemes regardless of different mobile computation scenarios. In this work, we proposed DiReCt — a resource-aware CNN reconfiguration system. Leveraging accurate CNN computation consumption modeling and mobile resource constraint analysis, DiReCt can reconfigure a CNN with different accuracy and resource consumption levels to adapt to various mobile computation scenarios. The experiment results show that: the proposed computation consumption models in DiReCt can well estimate the CNN computation consumption with 94.1% accuracy, and DiReCt achieves at most 34.9% computation acceleration, 52.7% memory reduction, and 27.1% energy saving. Eventually, DiReCt can effectively adapt CNNs to dynamic mobile usage scenarios for optimal performance.

POSTER SESSION: Posters

A Low-power [email protected] H.265/HEVC Video Encoder for Smart Video Surveillance

Ke Xu
Yu Li
Bo Huang
Xiangkai Liu
Hong Wang
Zhuoyan Wu
Zhanpeng Yan
Xueying Tu
Tongqing Wu
Daibing Zeng

This paper presents the design and VLSI implementation of a low-power HEVC main profile encoder, which is able to process up to [email protected] 4:2:0 encoding in real-time with five-stage pipeline architecture. A pyramid ME (Motion Estimation) engine is employed to reduce search complexity. To compensate for the video sequences with fast moving objects, GME (Global Motion Estimation) are introduced to alleviate the effect of limited search range. We also implement an alternative 5×5 search along with 3×3 to boost video quality. For intra mode decision, original pixels, instead of reconstructed ones are used to reduce pipeline stall. The encoder supports DVFS (Dynamic Voltage and Frequency Scaling) and features three operating modes, which helps to reduce power consumption by 25%. Scalable quality that trades encoding quality for power by reducing size of search range and intra prediction candidates, achieves 11.4% power reduction with 3.5% quality degradation. Furthermore, a lossless frame buffer compression is proposed which reduced DDR bandwidth by 49.1% and power consumption by 13.6%. The entire video surveillance SoC is fabricated with TSMC 28nm technology with 1.96 mm2 area. It consumes 2.88M logic gates and 117KB SRAM. The measured power consumption is 103mW at 350MHz for 4K encoding with high-quality mode. The 0.39nJ/pixel of energy efficiency of this work, which achieves 42% ~ 97% power reduction as compared with reference designs, make it ideal for real-time low-power smart video surveillance applications.

Breaking POps/J Barrier with Analog Multiplier Circuits Based on Nonvolatile Memories

M. Reza Mahmoodi
Dmitri Strukov

Low-to-medium resolution analog vector-by-matrix multipliers (VMMs) offer a remarkable energy/area efficiency as compared to their digital counterparts. Still, the maximum attainable performance in analog VMMs is often bounded by the overhead of the peripheral circuits. The main contribution of this paper is the design of novel sensing circuitry which improves energy-efficiency and density of analog multipliers. The proposed circuit is based on translinear Gilbert cell, which is topologically combined with a floating nonlinear resistor and a low-gain amplifier. Several compensation techniques are employed to ensure reliability with respect to process, temperature, and supply voltage variations. As a case study, we consider implementation of couple-gate current-mode VMM with embedded split-gate NOR flash memory. Our simulation results show that a 4-bit 100×100 VMM circuit designed in 55 nm CMOS technology achieves the record-breaking performance of 3.63 POps/J.

Efficient Image Sensor Subsampling for DNN-Based Image Classification

Jia Guo
Hongxiang Gu
Miodrag Potkonjak

Today’s mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.

Input-Splitting of Large Neural Networks for Power-Efficient Accelerator with Resistive Crossbar Memory Array

Yulhwa Kim
Hyungjun Kim
Daehyun Ahn
Jae-Joon Kim

Resistive Crossbar memory Arrays (RCA) have been gaining interest as a promising platform to implement Convolutional Neural Networks (CNN). One of the major challenges in RCA-based design is that the number of rows in an RCA is often smaller than the number of input neurons in a layer. Previous works used high-resolution Analog-to-Digital Converters (ADCs) to compute the partial weighted sum in each array and merged partial sums from multiple arrays outside the RCAs. However, such approach suffers from significant power consumption due to the need for high-resolution ADCs. In this paper, we propose a methodology to more efficiently construct a large CNN with multiple RCAs. By splitting the input feature map and retraining the CNN with proper initialization, we demonstrate that any CNN model can be represented with multiple arrays without using intermediate partial sums. The experimental results show that the ADC power of the proposed design is 32x smaller and the total chip power of the proposed design is 3x smaller than those of the baseline design.

Design Optimization of 3D Multi-Processor System-on-Chip with Integrated Flow Cell Arrays

Artem Andreev
Fulya Kaplan
Marina Zapater
Ayse K. Coskun
David Atienza

Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

Multi-Pattern Active Cell Balancing Architecture and Equalization Strategy for Battery Packs

Swaminathan Narayanaswamy
Sangyoung Park
Sebastian Steinhorst
Samarjit Chakraborty

Active cell balancing is the process of improving the usable capacity of a series-connected Lithium-Ion (Li-Ion) battery pack by redistributing the charge levels of individual cells. Depending upon the State-of-Charge (SoC) distribution of the individual cells in the pack, an appropriate charge transfer pattern (cell-to-cell, cell-to-module, module-to-cell or module-to-module) has to be selected for improving the usable energy of the battery pack. However, existing active cell balancing circuits are only capable of performing limited number of charge transfer patterns and, therefore, have a reduced energy efficiency for different types of SoC distribution. In this paper, we propose a modular, multi-pattern active cell balancing architecture that is capable of performing multiple types of charge transfer patterns (cell-to-cell, cell-to-module, module-to-cell and module-to-module) with a reduced number of hardware components and control signals compared to existing solutions. We derive a closed-form, analytical model of our proposed balancing architecture with which we profile the efficiency of the individual charge transfer patterns enabled by our architecture. Using the profiling analysis, we propose a hybrid charge equalization strategy that automatically selects the most energy-efficient charge transfer pattern depending upon the SoC distribution of the battery pack and the characteristics of our proposed balancing architecture. Case studies show that our proposed balancing architecture and hybrid charge equalization strategy provide up to a maximum of 46.83% improvement in energy efficiency compared to existing solutions.

Intrinsic and Database-free Watermarking in ICs by Exploiting Process and Design Dependent Variability in Metal-Oxide-Metal Capacitances

Ahish Shylendra
Swarup Bhunia
Amit Ranjan Trivedi

Authentication of integrated circuits (IC) to verify their integrity has emerged as a critical need to address increasing concerns associated with counterfeit ICs in the supply chain. In this paper, novel SAR-ADC based intrinsic and database-free authentication scheme has been proposed. Proposed technique utilizes mismatch in back end of line (BEOL) capacitors used in charge-redistribution SAR ADC to generate authentication signature. BEOL metal-oxide-metal (MOM) capacitors form a reliable source of process variation information and are less sensitive to aging & temperature induced variations. Line edge roughness is the primary source of mismatch in BEOL capacitors and thus, capacitor mismatch variation has been analyzed in terms of LER and geometric parameters. Resource overhead incurred by the proposed modifications to the ADC architecture to incorporate authentication ability is minimal and existing on-chip calibration circuitry is used to extract signature. Proposed technique does not require sophisticated test setup, thereby, simplifying the authentication procedure.

Scheduling of Hybrid Battery-Supercapacitor Control Instructions for Longevity in Systems with Power Gating

Sumanta Pyne

The in-rush current due to wake-up of power gating (PG) components causes faster discharge of battery. This work introduces an instruction controlled hybrid battery-supercapacitor (B-SC) system for longer battery life in systems with instruction controlled PG. Two instructions have been introduced along with architectural support. The first instruction disconnects the battery from the PG components if the charge in the supercapacitor greater than or equal to the charge required by wake-up of PG components. The other instruction connects the battery to the PG components for recharging the supercapacitor. Disconnecting the battery during wake-up minimizes rate capacity effect (C-rate) for longer battery life. An algorithm is designed to schedule the proposed battery control instructions within a program having PG instructions. The efficacy of the proposed method is evaluated on MiBench and MediaBench benchmark programs. The proposed method reduces C-rate by an average of 14.25% at the cost of average performance loss of 6.87%.

Better-Than-Worst-Case Design Methodology for a Compact Integrated Switched-Capacitor DC-DC Converter

Dongkwun Kim
Mingoo Seok

We suggest a new methodology in co-designing an integrated switched-capacitor converter and a digital load. Conventionally, a load has been specified to the minimum supply voltage and the maximum power dissipation, each found at her own worst-case process, workload, and environment condition. Furthermore, in designing an SC DC-DC converter toward this worst-case load specification, designers often have been adding another separate pessimistic assumption on power-switch’s resistance and flying-capacitor’s density of an SC converter. Such worst-case design methodology can lead to a significantly over-sized flying capacitor and thereby limit on-chip integration of a converter. Our proposed methodology instead adopts the better than worst-case (BTWC) perspective to avoid over-design and thus optimizes the area of an SC converter. Specifically, we propose BTWC load modeling where we specify non-pessimistic sets of supply voltage requirement and load power dissipation across variations. In addition, by considering coupled variations between the SC converter and the load integrated in the same die, our methodology can further reduce the pessimism in power-switch’s resistance and capacitor density. The proposed co-design methodology is verified with a 2:1 SC converter and a digital load in a 65 nm. The resulted converter achieves more than one order of magnitude reduction in the flying capacitor size as compared to the conventional worst-case design while maintaining the target conversion efficiency and target throughput. We also verified our methodology with a wide range of load characteristics in terms of their supply voltages and current draw and confirmed the similar benefits.

Dynamic Bit-width Reconfiguration for Energy-Efficient Deep Learning Hardware

Daniele Jahier Pagliari
Enrico Macii
Massimo Poncino

Deep learning models have reached state of the art performance in many machine learning tasks. Benefits in terms of energy, bandwidth, latency, etc., can be obtained by evaluating these models directly within Internet of Things end nodes, rather than in the cloud. This calls for implementations of deep learning tasks that can run in resource limited environments with low energy footprints. Research and industry have recently investigated these aspects, coming up with specialized hardware accelerators for low power deep learning. One effective technique adopted in these devices consists in reducing the bit-width of calculations, exploiting the error resilience of deep learning. However, bit-widths are tipically set statically for a given model, regardless of input data. Unless models are retrained, this solution invariably sacrifices accuracy for energy efficiency.

In this paper, we propose a new approach for implementing input-dependant dynamic bit-width reconfiguration in deep learning accelerators. Our method is based on a fully automatic characterization phase, and can be applied to popular models without retraining. Using the energy data from a real deep learning accelerator chip, we show that 50% energy reduction can be achieved with respect to a static bit-width selection, with less than 1% accuracy loss.

Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

Mahdi Nazemi
Massoud Pedram

Major advancements in building general-purpose and customized hardware have been one of the key enablers of versatility and pervasiveness of machine learning models such as deep neural networks. To sustain this ubiquitous deployment of machine learning models and cope with their computational and storage complexity, several solutions such as low-precision representation of model parameters using fixed-point representation and deploying approximate arithmetic operations have been employed. Studying the potency of such solutions in different applications requires integrating them into existing machine learning frameworks for high-level simulations as well as implementing them in hardware to analyze their effects on power/energy dissipation, throughput, and chip area. Lop is a library for design space exploration that bridges the gap between machine learning and efficient hardware realization. It comprises a Python module, which can be integrated with some of the existing machine learning frameworks and implements various customizable data representations including fixed-point and floating-point as well as approximate arithmetic operations. Furthermore, it includes a highly-parameterized Scala module, which allows synthesizing hardware based on the said data representations and arithmetic operations. Lop allows researchers and designers to quickly compare quality of their models using various data representations and arithmetic operations in Python and contrast the hardware cost of viable representations by synthesizing them on their target platforms (e.g., FPGA or ASIC). To the best of our knowledge, Lop is the first library that allows both software simulation and hardware realization using customized data representations and approximate computing techniques.

Battery-Aware Energy Model of Drone Delivery Tasks

Donkyu Baek
Yukai Chen
Alberto Bocca
Alberto Macii
Enrico Macii
Massimo Poncino

Drones are becoming increasingly popular in the commercial market for various package delivery services. In this scenario, the mostly adopted drones are quad-rotors (i.e., quadcopters). The energy consumed by a drone may become an issue, since it may affect (i) the delivery deadline (quality of service), (ii) the number of packages that can be delivered (throughput) and (iii) the battery lifetime (number of recharging cycles). It is thus fundamental try to find the proper compromise between the energy used to complete the delivery and the speed at which the quadcopter flies to reach the destination. In order to achieve this, we have to consider that the energy required by the drone for completing a given delivery task does not exactly correspond to the energy requested to the battery, since the latter is a non-ideal power supply that is able to deliver power with different efficiencies depending on its state of charge. In this paper, we demonstrate that the proposed battery-aware delivery scheduling algorithm carries more packages than the traditional delivery model with the same battery capacity. Moreover, the battery-aware delivery model is 17% more accurate than the traditional delivery model for the same delivery scheme, which prevents the unexpected drone landing.

A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference

Li Yang
Zhezhi He
Deliang Fan

Deep convolutional neural network has taken an important role in machine learning algorithm which has been widely used in computer vision tasks. However, its enormous model size and massive computation cost have became the main obstacle for deployment of such powerful algorithm in low power and resource limited embedded system, such as FPGA. Recent works have shown the binarized neural networks (BNN), utilizing binarized (i.e. +1 and -1) convolution kernel and binary activation function, can significantly reduce the model size and computation complexity, which paves a new road for energy-efficient FPGA implementation. In this work, we first propose a new BNN algorithm, called Parallel-Convolution BNN (i.e. PC-BNN), which replaces the original binary convolution layer in conventional BNN with two parallel binary convolution layers. PC-BNN achieves ~86% on CIFAR-10 dataset with only 2.3Mb parameter size. We then deploy our proposed PC-BNN into the Xilinx PYNQ Z1 FPGA board with only 4.9Mb on-chip RAM. Since the ultra-small network parameter, it is feasible to store the whole network parameter into on-chip RAM, which could greatly reduce the energy and delay overhead to load network parameter from off-chip memory. Meanwhile, a new data streaming pipeline architecture is proposed in PC-BNN FPGA implementation to further improve throughput. The experiment results show that our PC-BNN based FPGA implementation achieves 930 frames per second, 387.5 FPS/Watt and 396×10-4 FPS/LUT, which are among the best throughput and energy efficiency compared to most recent works.

In-situ Stochastic Training of MTJ Crossbar based Neural Networks

Ankit Mondal
Ankur Srivastava

Owing to high device density, scalability and non-volatility, Magnetic Tunnel Junction-based crossbars have garnered significant interest for implementing the weights of an artificial neural network. The existence of only two stable states in MTJs implies a high overhead of obtaining optimal binary weights in software. We illustrate that the inherent parallelism in the crossbar structure makes it highly appropriate for in-situ training, wherein the network is taught directly on the hardware. It leads to significantly smaller training overhead as the training time is independent of the size of the network, while also circumventing the effects of alternate current paths in the crossbar and accounting for manufacturing variations in the device. We show how the stochastic switching characteristics of MTJs can be leveraged to perform probabilistic weight updates using the gradient descent algorithm. We describe how the update operations can be performed on crossbars both with and without access transistors and perform simulations on them to demonstrate the effectiveness of our techniques. The results reveal that stochastically trained MTJ-crossbar NNs achieve a classification accuracy nearly same as that of real-valued-weight networks trained in software and exhibit immunity to device variations.

Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit

Ioannis Tsiokanos
Lev Mukhanov
Dimitrios S. Nikolopoulos
Georgios Karakonstantis

In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.

A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications

Sana Arshad
Azam Beg
Rashad Ramzan

This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA’s power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.

SESSION: Far-out Ideas

Insights from Biology: Low Power Circuits in the Fruit Fly

Louis K. Scheffer

Fruit flies (Drosophila melanogaster) are small insects, with correspondingly small power budgets. Despite this, they perform sophisticated neural computations in real time. Careful study of these insects is revealing how some of these circuits work. Insights from these systems might be helpful in designing other low power circuits.

ICCAD 2018 TOC

18 June 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

A fast thermal-aware fixed-outline floorplanning methodology based on analytical models

Jai-Ming Lin
Tai-Ting Chen
Yen-Fu Chang
Wei-Yi Chang
Ya-Ting Shyu
Yeong-Jar Chang
Juin-Ming Lu

High temperature or temperature non-uniformity have become a serious threat to performance and reliability of high-performance integrated circuits (ICs). Thermal effect becomes a non-ignorable issue to circuit design or physical design. To estimate temperature accurately, the locations of modules have to be determined, which makes an efficient and effective thermal-aware floorplanning play a more important role. To resolve this problem, this paper proposes a differential nonlinear model which can approximate temperature and minimize wirelength at the same time during floorplanning. We also apply some techniques such a thermal-aware clustering or shrinking hot modules in the multi-level framework to further reduce temperature without inducing longer wirelength. The experimental results demonstrate that temperature and wirelength are greatly improved in our method compared to other works. More importantly, our runtime is quite fast and the fixed-outline constraint is also satisfied.

Analytical solution of Poisson’s equation and its application to VLSI global placement

Wenxing Zhu
Zhipeng Huang
Jianli Chen
Yao-Wen Chang

Poisson’s equation has been used in VLSI global placement for describing the potential field induced by a given charge density distribution. Unlike previous global placement methods that solve Poisson’s equation numerically, in this paper, we provide an analytical solution of the equation to calculate the potential energy of an electrostatic system. The analytical solution is derived based on the separation of variables method and an exact density function to model the block distribution in a placement region, which is an infinite series and converges absolutely. Using the analytical solution, we give a fast computation scheme of Poisson’s equation and develop an effective and efficient global placement algorithm called Pplace. Experimental results show that our Pplace achieves smaller placement wirelength than ePlace and NTUplace3, two leading wirelength-driven placers. With the pervasive applications of Poisson’s equation in scientific fields, in particular, our effective, efficient, and robust computation scheme for its analytical solution can provide substantial impacts to these fields.

Novel proximal group ADMM for placement considering fogging and proximity effects

Jianli Chen
Li Yang
Zheng Peng
Wenxing Zhu
Yao-Wen Chang

Fogging and proximity effects are two major factors that cause inaccurate exposure and thus layout pattern distortions in e-beam lithography. In this paper, we propose the first analytical placement algorithm to consider both the fogging and proximity effects. We first formulate the global placement problem as a separable minimization problem with linear constraints, where different objectives can be tackled one by one in an alternating fashion. Then, we propose a novel proximal group alternating direction method of multipliers (ADMM) to solve the separable minimization problem with two subproblems, where the first subproblem (mainly associated with wirelength and density) is solved by a steepest descent method without line-search, and the second one (mainly associated with the fogging and proximity effects) is handled by an analytical scheme. We prove the property of global convergence of the proximal group ADMM method. Finally, legalization and detailed placement are used to legal and further improve the placement result. Experimental results show that our algorithm is effective and efficient for the addressed problem. Compared with the state-of-the-art work, our algorithm not only can achieve 13.4% smaller fogging variation and 21.4% lower proximity variation, but also has a 1.65X speedup.

Simultaneous partitioning and signals grouping for time-division multiplexing in 2.5D FPGA-based systems

Shih-Chun Chen
Richard Sun
Yao-Wen Chang

The 2.5D FPGA is a promising technology to accommodate a large design in one FPGA chip, but the limited number of inter-die connections in a 2.5D FPGA may cause routing failures. To resolve the failures, input/output time-division multiplexing is adopted by grouping cross-die signals to go through one routing channel with a timing penalty after netlist partitioning. However, grouping signals after partitioning might lead to a suboptimal solution. Consequently, it is desirable to consider simultaneous partitioning and signal grouping although the optimization objectives of partitioning and grouping are different, and the time complexity of such simultaneous optimization is usually high. In this paper, we propose a simultaneous partitioning and grouping algorithm that can not only integrate the two objectives smoothly, but also reduce the time complexity to linear time per partitioning iteration. Experimental results show that our proposed algorithm outperforms the state-of-the-arts flow in both cross-die signal timing criticality and system-clock periods.

IC/IP piracy assessment of reversible logic

Samah Mohamed Saeed
Xiaotong Cui
Alwin Zulehner
Robert Wille
Rolf Drechsler
Kaijie Wu
Ramesh Karri

Reversible logic is a building block for adiabatic and quantum computing in addition to other applications. Since common functions are non-reversible, one needs to embed them into proper-size reversible functions by adding ancillary inputs and garbage outputs. We explore the Intellectual Property (IP) piracy of reversible circuits. The number of embeddings of regular functions in a reversible function and the percent of leaked ancillary inputs measure the difficulty of recovering the embedded function. To illustrate the key concepts, we study reversible logic circuits designed using reversible logic synthesis tools based on Binary Decision Diagrams and Quantum Multi-valued Decision Diagrams.

TimingSAT: timing profile embedded SAT attack

Abhishek Chakraborty
Yuntao Liu
Ankur Srivastava

In order to enhance the security of logic obfuscation schemes, delay based logic locking has been proposed in combination with traditional functional logic locking approaches in recent literature. A circuit obfuscated using the aforementioned approach preserves the correct functionality only when both correct functional and delay keys are provided. In this paper, we develop a novel SAT formulation based approach called TimingSAT to deobfuscte the functionalities of such delay locked designs within a reasonable amount of time. The proposed technique models the timing characteristics of various types of gates present in the design as Boolean functions to build timing profile embedded SAT formulations in terms of targeted key inputs. TimingSAT attack works in two stages: In the first stage the functional keys are found using traditional SAT attack approach and in the second stage the delay keys are deciphered utilizing the timing profile embedded SAT formulation of the circuit. In both stages of the attack, wrong keys are iteratively eliminated till a key belonging to the correct equivalence class is obtained. The experimental results highlight the effectiveness of the proposed TimingSAT attack to break delay logic locked benchmarks within few hours.

Towards provably-secure analog and mixed-signal locking against overproduction

Nithyashankari Gummidipoondi Jayasankaran
Adriana Sanabria Borbon
Edgar Sanchez-Sinencio
Jiang Hu
Jeyavijayan Rajendran

Similar to digital circuits, analog and mixed-signal (AMS) circuits are also susceptible to supply-chain attacks such as piracy, overproduction, and Trojan insertion. However, unlike digital circuits, supply-chain security of AMS circuits is less explored. In this work, we propose to perform “logic locking” on digital section of the AMS circuits. The idea is to make the analog design intentionally suffer from the effects of process variations, which impede the operation of the circuit. Only on applying the correct key, the effect of process variations are mitigated, and the analog circuit performs as desired. We provide the theoretical guarantees of the security of the circuit, and along with simulation results for the band-pass filter, low-noise amplifier, and low-dropout regulator, we also show experimental results of our technique on a band-pass filter.

Best of both worlds: integration of split manufacturing and camouflaging into a security-driven CAD flow for 3D ICs

Satwik Patnaik
Mohammed Ashraf
Ozgur Sinanoglu
Johann Knechtel

With the globalization of manufacturing and supply chains, ensuring the security and trustworthiness of ICs has become an urgent challenge. Split manufacturing (SM) and layout camouflaging (LC) are promising techniques to protect the intellectual property (IP) of ICs from malicious entities during and after manufacturing (i.e., from untrusted foundries and reverse-engineering by end-users). In this paper, we strive for “the best of both worlds,” that is of SM and LC. To do so, we extend both techniques towards 3D integration, an up-and-coming design and manufacturing paradigm based on stacking and interconnecting of multiple chips/dies/tiers.

Initially, we review prior art and their limitations. We also put forward a novel, practical threat model of IP piracy which is in line with the business models of present-day design houses. Next, we discuss how 3D integration is a naturally strong match to combine SM and LC. We propose a security-driven CAD and manufacturing flow for face-to-face (F2F) 3D ICs, along with obfuscation of interconnects. Based on this CAD flow, we conduct comprehensive experiments on DRC-clean layouts. Strengthened by an extensive security analysis (also based on a novel attack to recover obfuscated F2F interconnects), we argue that entering the next, third dimension is eminent for effective and efficient IP protection.

Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base

Sebastian Vogel
Mengyu Liang
Andre Guntoro
Walter Stechele
Gerd Ascheid

Efficient acceleration of Deep Neural Networks is a manifold task. In order to save memory requirements and reduce energy consumption we propose the use of dedicated accelerators with novel arithmetic processing elements which use bit shifts instead of multipliers. While a regular power-of-2 quantization scheme allows for multiplierless computation of multiply-accumulate-operations, it suffers from high accuracy losses in neural networks. Therefore, we evaluate the use of powers-of-arbitrary-log-bases and confirmed their suitability for quantization of pre-trained neural networks. The presented method works without retraining of the neural network and therefore is suitable for applications in which no labeled training data is available. In order to verify our proposed method, we implement the log-based processing elements into a neural network accelerator on an FPGA. The hardware efficiency is evaluated in terms of FPGA utilization and energy requirements in comparison to regular 8-bit-fixed-point multiplier based acceleration. Using this approach hardware resources are minimized and power consumption is reduced by 22.3%.

NID: processing binary convolutional neural network in commodity DRAM

Jaehyeong Sim
Hoseok Seol
Lee-Sup Kim

Recent large-scale CNNs suffer from a severe memory wall problem as their number of weights range from tens to hundreds of millions. Processing in-memory (PIM) and binary CNN have been proposed to alleviate the number of memory accesses and footprints, respectively. By combining the two separate concepts, we propose a novel processing in-DRAM framework for binary CNN, called NID, where dominant convolution operations are processed using in-DRAM bulk bitwise operations. We first identify the problem that the bitcount operations with only bulk bitwise AND/OR/NOT incur significant overhead in terms of delay when the size of kernels gets larger. Then, we not only optimize the performance by efficiently allocating inputs and kernels to DRAM banks for both convolutional and fully-connected layers through design space explorations, but also mitigate the overhead of bitcount operations by splitting kernels into multiple parts. Partial sum accumulations and tasks of the other layers such as max-pooling and normalization layers are processed in the peripheral area of DRAM with negligible overheads. In results, our NID framework achieves 19X-36X performance and 9X-14X EDP improvements for convolutional layers, and 9X-17X performance and 1.4X-4.5X EDP improvements for fully-connected layers over previous PIM technique in four large-scale CNN models.

AXNet: approximate computing using an end-to-end trainable neural network

Zhenghao Peng
Xuyang Chen
Chengwen Xu
Naifeng Jing
Xiaoyao Liang
Cewu Lu
Li Jiang

Neural network based approximate computing is a universal architecture promising to gain tremendous energy-efficiency for many error resilient applications. To guarantee the approximation quality, existing works deploy two neural networks (NNs), e.g., an approximator and a predictor. The approximator provides the approximate results, while the predictor predicts whether the input data is safe to approximate with the given quality requirement. However, it is non-trivial and time-consuming to make these two neural network coordinate—they have different optimization objectives—by training them separately. This paper proposes a novel neural network structure—AXNet—to fuse two NNs to a holistic end-to-end trainable NN. Leveraging the philosophy of multi-task learning, AXNet can tremendously improve the invocation (proportion of safe-to-approximate samples) and reduce the approximation error. The training effort also decrease significantly. Experiment results show 50.7% more invocation and substantial cuts of training time when compared to existing neural network based approximate computing framework.

Scalable-effort ConvNets for multilevel classification

Valentino Peluso
Andrea Calimera

This work introduces the concept of scalable-effort Convolutional Neural Networks (ConvNets), an effort-accuracy scalable model for classification of data at multilevel abstraction. Scalable-effort ConvNets are able to adapt at run-timeto the complexity of the classification problem, i.e. the level of abstraction defined by the application (or context), and reach a given classification accuracy with minimal computational effort. The mechanism is implemented using a single-weight scalable-precision model rather than an ensemble of quantized weight models; this makes the proposed strategy highly flexible and particularly suited for embedded architectures with limited resource availability.

The paper describes (i) a hardware/software vertical implementation of scalable-precision multiply&accumulate arithmetic, (ii) an accuracy-constrained heuristic that delivers near-optimal layer-by-layer precision mapping at a predefined level of abstraction. It also reports the validation for three state-of-the-art nets, i.e. AlexNet, SqueezeNet and MobileNet, trained and tested with ImageNet. Collected results show scalable-effort ConvNets guarantee flexibility and substantial savings: 47.07% computational effort reduction at minimum accuracy, or 30.6% accuracy improvement at maximum effort w.r.t. standard flat ConvNets (average over the three benchmarks for high-level classification).

Emerging reconfigurable nanotechnologies: can they support future electronics?

Shubham Rai
Srivatsa Srinivasa
Patsy Cadareanu
Xunzhao Yin
Xiaobo Sharon Hu
Pierre-Emmanuel Gaillardon
Vijaykrishnan Narayanan
Akash Kumar

Several emerging reconfigurable technologies have been explored in recent years offering device level runtime reconfigurability. These technologies offer the freedom to choose between p- and n-type functionality from a single transistor. In order to optimally utilize the feature-sets of these technologies, circuit designs and storage elements require novel design to complement the existing and future electronic requirements. An important aspect to sustain such endeavors is to supplement the existing design flow from the device level to the circuit level. This should be backed by a thorough evaluation so as to ascertain the feasibility of such explorations. Additionally, since these technologies offer runtime reconfigurability and often encapsulate more than one functions, hardware security features like polymorphic logic gates and on-chip key storage come naturally cheap with circuits based on these reconfigurable technologies. This paper presents innovative approaches devised for circuit designs harnessing the reconfigurable features of these nanotechnologies. New circuit design paradigms based on these nano devices will be discussed to brainstorm on exciting avenues for novel computing elements.

Design and algorithm for clock gating and flip-flop co-optimization

Giyoung Yang
Taewhan Kim

This work firstly investigates the problem of how designing data-driven (i.e., toggling based) clock gating can be closely integrated with the synthesis of flip-flops, which has never been addressed in the prior clock gating works. Our key observation is that some internal part of a flip-flop cell can be reused to generate its clock gating enable signal. Based on this, we propose a newly optimized flip-flop wiring structure, called eXOR-FF, in which an internal logic can be reused for every clock cycle to decide if the flip-flop is to be activated or inactivated through clock gating, thereby achieving area saving (thus, leakage as well as dynamic power saving) on every pair of flip-flop and its toggling detection logic. Then, we propose a comprehensive methodology of placement/timing-aware clock gating exploration that provides two unique strengths: best suited for maximally exploiting the benefit of eXOR-FFs and precise analyses on the decomposition of power consumptions and timing impact, and translating them into cost functions in core engine of clock gating exploration.

Macro-aware row-style power delivery network design for better routability

Jai-Ming Lin
Jhih-Sheng Syu
I-Ru Chen

Reliability of a P/G network is one of the most important concerns in a chip design, which makes powerplanning the most critical step in the physical design. Traditional P/G network design mainly focuses on reducing usage of routing resource to satisfy voltage drop and electromigration constraints according to a regular mesh. As the number of macros in a modern design increases, this style may waste more routing resource and make routing congestion more severe in local regions. In order to save routing resource and increase routability, this paper proposes a delicate powerplanning method. First, we propose a row-style power mesh to facilitate connection of pre-placed macros and increase routability of signal nets in the later stage. Besides, an effective power stripe width which can reduce wastage of routing resource and provide stronger supply voltage is found. Moreover, we propose the first work to use the linear programming algorithm to minimize P/G routing area and consider routability at the same time. The experimental results show that routability of a design with many macros can be significantly improved by our row-style power networks.

Modeling and optimization of magnetic core TSV-inductor for on-chip DC-DC converter

Baixin Chen
Umamaheswara Tida
Cheng Zhuo
Yiyu Shi

Conventional on-chip spiral inductor consumes significant top metal routing area, thereby preventing its popularity in many on-chip applications. Recently TSV-inductor with a magnetic core has been proved to be a viable option for on-chip DC-DC converter in a 14nm test chip. The operating conditions of such inductors play a major role in maximizing the performance and efficiency of the DC-DC converter. However, due to its unique TSV-structure, unlike conventional spiral inductor, much of the modeling details remain unclear. This paper analyzes the modeling details of a magnetic core TSV-inductor and proposes a design methodology to optimize power losses of the inductor. With this methodology, designers can ensure fast and reliable inductor optimization for on-chip applications. Experimental results show that the optimized magnetic core TSV-inductor can achieve inductance density improvement of 6.0–7.7X and quality factor improvements of 1.3–1.6X while maintaining the same footprint.

Machine-learning-based dynamic IR drop prediction for ECO

Yen-Chun Fang
Heng-Yi Lin
Min-Yan Su
Chien-Mo Li
Eric Jia-Wei Fang

During design signoff, many iterations of Engineer Change Order (ECO) are needed to ensure IR drop of each cell instance meets the specified limit. It is a waste of resources because repeated dynamic IR drop simulations take a very long time on very similar designs. In this work, we train a machine learning model, based on data before ECO, and predict IR drop after ECO. To increase our prediction accuracy, we propose 17 timing-aware, power-aware, and physical-aware features. Our method is scalable because the feature dimension is fixed (937), independent of design size and cell library. Also, we propose to build regional models for cell instances near IR drop violations to improves both prediction accuracy and training time. Our experiments show that our prediction correlation coefficient is 0.97 and average error is 3.0mV on a 5-million-cell industry design. Our IR drop prediction for 100K cell instances can be completed within 2 minutes. Our proposed method provides a fast IR drop prediction to speedup ECO.

Privacy-preserving deep learning and inference

M. Sadegh Riazi
Farinaz Koushanfar

We provide a systemization of knowledge of the recent progress made in addressing the crucial problem of deep learning on encrypted data. The problem is important due to the prevalence of deep learning models across various applications, and privacy concerns over the exposure of deep learning IP and user’s data. Our focus is on provably secure methodologies that rely on cryptographic primitives and not trusted third parties/platforms. Computational intensity of the learning models, together with the complexity of realization of the cryptography algorithms hinder the practical implementation a challenge. We provide a summary of the state-of-the-art, comparison of the existing solutions, as well as future challenges and opportunities.

Machine learning IP protection

Rosario Cammarota
Indranil Banerjee
Ofer Rosenberg

Machine learning, specifically deep learning is becoming a key technology component in application domains such as identity management, finance, automotive, and healthcare, to name a few. Proprietary machine learning models – Machine Learning IP – are developed and deployed at the network edge, end devices and in the cloud, to maximize user experience.

With the proliferation of applications embedding Machine Learning IPs, machine learning models and hyper-parameters become attractive to attackers, and require protection. Major players in the semiconductor industry provide mechanisms on device to protect the IP at rest and during execution from being copied, altered, reverse engineered, and abused by attackers. In this work we explore system security architecture mechanisms and their applications to Machine Learning IP protection.

Assured deep learning: practical defense against adversarial attacks

Bita Darvish Rouhani
Mohammad Samragh
Mojan Javaheripi
Tara Javidi
Farinaz Koushanfar

Deep Learning (DL) models have been shown to be vulnerable to adversarial attacks. In light of the adversarial attacks, it is critical to reliably quantify the confidence of the prediction in a neural network to enable safe adoption of DL models in autonomous sensitive tasks (e.g., unmanned vehicles and drones). This article discusses recent research advances for unsupervised model assurance against the strongest adversarial attacks known to date and quantitatively compare their performance. Given the widespread usage of DL models, it is imperative to provide model assurance by carefully looking into the feature maps automatically learned within Dl models instead of looking back with regret when deep learning systems are compromised by adversaries.

Tetris: re-architecting convolutional neural network computation for machine learning accelerators

Hang Lu
Xin Wei
Ning Lin
Guihai Yan
Xiaowei Li

Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.

FCN-engine: accelerating deconvolutional layers in classic CNN processors

Dawen Xu
Kaijie Tu
Ying Wang
Cheng Liu
Bingsheng He
Huawei Li

Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead.

To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X — 44.9X speedup and reduces the energy consumption by 14.6%-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.

Designing adaptive neural networks for energy-constrained image classification

Dimitrios Stamoulis
Ting-Wu (Rudy) Chin
Anand Krishnan Prakash
Haocheng Fang
Sribhuvan Sajja
Mitchell Bognar
Diana Marculescu

As convolutional neural networks (CNNs) enable state-of-the-art computer vision applications, their high energy consumption has emerged as a key impediment to their deployment on embedded and mobile devices. Towards efficient image classification under hardware constraints, prior work has proposed adaptive CNNs, i.e., systems of networks with different accuracy and computation characteristics, where a selection scheme adaptively selects the network to be evaluated for each input image. While previous efforts have investigated different network selection schemes, we find that they do not necessarily result in energy savings when deployed on mobile systems. The key limitation of existing methods is that they learn only how data should be processed among the CNNs and not the network architectures, with each network being treated as a blackbox.

To address this limitation, we pursue a more powerful design paradigm where the architecture settings of the CNNs are treated as hyper-parameters to be globally optimized. We cast the design of adaptive CNNs as a hyper-parameter optimization problem with respect to energy, accuracy, and communication constraints imposed by the mobile device. To efficiently solve this problem, we adapt Bayesian optimization to the properties of the design space, reaching near-optimal configurations in few tens of function evaluations. Our method reduces the energy consumed for image classification on a mobile device by up to 6X, compared to the best previously published work that uses CNNs as blackboxes. Finally, we evaluate two image classification practices, i.e., classifying all images locally versus over the cloud under energy and communication constraints.

FATE: fast and accurate timing error prediction framework for low power DNN accelerator design

Jeff (Jun) Zhang
Siddharth Garg

Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs which execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8X –58X speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing a conventional DNN accelerator that uses 2’s complement (2C) arithmetic with one that uses signed magnitude representation (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest.

Waterfall is too slow, let’s go Agile: multi-domain coupling for synthesizing automotive cyber-physical systems

Debayan Roy
Michael Balszun
Thomas Heurung
Samarjit Chakraborty
Amol Naik

For future autonomous vehicles, the system development life cycle must keep up with the rapid rate of innovation and changing needs of the market. Waterfall is too slow to react to such changes, and therefore, there is a growing emphasis to adopt Agile development concepts in the automotive industry. Ensuring requirements trace-ability, and thus proving functional safety, is a serious challenge in this direction. Modern cars are complex cyber-physical systemsand are traditionally designed using a set of disjoint tools, which adds to the challenge. In this paper, we point out that multi-domain coupling and design automation using correct-by-design approaches can lead to safe designs even in an Agile environment. In this context, we study current industry trends. We further outline the challenges involved in multi-domain coupling and demonstrate using a state-of-the-art approach how these challenges can be addressed by exploiting domain-specific knowledge.

Model-based and data-driven approaches for building automation and control

Tianshu Wei
Xiaoming Chen
Xin Li
Qi Zhu

Smart buildings in the future are complex cyber-physical-human systems that involve close interactions among embedded platform (for sensing, computation, communication and control), mechanical components, physical environment, building architecture, and occupant activities. The design and operation of such buildings require a new set of methodologies and tools that can address these heterogeneous domains in a holistic, quantitative and automated fashion. In this paper, we will present our design automation methods for improving building energy efficiency and offering comfortable services to occupants at low cost. In particular, we will highlight our work in developing both model-based and data-driven approaches for building automation and control, including methods for co-scheduling heterogeneous energy demands and supplies, for integrating intelligent building energy management with grid optimization through a proactive demand response framework, for optimizing HVAC control with deep reinforcement learning, and for accurately measuring in-building temperature by combining prior modeling information with few sensor measurements based upon Bayesian inference.

Design automation for battery systems

Swaminathan Narayanaswamy
Sangyoung Park
Sebastian Steinhorst
Samarjit Chakraborty

High power Lithium-Ion (Li-Ion) battery packs used in stationary Electrical Energy Storage (EES) systems and Electric Vehicle (EV) applications require a sophisticated Battery Management System (BMS) in order to maintain safe operation and improve their performance. With the increasing complexity of these battery packs and their demand for shorter time-to-market, decentralized approaches for battery management, providing a high degree of modularity, scalability and improved control performance are typically preferred. However, manual design approaches for these complex distributed systems are time consuming and are error-prone resulting in a reduced energy efficiency of the overall system. Here, special design automation techniques considering all abstraction-levels of the battery system are required to obtain highly optimized battery packs. This paper presents from a design automation perspective the recent advances in the domain of battery systems that are a combination of the electrochemical cells and their associated management modules. Specifically, we classify the battery systems into three abstraction levels, cell-level (battery cells and their interconnection schemes), module-level (sensing and charge balancing circuits) and pack-level (computation and control algorithms). We provide an overview of challenges that exist in each abstraction layer and give an outlook towards future design automation techniques that are required to overcome these limitations.

RFUZZ: coverage-directed fuzz testing of RTL on FPGAs

Kevin Laeufer
Jack Koenig
Donggyu Kim
Jonathan Bachrach
Koushik Sen

Dynamic verification is widely used to increase confidence in the correctness of RTL circuits during the pre-silicon design phase. Despite numerous attempts over the last decades to automate the stimuli generation based on coverage feedback, Coverage Directed Test Generation (CDG) has not found the widespread adoption that one would expect. Based on new ideas from the software testing community around coverage-guided mutational fuzz testing, we propose a new approach to the CDG problem which requires minimal setup and takes advantage of FPGA-accelerated simulation for rapid testing. We provide test input and coverage definitions that allow fuzz testing to be applied to RTL circuit verification. In addition we propose and implement a series of transformation passes that make it feasible to reset arbitrary RTL designs quickly, a requirement for deterministic test execution. Alongside this paper we provide rfuzz, a fully featured implementation of our testing methodology which we make available as open-source software to the research community. An empirical evaluation of rfuzz shows promising results on archiving coverage for a wide range of different RTL designs ranging from communication IPs to an industry scale 64-bit CPU.

Steep coverage-ascent directed test generation for shared-memory verification of multicore chips

Gabriel A. G. Andrade
Marleson Graf
Nícolas Pfeifer
Luiz C. V. dos Santos

This paper proposes a framework for functional verification of shared memory that relies on reusable coverage-driven directed test generation. It reveals a new mechanism to improve the quality of non-deterministic tests. The generator exploits general properties of coherence protocols and cache memories for better control on transition coverage, which serves as a proxy for increasing the actual coverage metric adopted in a given verification environment. Being independent of coverage metric, coherence protocol, and cache parameters, the proposed generator is reusable across quite different designs and verification environments. We report the coverage for 8, 16, and 32-core designs and the effort required for exposing nine different types of errors. The proposed technique was always able to reach similar coverage as a state-of-the-art generator, and it always did it faster above a certain threshold. For instance, when executing tests with IK operations for verifying 32-core designs, the former reached 65% coverage around 5 times faster than the latter. Besides, we identified challenging errors that could hardly be found by the latter within one hour, but were exposed by our technique in 5 to 30 minutes.

SMTSampler: efficient stimulus generation from complex SMT constraints

Rafael Dutra
Jonathan Bachrach
Koushik Sen

Stimulus generation is an essential part of hardware verification, being at the core of widely applied constrained-random verification techniques. However, as verification problems get more and more complex, so do the constraints which must be satisfied. In this context, it is a challenge to efficiently generate random stimuli which can achieve a good coverage of the design space. We developed a new technique SMTSampler which can sample random solutions from Satisfiability Modulo Theories (SMT) formulas with bit-vectors, arrays, and uninterpreted functions. The technique uses a small number of calls to a constraint solver in order to generate up to millions of stimuli. Our evaluation on a large set of complex industrial SMT benchmarks shows that SMTSampler can handle a larger class of SMT problems, outperforming state-of-the-art constraint sampling techniques in the number of samples produced and the coverage of the constraint space.

DL-RSIM: a simulation framework to enable reliable ReRAM-based accelerators for deep learning

Meng-Yao Lin
Hsiang-Yun Cheng
Wei-Ting Lin
Tzu-Hsien Yang
I-Ching Tseng
Chia-Lin Yang
Han-Wen Hu
Hung-Sheng Chang
Hsiang-Pang Li
Meng-Fan Chang

Memristor-based deep learning accelerators provide a promising solution to improve the energy efficiency of neuromorphic computing systems. However, the electrical properties and crossbar structure of memristors make these accelerators error-prone. To enable reliable memristor-based accelerators, a simulation platform is needed to precisely analyze the impact of non-ideal circuit and device properties on the inference accuracy. In this paper, we propose a flexible simulation framework, DL-RSIM, to tackle this challenge. DL-RSIM simulates the error rates of every sum-of-products computation in the memristor-based accelerator and injects the errors in the targeted TensorFlow-based neural network model. A rich set of reliability impact factors are explored by DL-RSIM, and it can be incorporated with any deep learning neural network implemented by TensorFlow. Using three representative convolutional neural networks as case studies, we show that DL-RSIM can guide chip designers to choose a reliability-friendly design option and develop reliability optimization techniques.

A ferroelectric FET based power-efficient architecture for data-intensive computing

Yun Long
Taesik Na
Prakshi Rastogi
Karthik Rao
Asif Islam Khan
Sudhakar Yalamanchili
Saibal Mukhopadhyay

In this paper, we present a ferroelectric FET (FeFET) based power-efficient architecture to accelerate data-intensive applications such as deep neural networks (DNNs). We propose a cross-cutting solution combining emerging device technologies, circuit optimizations, and micro-architecture innovations. At device level, FeFET crossbar is utilized to perform vector-matrix multiplication (VMM). As a field effect device, FeFET significantly reduces the read/write energy compared with the resistive random-access memory (ReRAM). At circuit level, we propose an all-digital peripheral design, reducing the large overhead introduced by ADC and DAC in prior works. In terms of micro-architecture innovation, a dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Speed, power, area and computing accuracy are evaluated based on detailed device characterization and system modeling. For DNN computing, our design achieves 254x and 9.7x gain in power efficiency (GOPS/W) compared to GPU and ReRAM based designs, respectively.

EMAT: an efficient multi-task architecture for transfer learning using ReRAM

Fan Chen
Hai Li

Transfer learning has demonstrated a great success recently towards general supervised learning to mitigate expensive training efforts. However, existing neural network accelerators have been proven inefficient in executing transfer learning by failing to accommodate the layer-wise heterogeneity in computation and memory requirements. In this work, we propose EMAT—an efficient multi-task architecture for transfer learning built on resistive memory (ReRAM) technology. EMAT utilizes the energy-efficiency of ReRAM arrays for matrix-vector multiplication and realizes a hierarchical reconfigurable design with heterogeneous computation components to incorporate the data patterns in transfer learning. Compared to the GPU platform, EMAT can perform averagely 120X performance speedup and 87X energy saving. EMAT also obtains 2.5X speedup compared to the-state-of-the-art CMOS accelerator.

Co-manage power delivery and consumption for manycore systems using reinforcement learning

Haoran Li
Zhongyuan Tian
Rafael K. V. Maeda
Xuanqi Chen
Jun Feng
Jiang Xu

Maintaining high energy efficiency has become a critical design issue for high-performance systems. Many power management techniques have been proposed for the processor cores such as dynamic voltage and frequency scaling (DVFS). However, very few solutions consider the power losses suffered on the power delivery system (PDS), despite the fact that they have a significant impact on the system overall energy efficiency. With the explosive growth of system complexity and highly dynamic workloads variations, it is challenging to find the optimal power management policies which can effectively match the power delivery with the power consumption. To tackle the above problems, we propose a reinforcement learning-based power management scheme for manycore systems to jointly monitor and adjust both the PDS and the processor cores aiming to improve system overall energy efficiency. The learning agents distributed across power domains not only manage the power states of processor cores but also control the on/off states of on-chip VRs to proactively adapt to the workload variations. Experimental results with realistic applications show that when the proposed approach is applied to a large-scale system with a hybrid PDS, it lowers the system overall energy-delay-product (EDP) by 41% than a traditional monolithic DVFS approach with a bulky off-chip VR.

Adaptive-precision framework for SGD using deep Q-learning

Wentai Zhang
Hanxian Huang
Jiaxi Zhang
Ming Jiang
Guojie Luo

Stochastic gradient descent (SGD) is a widely-used algorithm in many applications, especially in the training process of deep learning models. Low-precision implementation for SGD has been studied as a major acceleration approach. However, if not appropriately used, low-precision implementation can deteriorate its convergence because of the rounding error when gradients become small near a local optimum. In this work, to balance throughput and algorithmic accuracy, we apply the Q-learning technique to adjust the precision of SGD automatically by designing an appropriate decision function. The proposed decision function for Q-learning takes the error rate of the objective function, its gradients, and the current precision configuration as the inputs. Q-learning then chooses proper precision adaptively for hardware efficiency and algorithmic accuracy. We use reconfigurable devices such as FPGAs to evaluate the adaptive precision configurations generated by the proposed Q-learning method. We prototype the framework using LeNet-5 model with MNIST and CIFAR10 datasets and implement it on a Xilinx KCU1500 FPGA board. In the experiments, we analyze the throughput of different precision representations and the precision-selection of our framework. The results show that the proposed framework with adapative precision increases the throughput by up to 4.3 x compared to the conventional 32-bit floating point setting, and it achieves both the best hardware efficiency and algorithmic accuracy.

Differentiated handling of physical scenes and virtual objects for mobile augmented reality

Chih-Hsuan Yen
Wei-Ming Chen
Pi-Cheng Hsiu
Tei-Wei Kuo

Mobile devices running augmented reality applications consume considerable energy for graphics-intensive workloads. This paper presents a scheme for the differentiated handling of camera-captured physical scenes and computer-generated virtual objects according to different perceptual quality metrics. We propose online algorithms and their realtime implementations to reduce energy consumption through dynamic frame rate adaptation while maintaining the visual quality required for augmented reality applications. To evaluate system efficacy, we integrate our scheme into Android and conduct extensive experiments on a commercial smartphone with various application scenarios. The results show that the proposed scheme can achieve energy savings of up to 39.1% in comparison to the native graphics system in Android while maintaining satisfactory visual quality.

DATC RDF: an academic flow from logic synthesis to detailed routing

Jinwook Jung
Iris Hui-Ru Jiang
Jianli Chen
Shih-Ting Lin
Yih-Lang Li
Victor N. Kravets
Gi-Joon Nam

In this paper, we present DATC Robust Design Flow (RDF) from logic synthesis to detailed routing. We further include detailed placement and detailed routing tools based on recent EDA research contests. We also demonstrate RDF in a scalable cloud infrastructure. Design methodology and cross-stage optimization research can be conducted via RDF.

Physical modeling of bitcell stability in subthreshold SRAMs for leakage-area optimization under PVT variations

Xin Fan
Rui Wang
Tobias Gemmeke

Subthreshold SRAM design is crucial for addressing the memory bottleneck in energy constrained applications. While statistical optimization can be applied based on Monte-Carlo (MC) simulation, exploration of bitcell design space is time consuming. This paper presents a framework for model-based design and optimization of subthreshold SRAM bitcells under random PVT variations. By incorporating key design and process features, a physical model of bitcell static noise margin (SNM) has been derived analytically. It captures intra-die SNM variations by the combination of a folded-normal distribution and a non-central chi-squared distribution. Validations with MC simulation show its accuracy of modeling SNM distributions down to 25mV beyond 6-sigma for typical bitcells in 28nm. Model-based tuning of subthreshold SRAM bitcells is investigated for design tradeoff between leakage, area and stability. When targeting a specific SNM constraint, we show that an optimal standby voltage exists which offers minimum bitcell leakage power – any deviation above or below increases the power consumption. When targeting a specific standby voltage, our design flow identifies bitcell instances of 12x less leakage power or 3x reductions in area as compared to the minimum-length design.

Comparing voltage adaptation performance between replica and in-situ timing monitors

Yutaka Masuda
Jun Nagayama
Hirotaka Takeno
Yoshimasa Ogawa
Yoichi Momiyama
Masanori Hashimoto

Adaptive voltage scaling (AVS) is a promising approach to overcome manufacturing variability, dynamic environmental fluctuation, and aging. This paper focuses on timing sensors necessary for AVS implementation and compares in-situ timing error predictive FF (TEP-FF) and critical path replica in terms of how much voltage margin can be reduced. For estimating the theoretical bound of ideal AVS, this work proposes linear programming based minimum supply voltage analysis and discusses the voltage adaptation performance quantitatively by investigating the gap between the lower bound and actual supply voltages. Experimental results show that TEP-FF based AVS and replica based AVS achieve up to 13.3% and 8.9% supply voltage reduction, respectively while satisfying the target MTTF. AVS with TEP-FF tracks the theoretical bound with 2.5 to 5.6 % voltage margin while AVS with replica needs 7.2 to 9.9 % margin.

Strain-aware performance evaluation and correction for OTFT-based flexible displays

Tengtao Li
Sachin S. Sapatnekar

Organic thin-film transistors (OTFTs) are widely used in flexible circuits, such as flexible displays, sensor arrays, and radio frequency identification cards (RFIDs), because these technologies offer features such as better flexibility, lower cost, and easy manufacturability using low-temperature fabrication process. This paper develops a procedure that evaluates the performance of flexible displays. Due to their very nature, flexible displays experience significant mechanical strain/stress in the field due to the deformation caused during daily use. These deformations can impact device and circuit performance, potentially causing a loss in functionality. This paper first models the effects of extrinsic strain due to two fundamental deformations modes, bending and twisting. Next, this strain is translated to variations in device mobility, after which analytical models for error analysis in the flexible display are derived based on the rendered image values in each pixel of the display. Finally, two error correction approaches for flexible displays are proposed, based on voltage compensation and flexible clocking.

Achieving fast sanitization with zero live data copy for MLC flash memory

Ping-Hsien Lin
Yu-Ming Chang
Yung-Chun Li
Wei-ChenWang
Chien-Chung Ho
Yuan-Hao Chang

As data security has become the major concern in modern storage systems with low-cost multi-level-cell (MLC) flash memories, it is not trivial to realize data sanitization in such a system. Even though some existing works employ the encryption or the built-in erase to achieve this requirement, they still suffer the risk of being deciphered or the issue of performance degradation. In contrast to the existing work, a fast sanitization scheme is proposed to provide the highest degree of security for data sanitization; that is, every old version of data could be immediately sanitized with zero live-data-copy overhead once the new version of data is created/written. In particular, this scheme further considers the reliability issue of MLC flash memories; the proposed scheme includes a one-shot sanitization design to minimize the disturbance during data sanitization. The feasibility and the capability of the proposed scheme were evaluated through extensive experiments based on real flash chips. The results demonstrate that this scheme can achieve the data sanitization with zero live-data-copy, where performance overhead is less than 1%.

Architecting data placement in SSDs for efficient secure deletion implementation

Hoda Aghaei Khouzani
Chen Liu
Chengmo Yang

Secure deletion ensures user privacy by permanently removing invalid data from the secondary storage. This process is particularly critical to solid state drives (SSDs) wherein invalid data are generated not only upon deleting a file but also upon updating a file of which the user is not aware. While previous secure deletion schemes are usually applied to all invalid data on the SSD, our observation is that in many cases security is not required for all files on the SSD. This paper proposes an efficient secure deletion scheme targeting only the invalid data of files marked as “secure” by the user. A security-aware data allocation strategy is designed, which separates secure and unsecure data at lower (block) level but mixes them at higher levels of SSD hierarchical organization. Block-level separation minimizes secure deletion cost, while higher-level mixing mitigates the adverse impact of secure deletion on SSD lifetime. A two-level block management scheme is further developed to scatter secure blocks over the SSD for wear leveling. Experiments on real-world benchmarks confirm the advantage of the proposed scheme in reducing secure deletion cost and improving SSD lifetime.

AxBA: an approximate bus architecture framework

Jacob R. Stevens
Ashish Ranjan
Anand Raghunathan

The exponential growth in creation and consumption of various forms of digital data has led to the emergence of new application workloads such as machine learning, data analytics and search. These workloads process large amounts of data and hence pose increased demands on the on-chip and off-chip interconnects of modern computing systems. Therefore, techniques that can improve the energy-efficiency and performance of interconnects are becoming increasingly important.

Security: the dark side of approximate computing?

Francesco Regazzoni
Cesare Alippi
Ilia Polian

Approximate computing promises significant advantages over more traditional computing architectures with respect to circuit area, performance, power efficiency, flexibility, and cost. Its use is suitable in applications where limited and controlled inaccuracies are tolerable or uncertainty is intrinsic in input or their data processing, e.g., as it happens in (deep-) machine learning, image and signal processing. This paper discusses a dimension of approximate computing that has been neglected so far, despite it represents nowadays a major asset, that of security. A number of hardware-related security threats are considered, and the implications of approximate circuits or systems designed to address these threats are discussed.

Security aspects of neuromorphic MPSoCs

Johanna Sepulveda
Cezar Reinbrecht
Jean-Philippe Diguet

Neural networks and deep learning are promising techniques for bringing brain inspired computing into embedded platforms. They pave the way to new kinds of associative memories, classifiers, data-mining, machine learning or search engines, which can be the basis of critical and sensitive applications such as autonomous driving. Emerging non-volatile memory technologies integrated in the so called Multi-Processor System-on-Chip (MPSoC) architectures enable the realization of such computational paradigms. These architectures take advantage of the Network-on-Chip concept to efficiently carry out communications with dedicated distributed memories and processing elements. However, current MPSoC-based neuromorphic architectures are deployed without taking security into account. The growing complexity and the hyper-sharing of hardware resources of MPSoCs may become a threat, thus increasing the risk of malware infections and Trojans introduced at design time. Specially, MPSoC microarchitectural side-channels and fault injection attacks can be exploited to leak sensitive information and to cause malfunctions. In this work we present three contributions to that issue: i) first analysis of security issues in MPSoC-based neuromorphic architectures; ii) discussion of the threat model of the neuromorphic architectures; ii) demonstration of the correlation between SNN input and the neural computation.

Vulnerability-tolerant secure architectures

Todd Austin
Valeria Bertacco
Baris Kasikci
Sharad Malik
Mohit Tiwari

Today, secure systems are built by identifying potential vulnerabilities and then adding protections to thwart the associated attacks. Unfortunately, the complexity of today’s systems makes it impossible to prove that all attacks are stopped, so clever attackers find a way around even the most carefully designed protections. In this article, we take a sobering look at the state of secure system design, and ask ourselves why the “security arms race” never ends? The answer lies in our inability to develop adequate security verification technologies. We then examine an advanced defensive system in nature – the human immune system – and we discover that it does not remove vulnerabilities, rather it adds offensive measures to protect the body when its vulnerabilities are penetrated We close the article with brief speculation on how the human immune system could inspire more capable secure system designs.

Machine learning for performance and power modeling of heterogeneous systems

Joseph L. Greathouse
Gabriel H. Loh

Modern processing systems with heterogeneous components (e.g., CPUs, GPUs) have numerous configuration and design options such as the number and types of cores, frequency, and memory bandwidth. Hardware architects must perform design space explorations in order to accurately target markets of interest under tight time-to-market constraints. This need highlights the importance of rapid performance and power estimation mechanisms.

This work describes the use of machine learning (ML) techniques within a methodology for the estimating performance and power of heterogeneous systems. In particular, we measure the power and performance of a large collection of test applications running on real hardware across numerous hardware configurations. We use these measurements to train a ML model; the model learns how the applications scale with the system’s key design parameters.

Later, new applications of interest are executed on a single configuration, and we gather hardware performance counter values which describe how the application used the hardware. These values are fed into our ML model’s inference algorithm, which quickly identify how this application will scale across various design points. In this way, we can rapidly predict the performance and power of the new application across a wide range of system configurations.

Once the initial run of the program is complete, our ML algorithm can predict the application’s performance and power at many hardware points faster than running it at each of those points and with a level of accuracy comparable to cycle-level simulators.

Machine learning for design space exploration and optimization of manycore systems

Ryan Gary Kim
Janardhan Rao Doppa
Partha Pratim Pande

In the emerging data-driven science paradigm, computing systems ranging from IoT and mobile to manycores and datacenters play distinct roles. These systems need to be optimized for the objectives and constraints dictated by the needs of the application. In this paper, we describe how machine learning techniques can be leveraged to improve the computational-efficiency of hardware design optimization. This includes generic methodologies that are applicable for any hardware design space. As an example, we discuss a guided design space exploration framework to accelerate application-specific manycore systems design and advanced imitation learning techniques to improve on-chip resource management. We present some experimental results for application-specific manycore system design optimization and dynamic power management to demonstrate the efficacy of these methods over traditional EDA approaches.

Failure prediction based on anomaly detection for complex core routers

Shi Jin
Zhaobo Zhang
Krishnendu Chakrabarty
Xinli Gu

Data-driven prognostic health management is essential to ensure high reliability and rapid error recovery in commercial core router systems. The effectiveness of prognostic health management depends on whether failures can be accurately predicted with sufficient lead time. This paper describes how time-series analysis and machine-learning techniques can be used to detect anomalies and predict failures in complex core router systems. First, both a feature-categorization-based hybrid method and a changepoint-based method have been developed to detect anomalies in time-varying features with different statistical characteristics. Next, a SVM-based failure predictor is developed to predict both categories and lead time of system failures from collected anomalies. A comprehensive set of experimental results is presented for data collected during 30 days of field operation from over 20 core routers deployed by customers of a major telecom company.

Invocation-driven neural approximate computing with a multiclass-classifier and multiple approximators

Haiyue Song
Chengwen Xu
Qiang Xu
Zhuoran Song
Naifeng Jing
Xiaoyao Liang
Li Jiang

Neural approximate computing gains enormous energy-efficiency at the cost of tolerable quality-loss. A neural approximator can map the input data to output while a classifier determines whether the input data are safe to approximate with quality guarantee. However, existing works cannot maximize the invocation of the approximator, resulting in limited speedup and energy saving. By exploring the mapping space of those target functions, in this paper, we observe a nonuniform distribution of the approximation error incurred by the same approximator. We thus propose a novel approximate computing architecture with a Multiclass-Classifier and Multiple Approximators (MCMA). These approximators have identica network topologies, and thus can share the same hardware resource in an neural processing unit(NPU) clip. In the runtime, MCMA can swap in the invoked approximator by merely shipping the synapse weights from the on-chip memory to the buffers near MAC within a cycle. We also propose efficient co-training methods for such MCMA architecture. Experimental results show a more substantial invocation of MCMA as well as the gain of energy-efficiency.

Deterministic methods for stochastic computing using low-discrepancy sequences

M. Hassan Najafi
David J. Lilja
Marc Riedel

Recently, deterministic approaches to stochastic computing (SC) have been proposed. These compute with the same constructs as stochastic computing but operate on deterministic bit streams. These approaches reduce the area, greatly reduce the latency (by an exponential factor), and produce completely accurate results. However, these methods do not scale well. Also, they lack the property of progressive precision enjoyed by SC. As a result, these deterministic approaches are not competitive for applications where some degree of inaccuracy can be tolerated. In this work we introduce two fast-converging, scalable deterministic approaches to SC based on low-discrepancy sequences. The results are completely accurate when running the operations for the required number of cycles. However, the computation can be truncated early if some inaccuracy is acceptable. Experimental results show that the proposed approaches significantly improve both the processing time and area-delay product compared to prior approaches.

Design space exploration of multi-output logic function approximations

Jorge Echavarria
Stefan Wildermann
Jürgen Teich

Approximate Computing has emerged as a design paradigm that allows to decrease hardware costs by reducing the accuracy of the computation for applications that are robust against such errors. In Boolean logic approximation, the number of terms and literals of a logic function can be reduced by allowing to produce erroneous outputs for some input combinations. This paper proposes a novel methodology for the approximation of multi-output logic functions. Related work on multi-output logic approximation minimizes each output function separately. In this paper, we show that thereby a huge optimization potential is lost. As a remedy, our methodology considers the effect on all output functions when introducing errors thus exploiting the cross-function minimization potential. Moreover, our approach is integrated into a design space exploration technique to obtain not only a single solution but a Pareto-set of designs with different trade-offs between hardware costs (terms and literals) and error (number of minterms that have been falsified). Experimental results show our technique is very efficient in exploring Pareto-optimal fronts. For some benchmarks, the number of terms could be reduced from an accurate function implementation by up to 15% and literals by up to 19% with degrees of inaccuracy around 0.1% w.r.t. accurate designs. Moreover, we show that the Pareto-fronts obtained by our methodology dominate the results obtained when applying related work.

3DICT: a reliable and QoS capable mobile process-in-memory architecture for lookup-based CNNs in 3D XPoint ReRAMs

Qian Lou
Wujie Wen
Lei Jiang

It is extremely challenging to deploy computing-intensive convolutional neural networks (CNNs) with rich parameters in mobile devices because of their limited computing resources and low power budgets. Although prior works build fast and energy-efficient CNN accelerators by greatly sacrificing test accuracy, mobile devices have to guarantee high CNN test accuracy for critical applications, e.g., unlocking phones by face recognitions. In this paper, we propose a 3D XPoint ReRAM-based process-in-memory architecture, 3DICT, to provide various test accuracies to applications with different priorities by lookup-based CNN tests that dynamically exploit the trade-off between test accuracy and latency. Compared to the state-of-the-art accelerators, on average, 3DICT improves the CNN test performance per Watt by 13% ~ 61X and guarantees 9-year endurance under various CNN test accuracy requirements.

Aliens: a novel hybrid architecture for resistive random-access memory

Bing Wu
Dan Feng
Wei Tong
Jingning Liu
Shuai Li
Mingshun Yang
Chengning Wang
Yang Zhang

Passive crossbar arrays of resistive random-access memory (RRAM) have shown great potential to meet the demands of future memory. By eliminating transistor per cell, the crossbar array possesses a higher memory density but introduces sneak currents which incur extra energy waste and reliability issues. The complementary resistive switch (CRS), consisting of two anti-serially stacked memristors, is considered as a promising solution to the sneak current problem. However, the destructive read of the CRS results in an additional recovery write operation which strongly restricts its further promotion. Exploiting the dual CRS/memristor mode of CRS devices, we propose Aliens, a novel hybrid architecture for resistive random-access memory which introduces one alien cell (memristor mode) for each wordline in the crossbar to provide a practical hybrid memory without operating system’s intervention. Aliens draws advantages from both modes: restrained sneak current of CRS mode and non-destructive read of memristor mode. The simple and regular cell mode organization of Aliens enables an energy-saving read method and an effective mode switching strategy called Lazy-Switch. By exploiting memory access locality, Lazy-Switch delays and merges the recovery write operations of the CRS mode. Due to fewer recovery write operations and negligible sneak currents, Aliens achieves improvement in energy, overall endurance, and access performance. The experiment results show that our design offers average energy savings of 13.9X compared with memristor-only memory, a memory lifetime 5.3X longer than CRS-only memory, and a competitive performance compared with memristor-only memory.

FELIX: fast and energy-efficient logic in memory

Saransh Gupta
Mohsen Imani
Tajana Rosing

The Internet of Things (IoT) has led to the emergence of big data. Processing this amount of data poses a challenge for current computing systems. PIM enables in-place computation which reduces data movement, a major latency bottleneck in conventional systems. In this paper, we propose an in-memory implementation of fast and energy-efficient logic (FELIX) which combines the functionality of PIM with memories. To the best of authors’ knowledge, FELIX is the first PIM logic to enable the single cycle NOR, NOT, NAND, minority, and OR directly in crossbar memory. We exploit the voltage threshold-based memristors to enable single cycle operations. It is a purely in-memory execution which neither reads out data nor changes sense amplifiers, while preserving data in-memory. We extend these single cycle operations to implement more complex functions like XOR and addition in memory with 2X lower latency than the fastest published PIM technique. We also increase the amount of in-memory parallelism in our design by segmenting bitlines using switches. To evaluate the efficiency of our design at the system level, we design a FELIX-based HyperDimensional (HD) computing accelerator. Our evaluation shows that for all applications tested using HD, FELIX provides on average 128.8X speedup and 5,589.3X lower energy consumption as compared to AMD GPU. FELIX HD also achieves on average 2.21X higher energy efficiency, 1.86X speedup, and 1.68X less memory as compared to the fastest PIM technique.

DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs

Xiaofan Zhang
Junsong Wang
Chao Zhu
Yonghua Lin
Jinjun Xiong
Wen-mei Hwu
Deming Chen

Building a high-performance EPGA accelerator for Deep Neural Networks (DNNs) often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for both cloud- and edge-devices. A number of novel techniques including high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme are developed to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) on two FPGAs (XC7Z045 and KU115) corresponding to the edge- and cloud-computing, respectively. The fine-grained layer-based pipeline architecture and the column-based cache scheme contribute to 7.7x and 43x reduction of the latency and BRAM utilization compared to conventional designs. We achieve the best performance (up to 5.15x faster) and efficiency (up to 5.88x more efficient) compared to published FPGA-based classification-oriented DNN accelerators for both edge and cloud computing cases. We reach 4218 GOPS for running object detection DNN which is the highest throughput reported to the best of our knowledge. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs

Yufei Ma
Tu Zheng
Yu Cao
Sarma Vrudhula
Jae-sun Seo

The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the inference algorithm. Therefore, this work proposes to customize the detection algorithm, e.g. SSD, to benefit its hardware implementation with low data precision at the cost of marginal accuracy degradation. The proposed FPGA-based deep learning inference accelerator is demonstrated on two Intel FPGAs for SSD algorithm achieving up to 2.18 TOPS throughput and up to 3.3X superior energy-efficiency compared to GPU.

TGPA: tile-grained pipeline architecture for low latency CNN inference

Xuechao Wei
Yun Liang
Xiuhong Li
Cody Hao Yu
Peng Zhang
Jason Cong

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.

Customized locking of IP blocks on a multi-million-gate SoC

Abhrajit Sengupta
Mohammed Nabeel
Mohammed Ashraf
Ozgur Sinanoglu

Reliance on off-site untrusted fabrication facilities has given rise to several threats such as intellectual property (IP) piracy, overbuilding and hardware Trojans. Logic locking is a promising defense technique against such malicious activities that is effected at the silicon layer. Over the past decade, several logic locking defenses and attacks have been presented, thereby, enhancing the state-of-the-art. Nevertheless, there has been little research aiming to demonstrate the applicability of logic locking with large-scale multi-million-gate industrial designs consisting of multiple IP blocks with different security requirements. In this work, we take on this challenge to successfully lock a multi-million-gate system-on-chip (SoC) provided by DARPA by taking it all the way to GDSII layout. We analyze how specific features, constraints, and security requirements of an IP block can be leveraged to lock its functionality in the most appropriate way. We show that the blocks of an SoC can be locked in a customized manner at 0.5%, 15.3%, and 1.5% chip-level overhead in power, performance, and area, respectively.

Dynamic resource management for heterogeneous many-cores

Jörg Henkel
Jürgen Teich
Stefan Wildermann
Hussam Amrouch

With the advent of many-core systems, use cases of embedded systems have become more dynamic: Plenty of applications are concurrently executed, but may dynamically be exchanged and modified even after deployment. Moreover, resources may temporally or permanently become unavailable because of thermal aspects, dynamic power management, or the occurrence of faults. This poses new challenges for reaching objectives like timeliness for real-time or performance for best-effort program execution and maximizing system utilization. In this work, we first focus on dynamic management schemes for reliability/aging optimization under thermal constraints. The reliability of on-chip systems in the current and upcoming technology nodes is continuously degrading with every new generation because transistor scaling is approaching its fundamental limits. Protecting systems against degradation effects such as circuits’ aging comes with considerable losses in efficiency. We demonstrate in this work why sustaining reliability while maximizing the utilization of available resources and hence avoiding efficiency loss is quite challenging – this holds even more when thermal constraints come into play. Then, we discuss techniques for run-time management of multiple applications which sustain real-time properties. Our solution relies on hybrid application mapping denoting the combination of design-time analysis with run-time application mapping. We present a method for Real-time Mapping Reconfiguration (RMR) which enables the Run-Time Manager (RM) to execute realtime applications even in the presence of dynamic thermal-and reliability-aware resource management.

This paper is paper of the ICCAD 2018 Special Session on “Managing Heterogeneous Many-cores for High-Performance and Energy-Efficiency”. The other two papers of this Special sessions are [1] and [2].

Online learning for adaptive optimization of heterogeneous SoCs

Ganapati Bhat
Sumit K. Mandal
Ujjwal Gupta
Umit Y. Ogras

Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads.

Hybrid on-chip communication architectures for heterogeneous manycore systems

Biresh Kumar Joardar
Janardhan Rao Doppa
Partha Pratim Pande
Diana Marculescu
Radu Marculescu

The widespread adoption of big data has led to the search for high-performance and low-power computational platforms. Emerging heterogeneous manycore processing platforms consisting of CPU and GPU cores along with various types of accelerators offer power and area-efficient trade-offs for running these applications. However, heterogeneous manycore architectures need to satisfy the communication and memory requirements of the diverse computing elements that conventional Network-on-Chip (NoC) architectures are unable to handle effectively. Further, with increasing system sizes and level of heterogeneity, it becomes difficult to quickly explore the large design space and establish the appropriate design trade-offs. To address these challenges, machine learning-inspired heterogeneous manycore system design is a promising research direction to pursue. In this paper, we highlight various salient features of heterogeneous manycore architectures enabled by emerging interconnect technologies and machine learning techniques.

A practical detailed placement algorithm under multi-cell spacing constraints

Yu-Hsiang Cheng
Ding-Wei Huang
Wai-Kei Mak
Ting-Chi Wang

Multi-cell spacing constraints arise due to aggressive scaling and manufacturing issues. For example, we can incorporate multi-cell spacing constraints due to pin accessibility problem in sub-10nm nodes. This work studies detailed placement considering multi-cell spacing constraints. A naive approach is to model each multi-cell spacing constraint as a set of 2-cell spacing constraints, but the resulting total cell displacement would be much larger than necessary. Thus, we aim to tackle this problem and propose a practical multi-cell method by first analyzing the initial layout to determine which cell pair in each multi-cell spacing constraint is the easiest to break apart. Secondly, we apply a single-row dynamic programming (SRDP)-based method one row at a time, called Intra-Row Move (IRM) to resolve a majority of violations while minimizing the total cell displacement or wirelength increase. With cell virtualization and movable region computation techniques, our IRM can be easily extended to handle mixed cell-height designs with only a slight modification of the cost computation in the SRDP method. Finally, we apply an integer linear programming-based method called Global Move (GM) to resolve the remaining violations. Experimental results indicate that our multi-cell method is much better than a 2-cell method both in solution quality and runtime.

Mixed-cell-height placement considering drain-to-drain abutment

Yu-Wei Tseng
Yao-Wen Chang

Along with device scaling, the drain-to-drain abutment (DDA) constraint arises as an emerging challenge in modern circuit designs, which incurs additional difficulties especially for designs with mixed-cell-height standard cells which have prevailed in advanced technology. This paper presents the first work to address the mixed-cell-height placement problem considering the DDA constraint from post global placement throughout detailed placement. Our algorithms consists of three major stages: (1) DDA-aware preprocessing, (2) legalization, and (3) detailed placement. In the DDA-aware preprocessing stage, we first align cells to desired rows, considering the distribution ratio of source nodes to drain nodes. After deciding the cell ordering of every row, we adopt the modulus-based matrix splitting iteration method to remove all cell overlaps with minimum total displacement in the legalization stage. For detailed placement, we propose a satisfiability-based approach which considers the whole layout to flip a subset of cells and swap pairs of adjacent cells simultaneously. Compared with a shortest-path method, experimental results show that our proposed algorithm can significantly reduce cell violations and displacements with reasonable runtime.

Mixed-cell-height legalization considering technology and region constraints

Ziran Zhu
Xingquan Li
Yuhang Chen
Jianli Chen
Wenxing Zhu
Yao-Wen Chang

Mixed-cell-height circuits have become popular in advanced technologies for better power, area, routability, and performance trade-offs. With the technology and region constraints imposed by modern circuit designs, the mixed-cell-height legalization problem has become more challenging. In this paper, we present an effective and efficient legalization algorithm for mixed-cell-height circuit designs with technology and region constraints. We first present a fence region handling technique to unify the fence regions and the default ones. To obtain a desired cell assignment, we then propose a movement-aware cell reassignment method by iteratively reassigning cells in locally dense areas to their desired rows. After cell reassignment, a technology-aware legalization is presented to remove cell overlaps while satisfying the technology constraints. Finally, we propose a technology-aware refinement to further reduce the average and maximum cell movements without increasing the technology constraints violations. Compared with the champion of the 2017 ICCAD CAD Contest and the state-of-the-art work, experimental results show that our algorithm achieves the best average and maximum cell movements and significantly fewer technology constraint violations, in a comparable runtime.

Mixed-cell-height placement with complex minimum-implant-area constraints

Jianli Chen
Peng Yang
Xingquan Li
Wenxing Zhu
Yao-Wen Chang

Mixed-cell-height standard cells are prevailingly used in advanced technologies to achieve better design trade-offs among timing, power, and routability. As feature size decreases, placement of cells with multiple threshold voltages may violate the complex minimum-implant-area (MIA) layer rule arising from the limitations of patterning technologies. Existing works consider the mixed-cell-height placement problem only during legalization, or handle the MIA constraints during detailed placement. In this paper, we address the mixed-cell-height placement problem with MIA constraints into two major stages: post global placement and MIA-aware legalization. In the post global placement stage, we first present a continuous and differentiable cost function to address the Vdd/Vss alignment constraints, and add weighted pseudo nets to MIA violation cells dynamically. Then, we propose a proximal optimization method based on the given global placement result to simultaneously consider Vdd/Vss alignment constraints, MIA constraints, cell distribution, cell displacement, and total wirelength. In the MIA-aware legalization stage, we develop a graph-based method to cluster cells of specific threshold voltages, and apply a strip-packing-based binary linear programming to reshape cells. Then, we propose a matching-based technique to resolve intra-row MIA violations and reduce filler insertion. Furthermore, we formulate inter-row MIA-aware legalization as a quadratic programming problem, which is efficiently solved by a modulus-based matrix splitting iteration method. Finally, MIA-aware cell allocation and refinement are performed to further improve the result. Experimental results show that, without any extra area overhead, our algorithm still can achieve 8.5% shorter final total wirelength than the state-of-the-art work.

RAPID: read acceleration for improved performance and endurance in MLC/TLC NVMs

Poovaiah M. Palangappa
Kartik Mohanram

RAPID is a low-overhead critical-word-first read acceleration architecture for improved performance and endurance in MLC/TLC non-volatile memories (NVMs). RAPID encodes the critical words in a cache line using only the most significant bits (MSbs) of the MLC/TLC NVM cells. Since the MSbs of an NVM cell can be decoded using a single read strobe, the data (i.e., critical words) encoded using the MSbs can be decoded with low latency. System-level SPEC CPU2006 workload evaluations of a TLC RRAM architecture show that RAPID improves read latency by 21%, energy by 24%, and endurance by 2-4x over state-of-the-art striped NVM.

Sneak path free reconfiguration of via-switch crossbars based FPGA

Ryutaro Doi
Jaehoon Yu
Masanori Hashimoto

FPGA that utilizes via-switches, which are a kind of nonvolatile resistive RAMs, for crossbar implementation is attracting attention due to higher integration density and performance. However, programming via-switches arbitrarily in a crossbar is not trivial since a programming current must be provided through signal wires that are shared by multiple via-switches. Consequently, depending on the previous programming status in sequential programming, unintentional switch programming may occur due to signal detour, which is called sneak path problem. This problem interferes the reconfiguration of via-switch FPGA, and hence countermeasures for sneak path problem are indispensable. This paper identifies the circuit status that causes sneak path problem and proposes a sneak path avoidance method that gives sneak path free programming order of via-switches in a crossbar. We prove that sneak path free programming order necessarily exists for arbitrary on-off patterns in a crossbar as long as no loops exist, and also validate the proof and the proposed method with simulation-based evaluation. Thanks to the proposed method, any practical configurations of via-switch FPGA can be successfully programmed without sneak path problem.

Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method

Zhenhua Zhu
Jilan Lin
Ming Cheng
Lixue Xia
Hanbo Sun
Xiaoming Chen
Yu Wang
Huazhong Yang

Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7x speedup, 30% utilization rate improvement, and 1.2x energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4x higher on average in energy efficiency and 20x higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4x speedup and 1.65x energy efficiency improvement.

Enhancing the solution quality of hardware ising-model solver via parallel tempering

Hidenori Gyoten
Masayuki Hiromoto
Takashi Sato

We propose an efficient Ising processor with approximated parallel tempering (IPAPT) implemented on an FPGA. Hardware-friendly approximations of the components of parallel tempering (PT) are proposed to enhance solution quality with low hardware overhead. Multiple replicas of Ising states having different temperatures run in parallel by sharing a single network structure, and the replicas are exchanged based on the approximated energy evaluation. The application of PT substantially improves the quality of optimization solutions. The experimental results on the various max-cut problems have shown that utilization of PT significantly increases the probability of obtaining optimal solutions, and IPAPT obtains optimal solutions two orders magnitude faster than a software solver.

Defensive dropout for hardening deep neural networks under adversarial attacks

Siyue Wang
Xiao Wang
Pu Zhao
Wujie Wen
David Kaeli
Peter Chin
Xue Lin

Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others’ strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker’s strategy for generating adversarial examples. We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.

Online human activity recognition using low-power wearable devices

Ganapati Bhat
Ranadeep Deb
Vatika Vardhan Chaurasia
Holly Shill
Umit Y. Ogras

Human activity recognition (HAR) has attracted significant research interest due to its applications in health monitoring and patient rehabilitation. Recent research on HAR focuses on using smartphones due to their widespread use. However, this leads to inconvenient use, limited choice of sensors and inefficient use of resources, since smartphones are not designed for HAR. This paper presents the first HAR framework that can perform both online training and inference. The proposed framework starts with a novel technique that generates features using the fast Fourier and discrete wavelet transforms of a textile-based stretch sensor and accelerometer data. Using these features, we design a neural network classifier which is trained online using the policy gradient algorithm. Experiments on a low power IoT device (TI-CC2650 MCU) with nine users show 97.7% accuracy in identifying six activities and their transitions with less than 12.5 mW power consumption.

Shadow attacks on MEDA biochips

Mohammed Shayan
Sukanta Bhattacharjee
Tung-Che Liang
Jack Tang
Krishnendu Chakrabarty
Ramesh Karri

The Micro-electrode-dot-array (MEDA) is a next-generation digital microfluidic biochip (DMFB) platform that supports fine-grained control and real-time sensing of droplet movements. These capabilities permit continuous monitoring and checkpoint-based validation of assay execution on MEDA. This paper presents a class of “shadow attacks” that abuse the timing slack in the assay execution. State-of-the-art checkpoint-based validation techniques cannot expose the shadow operations. We develop a defense that introduces extra checkpoints in the assay execution at time instances when the assay is prone to shadow attacks. Experiments confirm the effectiveness and practicality of the defense.

LeapChain: efficient blockchain verification for embedded IoT

Emanuel Regnath
Sebastian Steinhorst

Blockchain provides decentralized consensus in large, open networks without a trusted authority, making it a promising solution for the Internet of Things (IoT) to distribute verifiable data, such as firmware updates. However, verifying data integrity and consensus on a linearly growing blockchain quickly exceeds memory and processing capabilities of embedded systems.

As a remedy, we propose a generic blockchain extension that enables highly constrained devices to verify the inclusion and integrity of any block within a blockchain. Instead of traversing block by block, we construct a LeapChain that reduces verification steps without weakening the integrity guarantees of the blockchain. Applied to Proof-of-Work blockchains, our scheme can be used to verify consensus by proving a certain amount of work on top of a block.

Our analytical and experimental results show that, compared to existing approaches, only LeapChain provides deterministic and tight upper bounds on the memory requirements in the kilobyte range, significantly extending the possibilities of blockchain application on embedded IoT devices.

Robust object estimation using generative-discriminative inference for secure robotics applications

Yanqi Liu
Alessandro Costantini
R. Iris Bahar
Zhiqiang Sui
Zhefan Ye
Shiyang Lu
Odest Chadwicke Jenkins

Convolutional neural networks (CNNs) are of increasing widespread use in robotics, especially for object recognition. However, such CNNs still lack several critical properties necessary for robots to properly perceive and function autonomously in uncertain, and potentially adversarial, environments. In this paper, we investigate factors for accurate, reliable, and resource-efficient object and pose recognition suitable for robotic manipulation in adversarial clutter. Our exploration is in the context of a three-stage pipeline of discriminative CNN-based recognition, generative probabilistic estimation, and robot manipulation. This pipeline proposes using a SAmpling Network Density filter, or SAND filter, to recover from potentially erroneous decisions produced by a CNN through generative probabilistic inference. We present experimental results from SAND filter perception for robotic manipulation in tabletop scenes with both benign and adversarial clutter. These experiments vary CNN model complexity for object recognition and evaluate levels of inaccuracy that can be recovered by generative pose inference. This scenario is extended to consider adversarial environmental modifications with varied lighting, occlusions, and surface modifications.

Efficient utilization of adversarial training towards robust machine learners and its analysis

Sai Manoj P D
Sairaj Amberkar
Setareh Rafatirad
Houman Homayoun

Advancements in machine learning led to its adoption into numerous applications ranging from computer vision to security. Despite the achieved advancements in the machine learning, the vulnerabilities in those techniques are as well exploited. Adversarial samples are the samples generated by adding crafted perturbations to the normal input samples. An overview of different techniques to generate adversarial samples, defense to make classifiers robust is presented in this work. Furthermore, the adversarial learning and its effective utilization to enhance the robustness and the required constraints are experimentally provided, such as up to 97.65% accuracy even against CW attack. Though adversarial learning’s effectiveness is enhanced, still it is shown in this work that it can be further exploited for vulnerabilities.

Majority logic synthesis

Luca Amarù
Eleonora Testa
Miguel Couceiro
Odysseas Zografos
Giovanni De Micheli
Mathias Soeken

The majority function <xyz> evaluates to true, if at least two of its Boolean inputs evaluate to true. The majority function has frequently been studied as a central primitive in logic synthesis applications for many decades. Knuth refers to the majority function in the last volume of his seminal The Art of Computer Programming as “probably the most important ternary operation in the entire universe.” Majority logic sythesis has recently regained significant interest in the design automation community due to nanoemerging technologies which operate based on the majority function. In addition, majority logic synthesis has successfully been employed in CMOS-based applications such as standard cell or FPGA mapping.

This tutorial gives a broad introduction into the field of majority logic synthesis. It will review fundamental results and describe recent contributions from theory, practice, and applications.

RouteNet: routability prediction for mixed-size designs using convolutional neural network

Zhiyao Xie
Yu-Hung Huang
Guan-Qi Fang
Haoxing Ren
Shao-Yun Fang
Yiran Chen
Nvidia Corporation

Early routability prediction helps designers and tools perform preventive measures so that design rule violations can be avoided in a proactive manner. However, it is a huge challenge to have a predictor that is both accurate and fast. In this work, we study how to leverage convolutional neural network to address this challenge. The proposed method, called RouteNet, can either evaluate the overall routability of cell placement solutions without global routing or predict the locations of DRC (Design Rule Checking) hotspots. In both cases, large macros in mixed-size designs are taken into consideration. Experiments on benchmark circuits show that RouteNet can forecast overall routability with accuracy similar to that of global router while using substantially less runtime. For DRC hotspot prediction, RouteNet improves accuracy by 50% compared to global routing. It also significantly outperforms other machine learning approaches such as support vector machine and logistic regression.

TritonRoute: an initial detailed router for advanced VLSI technologies

Andrew B. Kahng
Lutong Wang
Bangqi Xu

Detailed routing is a dead-or-alive critical element in design automation tooling for advanced node enablement. However, very few works address detailed routing in the recent open literature, particularly in the context of modern industrial designs and a complete, end-to-end flow. The ISPD-2018 Initial Detailed Routing Contest addressed this gap for modern industrial designs, using a reduced design rules set. In this work, we present TritonRoute, an initial detailed router for the ISPD-2018 contest. Given route guides from global routing, the initial detailed routing stage should generate a detailed routing solution honoring the route guides as much as possible, while minimizing wirelength, via count and various design rule violations. In our work, the key contribution is intra-layer parallel routing, where we partition each layer into parallel panels and route each panel using an Integer Linear Programming-based algorithm. We sequentially route layer by layer from the bottom to the top. We evaluate our router using the official ISPD-2018 benchmark suite and show that we reduce the contest metric by up to 74%, and on average 50%, compared to the first-place routing solution for each testcase.

A multithreaded initial detailed routing algorithm considering global routing guides

Fan-Keng Sun
Hao Chen
Ching-Yu Chen
Chen-Hao Hsu
Yao-Wen Chang

Detailed routing is the most complicated and time-consuming stage in VLSI design and has become a critical process for advanced node enablement. To handle the high complexity of modern detailed routing, initial detailed routing is often employed to minimize design-rule violations to facilitate final detailed routing, even though it is still not violation-free after initial routing. This paper presents a novel initial detailed routing algorithm to consider industrial design-rule constraints and optimize the total wirelength and via count. Our algorithm consists of three major stages: (1) an effective pinaccess point generation method to identify valid points to model a complex pin shape, (2) a via-aware track assignment method to minimize the overlaps between assigned wire segments, and (3) a detailed routing algorithm with a novel negotiation-based rip-up and re-route scheme that enables multithreading and honors global routing information while minimizing designrule violations. Experimental results show that our router outperforms all the winning teams of the 2018 ACM ISPD Initial Detailed Routing Contest, where the top-3 routers result in 23%, 52%, and 1224% higher costs than ours.

Extending ML-OARSMT to net open locator with efficient and effective boolean operations

Bing-Hui Jiang
Hung-Ming Chen

Multi-layer obstacle-avoiding rectilinear Steiner minimal tree (ML-OARSMT) problem has been extensively studied in recent years. In this work, we consider a variant of ML-OARSMT problem and extend the applicability to the net open location finder. Since ECO or router limitations may cause the open nets, we come up with a framework to detect and reconnect existing nets to resolve the net opens. Different from prior connection graph based approach, we propose a technique by applying efficient Boolean operations to repair net opens. Our method has good quality and scalability and is highly parallelizable. Compared with the results of ICCAD-2017 contest, we show that our proposed algorithm can achieve the smallest cost with 4.81 speedup in average than the top-3 winners.

Logic synthesis of binarized neural networks for efficient circuit implementation

Chia-Chih Chi
Jie-Hong R. Jiang

Neural networks (NNs) are key to deep learning systems. Their efficient hardware implementation is crucial to applications at the edge. Binarized NNs (BNNs), where the weights and output of a neuron are of binary values {-1, +1} (or encoded in {0,1}), have been proposed recently. As no multiplier is required, they are particularly attractive and suitable for hardware realization. Most prior NN synthesis methods target on hardware architectures with neural processing elements (NPEs), where the weights of a neuron are loaded and the output of the neuron is computed. The load-and-compute method, though area efficient, requires expensive memory access, which deteriorates energy and performance efficiency. In this work we aim at synthesizing BNN dense layers into dedicated logic circuits. We formulate the corresponding matrix covering problem and propose a scalable algorithm to reduce the area and routing cost of BNNs. Experimental results justify the effectiveness of the method in terms of area and net savings on FPGA implementation. Our method provides an alternative implementation of BNNs, and can be applied in combination with NPE-based implementation for area, speed, and power tradeoffs.

Canonicalization of threshold logic representation and its applications

Siang-Yun Lee
Nian-Ze Lee
Jie-Hong R. Jiang

Threshold logic functions gain revived attention due to their connection to neural networks employed in deep learning. Despite prior endeavors in the characterization of threshold logic functions, to the best of our knowledge, the quest for a canonical representation of threshold logic functions in the form of their realizing linear inequalities remains open. In this paper we devise a procedure to canonicalize a threshold logic function such that two threshold logic functions are equivalent if and only if their canonicalized linear inequalities are the same. We further strengthen the canonicity to ensure that symmetric variables of a threshold logic function receive the same weight in the canonicalized linear inequality. The canonicalization procedure invokes O(m) queries to a linear programming (resp. an integer linear programming) solver when a linear inequality solution with fractional (resp. integral) weight and threshold values is to be found, where m is the number of symmetry groups of the given threshold logic function. The guaranteed canonicity allows direct application to the classification of NP (input negation, input permutation) and NPN (input negation, input permutation, output negation) equivalence of threshold logic functions. It may thus enable applications such as equivalence checking, Boolean matching, and library construction for threshold circuit synthesis.

DALS: delay-driven approximate logic synthesis

Zhuangzhuang Zhou
Yue Yao
Shuyang Huang
Sanbao Su
Chang Meng
Weikang Qian

Approximate computing is an emerging paradigm for error-tolerant applications. By introducing a reasonable amount of inaccuracy, both the area and delay of a circuit can be reduced significantly. To synthesize approximate circuits automatically, many approximate logic synthesis (ALS) algorithms have been proposed. However, they mainly focus on area reduction and are not optimal in reducing the delay of the circuits. In this paper, we propose DALS, a delay-driven ALS framework. DALS works on the AND-inverter graph (AIG) representation of a circuit. It supports a wide range of approximate local changes and some commonly-used error metrics, including error rate and mean error distance. In order to select an optimal set of nodes in the AIG to apply approximate local changes, DALS establishes a critical error network (CEN) from the AIG and formulates a maximum flow problem on the CEN. Our experimental results on a wide range of benchmarks show that DALS produces approximate circuits with significantly reduced delays.

Unlocking fine-grain parallelism for AIG rewriting

Vinicius Possani
Yi-Shan Lu
Alan Mishchenko
Keshav Pingali
Renato Ribas
Andre Reis

Parallel computing is a trend to enhance scalability of electronic design automation (EDA) tools using widely available multicore platforms. In order to benefit from parallelism, well-known EDA algorithms have to be reformulated and optimized for multicore implementation. This paper introduces a set of principles to enable a fine-grain parallel AND-inverter graph (AIG) rewriting. It presents a novel method to discover and rewrite in parallel parts of the AIG, without the need for graph partitioning. Experiments show that, when synthesizing large designs composed of millions of AIG nodes, the parallel rewriting on 40 physical cores is up to 36x and 68x faster than ABC commands rewrite -l and drw, respectively, with comparable quality of results in terms of AIG size and depth.

High-level synthesis with timing-sensitive information flow enforcement

Zhenghong Jiang
Steve Dai
G. Edward Suh
Zhiru Zhang

Specialized hardware accelerators are being increasingly integrated into today’s computer systems to achieve improved performance and energy efficiency. However, the resulting variety and complexity make it challenging to ensure the security of these accelerators. To mitigate complexity while guaranteeing security, we propose a high-level synthesis (HLS) infrastructure that incorporates static information flow analysis to enforce security policies on HLS-generated hardware accelerators. Our security-constrained HLS infrastructure is able to effectively identify both explicit and implicit information leakage. By detecting the security vulnerabilities at the behavioral level, our tool allows designers to address these vulnerabilities at an early stage of the design flow. We further propose a novel synthesis technique in HLS to eliminate timing channels in the generated accelerator. Our approach is able to remove timing channels in a verifiable manner while incurring lower performance overhead for high-security tasks on the accelerator.

Property specific information flow analysis for hardware security verification

Wei Hu
Armaiti Ardeshiricham
Mustafa S Gobulukoglu
Xinmu Wang
Ryan Kastner

Hardware information flow analysis detects security vulnerabilities resulting from unintended design flaws, timing channels, and hardware Trojans. These information flow models are typically generated in a general way, which includes a significant amount of redundancy that is irrelevant to the specified security properties. In this work, we propose a property specific approach for information flow security. We create information flow models tailored to the properties to be verified by performing a property specific search to identify security critical paths. This helps find suspicious signals that require closer inspection and quickly eliminates portions of the design that are free of security violations. Our property specific trimming technique reduces the complexity of the security model; this accelerates security verification and restricts potential security violations to a smaller region which helps quickly pinpoint hardware security vulnerabilities.

HISA: hardware isolation-based secure architecture for CPU-FPGA embedded systems

Mengmei Ye
Xianglong Feng
Sheng Wei

Heterogeneous CPU-FPGA systems have been shown to achieve significant performance gains in domain-specific computing. However, contrary to the huge efforts invested on the performance acceleration, the community has not yet investigated the security consequences due to incorporating FPGA into the traditional CPU-based architecture. In fact, the interplay between CPU and FPGA in such a heterogeneous system may introduce brand new attack surfaces if not well controlled. We propose a hardware isolation-based secure architecture, namely HISA, to mitigate the identified new threats. HISA extends the CPU-based hardware isolation primitive to the heterogeneous FPGA components and achieves security guarantees by enforcing two types of security policies in the isolated secure environment, namely the access control policy and the output verification policy. We evaluate HISA using four reference FPGA IP cores together with a variety of reference security policies targeting representative CPU-FPGA attacks. Our implementation and experiments on real hardware prove that HISA is an effective security complement to the existing CPU-only and FPGA-only secure architectures.

SWAN: mitigating hardware trojans with design ambiguity

Timothy Linscott
Pete Ehrett
Valeria Bertacco
Todd Austin

For the past decade, security experts have warned that malicious engineers could modify hardware designs to include hardware backdoors (trojans), which, in turn, could grant attackers full control over a system. Proposed defenses to detect these attacks have been outpaced by the development of increasingly small, but equally dangerous, trojans. To thwart trojan-based attacks, we propose a novel architecture that maps the security-critical portions of a processor design to a one-time programmable, LUT-free fabric. The programmable fabric is automatically generated by analyzing the HDL of targeted modules. We present our tools to generate the fabric and map functionally equivalent designs onto the fabric. By having a trusted party randomly select a mapping and configure each chip, we prevent an attacker from knowing the physical location of targeted signals at manufacturing time. In addition, we provide decoy options (canaries) for the mapping of security-critical signals, such that hardware trojans hitting a decoy are thwarted and exposed. Using this defense approach, any trojan capable of analyzing the entire configurable fabric must employ complex logic functions with a large silicon footprint, thus exposing it to detection by inspection. We evaluated our solution on a RISC-V BOOM processor and demonstrated that, by providing the ability to map each critical signal to 6 distinct locations on the chip, we can reduce the chance of attack success by an undetectable trojan by 99%, incurring only a 27% area overhead.

Security for safety: a path toward building trusted autonomous vehicles

Raj Gautam Dutta
Feng Yu
Teng Zhang
Yaodan Hu
Yier Jin

Automotive systems have always been designed with safety in mind. In this regard, the functional safety standard, ISO 26262, was drafted with the intention of minimizing risk due to random hardware faults or systematic failure in design of electrical and electronic components of an automobile. However, growing complexity of a modern car has added another potential point of failure in the form of cyber or sensor attacks. Recently, researchers have demonstrated that vulnerability in vehicle’s software or sensing units could enable them to remotely alter the intended operation of the vehicle. As such, in addition to safety, security should be considered as an important design goal. However, designing security solutions without the consideration of safety objectives could result in potential hazards. Consequently, in this paper we propose the notion of security for safety and show that by integrating safety conditions with our system-level security solution, which comprises of a modified Kalman filter and a Chi-squared detector, we can prevent potential hazards that could occur due to violation of safety objectives during an attack. Furthermore, with the help of a car-following case study, where the follower car is equipped with an adaptive-cruise control unit, we show that our proposed system-level security solution preserves the safety constraints and prevent collision between vehicle while under sensor attack.

Hardware-accelerated data acquisition and authentication for high-speed video streams on future heterogeneous automotive processing platforms

Martin Geier
Fabian Franzen
Samarjit Chakraborty

With the increasing use of Ethernet-based communication backbones in safety-critical real-time domains, both efficient and predictable interfacing and cryptographically secure authentication of high-speed data streams are becoming very important. Although the increasing data rates of in-vehicle networks allow the integration of more demanding (e.g., camera-based) applications, processing speeds and, in particular, memory bandwidths are no longer scaling accordingly. The need for authentication, on the other hand, stems from the ongoing convergence of traditionally separated functional domains and the extended connectivity both in- (e.g., smart-phones) and outside (e.g., telemetry, cloud-based services and vehicle-to-X technologies) current vehicles. The inclusion of cryptographic measures thus requires careful interface design to meet throughput, latency, safety, security and power constraints given by the particular application domain. Over the last decades, this has forced system designers to not only optimize their software stacks accordingly, but also incrementally move interface functionalities from software to hardware. This paper discusses existing and emerging methods for dealing with high-speed data streams ranging from software-only via mixed-hardware/software approaches to fully hardware-based solutions. In particular, we introduce two approaches to acquire and authenticate GigE Vision Video Streams at full line rate of Gigabit Ethernet on Programmable SoCs suitable for future heterogeneous automotive processing platforms.

Network and system level security in connected vehicle applications

Hengyi Liang
Matthew Jagielski
Bowen Zheng
Chung-Wei Lin
Eunsuk Kang
Shinichi Shiraishi
Cristina Nita-Rotaru
Qi Zhu

Connected vehicle applications such as autonomous intersections and intelligent traffic signals have shown great promises in improving transportation safety and efficiency. However, security is a major concern in these systems, as vehicles and surrounding infrastructures communicate through ad-hoc networks. In this paper, we will first review security vulnerabilities in connected vehicle applications. We will then introduce and discuss some of the defense mechanisms at network and system levels, including (1) the Security Credential Management System (SCMS) proposed by the United States Department of Transportation, (2) an intrusion detection system (IDS) that we are developing and its application on collaborative adaptive cruise control, and (3) a partial consensus mechanism and its application on lane merging. These mechanisms can assist to improve the security of connected vehicle applications.

A safety and security architecture for reducing accidents in intelligent transportation systems

Qian Chen
Azizeh Khaled Sowan
Shouhuai Xu

The Internet of Things (IoT) technology is transforming the world into Smart Cities, which have a huge impact on future societal lifestyle, economy and business. Intelligent Transportation Systems (ITS), especially IoT-enabled Electric Vehicles (EVs), are anticipated to be an integral part of future Smart Cities. Assuring ITS safety and security is critical to the success of Smart Cities because human lives are at stake. The state-of-the-art understanding of this matter is very superficial because there are many new problems that have yet to be investigated. For example, the cyber-physical nature of ITS requires considering human-in-the-loop (i.e., drivers and pedestrians) and imposes many new challenges. In this paper, we systematically explore the threat model against ITS safety and security (e.g., malfunctions of connected EVs/transportation infrastructures, driver misbehavior and unexpected medical conditions, and cyber attacks). Then, we present a novel and systematic ITS safety and security architecture, which aims to reduce accidents caused or amplified by a range of threats. The architecture has appealing features: (i) it is centered at proactive cyber-physical-human defense; (ii) it facilitates the detection of early-warning signals of accidents; (iii) it automates effective defense against a range of threats.

The need and opportunities of electromigration-aware integrated circuit design

Steve Bigalke
Jens Lienig
Göran Jerke
Jürgen Scheible
Roland Jancke

Electromigration (EM) is becoming a progressively severe reliability challenge due to increased interconnect current densities. A shift from traditional (post-layout) EM verification to robust (pro-active) EM-aware design – where the circuit layout is designed with individual EM-robust solutions – is urgently needed. This tutorial will give an overview of EM and its effects on the reliability of present and future integrated circuits (ICs). We introduce the physical EM process and present its specific characteristics that can be affected during physical design. Examples of EM countermeasures which are applied in today’s commercial design flows are presented. We show how to improve the EM-robustness of metallization patterns and we also consider mission profiles to obtain application-oriented current-density limits. The increasing interaction of EM with thermal migration is investigated as well. We conclude with a discussion of application examples to shift from the current post-layout EM verification towards an EM-aware physical design process. Its methodologies, such as EM-aware routing, increase the EM-robustness of the layout with the overall goal of reducing the negative impact of EM on the circuit’s reliability.

Uncertainty quantification of electronic and photonic ICs with non-Gaussian correlated process variations

Chunfeng Cui
Zheng Zhang

Since the invention of generalized polynomial chaos in 2002, uncertainty quantification has impacted many engineering fields, including variation-aware design automation of integrated circuits and integrated photonics. Due to the fast convergence rate, the generalized polynomial chaos expansion has achieved orders-of-magnitude speedup than Monte Carlo in many applications. However, almost all existing generalized polynomial chaos methods have a strong assumption: the uncertain parameters are mutually independent or Gaussian correlated. This assumption rarely holds in many realistic applications, and it has been a long-standing challenge for both theorists and practitioners.

This paper propose a rigorous and efficient solution to address the challenge of non-Gaussian correlation. We first extend generalized polynomial chaos, and propose a class of smooth basis functions to efficiently handle non-Gaussian correlations. Then, we consider high-dimensional parameters, and develop a scalable tensor method to compute the proposed basis functions. Finally, we develop a sparse solver with adaptive sample selections to solve high-dimensional uncertainty quantification problems. We validate our theory and algorithm by electronic and photonic ICs with 19 to 57 non-Gaussian correlated variation parameters. The results show that our approach outperforms Monte Carlo by 2500× to 3000× in terms of efficiency. Moreover, our method can accurately predict the output density functions with multiple peaks caused by non-Gaussian correlations, which is hard to handle by existing methods.

Based on the results in this paper, many novel uncertainty quantification algorithms can be developed and can be further applied to a broad range of engineering domains.

Parallelizable Bayesian optimization for analog and mixed-signal rare failure detection with high coverage

Hanbin Hu
Peng Li
Jianhua Z. Huang

Due to inherent complex behaviors and stringent requirements in analog and mixed-signal (AMS) systems, verification becomes a key bottleneck in the product development cycle. For the first time, we present a Bayesian optimization (BO) based approach to the challenging problem of verifying AMS circuits with stringent low failure requirements. At the heart of the proposed BO process is a delicate balancing between two competing needs: exploitation of the current statistical model for quick identification of highly-likely failures and exploration of undiscovered design space so as to detect hard-to-find failures within a large parametric space. To do so, we simultaneously leverage multiple optimized acquisition functions to explore varying degrees of balancing between exploitation and exploration. This makes it possible to not only detect rare failures which other techniques fail to identify, but also do so with significantly improved efficiency. We further build in a mechanism into the BO process to enable detection of multiple failure regions, hence providing a higher degree of coverage. Moreover, the proposed approach is readily parallelizable, further speeding up failure detection, particularly for large circuits for which acquisition of simulation/measurement data is very time-consuming. Our experimental study demonstrates that the proposed approach is very effective in finding very rare failures and multiple failure regions which existing statistical sampling techniques and other BO techniques can miss, thereby providing a more robust and cost-effective methodology for rare failure detection.

Transient circuit simulation for differential algebraic systems using matrix exponential

Pengwen Chen
Chung-Kuan Cheng
Dongwon Park
Xinyuan Wang

Transient simulation becomes a bottleneck for modern IC designs due to large numbers of transistors, interconnects and tight design margins. For modified nodal analysis (MNA) formulation, we could have differential algebraic equations (DAEs) which consist ordinary differential equations (ODEs) and algebraic equations. Study of solving DAEs with conventional multi-step integration methods has been a research topic in the last few decades. We adopt matrix exponential based integration method for circuit transient analysis, its stability and accuracy with DAEs remain an open problem. We identify that potential stability issues in the calculation of matrix exponential and vector product (MEVP) with rational Krylov method are originated from the singular system matrix in DAEs. We then devise a robust algorithm to implicitly regularize the system matrix while maintaining its sparsity. With the new approach, &phis; functions are applied for MEVP to improve the accuracy of results. Moreover our framework no longer suffers from the limitation on step sizes thus a large leap step is adopted to skip many simulation steps in between. Features of the algorithm are validated on large-scale power delivery networks which achieve high efficiency and accuracy.

CustomTopo: a topology generation method for application-specific wavelength-routed optical NoCs

Mengchu Li
Tsun-Ming Tseng
Davide Bertozzi
Mahdi Tala
Ulf Schlichtmann

Optical network-on-chip (NoC) is a promising platform beyond electronic NoCs. In particular, wavelength-routed optical network-on-chip (WRONoC) is renowned for its high bandwidth and ultra-low signal delay. Current WRONoC topology generation approaches focus on full-connectivity, i.e. all masters are connected to all slaves. This assumption leads to wasted resources for application-specific designs. In this work, we propose CustomTopo: a general solution to the topology generation problem on WRONoCs that supports customized connectivity. CustomTopo models the topology structure and its communication behavior as an integer-linear-programming (ILP) problem, with an adjustable optimization target considering the number of add-drop filters (ADFs), the number of wavelengths, and insertion loss. The time for solving the ILP problem in general positively correlates with the network communication densities. Experimental results show that CustomTopo is applicable for various communication requirements, and the resulting customized topology enables a remarkable reduction in both resource usage and insertion loss.

A cross-layer methodology for design and optimization of networks in 2.5D systems

Ayse Coskun
Furkan Eris
Ajay Joshi
Andrew B. Kahng
Yenai Ma
Vaishnav Srinivas

2.5D integration technology is gaining popularity in the design of homogeneous and heterogeneous many-core computing systems. 2.5D network design, both inter- and intra-chiplet, impacts overall system performance as well as its manufacturing cost and thermal feasibility. This paper introduces a cross-layer methodology for designing networks in 2.5D systems. We optimize the network design and chiplet placement jointly across logical, physical, and circuit layers to achieve an energy-efficient network, while maximizing system performance, minimizing manufacturing cost, and adhering to thermal constraints. In the logical layer, our co-optimization considers eight different network topologies. In the physical layer, we consider routing, microbump assignment, and microbump pitch constraints to account for the extra costs associated with microbump utilization in the inter-chiplet communication. In the circuit layer, we consider both passive and active links with five different link types, including a gas station link design. Using our cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods. Compared to 2D systems, our approach achieves 29% better performance with the same manufacturing cost, or 25% lower cost with the same performance.

Wavefront-MCTS: multi-objective design space exploration of NoC architectures based on Monte Carlo tree search

Yong Hu
Daniel Mueller-Gritschneder
Ulf Schlichtmann

Application-specific MPSoCs profit immensely from a custom-fit Network-on-Chip (NoC) architecture in terms of network performance and power consumption. In this paper we suggest a new approach to explore application-specific NoC architectures. In contrast to other heuristics, our approach uses a set of network modifications defined with graph rewriting rules to model the design space exploration as a Markov Decision Process (MDP). The MDP can be efficiently explored using the Monte Carlo Tree Search (MCTS) heuristics. We formulate a weighted sum reward function to compute a single solution with a good trade-off between power and latency or a set of max reward functions to compute the complete Pareto front between the two objectives. The Wavefront feature adds additional efficiency when computing the Pareto front by exchanging solutions between parallel MCTS optimization processes. Comparison with other popular search heuristics demonstrates a higher efficiency of MCTS-based heuristics for several test cases. Additionally, the Wavefront-MCTS heuristics allows complete tracability and control by the designer to enable an interactive design space exploration process.

HLS-based optimization and design space exploration for applications with variable loop bounds

Young-kyu Choi
Jason Cong

In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot find a design point with large speedup for applications with variable loop bounds. The reason is that loops with variable loop bounds cannot be efficiently parallelized or pipelined with simple insertion of HLS directives. Also, making highly accurate prediction of cycles and resource consumption on the entire design space becomes a challenging task because of the inaccuracy of the HLS tool cycle prediction and the wide design space. In this paper we present an HLS-based FPGA optimization and DSE framework that produces a high-performance design even in the presence of variable loop bounds. We propose code transformations that increase the utilization of the compute resources for variable loops, including several computation patterns with loop-carried dependency such as floating-point reduction and prefix sum. In order to rapidly perform DSE with high accuracy, we describe a resource and cycle estimation model constructed from the information obtained from the actual HLS synthesis. Experiments on applications with variable loop bounds in Polybench benchmarks with Vivado HLS show that our framework improves the baseline implementation by 75X on average and outperforms current state-of-the-art DSE frameworks.

HLSPredict: cross platform performance prediction for FPGA high-level synthesis

Kenneth O’Neal
Mitch Liu
Hans Tang
Amin Kalantar
Kennen DeRenard
Philip Brisk

FPGA application developers must explore increasingly large design spaces to identify regions of code to accelerate. High-Level Synthesis (HLS) tools automatically derive FPGA-based designs from high-level language specifications, which improves designer productivity; however, HLS tool run-times are cost-prohibitive for design space exploration, preventing designers from adequately answering cost-value decisions without expert guidance. To address this concern, this paper introduces a machine learning framework to predict FPGA performance and power consumption without relying on analytical models or HLS tools in-the-loop. For workloads that were manually optimized by appropriately setting pragmas, the framework obtains a worst-case relative error of 9.08% while running 43.78x faster than HLS; for unoptimized workloads, the framework obtains a worst-case relative error of 9.79% while running 36.24x faster than HLS.

C-GOOD: C-code generation framework for optimized on-device deep learning

Duseok Kang
Euiseok Kim
Inpyo Bae
Bernhard Egger
Soonhoi Ha

Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. Thus it is typical to use an energy-efficient accelerator such as mobile GPU, DSP array, and customized neural processor chip. Moreover, new deep learning algorithms that aim to balance accuracy, speed, and resource requirements are developed on a deep learning framework such as Caffe[16] and Tensorflow[1] that is assumed to run directly on the target hardware. However, embedded devices may not be able to run those frameworks directly due to hardware limitations or missing OS support. To overcome this difficulty, we develop a deep learning software framework that generates a C code that can be run on any devices. The framework is facilitated with various options for software optimization that can be performed according to the optimization methodology proposed in this paper. Another benefit is that it can generate various styles of C code, tailored for a specific compiler or the accelerator architecture. Experiments on three platforms, NVIDIA Jetson TX2[23], Odroid XU4[10], and SRP (Samsung Reconfigurable Processor)[32], demonstrate the potential of the proposed approach.

LiteHAX: lightweight hardware-assisted attestation of program execution

Ghada Dessouky
Tigist Abera
Ahmad Ibrahim
Ahmad-Reza Sadeghi

Unlike traditional processors, embedded Internet of Things (IoT) devices lack resources to incorporate protection against modern sophisticated attacks resulting in critical consequences. Remote attestation (RA) is a security service to establish trust in the integrity of a remote device. While conventional RA is static and limited to detecting malicious modification to software binaries at load-time, recent research has made progress towards runtime attestation, such as attesting the control flow of an executing program. However, existing control-flow attestation schemes are inefficient and vulnerable to sophisticated data-oriented programming (DOP) attacks subvert these schemes and keep the control flow of the code intact.

In this paper, we present LiteHAX, an efficient hardware-assisted remote attestation scheme for RISC-based embedded devices that enables detecting both control-flow attacks as well as DOP attacks. LiteHAX continuously tracks both the control-flow and data-flow events of a program executing on a remote device and reports them to a trusted verifying party. We implemented and evaluated LiteHAX on a RISC-V System-on-Chip (SoC) and show that it has minimal performance and area overhead.

SCADET: a side-channel attack detection tool for tracking prime+probe

Majid Sabbagh
Yunsi Fei
Thomas Wahl
A. Adam Ding

Microarchitectural side-channel attacks have posed serious threats to many computing systems, ranging from embedded systems and mobile devices to desktop workstations and cloud servers. Such attacks exploit side-channel vulnerabilities stemming from fundamental microarchitectural performance features, including the most common caches, out-of-order execution (for the newly revealed Meltdown exploit), and speculative execution (for Spectre). Prior efforts have focused on identifying and assessing these security vulnerabilities, and designing and implementing countermeasures against them. However, the efforts aiming at detecting specific side-channel attacks tend to be narrowly focused, which can make them effective but also makes them obsolete very quickly. In this paper, we propose a new methodology for detecting microarchitectural side-channel attacks that has the potential for a wide scope of applicability, as we demonstrate using a case study involving the Prime+Probe attack family. Instead of looking at the side-effects of side-channel attacks on microarchitectural elements such as hardware performance counters, we target the high-level semantics and invariant patterns of these attacks. We have applied our method to different Prime+Probe attack variants on the instruction cache, data cache, and last-level cache, as well as several benign programs as benchmarks. The method can detect all of the Prime+Probe attack variants with a true positive rate of 100% and an average false positive rate of 7.4%.

Industrial experiences with resource management under software randomization in ARINC653 avionics environments

Leonidas Kosmidis
Cristian Maxim
Victor Jegu
Francis Vatrinet
Francisco J. Cazorla

Injecting randomization in different layers of the computing platform has been shown beneficial for security, resilience to software bugs and timing analysis. In this paper, with focus on the latter, we show our experience regarding memory and timing resource management when software randomization techniques are applied to one of the most stringent industrial environments, ARINC653-based avionics. We describe the challenges in this task, we propose a set of solutions and present the results obtained for two commercial avionics applications, executed on COTS hardware and RTOS.

Single flux quantum circuit technology and CAD overview

Coenrad Fourie

Single Flux Quantum (SFQ) electronic circuits originated with the advent of Rapid Single Flux Quantum (RSFQ) logic in 1985 and have since evolved to include more energy-efficient technologies such as ERSFQ and eSFQ. SFQ logic circuits, based on the manipulation of quantized flux pulses, have been demonstrated to run at clock speeds in excess of 120 GHz, and with bit-switch energy below 1 aJ. Small SFQ microprocessors have been developed, but characteristics inherent to SFQ circuits and the lack of circuit design tools have hampered the development of large SFQ systems. SFQ circuit characteristics include fan-out of one and the subsequent demand for pulse splitters, gate-level clocking, susceptibility to magnetic fields and sensitivity to intra-gate and inter-gate inductance. Superconducting interconnects propagate data pulses at the speed of light, but suffer from reflections at vias that attenuate transmitted pulses. The recently started IARPA SuperTools program aims to deliver SFQ Computer-Aided Design (CAD) tools that can enable the successful design of 64 bit RISC processors given the characteristics of SFQ circuits. A discussion on the technology of SFQ circuits and the most modern SFQ fabrication processes is presented, with a focus on the unique electronic design automation CAD requirements for the design, layout and verification of SFQ circuits.

Design automation methodology and tools for superconductive electronics

Massoud Pedram
Yanzhi Wang

Josephson junction-based superconducting logic families have been proposed to implement analog and digital signals, which can achieve low energy dissipation and ultra-fast switching speed. There are two representative technologies: DC-biased RSFQ (rapid single flux quantum) technology and its variants that achieve a verified speed of 370 Ghz, and AC-biased AQFP (adiabatic quantum-flux-parametron) that achieves an energy dissipation near quantum limits. Despite extraordinary characteristics of the superconducting logic families, many technical challenges remain, including the choice of circuit fabrics and architectures that utilize the SFQ technology and the development of effective design automation methodologies and tools. This paper presents our work on developing design flows and tools for DC- and AC-biased SFQ circuits, leveraging unique characteristics and design requirements of the SFQ logic families. More precisely, physical design algorithms, including placement, clock tree routing, and signal routing algorithms targeting RSFQ circuits are presented first. Next, a majority/minority gate-based automatic synthesis framework targeting AQFP logic circuits is described. Finally, experimental results to demonstrate the efficacy of the proposed framework and tools are presented.

Multi-terminal routing with length-matching for rapid single flux quantum circuits

Pei-Yi Cheng
Kazuyoshi Takagi
Tsung-Yi Ho

With the increasing clock frequencies, the timing requirement of Rapid Single Flux Quantum (RSFQ) digital circuits is critical for achieving the correct functionality. To meet this requirement, it is necessary to incorporate length-matching constraint into routing problem. However, the solutions of existing routing algorithms are inherently limited by pre-allocated splitters (SPLs), which complicates the subsequent routing stage under length-matching constraint. Hence, in this paper, we reallocate SPLs to fully utilize routing resources to cope with length-matching effectively. We propose the first multi-terminal routing algorithm for RSFQ circuits that integrates SPL reallocation into the routing stage. The experimental results on a practical circuit show that our proposed algorithm achieves routing completion while reducing the required area by 17%. Comparing to [2], we can still improve by 7% with less runtime when SPLs are pre-allocated.

Electromagnetic equalizer: an active countermeasure against EM side-channel attack

Chenguang Wang
Yici Cai
Haoyi Wang
Qiang Zhou

Electromagnetic (EM) analysis is to reveal the secret information by analyzing the EM emission from a cryptographic device. EM analysis (EMA) attack is emerging as a serious threat to hardware security. It has been noted that the on-chip power grid (PG) has a security implication on EMA attack by affecting the fluctuations of supply current. However, there is little study on exploiting this intrinsic property as an active countermeasure against EMA. In this paper, we investigate the effect of PG on EM emission and propose an active countermeasure against EMA, i.e. EM Equalizer (EME). By adjusting the PG impedance, the current waveform can be flattened, equalizing the EM profile. Therefore, the correlation between secret data and EM emission is significantly reduced. As a first attempt to the co-optimization for power and EM security, we extend the EME method by fixing the vulnerability of power analysis. To verify the EME method, several cryptographic designs are implemented. The measurement to disclose (MTD) is improved by 1138x with area and power overheads of 0.62% and 1.36%, respectively.

GPU acceleration of RSA is vulnerable to side-channel timing attacks

Chao Luo
Yunsi Fei
David Kaeli

The RSA algorithm [21] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [9, 18, 23, 26], leveraging the Single Instruction Multiple Thread (SIMT) model.

Remote inter-chip power analysis side-channel attacks at board-level

Falk Schellenberg
Dennis R. E. Gnad
Amir Moradi
Mehdi B. Tahoori

The current practice in board-level integration is to incorporate chips and components from numerous vendors. A fully trusted supply chain for all used components and chipsets is an important, yet extremely difficult to achieve, prerequisite to validate a complete board-level system for safe and secure operation. An increasing risk is that most chips nowadays run software or firmware, typically updated throughout the system lifetime, making it practically impossible to validate the full system at every given point in the manufacturing, integration and operational life cycle. This risk is elevated in devices that run 3rd party firmware. In this paper we show that an FPGA used as a common accelerator in various boards can be reprogrammed by software to introduce a sensor, suitable as a remote power analysis side-channel attack vector at the board-level. We show successful power analysis attacks from one FPGA on the board to another chip implementing RSA and AES cryptographic modules. Since the sensor is only mapped through firmware, this threat is very hard to detect, because data can be exfiltrated without requiring inter-chip communication between victim and attacker. Our results also prove the potential vulnerability in which any untrusted chip on the board can launch such attacks on the remaining system.

Effective simple-power analysis attacks of elliptic curve cryptography on embedded systems

Chao Luo
Yunsi Fei
David Kaeli

Elliptic Curve Cryptography (ECC), initially proposed by Koblitz [17] and Miller [20], is a public-key cipher. Compared with other popular public-key ciphers (e.g., RSA), ECC features a shorter key length for the same level of security. For example, a 256-bit ECC cipher provides 128-bit security, equivalent to a 2048-bit RSA cipher [4]. Using smaller keys, ECC requires less memory for performing cryptographic operations. Embedded systems, especially given the proliferation of Internet-of-Things (IoT) devices and platforms, require efficient and low-power secure communications between edge devices and gateways/clouds. ECC has been widely adopted in IoT systems for authentication of communications, while RSA, which is much more costly to compute, remains the standard for desktops and servers.

SODA: stencil with optimized dataflow architecture

Yuze Chi
Jason Cong
Peng Wei
Peipei Zhou

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space.

In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

PolySA: polyhedral-based systolic array auto-compilation

Jason Cong
Jie Wang

Automatic systolic array generation has long been an interesting topic due to the need to reduce the lengthy development cycles of manual designs. Existing automatic systolic array generation approach builds dependency graphs from algorithms, and iteratively maps computation nodes in the graph into processing elements (PEs) with time stamps that specify the sequences of nodes that operate within the PE. There are a number of previous works that implemented the idea and generated designs for ASICs. However, all of these works relied on human intervention and usually generated inferior designs compared to manual designs. In this work, we present our ongoing compilation framework named PolySA which leverages the power of the polyhedral model to achieve the end-to-end compilation for systolic array architecture on FPGAs. PolySA is the first fully automated compilation framework for generating high-performance systolic array architectures on the FPGA leveraging recent advances in high-level synthesis. We demonstrate PolySA on two key applications—matrix multiplication and convolutional neural network. PolySA is able to generate optimal designs within one hour with performance comparable to state-of-the-art manual designs.

An efficient data reuse strategy for multi-pattern data access

Wensong Li
Fan Yang
Hengliang Zhu
Xuan Zeng
Dian Zhou

Memory partitioning has been widely adopted to increase the memory bandwidth. Data reuse is a hardware-efficient way to improve data access throughput by exploiting locality in memory access patterns. We found that for many applications in image and video processing, a global data reuse scheme can be shared by multiple patterns. In this paper, we propose an efficient data reuse strategy for multi-pattern data access. Firstly, a heuristic algorithm is proposed to extract the reuse information as well as find the non-reusable data elements of each pattern. Then the non-reusable elements are partitioned into several memory banks by an efficient memory partitioning algorithm. Moreover, the reuse information is utilized to generate the global data reuse logic shared by the multi-pattern. We design a novel algorithm to minimize the number of registers required by the data reuse logic. Experimental results show that compared with the state-of-the-art approach, our proposed method can reduce the number of required BRAMs by 62.2% on average, with the average reduction of 82.1% in SLICE, 87.1% in LUTs, 71.6% in Flip-Flops, 73.1% in DSP48Es, 83.8% in SRLs, 46.7% in storage overhead, 79.1% in dynamic power consumption, and 82.6% in execution time of memory partitioning. Besides, the performance is improved by 14.4%.

Optimizing data layout and system configuration on FPGA-based heterogeneous platforms

Hou-Jen Ko
Zhiyuan Li
Samuel Midkiff

The most attractive feature of field-programmable gate arrays (FPGAs) is their configuration flexibility. However, if the configuration is performed manually, this flexibility places a heavy burden on system designers to choose among a vast number of configuration parameters and program transformations. In this paper, we improve the state-of-the-art with two main innovations: First, we apply compiler-automated transformations to the data layout and program statements to create streaming accesses. Such accesses are turned into streaming interfaces when the kernels are implemented in hardware, allowing the kernels to run efficiently. Second, we use two-step mixed integer programming to first minimize the execution time and then to minimize energy dissipation. Configuration parameters are chosen automatically, including several important ones omitted by existing models. Experimental results demonstrate significant performance gains and energy savings using these techniques.

This work is sponsored in part by the National Science Foundation (Grant 1533822).

Design and optimization of edge computing distributed neural processor for biomedical rehabilitation with sensor fusion

Kofi Otseidu
Tianyu Jia
Joshua Bryne
Levi Hargrove
Jie Gu

Modern biomedical devices use sensor fusion techniques to improve the classification accuracy of motion intent of users for rehabilitation application. The design of motion classifier observes significant challenges due to the large number of channels and stringent communication latency requirement. This paper proposes an edge-computing distributed neural processor to effectively reduce the data traffic and physical wiring congestion. A special local and global networking architecture is introduced to significantly reduce traffic among multi-chips in edge computing. To optimize the design space of the features selected, a systematic design methodology is proposed. A novel mixed-signal feature extraction approach with assistance of neural network distortion recovery is also provided to significantly reduce the silicon area. A 12-channel 55nm CMOS test chip was implemented to demonstrate the proposed systematic design methodology. The measurement shows the test chip consumes only 20uW power, more than 10,000X less power than the current clinically used microprocessor and can perform edge-computing networking operation within 5ms time.

Area-efficient and low-power face-to-face-bonded 3D liquid state machine design

Bon Woong Ku
Yu Liu
Yingyezhe Jin
Peng Li
Sung Kyu Lim

As small-form-factor and low-power end devices matter in the cloud networking and Internet-of-Things Era, the bio-inspired neuromorphic architectures attract great attention recently in the hope of reaching the energy-efficiency of brain functions. Out of promising solutions, a liquid state machine (LSM), that consists of randomly and recurrently connected reservoir neurons and trainable readout neurons, has shown a great promise in delivering brain-inspired computing power. In this work, we adopt the state-of-the-art face-to-face (F2F)-bonded 3D IC flow named Compact-2D [4] to the LSM processor design, and study the power-area-accuracy benefits of 3D LSM ICs targeting the next generation commercial-grade neuromorphic computing platforms. First, we analyze how the different size and connection density of a reservoir in the LSM architecture affects the learning performance using the real-world speech recognition benchmark. Also, we explore how much the power-area design overhead should be paid off to enable better classification accuracy. Based on the power-area-accuracy trade-off, we implement a F2F-bonded 3D LSM IC using the optimal LSM architecture, and finally justify that 3D integration practically benefits the LSM processor design in huge form factor and power savings while preserving the best learning performance.

DIMA: a depthwise CNN in-memory accelerator

Shaahin Angizi
Zhezhi He
Deliang Fan

In this work, we first propose a deep depthwise Convolutional Neural Network (CNN) structure, called Add-Net, which uses binarized depthwise separable convolution to replace conventional spatial-convolution. In Add-Net, the computationally expensive convolution operations (i.e. Multiplication and Accumulation) are converted into hardware-friendly Addition operations. We meticulously investigate and analyze the Add-Net’s performance (i.e. accuracy, parameter size and computational cost) in object recognition application compared to traditional baseline CNN using the most popular large scale ImageNet dataset. Accordingly, we propose a Depthwise CNN In-Memory Accelerator (DIMA) based on SOT-MRAM computational sub-arrays to efficiently accelerate Add-Net within non-volatile MRAM. Our device-to-architecture co-simulation results show that, with almost the same inference accuracy to the baseline CNN on different data-sets, DIMA can obtain ~1.4× better energy-efficiency and 15.7× speedup compared to ASICs, and, ~1.6× better energy-efficiency and 5.6× speedup over the best processing-in-DRAM accelerators.

Multi-channel and fault-tolerant control multiplexing for flow-based microfluidic biochips

Ying Zhu
Bing Li
Tsung-Yi Ho
Qin Wang
Hailong Yao
Robert Wille
Ulf Schlichtmann

Continuous flow-based biochips are one of the promising platforms used in biochemical and pharmaceutical laboratories due to their efficiency and low costs. Inside such a chip, fluid volumes of nanoliter size are transported between devices for various operations, such as mixing and detection. The transportation channels and corresponding operation devices are controlled by microvalves driven by external pressure sources. Since assigning an independent pressure source to every microvalve would be impractical due to high costs and limited system dimensions, states of microvalves are switched using a control logic by time multiplexing. Existing control logic designs, however, still switch only a single control channel per operation — leading to a low efficiency. In this paper, we propose the first automatic synthesis approach for a control logic that is able to switch multiple control channels simultaneously to reduce the overall switching time of valve states. In addition, we propose the first fault-aware design in control logic to introduce redundant control paths to maintain the correct function even when manufacturing defects occur. Compared with the existing direct connection method, the proposed multi-channel switching mechanism can reduce the switching time of valve states by up to 64%. In addition, all control paths for fault tolerance have been realized.

Multi-physics-based FEM analysis for post-voiding analysis of electromigration failure effects

Hengyang Zhao
Sheldon Tan

In this paper, we propose anew multi-physics finite element method (FEM) based analysis method for void growth simulation of confined copper interconnects. This new method for the first time considers three important physics simultaneously in the EM failure process and their time-varying interactions: the hydrostatic stress in the confined interconnect wire, the current density and Joule heating induced temperature. As a result, we end up with solving a set of coupled partial differential equations which consist of the stress diffusion equation (Korhonen’s equation), the phase field equation (for modeling void boundary move), the Laplace equation for current density and the heat diffusion equation for Joule heating and wire temperature. In the new method, we show that each of the physics will have different physical domains and differential boundary conditions, and how such coupled multi-physics transient analysis was carried out based on FEM and different time scales are properly handled. Experiment results show that by considering all three coupled physics – the stress, current density, and temperature – and their transient behaviors, the proposed FEM EM solver can predict the unique transient wire resistance change pattern for copper interconnect wires, which were well observed by the published experiment data. We also show that the simulated void growth speed is less conservative than recently proposed compact EM model.

Estimating and optimizing BTI aging effects: from physics to CAD

Hussam Amrouch
Victor M. van Santen
Jörg Henkel

Transistor aging due to Bias Temperature Instability (BTI) is a crucial degradation that affects the reliability of circuits over time. Aging-aware circuit design flows do virtually not exist yet and even research is in its infancy. In this work, we demonstrate how the deleterious effects BTI-induced degradations can be modeled from physics, where they do occur, all the way up to the system level, where they finally take place and affect the delay and power of circuits. To achieve that, degradation-aware cell libraries, that properly capture the impact of BTI not only on the delay of standard cells but also on their static and dynamic power, are created. Unlike state of the art, which solely models the impact of BTI on the threshold voltage of transistors (V_th), we are the first to model the other key transistor parameters degraded by BTI like carrier mobility (μ), sub-threshold slope (SS), and gate-drain capacitance (C_gd).

Our cell libraries are compatible with existing commercial CAD tools. Employing the mature algorithms in such tools, enables designers – after importing our cell libraries – to accurately estimate the overall impact of aging on changing the delay and/or power of any circuit, despite its complexity. We demonstrate that ΔV_th alone (as done in state of the art) is insufficient to correctly model the impact of BTI either on delay or power of circuits. On the one hand, neglecting BTI-induced μ and C_gd degradations leads to underestimating the impact that BTI has on increasing the delay of circuits. Hence, designers will employ narrower timing guardbands in which reliability of circuits during lifetime cannot be sustained. On the other hand, neglecting BTI-induced SS degradation leads to overestimating the impact that BTI has on static power reduction. Hence, the potential benefit of circuits from BTI will be exaggerated.

PVT²: process, voltage, temperature and time-dependent variability in scaled CMOS process

A. K. M. Mahfuzul Islam
Hidetoshi Onodera

In addition to the conventional PVT (Process, Voltage and Temperature) variation, time-dependent current fluctuation such as random telegraph noise (RTN) poses a new challenge on VLSI reliability. In this paper, we show that compared with the static random variation, RTN amplitude of a particular device is not constant across supply voltages and temperatures. A device may show large RTN amplitude at one operating condition and small amplitude at another operating condition. As a result, RTN amplitude distribution becomes uncorrelated across a wide range of voltage and temperature. The emergence of uncorrelated distribution causes significant degradation of worst-case values. Analysis results based on variability models from a 65 nm silicon-on-insulator process show that uncorrelated RTN degrades the worst-case threshold voltage value significantly compared with that where RTN is not considered. Delay variation analysis shows that consideration of RTN in the statistical analysis have little impact at high supply voltage. However, at low voltage operation, RTN can degrade the worst-case value by more than 5 %.

Performance and accuracy in soft-error resilience evaluation using the multi-level processor simulator ETISS-ML

Daniel Mueller-Gritschneder
Uzair Sharif
Ulf Schlichtmann

Soft errors are a major safety concern in many devices, e.g., in automotive, industrial, control or medical applications. Ideally, safety-critical systems should be resilient against the impact of soft errors, but at a low cost. This requires to evaluate the soft error resilience, which is typically done by extensive fault injection.

In this paper, we present ETISS-ML, a multi-level processor simulator, which manages to achieve both accuracy and performance for fault simulation by intelligently switching the level of abstraction between an Instruction Set Simulator (ISS) and an RTL simulator. For a given software testcase and fault scenario, the software is first executed in ISS-mode until shortly before the fault injection. Then ETISS-ML switches to RTL-mode for accurate fault simulation. Whenever the impact of the fault is propagated completely out of the processor’s micro-architecture, the simulation can switch back to ISS-mode. This paper describes the methods needed to preserve accuracy during both of these switches. Experimental results show that ETISS-ML obtains near to ISS performance with RTL accuracy. It is also shown that ETISS-ML can be used as the processor model in SystemC / TLM virtual prototypes (VPs) and, hence, allows to investigate the impact of soft errors at system level.

Computer-aided design for quantum computation

Robert Wille
Austin Fowler
Yehuda Naveh

Quantum computation is currently moving from an academic idea to a practical reality. The recent past has seen tremendous progress in the physical implementation of corresponding quantum computers – also involving big players such as IBM, Google, Intel, Rigetti, Microsoft, and Alibaba. These devices promise substantial speedups over conventional computers for applications like quantum chemistry, optimization, machine learning, cryptography, quantum simulation, and systems of linear equations. The Computer-Aided Design and Verification (jointly referred as CAD) community needs to be ready for this revolutionizing new technology. While research on automatic design methods for quantum computers is currently underway, there is still far too little coordination between the CAD community and the quantum computation community. Consequently, many CAD approaches proposed in the past have either addressed the wrong problems or failed to reach the end users. In this summary paper, we provide a glimpse into both sides. To this end, we review and discuss selected accomplishments from the CAD domain as well as open challenges within the quantum domain. These examples showcase the recent state-of-the-art but also outline the remaining work left to be done in both communities.

PolyCleaner: clean your polynomials before backward rewriting to verify million-gate multipliers

Alireza Mahzoon
Daniel Große
Rolf Drechsler

Nowadays, a variety of multipliers are used in different computationally intensive industrial applications. Most of these multipliers are highly parallelized and structurally complex. Therefore, the existing formal verification techniques fail to verify them.

In recent years, formal multiplier verification based on Symbolic Computer Algebra (SCA) has shown superior results in comparison to all other existing proof techniques. However, for non-trivial architectures still a monomial explosion can be observed. A common understanding is that this is caused by redundant monomials also known as vanishing monomials. While several approaches have been proposed to overcome the explosion, the problem itself is still not fully understood.

In this paper we present a new theory for the origin of vanishing monomials and how they can be handled to prevent the explosion during backward rewriting. We implement our new approach as the SCA-verifier PolyCleaner. The experimental results show the efficiency of our proposed method in verification of non-trivial million-gate multipliers.

A formal instruction-level GPU model for scalable verification

Yue Xing
Bo-Yuan Huang
Aarti Gupta
Sharad Malik

GPUs have been widely used to accelerate big-data inference applications and scientific computing through their parallelized hardware resources and programming model. Their extreme parallelism increases the possibility of bugs such as data races and un-coalesced memory accesses, and thus verifying program correctness is critical. State-of-the-art GPU program verification efforts mainly focus on analyzing application-level programs, e.g., in C, and suffer from the following limitations: (1) high false-positive rate due to coarse-grained abstraction of synchronization primitives, (2) high complexity of reasoning about pointer arithmetic, and (3) keeping up with an evolving API for developing application-level programs.

In this paper, we address these limitations by modeling GPUs and reasoning about programs at the instruction level. We formally model the Nvidia GPU at the parallel execution thread (PTX) level using the recently proposed Instruction-Level Abstraction (ILA) model for accelerators. PTX is analogous to the Instruction-Set Architecture (ISA) of a general-purpose processor. Our formal ILA model of the GPU includes non-synchronization instructions as well as all synchronization primitives, enabling us to verify multithreaded programs. We demonstrate the applicability of our ILA model in scalable GPU program verification of data-race checking. The evaluation shows that our checker outperforms state-of-the-art GPU data race checkers with fewer false-positives and improved scalability.

Fast FPGA emulation of analog dynamics in digitally-driven systems

Steven Herbst
Byong Chan Lim
Mark Horowitz

In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation.

SPN dash: fast detection of adversarial attacks on mobile via sensor pattern noise fingerprinting

Kent W. Nixon
Jiachen Mao
Juncheng Shen
Huanrui Yang
Hai (Helen) Li
Yiran Chen

A concerning weakness of deep neural networks is their susceptibility to adversarial attacks. While methods exist to detect these attacks, they incur significant drawbacks, ignoring external features which could aid in the task of attack detection. In this work, we propose SPN Dash, a method for detection of adversarial attacks based on integrity of sensor pattern noise embedded in submitted images. Through experiment, we show that our SPN Dash method is capable of detecting the addition of adversarial noise with up to 94% accuracy for images of size 256×256. Analysis shows that SPN Dash is robust to image scaling techniques, as well as a small amount of image compression. This performance is on par with state of the art neural network-based detectors, while incurring an order of magnitude less computational and memory overhead.

Watermarking deep neural networks for embedded systems

Jia Guo
Miodrag Potkonjak

Deep neural networks (DNNs) have become an important tool for bringing intelligence to mobile and embedded devices. The increasingly wide deployment, sharing and potential commercialization of DNN models create a compelling need for intellectual property (IP) protection. Recently, DNN watermarking emerges as a plausible IP protection method. Enabling DNN watermarking on embedded devices in a practical setting requires a black-box approach. Existing DNN watermarking frameworks either fail to meet the black-box requirement or are susceptible to several forms of attacks. We propose a watermarking framework by incorporating the author’s signature in the process of training DNNs. While functioning normally in regular cases, the resulting watermarked DNN behaves in a different, predefined pattern when given any signed inputs, thus proving the authorship. We demonstrate an example implementation of the framework on popular image classification datasets and show that strong watermarks can be embedded in the models.

DeepFense: online accelerated defense against adversarial deep learning

Bita Darvish Rouhani
Mohammad Samragh
Mojan Javaheripi
Tara Javidi
Farinaz Koushanfar

Recent advances in adversarial Deep Learning (DL) have opened up a largely unexplored surface for malicious attacks jeopardizing the integrity of autonomous DL systems. With the wide-spread usage of DL in critical and time-sensitive applications, including unmanned vehicles, drones, and video surveillance systems, online detection of malicious inputs is of utmost importance. We propose DeepFense, the first end-to-end automated framework that simultaneously enables efficient and safe execution of DL models. DeepFense formalizes the goal of thwarting adversarial attacks as an optimization problem that minimizes the rarely observed regions in the latent feature space spanned by a DL network. To solve the aforementioned minimization problem, a set of complementary but disjoint modular redundancies are trained to validate the legitimacy of the input samples in parallel with the victim DL model. DeepFense leverages hardware/software/algorithm co-design and customized acceleration to achieve just-in-time performance in resource-constrained settings. The proposed countermeasure is unsupervised, meaning that no adversarial sample is leveraged to train modular redundancies. We further provide an accompanying API to reduce the non-recurring engineering cost and ensure automated adaptation to various platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders of magnitude performance improvement while enabling online adversarial sample detection.

Enabling deep learning at the IoT edge

Liangzhen Lai
Naveen Suda

Deep learning algorithms have demonstrated super-human capabilities in many cognitive tasks, such as image classification and speech recognition. As a result, there is an increasing interest in deploying neural networks (NNs) on low-power processors found in always-on systems, such as those based on Arm Cortex-M microcontrollers. In this paper, we discuss the challenges of deploying neural networks on microcontrollers with limited memory, compute resources and power budgets. We introduce CMSIS-NN, a library of optimized software kernels to enable deployment of NNs on Cortex-M cores. We also present techniques for NN algorithm exploration to develop light-weight models suitable for resource constrained systems, using keyword spotting as an example.

Searching toward pareto-optimal device-aware neural architectures

An-Chieh Cheng
Jin-Dong Dong
Chi-Hung Hsu
Shu-Huan Chang
Min Sun
Shih-Chieh Chang
Jia-Yu Pan
Yu-Ting Chen
Wei Wei
Da-Cheng Juan

Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS [19] and DPP-Net [10]. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.

Hardware-aware machine learning: modeling and optimization

Diana Marculescu
Dimitrios Stamoulis
Ermao Cai

Recent breakthroughs in Machine Learning (ML) applications, and especially in Deep Learning (DL), have made DL models a key component in almost every modern computing system. The increased popularity of DL applications deployed on a wide-spectrum of platforms (from mobile devices to datacenters) have resulted in a plethora of design challenges related to the constraints introduced by the hardware itself. “What is the latency or energy cost for an inference made by a Deep Neural Network (DNN)?” “Is it possible to predict this latency or energy consumption before a model is even trained?” “If yes, how can machine learners take advantage of these models to design the hardware-optimal DNN for deployment?” From lengthening battery life of mobile devices to reducing the runtime requirements of DL models executing in the cloud, the answers to these questions have drawn significant attention.

One cannot optimize what isn’t properly modeled. Therefore, it is important to understand the hardware efficiency of DL models during serving for making an inference, before even training the model. This key observation has motivated the use of predictive models to capture the hardware performance or energy efficiency of ML applications. Furthermore, ML practitioners are currently challenged with the task of designing the DNN model, i.e., of tuning the hyper-parameters of the DNN architecture, while optimizing for both accuracy of the DL model and its hardware efficiency. Therefore, state-of-the-art methodologies have proposed hardwareaware hyper-parameter optimization techniques. In this paper, we provide a comprehensive assessment of state-of-the-art work and selected results on the hardware-aware modeling and optimization for ML applications. We also highlight several open questions that are poised to give rise to novel hardware-aware designs in the next few years, as DL applications continue to significantly impact associated hardware systems and platforms.

Page 8 of 10

SLIP 2018 TOC

ISPD 2017 TOC

SESSION: Welcome and Keynote Address

SESSION: Machine Learning in EDA

SESSION: Monday Afternoon Keynote

POSTER SESSION: Invited Poster Presentation

SESSION: Nontraditional Physical Design Challenges

SESSION: Tuesday Keynote Address

SESSION: Clock and Timing

SESSION: Routability Considerations

SESSION: Commemoration for Professor Satoshi Goto

SESSION: Optimization and Placement

SESSION: FPGA CAD and Contest

SLIP 2016 TOC

ISPD 2018 TOC

SESSION: Keynote Address

SESSION: Finding the Golden Tree in the Forest!

SESSION: FPGA Special Session

SESSION: Design Flow and Power Grid Optimization

SESSION: Statistical and Machine Learning-Based CAD

SESSION: Three Shades of Placement!

SESSION: Commemoration for Professor Te Chiang Hu

SESSION: Interconnect Optimization and Detailed Routing Contest Results

SESSION: How to Make Your Foundry Happier?

GLSVLSI 2019 TOC

SESSION: Keynote & Invited Talks

SESSION: Tech Session 1: Design and Integration of Hardware Security Primitives

SESSION: Tech Session 2: VLSI Circuits and Power Aware Design

SESSION: Tech Session 3: : VLSI for Machine Learning and Artificial Intelligence

SESSION: Tech Session 4: Next Generation Interconnect: Architecture to Physical Design

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware security

SESSION: Tech Session 6: Emerging Computing & Post-CMOS Technologies

SESSION: Tech Session 7: Physical Design and Obfuscation

SESSION: Tech Session 5: Designing robust VLSI circuits. From approximate computing to hardware
security