FPGA 2021 TOC

FPGA ’21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

SESSION: Session 1: FPGA Architecture

Top-down Physical Design of Soft Embedded FPGA Fabrics

  • Prashanth Mohan
  • Oguz Atli
  • Onur Kibar
  • Mohammed Zackriya
  • Larry Pileggi
  • Ken Mai

In recent years, IC reverse engineering and IC fabrication supply chain security have
grown to become significant economic and security threats for designers, system integrators,
and end customers. Many of the existing logic locking and obfuscation techniques have
shown to be vulnerable to attack once the attacker has access to the design netlist
either through reverse engineering or through an untrusted fabrication facility. We
introduce soft embedded FPGA redaction, a hardware obfuscation approach that allows
the designer substitute security-critical IP blocks within a design with a synthesizable
eFPGA fabric. This method fully conceals the logic and the routing of the critical
IP and is compatible with standard ASIC flows for easy integration and process portability.
To demonstrate eFPGA redaction, we obfuscate a RISC-V control path and a GPS P-code
generator. We also show that the modified netlists are resilient to SAT attacks with
moderate VLSI overheads. The secure RISC-V design has 1.89x area and 2.36x delay overhead
while the GPS design has 1.39x area and negligible delay overhead when implemented
on an industrial 22nm FinFET CMOS process.

NetCracker: A Peek into the Routing Architecture of Xilinx 7-Series FPGAs

  • Morten B. Petersen
  • Stefan Nikolić
  • Mirjana Stojilović

Novel applications have triggered significant changes at the system level of FPGA
architecture design, such as the introduction of embedded VLIW processor arrays and
hardened NoCs. However, the routing architecture of the soft logic fabric has largely
remained unchanged in recent years. Since hunger for acceleration of ever more varied
tasks with various power budgets—as well as complications related to technology
scaling—is likely to remain significant, it is foreseeable that the routing architecture
too will have to evolve. In this work, we do not try to suggest how routing architectures
of tomorrow should look like. Instead, we analyze an existing architecture from a
popular commercial FPGA family, discussing the possible origins of various design
decisions and pointing out aspects that may merit future research. Moreover, we present
an open-source tool that greatly eases such analyses, relying only on data readily
available from the vendor CAD tools. Our hope is that this work will help the academic
research community in catching up with the current developments in industry and accelerate
its contributions to FPGA architectures of the future.

Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs

  • Aman Arora
  • Samidh Mehta
  • Vaughn Betz
  • Lizy K. John

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the
rapidly changing algorithms, network architectures and computation requirements in
this field. However, the generic building blocks available on traditional FPGAs limit
the acceleration that can be achieved. Many modifications to FPGA architecture have
been proposed and deployed including adding specialized artificial intelligence (AI)
processing engines, adding support for IEEE half-precision (fp16) math in DSP slices,
adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a
small percentage of the FPGA’s programmable logic area with Tensor Slices. These slices
are arrays of processing elements at their heart that support multiple tensor operations,
multiple dynamically-selectable precisions and can be dynamically fractured into individual
adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar
at the inputs that helps with easing the routing pressure caused by a large slice.
By spending ~3% of FPGA’s area on Tensor Slices, we observe an average frequency increase
of 2.45x and average area reduction by 0.41x across several ML benchmarks, including
a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the
impact of spending area on Tensor slices on non-ML applications. We observe an average
reduction of 1% in frequency and an average increase of 1% in routing wirelength compared
to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific
coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator
for ML applications, while still keeping the vast majority of the real estate on the
FPGA programmable at fine-grain.

Global Is the New Local: FPGA Architecture at 5nm and Beyond

  • Stefan Nikolić
  • Francky Catthoor
  • Zsolt Tőkei
  • Paolo Ienne

It takes only high-school physics to appreciate that the resistance of a wire grows
with a diminishing cross section, and a quick look at any plot about Moore’s law immediately
suggests that such cross section must decrease over time. Clearly, everyone can easily
imagine that this trend must have a deep influence on FPGA architectures. What is
difficult to predict is whether and when well-established architectural ideas will
break—and what can replace them. Unfortunately, in architectural research, we often
use fairly simplistic models of the underlying technology nodes which limit our ability
to visualize the detailed impact of technology evolution. In this paper, we develop,
from the available industrial disclosures, a consistent electrical model of the metal
stacks of recent and current technologies, as well as future trends. We combine it
to a plausible layout strategy to have an accurate idea of how wire characteristics
play nowadays into architectural decisions. To demonstrate our models, necessarily
speculative due to the paucity of reliable industrial information, we use them to
explore the evolution of a typical architectural family across technology nodes and
to reevaluate one of the most basic design parameters—namely, cluster size. We notice
effects which may in fact explain some recent changes in commercial architectures.
We also observe how conventional architectures may fail to take advantage of the performance
improvements of future nodes. Although conceptually straightforward, this study signals
how profoundly our understanding of FPGAs will be affected by technology while moving
towards the 3 nm node.

FABulous: An Embedded FPGA Framework

  • Dirk Koch
  • Nguyen Dao
  • Bea Healy
  • Jing Yu
  • Andrew Attwood

At the end of CMOS-scaling, the role of architecture design is increasingly gaining
importance. Supporting this trend, customizable embedded FPGAs are an ingredient in
ASIC architectures to provide the advantages of reconfigurable hardware exactly where
and how it is most beneficial. To enable this, we are introducing the FABulous embedded
open-source FPGA framework. FABulous is designed to fulfill the objectives of ease
of use, maximum portability to different process nodes, good control for customization,
and delivering good area, power, and performance characteristics of the generated
FPGA fabrics. The framework provides templates for logic, arithmetic, memory, and
I/O blocks that can be easily stitched together, whilst enabling users to add their
own fully customized blocks and primitives. The FABulous ecosystem generates the embedded
FPGA fabric for chip fabrication, integrates Yosys, ABC, VPR and nextpnr as FPGA CAD
tools, deals with the bitstream generation and after fabrication tests. Additionally,
we provide an emulation path for system development. FABulous was demonstrated for
an ASIC integrating a RISC-V core with an embedded FPGA fabric for custom instruction
set extensions using a TSMC 180nm process and an open-source 45nm process node.

Stratix 10 NX Architecture and Applications

  • Martin Langhammer
  • Eriko Nurvitadhi
  • Bogdan Pasca
  • Sergey Gribok

The advent of AI has driven the adoption of high density low precision arithmetic
on FPGAs. This has resulted in new methods in mapping both arithmetic functions as
well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks.
Technologies outside of the FPGA realm have also evolved, such as the addition of
tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of
which have a higher claimed performance and efficiency than current FPGAs. In this
paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA
specifically optimized for the AI application space. In addition to the computational
capabilities of the standard programmable soft logic fabric, a new type of DSP Block
provides the dense arrays of low precision multipliers typically used in AI implementations.
The architecture of the block is tuned for the common matrix-matrix or vector-matrix
multiplications in AI, with capabilities designed to work efficiently for both small
and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent
support for support block floating point FP16 and FP12 numerics. All additions/accumulations
can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple
blocks can be cascaded together to support larger matrices. We will also describe
methods by which the smaller precision multipliers can be aggregated to create larger
multiplier that are more applicable to standard signal processing requirements. In
terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs,
or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency
is in the range of 1-4 TOPs or TFLOPs/W.

SESSION: Keynote 1

Scientific Applications of FPGAs at the LHC

  • Philip Harris

The next generation of high throughput data acquisition systems is capable of acquisition
at rates far exceeding our ability to save data. To process data in real-time specialized
computing systems are needed with incredibly high throughput so that data can be quickly
assessed to determine whether it is sufficiently interesting for further processing.
With a raw data rate exceeding 1 Petabit per second, particle detectors at the Large
Hadron Collider at the Europe Center for Nuclear Research (CERN) contend with some
of the largest data rates ever encountered. With planned upgrades in the near future,
these rates will continue to grow, further complicating our ability to process data
effectively to continue to understand the fundamental properties of the universe.

In this talk, we present the current, FPGA-based, LHC data acquisition system, and
we discuss the plenitude of data challenges that are currently being addressed. Furthermore,
we discuss various aspects of the system, and we present deep learning base solutions
that are quickly being adopted by the LHC. Furthermore, we discuss the lower throughput
computationally complex systems and discuss how FPGAs can augment the system leading
to enhanced physics performance. Throughout the talk, we discuss the scientific implications
possible with an improved system. Finally, we discuss related problems in other scientific
fields, including astrophysics and materials science. We present new challenges that,
if solved, can open paths to new avenues of fundamental scientific research.

SESSION: Session 2: Abstractions and Tools

ThunderGP: HLS-based Graph Processing Framework on FPGAs

  • Xinyu Chen
  • Hongshi Tan
  • Yao Chen
  • Bingsheng He
  • Weng-Fai Wong
  • Deming Chen

FPGA has been an emerging computing infrastructure in datacenters benefiting from
features of fine-grained parallelism, energy efficiency, and reconfigurability. Meanwhile,
graph processing has attracted tremendous interest in data analytics, and its performance
is in increasing demand with the rapid growth of data. Many works have been proposed
to tackle the challenges of designing efficient FPGA-based accelerators for graph
processing. However, the largely overlooked programmability still requires hardware
design expertise and sizable development efforts from developers.

In order to close the gap, we propose ThunderGP, an open-source HLS-based graph processing
framework on FPGAs, with which developers could enjoy the performance of FPGA-accelerated
graph processing by writing only a few high-level functions with no knowledge of the
hardware. ThunderGP adopts the Gather-Apply-Scatter (GAS) model as the abstraction
of various graph algorithms and realizes the model by a build-in highly-paralleled
and memory-efficient accelerator template. With high-level functions as inputs, ThunderGP
automatically explores the massive resources and memory bandwidth of multiple Super
Logic Regions (SLRs) on FPGAs to generate accelerator and then deploys the accelerator
and schedules tasks for the accelerator. We evaluate ThunderGP with seven common graph
applications. The results show that accelerators on real hardware platforms deliver
2.9 times speedup over the state-of-the-art approach, running at 250MHz and achieving
throughput up to 6,400 MTEPS (Million Traversed Edges Per Second). We also conduct
a case study with ThunderGP, which delivers up to 419 times speedup over the CPU-based
design and requires significantly reduced development efforts. This work is open-sourced
on Github at https://github.com/Xtra-Computing/ThunderGP.

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design
on Multi-Die FPGAs

  • Licheng Guo
  • Yuze Chi
  • Jie Wang
  • Jason Lau
  • Weikang Qiao
  • Ecenur Ustun
  • Zhiru Zhang
  • Jason Cong

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity
advantages, there remains a significant gap in the achievable clock frequency between
an HLS-generated design and a handcrafted RTL one. A key factor that limits the timing
quality of the HLS outputs is the difficulty in accurately estimating the interconnect
delay at the HLS level. Unfortunately, this problem becomes even worse when large
HLS designs are implemented on the latest multi-die FPGAs, where die-crossing interconnects
incur a high delay penalty.

To tackle this challenge, we propose AutoBridge, an automated framework that couples
a coarse-grained floorplanning step with pipelining during HLS compilation. First,
our approach provides HLS with a view on the global physical layout of the design,
allowing HLS to more easily identify and pipeline the long wires, especially those
crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining,
the floorplanner is able to distribute the design logic across multiple dies on the
FPGA device without degrading clock frequency. This prevents the placer from aggressively
packing the logic on a single die which often results in local routing congestion
that eventually degrades timing. Since pipelining may introduce additional latency,
we further present analysis and algorithms to ensure the added latency will not compromise
the overall throughput.

AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our
experiments with a total of 43 design configurations, we improve the average frequency
from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible
change in resource utilization. Notably, in 16 experiments we make the originally
unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA

  • Jie Wang
  • Licheng Guo
  • Jason Cong

While systolic array architectures have the potential to deliver tremendous performance,
it is notoriously challenging to customize an efficient systolic array processor for
a target application. Designing systolic arrays requires knowledge for both high-level
characteristics of the application and low-level hardware details, thus making it
a demanding and inefficient process. To relieve users from the manual iterative trial-and-error
process, we present AutoSA, an end-to-end compilation framework for generating systolic
arrays on FPGA. AutoSA is based on the polyhedral framework, and further incorporates
a set of optimizations on different dimensions to boost performance. An efficient
and comprehensive design space exploration is performed to search for high-performance
designs. We have demonstrated AutoSA on a wide range of applications, on which AutoSA
achieves high performance within a short amount of time. As an example, for matrix
multiplication, AutoSA achieves 934 GFLOPs, 3.41 TOPs, and 6.95 TOPs in floating point,
16-bit and 8-bit integer data types on Xilinx Alveo U250.

Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers
through Microbenchmarking

  • Alec Lu
  • Zhenman Fang
  • Weihua Liu
  • Lesley Shannon

With the public availability of FPGAs from major cloud service providers like AWS,
Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms.
However, it is nontrivial to develop efficient FPGA accelerators, especially for software
programmers who use high-level synthesis (HLS).

The major goal of this paper is to figure out how to efficiently access the memory
system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially
important for memory-bound applications; for example, a naive accelerator design only
utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal,
we first identify a comprehensive set of factors that affect the memory bandwidth,
including 1) the number of concurrent memory access ports, 2) the data width of each
port, 3) the maximum burst access length for each port, and 4) the size of consecutive
data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively
evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when
changing those affecting factors, and provide insights into efficient memory access
in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we
also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN)
and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline
designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups
for the KNN and SpMV accelerators.

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

  • Young-kyu Choi
  • Yuze Chi
  • Weikang Qiao
  • Nikola Samardzic
  • Jason Cong

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers
can now exploit unprecedented external memory bandwidth. This allows more memory-bounded
applications to benefit from FPGA acceleration. However, fully utilizing the available
bandwidth may not be an easy task. If an application requires multiple processing
elements to access multiple HBM channels, we observed a significant drop in the effective
bandwidth. The existing high-level synthesis (HLS) programming environment had limitation
in producing an efficient communication architecture. In order to solve this problem,
we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board.
Novel HLS-based optimization techniques are introduced to increase the throughput
of AXI bus masters and switching elements. We also present a high-performance customized
crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect
is demonstrated using Xilinx’s Alveo U280 HBM board. Based on bucket sort and merge
sort case studies, we explore several design spaces and find the design point with
the best resource-performance trade-off. The result shows that HBM Connect improves
the resource-performance metrics by 6.5X-211X.

PRGA: An Open-Source FPGA Research and Prototyping Framework

  • Ang Li
  • David Wentzlaff

Field Programmable Gate Arrays (FPGA) are being used in a fast-growing range of scenarios,
and heterogeneous CPU-FPGA systems are being tapped as a possible way to mitigate
the challenges posed by the end of Moore’s Law. This growth in diverse use cases has
fueled the need to customize FPGA architectures for particular applications or application
domains. While high-level FPGA models can help explore the FPGA architecture space,
as FPGAs move to more advanced design nodes, there is an increased need for low-level
FPGA research and prototyping platforms that can be brought all the way to fabrication.

This paper presents Princeton Reconfigurable Gate Array (PRGA), a highly customizable, scalable, and complete open-source framework for building
custom FPGAs. The framework’s core functions include generating synthesizable Verilog
from user-specified FPGA architectures, and providing a complete, auto-generated,
open-source CAD toolchain for the custom FPGAs. Developed in Python, PRGA provides
a user-friendly API and supports use both as a standalone FPGA as well as an embedded
FPGA. PRGA is a great platform for FPGA architecture research, FPGA configuration
memory research, FPGA CAD tool research, and heterogeneous systems research. It is
also a completely open-source framework for designers who need a free and customizable
FPGA IP core. An FPGA designed with PRGA is placed and routed using standard cell
libraries. The design is evaluated and compared to prior works, providing comparable
performance and increased configurability.

Interactive Debugging at IP Block Interfaces in FPGAs

  • Marco Antonio Merlini
  • Isamu Poy
  • Paul Chow

Recent developments have shown FPGAs to be effective for data centre applications,
but debugging support in that environment has not evolved correspondingly. This presents
an additional barrier to widespread adoption. This work proposes Debug Governors,
a new open-source debugger designed for controllability and interactive debugging
that can help to locate issues across multiple FPGAs.

A Debug Governor can pause, log, drop, and/or inject data into any streaming interface.
These operations enable single-stepping, unit testing, and interfacing with software.
Hundreds of Debug Governors can fit in a single FPGA and, because they are transparent
when inactive, can be left “dormant” in production designs.

We show how Debug Governors can be used to resolve functional problems on a real FPGA,
and how they can be extended to memory-mapped protocols.

SESSION: Poster Session 1

Probabilistic Optimization for High-Level Synthesis

  • Jianyi Cheng
  • John Wickerson
  • George A. Constantinides

High-level synthesis (HLS) tools automatically transform a high-level program, for
example in C/C++, into a low-level hardware description. A key challenge in HLS tools
is scheduling, i.e. determining the start time of all the operations in the untimed
program. There are three approaches to scheduling: static, dynamic and hybrid.

Static scheduling has been well studied, however, statically analysing dynamic hardware
behaviours is still challenging due to the unpredictability due to run-time dependencies.
Existing approaches either assume the worst-case timing behaviour, which can cause
significant performance loss or area overhead, or use simulation, which takes significant
time to explore a sufficiently large number of program traces.

In this work, we introduce a novel probabilistic model allowing HLS tools to efficiently
estimate and optimize the cycle-level timing behaviour of HLS-generated hardware.
Our framework offers insights to assist both hardware engineers and HLS tools when
estimating and optimizing hardware performance.

A Framework for Optimizing GCN Inference on FPGA

  • Bingyi Zhang
  • Rajgopal Kannan
  • Viktor Prasanna

Graph convolutional networks (GCNs) have revolutionized many big data applications.
However, accelerating GCN inference is still challenging due to (1) massive external
memory traffic and irregular memory access, (2) workload imbalance because of the
skewed degree distribution, and (3) intra-stage load imbalance between feature aggregation
and feature transformation steps. To address the above challenges, we propose a framework
to optimize GCN inference on FPGA. First, we propose a novel Partition-Centric Feature
Aggregation (PCFA) scheme to increase the data locality and reduce the number of random
memory accesses in feature aggregation step. Second, we propose a novel hardware architecture
to enable pipelined execution of the two heterogeneous computation steps. Then, a
low-overhead task scheduling strategy is proposed to achieve stall-free execution
of the two computation steps. Third, we provide a complete GCN acceleration framework
on FPGA, and define key parameters for users to fine-tune the throughput. The model-specific
operators can be customized to support a wide-range of GCN models. Using our framework,
we design accelerators on a state-of-the-art FPGA. We evaluate our work using widely
used datasets and. Experimental results show the accelerators produced by our framework
achieve significant speedup compared with state-of-the-art implementations on CPU
(≈100x), GPU (≈30x), and FPGA (4.5-32x).

Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications
on FPGAs

  • Dillon Huff
  • Steve Dai
  • Pat Hanrahan

Image processing algorithms can benefit tremendously from hardware acceleration. However,
hardware accelerators for image processing algorithms look very different from the
programs that image processing algorithm designers are accustomed to writing. Many
image processing hardware compilers have been proposed to close this gap. Unfortunately,
all of them either exclude crucial access patterns, do not scale to realistic size
applications, or rely on a compilation process in which each stage of the application
is an independently scheduled module that sends data to its consumers through FIFOs,
which adds resource and energy overhead while inhibiting synthesis optimizations.
In this work we present a new algorithm for compiling image processing applications
to hardware, Clockwork, that combines insights from polyhedral analysis and synchronous
dataflow to overcome these limitations. Clockwork achieves an average of 43% reduction
in LUTs, 22% reduction in flip-flops, and 17% reduction in BRAMs compared to a state-of-the-art
stencil compiler at the same throughput while handling a wider range of access patterns.
For an image processing application with dozens of stages Clockwork achieves energy
efficiency 265x that of an 8 core CPU, 17x that of an NVIDIA K80 GPU, and 2.4x that
of an NVIDIA V100 GPU.

LEAP: A Deep Learning based Aging-Aware Architecture Exploration Framework for FPGAs

  • Behnam Ghavami
  • Seyed Milad Ebrahimi
  • Zhenman Fang
  • Lesley Shannon

Transistor aging raises a vital lifetime reliability challenge for FPGA devices in
advanced technology nodes. In this paper, we design a tool called LEAP to enable the
aging-aware FPGA architecture exploration. The core idea of LEAP is to efficiently
model the aging-induced delay degradation at the coarse-grained FPGA basic block level
using deep neural networks (DNNs), while achieving almost the same accuracy as the
transistor-level simulation. For each type of the FPGA basic block such as LUT and
DSP, we first characterize its accurate delay degradation via transistor-level SPICE
simulation under a versatile set of aging factors from the FPGA fabric and in-field
operation. Then we train one DNN model for each block type to learn the relation between
its delay degradation and aging factors. Moreover, we integrate our DNN models into
the widely used Verilog-to-Routing (VTR 8) toolflow and generate the aging-aware FPGA
architecture file. Experimental results demonstrate that our proposed flow can predict
the delay degradation of FPGA blocks more than 104x to 107x faster than transistor-level SPICE simulation, with the maximum prediction error
of less than 0.7%. Therefore, FPGA architects can leverage LEAP to explore better
aging-aware FPGA architectures.

Modeling FPGA-Based Systems via Few-Shot Learning

  • Gagandeep Singh
  • Dionysios Diamantopolous
  • Juan Gómez-Luna
  • Sander Stuijk
  • Onur Mutlu
  • Henk Corporaal

Machine-learning-based models have recently gained traction as a way to overcome the
slow downstream implementation process of FPGAs by building models that provide fast
and accurate performance predictions. However, these models suffer from two main limitations:
(1) a model trained for a specific environment cannot predict for a new, unknown environment;
(2) training requires large amounts of data (features extracted from FPGA synthesis
and implementation reports), which is cost-inefficient because of the time-consuming
FPGA design cycle. In various systems (e.g., cloud systems), where getting access
to platforms is typically costly, error-prone, and sometimes infeasible, collecting
enough data is even more difficult. Our research aims to answer the following question:
for an FPGA-based system, can we leverage and transfer our ML-based performance models
trained on a low-end local system to a new, unknown, high-end FPGA-based system, thereby
avoiding the aforementioned two main limitations of traditional ML-based approaches?
To this end, we propose a transfer-learning-based approach for FPGA-based systems
that adapts an existing ML-based model to a new, unknown environment to provide fast
and accurate performance and resource utilization predictions.

APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA

  • Beilei Jiang
  • Xianwei Cheng
  • Sihai Tang
  • Xu Ma
  • Zhaochen Gu
  • Hui Zhao
  • Song Fu

In this paper, we introduce APCNN, which explores algorithm-hardware co-design and
provides a CNN acceleration framework with multi-layer cooperative optimization and
customized design on FPGA. In terms of the algorithm design, the pooling layer is
moved before the non-linear activation function and normalization in APCNN, which
we prove causes negligible accuracy loss; the pooling layer is then co-optimized with
the convolutional layer by means of redundant multiplication elimination, local addition
reuse, and global addition reuse. We further design a dedicated accelerator to take
full advantage of convolutional-pooling cross-layer optimization to not only accelerate
computation but also reduce on-off chip data communication on FPGA. We demonstrate
that our novel APCNN can achieve 75% multiplication and 75% addition reduction in
the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent
of memory footprint can be eliminated, where Row and Col are the number of rows and
columns in the activation feature map respectively. We have implemented a prototype
of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level
cycle and energy model and an RTL implementation. Our experimental results show that
APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN.
(This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)

ScalaBFS: A Scalable BFS Accelerator on FPGA-HBM Platform

  • Chenhao Liu
  • Zhiyuan Shao
  • Kexin Li
  • Minkang Wu
  • Jiajie Chen
  • Ruoshi Li
  • Xiaofei Liao
  • Hai Jin

High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing
multiple memory channels to the processing units. To achieve high performance, an
accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform)
needs to scale its performance according to the available memory channels. In this
paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS,
which decouples memory accessing from processing to scale its performance with available
HBM memory channels. Moreover, by configuring each HBM memory channel with multiple
processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We
implement the prototype system of ScalaBFS and conduct BFS in both real-world and
synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real
hardware). The experimental results show that ScalaBFS scales its performance almost
linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem
of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280,
ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second).
When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS
to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice
memory bandwidth than U280).

AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators

  • Atefeh Sohrabizadeh
  • Cody Hao Yu
  • Min Gao
  • Jason Cong

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized
computing, but the fact that FPGAs are hard to program creates a steep learning curve
for software programmers. Even with the help of high-level synthesis (HLS), accelerator
designers still must manually perform code reconstruction and cumbersome parameter
tuning to achieve the optimal performance. While many learning models have been leveraged
by existing work to automate the design of efficient accelerators, the unpredictability
of modern HLS tools becomes a major obstacle for them to maintain high accuracy. We
address this problem by incorporating an automated DSE framework – AutoDSE – that
leverages bottleneck-guided gradient optimizer to systematically find a better design
point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact
parameters to overcome that, which is like the approach an expert would take. The
experimental results show that AutoDSE is able to find the design point that achieves,
on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks
and 1.04x over the manually designed HLS accelerated vision kernels in Xilinx Vitis
libraries yet with 26x reduction of their optimization pragmas.

SWIFT: Small-World-based Structural Pruning to Accelerate DNN Inference on FPGA

  • Yufei Ma
  • Gokul Krishnan
  • Yu Cao
  • Le Ye
  • Ru Huang

State-of-the-art DNN pruning approaches achieved high sparsity. However, these methods
usually do not consider the intrinsic graph property of DNNs, leading to an irregular
pruned network. Consequently, hardware accelerators cannot directly benefit from such
pruning, suffering additional cost on indexing, control and data paths. Inspired by
the observation that the brain and real-world networks follow a Small-World model,
we propose a graph-based progressive structural pruning technique, SWIFT, that integrates
local clusters and global sparsity in DNNs to benefit the dataflow and workload balance
of the accelerators. In particular, we propose an output stationary FPGA architecture
to accelerate DNN inference and integrate it with the structural sparsity by SWIFT,
so that the communication and computation of clustered zero weights are eliminated.
In addition, a full mesh data router is designed to adaptively direct inputs into
corresponding processing elements (PEs) for different layer configurations and skipping
zero operations. The proposed SWIFT is evaluated with multiple DNNs on different datasets.
It achieves sparsity ratio up to 76% for CIFAR-10, 83% for CIFAR-100, 76% for the
SVHN datasets. Moreover, our proposed SWIFT FPGA accelerator achieves up to 4.4× improvement
in throughput for different dense networks with a marginal hardware overhead.

Fuzzing High-Level Synthesis Tools

  • Zewei Du
  • Yann Herklotz
  • Nadesh Ramanathan
  • John Wickerson

High-level synthesis (HLS) is becoming an increasingly important part of the computing
landscape, even in safety-critical domains where correctness is key. As such, HLS
tools are increasingly relied upon. But are they trustworthy?

We have subjected three widely used HLS tools – LegUp, Xilinx Vivado HLS, and the
Intel HLS Compiler – to a rigorous fuzzing campaign using thousands of random, valid
C programs that we generated using a modified version of the Csmith tool. For each
C program, we compiled it to a hardware design using the HLS tool under test and checked
whether that hardware design generates the same output as an executable generated
by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test,
we reduced the C program to a minimal example in order to zero in on the potential
bug. Our testing campaign has revealed that all three HLS tools can be made either
to crash or to generate wrong code when given valid C programs, and thereby underlines
the need for these increasingly trusted tools to be more rigorously engineered. Out
of 6700 test cases, we found 272 programs that failed in at least one tool, out of
which we were able to discern at least 6 unique bugs.

RIFL: A Reliable Link Layer Network Protocol for FPGA-to-FPGA Communication

  • Qianfeng (Clark) Shen
  • Jun Zheng
  • Paul Chow

More and more latency-sensitive applications are being introduced into the data center.
Performance of such applications can be limited by the high latency of the network
interconnect. Because the conventional network stack is designed not only for LAN,
but also for WAN, it carries a great amount of redundancy that is not required in
a data center network. This paper introduces the concept of a three-layer protocol
stack that can replace the conventional network stack and fulfill the exact demands
of data center network communications. The detailed design and implementation of the
first layer of the stack, which we call RIFL, is presented. A novel low latency in-band
hop-by-hop re-transmission protocol is proposed and adopted in RIFL, which guarantees
lossless transmission for links whose longest wire segment is no more than 150 meters.
Experimental results show that RIFL achieves 218 nanoseconds round-trip latency on
3 meter zero-hop links, at a throughput of 104.7 Gbps. RIFL is a multi-lane protocol
with scalable throughput from 500 Mbps to above 200 Gbps. It is portable to most of
the recent FPGAs. It can be the enabler of low latency, high throughput, flexible,
scalable, and lossless data center networks.

SESSION: Session 3: Machine Learning and Supporting Algorithms

GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing

  • Qinggang Wang
  • Long Zheng
  • Yu Huang
  • Pengcheng Yao
  • Chuangyi Gui
  • Xiaofei Liao
  • Hai Jin
  • Wenbin Jiang
  • Fubing Mao

Existing FPGA-based graph accelerators, typically designed for static graphs, rarely
handle dynamic graphs that often involve substantial graph updates (e.g., edge/node
insertion and deletion) over time. In this paper, we aim to fill this gap. The key
innovation of this work is to build an FPGA-based dynamic graph accelerator easily
from any off-the-shelf static graph accelerator with minimal hardware engineering
efforts (rather than from scratch). We observe \em spatial similarity of dynamic graph
updates in the sense that most of graph updates get involved with only a small fraction
of vertices. We therefore propose an FPGA library, called GraSU, to exploit spatial
similarity for fast graph updates. GraSU uses a differential data management, which
retains the high-value data (that will be frequently accessed) in the specialized
on-chip UltraRAM while the overwhelming majority of low-value ones reside in the off-chip
memory. Thus, GraSU can transform most of off-chip communications arising in dynamic
graph updates into fast on-chip memory accesses. Our experiences show that GraSU can
be easily integrated into existing state-of-the-art static graph accelerators with
only 11 lines of code modifications. Our implementation atop AccuGraph using a Xilinx
Alveo#8482; \ U250 board outperforms two state-of-the-art CPU-based dynamic graph
systems, Stinger and Aspen, by an average of 34.24× and 4.42× in terms of update throughput,
improving further overall efficiency by 9.80× and 3.07× on average.

Folded Integer Multiplication for FPGAs

  • Martin Langhammer
  • Bogdan Pasca

Encryption – especially the key exchange algorithms such as RSA – is an increasing
use-model for FPGAs, driven by the adoption of the FPGA as a SmartNIC in the datacenter.
While bulk encryption such as AES maps well to generic FPGA features, the very large
multipliers required for RSA are a much more difficult problem. Although FPGAs contain
thousands of small integer multipliers in DSP Blocks, aggregating them into very large
multipliers is very challenging because of the large amount of soft logic required
– especially in the form of long adders, and the high embedded multiplier count. In
this paper, we describe a large multiplier architecture that operates in a multi-cycle
format and which has a linear area/throughput ratio. We show results for a 2048-bit
multiplier that has a latency of 118 cycles, inputs data every 9th cycle and closes
timing at 377MHz in an Intel Arria 10 FPGA, and over 400MHz in a Stratix 10. The proposed
multiplier uses 1/9 of the DSP resources typically used in a 2048-bit Karatsuba implementation,
showing a perfectly linear throughput to DSP-count ratio. Our proposed solution outperforms
recently reported results, in either arithmetic complexity – by making use of the
Karatsuba techniques, or in scheduling efficiency – embedded DSP resources are fully
utilized.

FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

  • Yichi Zhang
  • Junhao Pan
  • Xinheng Liu
  • Hongzheng Chen
  • Deming Chen
  • Zhiru Zhang

Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are
well suited for FPGAs, as their dominant computations are bitwise arithmetic and the
memory requirement is also significantly reduced. However, compared to start-of-the-art
compact convolutional neural network (CNN) models, BNNs tend to produce a much lower
accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs
has gradually become a major compute bottleneck, because it is conventionally excluded
from binarization to avoid a large accuracy loss.

This work proposes FracBNN, which exploits fractional activations to substantially
improve the accuracy of BNNs. Specifically, our approach employs a dual-precision
activation scheme to compute features with up to two bits, using an additional sparse
binary convolution. We further binarize the input layer using a novel thermometer
encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where
all convolutional layers are computed in pure binary MAC operations (BMACs). We design
an efficient FPGA-based accelerator for our novel BNN model that supports the fractional
activations. To evaluate the performance of FracBNN under a resource-constrained scenario,
we implement the entire optimized network architecture on an embedded FPGA (Xilinx
Ultra96 v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable
to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of
28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms
a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using
the same model size. On the embedded FPGA device, FracBNN demonstrates the ability
of real-time image classification.

DYNAMAP: <u>Dyna</u>mic Algorithm <u>Map</u>ping Framework for Low Latency CNN Inference

  • Yuan Meng
  • Sanmukh Kuppannagari
  • Rajgopal Kannan
  • Viktor Prasanna

Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN)
focuses on employing a single strategy (algorithm, dataflow, etc.) across all the
layers. Such an approach does not achieve optimal latency on complex and deep CNNs.
Emerging CNNs have diverse per-layer computation characteristics including parallelism,
arithmetic intensity, locality, and memory footprint. Per-layer strategy selection
and fine-grained tuning are required to achieve low end-to-end latency. However, specialized
hardware modules dedicated to each layer limit the per-layer utilization and adversely
affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture
co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that
can be reused across layers, supporting dynamic mapping of all three families of popular
convolution algorithms, and further allowing flexible dataflow switching to maximize
hardware utilization for each layer; (2) a novel software Design Space Exploration
(DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping.
We show that the algorithm mapping space increases exponentially with network depth,
and while the optimal algorithm selection problem is NP-hard in general, by exploiting
the series-parallel structure of CNN models, we demonstrate a polynomial-time solution
for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having
diverse computation and memory requirements across the layers. We demonstrate DYNAMAP
using two state-of-the-art CNNs – GoogleNet and Inception-V4. The generated accelerators
achieve up to 2.8x and 1.4x speedups, respectively, wrt inference latency compared
with the state-of-the-art FPGA implementations.

S2N2: A FPGA Accelerator for Streaming Spiking Neural Networks

  • Alireza Khodamoradi
  • Kristof Denolf
  • Ryan Kastner

Spiking Neural Networks (SNNs) are the next generation of Artificial Neural Networks
(ANNs) that utilize an event-based representation to perform more efficient computation.
Most SNN implementations have a systolic array-based architecture and, by assuming
high sparsity in spikes, significantly reduce computing in their designs. This work
shows this assumption does not hold for applications with signals of large temporal
dimension. We develop a streaming SNN (S2N2) architecture that can support fixed-per-layer
axonal and synaptic delays for its network. Our architecture is built upon FINN and
thus efficiently utilizes FPGA resources. We show how radio frequency processing matches
our S2N2 computational model. By not performing tick-batching, a stream of RF samples
can efficiently be processed by S2N2, improving the memory utilization by more than
three orders of magnitude.

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

  • Qijing Huang
  • Dequan Wang
  • Zhen Dong
  • Yizhao Gao
  • Yaohui Cai
  • Tian Li
  • Bichen Wu
  • Kurt Keutzer
  • John Wawrzynek

Deploying deep learning models on embedded systems for computer vision tasks has been
challenging due to limited compute resources and strict energy budgets. The majority
of existing work focuses on accelerating image classification, while other fundamental
vision problems, such as object detection, have not been adequately addressed. Compared
with image classification, detection problems are more sensitive to the spatial variance
of objects, and therefore, require specialized convolutions to aggregate spatial information.
To address this need, recent work introduces dynamic deformable convolution to augment
regular convolutions. Regular convolutions process a fixed grid of pixels across all
the spatial locations in an image, while dynamic deformable convolution may access
arbitrary pixels in the image with the access pattern being input-dependent and varying
with spatial location. These properties lead to inefficient memory accesses of inputs
with existing hardware.

In this work, we harness the flexibility of FPGAs to develop a novel object detection
pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a
set of algorithm modifications including irregular-access versus limited-range and
fixed-shape on a flexible hardware accelerator. We evaluate these algorithmic changes
with corresponding hardware optimizations and show a 1.36x and 9.76x speedup respectively
for the full and depthwise deformable convolution on hardware with minor accuracy
loss. We then co-design a network called CoDeNet with the modified deformable convolution
for object detection and quantize the network to 4-bit weights and 8-bit activations.
With our high-efficiency implementation, our solution reaches 26.9 frames per second
with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object
detection dataset, Pascal VOC. With our higher-accuracy implementation, our model
gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters–20.9x smaller but
10% more accurate than Tiny-YOLO.

Efficient FPGA Modular Multiplication Implementation

  • Martin Langhammer
  • Bogdan Pasca

Barrett’s algorithm is the most commonly known method of performing a modular multiplication,
which is the core of many modern encryption algorithms such as RSA. Barrett’s algorithm
requires an accurate quotient estimation which in turn requires accurate multiplications.
These multiplications operating on word sizes of thousands of bits are particularly
expensive to implement in FPGAs, requiring many hundreds or even thousands of embedded
DSP components along with large amounts of logic and routing. In this work we show
that approximate quotient estimates as results of aggressive multiplier truncations
can significantly reduce implementation cost. The looser modified Barrett’s output
[0; YM) is reduced to [0; M) using a shallow reduction technique based on table lookups
and wide additions, taking advantage of new techniques which have recently been introduced
for FPGA. We first use these techniques to develop an improved standard Barrett’s
implementation for 1024b modular multiplication, followed by our approximate method
which reduces logic cost in the LSB truncated multiplier by approximately 10%. The
effect is more pronounced for very large word sizes, where our relaxed error bounds
in the LSB truncated multiplication can reduce the number of operations by 20%.

SESSION: Keynote 2

Are We Alone? Searching for ET with FPGAs

  • Dan Werthimer

What is the possibility of other intelligent life in the universe? Can we detect radio,
infrared, or visible light signals from alien civilizations? Current and future projects
searching for such signals may provide an answer. Dan will describe SETI@home, the
new PANOSETI observatory, future searches, and show how FPGAs and new technologies
are revolutionizing the search for extra-terrestrial intelligence (SETI).

Dan will also describe the Collaboration for Astronomy Signal Processing and Electronics
Research (CASPER) open source hardware, tools and libraries for FPGA based radio astronomy
instrumentation that produced the first images of the black hole and discovered many
fast radio bursts, pulsars, and a planet made from solid diamond. Next generation
radio telescopes will be composed of hundreds to thousands of smaller telescopes;
these large arrays require peta-ops per second of real time processing to combine
telescope signals and generate spectral-images. Dan will describe these telescopes
and their real time signal processing systems.

Open source hardware, software, libraries, tools, reference designs and video training
are available at http://casper.berkeley.edu

SESSION: Poster Session 2

Stealing Neural Network Structure through Remote FPGA Side-channel Analysis

  • Yicheng Zhang
  • Rozhin Yasaei
  • Hao Chen
  • Zhou Li
  • Mohammad Abdullah Al Faruque

Deep Neural Network (DNN) models have been extensively developed by companies for
a wide range of applications. The development of a customized DNN model with great
performance requires costly investments, and its structure (layers and hyper-parameters)
is considered intellectual property and holds immense value. However, in this paper,
we found the model secret is vulnerable when a cloud-based FPGA accelerator executes
it. We demonstrate an end-to-end attack based on remote power side-channel analysis
and machine-learning-based secret inference against different DNN models. The evaluation
result shows that an attacker can reconstruct the layer and hyper-parameter sequence
at over 90% accuracy using our method, which can significantly reduce her model development
workload. We believe the threat presented by our attack is tangible, and new defense
mechanisms should be developed against this threat.

Exploring PGAS Communication for Heterogeneous Clusters with FPGAs

  • Varun Sharma
  • Paul Chow

This work presents a heterogeneous communication library for generic clusters of processors
and FPGAs. This library, Shoal, supports the Partitioned Global Address Space (PGAS)
memory model for applications. PGAS is a shared memory model for clusters that creates
a distinction between local and remote memory access. Through Shoal and its common
application programming interface for hardware and software, applications can be more
freely migrated to the optimal platform and deployed onto dynamic cluster topologies.

The library is tested using a thorough suite of microbenchmarks to establish latency
and throughput performance. We also show an implementation of the Jacobi iterative
method that demonstrates the ease with which applications can be moved between platforms
to yield faster run times.

Extending High-Level Synthesis for Task-Parallel Programs

  • Yuze Chi
  • Licheng Guo
  • Young-kyu Choi
  • Jie Wang
  • Jason Cong

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable
gate array (FPGA) accelerators in many application domains in recent years, thanks
to its competitive quality of result (QoR) and short development cycle compared with
the traditional register-transfer level (RTL) design approach. Yet, limited by the
sequential C semantics, it remains challenging to adopt the same highly productive
high-level programming approach in many other application domains, where coarse-grained
tasks run in parallel and communicate with each other at a fine-grained level. While
current HLS tools support task-parallel programs, the productivity is greatly limited
in the code development, correctness verification, and QoR tuning cycles, due to the
poor programmability, restricted software simulation, and slow code generation, respectively.
Such limited productivity often defeats the purpose of HLS and hinder programmers
from adopting HLS for task-parallel FPGA accelerators.

In this paper, we extend the HLS C++ language and present a fully automated framework
with programmer-friendly interfaces, universal software simulation, and fast code
generation to overcome these limitations. Experimental results based on a wide range
of real-world task-parallel programs show that, on average, the lines of kernel and
host code are reduced by 22% and 51%, respectively, which considerably improves the
programmability. The correctness verification and the iterative QoR tuning cycles
are both greatly accelerated by 3.2× and 6.8×, respectively.

Simulating and Evaluating a Quaternary Logic FPGA Based on Floating-gate Memories
and Voltage Division

  • Ayokunle Fadamiro
  • Pouyan Rezaie
  • Spencer Millican
  • Christopher Harris

Technology scaling cannot meet consumer demands, especially for binary circuits. Previous
studies proposed addressing this with multi-valued logic (MVL) architectures, but
these architectures use non-standard fabrication techniques and optimistic performance
analysis. This study presents a new quaternary FPGA (QFPGA) architecture based on
floating-gate memories that standard CMOS fabrication can fabricate: programming floating-gates
implement a voltage divider, and these divided voltages represent one of four distinct
logic values. When simulated with open-source FinFET SPICE models, the proposed architecture
obtains competitive delay and power performance compared to equivalent binary and
QFPGA architectures from literature. Results show the proposed QFPGA basic logic element
(BLE) requires half the area and dissipates a third of the power density compared
to QFPGA architectures from literature. When projecting BLE performance onto benchmark
circuits, implementing circuits requires up to 55% less area and one-third the power,
and the proposed architecture can operate at clock speeds up to three times faster
than binary equivalents. Future studies will investigate accurate modeling of interconnects
to better account for their performance impacts and will explore efficient architectures
for programming MVL memories when they’re used in FPGAs.

Resource Sharing in Dataflow Circuits

  • Lana Josipović
  • Axel Marmet
  • Andrea Guerrieri
  • Paolo Ienne

To achieve resource-efficient hardware designs, high-level synthesis tools share functional
units among operations of the same type. This optimization is typically performed
in conjunction with operation scheduling to ensure the best possible unit usage at
each point in time. Dataflow circuits have emerged as an alternative HLS approach
to efficiently handle irregular and control-dominated code. However, these circuits
do not have a predetermined schedule; in its absence, it is challenging to determine
which operations can share a functional unit without a performance penalty. Additionally,
although sharing seems to imply only trivial circuitry, sharing units in dataflow
circuits may cause deadlock by blocking certain data transfers and preventing operations
from executing. We developed a complete methodology to implement resource sharing
in dataflow designs. Our approach automatically identifies performance-acceptable
resource sharing opportunities based on average unit utilization with data tokens.
Our sharing mechanism achieves functionally correct and deadlock-free circuits by
regulating the multiplexing of tokens at the inputs of the shared unit. On a set of
benchmarks obtained out of C code, we showed that our approach effectively implements
resource sharing and results in significant area savings compared to dataflow circuits
which do not support this feature. Our sharing mechanism is key to achieve different
area-performance tradeoffs in dataflow designs and to make them competitive in terms
of computational resources with circuits achieved using standard HLS techniques.

Triggered Scheduling: Efficient Detection of Dataflow Network Idleness on Heterogeneous
Systems

  • Mahyar Emami
  • Endri Bezati
  • Jörn W. Janneck
  • James Larus

Hardware-software codesign for FPGAs requires flexible and changeable boundaries between
hardware and software. Design space exploration is facilitated by expressing programs
in a language that can be compiled for both CPU and FPGA execution. Such an approach
requires efficient and general communication mechanisms between hardware and software.
We present a practical solution to this problem for heterogeneous programs expressed
in CAL, an actor based language running on a PCIe-based FPGA system where communication
between a processor and FPGA is relatively expensive. We show how a network of continuously
executing software and hardware actors with fine-grained communication can be expressed
as a coprocessor model that executes the network in discrete steps with efficient
coarse-grained transfers across the PCIe bus.

To this end, we present the Triggered Scheduling (TS) algorithm to detect idleness
(i.e. lack of forward progress) of a dynamic actor network with unpredictable consumption/production
rates. With TS, it is possible to treat a network of actors running on hardware as
a coprocessor that can be called by software. We show how TS can be used to build
a truly heterogeneous system on a HLS platform. Using 4 large benchmarks, we analyze
the performance and resource utilization of the Triggered Scheduling algorithm.

Classifying Computations on Multi-Tenant FPGAs

  • Mustafa Gobulukoglu
  • Colin Drewes
  • Bill Hunter
  • Dustin Richmond
  • Ryan Kastner

Modern data centers leverage large FPGAs to provide low latency, high throughput,
and low energy computation. FPGA multi-tenancy is an attractive option to maximize
utilization, yet it opens the door to unique security threats. In this work, we develop
a remote classification pipeline that targets the confidentiality of multi-tenant
cloud FPGA environments. We design a unique Dual-Edged voltage fluctuation sensor
that measures subtle changes in the power distribution network caused by co-located
computations. The sensor measurements are given to a classification pipeline that
is able to deduce information about co-located applications including the type of
computation and its implementation. We study the importance of the trace length, signal
conditioning algorithms, and other aspects that affect classification accuracy. Our
results show that we can determine if another co-tenant is present with 96% accuracy.
We can classify with 98% accuracy whether a power waster circuit is operating. Furthermore,
we are able to determine if a cryptographic operation is occurring, differentiate
between different cryptographic algorithms (AES and PRESENT) and microarchitectural
implementations (Microblaze, ORCA, and PicoRV32).

NPE: An FPGA-based Overlay Processor for Natural Language Processing

  • Hamza Khan
  • Asma Khan
  • Zainab Khan
  • Lun Bin Huang
  • Kun Wang
  • Lei He

In recent years, transformer-based models have shown state-of-the-art results for
Natural Language Processing (NLP). In particular, the introduction of the BERT language
model brought with it breakthroughs in tasks such as question answering and natural
language inference, advancing applications that allow humans to interact naturally
with embedded devices. FPGA-based overlay processors have been shown as effective
solutions for edge image and video processing applications, which mostly rely on low
precision linear matrix operations. In contrast, transformer-based NLP techniques
employ a variety of higher precision nonlinear operations with significantly higher
frequency. We present NPE, an FPGA-based overlay processor that can efficiently execute
a variety of NLP models. NPE offers software-like programmability to the end user
and, unlike FPGA designs that implement specialized accelerators for each nonlinear
function, can be upgraded for future NLP models without requiring reconfiguration.
NPE can meet real-time conversational AI latency targets for the BERT language model
with 4x lower power than CPUs and 6x lower power than GPUs. We also show NPE uses
3x fewer FPGA resources relative to comparable BERT network-specific accelerators
in the literature. NPE provides a cost-effective and power-efficient FPGA-based solution
for Natural Language Processing at the edge.

PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow

  • Sitao Huang
  • Kun Wu
  • Hyunmin Jeong
  • Chengyue Wang
  • Deming Chen
  • Wen-mei Hwu

The exploding complexity and computation efficiency requirements of applications are
stimulating a strong demand for hardware acceleration with heterogeneous platforms
such as FPGAs. However, a high-quality FPGA design is very hard to create as it requires
FPGA expertise and a long design iteration time. In contrast, software applications
are typically developed in a short development cycle, in high-level languages like
Python, which is at a much higher level of abstraction than all existing hardware
design flows. To close this gap between hardware design flows and software applications,
and simplify FPGA programming, we create PyLog, a high-level, algorithm-centric Python-based
programming and synthesis flow for FPGA. PyLog is powered by a set of compiler optimization
passes and a type inference system to generate high-quality design. It abstracts away
the implementation details and allows designers to focus on algorithm specification.
PyLog captures more high-level computation patterns for better optimization than traditional
HLS systems. PyLog also has a runtime for running PyLog code directly on FPGA platform
without any extra code development. Evaluation shows that PyLog significantly improves
FPGA design productivity and generates highly efficient FPGA designs that outperform
highly optimized CPU and FPGA version by 3.17× and 1.24× on average.

MLBlocks: FPGA Blocks for Machine Learning Applications

  • Seyedramin Rasoulinezhad
  • David Boland
  • Philip H.W. Leong

The underlying goal of FPGA architecture research is to devise flexible substrates
which implement a wide variety of circuits efficiently. Contemporary FPGA architectures
have been optimized to support networking, signal processing and image processing
applications through high precision digital signal processing (DSP) blocks. The recent
emergence of machine learning has created a new set of demands characterized by: 1)
higher computational density and 2) low precision arithmetic requirements. With the
goal of exploring this new design space in a methodical manner, we first propose a
problem formulation involving computing nested loops over multiply-accumulate (MAC)
operations, which covers many basic linear algebra primitives and standard deep neural
network (DNN) layers. A quantitative methodology for deriving efficient coarse-grained
compute block architectures from benchmarks is then proposed together with a family
of new compute units, called MLBlocks. These blocks are flexible mesh-based systolic
array units parameterized with different data movements, data reuse, and multi-precision
support. They utilize a columnar arrangement which is compatible with existing FPGA
architectures. Finally, using synthetic benchmarks, we demonstrate that MLBlocks offer
significantly improved performance over the commercial Xilinx DSP48E2, while maintaining
similar area and timing requirements to current DSPs.

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems
in the Cloud

  • Shulin Zeng
  • Guohao Dai
  • Hanbo Sun
  • Jun Liu
  • Hongren Zheng
  • Yusong Wu
  • Fan Zhang
  • Xinhao Yang
  • Yi Cai
  • Yu Wang
  • Huazhong Yang

With the ever-growing demands for online Artificial Intelligence (AI), the hardware
virtualization support for deep learning accelerators is vital for providing AI capability
in the cloud. Three basic features, multi-task, dynamic workload, and remote access,
are fundamental for hardware virtualization. However, most of the deep learning accelerators
do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN
scheduling algorithm for NN accelerators neither consider the multi-task concurrent
execution and resources allocation for the multi-core DNN accelerators. Moreover,
existing GPU virtualized solutions could introduce a huge remote access latency overhead,
resulting in a severe system performance drop.

In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core
virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model
parallelism on multi-FPGA by optimizing data synchronization and movement between
FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate
multi-core latency prediction model. 3M-AI significantly reduces the remote API access
overhead to nearly 1%, and achieves better NN inference latency with a batch size
1 compared with GPU virtualization solutions.

SESSION: Session 4: Applications

Reconfigurable Acceleration of Short Read Mapping with Biological Consideration

  • Ho-Cheung Ng
  • Izaak Coleman
  • Shuanglong Liu
  • Wayne Luk

Existing FPGA accelerators for short read mapping often fail to utilize the complete
biological information in sequencing data for simple hardware design, leading to missed
or incorrect alignment. Furthermore, their performance may not be optimized across
hardware platforms. This paper proposes a novel alignment pipeline that considers
all information in sequencing data for biologically accurate acceleration of short
read mapping. To ensure the performance of the proposed design optimized across different
platforms, we accelerate the memory-bound operations which have been a bottleneck
in short read mapping. Specifically, we partition the FM-index into buckets. The length
of each bucket is equal to an optimal multiple of the memory burst size and is determined
through data-driven exploration. A tool has been developed to obtain the optimal parameters
of the design for different hardware platforms to enhance performance optimization.
Experimental results indicate that our design maximizes alignment accuracy compared
to the state-of-the-art software Bowtie, mapping reads 4.48x as fast. Compared to
the previous hardware aligner, our achieved accuracy is 97.7% which reports 4.48 M
more valid alignments with a similar speed.

An FPGA-based 7-ENOB 600 MSample/s ADC without any External Components

  • Lukas Leuenberger
  • Dorian Amiet
  • Tao Wei
  • Paul Zbinden

Analog to digital converters (ADCs) are indispensable nowadays. Analog signals are
digitized earlier and earlier in the processing chain to reduce the need for complex
analog signal processing. For this reason, ADCs are often integrated directly into
field-programmable gate arrays (FPGA) or microprocessors. However, such ADCs are designed
for a specific set of requirements with limited flexibility. In this paper, a new
structure of an FPGA-based ADC is proposed. The ADC is based on the slope ADC, where
a time-to-digital converter (TDC) measures the time from the beginning of a reference
slope until the slope reaches the voltage-to-be-measured. Only FPGA-internal elements
are used to build the ADC. It is fully reconfigurable and does not require any external
components. This innovation offers the flexibility to convert almost any digital input/output
(I/O) into an ADC. Considering the very high number of digital I/O ports available
in today’s FPGA systems, this enables the construction of a massive and powerful ADC
array directly on a standard FPGA. The proposed ADC has a resolution of 9.3 bit and
achieves an effective number of bits (ENOB) of 7 at a sample rate of 600 MSample/s.
The differential nonlinearity (DNL) ranges from -0.9 to 0.9 bit, and the integral
nonlinearity (INL) is in the range between -1.1 and 0.9 bit. An alternative version
of the ADC operates at 1.2 GSample/s and achieves an ENOB of 5.3.

A Framework for Customizable FPGA-based Image Registration Accelerators

  • Davide Conficconi
  • Eleonora D’Arnese
  • Emanuele Del Sozzo
  • Donatella Sciuto
  • Marco D. Santambrogio

Image Registration is a highly compute-intensive optimization procedure that determines
the geometric transformation to align a floating image to a reference one. Generally,
the registration targets are images taken from different time instances, acquisition
angles, and/or sensor types. Several methodologies are employed in the literature
to address the limiting factors of this class of algorithms, among which hardware
accelerators seem the most promising solution to boost performance. However, most
hardware implementations are either closed-source or tailored to a specific context,
limiting their application to different fields. For these reasons, we propose an open-source
hardware-software framework to generate a configurable architecture for the most compute-intensive
part of registration algorithms, namely the similarity metric computation. This metric
is the Mutual Information, a well-known calculus from the Information Theory, used
in several optimization procedures. Through different design parameters configurations,
we explore several design choices of our highly-customizable architecture and validate
it on multiple FPGAs. We evaluated various architectures against an optimized Matlab implementation on an Intel Xeon Gold, reaching a speedup up to 2.86x, and remarkable
performance and power efficiency against other state-of-the-art approaches.

NASCENT: Near-Storage Acceleration of Database Sort on SmartSSD

  • Sahand Salamat
  • Armin Haj Aboutalebi
  • Behnam Khaleghi
  • Joo Hwan Lee
  • Yang Seok Ki
  • Tajana Rosing

As the size of data generated every day grows dramatically, the computational bottleneck
of computer systems has been shifted toward the storage devices. Thanks to recent
developments in storage devices, the interface between the storage and the computational
platforms has become the main limitation as it provides limited bandwidth which does
not scale when the number of storage devices increases. Interconnect networks limit
the performance of the system when independent operations are executing on different
storage devices since they do not provide simultaneous accesses to all the storage
devices. Offloading the computations to the storage devices eliminates the burden
of data transfer from the interconnects. Emerging as a nascent computing trend, near
storage computing offloads a portion of computation to the storage devices to accelerate
the big data applications. In this paper, we propose a near storage accelerator for
database sort, NASCENT, which utilizes Samsung SmartSSD, an NVMe flash drive with
an on-board FPGA chip that processes data in-situ. We propose, to the best of our
knowledge, the first near storage database sort based on bitonic sort which considers
the specifications of the storage devices to increase the scalability of computer
systems as the number of storage devices increases. NASCENT improves both performance
and energy efficiency as the number of storage devices increases. With 12 SmartSSDs,
NASCENT is 7.6x (147.2x) faster and 5.6x (131.4x) more energy efficient than the FPGA
(CPU) baseline.

MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators

  • Peipei Zhou
  • Jiayi Sheng
  • Cody Hao Yu
  • Peng Wei
  • Jie Wang
  • Di Wu
  • Jason Cong

FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS)
and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to
PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go
public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies,
although the adoption of FPGAs reduces the overall execution time, it introduces 2.56×
extra cost, due to insufficient application-level speedup by Amdahl’s law. To optimize
the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha
framework as a distributed runtime system to fully utilize the accelerator resource
by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on
Haplotype Caller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application
cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively
than straightforward CPU-FPGA integration solution with less than 5.1% performance
overhead.

Design Principles for Packet Deparsers on FPGAs

  • Thomas Luinaud
  • Jeferson Santiago da Silva
  • J.M. Pierre Langlois
  • Yvon Savaria

The P4 language has drastically changed the networking field as it allows to quickly
describe and implement new networking applications. Although a large variety of applications
can be described with the P4 language, current programmable switch architectures impose
significant constraints on P4 programs. To address this shortcoming, FPGAs have been
explored as potential targets for P4 applications. P4 applications are described using
three abstractions: a packet parser, match-action tables, and a packet deparser, which
reassembles the output packet with the result of the match-action tables. While implementations
of packet parsers and match-action tables on FPGAs have been widely covered in the
literature, no general design principles have been presented for the packet deparser.
Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open
issue because it requires a large amount of interconnections and the architecture
must be tailored to a P4 program. As a result, in several works where a P4 application
is implemented on FPGAs, the deparser consumes a significant proportion of chip resources.
Hence, in this paper, we address this issue by presenting design principles for efficient
and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates
an efficient vendor-agnostic deparser architecture from a P4 program.Our design has
been validated and simulated with a cocotb-based framework.The resulting architecture
is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than
200 Gbps while reducing resource usage by almost 10x compared to other solutions.

ASPDAC 2021 TOC

ASPDAC ’21: Proceedings of the 26th Asia and South Pacific Design Automation Conference 

SESSION: 1A: University Design Contest I

A DSM-based Polar Transmitter with 23.8% System Efficiency

  • Yuncheng Zhang
  • Bangan Liu
  • Xiaofan Gu
  • Chun Wang
  • Atsushi Shirane
  • Kenichi Okada

An energy efficient digital polar transmitter (TX) based on 1.5bit Delta-Sigma modulator
(DSM) and fractional-N injection-locked phase-locked loop (IL-PLL) is proposed. In
the proposed TX, redundant charge and discharge of turned-off capacitors in the conventional
switched-capacitor power amplifiers (SCPAs) are avoided, which drastically improves
the efficiency at power back-off. In the PLL, spur-mitigation technique is proposed
to reduce the frequency mismatch between the oscillator and the reference. The transmitter,
implemented in 65nm CMOS, achieves a PAE of 29% at an EVM of -25.1dB, and a system
efficiency of 23.8%.

A 0.41W 34Gb/s 300GHz CMOS Wireless Transceiver

  • Ibrahim Abdo
  • Takuya Fujimura
  • Tsuyoshi Miura
  • Korkut K. Tokgoz
  • Atsushi Shirane
  • Kenichi Okada

A 300GHz CMOS-only wireless transceiver that achieves a maximum data rate of 34Gb/s
while consuming a total power of 0.41W from a 1V supply is introduced. A subharmonic
mixer with low conversion loss is proposed to compensate the absence of the RF amplifiers
in TX and RX as a mixer-last-mixer-first topology is adopted. The TRX covers 19 IEEE802.15.3d
channels (13-23, 39-43, 52-53, 59).

Capacitive Sensor Circuit with Relative Slope-Boost Method Based on a Relaxation Oscillator

  • Ryo Onishi
  • Koki Miyamoto
  • Korkut Kaan Tokgoz
  • Noboru Ishihara
  • Hiroyuki Ito

This paper presents a relative slope-boosting technique for a capacitive sensor circuit
based on a relaxation oscillator. Our technique improves jitter, i.e. resolution,
by changing both the voltage slope on the sensing and the reference sides with respect
to the sensor capacitance. The sensor prototype circuit is implemented in a 180-nm
standard CMOS process and achieves resolution of 710 aF while consuming 12.7 pJ energy
every cycle of 13.78 kHz output frequency. The measured power consumption from a 1.2
V DC supply is 430 nW.

28GHz Phase Shifter with Temperature Compensation for 5G NR Phased-array Transceiver

  • Yi Zhang
  • Jian Pang
  • Kiyoshi Yanagizawa
  • Atsushi Shirane
  • Kenichi Okada

A phase shifter with temperature compensation for 28GHz phased-array TRX is presented.
A precise low-voltage current reference is proposed for the IDAC biasing circuit.
The total gain variation for a single TX path including phase shifter and post stage
amplifiers over -40°C to 80°C is only 1dB in measurement and the overall phase error
due to temperature is less than 1 degree without off-chip calibration.

An up to 35 dBc/Hz Phase Noise Improving Design Methodology for Differential-Ring-Oscillators
Applied in Ultra-Low Power Systems

  • Peter Toth
  • Hiroki Ishikuro

This work presents a novel control loop concept to adjust dynamically a differential
ring oscillators (DRO) biasing in order to improve the phase noise performance (PN)
in the ultra-low-power domain. Applying this proposed feedback system on any DRO with
a tail current source is possible. The following paper presents the proposed concept
and includes measurements of a 180 nm CMOS integrated prototype system, which underlines
the feasibility of the discussed idea. Measurements show an up to 35 dBc/Hz phase
noise improvement with an active control loop. Moreover, the tuning range of the implemented
ring oscillator is extended by about 430 % compared to fixed bias operation. These
values are measured at a minimum oscillation power consumption of 55 pW/Hz.

University LSI Design Contest ASP-DAC 2021

Gate Voltage Optimization in Capacitive DC-DC Converters for Thermoelectric Energy
Harvesting

  • Yi Tan
  • Yohsuke Shiiki
  • Hiroki Ishikuro

This paper presents a gate voltage optimized fully integrated charge pump for thermoelectric
energy harvesting applications. In this paper, the trade-off generated by rising the
gate voltage of switching transistors are discussed. The proposed 5/3-stage design,
which implemented with 180 nm CMOS technique, achieved a down to 0.12V/0.13V startup
voltage correspondingly with the proposed technique. A 20% peak power conversion efficiency
improvement is achieved when comparing with a similar 3-stage linear charge pump in
previous state-of-the-art research.

A 0.57-GOPS/DSP Object Detection PIM Accelerator on FPGA

  • Bo Jiao
  • Jinshan Zhang
  • Yuanyuan Xie
  • Shunli Wang
  • Haozhe Zhu
  • Xiaoyang Kang
  • Zhiyan Dong
  • Lihua Zhang
  • Chixiao Chen

The paper presents an object detection accelerator featuring a processing-in-memory
(PIM) architecture on FPGAs. PIM architectures are well known for their energy efficiency
and avoidance of the memory wall. In the accelerator, a PIM unit is developed using
BRAM and LUT based counters, which also helps to improve the DSP performance density.
The overall architecture consists of 64 PIM units and three memory buffers to store
inter-layer results. A shrunk and quantized Tiny-YOLO network is mapped to the PIM
accelerator, where DRAM access is fully eliminated during inference. The design achieves
a throughput of 201.6 GOPs at 100MHz clock rate and correspondingly, a performance
density of 0.57 GOPS/DSP.

Supply Noise Reduction Filter for Parallel Integrated Transimpedance Amplifiers

  • Shinya Tanimura
  • Akira Tsuchiya
  • Toshiyuki Inoue
  • Keiji Kishine

This paper presents a supply noise reduction in transimpedance amplifier (TIA) for
optical interconnection. TIAs integrated in parallel suffer from inter-channel interference
via the supply and the ground lines. We employ an RC filter to reduce the supply noise.
The filter is inserted to the first stage of TIA and does not need extra power. The
proposed circuit was fabricated in an 180-nm CMOS. The measurement results verify
38% noise reduction at 5 Gbps operation.

SESSION: 1B: Accelerating Design and Simulation

A Fast Yet Accurate Message-level Communication Bus Model for Timing Prediction of
SDFGs on MPSoC

  • Hai-Dang Vu
  • S. Le Nours
  • S. Pillement
  • Ralf Stemmer
  • Kim Grüttner

Fast yet accurate performance and timing prediction of complex parallel data flow
applications on multi-processor systems remains a difficult discipline. The reason
for it comes from the complexity of the data flow applications and the hardware platform
with shared resources, like buses and memories. This combination may lead to complex
timing interferences that are difficult to express in pure analytical or classical
simulation-based approaches. In this work, we propose a message-level communication
model for timing and performance prediction of Synchronous Data Flow (SDF) applications
on MPSoCs with shared memories. We compare our work against measurement and TLM simulation-based
performance prediction models on two case-studies from the computer vision domain.
We show that the accuracy and execution time of our simulation outperforms existing
approaches and is suitable for a fast yet accurate design space exploration.

Simulation of Ideally Switched Circuits in SystemC

  • Breytner Joseph Fernández-Mesa
  • Liliana Andrade
  • Frédéric Pétrot

Modeling and simulation of power systems at low levels of abstraction is supported
by specialized tools such as SPICE and MATLAB. But when power systems are part of
larger systems including digital hardware and software, low-level models become over-detailed;
at the system level, models must be simple and execute fast. We present an extension
to SystemC that relies on efficient modeling, simulation, and synchronization strategies
for Ideally Switched Circuits. Our solution enables designers to specify circuits
and to jointly simulate them with other SystemC hardware and software models. We test
our extension with three power converter case studies and show a simulation speed-up
between 1.2 and 2.7 times while preserving accuracy when compared to the reference
tool. This work demonstrates the suitability of SystemC for the simulation of heterogeneous
models to meet system-level goals such as validation, verification, and integration.

HW-BCP: A Custom Hardware Accelerator for SAT Suitable for Single Chip Implementation for
Large Benchmarks

  • Soowang Park
  • Jae-Won Nam
  • Sandeep K. Gupta

Boolean Satisfiability (SAT) has broad usage in Electronic Design Automation (EDA),
artificial intelligence (AI), and theoretical studies. Further, as an NP-complete
problem, acceleration of SAT will also enable acceleration of a wide range of combinatorial
problems.

We propose a completely new custom hardware design to accelerate SAT. Starting with
the well-known fact that Boolean Constraint Propagation (BCP) takes most of the SAT
solving time (80-90%), we focus on accelerating BCP. By profiling a widely-used software
SAT solver, MiniSAT v2.2.0 (MiniSAT2) [1], we identify opportunities to accelerate
BCP via parallelization and elimination of von Neumann overheads, especially data
movement. The proposed hardware for BCP (HW-BCP) achieves these goals via a customized
combination of content-addressable memory (CAM) cells, SRAM cells, logic circuitry,
and optimized interconnects.

In 65nm technology, on the largest SAT instances in the SAT Competition 2017 benchmark
suite, our HW-BCP dramatically accelerates BCP (4.5ns per BCP in simulations) and
hence provides a 62-185x speedup over optimized software implementation running on
general purpose processors.

Finally, we extrapolate our HW-BCP design to 7nm technology and estimate area and
delay. The analysis shows that in 7nm, in a realistic chip size, HW-BCP would be large
enough for the largest SAT instances in the benchmark suite.

SESSION: 1C: Process-in-Memory for Efficient and Robust AI

A Novel DRAM-Based Process-in-Memory Architecture and its Implementation for CNNs

  • Chirag Sudarshan
  • Taha Soliman
  • Cecilia De la Parra
  • Christian Weis
  • Leonardo Ecco
  • Matthias Jung
  • Norbert Wehn
  • Andre Guntoro

Processing-in-Memory (PIM) is an emerging approach to bridge the memory-computation
gap. One of the key challenges of PIM architectures in the scope of neural network
inference is the deployment of traditional area-intensive arithmetic multipliers in
memory technology, especially for DRAM-based PIM architectures. Hence, existing DRAM
PIM architectures are either confined to binary networks or exploit the analog property
of the sub-array bitlines to perform bulk bit-wise logic operations. The former reduces
the accuracy of predictions, i.e. Quality-of-results, while the latter increases overall
latency and power consumption.

In this paper, we present a novel DRAM-based PIM architecture and implementation for
multi-bit-precision CNN inference. The proposed implementation relies on shifter based
approximate multiplications specially designed to fit into commodity DRAM architectures
and its technology. The main goal of this work is to propose an architecture that
is fully compatible with commodity DRAM architecture and to maintain a similar thermal
design power (i.e. < 1 W). Our evaluation shows that the proposed DRAM-based PIM has
a small area overhead of 6.6% when compared with an 8 Gb commodity DRAM. Moreover,
the architecture delivers a peak performance of 8.192 TOPS per memory channel while
maintaining a very high energy efficiency. Finally, our evaluation also shows that
the use of approximate multipliers results in a negligible drop B@in prediction-accuracy
(i.e. < 2 %) in comparison with conventional CNN inference that relies on traditional
arithmetic multipliers.

A Quantized Training Framework for Robust and Accurate ReRAM-based Neural Network
Accelerators

  • Chenguang Zhang
  • Pingqiang Zhou

Neural networks (NN), especially deep neural networks (DNN), have achieved great success
in lots of fields. ReRAM crossbar, as a promising candidate, is widely employed to
accelerate neural network owing to its nature of processing MVM. However, ReRAM crossbar
suffers high conductance variation due to many non-ideal effects, resulting in great
inference accuracy degradation. Recent works use uniform quantization to enhance the
tolerance of conductance variation, but these methods still suffer high accuracy loss
with large variation. In this paper, firstly, we analyze the impact of the quantization
and conductance variation on the accuracy. Then, based on two observation, we propose
a quantized training framework to enhance the robustness and accuracy of the neural
network running on the accelerator, by introducing a smart non-uniform quantizer.
This framework consists of a robust trainable quantizer and a corresponding training
method, and needs no extra hardware overhead and compatible with a standard neural
network training procedure. Experimental results show that our proposed method can
improve inference accuracy by 10% ~ 30% under large variation, compared with uniform
quantization method.

Attention-in-Memory for Few-Shot Learning with Configurable Ferroelectric FET Arrays

  • Dayane Reis
  • Ann Franchesca Laguna
  • Michael Niemier
  • Xiaobo Sharon Hu

Attention-in-Memory (AiM), a computing-in-memory (CiM) design, is introduced to implement
the attentional layer of Memory Augmented Neural Networks (MANNs). AiM consists of
a memory array based on Ferroelectric FETs (FeFET) along with CMOS peripheral circuits
implementing configurable functionalities, i.e., it can be dynamically changed from
a ternary content-addressable memory (TCAM) to a general-purpose (GP) CiM. When compared
to state-of-the art accelerators, AiM achieves comparable end-to-end speed-up and
energy for MANNs, with better accuracy (95.14% v.s. 92.21%, and 95.14% v.s. 91.98%)
at iso-memory size, for a 5-way 5-shot inference task with the Omniglot dataset.

SESSION: 1D: Validation and Verification

Mutation-based Compliance Testing for RISC-V

  • Vladimir Herdt
  • Sören Tempel
  • Daniel Große
  • Rolf Drechsler

Compliance testing for RISC-V is very important. Essentially, it ensures that compatibility
is maintained between RISC-V implementations and the ever growing RISC-V ecosystem.
Therefore, an official Compliance Test-suite (CT) is being actively developed. However,
it is very difficult to achieve that all relevant functional behavior is comprehensively
tested.

In this paper, we propose a mutation-based approach to boost RISC-V compliance testing
by providing more comprehensive testing results. Therefore, we define mutation classes
tailored for RISC-V to assess the quality of the CT and provide a symbolic execution
framework to generate new test-cases that kill the undetected mutants. Our experimental
results demonstrate the effectiveness of our approach. We identified several serious
gaps in the CT and generated new tests to close these gaps.

A General Equivalence Checking Framework for Multivalued Logic

  • Chia-Chun Lin
  • Hsin-Ping Yen
  • Sheng-Hsiu Wei
  • Pei-Pei Chen
  • Yung-Chih Chen
  • Chun-Yao Wang

Logic equivalence checking is a critical task in the ASIC design flow. Due to the
rapid development in nanotechnology-based devices, an efficient implementation of
multivalued logic becomes practical. As a result, many synthesis algorithms for ternary
logic were proposed. In this paper, we bring out an equivalence checking framework
based on multivalued logic exploiting the modern SAT solvers. Furthermore, a structural
conflict-driven clause learning (SCDCL) technique is also proposed to accelerate the
SAT solving process. The SCDCL algorithm deploys some strategies to cut off the search
space for SAT algorithms. The experimental results show that the proposed SCDCL technique
saves 42% CPU time from SAT solvers on average over a set of industrial benchmarks.

ATLaS: Automatic Detection of Timing-based Information Leakage Flows for SystemC HLS Designs

  • Mehran Goli
  • Rolf Drechsler

In order to meet the time-to-market constraint, High-level Synthesis (HLS) is being
increasingly adopted by the semiconductor industry. HLS designs, which can be automatically
translated into the Register Transfer Level (RTL), are typically written in SystemC
at the Electronic System Level (ESL). Timing-based information leakage and its countermeasures,
while well-known at RTL and below, have not been yet considered for HLS. The paper
makes a contribution to this emerging research area by proposing ATLaS, a novel timing-based
information leakage flows detection approach for SystemC HLS designs. The efficiency
of our approach in identifying timing channels for SystemC HLS designs is demonstrated
on two security-critical architectures which are shared interconnect and crypto core.

SESSION: 1E: Design Automation Methods for Various Microfluidic Platforms

A Multi-Commodity Network Flow Based Routing Algorithm for Paper-Based Digital Microfluidic
Biochips

  • Nai-Ren Shih
  • Tsung-Yi Ho

Paper-based digital microfluidic biochips (P-DMFBs) have emerged as a safe, low-cost,
and fast-responsive platform for biochemical assays. In P-DMFB, droplet manipulations
are executed by the electrowetting technology. In order to enable the electrowetting
technology, pattern arrays of electrodes and control lines are coated on paper with
a hydrophobic Teflon film and dielectric parylene-C film. Different from traditional
DMFBs, the manufacturing of P-DMFBs is efficient and inexpensive since the electrodes
and control lines are printed on photo paper with an inkjet printer. Active paper-based
hybridized chip (APHC) is a type of P-DMFBs that has open and closed part. APHC enjoys
more convenience than common P-DMFBs since it has no need to fabricate and maintain
the micro gap between glass and paper chip, which requires highly delicate treatments.
However, the pattern rails of electrodes in APHCs are denser than traditional P-DMFBs,
which makes existing electrode routing algorithm fail in APHCs. To deal with the challenge
in electrode routing of APHCs, this paper proposes a multi-commodity network flow-based
routing algorithm, which simultaneously maximizes the routability and minimizes the
total wire length of control lines. The multi-commodity flow model can utilize the
pin-sharing between electrodes, which can improve routability and reduce the detour
of routing lines. Moreover, the activation sequences of electrodes are considered,
which guarantees that the bioassay will not be interfered with after pin-sharing.
The proposed method achieves a 100% successful routing rate on real-life APHCs while
other electrode routing method cannot solve the electrode routing of APHCs successfully.

Interference-Free Design Methodology for Paper-Based Digital Microfluidic Biochips

  • Yun-Chen Lo
  • Bing Li
  • Sooyong Park
  • Kwanwoo Shin
  • Tsung-Yi Ho

Paper-based digital microfluidic biochips (P-DMFBs) have recently attracted great
attention for its low-cost, in-place, and fast fabrication. This technology is essential
for agile bio-assay development and deployment. P-DMFBs print electrodes and associate
control lines on paper to control droplets and complete bio-assays. However, P-DMFBs
have following issues: 1) control line interference may cause unwanted droplet movements,
2) avoiding control interference degrades assay performance and routability, 3) single
layer fabrication limits routability, and 4) expensive ink cost limits low-cost benefits
of P-DMFBs. To solve above issues, this work proposes an interference-free design
methodology to design P-DMFBs with fast assay speed, better routability, and compact
printing area. The contributions are as follows: First, we categorize control interference
into soft and hard. Second, we identify only soft interference happens and propose
to remove soft control interference constraints. Third, we propose an interference-free
design methodology. Finally, we propose a cost-efficient ILP-based fluidic design
module. Experimental results show proposed method outperforms prior work [14] across
all bio-assay benchmarks. Compared to previous work, our cost-optimized designs use
only 47%~78% area, gain 3.6%~16.2% more routing resources, and achieve 0.97x~1.5x
shorter assay completion time. Our performance-optimized designs can accelerate assay
speed by 1.05x~1.65x using 81%~96% printed area.

Accurate and Efficient Simulation of Microfluidic Networks

  • Gerold Fink
  • Philipp Ebner
  • Medina Hamidović
  • Werner Haselmayr
  • Robert Wille

Microfluidics is a prospective field which provides technological advances to the
life sciences. However, the design process for microfluidic devices is still in its
infancy and frequently results in a “trial-and-error” scheme. In order to overcome
this problem, simulation methods provide a powerful solution—allowing for deriving
a design, validating its functionality, or exploring alternatives without the need
of an actual fabricated and costly prototype. To this end, several physical models
are available such as Computational Fluid Dynamics (CFD) or the 1-dimensional analysis
model. However, while CFD-simulations have high accuracy, they also have high costs
with respect to setup and simulation time. On the other hand, the 1D-analysis model
is very efficient but lacks in accuracy when it comes to certain phenomena. In this
work, we present ideas to combine these two models and, thus, to provide an accurate
and efficient simulation approach for microfluidic networks. A case study confirms
the general suitability of the proposed approach.

SESSION: 2A: University Design Contest II

A 65nm CMOS Process Li-ion Battery Charging Cascode SIDO Boost Converter with 89%
Maximum Efficiency for RF Wireless Power Transfer Receiver

  • Yasuaki Isshiki
  • Dai Suzuki
  • Ryo Ishida
  • Kousuke Miyaji

This paper proposes a 65nm CMOS process cascode single-inductor-dual-output (SIDO)
boost converter for RF wireless power transfer (WPT) receiver. In order to withstand
4.2V Li-ion battery output, cascode 2.5V I/O PFETs are used at the power stage while
2.5V cascode NFETs are used for 1V output to supply low voltage control circuit. By
using NFETs, 1V output with 5V tolerance can be achieved. Measurement results show
conversion efficiency of 89% at PIN=7.9mW and Vbat=3.4V.

A High Accuracy Phase and Amplitude Detection Circuit for Calibration of 28GHz Phased
Array Beamformer System

  • Joshua Alvin
  • Jian Pang
  • Atsushi Shirane
  • Kenichi Okada

This paper presents high-accuracy phase and amplitude detection circuits for the calibration
of 5G millimeter-wave phased array beamformer systems. The phase and amplitude detection
circuits, which are implemented in a 65nm CMOS process, can realize phase and amplitude
detections with RMS phase error of 0.17 degree and RMS gain error of 0.12 dB, respectively.
The total power consumption of the circuits is 59mW.

A Highly Integrated Energy-efficient CMOS Millimeter-wave Transceiver with Direct-modulation
Digital Transmitter, Quadrature Phased-coupled Frequency Synthesizer and Substrate-Integrated
Waveguide E-shaped Patch Antenna

  • Wei Deng
  • Zheng Song
  • Ruichang Ma
  • Haikun Jia
  • Baoyong Chi

An energy-efficient millimeter-wave transceiver with direct-modulation digital transmitter
(TX), I/Q phase-coupled frequency synthesizer and Substrate-Integrated Waveguide (SIW)
E-shaped patch antenna is presented in this paper. The proposed transceiver achieves
the 10-Gbps data rate while consuming 340.4 mW. The measured Over-the-Air (OTA) EVM
is -13.8 dB. The energy efficiency is 34 pJ/bit, which is a significant improvement
compared with the state-of-the-art mm-wave transceivers.

A 3D-Stacked SRAM Using Inductive Coupling Technology for AI Inference Accelerator
in 40-nm CMOS

  • Kota Shiba
  • Tatsuo Omori
  • Mototsugu Hamada
  • Tadahiro Kuroda

A 3D-stacked SRAM using an inductive coupling wireless inter-chip communication technology
(TCI) is presented for an AI inference accelerator. The energy and area efficiency
are improved thanks to the introduction of a proposed low-voltage NMOS push-pull transmitter
and a 12:1 SerDes. A termination scheme to short unused open coils is proposed to
eliminate the ringing in an inductive coupling bus. Test chips were fabricated in
a 40-nm CMOS technology confirming 0.40-V operation of the proposed transmitter with
successful stacked SRAM operation.

Sub-10-μm Coil Design for Multi-Hop Inductive Coupling Interface

  • Tatsuo Omori
  • Kota Shiba
  • Mototsugu Hamada
  • Tadahiro Kuroda

Sub-10-μm on-chip coils are designed and prototyped for the multi-hop inductive coupling
interface in a 40-nm CMOS. Multi-layer coils and a new receiver circuit are employed
to compensate the decrease of the coupling coefficient due to the small coil size.
The prototype emulates a 3D stacked module with 8 dies in a 7-nm CMOS and shows that
a 0.1-pJ/bit and 41-Tb/s/mm2 inductive coupling interface is achievable.

Current-Starved Chaotic Oscillator Over Multiple Frequency Decades on Low-Cost CMOS: Towards Distributed and Scalable Environmental Sensing with a Myriad of Nodes

  • Korkut Kaan Tokgoz
  • Ludovico Minati
  • Hiroyuki Ito

This work presents a current-starved cross-coupled chaotic oscillator achieving multiple
decades of oscillation frequency spanning 2 kHz to 15 MHz. The main circuit characteristics
are low-power consumption (<100 nW to 25 μW, at 1 V supply voltage), and controllability
of the oscillation frequency, enabling future applications such as in distributed
environmental sensing. The IC was implemented in 180 nm standard CMOS process, yielding
a core area of 0.028 mm2.

TCI Tester: Tester for Through Chip Interface

  • Hideto Kayashima
  • Hideharu Amano

An 18 Bit Time-to-Digital Converter Design with Large Dynamic Range and Automated
Multi-Cycle Concept

  • Peter Toth
  • Hiroki Ishikuro

This paper presents a wide-dynamic-range high-resolution time-domain converter concept
tailored for low-power sensor interfaces. The unique system structure applies different
techniques to reduce circuit complexity, power consumption, and noise sensitivity.
A multi-cycle concept allows a virtual delay line extension and is applied to achieve
high resolution down to 1ns. At the same time, it expands the dynamic range drastically
up to 2.35 ms. Moreover, individually tunable delay elements in the range of 1ns to
12 ns allow on-demand flexible operation in a low- or high-resolution mode for smart
sensing applications and flexible power control. The concept of this paper is evaluated
by a custom-designed FPGA supported PCB. The presented concept is highly suitable
for on-chip integration.

University LSI Design Contest ASP-DAC 2021

SESSION: 2B: Emerging Non-Volatile Processing-In-Memory for Next Generation Computing

Connection-based Processing-In-Memory Engine Design Based on Resistive Crossbars

  • Shuhang Zhang
  • Hai Helen Li
  • Ulf Schlichtmann

Deep neural networks have successfully been applied to various fields. The efficient
deployment of neural network models emerges as a new challenge. Processing-in-memory
(PIM) engines that carry out computation within memory structures are widely studied
for improving computation efficiency and data communication speed. In particular,
resistive memory crossbars can naturally realize the dot-product operations and show
great potential in PIM design. The common practice of a current-based design is to
map a matrix to a crossbar, apply the input data from one side of the crossbar, and
extract the accumulated currents as the computation results at the orthogonal direction.
In this study, we propose a novel PIM design concept that is based on the crossbar
connections. Our analysis on star-mesh network transformation reveals that in a crossbar
storing both input data and weight matrix, the dot-product result is embedded within
the network connection. Our proposed connection-based PIM design leverages this feature
and discovers the latent dot-products directly from the connection information. Moreover,
in the connection-based PIM design, the output current range of resistive crossbars
can easily be adjusted, leading to more linear conversion to voltage values, and the
output circuitry can be shared by multiple resistive crossbars. The simulation results
show that our design can achieve on average 46.23% and 33.11% reductions in area and
energy consumption, with a merely 3.85% latency overhead compared with current-based
designs.

FePIM: Contention-Free In-Memory Computing Based on Ferroelectric Field-Effect Transistors

  • Xiaoming Chen
  • Yuping Wu
  • Yinhe Han

The memory wall bottleneck has caused a large portion of the energy to be consumed
by data transfer between processors and memories when dealing with data-intensive
workloads. By giving some processing abilities to memories, processing-in-memory (PIM)
is a promising technique to alleviate the memory wall bottleneck. In this work, we
proposed a novel PIM architecture by employing ferroelectric field-effect transistors
(FeFETs). The proposed design, named FePIM, is able to perform in-memory bitwise logic
and add operations between two selected rows or between one selected row and an immediate
operand. By utilizing unique features of FeFET devices, we further propose novel solutions
to eliminate simultaneous-read-and-write (SRAW) contentions such that stalls are eliminated.
Experimental results show that FePIM reduces 15% of the memory access latency and
44% of the memory access energy, compared with an enhanced version of a state-of-the-art
FeFET-based PIM design which cannot handle SRAW contentions.

RIME: A Scalable and Energy-Efficient Processing-In-Memory Architecture for Floating-Point
Operations

  • Zhaojun Lu
  • Md Tanvir Arafin
  • Gang Qu

Processing in-memory (PIM) is an emerging technology poised to break the memory-wall
in the conventional von Neumann architecture. PIM reduces data movement from the memory
systems to the CPU by utilizing memory cells for logic computation. However, existing
PIM designs do not support high precision computation (e.g., floating-point operations)
essential for critical data-intensive applications. Furthermore, PIM architectures
require complex control module and costly peripheral circuits to harness the full
potential of in-memory computation. These peripherals and control modules usually
suffer from scalability and efficiency issues.

Hence, in this paper, we explore the analog properties of the resistive random access
memory (RRAM) crossbar and propose a scalable RRAM-based in-memory floating-point
computation architeture (RIME). RIME uses single-cycle NOR, NAND, and Minority logic
to achieve floating-point operations. RIME features a centralized control module and
a simplified peripheral circuit to eliminate data movement during parallel computation.
An experimental 32-bit RIME multiplier demonstrates 4.8X speedup, 1.9X area-improvement,
and 5.4X energy-efficiency than state-of-the-art RRAM-based PIM multipliers.

A Non-Volatile Computing-In-Memory Framework With Margin Enhancement Based CSA and
Offset Reduction Based ADC

  • Yuxuan Huang
  • Yifan He
  • Jinshan Yue
  • Huazhong Yang
  • Yongpan Liu

Nowadays, deep neural network (DNN) has played an important role in machine learning.
Non-volatile computingin-memory (nvCIM) for DNN has become a new architecture to optimize
hardware performance and energy efficiency. However, the existing nvCIM accelerators
focus on system-level performance but ignore analog factors. In this paper, the sense
margin and offset are considered in the proposed nvCIM framework. The margin enhancement
based current-mode sense amplifier (MECSA) and the offset reduction based analog-to-digital
converter (ORADC) are proposed to improve the accuracy of the ADC. Based on the above
methods, the nvCIM framework is displayed and the experiment results show that the
proposed framework has an improvement on area, power, and latency with the high accuracy
of network models, and the energy efficiency is 2.3 – 20.4x compared to the existing
RRAM based nvCIM accelerators.

SESSION: 2C: Emerging Trends for Cross-Layer Co-Design: From Device, Circuit, to Architecture,
Application

Cross-layer Design for Computing-in-Memory: From Devices, Circuits, to Architectures and Applications

  • Hussam Amrouch
  • Xiaobo Sharon Hu
  • Mohsen Imani
  • Ann Franchesca Laguna
  • Michael Niemier
  • Simon Thomann
  • Xunzhao Yin
  • Cheng Zhuo

The era of Big Data, Artificial Intelligence (AI) and Internet of Things (IoT) is
approaching, but our underlying computing infrastructures are not sufficiently ready.
The end of Moore’s law and process scaling as well as the memory wall associated with
von Neumann architectures have throttled the rapid development of conventional architectures
based on CMOS technology, and cross-layer efforts that involve the interactions from
low-end devices to high-end applications have been prominently studied to overcome
the aforementioned challenges. On one hand, various emerging devices, e.g., Ferroelectric
FET, have been proposed to either sustain the scaling trends or enable novel circuit
and architecture innovations. On the other hand, novel computing architectures/algorithms,
e.g., computing-in-memory (CiM), have been proposed to address the challenges faced
by conventional von Neumann architectures. Naturally, integrated approaches across
the emerging devices and computing architectures/algorithms for data-intensive applications
are of great interests. This paper uses the FeFET as a representative device, and
discuss about the challenges, opportunities and contributions for the emerging trends
of cross-layer co-design for CiM.

SESSION: 2D: Machine Learning Techniques for EDA in Analog/Mixed-Signal ICs

Automatic Surrogate Model Generation and Debugging of Analog/Mixed-Signal Designs
Via Collaborative Stimulus Generation and Machine Learning

  • Jun Yang Lei
  • Abhijit Chatterjee

In top-down analog and mixed-signal design, a key problem is to ensure that the netlist
or physical design does not contain unanticipated behaviors. Mismatches between netlist
level circuit descriptions and high level behavioral models need to be captured at
all stages of the design process for accuracy of system level simulation as well as
fast convergence of the design. To support the above, we present a guided test generation
algorithm that explores the input stimulus space and generates new stimuli which are
likely to excite differences between the model and its netlist description. Subsequently,
a recurrent neural network (RNN) based learning model is used to learn divergent model
and netlist behaviors and absorb them into the model to minimize these differences.
The process is repeated iteratively and in each iteration, a Bayesian optimization
algorithm is used to find optimal RNN hyperparameters to maximize behavior learning.
The result is a circuit-accurate behavioral model that is also much faster to simulate
than a circuit simulator. In addition, another sub-goal is to perform design bug diagnosis
to track the source of observed behavioral anomalies down to individual modules or
small levels of circuit detail. An optimization-based diagnosis approach using Volterra
learning kernels that is easily integrated into circuit simulators is proposed. Results
on representative circuits are presented.

A Robust Batch Bayesian Optimization for Analog Circuit Synthesis via Local Penalization

  • Jiangli Huang
  • Fan Yang
  • Changhao Yan
  • Dian Zhou
  • Xuan Zeng

Bayesian optimization has been successfully introduced to analog circuit synthesis
recently. Since the evaluations of performances are computational expensive, batch
Bayesian optimization has been proposed to run simulations in parallel. However, circuit
simulations may fail during the optimization, due to the improper design variables.
In such cases, Bayesian optimization methods may have poor performance. In this paper,
we propose a Robust Batch Bayesian Optimization approach (RBBO) for analog circuit
synthesis. Local penalization (LP) is used to capture the local repulsion between
query points in one batch. The diversity of the query points can thus be guaranteed.
The failed points and their neighborhoods can also be excluded by LP. Moreover, we
propose an Adaptive Local Penalization (ALP) strategy to adaptively scale the penalized
areas to improve the convergence of our proposed RBBO method. The proposed approach
is compared with the state-of-the-art algorithms with several practical analog circuits.
The experimental results have demonstrated the efficiency and robustness of the proposed
method.

Layout Symmetry Annotation for Analog Circuits with Graph Neural Networks

  • Xiaohan Gao
  • Chenhui Deng
  • Mingjie Liu
  • Zhiru Zhang
  • David Z. Pan
  • Yibo Lin

The performance of analog circuits is susceptible to various layout constraints, such
as symmetry, matching, etc. Modern analog placement and routing algorithms usually
need to take these constraints as input for high quality solutions, while manually
annotating such constraints is tedious and requires design expertise. Thus, automatic
constraint annotation from circuit netlists is a critical step to analog layout automation.
In this work, we propose a graph learning based framework to learn the general rules
for annotation of the symmetry constraints with path-based feature extraction and
label filtering techniques. Experimental results on the open-source analog circuit
designs demonstrate that our framework is able to achieve significantly higher accuracy
compared with the most recent works on symmetry constraint detection leveraging graph
similarity and signal flow analysis techniques. The framework is general and can be
extended to other pairwise constraints as well.

Fast and Efficient Constraint Evaluation of Analog Layout Using Machine Learning Models

  • Tonmoy Dhar
  • Jitesh Poojary
  • Yaguang Li
  • Kishor Kunal
  • Meghna Madhusudan
  • Arvind K. Sharma
  • Susmita Dey Manasi
  • Jiang Hu
  • Ramesh Harjani
  • Sachin S. Sapatnekar

Placement algorithms for analog circuits explore numerous layout configurations in
their iterative search. To steer these engines towards layouts that meet the electrical
constraints on the design, this work develops a fast feasibility predictor to guide
the layout engine. The flow first discerns rough bounds on layout parasitics and prunes
the feature space. Next, a Latin hypercube sampling technique is used to sample the
reduced search space, and the labeled samples are classified by a linear support vector
machine (SVM). If necessary, a denser sample set is used for the SVM, or if the constraints
are found to be nonlinear, a multilayer perceptron (MLP) is employed. The resulting
machine learning model demonstrated to rapidly evaluate candidate placements in a
placer, and is used to build layouts for several analog blocks.

SESSION: 2E: Innovating Ideas in VLSI Routing Optimization

TreeNet: Deep Point Cloud Embedding for Routing Tree Construction

  • Wei Li
  • Yuxiao Qu
  • Gengjie Chen
  • Yuzhe Ma
  • Bei Yu

In the routing tree construction, both wirelength (WL) and path-length (PL) are of
importance. Among all methods, PD-II and SALT are the two most prominent ones. However,
neither PD-II nor SALT always dominates the other one in terms of both WL and PL for
all nets. In addition, estimating the best parameters for both algorithms is still
an open problem. In this paper, we model the pins of a net as point cloud and formalize
a set of special properties of such point cloud. Considering these properties, we
propose a novel deep neural net architecture, TreeNet, to obtain the embedding of
the point cloud. Based on the obtained cloud embedding, an adaptive workflow is designed
for the routing tree construction. Experimental results show that the proposed TreeNet
is superior to other mainstream models for the point cloud on classification tasks.
Moreover, the proposed adaptive workflow for the routing tree construction outperforms
SALT and PD-II in terms of both efficiency and effectiveness.

A Unified Printed Circuit Board Routing Algorithm With Complicated Constraints and
Differential Pairs

  • Ting-Chou Lin
  • Devon Merrill
  • Yen-Yi Wu
  • Chester Holtz
  • Chung-Kuan Cheng

The printed circuit board (PCB) routing problem has been studied extensively in recent
years. Due to continually growing net/pin counts, extremely high pin density, and
unique physical constraints, the manual routing of PCBs has become a time-consuming
task to reach design closure. Previous works break down the problem into escape routing
and area routing and focus on these problems separately. However, there is always
a gap between these two problems requiring a massive amount of human efforts to fine-tune
the algorithms back and forth. Besides, previous works of area routing mainly focus
on routing between escaping routed ball-grid-array (BGA) packages. Nevertheless, in
practice, many components are not in the form of BGA packages, such as passive devices,
decoupling capacitors, and through-hole pin arrays. To mitigate the deficiencies of
previous works, we propose a full-board routing algorithm that can handle multiple
real-world complicated constraints to facilitate the printed circuit board routing
and produce high-quality manufacturable layouts. Experimental results show that our
algorithm is effective and efficient. Specifically, for all given test cases, our
router can achieve 100% routability without any design rule violation while the other
two state-of-the-art routers fail to complete the routing for some test cases and
incur design rule violations.

Multi-FPGA Co-optimization: Hybrid Routing and Competitive-based Time Division Multiplexing Assignment

  • Dan Zheng
  • Xiaopeng Zhang
  • Chak-Wa Pui
  • Evangeline F.Y. Young

In multi-FPGA systems, time-division multiplexing (TDM) is a widely used technique
to transfer signals between FPGAs. While TDM can greatly increase logic utilization,
the inter-FPGA delay will also become longer. A good time-multiplexing scheme for
inter-FPGA signals is very important for optimizing the system performance. In this
work, we propose a fast algorithm to generate high quality time-multiplexed routing
results for multiple FPGA systems. A hybrid routing algorithm is proposed to route
the nets between FPGAs, by maze routing and by a fast minimum terminal spanning tree
method. After obtaining a routing topology, a two-step method is applied to perform
TDM assignment to optimize timing, which includes an initial assignment and a competitive-based
refinement. Experiments show that our system-level routing and TDM assignment algorithm
can outperform both the top winner of the ICCAD 2019 Contest and the state-of-the-art
methods. Moreover, compared to the state-of-the-art works [17, 22], our approach has
better run time by more than 2x with better or comparable TDM performance.

Boosting Pin Accessibility Through Cell Layout Topology Diversification

  • Suwan Kim
  • Kyeongrok Jo
  • Taewhan Kim

As the layout of standard cells is becoming dense, accessing pins is much harder in
detailed routing. The conventional solutions to resolving the pin access issue are
to attempt cell flipping, cell shifting, cell swapping, and/or cell dilating in the
placement optimization stage, expecting to acquire high pin accessibility. However,
those solutions do not guarantee close-to-100% pin accessibility to ensure safe manual
fixing afterward in the routing stage. Furthermore, there is no easy and effective
methodology to fix the inaccessibility in the detailed routing stage as yet. This
work addresses the problem of fixing the inaccessibility in the detailed routing stage.
Precisely, (1) we produce, for each type of cell, multiple layouts with diverse pin
locations and access points by modifying the core engines i.e., gate poly ordering
and middle-of-line dummy insertion in the flow of design-technology co-optimization
based automatic cell layout generation. Then, (2) we propose a systematic method to
make use of those layouts to fix the routing failures caused by pin inaccessibility
in the ECO (Engineering Change Order) routing stage. Experimental results demonstrate
that our proposed cell layout diversification and replacement approach can fix metal-2
shorts by 93.22% in the ECO routing stage.

SESSION: 3A: ML-Driven Approximate Computing

Approximate Computing for ML: State-of-the-art, Challenges and Visions

  • Georgios Zervakis
  • Hassaan Saadat
  • Hussam Amrouch
  • Andreas Gerstlauer
  • Sri Parameswaran
  • Jörg Henkel

In this paper, we present our state-of-the-art approximate techniques that cover the
main pillars of approximate computing research. Our analysis considers both static
and reconfigurable approximation techniques as well as operation-specific approximate
components (e.g., multipliers) and generalized approximate highlevel synthesis approaches.
As our application target, we discuss the improvements that such techniques bring
on machine learning and neural networks. In addition to the conventionally analyzed
performance and energy gains, we also evaluate the improvements that approximate computing
brings in the operating temperature.

SESSION: 3B: Architecture-Level Exploration

Bridging the Frequency Gap in Heterogeneous 3D SoCs through Technology-Specific NoC
Router Architectures

  • Jan Moritz Joseph
  • Lennart Bamberg
  • Geonhwa Jeong
  • Ruei-Ting Chien
  • Rainer Leupers
  • Alberto Garía-Ortiz
  • Tushar Krishna
  • Thilo Pionteck

In heterogeneous 3D System-on-Chips (SoCs), NoCs with uniform properties suffer one
major limitation; the clock frequency of routers varies due to different manufacturing
technologies. For example, digital nodes allow for a higher clock frequency of routers
than mixed-signal nodes. This large frequency gap is commonly tackled by complex and
expensive pseudo-mesochronous or asynchronous router architectures. Here, a more efficient
approach is chosen to bridge the frequency gap. We propose to use a heterogeneous
network architecture. We show that reducing the number of VCs allows to bridge a frequency
gap of up to 2x. We achieve a system-level latency improvement of up to 47% for uniform
random traffic and up to 59% for PARSEC benchmarks, a maximum throughput increase
of 50%, up to 68% reduced area and 38% reduced power in an exemplary setting combining
15-nm digital and 30-nm mixed-signal nodes and comparing against a homogeneous synchronous
network architecture. Versus asynchronous and pseudo-mesochronous router architectures,
the proposed optimization consistently performs better in area, in power and the average
flit latency improvement can be larger than 51%.

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

  • Cheng Li
  • Jiangyuan Gu
  • Shouyi Yin
  • Leibo Liu
  • Shaojun Wei

Coarse-Grained Reconfigurable Architectures (CGRAs) are attractive reconfigurable
platforms with the advantages of high performance and power efficiency. In a CGRA
based computing system, the computations are often mapped onto the CGRA with parallel
memory accesses. To fully exploit the on-chip memory bandwidth, memory partitioning
algorithms are widely used to reduce access conflicts. CGRAs have a fixed storage
fabric and limited size memory due to the severe area constraints. Previous memory
partitioning algorithms assumed that data could be completely transferred into the
target memory. However, in practice, we often encounter situations where on-chip storage
is insufficient to store the complete data. In order to perform the computation of
these applications in the memory-limited CGRA, we first develop a memory partitioning
strategy with continual placement, which can also avoid data preprocessing, and then
divide the kernel into multiple subtasks that suit the size of the target memory.
Experimental results show that, compared to the state-of-the-art method, our approach
achieves a 43.2% reduction in data preparation time and an 18.5% improvement in overall
performance. If the subtask generation scheme is adopted, our approach can achieve
a 14.4% overall performance improvement while reducing memory requirements by 99.7%.

A Dynamic Link-latency Aware Cache Replacement Policy (DLRP)

  • Yen-Hao Chen
  • Allen C.-H. Wu
  • TingTing Hwang

Multiprocessor system-on-chips (MPSoCs) in modern devices have mostly adopted the
non-uniform cache architecture (NUCA) [1], which features varied physical distance
from cores to data locations and, as a result, varied access latency. In the past,
researchers focused on minimizing the average access latency of the NUCA. We found
that dynamic latency is also a critical index of the performance. A cache access pattern
with long dynamic latency will result in a significant cache performance degradation
without considering dynamic latency. We have also observed that a set of commonly
used neural network application kernels, including the neural network fully-connected
and convolutional layers, contains substantial accessing patterns with long dynamic
latency. This paper proposes a hardware-friendly dynamic latency identification mechanism
to detect such patterns and a dynamic link-latency aware replacement policy (DLRP)
to improve cache performance based on the NUCA.

The proposed DLRP, on average, outperforms the least recently used (LRU) policy by
53% with little hardware overhead. Moreover, on average, our method achieves 45% and
24% more performance improvement than the not recently used (NRU) policy and the static
re-reference interval prediction (SRRIP) policy normalized to LRU.

Prediction of Register Instance Usage and Time-sharing Register for Extended Register
Reuse Scheme

  • Shuxin Zhou
  • Huandong Wang
  • Dong Tong

Register renaming is the key for the performance of out-of-order processors. However,
the release mechanism of the physical register may cause a waste from time dimension.
The register reuse technique is the earliest solution to release a physical register
at renaming stage, which takes the advantage of those register instances with only
one time use. However, the range of possible reuse mined by this scheme is not high,
and the physical structure of the register have to be modified. Aiming at these two
problems, we propose an extended register reuse scheme. Our work presents: 1) prediction
of the use times of the register instance, so as to reuse the physical registers at
the end of the last use, to expand the range of possible reuse. 2) A design of time-sharing
register file with little overheads which is implemented by Backup Registers, avoiding
to modify the physical register structure. Compared with the original register reuse
technique, this work achieves 8.5% performance improvement, alternatively, 9.6% decrease
of the number of physical registers with minor hardware overhead.

SESSION: 3C: Core Circuits for AI Accelerators

Residue-Net: Multiplication-free Neural Network by In-situ No-loss Migration to Residue Number
Systems

  • Sahand Salamat
  • Sumiran Shubhi
  • Behnam Khaleghi
  • Tajana Rosing

Deep neural networks are widely deployed on embedded devices to solve a wide range
of problems from edge-sensing to autonomous driving. The accuracy of these networks
is usually proportional to their complexity. Quantization of model parameters (i.e.,
weights) and/or activations to alleviate the complexity of these networks while preserving
accuracy is a popular powerful technique. Nonetheless, previous studies have shown
that quantization level is limited as the accuracy of the network decreases afterward.
We propose Residue-Net, a multiplication-free accelerator for neural networks that
uses Residue Number System (RNS) to achieve substantial energy reduction. RNS breaks
down the operations to several smaller operations that are simpler to implement. Moreover,
Residue-Net replaces the copious of costly multiplications with non-complex, energy-efficient
shift and add operations to further simplify the computational complexity of neural
networks. To evaluate the efficiency of our proposed accelerator, we compared the
performance of Residue-Net with a baseline FPGA implementation of four widely-used
networks, viz., LeNet, AlexNet, VGG16, and ResNet-50. When delivering the same performance
as the baseline, Residue-Net reduces the area and power (hence energy) respectively
by 36% and 23%, on average with no accuracy loss. Leveraging the saved area to accelerate
the quantized RNS network through parallelism, Residue-Net improves its throughput
by 2.8x and energy by 2.7x.

A Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy
for AI Accelerating

  • Song Zhang
  • Jiangyuan Gu
  • Shouyi Yin
  • Leibo Liu
  • Shaojun Wei

Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator
with AI applications ranging from filtering to convolutional neural networks(CNN).
This paper proposes an energy-efficient MAC design, supporting a wide range of bit-width,
for both signed and unsigned operands. Firstly, based on the classic Booth algorithm,
we propose the Booth algorithm to propose a multiply-add merged strategy. The design
can not only support both signed and unsigned operations but also eliminate the delay,
area and power overheads from the adder of traditional MAC units. Then a multiply-add
merged design method for flexible bit-width adjustment is proposed using the fusion
strategy. In addition, treating the addend as a partial product makes the operation
easy to pipeline and balanced. The comprehensive improvement in delay, area and power
can meet various requirements from different applications and hardware design. By
using the proposed method, we have synthesized MAC units for several operation modes
using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed
design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width
fixed MAC designs, and 28.43% ~ 38.16% for bit-width adjustable ones. When pipelined,
the design has decreased the latency by more than 13%. The improvement in power and
area is up to 8.0% and 8.1% respectively.

DeepOpt: Optimized Scheduling of CNN Workloads for ASIC-based Systolic Deep Learning Accelerators

  • Susmita Dey Manasi
  • Sachin S. Sapatnekar

Scheduling computations in each layer of a convolutional neural network on a deep
learning (DL) accelerator involves a large number of choices, each of which involves
a different set of memory reuse and memory access patterns. Since memory transactions
are the primary bottleneck in DL acceleration, these choices can strongly impact the
energy and throughput of the accelerator. This work proposes an optimization framework,
DeepOpt, for general ASIC-based systolic hardware accelerators for layer-specific
and hardware-specific scheduling strategy for each layer of a CNN to optimize energy
and latency. Optimal hardware allocation significantly reduces execution cost as compared
to generic static hardware resource allocation, e.g., improvements of up to 50x in
the energy-delay product for VGG-16 and 41x for GoogleNet-v1.

Value-Aware Error Detection and Correction for SRAM Buffers in Low-Bitwidth, Floating-Point
CNN Accelerators

  • Jun-Shen Wu
  • Chi-En Wang
  • Ren-Shuo Liu

Low-power CNN accelerators are a key technique to enable the future artificial intelligence
world. Dynamic voltage scaling is an essential low-power strategy, but it is bottlenecked
by on-chip SRAM. More specifically, SRAM can exhibit stuck-at (SA) faults at a rate
as high as 0.1% when the supply voltage is lowered to, e.g., 0.5 V. Although this
issue has been studied in CPU cache design, since their solutions are tailored for
CPUs instead of CNN accelerators, they inevitably incur unnecessary design complexity
and SRAM capacity overhead.

To address the above issue, we conduct simulations and analyses to enable us to propose
error detecting and correcting mechanisms that are tailored for our targeting low-bitwidth,
floating-point (LBFP) CNN accelerators. We analyze the impacts of SA faults in different
SRAM positions, and we also analyze the impacts of different SA types, i.e., stuck-at-one
(SA1) and stuck-at-zero (SA0). The analysis results lead us to the error detecting
and correcting mechanisms that prioritize fixing SA1 appearing at SRAM positions where
the exponent bits of LBFP are stored. The evaluation results show that our proposed
mechanisms can help to push the voltage scaling limit down to a voltage level with
0.1% SA faults (e.g., 0.5 V).

SESSION: 3D: Stochastic and Approximate Computing

MIPAC: Dynamic Input-Aware Accuracy Control for Dynamic Auto-Tuning of Iterative Approximate
Computing

  • Taylor Kemp
  • Yao Yao
  • Younghyun Kim

For many applications that exhibit strong error resilience, such as machine learning
and signal processing, energy efficiency and performance can be dramatically improved
by allowing for slight errors in intermediate computations. Iterative methods (IMs),
wherein the solution is improved over multiple executions of an approximation algorithm,
allow for energy-quality trade-off at run-time by adjusting the number of iterations
(NOI). However, in prior IM circuits, NOI adjustment has been made based on a pre-characterized
NOI-quality mapping, which is input-agnostic thus results in an undesirable large
variation in output quality. In this paper, we propose a novel design framework that
incorporates a lightweight quality controller that makes input-dependent predictions
on the output quality and determines the optimal NOI at run-time. The proposed quality
controller is composed of accurate yet low-overhead NOI predictors, generated by a
novel logic reduction technique. We evaluate the proposed design framework on several
IM circuits and demonstrate significant improvements in energy-quality performance.

Normalized Stability: A Cross-Level Design Metric for Early Termination in Stochastic Computing

  • Di Wu
  • Ruokai Yin
  • Joshua San Miguel

Stochastic computing is a statistical computing scheme that represents data as serial
bit streams to greatly reduce hardware complexity. The key trade-off is that processing
more bits in the streams yields higher computation accuracy at the cost of more latency
and energy consumption. To maximize efficiency, it is desirable to account for the
error tolerance of applications and terminate stochastic computations early when the
result is acceptably accurate. Currently, the stochastic computing community lacks
a standard means of measuring a circuit’s potential for early termination and predicting
at what cycle it would be safe to terminate. To fill this gap, we propose normalized
stability, a metric that measures how fast a bit stream converges under a given accuracy
budget. Our unit-level experiments show that normalized stability accurately reflects
and contrasts the early-termination capabilities of varying stochastic computing units.
Furthermore, our application-level experiments on low-density parity-check decoding,
machine learning and image processing show that normalized stability can reduce the
design space and predict the timing to terminate early.

Zero Correlation Error: A Metric for Finite-Length Bitstream Independence in Stochastic Computing

  • Hsuan Hsiao
  • Joshua San Miguel
  • Yuko Hara-Azumi
  • Jason Anderson

Stochastic computing (SC), with its probabilistic data representation format, has
sparked renewed interest due to its ability to use very simple circuits to implement
complex operations. Though unlike traditional binary computing, SC needs to carefully
handle correlations that exist across data values to avoid the risk of unacceptably
inaccurate results. With many SC circuits designed to operate under the assumption
that input values are independent, it is important to provide the ability to accurately
measure and characterize independence of SC bitstreams. We propose zero correlation
error (ZCE), a metric that quantifies how independent two finite-length bitstreams
are, and show that it addresses fundamental limitations in metrics currently used
by the SC community. Through evaluation at both the functional unit level and application
level, we demonstrate how ZCE can be an effective tool for analyzing SC bitstreams,
simulating circuits and design space exploration.

An Efficient Approximate Node Merging with an Error Rate Guarantee

  • Kit Seng Tam
  • Chia-Chun Lin
  • Yung-Chih Chen
  • Chun-Yao Wang

Approximate computing is an emerging design paradigm for error-tolerant applications.
e.g., signal processing and machine learning. In approximate computing, the area,
delay, or power consumption of an approximate circuit can be improved by trading off
its accuracy. In this paper, we propose an approximate logic synthesis approach based
on a node-merging technique with an error rate guarantee. The ideas of our approach
are to replace internal nodes by constant values and to merge two similar nodes in
the circuit in terms of functionality. We conduct experiments on a set of IWLS 2005
and MCNC benchmarks. The experimental results show that our approach can reduce area
by up to 80%, and 31% on average. As compared with the state-of-the-art method, our
approach has a speedup of 51 under the same 5% error rate constraint.

SESSION: 3E: Timing Analysis and Timing-Aware Design

An Adaptive Delay Model for Timing Yield Estimation under Wide-Voltage Range

  • Hao Yan
  • Xiao Shi
  • Chengzhen Xuan
  • Peng Cao
  • Longxing Shi

Yield analysis for wide-voltage circuit design is a strong nonlinear integration problem.
The most challenging task is how to accurately estimate the yield of long-tail distribution.
This paper proposes an adaptive delay model to substitute expensive transistor-level
simulation for timing yield estimation. We use the Low-Rank Tensor Approximation (LRTA)
to model the delay variation from a large number of process parameters. Moreover,
an adaptive nonlinear sampling algorithm is adopted to calibrate the model iteratively,
which can capture the larger variability of delay distribution for different voltage
regions. The proposed method is validated on benchmark circuits of TAU15 in 45nm free
PDK. The experiment results show that our method achieves 20-100X speedup compared
to Monte Carlo simulation at the same accuracy level.

ATM: A High Accuracy Extracted Timing Model for Hierarchical Timing Analysis

  • Kuan-Ming Lai
  • Tsung-Wei Huang
  • Pei-Yu Lee
  • Tsung-Yi Ho

As technology advances, the complexity and size of integrated circuits continue to
grow. Hierarchical design flow is a mainstream solution to speed up timing closure.
Static timing analysis is a pivotal step in the flow but it can be timing-consuming
on large flat designs. To reduce the long runtime, we introduce ATM, a high-accuracy
extracted timing model for hierarchical timing analysis. Interface logic model (ILM)
and extracted timing model (ETM) are the two popular paradigms for generating timing
macros. ILM is accurate but large in model size, and ETM is compact but less accurate.
Recent research has applied graph compression techniques to ILM to reduce model size
with simultaneous high accuracy. However, the generated models are still very large
compared to ETM, and its efficiency of in-context usage may be limited. We base ATM
on the ETM paradigm and address its accuracy limitation. Experimental results on TAU
2017 benchmarks show that ATM reduces the maximum absolute error of ETM from 131 ps
to less than 1 ps. Compared to the ILM-based approach, our accuracy differs within
1 ps and the generated model can be up to 270x smaller.

Mode-wise Voltage-scalable Design with Activation-aware Slack Assignment for Energy
Minimization

  • TaiYu Cheng
  • Yukata Masuda
  • Jun Nagayama
  • Yoichi Momiyama
  • Jun Chen
  • Masanori Hashimoto

This paper proposes a design optimization methodology that can achieve a mode-wise
voltage scalable (MWVS) design with applying the activation-aware slack assignment
(ASA). Originally, ASA allocates the timing margin of critical paths with a stochastic
treatment of timing errors, which limits its application. Instead, this work employs
ASA with guaranteeing no timing errors. The MWVS design is formulated as an optimization
problem that minimizes the overall power consumption considering each mode duration,
achievable voltage reduction, and accompanied circuit overhead explicitly, and explores
the solution space with the downhill simplex algorithm that does not require numerical
derivation. For obtaining a solution, i.e., a design, in the optimization process,
we exploit the multi-corner multi-mode design flow in a commercial tool for performing
mode-wise ASA with sets of false paths dedicated to individual modes. Experimental
results based on RISC-V design show that the proposed methodology saves 20% more power
compared to the conventional voltage scaling approach and attains 15% gain from the
single-mode ASA. Also, the cycle-by-cycle fine-grained false path identification reduced
leakage power by 42%.

A Timing Prediction Framework for Wide Voltage Design with Data Augmentation Strategy

  • Peng Cao
  • Wei Bao
  • Kai Wang
  • Tai Yang

Wide voltage design has been widely used to achieve power reduction and energy efficiency
improvement. The consequent increasing number of PVT corners poses severe challenges
to timing analysis in terms of accuracy and efficiency. The data insufficiency issue
during path delay acquisition raises the difficulty for the training of machine learning
models, especially at low voltage corners due to tremendous library characterization
effort and/or simulation cost. In this paper, a learning-based timing prediction framework
is proposed to predict path delays across wide voltage region by LightGBM (Light Gradient
Boosting Machine) with data augmentation strategies including CTGAN (Conditional Generative
Adversarial Networks) and SMOTER (Synthetic Minority Oversampling Technique for Regression),
which generate realistic synthetic data of circuit delays to improve prediction precision
and reduce data sampling effort. Experimental results demonstrate that with the proposed
framework, the path delays at low voltage could be predicted by their delays at high
voltage corners with rRMSE of less than 5%, owing to the data augmentation strategies
which achieve significant prediction error reduction by up to 12x.

SESSION: 4A: Technological Advancements inside the AI chips, and using the AI Chips

Energy-Efficient Deep Neural Networks with Mixed-Signal Neurons and Dense-Local and
Sparse-Global Connectivity

  • Baibhab Chatterjee
  • Shreyas Sen

Neuromorphic Computing has become tremendously popular due to its ability to solve
certain classes of learning tasks better than traditional von-Neumann computers. Data-intensive
classification and pattern recognition problems have been of special interest to Neuromorphic
Engineers, as these problems present complex use-cases for Deep Neural Networks (DNNs)
which are motivated from the architecture of the human brain, and employ densely connected
neurons and synapses organized in a hierarchical manner. However, as these systems
become larger in order to handle an increasing amount of data and higher dimensionality
of features, the designs often become connectivity constrained. To solve this, the
computation is divided into multiple cores/islands, called processing engines (PEs).
Today, the communication among these PEs are carried out through a power-hungry network-on-chip
(NoC), and hence the optimal distribution of these islands along with energy-efficient
compute and communication strategies become extremely important in reducing the overall
energy of the neuromorphic computer, which is currently orders of magnitude higher
than the biological human brain. In this paper, we extensively analyze the choice
of the size of the islands based on mixed-signal neurons/synapses for 3-8 bit-resolution
within allowable ranges for system-level classification error, determined by the analog
non-idealities (noise and mismatch) in the neurons, and propose strategies involving
local and global communication for reduction of the system-level energy consumption.
AC-coupled mixed-signal neurons are shown to have 10X lower non-idealities than DC-coupled
ones, while the choice of number of islands are shown to be a function of the network,
constrained by the analog to digital conversion (or viceversa) power at the interface
of the islands. The maximum number of layers in an island is analyzed and a global
bus-based sparse connectivity is proposed, which consumes orders of magnitude lower
power than the competing powerline communication techniques.

Merged Logic and Memory Fabrics for AI Workloads

  • Brian Crafton
  • Samuel Spetalnick
  • Arijit Raychowdhury

As we approach the end of the silicon roadmap, we observe a steady increase in both
the research effort toward and quality of embedded non-volatile memories (eNVM). Integrated
in a dense array, eNVM such as resistive random access memory (RRAM), spin transfer
torque based random access memory, or phase change random access memory (PCRAM) can
perform compute in-memory (CIM) using the physical properties of the device. The combination
of eNVM and CIM seeks to minimize both data transport and leakage power while offering
density up to 10x that of traditional 6T SRAM. Despite these exciting new properties,
these devices introduce problems that were not faced by traditional CMOS and SRAM
based designs. While some of these problems will be solved by further research and
development, properties such as significant cell-to-cell variance and high write power
will persist due to the physical limitations of the devices. As a result, circuit
and system level designs must account for and mitigate the problems that arise. In
this work we introduce these problems from the system level and propose solutions
that improve performance while mitigating the impact of the non-ideal properties of
eNVM. Using statistics from the application and known properties of the eNVM, we can
configure a CIM accelerator to minimize error from cell-to-cell variance and maximize
throughput while minimizing write energy.

Vision Control Unit in Fully Self Driving Vehicles using Xilinx MPSoC and Opensource
Stack

  • Ravikumar V. Chakaravarthy
  • Hyun Kwon
  • Hua Jiang

Fully self-driving (FSD) vehicles are becoming increasing popular over the last few
years and companies are investing significantly into its research and development.
In the recent years, FSD technology innovators like Tesla, Google etc. have been working
on proprietary autonomous driving stacks and have been able to successfully bring
the vehicle to the roads. On the other end, organizations like Autoware Foundation
and Baidu are fueling the growth of self-driving mobility using open source stacks.
These organizations firmly believe in enabling autonomous driving technology for everyone
and support developing software stacks through the open source community that is SoC
vendor agnostic. In this proposed solution we describe a vision control unit for a
fully self-driving vehicle developed on Xilinx MPSoC platform using open source software
components.

The vision control unit of an FSD vehicle is responsible for camera video capture,
image processing and rendering, AI algorithm processing, data and meta-data transfer
to next stage of the FSD pipeline. In this proposed solution we have used many open
source stacks and frameworks for video and AI processing. The processing of the video
pipeline and algorithms take full advantage of the pipelining and parallelism using
all the heterogenous cores of the Xilinx MPSoC. In addition, we have developed an
extensible, scalable, adaptable and configurable AI backend framework, XTA, for acceleration
purposes that is derived from a popular, open source AI backend framework, TVM-VTA.
XTA uses all the MPSoC cores for its computation in a parallel and pipelined fashion.
XTA also adapts to the compute and memory parameters of the system and can scale to
achieve optimal performance for any given AI problem. The FSD system design is based
on a distributed system architecture and uses open source components like Autoware
for autonomous driving algorithms, ROS and Distributed Data Services as a messaging
middleware between the functional nodes and a real-time kernel to coordinate the actions.
The details of image capture, rendering and AI processing of the vision perception
pipeline will be presented along with the performance measurements of the vision pipeline.

In this proposed solution we will demonstrate some of the key use cases of vision
perception unit like surround vision and object detection. In addition, we will also
show the capability of Xilinx MPSoC technology to handle multiple channels of real
time camera and the integration with the Lidar/Radar point cloud data to feed into
the decision-making unit of the overall system. The system is also designed with the
capability to update the vision control unit through Over the Air Update (OTA). It
is also envisioned that the core AI engine will require regular updates with the latest
training values; hence a built-in platform level mechanism supporting such capability
is essential for real world deployment.

SESSION: 4B: System-Level Modeling, Simulation, and Exploration

Constrained Conservative State Symbolic Co-analysis for Ultra-low-power Embedded Systems

  • Shashank Hegde
  • Subhash Sethumurugan
  • Hari Cherupalli
  • Henry Duwe
  • John Sartori

Symbolic simulation and symbolic execution techniques have long been used for verifying
designs and testing software. Recently, using symbolic hardware-software co-analysis
to characterize unused hardware resources across all possible executions of an application
running on a processor has been leveraged to enable application-specific analysis
and optimization techniques. Like other symbolic simulation techniques, symbolic hardware-software
co-analysis does not scale well to complex applications, due to an explosion in the
number of execution paths that must be analyzed to characterize all possible executions
of an application. To overcome this issue, prior work proposed a scalable approach
by maintaining conservative states of the system at previously-visited locations in
the application. However, this approach can be too pessimistic in determining the
exercisable subset of resources of a hardware design. In this paper, we propose a
technique for performing symbolic co-analysis of an application on a processor’s netlist
by identifying, propagating, and imposing constraints from the software level onto
the gate-level simulation. This produces a more precise, less pessimistic estimate
of the gates that an application can exercise when executing on a processor, while
guaranteeing coverage of all possible gates that the application can exercise. This
also reduces the simulation time of the analysis, significantly, by eliminating the
need to explore many simulation paths in the application. Compared to the state-of-art
analysis based on conservative states, our constrained approach reduces the number
of gates identified as exercisable by up to 34.98%, 11.52% on average, and analysis
runtime by up to 84.61%, 43.83% on average.

Arbitrary and Variable Precision Floating-Point Arithmetic Support in Dynamic Binary
Translation

  • Marie Badaroux
  • Frédéric Pétrot

Floating-point hardware support has more or less been settled 35 years ago by the
adoption of the IEEE 754 standard. However, many scientific applications require higher
accuracy than what can be represented on 64 bits, and to that end make use of dedicated
arbitrary precision software libraries. To reach a good performance/accuracy trade-off,
developers use variable precision, requiring e.g. more accuracy as the computation
progresses. Hardware accelerators for this kind of computations do not exist yet,
and independently of the actual quality of the underlying arithmetic computations,
defining the right instruction set architecture, memory representations, etc, for
them is a challenging task. We investigate in this paper the support for arbitrary
and variable precision arithmetic in a dynamic binary translator, to help gain an
insight of what such an accelerator could provide as an interface to compilers, and
thus programmers. We detail our design and present an implementation in QEMU using
the MPRF library for the RISC-V processor1.

Optimizing Temporal Decoupling using Event Relevance

  • Lukas Jünger
  • Carmine Bianco
  • Kristof Niederholtmeyer
  • Dietmar Petras
  • Rainer Leupers

Over the last decades, HW/SW systems have grown ever more complex. System simulators,
so called virtual platforms, have been an important tool for developing and testing
these systems. However, the rise in overall complexity has also impacted the simulators.
Complex platforms require fast simulation components and a sophisticated simulation
infrastructure to meet today’s performance demands. With the introduction of SystemC
TLM2.0, temporal decoupling has become a staple in the arsenal of simulation acceleration
techniques. Temporal decoupling yields a significant simulation performance increase
at the cost of diminished accuracy. The two prevalent approaches are called static
quantum and dynamic quantum. In this work both are analyzed using a state-of-the-art,
industrial virtual platform as a case study. While dynamic quantum offers an ideal
trade-off between simulation performance and accuracy in a single-core scenario, performance
reductions can be observed in multi-core platforms. To address this, a novel performance
optimization is proposed, achieving a 14.32% performance gain in our case study while
keeping near-perfect accuracy.

Design Space Exploration of Heterogeneous-Accelerator SoCs with Hyperparameter Optimization

  • Thanh Cong
  • François Charot

Modern SoC systems consist of general-purpose processor cores augmented with large
numbers of specialized accelerators. Building such systems requires a design flow
allowing the design space to be explored at the system level with an appropriate strategy.
In this paper, we describe a methodology allowing to explore the design space of power-performance
heterogeneous SoCs by combining an architecture simulator (gem5-Aladdin) and a hyperparameter
optimization method (Hyperopt). This methodology allows different types of parallelism
with loop unrolling strategies and memory coherency interfaces to be swept. The flow
has been applied to a convolutional neural network algorithm. We show that the most
energy efficient architecture achieves a 2x to 4x improvement in energy-delay-product
compared to an architecture without parallelism. Furthermore, the obtained solution
is more efficient than commonly implemented architectures (Systolic, 2D-mapping, and
Tiling). We also applied the methodology to find the optimal architecture including
its coherency interface for a complex SoC made up of six accelerated-workloads. We
show that a hybrid interface appears to be the most efficient; it reaches 22% and
12% improvement in energy-delay-product compared to just only using non-coherent and
only LLC-coherent models, respectively.

SESSION: 4C: Neural Network Optimizations for Compact AI Inference

DNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs

  • Souvik Kundu
  • Mahdi Nazemi
  • Peter A. Beerel
  • Massoud Pedram

This paper presents a dynamic network rewiring (DNR) method to generate pruned deep
neural network (DNN) models that are robust against adversarial attacks yet maintain
high accuracy on clean images. In particular, the disclosed DNR method is based on
a unified constrained optimization formulation using a hybrid loss function that merges
ultra-high model compression with robust adversarial training. This training strategy
dynamically adjusts inter-layer connectivity based on per-layer normalized momentum
computed from the hybrid loss function. In contrast to existing robust pruning frameworks
that require multiple training iterations, the proposed learning strategy achieves
an overall target pruning ratio with only a single training iteration and can be tuned
to support both irregular and structured channel pruning. To evaluate the merits of
DNR, experiments were performed with two widely accepted models, namely VGG16 and
ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared
to the baseline uncompressed models, DNR provides over 20x compression on all the
datasets with no significant drop in either clean or adversarial classification accuracy.
Moreover, our experiments show that DNR consistently finds compressed models with
better clean and adversarial image classification performance than what is achievable
through state-of-the-art alternatives. Our models and test codes are available at
https://github.com/ksouvik52/DNR_ASP_DAC2021.

Dynamic Programming Assisted Quantization Approaches for Compressing Normal and Robust
DNN Models

  • Dingcheng Yang
  • Wenjian Yu
  • Haoyuan Mu
  • Gary Yao

In this work, we present effective quantization approaches for compressing the deep
neural networks (DNNs). A key ingredient is a novel dynamic programming (DP) based
algorithm to obtain the optimal solution of scalar K-means clustering. Based on the
approaches with regularization and quantization function, two weight quantization
approaches called DPR and DPQ for compressing normal DNNs are proposed respectively.
Experiments show that they produce models with higher inference accuracy than recently
proposed counterparts while achieving same or larger compression. They are also extended
for compressing robust DNNs, and the relevant experiments show 16X compression of
the robust ResNet-18 model with less than 3% accuracy drop on both natural and adversarial
examples.

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

  • Junhao Pan
  • Deming Chen

While computer vision tasks target increasingly challenging scenarios, the need for
real-time processing of images rises as well, requiring more efficient methods to
accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based
methods and Winograd algorithms to compute matrix convolutions, which effectively
lower the computing complexity by reducing the number of multiplications. For non-unit
stride convolutions, we usually cannot directly apply those algorithms to accelerate
the computations. In this work, we propose a novel universal approach to construct
the non-unit stride convolution algorithms for any given stride and filter sizes from
Winograd algorithms. Specifically, we first demonstrate the steps to decompose an
arbitrary convolutional kernel and apply the Winograd algorithms separately to compute
non-unit stride convolutions. We then present the derivation of this method and proof
by construction to confirm the validity of this approach. Finally, we discuss the
minimum number of multiplications and additions necessary for the non-unit stride
convolutions and evaluate the performance of the decomposed Winograd algorithms. From
our analysis of the computational complexity, the new approach can benefit from 1.5x
to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired
around 1.3x speedup (Told /Tnew) of the Winograd algorithms against the conventional
convolution algorithm in various experiment settings.

Efficient Accuracy Recovery in Approximate Neural Networks by Systematic Error Modelling

  • Cecilia De la Parra
  • Andre Guntoro
  • Akash Kumar

Approximate Computing is a promising paradigm for mitigating the computational demands
of Deep Neural Networks (DNNs), by leveraging DNN performance and area, throughput
or power. The DNN accuracy, affected by such approximations, can be then effectively
improved through retraining. In this paper, we present a novel methodology for modelling
the approximation error introduced by approximate hardware in DNNs, which accelerates
retraining and achieves negligible accuracy loss. To this end, we implement the behavioral
simulation of several approximate multipliers and model the error generated by such
approximations on pre-trained DNNs for image classification on CIFAR10 and ImageNet.
Finally, we optimize the DNN parameters by applying our error model during DNN retraining,
to recover the accuracy lost due to approximations. Experimental results demonstrate
the efficiency of our proposed method for accelerated retraining (11 x faster for
CIFAR10 and 8x faster for ImageNet) for full DNN approximation, which allows us to
deploy approximate multipliers with energy savings of up to 36% for 8-bit precision
DNNs with an accuracy loss lower than 1%.

SESSION: 4D: Brain-Inspired Computing

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

  • Sitao Huang
  • Aayush Ankit
  • Plinio Silveira
  • Rodrigo Antunes
  • Sai Rahul Chalamalasetti
  • Izzat El Hajj
  • Dong Eun Kim
  • Glaucimar Aguiar
  • Pedro Bruel
  • Sergey Serebryakov
  • Cong Xu
  • Can Li
  • Paolo Faraboschi
  • John Paul Strachan
  • Deming Chen
  • Kaushik Roy
  • Wen-mei Hwu
  • Dejan Milojicic

ReRAM-based accelerators have shown great potential for accelerating DNN inference
because ReRAM crossbars can perform analog matrix-vector multiplication operations
with low latency and energy consumption. However, these crossbars require the use
of ADCs which constitute a significant fraction of the cost of MVM operations. The
overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization
flows for DNN inference accelerators do not consider partial sum quantization which
is not highly relevant to traditional digital architectures. To address this issue,
we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators
where weight quantization, input quantization, and partial sum quantization are jointly
applied for each DNN layer. We also propose an automated quantization flow powered
by deep reinforcement learning to search for the best quantization configuration in
the large design space. Our evaluation shows that the proposed mixed precision quantization
scheme and quantization flow reduce inference latency and energy consumption by up
to 3.89x and 4.84x, respectively, while only losing 1.18% in DNN inference accuracy.

A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA

  • Alberto Parravicini
  • Francesco Sgherzi
  • Marco D. Santambrogio

Sparse matrix-vector multiplication is often employed in many data-analytic workloads
in which low latency and high throughput are more valuable than exact numerical convergence.
FPGAs provide quick execution times while offering precise control over the accuracy
of the results thanks to reduced-precision fixed-point arithmetic. In this work, we
propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector
multiplication, and study its effectiveness when applied to the Personalized PageRank
algorithm, a common building block of recommender systems in e-commerce websites and
social networks. Our implementation achieves speedups up to 6x over a reference floating-point
FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different
data-sets, while preserving the numerical fidelity of the results and reaching up
to 42x higher energy efficiency compared to the CPU implementation.

HyperRec: Efficient Recommender Systems with Hyperdimensional Computing

  • Yunhui Guo
  • Mohsen Imani
  • Jaeyoung Kang
  • Sahand Salamat
  • Justin Morris
  • Baris Aksanli
  • Yeseong Kim
  • Tajana Rosing

Recommender systems are important tools for many commercial applications such as online
shopping websites. There are several issues that make the recommendation task very
challenging in practice. The first is that an efficient and compact representation
is needed to represent users, items and relations. The second issue is that the online
markets are changing dynamically, it is thus important that the recommendation algorithm
is suitable for fast updates and hardware acceleration. In this paper, we propose
a new hardware-friendly recommendation algorithm based on Hyperdimensional Computing,
called HyperRec. Unlike existing solutions which leverages floating-point numbers
for the data representation, in HyperRec, users and items are modeled with binary
vectors in a high dimension. The binary representation enables to perform the reasoning
process of the proposed algorithm only using Boolean operations, which is efficient
on various computing platforms and suitable for hardware acceleration. In this work,
we show how to utilize GPU and FPGA to accelerate the proposed HyperRec. When compared
with the state-of-the-art methods for rating prediction, the CPU-based HyperRec implementation
is 13.75x faster and consumes 87% less memory, while decreasing the mean squared error
(MSE) for the prediction by as much as 31.84%. Our FPGA implementation is on average
67.0x faster and has 6.9x higher energy efficient as compared to CPU. Our GPU implementation
further achieves on average 3.1x speedup as compared to FPGA, while providing only
1.2x lower energy efficiency.

Efficient Techniques for Training the Memristor-based Spiking Neural Networks Targeting
Better Speed, Energy and Lifetime

  • Yu Ma
  • Pingqiang Zhou

Speed and energy consumption are two important metrics in designing spiking neural
networks (SNNs). The inference process of current SNNs is terminated after a preset
number of time steps for all images, which leads to a waste of time and spikes. We
can terminate the inference process after proper number of time steps for each image.
Besides, normalization method also influences the time and spikes of SNNs. In this
work, we first use reinforcement learning algorithm to develop an efficient termination
strategy which can help find the right number of time steps for each image. Then we
propose a model tuning technique for memristor-based crossbar circuit to optimize
the weight and bias of a given SNN. Experimental results show that the proposed techniques
can reduce about 58.7% crossbar energy consumption and over 62.5% time consumption
and double the drift lifetime of memristor-based SNN.

SESSION: 4E: Cross-Layer Hardware Security

PCBench: Benchmarking of Board-Level Hardware Attacks and Trojans

  • Huifeng Zhu
  • Xiaolong Guo
  • Yier Jin
  • Xuan Zhang

Most modern electronic systems are hosted by printed circuit boards (PCBs), making
them a ubiquitous system component that can take many different shapes and forms.
In order to achieve a high level of economy of scale, the global supply chain of electronic
systems has evolved into disparate segments for the design, fabrication, assembly,
and testing of PCB boards and their various associated components. As a consequence,
the modern PCB supply chain exposes many vulnerabilities along its different stages,
allowing adversaries to introduce malicious alterations to facilitate board-level
attacks.

As an emerging hardware threat, the attack and defense techniques at the board level
have not yet been systemically explored and thus require a thorough and comprehensive
investigation. In the absence of standard board-level attack benchmark, current research
on perspective countermeasures is likely to be evaluated on proprietary variants of
ad-hoc attacks, preventing credible and verifiable comparison among different techniques.
Upon this request, in this paper, we will systematically define and categorize a broad
range of board-level attacks. For the first time, the attack vectors and construction
rules for board-level attacks are developed. A practical and reliable board-level
attack benchmark generation scheme is also developed, which can be used to produce
references for evaluating countermeasures. Finally, based on the proposed approach,
we have created a comprehensive set of board-level attack benchmarks for open-source
release.

Cache-Aware Dynamic Skewed Tree for Fast Memory Authentication

  • Saru Vig
  • Siew-Kei Lam
  • Rohan Juneja

Memory integrity trees are widely-used to protect external memories in embedded systems
against bus attacks. However, existing methods often result in high performance overheads
incurred during memory authentication. To reduce memory accesses during authentication,
the tree nodes are cached on-chip. In this paper, we propose a cacheaware technique
to dynamically skew the integrity tree based on the application workloads in order
to reduce the performance overhead. The tree is initialized using Van-Emde Boas (vEB)
organization to take advantage of locality of reference. At run time, the nodes of
the integrity tree are dynamically positioned based on their memory access patterns.
In particular, frequently accessed nodes are placed closer to the root to reduce the
memory access overheads. The proposed technique is compared with existing methods
on Multi2Sim using benchmarks from SPEC-CPU2006, SPLASH-2 and PARSEC to demonstrate
its performance benefits.

Automated Test Generation for Hardware Trojan Detection using Reinforcement Learning

  • Zhixin Pan
  • Prabhat Mishra

Due to globalized semiconductor supply chain, there is an increasing risk of exposing
System-on-Chip (SoC) designs to malicious implants, popularly known as hardware Trojans.
Unfortunately, traditional simulation-based validation using millions of test vectors
is unsuitable for detecting stealthy Trojans with extremely rare trigger conditions
due to exponential input space complexity of modern SoCs. There is a critical need
to develop efficient Trojan detection techniques to ensure trustworthy SoCs. While
there are promising test generation approaches, they have serious limitations in terms
of scalability and detection accuracy. In this paper, we propose a novel logic testing
approach for Trojan detection using an effective combination of testability analysis
and reinforcement learning. Specifically, this paper makes three important contributions.
1) Unlike existing approaches, we utilize both controllability and observability analysis
along with rareness of signals to significantly improve the trigger coverage. 2) Utilization
of reinforcement learning considerably reduces the test generation time without sacrificing
the test quality. 3) Experimental results demonstrate that our approach can drastically
improve both trigger coverage (14.5% on average) and test generation time (6.5 times
on average) compared to state-of-the-art techniques.

On the Impact of Aging on Power Analysis Attacks Targeting Power-Equalized Cryptographic
Circuits

  • Md Toufiq Hasan Anik
  • Bijan Fadaeinia
  • Amir Moradi
  • Naghmeh Karimi

Side-channel analysis attacks exploit the physical characteristics of cryptographic
chip implementations to extract their embedded secret keys. In particular, Power Analysis
(PA) attacks make use of the dependency of the power consumption on the data being
processed by the cryptographic devices. To tackle the vulnerability of cryptographic
circuits against PA attack, various countermeasures have been proposed in literature
and adapted by industries, among which a branch of hiding schemes opt to equalize
the power consumption of the chip regardless of the processed data. Although these
countermeasures are supposed to reduce the information leakage of cryptographic chips,
they fail to consider the impact of aging occurs during the device lifetime. Due to
aging, the specifications of transistors, and in particular their threshold-voltage,
deviate from their fabrication-time specification, leading to a change of circuit’s
delay and power consumption over time. In this paper, we show that the aging-induced
impacts result in imbalances in the equalized power consumption achieved by hiding
countermeasures. This makes such protected cryptographic chips vulnerable to PA attacks
when aged. The experimental results extracted through the aging simulation of the
PRESENT cipher protected by Sense Amplifier Based Logic (SABL), one of the well-known
hiding countermeasures, show that the achieved protection may not last during the
circuit lifetime.

SESSION: 5B: Embedded Operating Systems and Information Retrieval

Energy-Performance Co-Management of Mixed-Sensitivity Workloads on Heterogeneous Multi-core
Systems

  • Elham Shamsa
  • Anil Kanduri
  • Amir M. Rahmani
  • Pasi Liljeberg

Satisfying performance of complex workload scenarios with respect to energy consumption
on Heterogeneous Multi-core Platforms (HMPs) is challenging when considering i) the
increasing variety of applications, and ii) the large space of resource management
configurations. Existing run-time resource management approaches use online and offline
learning to handle such complexity. However, they focus on one type of application,
neglecting concurrent execution of mixed sensitivity workloads. In this work, we propose
an energy-performance co-management method which prioritizes mixed type of applications
at run-time, and searches in the configuration space to find the optimal configuration
for each application which satisfies the performance requirements while saving energy.
We evaluate our approach on a real Odroid XU3 platform over mixed-sensitivity embedded
workloads. Experimental results show our approach provides 54% lower performance violation
with 50% higher energy saving compared to the existing approaches.

Optimizing Inter-Core Data-Propagation Delays in Industrial Embedded Systems under
Partitioned Scheduling

  • Lamija Hasanagić
  • Tin Vidović
  • Saad Mubeen
  • Mahammad Ashjaei
  • Matthias Becker

This paper addresses the scheduling of industrial time-critical applications on multi-core
embedded systems. A novel scheduling technique under partitioned scheduling is proposed
that minimizes inter-core data-propagation delays between tasks that are activated
with different periods. The proposed technique is based on the read-execute-write
model for the execution of tasks to guarantee temporal isolation when accessing the
shared resources. A Constraint Programming formulation is presented to find the schedule
for each core. Evaluations are preformed to assess the scalability as well as the
resulting schedulability ratio, which is still 18% for two cores that are both utilized
90%. Furthermore, an automotive industrial case study is performed to demonstrate
the applicability of the proposed technique to industrial systems. The case study
also presents a comparative evaluation of the schedules generated by (i) the proposed
technique and (ii) the Rubus-ICE industrial tool suite with respect to jitter, inter-core
data-propagation delays and their impact on data age of task chains that span multiple
cores.

LiteIndex: Memory-Efficient Schema-Agnostic Indexing for JSON documents in SQLite

  • Siqi Shang
  • Qihong Wu
  • Tianyu Wang
  • Zili Shao

SQLite with JSON (JavaScript Object Notation) format is widely adopted for local data
storage in mobile applications such as Twitter and Instagram. With more data are generated
and stored, it becomes vitally important to efficiently index and search JSON records
in SQLite. However, current methods in SQLite either require full text search (that
incurs big memory usage and long query latency) or indexing based on expression (that
needs to be manually created by specifying search keys). On the other hand, existing
JSON automatic indexing techniques, mainly focusing on big data and cloud environments,
depend on a colossal tree structure that cannot be applied in memory-constrained mobile
devices.

In this paper, we propose a novel schema-agnostic indexing technique called LiteIndex
that can automatically index JSON records by extracting keywords from long text and
maintaining user-preferred items within a given memory constraint. This is achieved
by memory-efficient index organization with light-weight keyword extraction from long
text and user-preference-aware reinforcement-learning-based index pruning mechanism.
LiteIndex has been implemented in a Android smartphone platform and evaluated with
a dataset from Tweet. Experimental results show that LiteIndex can significantly reduce
the query latency by up to 18x with less memory usage compared with SQLite with FTS3/FTS4
extensions.

SESSION: 5C: Security Issues in AI and Their Impacts on Hardware Security

Micro-architectural Cache Side-Channel Attacks and Countermeasures

  • Chaoqun Shen
  • Congcong Chen
  • Jiliang Zhang

Central Processing Unit (CPU) is considered as the brain of a computer. If the CPU
has vulnerabilities, the security of software running on it is difficult to be guaranteed.
In recent years, various micro-architectural cache side-channel attacks on the CPU
such as Spectre and Meltdown have appeared. They exploit contention on internal components
of the processor to leak secret information between processes. This newly evolving
research area has aroused significant interest due to the broad application range
and harmfulness of these attacks. This article reviews recent research progress on
micro-architectural cache side-channel attacks and defenses. First, the various micro-architectural
cache side-channel attacks are classified and discussed. Then, the corresponding countermeasures
are summarized. Finally, the limitations and future development trends are prospected.

Security of Neural Networks from Hardware Perspective: A Survey and Beyond

  • Qian Xu
  • Md Tanvir Arafin
  • Gang Qu

Recent advances in neural networks (NNs) and their applications in deep learning techniques
have made the security aspects of NNs an important and timely topic for fundamental
research. In this paper, we survey the security challenges and opportunities in the
computing hardware used in implementing deep neural networks (DNN). First, we explore
the hardware attack surfaces for DNN. Then, we report the current state-of-the-art
hardware-based attacks on DNN with focus on hardware Trojan insertion, fault injection,
and side-channel analysis. Next, we discuss the recent development on detecting these
hardware-oriented attacks and the corresponding countermeasures. We also study the
application of secure enclaves for the trusted execution of NN-based algorithms. Finally,
we consider the emerging topic of intellectual property protection for deep learning
systems. Based on our study, we find ample opportunities for hardware based research
to secure the next generation of DNN-based artificial intelligence and machine learning
platforms.

Learning Assisted Side Channel Delay Test for Detection of Recycled ICs

  • Ashkan Vakil
  • Farzad Niknia
  • Ali Mirzaeian
  • Avesta Sasan
  • Naghmeh Karimi

With the outsourcing of design flow, ensuring the security and trustworthiness of
integrated circuits has become more challenging. Among the security threats, IC counterfeiting
and recycled ICs have received a lot of attention due to their inferior quality, and
in turn, their negative impact on the reliability and security of the underlying devices.
Detecting recycled ICs is challenging due to the effect of process variations and
process drift occurring during the chip fabrication. Moreover, relying on a golden
chip as a basis for comparison is not always feasible. Accordingly, this paper presents
a recycled IC detection scheme based on delay side-channel testing. The proposed method
relies on the features extracted during the design flow and the sample delays extracted
from the target chip to build a Neural Network model using which the target chip can
be truly identified as new or recycled. The proposed method classifies the timing
paths of the target chip into two groups based on their vulnerability to aging using
the information collected from the design and detects the recycled ICs based on the
deviation of the delay of these two sets from each other.

ML-augmented Methodology for Fast Thermal Side-channel Emission Analysis

  • Norman Chang
  • Deqi Zhu
  • Lang Lin
  • Dinesh Selvakumaran
  • Jimin Wen
  • Stephen Pan
  • Wenbo Xia
  • Hua Chen
  • Calvin Chow
  • Gary Chen

Accurate side-channel attacks can non-invasively or semi-invasively extract secure
information from hardware devices using “side- channel” measurements. The thermal
profile of an IC is one class of side channel that can be used to exploit the security
weaknesses in a design. Measurement of junction temperature from an on-chip thermal
sensor or top metal layer temperature using an infrared thermal image of an IC with
the package being removed can disclose secret keys of a cryptographic design through
correlation power analysis. In order to identify the design vulnerabilities to thermal
side channel attacks, design time simulation tools are highly important. However,
simulation of thermal side-channel emission is highly complex and computationally
intensive due to the scale of simulation vectors required and the multi-physics simulation
models involved. Hence, in this paper, we have proposed a fast and comprehensive Machine
Learning (ML) augmented thermal simulation methodology for thermal Side-Channel emission
Analysis (SCeA). We have developed an innovative tile-based Delta-T Predictor using
a data-driven DNN-based thermal solver. The developed tile based Delta-T Predictor
temperature is used to perform the thermal side-channel analysis which models the
scenario of thermal attacks with the measurement of junction temperature. This method
can be 100-1000x faster depending on the size of the chip compared to traditional
FEM-based thermal solvers with the same level of accuracy. Furthermore, this simulation
allows for the determination of location- dependent wire temperature on the top metal
layer to validate the scenario of thermal attack with top metal layer temperature.
We have demonstrated the leakage of the encryption key in an 128-bit AES chip using
both proposed tile-based temperature calculations and top metal wire temperature calculations,
quantified by simulation MTD (Measurements-to-Disclosure).

SESSION: 5D: Advances in Logic and High-level Synthesis

1st-Order to 2nd-Order Threshold Logic Gate Transformation with an Enhanced ILP-based
Identification Method

  • Li-Cheng Zheng
  • Hao-Ju Chang
  • Yung-Chih Chen
  • Jing-Yang Jou

This paper introduces a method to enhance an integer linear programming (ILP)-based
method for transforming a 1st-order threshold logic gate (1-TLG) to a 2nd-order TLG
(2-TLG) with lower area cost. We observe that for a 2-TLG, most of the 2nd-order weights
(2-weights) are zero. That is, in the ILP formulation, most of the variables for the
2-weights could be set to zero. Thus, we first propose three sufficient conditions
for transforming a 1-TLG to a 2-TLG by extracting 2-weights. These extracted weights
are seen to be more likely non-zero. Then, we simplify the ILP formulation by eliminating
the non-extracted 2-weights to speed up the ILP solving. The experimental results
show that, to transform a set of 1-TLGs to 2-TLGs, the enhanced method saves an average
of 24% CPU time with only an average of 1.87% quality loss in terms of the area cost
reduction rate.

A Novel Technology Mapper for Complex Universal Gates

  • Meng-Che Wu
  • Ai Quoc Dao
  • Mark Po-Hung Lin

Complex universal logic gates, which may have higher density and flexibility than
basic logic gates and look-up tables (LUT), are useful for cost-effective or security-oriented
VLSI design requirements. However, most of the technology mapping algorithms aim to
optimize combinational logic with basic standard cells or LUT components. It is desirable
to investigate optimal technology mappers for complex universal gates in addition
to basic standard cells and LUT components. This paper proposes a novel technology
mapper for complex universal gates with a tight integration of the following techniques:
Boolean network simulation with permutation classification, supergate library construction,
dynamic programming based cut enumeration, Boolean matching with optimal universal
cell covering. Experimental results show that the proposed method outperforms the
state-of-the-art technology mapper in ABC, in terms of both area and delay.

High-Level Synthesis of Transactional Memory

  • Omar Ragheb
  • Jason H. Anderson

The rising popularity of high-level synthesis (HLS) is due to the complexity and amount
of background knowledge required to design hardware circuits. Despite significant
recent advances in HLS research, HLS-generated circuits may be of lower quality than
human-expert-designed circuits, from the performance, power, or area perspectives.
In this work, we aim to raise circuit performance by introducing a transactional memory
(TM) synchronization model to the open-source LegUp HLS tool [1]. LegUp HLS supports
the synthesis of multi-threaded software into parallel hardware [4], including support
for mutual-exclusion lock-based synchronization. With the introduction of transactional
memory-based synchronization, location-specific (i.e. finer grained) memory locks
are made possible, where instead of placing an access lock around an entire array,
one can place a lock around individual array elements. Significant circuit performance
improvements are observed through reduced stalls due to contention, and greater memory-access
parallelism. On a set of 5 parallel benchmarks, wall-clock time is improved by 2.0x,
on average, by the TM synchronization model vs. mutex-based locks.

SESSION: 5E: Hardware-Oriented Threats and Solutions in Neural Networks

VADER: Leveraging the Natural Variation of Hardware to Enhance Adversarial Attack

  • Hao Lv
  • Bing Li
  • Ying Wang
  • Cheng Liu
  • Lei Zhang

Adversarial attacks have been viewed as the primary threat to the security of neural
networks. Hence, extensive adversarial defense techniques have been proposed to protect
the neural networks from adversarial attacks, allowing for the application of neural
networks to the security-sensitive tasks. Recently, the emerging devices, e.g., Resistive
RAM (RRAM), attracted extensive attention for establishing the hardware platform for
neural networks to tackle the inadequate computing capability of the traditional computing
platform. Though the emerging devices exhibit the instinct instability issues due
to the advanced manufacture technology, including hardware variations and defects,
the error-resilience capability of neural networks enables the wide deployment of
neural networks on the emerging devices. In this work, we find that the natural instability
in emerging devices impairs the security of neural networks. Specifically, we design
an enhanced adversarial attack, Variation-oriented ADvERsarial (VADER) attack which
leverages the inherent hardware variations in RRAM chips to penetrate the protection
of adversarial defenses and mislead the prediction of neural networks. We evaluated
the effectiveness of VADER across various protected neural network models and the
result shows that VADER achieves higher success attack rate over other adversarial
attacks.

Entropy-Based Modeling for Estimating Adversarial Bit-flip Attack Impact on Binarized
Neural Network

  • Navid Khoshavi
  • Saman Sargolzaei
  • Yu Bi
  • Arman Roohi

Over past years, the high demand to efficiently process deep learning (DL) models
has driven the market of the chip design companies. However, the new Deep Chip architectures,
a common term to refer to DL hardware accelerator, have slightly paid attention to
the security requirements in quantized neural networks (QNNs), while the black/white
-box adversarial attacks can jeopardize the integrity of the inference accelerator.
Therefore in this paper, a comprehensive study of the resiliency of QNN topologies
to black-box attacks is examined. Herein, different attack scenarios are performed
on an FPGA-processor co-design, and the collected results are extensively analyzed
to give an estimation of the impact’s degree of different types of attacks on the
QNN topology. To be specific, we evaluated the sensitivity of the QNN accelerator
to a range number of bit-flip attacks (BFAs) that might occur in the operational lifetime
of the device. The BFAs are injected at uniformly distributed times either across
the entire QNN or per individual layer during the image classification. The acquired
results are utilized to build the entropy-based model that can be leveraged to construct
resilient QNN architectures to bit-flip attacks.

A Low Cost Weight Obfuscation Scheme for Security Enhancement of ReRAM Based Neural
Network Accelerators

  • Yuhang Wang
  • Song Jin
  • Tao Li

The resistive random-access memory (ReRAM) based accelerator can execute the large
scale neural network (NN) applications in an extremely energy efficient way. However,
the non-volatile feature of the ReRAM introduces some security vulnerabilities. The
weight parameters of a well-trained NN model deployed on the ReRAM based accelerator
are persisted even after the chip is powered off. The adversaries who have the physical
access to the accelerator can hence launch the model stealing attack and extract these
weights by some micro-probing methods. Run time encryption of the weights is intuitive
to protect the NN model but degrades execution performance and device endurance largely.
While obfuscation of the weight rows needs to pay the tremendous hardware area overhead
in order to achieve the high security. In view of above mentioned problems, in this
paper we propose a low cost weight obfuscation scheme to secure the NN model deployed
on the ReRAM based accelerators from the model stealing attack. We partition the crossbar
into many virtual operation units (VOUs) and perform full permutation on the weights
of the VOUs along the column dimension. Without the keys, the attacker cannot perform
the correct NN computations even if they have obtained the obfuscated model. Compared
with the weight rows based obfuscation, our scheme can achieve the same level of security
with less an order of magnitude in the hardware area and power overheads.

SESSION: 6B: Advanced Optimizations for Embedded Systems

Puncturing the memory wall: Joint optimization of network compression with approximate memory for ASR application

  • Qin Li
  • Peiyan Dong
  • Zijie Yu
  • Changlu Liu
  • Fei Qiao
  • Yanzhi Wang
  • Huazhong Yang

The automatic speech recognition (ASR) system is becoming increasingly irreplaceable
in smart speech interaction applications. Nonetheless, these applications confront
the memory wall when embedded in the energy and memory constrained Internet of Things
devices. Therefore, it is extremely challenging but imperative to design a memory-saving
and energy-saving ASR system. This paper proposes a joint-optimized scheme of network
compression with approximate memory for the economical ASR system. At the algorithm
level, this work presents block-based pruning and quantization with error model (BPQE),
an optimized compression framework including a novel pruning technique coordinated
with low-precision quantization and the approximate memory scheme. The BPQE compressed
recurrent neural network (RNN) model comes with an ultra-high compression rate and
finegrained structured pattern that reduce the amount of memory access immensely.
At the hardware level, this work presents an ASR-adapted incremental retraining method
to further obtain optimal power saving. This retraining method stimulates the utility
of the approximate memory scheme, while maintaining considerable accuracy. According
to the experiment results, the proposed joint-optimized scheme achieves 58.6% power
saving and 40x memory saving with a phone error rate of 20%.

Canonical Huffman Decoder on Fine-grain Many-core Processor Arrays

  • Satyabrata Sarangi
  • Bevan Baas

Canonical Huffman codecs have been used in a wide variety of platforms ranging from
mobile devices to data centers which all demand high energy efficiency and high throughput.
This work presents bit-parallel canonical Huffman decoder implementations on a fine-grain
many-core array built using simple RISC-style programmable processors. We develop
multiple energy-efficient and area-efficient decoder implementations and the results
are compared with an Intel i7-4850HQ and a massively parallel GT 750M GPU executing
the corpus benchmarks: Calgary, Canterbury, Artificial, and Large. The many-core implementations
achieve a scaled throughput per chip area that is 324x and 2.7x greater on average
than the i7 and GT 750M respectively. In addition, the many-core implementations yield
a scaled energy efficiency (bytes decoded per energy) that is 24.1x and 4.6x greater
than the i7 and GT 750M respectively.

A Decomposition-Based Synthesis Algorithm for Sparse Matrix-Vector Multiplication
in Parallel Communication Structure

  • Mingfei Yu
  • Ruitao Gao
  • Masahiro Fujita

There is an obvious trend that hardware including many-core CPU, GPU and FPGA are
always made use of to conduct computationally intensive tasks of deep learning implementations,
a large proportion of which can be formulated into the format of sparse matrix-vector
multiplication(SpMV). In contrast with dense matrix-vector multi-plication(DMV), scheduling
solutions for SpMV targeting parallel processing turn out to be irregular, leading
to the dilemma that the optimum synthesis problems are time-consuming or even infeasible,
when the size of the involved matrix increases. In this paper, the minimum scheduling
problem of 4×4 SpMV on ring-connected architecture is first studied, with two concepts
named Multi-Input Vector and Multi-Output Vector introduced. The classification of
4×4 sparse matrices has been conducted, on account of which a decomposition-based
synthesis algorithm for larger matrices is put forward. As the proposed method is
guided by known sub-scheduling solutions, search space of the synthesis problem is
considerably reduced. Through comparison with an exhaustive search method and a brute
force-based parallel scheduling method, the proposed method is proved to be able to
offer scheduling solutions of high-equality: averagely utilize 65.27% of the sparseness
of the involved matrices and achieve 91.39% of the performance of the solutions generated
by exhaustive search, with a remarkable saving of compilation time and the best scalability
among the above-mentioned approaches.

SESSION: 6C: Design and Learning of Logic Circuits and Systems

Learning Boolean Circuits from Examples for Approximate Logic Synthesis

  • Sina Boroumand
  • Christos-Savvas Bouganis
  • George A. Constantinides

Many computing applications are inherently error resilient. Thus, it is possible to
decrease computing accuracy to achieve greater efficiency in area, performance, and/or
energy consumption. In recent years, a slew of automatic techniques for approximate
computing has been proposed; however, most of these techniques require full knowledge
of an exact, or ‘golden’ circuit description. In contrast, there has been significant
recent interest in synthesizing computation from examples, a form of supervised learning.
In this paper, we explore the relationship between supervised learning of Boolean
circuits and existing work on synthesizing incompletely-specified functions. We show
that when considered through a machine learning lens, the latter work provides a good
training accuracy but poor test accuracy. We contrast this with prior work from the
1990s which uses mutual information to steer the search process, aiming for good generalization.
By combining this early work with a recent approach to learning logic functions, we
are able to achieve a scalable and efficient machine learning approach for Boolean
circuits in terms of area/delay/test-error trade-off.

Read your Circuit: Leveraging Word Embedding to Guide Logic Optimization

  • Walter Lau Neto
  • Matheus Trevisan Moreira
  • Luca Amaru
  • Cunxi Yu
  • Pierre-Emmanuel Gaillardon

To tackle the involved complexity, Electronic Design Automation (EDA) tools are broken
in well-defined steps, each operating at different abstraction levels. Higher levels
of abstraction shorten the flow run-time while sacrificing correlation with the physical
circuit implementation. Bridging this gap between Logic Synthesis tool and Physical
Design (PnR) tools is key to improve Quality of Results (QoR), while possibly shorting
the time-to-market. To address this problem, in this work, we formalize logic paths
as sentences, with the gates being a bag of words. Thus, we show how word embedding
can be leveraged to represent generic paths and predict if a given path is likely
to be critical post-PnR. We present the effectiveness of our approach, with accuracy
over than 90% for our test-cases. Finally, we give a step further and introduce an
intelligent and non-intrusive flow that uses this information to guide optimization.
Our flow presents up to 15.53% area delay product (ADP) and 18.56% power delay product
(PDP), compared to a standard flow.

Exploiting HLS-Generated Multi-Version Kernels to Improve CPU-FPGA Cloud Systems

  • Bernardo Neuhaus Lignati
  • Michael Guilherme Jordan
  • Guilherme Korol
  • Mateus Beck Rutzig
  • Antonio Carlos Schneider Beck

Cloud Warehouses have been exploiting CPU-FPGA collaborative execution environments,
where multiple clients share the same infrastructure to achieve to maximize resource
utilization with the highest possible energy efficiency and scalability. However,
the resource provisioning is challenging in these environments, since kernels may
be dispatched to both CPU and FPGA concurrently in a highly variant scenario, in terms
of available resources and workload characteristics. In this work, we propose MultiVers,
a framework that leverages automatic HLS generation to enable further gains in such
CPU-FPGA collaborative systems. MultiVers exploits the automatic generation from HLS
to build libraries containing multiple versions of each incoming kernel request, greatly
enlarging the available design space exploration passive of optimization by the allocation
strategies in the cloud provider. Multivers makes both kernel multiversioning and
allocation strategy to work symbiotically, allowing fine-tuning in terms of resource
usage, performance, energy, or any combination of these parameters. We show the efficiency
of MultiVers by using real-world cloud request scenarios with a diversity of benchmarks,
achieving average improvements on makespan and energy of up to 4.62x and 19.04x, respectively,
over traditional allocation strategies executing non-optimized kernels.

SESSION: 6D: Hardware Locking and Obfuscation

Area Efficient Functional Locking through Coarse Grained Runtime Reconfigurable Architectures

  • Jianqi Chen
  • Benjamin Carrion Schafer

The protection of Intellectual Property (IP) has emerged as one of the most important
issues in the hardware design industry. Most VLSI design companies are now fabless
and need to protect their IP from being illegally distributed. One of the main approach
to address this has been through logic locking. Logic locking prevents IPs from being
reversed engineered as well as overbuilding the hardware circuit by untrusted foundries.
One of the main problem with existing logic locking techniques is that the foundry
has full access to the entire design including the logic locking mechanism. Because
of the importance of this topic, continuous more robust locking mechanisms are proposed
and equally fast new methods to break them appear. One alternative approach is to
lock a circuit through omission. The main idea is to selectively map a portion of
the IP onto an embedded FPGA (eFPGA). Because the foundry does not have access to
the bitstream, the circuit cannot be used until programmed by the legitimate user.
One of the main problems with this approach is the large overhead in terms of area
and power, as well as timing degradation. Area is especially a concern for price sensitive
applications. To address this, in this work we presents a method to map portions of
a design onto a Coarse Grained Runtime Reconfigurable Architecture (CGRRA) such that
multiple parts of a design can be hidden onto the CGRRA, substantially amortizing
the area overhead introduced by the CGRRA.

ObfusX: Routing Obfuscation with Explanatory Analysis of a Machine Learning Attack

  • Wei Zeng
  • Azadeh Davoodi
  • Rasit Onur Topaloglu

This is the first work that incorporates recent advancements in “explainability” of
machine learning (ML) to build a routing obfuscator called ObfusX. We adopt a recent
metric—the SHAP value—which explains to what extent each layout feature can reveal
each unknown connection for a recent ML-based split manufacturing attack model. The
unique benefits of SHAP-based analysis include the ability to identify the best candidates
for obfuscation, together with the dominant layout features which make them vulnerable.
As a result, ObfusX can achieve better hit rate (97% lower) while perturbing significantly
fewer nets when obfuscating using a via perturbation scheme, compared to prior work.
When imposing the same wirelength limit using a wire lifting scheme, ObfusX performs
significantly better in performance metrics (e.g., 2.4 times more reduction on average
in percentage of netlist recovery).

Breaking Analog Biasing Locking Techniques via Re-Synthesis

  • Julian Leonhard
  • Mohamed Elshamy
  • Marie-Minerve Louërat
  • Haralampos-G. Stratigopoulos

We demonstrate an attack to break all analog circuit locking techniques that act upon
the biasing of the circuit. The attack is based on re-synthesizing the biasing circuits
and requires only the use of an optimization algorithm. It is generally applicable
to any analog circuit class. For the attacker the method requires no in-depth understanding
or analysis of the circuit. The attack is demonstrated on a bias-locked Low-Dropout
(LDO) regulator. As the underlying optimization algorithm we employ a Genetic Algorithm
(GA).

SESSION: 6E: Efficient Solutions for Emerging Technologies

Energy and QoS-Aware Dynamic Reliability Management of IoT Edge Computing Systems

  • Kazim Ergun
  • Raid Ayoub
  • Pietro Mercati
  • Dancheng Liu
  • Tajana Rosing

The Internet of Things (IoT) systems, as any electronic or mechanical system, are
prone to failures. Hard failures in hardware due to aging and degradation are particularly
important since they are irrecoverable, requiring maintenance for the replacement
of defective parts, at high costs. In this paper, we propose a novel dynamic reliability
management (DRM) technique for IoT edge computing systems to satisfy the Quality of
Service (QoS) and reliability requirements while maximizing the remaining energy of
the edge device batteries. We formulate a state-space optimal control problem with
a battery energy objective, QoS, and terminal reliability constraints. We decompose
the problem into low-overhead subproblems and solve it employing a hierarchical and
multi-timescale control approach, distributed over the edge devices and the gateway.
Our results, based on real measurements and trace-driven simulation demonstrate that
the proposed scheme can achieve a similar battery lifetime compared to the state-of-the-art
approaches while satisfying reliability requirements, where other approaches fail
to do so.

Light: A Scalable and Efficient Wavelength-Routed Optical Networks-On-Chip Topology

  • Zhidan Zheng
  • Mengchu Li
  • Tsun-Ming Tseng
  • Ulf Schlichtmann

Wavelength-routed optical networks-on-chip (WRONoCs) are known for delivering collision-
and arbitration-free on-chip communication in many-cores systems. While appealing
for low latency and high predictability, WRONoCs are challenged by scalability concerns
due to two reasons: (1) State-of-the-art WRONoC topologies use a large number of microring
resonators (MRRs) which result in much MRR tuning power and crosstalk noise. (2) The
positions of master and slave nodes in current topologies do not match realistic layout
constraints. Thus, many additional waveguide crossings will be introduced during physical
implementation, which degrades the network performance. In this work, we propose an
N x (N – 1) WRONoC topology: Light with a 4 x 3 router Hash as the basic building
block, and a simple but efficient approach to configure the resonant wavelength for
each MRR. Experimental results show that Light outperforms state-of-the-art topologies
in terms of enhancing signal-to-noise ratio (SNR) and reducing insertion loss, especially
for large-scale networks. Furthermore, Light can be easily implemented onto a physical
plane without causing external waveguide crossings.

One-pass Synthesis for Field-coupled Nanocomputing Technologies

  • Marcel Walter
  • Winston Haaswijk
  • Robert Wille
  • Frank Sill Torres
  • Rolf Drechsler

Field-coupled Nanocomputing (FCN) is a class of post-CMOS emerging technologies, which
promises to overcome certain physical limitations of conventional solutions such as
CMOS by allowing for high computational throughput with low power dissipation. Despite
their promises, the design of corresponding FCN circuits is still in its infancy.
In fact, state-of-the-art solutions still heavily rely on conventional synthesis approaches
that do not take the tight physical constraints of FCN circuits (particularly with
respect to routability and clocking) into account. Instead, physical design is conducted
in a second step in which a classical logic network is mapped onto an FCN layout.
Using this two-stage approach with a classical and FCN-oblivious logic network as
an intermediate result, frequently leads to substantial quality loss or completely
impractical results. In this work, we propose a one-pass synthesis scheme for FCN
circuits, which conducts both steps, synthesis and physical design, in a single run.
For the first time, this allows to generate exact, i. e., minimal FCN circuits for
a given functionality.

SESSION: 7A: Platform-Specific Neural Network Acceleration

Real-Time Mobile Acceleration of DNNs: From Computer Vision to Medical Applications

  • Hongjia Li
  • Geng Yuan
  • Wei Niu
  • Yuxuan Cai
  • Mengshu Sun
  • Zhengang Li
  • Bin Ren
  • Xue Lin
  • Yanzhi Wang

With the growth of mobile vision applications, there is a growing need to break through
the current performance limitation of mobile platforms, especially for computationally
intensive applications, such as object detection, action recognition, and medical
diagnosis. To achieve this goal, we present our unified real-time mobile DNN inference
acceleration framework, seamlessly integrating hardware-friendly, structured model
compression with mobile-targeted compiler optimizations. We aim at an unprecedented,
realtime performance of such large-scale neural network inference on mobile devices.
A fine-grained block-based pruning scheme is proposed to be universally applicable
to all types of DNN layers, such as convolutional layers with different kernel sizes
and fully connected layers. Moreover, it is also successfully extended to 3D convolutions.
With the assist of our compiler optimizations, the fine-grained block-based sparsity
is fully utilized to achieve high model accuracy and high hardware acceleration simultaneously.
To validate our framework, three representative fields of applications are implemented
and demonstrated, object detection, activity detection, and medical diagnosis. All
applications achieve real-time inference using an off-the-shelf smartphone, outperforming
the representative mobile DNN inference acceleration frameworks by up to 6.7x in speed.
The demonstrations of these applications can be found in the following link: https://bit.ly/39lWpYu.

Dynamic Neural Network to Enable Run-Time Trade-off between Accuracy and Latency

  • Li Yang
  • Deliang Fan

To deploy powerful deep neural network (DNN) into smart, but resource limited IoT
devices, many prior works have been proposed to compress DNN to reduce the network
size and computation complexity with negligible accuracy degradation, such as weight
quantization, network pruning, convolution decomposition, etc. However, by utilizing
conventional DNN compression methods, a smaller, but fixed, network is generated from
a relative large background model to achieve resource limited hardware acceleration.
However, such optimization lacks the ability to adjust its structure in real-time
to adapt for a dynamic computing hardware resource allocation and workloads. In this
paper, we mainly review our two prior works [13, 15] to tackle this challenge, discussing
how to construct a dynamic DNN by means of either uniform or non-uniform sub-nets
generation methods. Moreover, to generate multiple nonuniform sub-nets, [15] needs
to fully retrain the background model for each sub-net individually, named as multi-path
method. To reduce the training cost, in this work, we further propose a single-path
sub-nets generation method that can sample multiple sub-nets in different epochs within
one training round. The constructed dynamic DNN, consisting of multiple sub-nets,
provides the ability to run-time trade-off the inference accuracy and latency according
to hardware resources and environment requirements. In the end, we study the the dynamic
DNNs with different sub-nets generation methods on both CIFAR-10 and ImageNet dataset.
We also present the run-time tuning of accuracy and latency on both GPU and CPU.

When Machine Learning Meets Quantum Computers: A Case Study

  • Weiwen Jiang
  • Jinjun Xiong
  • Yiyu Shi

Along with the development of AI democratization, the machine learning approach, in
particular neural networks, has been applied to wide-range applications. In different
application scenarios, the neural network will be accelerated on the tailored computing
platform. The acceleration of neural networks on classical computing platforms, such
as CPU, GPU, FPGA, ASIC, has been widely studied; however, when the scale of the application
consistently grows up, the memory bottleneck becomes obvious, widely known as memory-wall.
In response to such a challenge, advanced quantum computing, which can represent 2N
states with N quantum bits (qubits), is regarded as a promising solution. It is imminent
to know how to design the quantum circuit for accelerating neural networks. Most recently,
there are initial works studying how to map neural networks to actual quantum processors.
To better understand the state-of-the-art design and inspire new design methodology,
this paper carries out a case study to demonstrate an end-to-end implementation. On
the neural network side, we employ the multilayer perceptron to complete image classification
tasks using the standard and widely used MNIST dataset. On the quantum computing side,
we target IBM Quantum processors, which can be programmed and simulated by using IBM
Qiskit. This work targets the acceleration of the inference phase of a trained neural
network on the quantum processor. Along with the case study, we will demonstrate the
typical procedure for mapping neural networks to quantum circuits.

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance
Optimization

  • Meng Li
  • Yilei Li
  • Vikas Chandra

Neural network accelerator is a key enabler for the on-device AI inference, for which
energy efficiency is an important metric. The datapath energy, including the computation
energy and the data movement energy among the arithmetic units, claims a significant
part of the total accelerator energy. By revisiting the basic physics of the arithmetic
logic circuits, we show that the datapath energy is highly correlated with the bit
flips when streaming the input operands into the arithmetic units, defined as the
hamming distance (HD) of the input operand matrices. Based on the insight, we propose
a post-training optimization algorithm and a HD-aware training algorithm to co-design
and co-optimize the accelerator and the network synergistically. The experimental
results based on post-layout simulation with MobileNetV2 demonstrate on average 2.85x
datapath energy reduction and up to 8.51x datapath energy reduction for certain layers.

Lightweight Run-Time Working Memory Compression for Deployment of Deep Neural Networks
on Resource-Constrained MCUs

  • Zhepeng Wang
  • Yawen Wu
  • Zhenge Jia
  • Yiyu Shi
  • Jingtong Hu

This work aims to achieve intelligence on embedded devices by deploying deep neural
networks (DNNs) onto resource-constrained microcontroller units (MCUs). Apart from
the low frequency (e.g., 1-16 MHz) and limited storage (e.g., 16KB to 256KB ROM),
one of the largest challenges is the limited RAM (e.g., 2KB to 64KB), which is needed
to save the intermediate feature maps of a DNN. Most existing neural network compression
algorithms aim to reduce the model size of DNNs so that they can fit into limited
storage. However, they do not reduce the size of intermediate feature maps significantly,
which is referred to as working memory and might exceed the capacity of RAM. Therefore,
it is possible that DNNs cannot run in MCUs even after compression. To address this
problem, this work proposes a technique to dynamically prune the activation values
of the intermediate output feature maps in the runtime to ensure that they can fit
into limited RAM. The results of our experiments show that this method could significantly
reduce the working memory of DNNs to satisfy the hard constraint of RAM size, while
maintaining satisfactory accuracy with relatively low overhead on memory and run-time
latency.

SESSION: 7B: Toward Energy-Efficient Embedded Systems

EHDSktch: A Generic Low Power Architecture for Sketching in Energy Harvesting Devices

  • Priyanka Singla
  • Chandran Goodchild
  • Smruti R. Sarangi

Energy harvesting devices (EHDs) are becoming extremely prevalent in remote and hazardous
environments. They sense the ambient parameters and compute some statistics on them,
which are then sent to a remote server. Due to the resource-constrained nature of
EHDs, it is challenging to perform exact computations on streaming data; however,
if we are willing to tolerate a slight amount of inaccuracy, we can leverage the power
of sketching algorithms to provide quick answers with significantly lower energy consumption.

In this paper, we propose a novel hardware architecture called EHDSktch — a set of
IP blocks that can be used to implement most of the popular sketching algorithms.
We demonstrate an energy savings of 4-10X and a speedup of more than 10X over state-of-the-art
software implementations. Leveraging the temporal locality further provides us a performance
gain of 3-20% in energy and time and reduces the on-chip memory requirement by at
least 50-75%.

Energy-Aware Design Methodology for Myocardial Infarction Detection on Low-Power Wearable
Devices

  • Mohanad Odema
  • Nafiul Rashid
  • Mohammad Abdullah Al Faruque

Myocardial Infarction (MI) is a heart disease that damages the heart muscle and requires
immediate treatment. Its silent and recurrent nature necessitates real-time continuous
monitoring of patients. Nowadays, wearable devices are smart enough to perform on-device
processing of heartbeat segments and report any irregularities in them. However, the
small form factor of wearable devices imposes resource constraints and requires energy-efficient
solutions to satisfy them. In this paper, we propose a design methodology to automate
the design space exploration of neural network architectures for MI detection. This
methodology incorporates Neural Architecture Search (NAS) using Multi-Objective Bayesian
Optimization (MOBO) to render Pareto optimal architectural models. These models minimize
both detection error and energy consumption on the target device. The design space
is inspired by Binary Convolutional Neural Networks (BCNNs) suited for mobile health
applications with limited resources. The models’ performance is validated using the
PTB diagnostic ECG database from PhysioNet. Moreover, energy-related measurements
are directly obtained from the target device in a typical hardware-in-the-loop fashion.
Finally, we benchmark our models against other related works. One model exceeds state-of-the-art
accuracy on wearable devices (reaching 91.22%), whereas others trade off some accuracy
to reduce their energy consumption (by a factor reaching 8.26x).

Power-Efficient Layer Mapping for CNNs on Integrated CPU and GPU Platforms: A Case Study

  • Tian Wang
  • Kun Cao
  • Junlong Zhou
  • Gongxuan Zhang
  • Xiji Wang

Heterogeneous MPSoCs consisting of integrated CPUs and GPUs are suitable platforms
for embedded applications running on handheld devices such as smart phones. As the
handheld devices are mostly powered by battery, the integrated CPU and GPU MPSoC is
usually designed with an emphasis on low-power rather than performance. In this paper,
we are interested in exploring a power-efficient layer mapping of convolution neural
networks (CNNs) deployed on integrated CPU and GPU platforms. Specifically, we investigate
the impact of layer mapping of YoloV3-Tiny (i.e., a widely-used CNN in both industry
and academia) on system power consumption through numerous experiments on NVIDIA board
Jetson TX2. The experimental results indicate that 1) almost all of the convolution
layers are not suitable for mapping to CPU, 2) the pooling layer can be mapped to
CPU for reducing power consumption, but the mapping may lead to a decrease in inference
speed when the layer’s output tensor size is large, 3) the detection layer can be
mapped to CPU as long as its floating-point operation scale is not too large, and
4) the channel and upsampling layers are both suitable for mapping to CPU. These observations
obtained in this study can be further utilized to guide the design of power-efficient
layer mapping strategies for integrated CPU and GPU platforms.

A Write-friendly Arithmetic Coding Scheme for Achieving Energy-Efficient Non-Volatile
Memory Systems

  • Yi-Shen Chen
  • Chun-Feng Wu
  • Yuan-Hao Chang
  • Tei-Wei Kuo

In the era of the Internet of Things (IoT), wearable IoT devices become popular and
closely related to our life. Most of these devices are based on the embedded systems
that have to operate on limited energy resources, such as batteries or energy harvesters.
Therefore, energy efficiency is one of the critical issues for these devices. To relieve
the energy consumption by reducing the total accesses on memory and storage layers,
the technologies of storage-class memory (SCM) and data compression techniques are
applied to eliminate the data movements and squeeze the data size, respectively. However,
the information gap between them hinders the cooperation among the two techniques
for achieving further optimizations on minimizing energy consumption. This work proposes
a write-friendly arithmetic coding with joint managing both techniques to achieve
energy-efficient non-volatile memory (NVM) systems. In particular, the concept of
“ignorable bits” is introduced to further skip the write operations while storing
the compressed data into SCM devices. The proposed design was evaluated by a series
of intensive experiments, and the results are encouraging.

SESSION: 7C: Software and System Support for Nonvolatile Memory

DP-Sim: A Full-stack Simulation Infrastructure for Digital Processing In-Memory Architectures

  • Minxuan Zhou
  • Mohsen Imani
  • Yeseong Kim
  • Saransh Gupta
  • Tajana Rosing

Digital processing in-memory (DPIM) is a promising technology that significantly reduces
data movements while providing high parallelism. In this work, we design and implement
the first full-stack DPIM simulation infrastructure, DP-Sim, which evaluates a comprehensive
range of DPIM-specific design space concerning both software and hardware. DP-Sim
provides a C++ library to enable DPIM acceleration in general programs while supporting
several aspects of software-level exploration by a convenient interface. The DP-Sim
software front-end generates specialized instructions that can be processed by a hardware
simulator based on a new DPIM-enabled architecture model which is 10.3% faster than
conventional memory simulation models. We use DP-Sim to explore the DPIM-specific
design space of acceleration for various emerging applications. Our experiments show
that bank-level control is 11.3x faster than conventional channel-level control because
of higher computing parallelism. Furthermore, cost-aware memory allocation can provide
at least 2.2x speedup vs. heuristic methods, showing the importance of data layout
in DPIM acceleration.

SAC: A Stream Aware Write Cache Scheme for Multi-Streamed Solid State Drives

  • Bo Zhou
  • Chuanming Ding
  • Yina Lv
  • Chun Jason Xue
  • Qingfeng Zhuge
  • Edwin H.-M. Sha
  • Liang Shi

This work found that the state-of-the-art multi-streamed SSDs are inefficiently used
due to two issues. First, the write cache inside SSDs is not aware of data from different
streams, which induce conflict among streams. Second, the current stream identification
methods are not accurate, which should be optimized inside SSDs. This work proposed
a novel write cache scheme to efficiently utilize and optimize the multiple streams.
First, an inter-stream aware cache partitioning scheme is proposed to manage the data
from different streams. Second, an intra-stream based active cache evicting scheme
is proposed to evict data to block with more invalid pages in priority. Experiment
results show that the proposed scheme significantly reduces the write amplification
(WAF) of multi-streamed SSDs by up to 28% with negligible cost.

Providing Plug N’ Play for Processing-in-Memory Accelerators

  • Paulo C. Santos
  • Bruno E. Forlin
  • Luigi Carro

Although Processing-in-Memory (PIM) emerged as a solution to avoid unnecessary and
expensive data movements to/from host and accelerators, their widespread usage is
still difficult, given that to effectively use a PIM device, huge and costly modifications
must be done at the host processor side to allow instructions offloading, cache coherence,
virtual memory management, and communication between different PIM instances. The
present work addresses these challenges by presenting non-invasive solutions for those
requirements. We demonstrate that, at compile-time, and without any host modifications
or programmer intervention, it is possible to exploit already available resources
to allow efficient host and PIM communication and task partitioning, without disturbing
neither host nor memory hierarchy. We present Plug&PIM, a plug n’ play strategy for
PIM adoption with minimal performance penalties.

Aging-Aware Request Scheduling for Non-Volatile Main Memory

  • Shihao Song
  • Anup Das
  • Onur Mutlu
  • Nagarajan Kandasamy

Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity
and low-cost main memory. Elevated operating voltages of NVM accelerate the aging
of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device
scaling increases power density and temperature, which further accelerates aging,
challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural
technique to mitigate the circuit aging-related problems of NVM-based main memory.
HEBE is built on three contributions. First, we propose a new analytical model that
can dynamically track the aging in the peripheral circuitry of each memory bank based
on the bank’s utilization. Second, we develop an intelligent memory request scheduler
that exploits this aging model at run time to de-stress the peripheral circuitry of
a memory bank only when its aging exceeds a critical threshold. Third, we introduce
an isolation transistor to decouple parts of a peripheral circuit operating at different
voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations
independently and off the critical path of memory read and write accesses, improving
performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite.
Our results show that HEBE significantly improves both performance and lifetime of
NVM-based main memory.

SESSION: 7D: Learning-Driven VLSI Layout Automation Techniques

Placement for Wafer-Scale Deep Learning Accelerator

  • Benzheng Li
  • Qi Du
  • Dingcheng Liu
  • Jingchong Zhang
  • Gengjie Chen
  • Hailong You

To meet the growing demand from deep learning applications for computing resources,
accelerators by ASIC are necessary. A wafer-scale engine (WSE) is recently proposed
[1], which is able to simultaneously accelerate multiple layers from a neural network
(NN). However, without a high-quality placement that properly maps NN layers onto
the WSE, the acceleration efficiency cannot be achieved. Here, the WSE placement resembles
the traditional ASIC floor plan problem of placing blocks onto a chip region, but
they are fundamentally different. Since the slowest layer determines the compute time
of the whole NN on WSE, a layer with a heavier workload needs more computing resources.
Besides, locations of layers and protocol adapter cost of internal 10 connections
will influence inter-layer communication overhead. In this paper, we propose GigaPlacer
to handle this new challenge. A binary-search-based framework is developed to obtain
a minimum compute time of the NN. Two dynamic-programming-based algorithms with different
optimizing strategies are integrated to produce legal placement. The distance and
adapter cost between connected layers will be further minimized by some refinements.
Compared with the first place of the ISPD2020 Contest, GigaPlacer reduces the contest
metric by up to 6.89% and on average 2.09%, while runs 7.23X faster.

Net2: A Graph Attention Network Method Customized for Pre-Placement Net Length Estimation

  • Zhiyao Xie
  • Rongjian Liang
  • Xiaoqing Xu
  • Jiang Hu
  • Yixiao Duan
  • Yiran Chen

Net length is a key proxy metric for optimizing timing and power across various stages
of a standard digital design flow. However, the bulk of net length information is
not available until cell placement, and hence it is a significant challenge to explicitly
consider net length optimization in design stages prior to placement, such as logic
synthesis. This work addresses this challenge by proposing a graph attention network
method with customization, called Net2, to estimate individual net length before cell
placement. Its accuracy-oriented version Net2a achieves about 15% better accuracy
than several previous works in identifying both long nets and long critical paths.
Its fast version Net2f is more than 1000x faster than placement while still outperforms
previous works and other neural network techniques in terms of various accuracy metrics.

Machine Learning-based Structural Pre-route Insertability Prediction and Improvement
with Guided Backpropagation

  • Tao-Chun Yu
  • Shao-Yun Fang
  • Hsien-Shih Chiu
  • Kai-Shun Hu
  • Chin-Hsiung Hsu
  • Philip Hui-Yuh Tai
  • Cindy Chin-Fang Shen

With the development of semiconductor technology nodes, the sizes of standard cells
become smaller and the number of standard cells is dramatically increased to bring
into more functionality in integrated circuits (ICs). However, the shrinking of standard
cell sizes causes many problems of ICs such as timing, power, and electromigration
(EM). To tackle these problems, a new style structural pre-route (SPR) is proposed.
Such type of pre-route is composed of redundant parallel metals and vias so that the
low resistance and the redundant sub-structures can improve timing and yield. But
the large area overhead becomes the major problem of inserting such pre-routes all
over a design. In this paper, we propose a machine learning-based approach to predict
the insertability of SPRs for placed designs. In addition, we apply a pattern visualization
method by using a guided backpropagation technique to see in depth of our model and
identify the problematic layout features causing SPR insertion failures. The experimental
results not only show the excellent performance of our model, but also show that avoiding
generating the identified critical features during legalization can improve SPR insertability
compared to a commercial SPR-aware placement tool.

Standard Cell Routing with Reinforcement Learning and Genetic Algorithm in Advanced
Technology Nodes

  • Haoxing Ren
  • Matthew Fojtik

Standard cell layout in advanced technology nodes are done manually in the industry
today. Automating standard cell layout process, in particular the routing step, are
challenging because of the constraints of enormous design rules. In this paper we
propose a machine learning based approach that applies genetic algorithm to create
initial routing candidates and uses reinforcement learning (RL) to fix the design
rule violations incrementally. A design rule checker feedbacks the violations to the
RL agent and the agent learns how to fix them based on the data. This approach is
also applicable to future technology nodes with unseen design rules. We demonstrate
the effectiveness of this approach on a number of standard cells. We have shown that
it can route a cell which is deemed unroutable manually, reducing the cell size by
11%.

SESSION: 7E: DNN-Based Physical Analysis and DNN Accelerator Design

Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks

  • Vidya A. Chhabria
  • Vipul Ahuja
  • Ashwath Prabhu
  • Nikhil Patil
  • Palkesh Jain
  • Sachin S. Sapatnekar

Computationally expensive temperature and power grid analyses are required during
the design cycle to guide IC design. This paper employs encoder-decoder based generative
(EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence
translation tasks. The network takes a power map as input and outputs the temperature
or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip
temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based
on input power, power grid distribution, and power pad distribution patterns. The
models are design-independent and must be trained just once for a particular technology
and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict on-chip
temperature and IR drop contours in milliseconds (in contrast with commercial tools
that require several hours or more) and provide an average error of 0.6% and 0.008%
respectively.

GRA-LPO: Graph Convolution Based Leakage Power Optimization

  • Uday Mallappa
  • Chung-Kuan Cheng

Static power consumption is a critical challenge for IC designs, particularly for
mobile and IoT applications. A final post-layout step in modern design flows involves
a leakage recovery step that is embedded in signoff static timing analysis tools.
The goal of such recovery is to make use of the positive slack (if any) and recover
the leakage power by performing cell swaps with footprint compatible variants. Though
such swaps result in unaltered routing, the hard constraint is not to introduce any
new timing violations. This process can require up to tens of hours of runtime, just
before the tapeout, when schedule and resource constraints are tightest. The physical
design teams can benefit greatly from a fast predictor of the leakage recovery step:
if the eventual recovery will be too small, the entire step can be skipped, and the
resources can be allocated elsewhere. If we represent the circuit netlist as a graph
with cells as vertices and nets connecting these cells as edges, the leakage recovery
step is an optimization step, on this graph. If we can learn these optimizations over
several graphs with various logic-cone structures, we can generalize the learning
to unseen graphs. Using graph convolution neural networks, we develop a learning-based
model, that predicts per-cell recoverable slack, and translate these slack values
to equivalent power savings. For designs up to 1.6M instances, our inference step
takes less than 12 seconds on a Tesla P100 GPU, and an additional feature extraction,
post-processing steps consuming 420 seconds. The model is accurate with relative error
under 6.2%, for the design-specific context.

DEF: Differential Encoding of Featuremaps for Low Power Convolutional Neural Network Accelerators

  • Alexander Montgomerie-Corcoran
  • Christos Savvas-Bouganis

As the need for the deployment of Deep Learning applications on edge-based devices
becomes ever increasingly prominent, power consumption starts to become a limiting
factor on the performance that can be achieved by the computational platforms. A significant
source of power consumption for these edge-based machine learning accelerators is
off-chip memory transactions. In the case of Convolutional Neural Network (CNN) workloads,
a predominant workload in deep learning applications, those memory transactions are
typically attributed to the store and recall of feature-maps. There is therefore a
need to explicitly reduce the power dissipation of these transactions whilst minimising
any overheads needed to do so. In this work, a Differential Encoding of Feature-maps
(DEF) scheme is proposed, which aims at minimising activity on the memory data bus,
specifically for CNN workloads. The coding scheme uses domain-specific knowledge,
exploiting statistics of feature-maps alongside knowledge of the data types commonly
used in machine learning accelerators as a means of reducing power consumption. DEF
is able to out-perform recent state-of-the-art coding schemes, with significantly
less overhead, achieving up to 50% reduction of activity across a number of modern
CNNs.

Temperature-Aware Optimization of Monolithic 3D Deep Neural Network Accelerators

  • Prachi Shukla
  • Sean S. Nemtzow
  • Vasilis F. Pavlidis
  • Emre Salman
  • Ayse K. Coskun

We propose an automated method to facilitate the design of energy-efficient Mono3D
DNN accelerators with safe on-chip temperatures for mobile systems. We introduce an
optimizer to investigate the effect of different aspect ratios and footprint specifications
of the chip, and select energy-efficient accelerators under user-specified thermal
and performance constraints. We also demonstrate that using our optimizer, we can
reduce energy consumption by 1.6x and area by 2x with a maximum of 9.5% increase in
latency compared to a Mono3D DNN accelerator optimized only for performance.

SESSION: 8B: Embedded Neural Networks and File Systems

Gravity: An Artificial Neural Network Compiler for Embedded Applications

  • Tony Givargis

This paper introduces the Gravity compiler. Gravity is an open source optimizing Artificial
Neural Network (ANN) to ANSI C compiler with two unique design features that make
it ideal for use in resource constrained embedded systems: (1) the generated ANSI
C code is self-contained and void of any library or platform dependencies and (2)
the generated ANSI C code is optimized for maximum performance and minimum memory
usage. Moreover, Gravity is constructed as a modern compiler consisting of an intuitive
input language, an expressive Intermediate Representation (IR), a mapping to a Fictitious
Instruction Set Machine (FISM) and a retargetable backend, making it an ideal research
tool for exploring high-performance embedded software strategies in AI and Deep-Learning
applications. We validate the efficacy of Gravity by solving the MNIST handwriting
digit recognition on an embedded device We measured a 300x reduction in memory, 2.5x
speedup in inference and 33% speedup in training compared to TensorFlow. We also outperformed
TVM, by over 2.4x in inference speed.

A Self-Test Framework for Detecting Fault-induced Accuracy Drop in Neural Network
Accelerators

  • Fanruo Meng
  • Fateme S. Hosseini
  • Chengmo Yang

Hardware accelerators built with SRAM or emerging memory devices are essential to
the accommodation of the ever-increasing Deep Neural Network (DNN) workloads on resource-constrained
devices. After deployment, however, the performance of these accelerators is threatened
by the faults in their on-chip and off-chip memories where millions of DNN weights
are held. Different types of faults may exist depending on the underlying memory technology,
degrading inference accuracy. To tackle this challenge, this paper proposes an online
self-test framework that monitors the accuracy of the accelerator with a small set
of test images selected from the test dataset. Upon detecting a noticeable level of
accuracy drop, the framework uses additional test images to identify the corresponding
fault type and predict the severeness of faults by analyzing the change in the ranking
of the test images. Experimental results show that our method can quickly detect the
fault status of a DNN accelerator and provide accurate fault type and fault severeness
information, allowing for subsequent recovery and self-healing process.

Facilitating the Efficiency of Secure File Data and Metadata Deletion on SMR-based
Ext4 File System

  • Ping-Xiang Chen
  • Shuo-Han Chen
  • Yuan-Hao Chang
  • Yu-Pei Liang
  • Wei-Kuan Shih

The efficiency of secure deletion is highly dependent on the data layout of underlying
storage devices. In particular, owing to the sequential-write constraint of the emerging
Shingled Magnetic Recording (SMR) technology, an improper data layout could lead to
serious write amplification and hinder the performance of secure deletion. The performance
degradation of secure deletion on SMR drives is further aggravated with the need to
securely erase the file system metadata of deleted files due to the small-size nature
of file system metadata. Such an observation motivates us to propose a secure-deletion
and SMR-aware space allocation (SSSA) strategy to facilitate the process of securely
erasing both the deleted files and their metadata simultaneously. The proposed strategy
is integrated within the widely-used extended file system 4 (ext4) and is evaluated
through a series of experiments to demonstrate the effectiveness of the proposed strategy.
The evaluation results show that the proposed strategy can reduce the secure deletion
latency by 91.3% on average when compared with naive SMR-based ext4 file system.

SESSION: 8C: Design Automation for Future Autonomy

Efficient Computing Platform Design for Autonomous Driving Systems

  • Shuang Liang
  • Changcheng Tang
  • Xuefei Ning
  • Shulin Zeng
  • Jincheng Yu
  • Yu Wang
  • Kaiyuan Guo
  • Diange Yang
  • Tianyi Lu
  • Huazhong Yang

Autonomous driving is becoming a hot topic in both academic and industrial communities.
Traditional algorithms can hardly achieve the complex tasks and meet the high safety
criteria. Recent research on deep learning shows significant performance improvement
over traditional algorithms and is believed to be a strong candidate in autonomous
driving system. Despite the attractive performance, deep learning does not solve the
problem totally. The application scenario requires that an autonomous driving system
must work in real-time to keep safety. But the high computation complexity of neural
network model, together with complicated pre-process and post-process, brings great
challenges. System designers need to do dedicated optimizations to make a practical
computing platform for autonomous driving. In this paper, we introduce our work on
efficient computing platform design for autonomous driving systems. In the software
level, we introduce neural network compression and hardware-aware architecture search
to reduce the workload. In the hardware level, we propose customized hardware accelerators
for pre- and post-process of deep learning algorithms. Finally, we introduce the hardware
platform design, NOVA-30, and our on-vehicle evaluation project.

On Designing Computing Systems for Autonomous Vehicles: a PerceptIn Case Study

  • Bo Yu
  • Jie Tang
  • Shaoshan Liu

PerceptIn develops and commercializes autonomous vehicles for micromobility around
the globe. This paper makes a holistic summary of PerceptIn’s development and operating
experiences. It provides the business tale behind our product, and presents the development
of the computing system for our vehicles. We illustrate the design decision made for
the computing system, and show the advantage of offloading localization workloads
onto an FPGA platform.

Runtime Software Selection for Adaptive Automotive Systems

  • Chia-Ching Fu
  • Ben-Hau Chia
  • Chung-Wei Lin

As automotive systems become more intelligent than ever, they need to handle many
functional tasks, resulting in more and more software programs running in automotive
systems. However, whether a software program should be executed depends on the environmental
conditions (surrounding conditions). For example, a deraining algorithm supporting
object detection and image recognition should only be executed when it is raining.
Supported by the advance of over-the-air (OTA) updates and plug-and-play systems,
adaptive automotive systems, where the software programs are updated, activated, and
deactivated before driving and during driving, can be realized. In this paper, we
consider the upcoming environmental conditions of an automotive system and target
the corresponding software selection problem during runtime. We formulate the problem
as a set cover problem with timing constraints and then propose a heuristic approach
to solve the problem. The approach is very efficient so that it can be applied during
runtime, and it is a preliminary step towards the broad realization of adaptive automotive
systems.

Safety-Assured Design and Adaptation of Learning-Enabled Autonomous Systems

  • Qi Zhu
  • Chao Huang
  • Ruochen Jiao
  • Shuyue Lan
  • Hengyi Liang
  • Xiangguo Liu
  • Yixuan Wang
  • Zhilu Wang
  • Shichao Xu

Future autonomous systems will employ sophisticated machine learning techniques for
the sensing and perception of the surroundings and the making corresponding decisions
for planning, control, and other actions. They often operate in highly dynamic, uncertain
and challenging environment, and need to meet stringent timing, resource, and mission
requirements. In particular, it is critical and yet very challenging to ensure the
safety of these autonomous systems, given the uncertainties of the system inputs,
the constant disturbances on the system operations, and the lack of analyzability
for many machine learning methods (particularly those based on neural networks). In
this paper, we will discuss some of these challenges, and present our work in developing
automated, quantitative, and formalized methods and tools for ensuring the safety
of autonomous systems in their design and during their runtime adaptation. We argue
that it is essential to take a holistic approach in addressing system safety and other
safety-related properties, vertically across the functional, software, and hardware
layers, and horizontally across the autonomy pipeline of sensing, perception, planning,
and control modules. This approach could be further extended from a single autonomous
system to a multi-agent system where multiple autonomous agents perform tasks in a
collaborative manner. We will use connected and autonomous vehicles (CAVs) as the
main application domain to illustrate the importance of such holistic approach and
show our initial efforts in this direction.

SESSION: 8D: Emerging Hardware Verification

System-Level Verification of Linear and Non-Linear Behaviors of RF Amplifiers using
Metamorphic Relations

  • Muhammad Hassan
  • Daniel Große
  • Rolf Drechsler

System-on-Chips (SoC) have imposed new yet stringent design specifications on the
Radio Frequency (RF) subsystems. The Timed Data Flow (TDF) model of computation available
in SystemC-AMS offers here a good trade-off between accuracy and simulation-speed
at the system-level. However, one of the main challenges in system-level verification
is the availability of reference models traditionally used to verify the correctness
of the Design Under Verification (DUV). Recently, Metamorphic testing (MT) introduced
a new verification perspective in the software domain to alleviate this problem. MT
uncovers bugs just by using and relating test-cases.

In this paper, we present a novel MT-based verification approach to verify the linear
and non-linear behaviors of RF amplifiers at the system-level. The central element
of our MT-approach is a set of Metamorphic Relations (MRs) which describes the relation
of the inputs and outputs of consecutive DUV executions. For the class of Low Noise
Amplifiers (LNAs) we identify 12 high-quality MRs. We demonstrate the effectiveness
of our proposed MT-based verification approach in an extensive set of experiments
on an industrial system-level LNA model without the need of a reference model.

Random Stimuli Generation for the Verification of Quantum Circuits

  • Lukas Burgholzer
  • Richard Kueng
  • Robert Wille

Verification of quantum circuits is essential for guaranteeing correctness of quantum
algorithms and/or quantum descriptions across various levels of abstraction. In this
work, we show that there are promising ways to check the correctness of quantum circuits
using simulative verification and random stimuli. To this end, we investigate how
to properly generate stimuli for efficiently checking the correctness of a quantum
circuit. More precisely, we introduce, illustrate, and analyze three schemes for quantum
stimuli generation—offering a trade-off between the error detection rate (as well
as the required number of stimuli) and efficiency. In contrast to the verification
in the classical realm, we show (both, theoretically and empirically) that even if
only a few randomly-chosen stimuli (generated from the proposed schemes) are considered,
high error detection rates can be achieved for quantum circuits. The results of these
conceptual and theoretical considerations have also been empirically confirmed—with
a grand total of approximately 106 simulations conducted across 50 000 benchmark instances.

Exploiting Extended Krylov Subspace for the Reduction of Regular and Singular Circuit
Models

  • Chrysostomos Chatzigeorgiou
  • Dimitrios Garyfallou
  • George Floros
  • Nestor Evmorfopoulos
  • George Stamoulis

During the past decade, Model Order Reduction (MOR) has become key enabler for the
efficient simulation of large circuit models. MOR techniques based on moment-matching
are well established due to their simplicity and computational performance in the
reduction process. However, moment-matching methods based on the ordinary Krylov subspace
are usually inadequate to accurately approximate the original circuit behaviour. In
this paper, we present a moment-matching method which is based on the extended Krylov
subspace and exploits the superposition property in order to deal with many terminals.
The proposed method can handle large-scale regular and singular circuits, and generate
accurate and efficient reduced-order models for circuit simulation. Experimental results
on industrial IBM power grid benchmarks demonstrate that our method achieves an error
reduction up to 83.69% over a standard Krylov subspace technique.

SESSION: 8E: Optimization and Mapping Methods for Quantum Technologies

Algebraic and Boolean Optimization Methods for AQFP Superconducting Circuits

  • Eleonora Testa
  • Siang-Yun Lee
  • Heinz Riener
  • Giovanni De Micheli

Adiabatic quantum-flux-parametron (AQFP) circuits are a family of superconducting
electronic (SCE) circuits that have recently gained growing interest due to their
low-energy consumption, and may serve as alternative technology to overcome the down-scaling
limitations of CMOS. AQFP logic design differs from classic digital design because
logic cells are natively abstracted by the majority function, require data and clocking
in specific timing windows, and have fan-out limitations. We describe here a novel
majority-based logic synthesis flow addressing AQFP technology. In particular, we
present both algebraic and Boolean methods over majority-inverter graphs (MIGs) aiming
at optimizing size and depth of logic circuits. The technology limitations and constraints
of the AQFP technology (e.g., path balancing and maximum fanout) are considered during
optimization. The experimental results show that our flow reduces both size and depth
of MIGs, while meeting the constraint of the AQFP technology. Further, we show an
improvement for both area and delay when the MIGs are mapped into the AQFP technology.

Dynamical Decomposition and Mapping of MPMCT Gates to Nearest Neighbor Architectures

  • Atsushi Matsuo
  • Wakaki Hattori
  • Shigeru Yamashita

We usually use Mixed-Polarity Multiple-Control Toffoli (MPMCT) gates to realize large
control logic functions for quantum computation. A logic circuit consisting of MPMCT
gates needs to be mapped to a quantum computing device that has some physical limitation;
(1) we need to decompose MPMCT gates into one or two-qubit gates, and then (2) we
need to insert SWAP gates such that all the gates can be performed on Nearest Neighbor
Architectures (NNAs). Up to date, the above two processes have been independently
studied intensively. This paper points out that we can decrease the total number of
the gates in a circuit if the above two processes are considered dynamically as a
single step; we propose a method to inserts SWAP gates while decomposing MPMCT gates
unlike most of the existing methods. Our additional idea is to consider the effect
on the latter part of a circuit carefully by considering the qubit layout when composing
an MPMCT gate. We show some experimental results to confirm the effectiveness of our
method.

Exploiting Quantum Teleportation in Quantum Circuit Mapping

  • Stefan Hillmich
  • Alwin Zulehner
  • Robert Wille

Quantum computers are constantly growing in their number of qubits, but continue to
suffer from restrictions such as the limited pairs of qubits that may interact with
each other. Thus far, this problem is addressed by mapping and moving qubits to suitable
positions for the interaction (known as quantum circuit mapping). However, this movement
requires additional gates to be incorporated into the circuit, whose number should
be kept as small as possible since each gate increases the likelihood of errors and
decoherence. State-of-the-art mapping methods utilize swapping and bridging to move
the qubits along the static paths of the coupling map—solving this problem without
exploiting all means the quantum domain has to offer. In this paper, we propose to
additionally exploit quantum teleportation as a possible complementary method. Quantum
teleportation conceptually allows to move the state of a qubit over arbitrary long
distances with constant overhead—providing the potential of determining cheaper
mappings. The potential is demonstrated by a case study on the IBM Q Tokyo architecture
which already shows promising improvements. With the emergence of larger quantum computing
architectures, quantum teleportation will become more effective in generating cheaper
mappings.

SESSION: 9B: Emerging System Architectures for Edge-AI

Hardware-Aware NAS Framework with Layer Adaptive Scheduling on Embedded System

  • Chuxi Li
  • Xiaoya Fan
  • Shengbing Zhang
  • Zhao Yang
  • Miao Wang
  • Danghui Wang
  • Meng Zhang

Neural Architecture Search (NAS) has been proven to be an effective solution for building
Deep Convolutional Neural Network (DCNN) models automatically. Subsequently, several
hardware-aware NAS frameworks incorporate hardware latency into the search objectives
to avoid the potential risk that the searched network cannot be deployed on target
platforms. However, the mismatch between NAS and hardware persists due to the absent
of rethinking the applicability of the searched network layer characteristics and
hardware mapping. A convolution neural network layer can be executed on various dataflows
of hardware with different performance, with which the characteristics of on-chip
data using varies to fit the parallel structure. This mismatch also results in significant
performance degradation for some maladaptive layers obtained from NAS, which might
achieved a much better latency when the adopted dataflow changes. To address the issue
that the network latency is insufficient to evaluate the deployment efficiency, this
paper proposes a novel hardware-aware NAS framework in consideration of the adaptability
between layers and dataflow patterns. Beside, we develop an optimized layer adaptive
data scheduling strategy as well as a coarse-grained reconfigurable computing architecture
so as to deploy the searched networks with high power-efficiency by selecting the
most appropriate dataflow pattern layer-by-layer under limited resources. Evaluation
results show that the proposed NAS framework can search DCNNs with the similar accuracy
to the state-of-the-art ones as well as the low inference latency, and the proposed
architecture provides both power-efficiency improvement and energy consumption saving.

Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package

  • Robert Guirado
  • Hyoukjun Kwon
  • Sergi Abadal
  • Eduard Alarcón
  • Tushar Krishna

Deep neural network (DNN) models continue to grow in size and complexity, demanding
higher computational power to enable real-time inference. To efficiently deliver such
computational demands, hardware accelerators are being developed and deployed across
scales. This naturally requires an efficient scale-out mechanism for increasing compute
density as required by the application. 2.5D integration over interposer has emerged
as a promising solution, but as we show in this work, the limited interposer bandwidth
and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the
approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D
DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator
chiplets to the global buffer chiplet, providing high-bandwidth multicasting capabilities.
Here, we also identify the dataflow style that most efficienty exploits the wireless
NoP’s high-bandwidth multicasting capability on each layer. With modest area and power
overheads, WIENNA achieves 2.2X-5.1X higher throughput and 38.2% lower energy than
an interposer-based NoP design.

Block-Circulant Neural Network Accelerator Featuring Fine-Grained Frequency-Domain
Quantization and Reconfigurable FFT Modules

  • Yifan He
  • Jinshan Yue
  • Yongpan Liu
  • Huazhong Yang

Block-circulant based compression is a popular technique to accelerate neural network
inference. Though storage and computing costs can be reduced by transforming weights
into block-circulant matrices, this method incurs uneven data distribution in the
frequency domain and imbalanced workload. In this paper, we propose RAB: a Reconfigurable
Architecture Block-Circulant Neural Network Accelerator to solve the problems via
two techniques. First, a fine-grained frequency-domain quantization is proposed to
accelerate MAC operations. Second, a reconfigurable architecture is designed to transform
FFT/IFFT modules into MAC modules, which alleviates the imbalanced workload and further
improves efficiency. Experimental results show that RAB can achieve 1.9x/1.8x area/energy
efficiency improvement compared with the state-of-the-art block-circulant compression
based accelerator.

BatchSizer: Power-Performance Trade-off for DNN Inference

  • Seyed Morteza Nabavinejad
  • Sherief Reda
  • Masoumeh Ebrahimi

GPU accelerators can deliver significant improvement for DNN processing; however,
their performance is limited by internal and external parameters. A well-known parameter
that restricts the performance of various computing platforms in real-world setups,
including GPU accelerators, is the power cap imposed usually by an external power
controller. A common approach to meet the power cap constraint is using the Dynamic
Voltage Frequency Scaling (DVFS) technique. However, the functionally of this technique
is limited and platform-dependent. To improve the performance of DNN inference on
GPU accelerators, we propose a new control knob, which is the size of input batches
fed to the GPU accelerator in DNN inference applications. After evaluating the impact
of this control knob on power consumption and performance of GPU accelerators and
DNN inference applications, we introduce the design and implementation of a fast and
lightweight runtime system, called BatchSizer. This runtime system leverages the new
control knob for managing the power consumption of GPU accelerators in the presence
of the power cap. Conducting several experiments using a modern GPU and several DNN
models and input datasets, we show that our BatchSizer can significantly surpass the
conventional DVFS technique regarding performance (up to 29%), while successfully
meeting the power cap.

SESSION: 9C: Cutting-Edge EDA Techniques for Advanced Process Technologies

Deep Learning for Mask Synthesis and Verification: A Survey

  • Yibo Lin

Achieving lithography compliance is increasingly difficult in advanced technology
nodes. Due to complicated lithography modeling and long simulation cycles, verifying
and optimizing photomasks becomes extremely expensive. To speedup design closure,
deep learning techniques have been introduced to enable data-assisted optimization
and verification. Such approaches have demonstrated promising results with high solution
quality and efficiency. Recent research efforts show that learning-based techniques
can accomplish more and more tasks, from classification, simulation, to optimization,
etc. In this paper, we will survey the successful attempts of advancing mask synthesis
and verification with deep learning and highlight the domain-specific learning techniques.
We hope this survey can shed light on the future development of learning-based design
automation methodologies.

Physical Synthesis for Advanced Neural Network Processors

  • Zhuolun He
  • Peiyu Liao
  • Siting Liu
  • Yuzhe Ma
  • Yibo Lin
  • Bei Yu

The remarkable breakthroughs in deep learning have led to a dramatic thirst for computational
resources to tackle interesting real-world problems. Various neural network processors
have been proposed for the purpose, yet, far fewer discussions have been made on the
physical synthesis for such specialized processors, especially in advanced technology
nodes. In this paper, we review several physical synthesis techniques for advanced
neural network processors. We especially argue that datapath design is an essential
methodology in the above procedures due to the organized computational graph of neural
networks. As a case study, we investigate a wafer-scale deep learning accelerator
placement problem in detail.

Advancements and Challenges on Parasitic Extraction for Advanced Process Technologies

  • Wenjian Yu
  • Mingye Song
  • Ming Yang

As the feature size scales down, the process technology becomes more complicated and
the design margin shrinks, accurate parasitic extraction during IC design is largely
demanded. In this invited paper, we survey the recent advancements on parasitic extraction
techniques, especially those enhancing the floating random walk based capacitance
solver and incorporating machine learning methods. The work dealing with process variation
are also addressed. After that, we briefly discuss the challenges for capacitance
extraction under advanced process technologies, including manufacture-aware geometry
variations and middle-end-of-line (MEOL) parasitic extraction, etc.

SESSION: 9D: Robust and Reliable Memory Centric Computing at Post-Moore

Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems

  • Hanbo Sun
  • Zhenhua Zhu
  • Yi Cai
  • Shulin Zeng
  • Kaizhong Qiu
  • Yu Wang
  • Huazhong Yang

Memristor based Processing-In-Memory (PIM) systems give alternative solutions to boost
the computing energy efficiency of Convolutional Neural Network (CNN) based algorithms.
However, Analog-to-Digital Converters’ (ADCs) high interface costs and the limited
size of the memristor crossbars make it challenging to map CNN models onto PIM systems
with both high accuracy and high energy efficiency. Besides, it takes a long time
to simulate the performance of large-scale PIM systems, resulting in unacceptable
development time for the PIM system. To address these problems, we propose a reliability-aware
training framework and a behavior-level modeling tool (MNSIM 2.0) for PIM accelerators.
The proposed reliability-aware training framework, containing network splitting/merging
analysis and a PIM-based non-uniform activation quantization scheme, can improve the
energy efficiency by reducing the ADC resolution requirements in memristor crossbars.
Moreover, MNSIM 2.0 provides a general modeling method for PIM architecture design
and computation data flow; it can evaluate both accuracy and hardware performance
within a short time. Experiments based on MNSIM 2.0 show that the reliability-aware
training framework can improve 3.4x energy efficiency of PIM accelerators with little
accuracy loss. The equivalent energy efficiency is 9.02 TOPS/W, nearly 2.6~4.2x compared
with the existing work. We also evaluate more case studies of MNSIM 2.0, which help
us balance the trade-off between accuracy and hardware performance.

Robustness of Neuromorphic Computing with RRAM-based Crossbars and Optical Neural
Networks

  • Grace Li Zhang
  • Bing Li
  • Ying Zhu
  • Tianchen Wang
  • Yiyu Shi
  • Xunzhao Yin
  • Cheng Zhuo
  • Huaxi Gu
  • Tsung-Yi Ho
  • Ulf Schlichtmann

RRAM-based crossbars and optical neural networks are attractive platforms to accelerate
neuromorphic computing. However, both accelerators suffer from hardware uncertainties
such as process variations. These uncertainty issues left unaddressed, the inference
accuracy of these computing platforms can degrade significantly. In this paper, a
statistical training method where weights under process variations and noise are modeled
as statistical random variables is presented. To incorporate these statistical weights
into training, the computations in neural networks are modified accordingly. For optical
neural networks, we modify the cost function during software training to reduce the
effects of process variations and thermal imbalance. In addition, the residual effects
of process variations are extracted and calibrated in hardware test, and thermal variations
on devices are also compensated in advance. Simulation results demonstrate that the
inference accuracy can be improved significantly under hardware uncertainties for
both platforms.

Uncertainty Modeling of Emerging Device based Computing-in-Memory Neural Accelerators
with Application to Neural Architecture Search

  • Zheyu Yan
  • Da-Cheng Juan
  • Xiaobo Sharon Hu
  • Yiyu Shi

emerging device based Computing-in-memory (CiM) has been proved to be a promising
candidate for high energy efficiency deep neural network (DNN) computations. However,
most emerging devices suffer uncertainty issues, resulting in a difference between
actual data stored and the weight value it is design to be. This leads to an accuracy
drop from trained models to actually deployed platforms. In this work, we offer a
thorough analysis on the effect of such uncertainties induced changes in DNN models.
To reduce the impact of device uncertainties, we propose UAE, a uncertainty-aware
Neural Architecture Search scheme to identify a DNN model that is both accurate and
robust against device uncertainties.

A Physical-Aware Framework for Memory Network Design Space Exploration

  • Tianhao Shen
  • Di Gao
  • Li Zhang
  • Jishen Zhao
  • Cheng Zhuo

At the era of big data, there have been growing demands for server memory capacity
and performance. Memory network is a promising alternative to provide high bandwidth
and low latency through distributed memory nodes connected by high speed interconnect.
However, most of them implement the design from a pure-logic-level and ignore the
physical impact from network interconnect latency, processor placement and the interplay
between processor and memory. In this work, we propose a Physical-Aware framework
for memory network design space exploration, which facilitates the design of an energy
efficient and physical-aware memory network system. Experimental results on various
workloads show that the proposed framework can help customize network topology with
significant improvements on various design metrics when compared to the other commonly
used topologies.

SESSION: 9E: Design for Manufacturing and Soft Error Tolerance

Manufacturing-Aware Power Staple Insertion Optimization by Enhanced Multi-Row Detailed
Placement Refinement

  • Yu-Jin Xie
  • Kuan-Yu Chen
  • Wai-Kei Mak

Power staple insertion is a new methodology for IR drop mitigation in advanced technology
nodes. Detailed placement refinement which perturbs an initial placement slightly
is an effective strategy to increase the success rate of power staple insertion. We
are the first to address the manufacturing-aware power staple insertion optimization
problem by triple-row placement refinement. We present a correct-by-construction approach
based on dynamic programming to maximize the total number of legal power staples inserted
subject to the design rule for 1D patterning. Instead of using a multidimensional
array which incurs huge space overhead, we show how to construct a directed acyclic
graph (DAG) on the fly efficiently to implement the dynamic program for multi-row
optimization in order to conserve memory usage. The memory usage can thus be reduced
by a few orders of magnitude in practice.

A Hierarchical Assessment Strategy on Soft Error Propagation in Deep Learning Controller

  • Ting Liu
  • Yuzhuo Fu
  • Yan Zhang
  • Bin Shi

Deep learning techniques have been introduced into the field of intelligent controller
design in recent years and become an effective alternative in complex control scenarios.
In addition to improve control robustness, deep learning controllers (DLCs) also provide
a potential fault tolerance to internal disturbances (such as soft errors) due to
the inherent redundant structure of deep neural networks (DNNs). In this paper, we
propose a hierarchical assessment to characterize the impact of soft errors on the
dependability of a PID controller and its DLC alternative. Single-bit-flip injections
in underlying hardware and time series data collection from multiple abstraction layers
(ALs) are performed on a virtual prototype system based on an ARM Cortex-A9 CPU, with
a PID controller and corresponding recurrent neural network (RNN) implemented DLC
deployed on it. We employ generative adversarial networks and Bayesian networks to
characterize the local and global dependencies caused by soft errors across the system.
By analyzing cross-AL fault propagation paths and component sensitivities, we discover
that the parallel data processing pipelines and regular feature size scaling mechanism
in DLC can effectively prevent critical failure causing faults from propagating to
the control output.

Attacking a CNN-based Layout Hotspot Detector Using Group Gradient Method

  • Haoyu Yang
  • Shifan Zhang
  • Kang Liu
  • Siting Liu
  • Benjamin Tan
  • Ramesh Karri
  • Siddharth Garg
  • Bei Yu
  • Evangeline F.Y. Young

Deep neural networks are being used in disparate VLSI design automation tasks, including
layout printability estimation, mask optimization, and routing congestion analysis.
Preliminary results show the power of deep learning as an alternate solution in state-of-the-art
design and sign-off flows. However, deep learning is vulnerable to adversarial attacks.
In this paper, we examine the risk of state-of-the-art deep learning-based layout
hotspot detectors under practical attack scenarios. We show that legacy gradient-based
attacks do not adequately consider the design rule constraints. We present an innovative
adversarial attack formulation to attack the layout clips and propose a fast group
gradient method to solve it. Experiments show that the attack can deceive the deep
neural networks using small perturbations in clips which preserve layout functionality
while meeting the design rules. The source code is available at https://github.com/phdyang007/dlhsd/tree/dct_as_conv.

Bayesian Inference on Introduced General Region: An Efficient Parametric Yield Estimation Method for Integrated Circuits

  • Zhengqi Gao
  • Zihao Chen
  • Jun Tao
  • Yangfeng Su
  • Dian Zhou
  • Xuan Zeng

In this paper, we propose an efficient parametric yield estimation method based on
Bayesian Inference. By observing that nowadays analog and mixed-signal circuit is
designed via a multi-stage flow, and that the circuit performance correlation of early
stage and late stage is naturally symmetrical, we introduce a general region to capture
the common features of the early and late stage. Meanwhile, two private regions are
also incorporated to represent the unique features of these two stages respectively.
Afterwards, we introduce classifiers one for each region to explicitly encode the
correlation information. Next, we set up a graphical model, and consequently adopt
Bayesian Inference to calculate the model parameters. Finally, based on the obtained
optimal model parameters, we can accurately and efficiently estimate the parametric
yield with a simple sampling method. Our numerical experiments demonstrate that compared
to the state-of-the-art algorithms, our proposed method can better estimate the yield
while significantly reducing the number of circuit simulations.

Analog IC Aging-induced Degradation Estimation via Heterogeneous Graph Convolutional
Networks

  • Tinghuan Chen
  • Qi Sun
  • Canhui Zhan
  • Changze Liu
  • Huatao Yu
  • Bei Yu

With continued scaling, transistor aging induced by Hot Carrier Injection and Bias
Temperature Instability causes a gradual failure of nanometer-scale integrated circuits
(ICs). In this paper, to characterize the multi-typed devices and connection ports,
a heterogeneous directed multigraph is adopted to efficiently represent analog IC
post-layout netlists. We investigate a heterogeneous graph convolutional network (H-GCN)
to fast and accurately estimate aging-induced transistor degradation. In the proposed
H-GCN, an embedding generation algorithm with a latent space mapping method is developed
to aggregate information from the node itself and its multi-typed neighboring nodes
through multi-typed edges. Since our proposed H-GCN is independent of dynamic stress
conditions, it can replace static aging analysis. We conduct experiments on very advanced
5nm industrial designs. Compared to traditional machine learning and graph learning
methods, our proposed H-GCN can achieve more accurate estimations of aging-induced
transistor degradation. Compared to an industrial reliability tool, our proposed H-GCN
can achieve 24.623x speedup on average.

MLCAD’20 TOC

MLCAD ’20: Proceedings of the 2020 ACM/IEEE Workshop on Machine Learning for CAD


Full Citation in the ACM Digital Library

SESSION: Keynote Talk I

Session details: Keynote Talk I

  • Ulf Schlichtmann

MLCAD Today and Tomorrow: Learning, Optimization and Scaling

  • Andrew B. Kahng

The scaling imperative challenges us to always do better and faster, with less resources.
The semiconductor industry has looked to machine learning (ML) as a design-based lever
for scaling that will reduce design costs and design schedules while improving quality
of results. As a result, in recent years Machine Learning for CAD (MLCAD) has dominated
the conversation at leading conferences. Numerous ML-based enhancements and their
benefits have been highlighted by EDA vendors and their customers. With this as backdrop,
this talk will offer some thoughts on future directions for MLCAD.

First, MLCAD lies on a road to “self-driving IC design tools and flows” that make
design ideation and design space exploration both accurate and accessible. Eventually,
MLCAD (ML for CAD) will lead us to MLDA (ML-enabled Design Automation). But for this
to happen, researchers and practitioners will need to deliver (i) human-quality prediction,
evaluation and decision-making with no humans; (ii) design tools and flows that never
require iteration and never fail; (iii) modeling of design processes that continually
improves; and more.

Second, the trajectory of MLCAD will need to keep three concepts in foreground: Learning,
Optimization and Scaling. “Learning” seems obvious from “ML”, but it brings open questions
about data and models, ranging from statistics to standards, sharing and openness.
“Optimization” is the essence of CAD, and brings open questions about both synergies
and boundaries with learning. “Scaling” is how practical realization of Learning and
Optimization will satisfy the needs of design within an ever-tighter box of compute,
schedule and other resources. Finally, there is a meta-question of how the MLCAD community
will itself learn, optimize, and scale.

SESSION: Session 1: DNN for CAD

Session details: Session 1: DNN for CAD

  • Hussam Amrouch

An Adaptive Analytic FPGA Placement Framework based on Deep-Learning

  • Abeer Al-hyari
  • Ahmed Shamli
  • Timothy Martin
  • Shawki Areibi
  • Gary Grewal

In this work, a Convolutional Encoder-Decoder (CED) is utilized to significantly reduce
placement runtimes for large, high-utilization designs. The proposed CED uses features
available during the early stages of placement to predict the congestion present in
subsequent placement iterations including the final placement. This congestion information
is then used by the placer to improve decision making leading to reduced runtimes.
Experimental results show that reductions in placer runtime between 27% and 40% are
achievable with no significant deterioration in quality-of-result.

Design Rule Checking with a CNN Based Feature Extractor

  • Luis Francisco
  • Tanmay Lagare
  • Arpit Jain
  • Somal Chaudhary
  • Madhura Kulkarni
  • Divya Sardana
  • W. Rhett Davis
  • Paul Franzon

Design rule checking (DRC) is getting increasingly complex in advanced nodes technologies.
It would be highly desirable to have a fast interactive DRC engine that could be used
during layout. In this work, we establish the proof of feasibility for such an engine.
The proposed model consists of a convolutional neural network (CNN) trained to detect
DRC violations. The model was trained with artificial data that was derived from a
set of 50 SRAM designs. The focus in this demonstration was metal 1 rules. Using this
solution, we can detect multiple DRC violations 32x faster than Boolean checkers with
an accuracy of up to 92%. The proposed solution can be easily expanded to a complete
rule set.

Using DNNs and Smart Sampling for Coverage Closure Acceleration

  • Raviv Gal
  • Eldad Haber
  • Avi Ziv

Coverage Directed Generation represents algorithms that are used to create tests or
test-templates for hitting coverage events. Standard approaches for solving the problem
use either user’s intuition or random sampling. Recent work has been using optimization
algorithms in order to hit, a single hard-to-hit event. In this work we extend the
optimization technique for many events and show that by using a deep neural network
one can accelerate the optimization significantly. The algorithms are presented on
the NorthStar simulator where we show substantial improvement over random based techniques
and a factor larger than 2 on other optimization-based techniques.

R2AD: Randomization and Reconstructor-based Adversarial Defense on Deep Neural Network

  • Marzieh Ashrafiamiri
  • Sai Manoj Pudukotai Dinakarrao
  • Amir Hosein Afandizadeh Zargari
  • Minjun Seo
  • Fadi Kurdahi
  • Houman Homayoun

Machine learning (ML) has been widely adopted in a plethora of applications ranging
from simple time-series forecasting to computer security and autonomous systems. Despite
the robustness by the ML algorithms against random noise, it has been shown that inclusion
of specially crafted perturbations to the input data termed as adversarial samples
can lead to a significant degradation in the ML performance. Existing defenses to
mitigate or minimize the impact of adversarial samples including adversarial training
or randomization are confined to specific categories of adversaries, compute-intensive
and/or often lead to reduce performance even without adversaries. To overcome the
shortcomings of the existing works on adversarial defense, we propose a two-stage
adversarial defense technique (R2AD). To thwart the exploitation of the deep neural
network by the attacker, we first include a random nullification (RNF) layer. The
RNF nullifies/removes some of the features from the input randomly to reduce the impact
of adversarial noise and minimizes attacker’s feasibility to extract the model parameters.
However, the removal of input features through RNF leads to a reduction in the performance
of the ML. As an antidote, we equip the network with a Reconstructor. The Reconstructor
primarily contributes to reconstructing the input data by utilizing an autoencoder
network, but based on the distribution of the normal samples, thereby improving the
performance, and also being robust to the adversarial noise. We evaluated the performance
of proposed multi-stage R^2AD on the MNIST digits and Fashion-MNIST datasets against
multiple adversarial attacks including FGSM, JSMA, BIM, Deepfool, and CW attacks.
Our findings report improvements as high as 80% in the performance compared to the
existing defenses such as adversarial training and randomization-based defense.

DAVE: Deriving Automatically Verilog from English

  • Hammond Pearce
  • Benjamin Tan
  • Ramesh Karri

Specifications for digital systems are provided in natural language, and engineers
undertake significant efforts to translate these into the programming languages understood
by compilers for digital systems. Automating this process allows designers to work
with the language in which they are most comfortable – the original natural language
– and focus instead on other downstream design challenges. We explore the use of state-of-the-art
machine learning (ML) to automatically derive Verilog snippets from English via fine-tuning
GPT-2, a natural language ML system. We describe our approach for producing a suitable
dataset of novice-level digital design tasks and provide a detailed exploration of
GPT-2, finding encouraging translation performance across our task sets (94.8% correct),
with the ability to handle both simple and abstract design tasks.

SESSION: Plenary I

Session details: Plenary I

  • Paul Franzon

Accelerating Chip Design with Machine Learning

  • Brucek Khailany

As Moore’s law has provided an exponential increase in chip transistor density, the
unique features we can now include in large chips are no longer predominantly limited
by area constraints. Instead, new capabilities are increasingly limited by the engineering
effort associated with digital design, verification, and implementation. As applications
demand more performance and energy efficiency from specialization in the post-Moore’s-law
era, we expect required complexity and design effort to increase.

Historically, these challenges have been met through levels of abstraction and automation.
Over the last few decades, Electronic Design Automation (EDA) algorithms and methodologies
were developed for all aspects of chip design – design verification and simulation,
logic synthesis, place-and-route, and timing and physical signoff analysis. With each
increase in automation, total work per chip has increased, but more work has also
been offloaded from manual effort to software. Just as machine learning (ML) has transformed
software in many domains, we expect advancements in ML will also transform EDA software
and as a result, chip design workflows.

In this talk, we highlight work from our research group and the community applying
ML to various chip design prediction tasks [1]. We show how deep convolutional neural
networks [2] and graph-based neural networks [3] can be used in the areas of automatic
design space exploration, power analysis, VLSI physical design, and analog design.
We also present a future vision of an AI-assisted chip design workflow to automate
optimization tasks. In this future vision, GPU acceleration, neural-network predictors,
and deep reinforcement learning techniques combine to automate VLSI design and optimization.

SESSION: Keynote Talk II

Session details: Keynote Talk II

  • Ulf Schlichtmann

SoC Design Automation with ML – It’s Time for Research

  • Vijay Deep Bhatt
  • Wolfgang Ecker
  • Volkan Esen
  • Zhao Han
  • Daniela Sanchez Lopera
  • Rituj Patel
  • Lorenzo Servadei
  • Sahil Singla
  • Sven Wenzek
  • Vijaydeep Yadav
  • Elena Zennaro

The AI-hype started a few years ago, with advances in object recognition. Soon the
EDA research community made proposals on applying AI in EDA and all major players
announced new AI-based tools at DAC 2018. Unfortunately, few new AI-based EDA-tools
made it to productive use today. This talk analyses general challenges of AI in EDA,
outlines promising use cases, and motivates more AI research in EDA: More HI (=Human
Intelligence) is needed to make AI successful in EDA.

Motivation: For a long time, hardware design resides in an area between hell of complexity
and hell of physics. Continuously decreasing feature size enables to put more and
more transistors on a square millimeter silicon. This offers to make continuously
new applications at a reasonable form factor. However, the functionality of the application
must be designed first and the deep submicron effects must be considered properly.

EDA tools help to automate design, but face challenges keeping up with the continuously
increasing productivity demand. Therefore, the design teams have increased in size
to step up the design of the chips. So, any further innovation is welcome. The re-discovery
of AI in general and ML in particular created visions of learning from designers and
automatically creating automation from these learnings. To give an example, Google
describes in [1] how to accelerate Chip Placement from weeks to hours.

SESSION: Session 2: Design Methodology and Optimization

Session details: Session 2: Design Methodology and Optimization

  • Hai Li (Helen) Li

Cost Optimization at Early Stages of Design Using Deep Reinforcement Learning

  • Lorenzo Servadei
  • Jiapeng Zheng
  • José Arjona-Medina
  • Michael Werner
  • Volkan Esen
  • Sepp Hochreiter
  • Wolfgang Ecker
  • Robert Wille

With the increase in the complexity of the modern system on Chips(SoCs) and the demand
for a lower time-to-market, automation becomes essential in hardware design. This
is particularly relevant in complex/time-consuming tasks, as the optimization of design
cost for a hardware component. Design cost, in fact, may depend on several objectives,
as for the hardware-software trade-off. Given the complexity of this task, the designer
often has no means to perform a fast and effective optimization in particular for
larger and complex designs. In this paper, we introduce Deep Reinforcement Learning(DRL)
for design cost optimization at the early stages of the design process. We first show
that DRL is a perfectly suitable solution for the problem at hand. Afterward, by means
of a Pointer Network, a neural network specifically applied for combinatorial problems,
we benchmark three DRL algorithms towards the selected problem. Results obtained in
different settings show the improvements achieved by DRL algorithms compared to conventional
optimization methods. Additionally, by using reward redistribution proposed in the
recently introduced RUDDER method, we obtain significant improvements in complex designs.
Here, the obtained optimization is on average 15.18% on the area as well as 8.25%
and 8.12% on the application size and execution time on a dataset of industrial hardware/software
interface design

F-LEMMA: Fast Learning-based Energy Management for Multi-/Many-core Processors

  • An Zou
  • Karthik Garimella
  • Benjamin Lee
  • Christopher Gill
  • Xuan Zhang

Over the last two decades, as microprocessors have evolved to achieve higher computational
performance, their power density also has increased at an accelerated rate. Improving
energy efficiency and reducing power consumption is therefore of critical importance
to modern computing systems. One effective technique to improve energy efficiency
is dynamic voltage and frequency scaling (DVFS). In this paper, we propose F-LEMMA:
a fast learning-based power management framework consisting of a global power allocator
in userspace, a reinforcement learning-based power management scheme at the architecture
level, and a swift controller at the digital circuit level. This hierarchical approach
leverages computation at the system and architecture levels, and the short response
times of the swift controllers, to achieve effective and rapid μs-level power management.
Our experimental results demonstrate that F-LEMMA can achieve significant energy savings
(35.2% on average) across a broad range of workload benchmarks. Compared With existing
state-of-the-art DVFS-based power management strategies that can only operate at millisecond
timescales, F-LEMMA is able to provide notable (up to 11%) Energy Delay Product improvements
when evaluated across benchmarks.

CALT: Classification with Adaptive Labeling Thresholds for Analog Circuit Sizing

  • Zhengfeng Wu
  • Ioannis Savidis

A novel simulation-based framework that applies classification with adaptive labeling
thresholds (CALT) is developed that auto-generates the component sizes of an analog
integrated circuit. Classifiers are applied to predict whether the target specifications
are satisfied. To address the lack of data points with positive labels due to the
large dimensionality of the parameter space, the labeling threshold is adaptively
set to a certain percentile of the distribution of a given circuit performance metric
in the dataset. Random forest classifiers are executed for surrogate prediction modeling
that provide a ranking of the design parameters. For each iteration of the simulation
loop, optimization is utilized to determine new query points. CALT is applied to the
design of a low noise amplifier (LNA) in a 65 nm technology. Qualified design solutions
are generated for two sets of specifications with an average execution of 4 and 17
iterations of the optimization loop, which require an average of 1287 and 2190 simulation
samples, and an average execution time of 5.4 hours and 23.2 hours, respectively.
CALT is a specification-driven design framework to automate the sizing of the components
(transistors, capacitors, inductors, etc.) of an analog circuit. CALT generates interpretable
models and achieves high sample efficiency without requiring the use of prior circuit
models.

Decision Making in Synthesis cross Technologies using LSTMs and Transfer Learning

  • Cunxi Yu
  • Wang Zhou

We propose a general approach that precisely estimates the Quality-of-Result (QoR),
such as delay and area, of unseen synthesis flows for specific designs. The main idea
is leveraging LSTM-based network to forecast the QoR, where the inputs are synthesis
flows represented in novel timed-flow modeling, and QoRs are ground truth. This approach
is demonstrated with 1.2 million data points collected using 14nm, 7nm regular-voltage
(RVT), and 7nm low-voltage (LVT) technologies with twelve IC designs. The accuracy
of predicting the QoRs (delay and area) evaluated using mean absolute prediction error
(MAPE). While collecting training data points in EDA can be extremely challenging,
we propose to elaborate transfer learning in our approach, which enables accurate
predictions cross different technologies and different IC designs. Our transfer learning
approach obtains estimation MAPE 3.7% over 960,000 test points collected on 7nm technologies,
with only 100 data points used for training the pre-trained LSTM network using 14nm
dataset.

Application of Quantum Machine Learning to VLSI Placement

  • Isaac Turtletaub
  • George Li
  • Mohannad Ibrahim
  • Paul Franzon

Considerable advances in quantum computing with functioning noisy, near-term devices
have allowed for the application space to grow as a emerging field for problems with
large solution spaces. However, current quantum hardware is limited in scale and noisy
in generated data, necessitating hybrid quantum-classical solutions for viability
of results and convergence. A quantum backend generates data for classical algorithms
to optimize control parameters with, creating a hybrid quantum-classical computing
loop. VLSI placement problems have shown potential for utilization, where traditionally
heuristic solutions such as Kernighan-Lin (KL) are used. The Variational Quantum Eigensolver
(VQE) is used to formulate a recursive Balanced Min-Cut (BMC) algorithm, and we suggest
that quantum machine learning techniques can lower error rates and allow for faster
convergence to an optimal solution.

SESSION: Plenary II

Session details: Plenary II

  • Ulf Schlichtmann

From Tuning to Learning: Why the FPGA Physical Design Flow Offers a Compelling Case for ML?

  • Ismail S. Bustany

MLCAD is particularly suited for the FPGA Physical design (PD) flow since each device
family generation innately provides a rich platform for device/design feature data
harvesting: (1) A vast amount of device architecture-specific interconnect/layout
fabric data and (2) significant amount of large design suite data from and from broad
set of application domains. These bode well for developing robust predictive ML models.
Furthermore, the long lifespan of these device families affords a favorable ROI. In
this talk, we will highlight some data harvesting and ML solutions we have developed
in Xilinx? Vivado PD flow and share some initial results. These include a strategy
recommendation framework for design closure, design classification for computational
resource allocation, device characteristics modeling, and routing congestion estimation.
Furthermore, we will outline potential MLCAD opportunities in trend identification,
algorithm parameter optimization, and reinforcement learning paradigms where we foresee
potential collaborations with the academic community.

Biography: Ismail Bustany is a Distinguished Engineer at Xilinx, where he works on
physical design, MLCAD, and sparse computation hardware acceleration. He has served
on the technical programming committees for the ISPD, the ISQED, and DAC. He was the
2019 ISPD general chair. He currently serves on the organizing committees for ICCAD
and SLIP. He organized the 2014 and 2015 ISPD detailed routing-driven placement contests
and co-organized the 2017 ICCAD detailed placement contest. His research interests
include physical design, computationally efficient optimization algorithms, MLCAD,
sparse matrix computations, and hardware acceleration. He earned his M.S. and Ph.D.
in electrical engineering from UC Berkeley.

SESSION: Keynote Talk III

Session details: Keynote Talk III

  • Paul Franzon

Data-driven CAD or Algorithm-Driven CAD: Competitors or Collaborators?

  • Rajeev Jain
  • Pankaj Kukkal

Motivation: Despite decades of R&D in algorithm-driven CAD, the design and implementation
of SoCs requires an ever-increasing number of resources in terms of designers, compute
servers and tool licenses. Design automation has not scaled with the complexity of
deep sub-micron fabrication process or the complexity of optimizing power, performance
and area (PPA) of modern SoCs. There seems to be a fundamental limit to algorithm-driven
CAD that prevents tools from scaling to meet the increasing complexity. As technology
scaling reaches its limits and the PPA gains from technology scaling get limited,
the need for design tools to close PPA gap through design will increase significantly,
making this problem worse.

Problem statement:

Approach:SoC design consists of taking a chip hardware spec and generating the fabrication
mask spec, involving two main tasks: (1) logic synthesis and (2) physical design.
While algorithm-driven CAD tools exist to automate both these tasks, they cannot meet
the PPA without a large number of manually guided design iterations that consume manpower,
compute and tool resources.

Approach: Data-driven CAD can capture the learning from manual PPA optimization, and
data-driven tools inherently scale with design complexity. We explore the open problems
in using Data-driven CAD, to complement the automation capabilities of algorithm-driven
CAD and meet the increasing PPA demands of modern SOCs in deep-submicron technologies.

SESSION: Session 3: ML for Reliability Improvement

Session details: Session 3: ML for Reliability Improvement

  • Bei Yu

Data-Driven Fast Electrostatics and TDDB Aging Analysis

  • Shaoyi Peng
  • Wentian Jin
  • Liang Chen
  • Sheldon X.-D. Tan

Computing the electric potential and electric field is a critical step for modeling
and analysis of VLSI chips such as TDDB (Time dependent dielectric breakdown) aging
analysis. Data-driven deep learning approach provides new perspectives for learning
the physics-law and representations of the physics dynamics from the data. In this
work, we propose a new data-driven learning based approach for fast 2D analysis of
electric potential and electric fields based on DNNs (deep neural networks). Our work
is based on the observation that the synthesized VLSI layout with multi interconnect
layers can be viewed as layered images. Image transformation techniques via CNN (convolutional
neural network) are adopted for the analysis. Once trained, the model is applicable
to any synthesized layout of the same technology. Training and testing are done on
a dataset built from a synthesized CPU chip. Results show that the proposed method
is around 138x faster than the conventional numerical methods based software COMSOL,
while keeping 99% of the accuracy on potential analysis, and 97% for TDDB aging analysis.

HAT-DRL: Hotspot-Aware Task Mapping for Lifetime Improvement of Multicore System using
Deep Reinforcement Learning

  • Jinwei Zhang
  • Sheriff Sadiqbatcha
  • Yuanqi Gao
  • Michael O’Dea
  • Nanpeng Yu
  • Sheldon X.-D. Tan

In this work, we propose a novel learning-based task to core mapping technique to
improve lifetime and reliability based on advanced deep reinforcement learning. The
new method, called HAT-DRL, is based on the observation that on-chip temperature sensors
may not capture the true hotspots of the chip, which can lead to sub-optimal control
decisions. In the new method, we first perform data-driven learning to model the hotspot
activation indicator with respect to the resource utilization of different workloads.
On top of this, we propose to employ a recently proposed, highly robust, sample-efficient
soft-actor-critic deep reinforcement learning algorithm, which can learn optimal maximum
entropy policies to improve the long-term reliability and minimize the performance
degradation from NBTI/HCI effects. Lifetime and reliability improvement is achieved
by assigning a reward function, which penalizes continuously stressing the same hotspots
and encourages even stressing of cores. The proposed algorithm is validated with an
Intel i7-8650U four-core CPU platform executing CPU benchmark workloads for various
hotspot activation profiles. Our experimental results show that HAT-DRL balances the
stress between all cores and hotspots, and achieves 50% and 160% longer lifetime compared
to non-hotspot-aware and Linux default scheduling respectively. The proposed method
can also reduce the average temperature by exploiting the true-hotspot information.

Can Wear-Aware Memory Allocation be Intelligent?

  • Christian Hakert
  • Kuan-Hsun Chen
  • Jian-Jia Chen

Many non-volatile memories (NVM) suffer from a severe reducedcell endurance and therefore
require wear-leveling. Heap memory,as one segment, which potentially is mapped to
a NVM, faces astrong application dependent characteristic regarding the amountof memory
accesses and allocations. A simple deterministic strategyfor wear leveling of the
heap may suffer when the available actionspace becomes too large. Therefore, we investigate
the employmentof a reinforcement learning agent as a substitute for such a strategyin
this paper. The agent’s objective is to learn a strategy, which isoptimal with respect
to the total memory wear out. We concludethis work with an evaluation, where we compare
the deterministicstrategy with the proposed agent. We report that our proposedagent
outperforms the simple deterministic strategy in several cases.However, we also report
further optimization potential in the agentdesign and deployment.

An Enhanced Machine Learning Model for Adaptive Monte Carlo Yield Analysis

  • Richard Kimmel
  • Tong Li
  • David Winston

This paper presents a novel methodology for generating machine learning models used
by an adaptive Monte Carlo analysis. The advantages of this methodology are that model
generation occurs at the beginning of the analysis with no retraining required, it
applies to both classification and regression models, and accuracy of the Monte Carlo
analysis is not impacted by the accuracy of the model. This paper discusses the details
of constructing and enhancing the machine learning model with emphasis on model training.
It will then show how the model enables a Monte Carlo analysis that monitors and adapts
to model mispredictions.

Towards NN-based Online Estimation of the Full-Chip Temperature and the Rate of Temperature
Change

  • Martin Rapp
  • Omar Elfatairy
  • Marilyn Wolf
  • Jörg Henkel
  • Hussam Amrouch

We propose a novel technique to estimate at run-time both the dynamic thermal map
of the whole chip and the rate of temperature change. Knowledge of the current temperature
is crucial for thermal management. Additional knowledge of the rate of temperature
change allows for predictions of temperatures in the near future, and, therefore,
enables proactive management. However, neither is achievable with existing thermal
sensors due to their limited number. Our technique is based on a neural network (NN)
to predict the rate of temperature change based on performance counter readings and
the current estimate of the thermal map. The thermal map is then updated based on
the prediction. At design-time, we create training data for the NN by recording performance
counters and the dynamic thermal map during the execution of mixed workloads. The
thermal map is recorded with the help of an infrared (IR) camera. At run-time, our
technique requires only performance counter readings. Our technique predicts temperature
changes accurately. However, absolute temperature estimation suffers from instability.

SESSION: Plenary III

Session details: Plenary III

  • Bei Yu

Design Challenges on Post Moore’s Law Era

  • Pak Hei Matthew Leung

IC companies nowadays are busy struggling between increasing challenges in deep submicron
process and, in the same time, more stringent time-to-market cycle to entertain the
more demanding consumers. As a result, engineers have to turn to more holistic optimizations
across software, architecture, micro-architecture, circuit design and physical implementations.
The increase in complexity also demands for high level of automation and help from
design tools. We shall look into some of the solutions that we are exploring to cope
with the situation.

Biography: Mr. Matthew Leung serves as the director of Huawei Hong Kong Research Center,
with a current focus in the development of hardware, software and algorithm for artificial
intelligence. Prior to that, he served as the director and a founding member of HiSilicon
(A subsidiary of Huawei) Hong Kong R&D Center. His expertise and experience lies in
the fields of VLSI design for advanced communication chipsets, microprocessors and
artificial intelligence. Mr. Leung received his BSc and MSc degrees of Electrical
Engineering in University of Michigan and Stanford University respectively.

SESSION: Keynote Talk IV

Session details: Keynote Talk IV

  • Hai Li (Helen) Li

Machine Learning in EDA: Opportunities and Challenges

  • Elias Fallon

Electronic Design Automation software has delivered semiconductor design productivity
improvements for decades. The next leap in productivity will come from the addition
of machine learning techniques to the toolbox of computational software capabilities
employed by EDA developers. Recent research and development into machine learning
for EDA point to clear patterns for how it impacts EDA tools, flows, and design challenges.
This research has also illustrated some of the challenges that will come with production
deployment of machine learning techniques into EDA tools and flows. This talk will
detail patterns observed in ML for EDA development, as well as discussing challenges
with productization of ML for EDA developments and the opportunities that it presents
for researchers.

Biography: Elias Fallon is currently Engineering Group Director at Cadence Design
Systems, a leading Electronic Design Automation company. He has been involved in EDA
for more than 20 years from the founding of Neolinear, Inc, which was acquired by
Cadence in 2004. Elias was co-Primary Investigator on the MAGESTIC project, funded
by DARPA to investigate the application of Machine Learning to EDA for Package/PCB
and Analog IC. Elias also leads an innovation incubation team within the Custom IC
R&D group as well as other traditional EDA product teams. Beyond his work developing
electronic design automation tools, he has led software quality improvement initiatives
within Cadence, partnering with the Carnegie Mellon Software Engineering Institute.
Elias graduated from Carnegie Mellon University with an M.S. and B.S. in Electrical
and Computer Engineering. Elias, his wife and two children live north of Pittsburgh,
PA. https://www.linkedin.com/in/elias-fallon/

SESSION: Session 4: Intelligent Modeling

Session details: Session 4: Intelligent Modeling

  • Hai Li (Helen) Li

Track-Assignment Detailed Routing Using Attention-based Policy Model With Supervision

  • Haiguang Liao
  • Qingyi Dong
  • Weiyi Qi
  • Elias Fallon
  • Levent Burak Kara

Detailed routing is one of the most critical steps in analog circuit design. Complete
routing has become increasingly more challenging in advanced node analog circuits,
making advances in efficient automatic routers ever more necessary. In this work,
we propose a machine learning driven method for solving the track-assignment detailed
routing problem for advanced node analog circuits. Our approach adopts an attention-based
reinforcement learning (RL) policy model. Our main insight and advancement over this
RL model is the use of supervision as a way to leverage solutions generated by a conventional
genetic algorithm (GA). For this, our approach minimizes the Kullback-Leibler divergence
loss between the output from the RL policy model and a solution distribution obtained
from the genetic solver. The key advantage of this approach is that the router can
learn a policy in an offline setting with supervision, while improving the run-time
performance nearly 100× over the genetic solver. Moreover, the quality of the solutions
our approach produces matches well with those generated by GA. We show that especially
for complex problems, our supervised RL method provides good quality solution similar
to conventional attention-based RL without comprising run time performance. The ability
to learn from example designs and train the router to get similar solutions with orders
of magnitude run-time improvement can impact the design flow dramatically, potentially
enabling increased design exploration and routability-driven placement.

Compact Models for Initial MOSFET Sizing Based on Higher-order Artificial Neural Networks

  • Husni Habal
  • Dobroslav Tsonev
  • Matthias Schweikardt

Simple MOSFET models intended for hand analysis are inaccurate in deep sub-micrometer
process technologies and in the moderate inversion region of device operation. Accurate
models, such as the Berkeley BSIM6 model, are too complex for use in hand analysis
and are intended for circuit simulators. Artificial neural networks (ANNs) are efficient
at capturing both linear and non-linear multivariate relationships. In this work,
a straightforward modeling technique is presented using ANNs to replace the BSIM model
equations. Existing open-source libraries are used to quickly build models with error
rates generally below 3%. When combined with a novel approach, such as the gm/Id systematic
design method, the presented models are sufficiently accurate for use in the initial
sizing of analog circuit components without simulation.

An Efficient and Flexible Learning Framework for Dynamic Power and Thermal Co-Management

  • Yuan Cao
  • Tianhao Shen
  • Li Zhang
  • Xunzhao Yin
  • Cheng Zhuo

At the era of Artificial Intelligence and Internet of Things (AIoT), battery-powered
mobile devices are required to perform more sophisticated tasks featured with fast
varying workloads and constrained power supply, demanding more efficient run-time
power management. In this paper, we propose a deep reinforcement learning framework
for dynamic power and thermal co-management. We build several machine learning models
that incorporate the physical details for an ARM Cortex-A72, with on average 3% and
1% error for power and temperature predictions, respectively. We then build an efficient
deep reinforcement learning control incorporating the machine learning models and
facilitating the run-time dynamic voltage and frequency scaling (DVFS) strategy selection
based on the predicted power, workloads and temperature. We evaluate our proposed
framework, and compare the performance with existing management methods. The results
suggest that our proposed framework can achieve 6.8% performance improvement compared
with other alternatives.

Partial Sharing Neural Networks for Multi-Target Regression on Power and Performance
of Embedded Memories

  • Felix Last
  • Ulf Schlichtmann

Memories contribute significantly to the overall power, performance and area (PPA)
of modern integrated electronic systems. Owing to their regular structure, memories
are generated by memory compilers in modern industrial designs. Although such compilers
provide PPA-efficient and silicon-verified layouts, the large and growing number of
input parameters to the compilers themselves results in a new challenge of compiler
parameter selection given design requirements. The dimensionality of the search space
as well as the count of memories prohibit manual tuning in fast-paced design cycles.
To efficiently select optimal compiler parameters, we devise regression neural networks
as PPA models of memory compilers, based on which an optimal parameterization can
be selected. Highly accurate PPA estimates are a prerequisite to a reliable optimization.
While regression with multiple targets can easily be achieved by neural networks with
multiple output units, model accuracy depends highly on architecture and hyperparameters.
We study how neural network prediction error on multi-target regression problems can
be reduced, validating recent findings that partial parameter sharing is beneficial
to this class of problems. Our real-world application confirms the benefits of partial
sharing for multi-target regression, and asserts the applicability to the sigmoid
activation function. The accuracy of memory compiler PPA prediction is improved by
approximately ten percent on average, decreasing worst-case prediction errors by over
50 percent.

Explaining and Interpreting Machine Learning CAD Decisions: An IC Testing Case Study

  • Prashanth Krishnamurthy
  • Animesh Basak Chowdhury
  • Benjamin Tan
  • Farshad Khorrami
  • Ramesh Karri

We provide a methodology to explain and interpret machine learning decisions in Computer-Aided
Design (CAD) flows. We demonstrate the efficacy of the methodology to the VLSI testing
case. Such a tool will provide designers with insight into the “black box” machine
learning models/classifiers through human readable sentences based on normally understood
design rules or new design rules. The methodology builds on an intrinsically explainable,
rule-based ML framework, called Sentences in Feature Subsets (SiFS), to mine human
readable decision rules from empirical data sets. SiFS derives decision rules as compact
Boolean logic sentences involving subsets of features in the input data. The approach
is applied to test point insertion problem in circuits and compared to the ground
truth and traditional design rules.

SESSION: Plenary IV

Session details: Plenary IV

  • Raviv Gal

Machine-Learning Enabled Next-Generation Physical Design – An EDA Perspective

  • Vishal Khandelwal

Physical design is an ensemble of NP-complete problems that P&R tools attempt to solve
in (pseudo) linear time. Advanced process nodes and complex signoff requirements bring
in new physical and timing constraints into the implementation flow, making it harder
for physical design algorithms to deliver industry-leading power, performance, area
(PPA), without giving up design turn-around-time. The relentless pursuit for low-power
high-performance designs is putting constant pressure to limit any over-design, creating
an acute need to have better models/predictions and advanced analytics to drive implementation
flows. Given the advancements in supervised and reinforcement learning, combined with
the availability of large-scale compute, Machine Learning (ML) has the potential to
become a disruptive paradigm change for EDA tools. In this talk, I would like to share
some of the challenges and opportunities for innovation in next-generation physical
design using ML.

Biography: Vishal leads the physical optimization team for the Digital Implementation
products at Synopsys. He has 15 years of R&D experience in building state-of-the-art
optimization engines and P&R flows targeting advanced-node low-power high-performance
designs. More recently, he has been looking at bringing machine-learning paradigms
into digital implementation tools to improve power, performance, area and productivity.
Vishal has a B.Tech. from Indian Institute of Technology, Kanpur and a Ph.D. from
University of Maryland, College Park. He has won a best paper award at ISPD, co-authored
several patents and over 20 IEEE/ACM publications.

SESSION: Panel

Session details: Panel

  • Raviv Gal

ML for CAD – Where is the Treasure Hiding?

  • Raviv Gal
  • David Z. Pan
  • Haoxing Ren
  • Manish Pandey
  • Marilyn Wolf
  • Avi Ziv

Advances in ML have revolutionized its effectiveness for a variety of applications.
Indeed, in areas like image classification and NLP, ML (AI) has changed the rules
of the game and opened the door to incredible advances. Design processes seem to match
the ML paradigm perfectly. This mature area is highly automated, combines advanced
analytic techniques, and generates large volumes of data that are used during the
processes. With the promise of saving resources and improving quality, ML for CAD
has attracted a lot of attention in the industry and academia. This is well reflected
in conferences and journals; and the most advanced success stories and works-in-progress
are being presented at MLCAD-2020.

SESSION: Session 5: ML for Systems

Session details: Session 5: ML for Systems

  • Hussam Amrouch

Using Machine Learning Clustering To Find Large Coverage Holes

  • Raviv Gal
  • Giora Simchoni
  • Avi Ziv

Identifying large and important coverage holes is a time-consuming process that requires
expertise in the design and its verification environment. This paper describes a novel
machine learning-based technique for finding large coverage holes when the coverage
events are individually defined. The technique is based on clustering the events according
to their names and mapping the clusters into cross-products. Our proposed technique
is being used in the verification of high-end servers. It has already improved the
quality of coverage analysis and helped identify several environment problems.

Exploring Logic Optimizations with Reinforcement Learning and Graph Convolutional
Network

  • Keren Zhu
  • Mingjie Liu
  • Hao Chen
  • Zheng Zhao
  • David Z. Pan

Logic synthesis for combinational circuits is to find the minimum equivalent representation
for Boolean logic functions. A well-adopted logic synthesis paradigm represents the
Boolean logic with standardized logic networks, such as and-inverter graphs (AIG),
and performs logic minimization operations over the graph iteratively. Although the
research for different logic representation and operations is fruitful, the sequence
of using the operations are often determined by heuristics. We propose a Markov decision
process (MDP) formulation of the logic synthesis problem and a reinforcement learning
(RL) algorithm incorporating with graph convolutional network to explore the solution
search space. The experimental results show that the proposed method outperforms the
well-known logic synthesis heuristics with the same sequence length and action space.

AdaPool: Multi-Armed Bandits for Adaptive Virology Screening on Cyber-Physical Digital-Microfluidic
Biochips

  • Mohamed Ibrahim

Cyber-physical digital microfluidics is a versatile lab-on-chip technology that offers
key advantages in reconfigurability, manufacturability, and sensor integration. Critical
applications such as point-of-care testing (POCT) are expected to benefit the most
from this technology, thus motivating a great body of literature that addresses performance,
cost, and reliability using design-automation methodologies. Despite this effort,
today’s solutions are unable to support the most critical application in the modern
era; that is, cost-effective POCT for rapid virology screening. This application poses
new design challenges related to the testing capacity and adaptability to the infection
distribution within target populations. To support this application, we present a
reinforcement-learning method that enables a cyber-physical digital-microfluidic platform
to learn from its testing results. The proposed method, named AdaPool, uses multi-armed
bandits to infer the dynamics of viral infection and hence adapt the microfluidic
system to an effective testing strategy. Simulation results illustrate the effectiveness
of the proposed method at different infection conditions.

Automatic compiler optimization on embedded software through k-means clustering

  • Michael Werner
  • Lorenzo Servadei
  • Robert Wille
  • Wolfgang Ecker

Generating instead of implementing variable design platforms is becoming increasingly
popular in the development of System on Chips. This shift also poses the challenge
of rapid compiler optimization that adapts to each newly generated platform. In this
paper, we evaluate the impact of 104 compiler flags on memory usage and core execution
time against standard optimization levels. Each flag has a different influence on
these costs, which is difficult to predict. In this work, we apply cost estimation
methods to predict the impact of each flag on the generated core using unsupervised
Machine Learning, in the form of k-means clustering. The key strengths of the approach
are the low need for data, the adaptability to new cores, and the ease of use. This
helps the designer to understand the impact of flags on related applications, showing
which combination is optimizing the most. As a result, we can obtain 20,93% optimization
on the software size, 3,10% on the performance, and 1,75% on their trade-off beyond
the -O3 optimization.

Transfer Learning for Design-Space Exploration with High-Level Synthesis

  • Jihye Kwon
  • Luca P. Carloni

High-level synthesis (HLS) raises the level of design abstraction, expedites the process
of hardware design, and enriches the set of final designs by automatically translating
a behavioral specification into a hardware implementation. To obtain different implementations,
HLS users can apply a variety of knobs, such as loop unrolling or function inlining,
to particular code regions of the specification. The applied knob configuration significantly
affects the synthesized design’s performance and cost, e.g., application latency and
area utilization. Hence, HLS users face the design-space exploration (DSE) problem,
i.e. determine which knob configurations result in Pareto-optimal implementations
in this multi-objective space. Whereas it can be costly in time and resources to run
HLS flows with an enormous number of knob configurations, machine learning approaches
can be employed to predict the performance and cost. Still, they require a sufficient
number of sample HLS runs. To enhance the training performance and reduce the sample
complexity, we propose a transfer learning approach that reuses the knowledge obtained
from previously explored design spaces in exploring a new target design space. We
develop a novel neural network model for mixed-sharing multi-domain transfer learning.
Experimental results demonstrate that the proposed model outperforms both single-domain
and hard-sharing models in predicting the performance and cost at early stages of
HLS-driven DSE.

Footprint Classification of Electric Components on Printed Circuit Boards

  • Yun-Jie Ni
  • Yan-Jhih Wang
  • Tsung-Yi Ho

The market of Printed Circuit Boards (PCBs) is growing fast with the population of
the Internet of Things. Therefore, PCB manufacturers require an effective design methodology
to accelerate the PCB manufacturing processes. To design PCBs for new components,
footprints which contain component information are needed to mount components on a
PCB. However, current footprint design relies on experienced engineers and they may
not maintain rule guidelines, which makes it a time-consuming work in the design flow.
To achieve footprint design automation, analysis of footprint design rule is necessary
and footprint classification can help sorting out design rules for different type
of components. In this paper, we adopt both footprint and file name information to
classify footprints. Through the proposed methodology, we can classify the footprint
libraries with higher accuracy so as to achieve footprint design automation.

NOCS 2019 TOC

NOCS ’19: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip


Full Citation in the ACM Digital Library

SESSION: NoC and router design

UBERNoC: unified buffer power-efficient router for network-on-chip

  • Hossein Farrokhbakht
  • Henry Kao
  • Natalie Enright Jerger

Networks-on-Chip (NoCs) address many shortcomings of traditional interconnects. However,
they consume a considerable portion of a chip’s total power – particularly when the
utilization is low. As transistor size continues to shrink, we expect NoCs to contribute
even more, especially static power. A wide range of prior-art focuses on reducing
the contribution of NoC power consumption. These can be categorized into two main
groups: (1) power-gating, and (2) simplified router microarchitectures. Maintaining
the performance and the flexibility of the network are key challenges that have not
yet been addressed by these two groups of low-power architectures. In this paper,
we propose UBERNoC, a simplified router microarchitecture, which reduces underutilized
buffer space by leveraging an observation that for most switch traversals, only a
single packet is present. We use a unified buffer with multiple virtual channels shared
amongst the input ports to reduce both power and area. The empirical results demonstrate
that compared to a conventional router, UBERNoC achieves 58% and 69% reduction in
power and area respectively, with negligible latency overhead.

Ghost routers: energy-efficient asymmetric multicore processors with symmetric NoCs

  • Hyojun Son
  • Hanjoon Kim
  • Hao Wang
  • Nam Sung Kim
  • John Kim

Asymmetric multicore architectures have been proposed to exploit the benefits of heterogeneous
cores. However, asymmetric cores present challenge to network-on-chip (NoC) designers
since the floorplan is not necessarily regular with “nodes” being different size.
In contrast, most of the previously proposed NoC topologies commonly assume a regular
or symmetric floorplan with equal size nodes. In this work, we first describe how
asymmetric floorplan leads to asymmetric topology and can limit overall performance.
To overcome the asymmetric floorplan, we present Ghost Routers – extra “dummy” routers that are added to the NoC to create a symmetric NoC architecture
for asymmetric multicore architectures. Ghost router provides higher network path
diversity and provides higher network performance that leads to higher system performance.
Ghost routers also enable simpler routing algorithms because of the symmetric NoC
architecture. While ghost routers is a simplistic modification to the NoC architecture,
it does increase NoC cost. However, ghost routers exploit the observations that in
realistic systems, the cost of NoC is not a significant fraction of overall system
cost. Our evaluations show that ghost routers can improve performance by up to 21%
while improving overall energy-efficiency of the system by up to 26%.

BINDU: deadlock-freedom with one bubble in the network

  • Mayank Parasar
  • Tushar Krishna

Every interconnection network must ensure, for its functional correctness, that it
is deadlock free. A routing deadlock occurs when there is a cyclic dependency of packets
when acquiring the buffers of the routers. Prior solutions have provisioned an extra
set of escape buffers to resolve deadlocks, or restrict the path that a packet can take in the
network by disallowing certain turns. This either pays higher power/area overhead
or impacts performance. In this work, we demonstrate that (i) keeping one virtual-channel
in the entire network (called ‘Bindu’) empty, and (ii) forcing it to move through
all input ports of every router in the network via a pre-defined path, can guarantee
deadlock-freedom. We show that our scheme (a) is topology agnostic (we evaluate it
on multiple topologies, both regular and irregular), (b) does not impose any turn
restrictions on packets, (c) does not require an extra set of escape buffers, and
(d) is free from the complex circuitry for detecting and recovering from deadlocks.
We report 15% average improvement in throughput for synthetic traffic and 7% average
reduction in runtime for real applications over state-of-the-art deadlock freedom
schemes.

SESSION: Best paper nominees

NoC-enabled software/hardware co-design framework for accelerating k-mer counting

  • Biresh Kumar Joardar
  • Priyanka Ghosh
  • Partha Pratim Pande
  • Ananth Kalyanaraman
  • Sriram Krishnamoorthy

Counting k-mers (substrings of fixed length k) in DNA and protein sequences generate non-uniform and irregular memory access patterns.
Processing-in-Memory (PIM) architectures have the potential to significantly reduce
the overheads associated with such frequent and irregular memory accesses. However,
existing k-mer counting algorithms are not designed to exploit the advantages of PIM architectures.
Furthermore, owing to thermal constraints, the allowable power budget is limited in
conventional PIM designs. Moreover, k-mer counting generates unbalanced and long-range traffic patterns that need to be handled
by an efficient Network-on-Chip (NoC). In this paper, we present an NoC-enabled software/hardware
co-design framework to implement high-performance k-mer counting. The proposed architecture enables more computational power, efficient communication
between cores/memory – all without creating a thermal bottleneck; while the software
component exposes more in-memory opportunities to exploit the PIM and aids in the
NoC design. Experimental results show that the proposed architecture outperforms a
state-of-the-art software implementation of k-mer counting utilizing Hybrid Memory Cube (HMC), by up to 7.14X, while allowing significantly
higher power budgets.

SMART++: reducing cost and improving efficiency of multi-hop bypass in NoC routers

  • Iván Pérez
  • Enrique Vallejo
  • Ramón Beivide

Low latency and low implementation cost are two key requirements in NoCs. SMART routers
implement multi-hop bypass, obtaining latency values close to an ideal point-to-point
interconnect. However, it requires a significant amount of resources such as Virtual
Channels (VCs), which are not used as efficiently as possible, preventing bypass in
certain scenarios. This translates into increased area and delay, compared to an ideal
implementation.

In this paper, we introduce SMART++, an efficient multi-hop bypass mechanism which
combines four key ideas: SMART bypass, multi-packet buffers, Non-Empty Buffer Bypass and Per-packet allocation. SMART++ relies on a more aggressive VC reallocation policy
and supports bypass of buffers even when they are not completely free. With these
desirable characteristics, SMART++ requires limited resources and exhibits high performance.

SMART++ is evaluated using functional simulation and HDL synthesis tools. SMART++
without VCs and with a reduced amount of buffer slots outperforms the original SMART
using 8 VCs, while reducing the amount of logic and dynamic power in an FPGA by 5.5x
and 5.0x respectively. Additionally, it allows for up to 2.1x frequency; this might
translate into more than 31.9% base latency reduction and 42.2% throughput increase.

APEC: improved acknowledgement prioritization through erasure coding in bufferless NoCs

  • Michael Vonbun
  • Adrian Schiechel
  • Nguyen Anh Vu Doan
  • Thomas Wild
  • Andreas Herkersdorf

Bufferless NoCs have been proposed as they come with a decreased silicon area footprint
and a reduced power consumption, when compared to buffered NoCs. However, while known
for their inherent simplicity, they suffer from early saturation and depend on additional
measures to ensure reliable packet delivery, such as control protocols based on ACKs
or NACKs. In this paper, we propose APEC, a novel concept for bufferless NoCs that
allows to prioritize ACKs and NACKs over single payload flits of colliding packets
by discarding the latter. Lightweight heuristic erasure codes are used to compensate
for discarded payload flits. By trading off the erasure code overhead for packet retransmissions,
a more efficient network operation is achieved. For ACK-based networks, APEC saturates
at 2.1x and 2.875x higher generation rates than a conventional ACK-based bufferless
NoC for packets between 5 and 17 flits. For NACK-based networks, APEC does not require
concepts such as deflection routing or circuit-switched overlay NACK-networks, as
prior work does. Therefore, it can simplify the network implementation compared to
prior work while achieving similar performance.

SESSION: NoC potpourri

ClusCross: a new topology for silicon interposer-based network-on-chip

  • Hesam Shabani
  • Xiaochen Guo

The increasing number of cores challenges the scalability of chip multiprocessors.
Recent studies proposed the idea of disintegration by partitioning a large chip into
multiple smaller chips and using silicon interposer-based integration (2.5D) to connect
these smaller chips. This method can improve yield, but as the number of small chips
increases, the chip-to-chip communication becomes a performance bottleneck.

This paper proposes a new network topology, ClusCross, to improve network performance
for multicore interconnection networks on silicon interposer-based systems. The key
idea is to treat each small chip as a cluster and use cross-cluster long links to
increase bisection width and decrease average hop count without increasing the number
of ports in the routers. Synthetic traffic patterns and real applications are simulated
on a cycle-accurate simulator. Network latency reduction and saturation throughput
improvement are demonstrated as compared to previously proposed topologies. Two versions
of the ClusCross topology are evaluated. One version of ClusCross has a 10% average
latency reduction for coherence traffic as compared to the state-of-the-art network-on-interposer
topology, the misaligned ButterDonut. The other version of ClusCross has a 7% and
a 10% reduction in power consumption as compared to the FoldedTorus and the ButterDonut
topologies, respectively.

Distributed SDN architecture for NoC-based many-core SoCs

  • Marcelo Ruaro
  • Nedison Velloso
  • Axel Jantsch
  • Fernando G. Moraes

In the Software-Defined Networking (SDN) paradigm, routers are generic and programmable
forwarding units that transmit packets according to a given policy defined by a software
controller. Recent research has shown the potential of such a communication concept
for NoC management, resulting in hardware complexity reduction, management flexibility,
real-time guarantees, and self-adaptation. However, a centralized SDN controller is
a bottleneck for large-scale systems.

Assuming an NoC with multiple physical subnets, this work proposes a distributed SDN
architecture (D-SDN), with each controller managing one cluster of routers. Controllers
work in parallel for local (intra-cluster) paths. For global (inter-cluster) paths,
the controllers execute a synchronization protocol inspired by VLSI routing, with
global and detailed routing phases. This work also proposes a short path establishment
heuristic for global paths that explores the controllers” parallelism.

D-SDN outperforms a centralized approach (C-SDN) for larger networks without loss
of success rate. Evaluations up to 2,304 cores and 6 subnets shows that: (i) D-SDN outperforms C-SDN in path establishment latency up to 69.7% for 1 subnet above
32 cores, and 51% for 6 subnets above 1,024 cores; (ii) D-SDN achieves a smaller latency then C-SDN (on average 54%) for scenarios with
more than 70% of local paths; (iii) the path success rate, for all scenarios, is similar in both approaches, with an
average difference of 1.7%; (iv) the data storage for the C-SDN controller increases with the system size, while
it remains constant for D-SDN.

Approximate nanophotonic interconnects

  • Jaechul Lee
  • Cédric Killian
  • Sébastien Le Beux
  • Daniel Chillet

The energy consumption of manycore is dominated by data movement, which calls for
energy-efficient and high-bandwidth interconnects. Integrated optics is promising
technology to overcome the bandwidth limitations of electrical interconnects. However,
it suffers from high power overhead related to low efficiency lasers, which calls
for the use of approximate communications for error tolerant applications. In this
context, this paper investigates the design of an Optical NoC supporting the transmission
of approximate data. For this purpose, the least significant bits of floating point
numbers are transmitted with low power optical signals. A transmission model allows
estimating the laser power according to the targeted BER and a micro-architecture
allows configuring, at run-time, the number of approximated bits and the laser output
powers. Simulations results show that, compared to an interconnect involving only
robust communications, approximations in the optical transmission lead to up to 42%
laser power reduction for image processing application with a limited degradation
at the application level.

Direct-modulated optical networks for interposer systems

  • Mohammad Reza Jokar
  • Lunkai Zhang
  • John M. Dallesasse
  • Frederic T. Chong
  • Yanjing Li

We present a new interposer-level optical network based on direct-modulated lasers
such as vertical-cavity surface-emitting lasers (VCSELs) or transistor lasers (TLs).
Our key observation is that, the physics of these lasers is such that they must transmit
significantly more power (21x) than is needed by the receiver. We take advantage of
this excess optical power to create a new network architecture called Rome, which splits optical signals using passive splitters to allow flexible bandwidth
allocation among different transmitter and receiver pairs while imposing minimal power
and design costs. Using multi-chip module GPUs (MCM-GPUs) as a case study, we thoroughly
evaluate network power and performance, and show that (1) Rome is capable of efficiently
scaling up MCM-GPUs with up to 1024 streaming multiprocessors, and (2) Rome outperforms
various competing designs in terms of energy efficiency (by up to 4x) and performance
(by up to 143%).

SESSION: Interconnection networks for deep neural networks

NoC-based DNN accelerator: a future design paradigm

  • Kun-Chih (Jimmy) Chen
  • Masoumeh Ebrahimi
  • Ting-Yi Wang
  • Yuch-Chi Yang

Deep Neural Networks (DNN) have shown significant advantages in many domains such
as pattern recognition, prediction, and control optimization. The edge computing demand
in the Internet-of-Things era has motivated many kinds of computing platforms to accelerate
the DNN operations. The most common platforms are CPU, GPU, ASIC, and FPGA. However,
these platforms suffer from low performance (i.e., CPU and GPU), large power consumption (i.e., CPU, GPU, ASIC, and FPGA), or low computational flexibility at runtime (i.e., FPGA and ASIC). In this paper, we suggest the NoC-based DNN platform as a new accelerator
design paradigm. The NoC-based designs can reduce the off-chip memory accesses through
a flexible interconnect that facilitates data exchange between processing elements
on the chip. We first comprehensively investigate conventional platforms and methodologies
used in DNN computing. Then we study and analyze different design parameters to implement
the NoC-based DNN accelerator. The presented accelerator is based on mesh topology,
neuron clustering, random mapping, and XY-routing. The experimental results on LeNet,
MobileNet, and VGG-16 models show the benefits of the NoC-based DNN accelerator in
reducing off-chip memory accesses and improving runtime computational flexibility.

Energy-efficient and high-performance NoC architecture and mapping solution for deep
neural networks

  • Md Farhadur Reza
  • Paul Ampadu

With the advancement and miniaturization of transistor technology, hundreds of cores
can be integrated on a single chip. Network-on-Chips (NoCs) are the de facto on-chip communication fabrics for multi/many core systems because of their benefits
over the traditional bus in terms of scalability, parallelism, and power efficiency
[20]. Because of these properties of NoC, communication architecture for different
layers of a deep neural network can be developed using NoC. However, traditional NoC
architectures and strategies may not be suitable for running deep neural networks
because of the different types of communication patterns (e.g. one-to-many and many-to-one
communication between layers and zero communication within a single layer) in neural
networks. Furthermore, because of the different communication patterns, computations
of the different layers of a neural network need to be mapped in a way that reduces
communication bottleneck in NoC. Therefore, we explore different NoC architectures
and mapping solutions for deep neural networks, and then propose an efficient concentrated
mesh NoC architecture and a load-balanced mapping solution (including mathematical
model) for accelerating deep neural networks. We also present preliminary results
to show the effectiveness of our proposed approaches to accelerate deep neural networks
while achieving energy-efficient and high-performance NoC.

Flow mapping and data distribution on mesh-based deep learning accelerator

  • Seyedeh Yasaman Hosseini Mirmahaleh
  • Midia Reshadi
  • Hesam Shabani
  • Xiaochen Guo
  • Nader Bagherzadeh

Convolutional neural networks have been proposed as an approach for classifying data
corresponding to labeled and unlabeled datasets. The fast-growing data empowers deep
learning algorithms to achieve higher accuracy. Numerous trained models have been
proposed, which involve complex algorithms and increasing network depth. The main
challenges of implementing deep convolutional neural networks are high energy consumption,
high on-chip and off-chip bandwidth requirements, and large memory footprint. Different
types of on-chip communication networks and traffic distribution methods have been
proposed to reduce memory access latency and energy consumption of data movement.
This paper proposes a new traffic distribution mechanism on a mesh topology using
distributer nodes by considering memory access mechanism in the AlexNet, VggNet, and
GoogleNet trained models. We also propose a flow mapping method (FMM) based on dataflow
stationary which reduces energy consumption by 8%.

SESSION: Heterogeneous integration and interconnect fabrics

3D NoCs with active interposer for multi-die systems

  • Vasil Pano
  • Ragh Kuttappa
  • Baris Taskin

Advances in interconnect technologies for system-in-package manufacturing have re-introduced
multi-chip module (MCM) architectures as an alternative to the current monolithic
approach. MCMs or multi-die systems implement multiple smaller chiplets in a single
package. These MCMs are connected through various package interconnect technologies,
such as current industry solutions in AMD’s Infinity Fabric, Intel’s Foveros active
interposer, and Marvell’s Mochi Interconnect. Although MCMs improve manufacturing
yields and are cost-effective, additional challenges on the Network-on-Chip (NoC)
within a single chiplet and across multiple chiplets need to be addressed. These challenges
include routing, scalability performance, and resource allocation. This work introduces
a scalable MCM 3D interconnect infrastructure called “MCM-3D-NoC” with multiple 3D
chiplets connected through an active interposer. System-level simulations of MCM-3D-NoC
are performed to validate the proposed architecture and provide performance evaluation
of network latency, throughput, and EDP.

Global and semi-global communication on Si-IF

  • Boris Vaisband
  • Subramanian S. Iyer

On-chip scaling continues to pose significant technological and design challenges.
Nonetheless, the key obstacle in on-chip scaling is the high fabrication cost of the
state-of-the-art technology nodes. An opportunity exists however, to continue scaling
at the system level. Silicon interconnect fabric (Si-IF) is a platform that aims to
replace both the package and printed circuit board to enable heterogeneous integration
and high inter-chip performance. Bare dies are attached directly to the Si-IF at fine
vertical interconnect pitch (2 to 10 μm) and small inter-die spacing (≤ 100 μm). The
Si-IF is a single-hierarchy integration construct that supports dies of any process,
technology, and dimensions. In addition to development of the fabrication and integration
processes, system-level challenges need to be addressed to enable integration of heterogeneous
systems on the Si-IF. Communication is a fundamental challenge on large Si-IF platforms
(up to 300 mm diameter wafers). Different technological and design approaches for
global and semi-global communication are discussed in this paper. The area overhead
associated with global communication on the Si-IF is determined.

A 7.5-mW 10-Gb/s 16-QAM wireline transceiver with carrier synchronization and threshold
calibration for mobile inter-chip communications in 16-nm FinFET

  • Jieqiong Du
  • Chien-Heng Wong
  • Yo-Hao Tu
  • Wei-Han Cho
  • Yilei Li
  • Yuan Du
  • Po-Tsang Huang
  • Sheau-Jiung Lee
  • Mau-Chung Frank Chang

A compact energy-efficient 16-QAM wireline transceiver with carrier synchronization
and threshold calibration is proposed to leverage high-density fine-pitch interconnects.
Utilizing frequency-division multiplexing, the transceiver transfers four-bit data
through one RF band to reduce intersymbol interferences. A forwarded clock is also
transmitted through the same interconnect with the data simultaneously to enable low-power
PVT-insensitive symbol clock recovery. A carrier synchronization algorithm is proposed
to overcome nontrivial current and phase mismatches by including DC offset calibration
and dedicated I/Q phase adjustments. Along with this carrier synchronization, a threshold
calibration process is used for the transceiver to tolerate channel and circuit variations.
The transceiver implemented in 16-nm FinFET occupies only 0.006-mm2 and achieves 10 Gb/s with 0.75-pJ/bit efficiency and <2.5-ns latency.

SESSION: Work in progress posters

Reinforcement learning based interconnection routing for adaptive traffic optimization

  • Sheng-Chun Kao
  • Chao-Han Huck Yang
  • Pin-Yu Chen
  • Xiaoli Ma
  • Tushar Krishna

Applying Machine Learning (ML) techniques to design and optimize computer architectures
is a promising research direction. Optimizing the runtime performance of a Network-on-Chip
(NoC) necessitates a continuous learning framework. In this work, we demonstrate the
promise of applying reinforcement learning (RL) to optimize NoC runtime performance.
We present three RL-based methods for learning optimal routing algorithms. The experimental
results show the algorithms can successfully learn a near-optimal solution across
different environment states.

Power efficient photonic network-on-chip for a scalable GPU

  • Janibul Bashir
  • Khushal Sethi
  • Smruti R. Sarangi

In this paper, we propose an energy efficient and scalable optical interconnect for
GPUs. We intelligently divide the components in a GPU into different types of clusters
and enable these clusters to communicate optically with each other. In order to reduce
the network delay, we use separate networks for coherence and non-coherence traffic.
Moreover, to reduce the static power consumption in optical interconnects, we modulate
the off-chip light source by proposing a novel GPU specific prediction scheme for
on-chip network traffic. Using our design, we were able to increase the performance
by 17% and achieve a 65% reduction in ED2 as compared to a state-of-the-art optical topology.

CDMA-based multiple multicast communications on WiNOC for efficient parallel computing

  • Navonil Chatterjee
  • Hemanta Kumar Mondal
  • Rodrigo Cataldo
  • Jean-Philippe Diguet

In this work, we introduce an hybrid WiNoC, which judicially uses the wired and wireless
interconnects for broadcasting/multicasting of packets. A code division multiple access
(CDMA) method is used to support multiple broadcast operations originating from multiple
applications executed on the multiprocessor platform. The CDMA-based WiNoC is compared
in terms of network latency and power consumption with wired-broadcast/multicast NoC.

Channel mapping strategies for effective protection switching in fail-operational
hard real-time NoCs

  • Max Koenen
  • Nguyen Anh Vu Doan
  • Thomas Wild
  • Andreas Herkersdorf

With Multi Processor System-on-Chips (MPSoC) scaling up to thousands of processing
elements, bus-based solutions have been dropped in favor of Network-on-Chips (NoC)
as proposed in [2]. However, MPSoCs are yet hesitantly adopted in safety-critical
fields, mainly due to the difficulty of ensuring strict isolation between different
applications running on a single MPSoC as well as providing communication with Guaranteed
Service (GS) to critical applications. This is particularly difficult in the NoC as
it constitutes a network of shared resources. Moreover, safety-critical applications
require some degree of Fault-Tolerance (FT) to guarantee safe operation at all times.

Multi-carrier spread-spectrum transceiver for WiNoC

  • Joel Ortiz
  • Olivier Sentieys
  • Christian Roland
  • Cedric Killian

In this paper, we propose a low-power, high-speed, multi-carrier reconfigurable transceiver
based on Frequency Division Multiplexing to ensure data transfer in future Wireless
NoCs. The proposed transceiver supports a medium access control method to sustain
unicast, broadcast and multicast communication patterns, providing dynamic data exchange
among wireless nodes. The proposed transceiver designed using a 28-nm FDSOI technology
consumes only 2.37 mW and 4.82 mW in unicast/broadcast and multicast modes, respectively,
with an area footprint of 0.0138 mm2.

Detection and prevention protocol for black hole attack in network-on-chip

  • Luka Daoud
  • Nader Rafla

Network-on-Chip (NoC) has become exposed to security threats. It can be infected with
a Hardware Trojan (HT) to degrade the system performance and apply a denial of service
attack. In this paper, we proposed a new HT-based threat model, known as Black Hole
Router (BHR), where it deliberately drops packets from the NoC. We proposed a detection
and prevention protocol to such BHR attack with reasonably low overhead. The results
show 10.83%, 27.78%, and 21.31% overhead in area, power, and performance, respectively.
However, our proposed protocol not only detects the BHR attack but also avoids it
and assures packet-delivery.

Analyzing networks-on-chip based deep neural networks

  • Giuseppe Ascia
  • Vincenzo Catania
  • Salvatore Monteleone
  • Maurizio Palesi
  • Davide Patti
  • John Jose

One of the most promising architectures for performing deep neural network inferences
on resource-constrained embedded devices is based on massive parallel and specialized
cores interconnected by means of a Network-on-Chip (NoC). In this paper, we extensively
evaluate NoC-based deep neural network accelerators by exploring the design space
spanned by several architectural parameters. We show how latency is mainly dominated
by the on-chip communication whereas energy consumption is mainly accounted by memory
(both on-chip and off-chip).

DAWN

We are thrilled to announce Design Automation WebiNar (DAWN) to drive research momentum and ensure our community remains at the cutting edge. Different from conventional keynote and individual speaker webinars, DAWN is a special-session-style webinar. DAWN is formed by multiple presentations on focused topics by leading experts in our community.

Recent events:

For more information, please visit: https://dawn-webinar.github.io/DAWN/

SRC-2019

ACM Student Research Competition at ICCAD 2019 (SRC@ICCAD’19)

http://www.cse.cuhk.edu.hk/~byu/img/img-sigda/logo-src.jpg

Winners in the graduate category

1stStefan HillmichJohannes Kepler University LinzDecision Diagrams for Quantum Computing
2ndJustin SanchezUNC CharlotteArchitectures Leveraging Edge and Real-time Template Systems
3rdMengchu LiTechnical University of MunichHigh-Level Synthesis for Microfluidics Large-Scale Integration

Winners in the undergraduate category

1stMilind SrivastavaIndian Institute of Technology MadrasSauron- An Automated Framework for Detecting Fault Attack Vulnerabilities in Hardware
2ndShuting ChengYuan Ze UniversityA Novel Approach for Improving Lifetime of Multi-core Systems How Asymmetric Aging Can Lead a Way

DEADLINE: August 17, 2019
Online Submission: https://www.easychair.org/conferences/?conf=srciccad2019
 
Sponsored by Microsoft Research, the ACM Student Research Competition is an internationally recognized venue enabling undergraduate and graduate students who are ACM members to:

  • Experience the research world — for many undergraduates, this is a first!
  • Share research results and exchange ideas with other students, judges, and conference attendees
  • Rub shoulders with academic and industry luminaries
  • Understand the practical applications of their research
  • Perfect their communication skills
  • Receive prizes and gain recognition from ACM and the greater computing community.

The ACM Special Interest Group on Design Automation (ACM SIGDA) is organizing such an event in conjunction with the International Conference on Computer Aided Design (ICCAD). Authors of accepted submissions will get travel grants up to $500 from ACM/Microsoft and ICCAD registration fee support from SIGDA. The event consists of several rounds, as described at http://src.acm.org/ and http://www.acm.org/student-research-competition, where you can also find more details on student eligibility and timeline.
 


At SRC@ICCAD’18, the first-place winners in the graduate category, Gengjie Chen (Chinese University of Hong Kong), and the first-place winner in the undergraduate category, Zhuangzhuang Zhou (Shanghai Jiaotong Univeristy), both won the First Place in the 2019 ACM SRC Grand Finals! (https://www.acm.org/media-center/2019/may/src-2019-grand-finals)

The first-place winner in the graduate category at SRC@ICCAD’17, Meng Li (University of Texas at Austin), also won the First Place in the 2018 ACM SRC Grand Finals! (https://www.acm.org/media-center/2018/june/src-2018-grand-finals)
 
The first-place winner in the undergraduate category at SRC@ICCAD’16, Jennifer Vaccaro (Olin College of Engineering), also won the Second Place in the 2017 ACM SRC Grand Finals: http://www.acm.org/media-center/2017/june/src-2017-grand-finals.



Details on abstract submission:
Research projects from all areas of design automation are encouraged. The author submitting the abstract must still be a student at the time the abstract is due. Each submission should be made on the EasyChair submission site. Please include the author’s name, affiliation, postal address, and email address; research advisor’s name; ACM student member number; category (undergraduate or graduate); research title; and an extended abstract (maximum 2 pages or 800 words) containing the following sections:

  • Problem and Motivation: This section should clearly state the problem being addressed and explain the reasons for seeking a solution to this problem.
  • Background and Related Work: This section should describe the specialized (but pertinent) background necessary to appreciate the work. Include references to the literature where appropriate, and briefly explain where your work departs from that done by others. Reference lists do not count towards the limit on the length of the abstract.
  • Approach and Uniqueness: This section should describe your approach in attacking the problem and should clearly state how your approach is novel.
  • Results and Contributions: This section should clearly show how the results of your work contribute to computer science and should explain the significance of those results. Include a separate paragraph (maximum of 100 words) for possible publication in the conference proceedings that serves as a succinct description of the project.
  • Single paper summaries (or just cut & paste versions of published papers) are inappropriate for the ACM SRC. Submissions should include at least one year worth of research contributions, but not subsuming an entire doctoral thesis load.

Note that this event is different than other ACM/SIGDA sponsored or supported events at DAC or ICCAD: RNYF brings together seniors and 1st year graduate students at DAC, UBooth features demos from research groups, DASS allows graduate students to get up to speed on lectures on design automation, while the PhD Forum showcases post-proposal PhD research at DAC and the CADathlon allows graduate students to compete in a programming contest at ICCAD.

The ACM Student Research Competition allows both graduate and undergraduate students to discuss their research with student peers, as well as academic and industry researchers, in an informal setting, while enabling them to attend ICCAD and compete with other ACM SRC winners from other computing areas in the ACM Grand Finals. Travel grant recipients cannot receive travel support from any other ICCAD or ACM/SIGDA sponsored program.

This year we plan to reserve as many as 5 poster session spots for undergraduate attendees to encourage their continuous investigation in the design automation field. The exact number is subject to the total undergraduates’ submissions as well as the quality of the works.
 
Online Submission – EasyChair:
https://www.easychair.org/conferences/?conf=srciccad2019
 
Important dates:

  • Abstract submission deadline: 11:59pm, PST, August 17, 2019
  • Acceptance notification: September 01 08, 2019
  • Poster session: 11:30am–1:30pm, Nov. 04 (Monday) @Westminster Foyer
  • Presentation session: 6:45–8:15pm, Nov. 04 (Monday) @Westminster I Ballroom
  • Award winners announced at ACM SIGDA Dinner: 6:45–8:30pm, Nov. 5 (Tuesday) @Legacy Ballroom
  • Grand Finals winners honored at ACM Awards Banquet: June 2020 (Estimated)


Requirement:
Students submitting and presenting their work at SRC@ICCAD’19 are required to be members of both ACM and ACM SIGDA.
 
Organizers:
Bei Yu (Chinese University of Hong Kong, Hong Kong)
Robert Wille (Johannes Kepler University Linz, Austria)

ISLPED 2020 TOC

ISLPED ’20: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

Full Citation in the ACM Digital Library

SESSION: ML related software and systems

How to cultivate a green decision tree without loss of accuracy?

  • Tseng-Yi Chen
  • Yuan-Hao Chang
  • Ming-Chang Yang
  • Huang-Wei Chen

Decision tree is the core algorithm of the random forest learning that has been widely applied to classification and regression problems in the machine learning field. For avoiding underfitting, a decision tree algorithm will stop growing its tree model when the model is a fully-grown tree. However, a fully-grown tree will result in an overfitting problem reducing the accuracy of a decision tree. In such a dilemma, some post-pruning strategies have been proposed to reduce the model complexity of the fully-grown decision tree. Nevertheless, such a process is very energy-inefficiency over an non-volatile-memory-based (NVM-based) system because NVM generally have high writing costs (i.e., energy consumption and I/O latency). Such unnecessary data will induce high writing energy consumption and long I/O latency on NVM-based architectures, especially for low-power-oriented embedded systems. In order to establish a green decision tree (i.e., a tree model with minimized construction energy consumption), this study rethinks a pruning algorithm, namely duo-phase pruning framework, which can significantly decrease the energy consumption on the NVM-based computing system without loss of accuracy.

Approximate inference systems (AxIS): end-to-end approximations for energy-efficient inference at the edge

  • Soumendu Kumar Ghosh
  • Arnab Raha
  • Vijay Raghunathan

The rapid proliferation of the Internet-of-Things (IoT) and the dramatic resurgence of artificial intelligence (AI) based application workloads has led to immense interest in performing inference on energy-constrained edge devices. Approximate computing (a design paradigm that yields large energy savings at the cost of a small degradation in application quality) is a promising technique to enable energy-efficient inference at the edge. This paper introduces the concept of an approximate inference system (AxIS) and proposes a systematic methodology to perform joint approximations across different subsystems in a deep neural network-based inference system, leading to significant energy benefits compared to approximating individual subsystems in isolation. We use a smart camera system that executes various convolutional neural network (CNN) based image recognition applications to illustrate how the sensor, memory, compute, and communication subsystems can all be approximated synergistically. We demonstrate our proposed methodology using two variants of a smart camera system: (a) Camedge, where the CNN executes locally on the edge device, and (b) Camcloud, where the edge device sends the captured image to a remote cloud server that executes the CNN. We have prototyped such an approximate inference system using an Altera Stratix IV GX-based Terasic TR4-230 FPGA development board. Experimental results obtained using six CNNs demonstrate significant energy savings (around 1.7× for Camedge and 3.5× for Camcloud) for minimal (< 1%) loss in application quality. Compared to approximating a single subsystem in isolation, AxIS achieves additional energy benefits of 1.6×–1.7× (Camedge) and 1.4×–3.4× (Camcloud) on average for minimal application-level quality loss.

Time-step interleaved weight reuse for LSTM neural network computing

  • Naebeom Park
  • Yulhwa Kim
  • Daehyun Ahn
  • Taesu Kim
  • Jae-Joon Kim

In Long Short-Term Memory (LSTM) neural network models, a weight matrix tends to be repeatedly loaded from DRAM if the size of on-chip storage of the processor is not large enough to store the entire matrix. To alleviate heavy overhead of DRAM access for weight loading in LSTM computations, we propose a weight reuse scheme which utilizes the weight sharing characteristics in two adjacent time-step computations. Experimental results show that the proposed weight reuse scheme reduces the energy consumption by 28.4-57.3% and increases the overall throughput by 110.8% compared to the conventional schemes.

Sound event detection with binary neural networks on tightly power-constrained IoT devices

  • Gianmarco Cerutti
  • Renzo Andri
  • Lukas Cavigelli
  • Elisabetta Farella
  • Michele Magno
  • Luca Benini

Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on deep neural networks (DNNs) are very effective, but highly demanding in terms of memory, power, and throughput when targeting ultra-low power always-on devices.

Latency, availability, cost, and privacy requirements are pushing recent IoT systems to process the data on the node, close to the sensor, with a very limited energy supply, and tight constraints on the memory size and processing capabilities precluding to run state-of-the-art DNNs.

In this paper, we explore the combination of extreme quantization to a small-footprint binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller. Starting from an existing CNN for SED whose footprint (815 kB) exceeds the 512 kB of memory available on our platform, we retrain the network using binary filters and activations to match these memory constraints. (Fully) binary neural networks come with a natural drop in accuracy of 12-18% on the challenging ImageNet object recognition challenge compared to their equivalent full-precision baselines. This BNN reaches a 77.9% accuracy, just 7% lower than the full-precision version, with 58 kB (7.2× less) for the weights and 262 kB (2.4× less) memory in total. With our BNN implementation, we reach a peak throughput of 4.6 GMAC/s and 1.5 GMAC/s over the full network, including preprocessing with Mel bins, which corresponds to an efficiency of 67.1 GMAC/s/W and 31.3 GMAC/s/W, respectively. Compared to the performance of an ARM Cortex-M4 implementation, our system has a 10.3× faster execution time and a 51.1× higher energy-efficiency.

SESSION: Low power circuit designs

Analysis of crosstalk in NISQ devices and security implications in multi-programming regime

  • Abdullah Ash-Saki
  • Mahabubul Alam
  • Swaroop Ghosh

The noisy intermediate-scale quantum (NISQ) computers suffer from unwanted coupling across qubits referred to as crosstalk. Existing literature largely ignores the crosstalk effects which can introduce significant error in circuit optimization. In this work, we present a crosstalk modeling analysis framework for near-term quantum computers after extracting the error-rates experimentally. Our analysis reveals that crosstalk can be of the same order of gate error which is considered a dominant error in NISQ devices. We also propose adversarial fault injection using crosstalk in a multiprogramming environment where the victim and the adversary share the same quantum hardware. Our simulation and experimental results from IBM quantum computers demonstrated that the adversary can inject fault and launch a Denial-of-Service attack. Finally, we propose system- and device-level countermeasures.

An 88.6nW ozone pollutant sensing interface IC with a 159 dB dynamic range

  • Rishika Agarwala
  • Peng Wang
  • Akhilesh Tanneeru
  • Bongmook Lee
  • Veena Misra
  • Benton H. Calhoun

This paper presents a low power resistive sensor interface IC designed at 0.6V for ozone pollutant sensing. The large resistance range of gas sensors poses challenges in designing a low power sensor interface. Exiting architectures are insufficient for achieving a high dynamic range while enabling low VDD operation, resulting in high power consumption regardless of the adopted architecture. We present an adaptive architecture that provides baseline resistance cancellation and dynamic current control to enable low VDD operation while maintaining a dynamic range of 159dB across 20kΩ-1MΩ. The sensor interface IC is fabricated in a 65nm bulk CMOS process and consumes 88.6nW of power which is 300x lower than the state-of-art. The full system power ranges between 116 nW – 1.09 μW which includes the proposed sensor interface IC, analog to digital converter and peripheral circuits. The sensor interface’s performance was verified using custom resistive metal-oxide sensors for ozone concentrations from 50 ppb to 900 ppb.

A 1.2-V, 1.8-GHz low-power PLL using a class-F VCO for driving 900-MHz SRD band SC-circuits

  • Tim Schumacher
  • Markus Stadelmayer
  • Thomas Faseth
  • Harald Pretl

This work presents a 1.6 GHz to 2 GHz integer PLL with 2 MHz stepping, which is optimized for driving low-power 180 nm switched-capacitor (SC) circuits at a 1.2 V supply. To reduce the overall power consumption, a class-F VCO is implemented. Due to enriched odd harmonics of the oscillator, a rectangular oscillator signal is generated, which allows omitting output buffering stages. The rectangular signal results in a lowered power consumption and enables to directly drive SC-filters and an RF-divider using the oscillator signal. In addition, the proposed RF-divider includes a differential 4-phase signal generation at 868 MHz and 915 MHz SRD band frequencies that can be used for complex modulation schemes. With a fully integrated loop-filter, a maximum of integration is achieved. A test-chip was manufactured in a 1P6M 180 nm CMOS technology with triple-well option and confirms a PLL with a total active power consumption of 4.1 mW. It achieves a phase noise of -111 dBc/Hz at 1 MHz offset and a -42 dBc spurious response from a 1 MHz reference.

A 640pW 32kHz switched-capacitor ILO analog-to-time converter for wake-up sensor applications

  • Nicolas Goux
  • Jean-Baptiste Casanova
  • Gaël Pillonnet
  • Franck Badets

This paper presents the architecture and ultra-low power (ULP) implementation of a switched-capacitor injection-locked oscillator (SC-ILO) used as analog-to-time converter for wake-up sensor applications. Thanks to a novel injection-locking scheme based on switched capacitors, the SC-ILO architecture avoids the use of power-hungry constant injection current sources. The SC-ILO design parameters and transfer function, resulting from an analytical study, are determined, and used to optimize the design. The ULP implementation strategy regarding power consumption, gain, modulation bandwidth and output phase dynamic range is presented and optimized to be compatible with audio wake-up sensor application that require ultra-low power consumption but low dynamic range performances. This paper reports SC-ILO circuit experimental measurements, fabricated in a 22 nm FDSOI process. The measured chip exhibits a 129° phase-shift range, a 6kHz bandwidth leading to a 34.6dB-dynamic range for a power consumption of 640pW under 0.4V.

SESSION: Low power management

Dynamic idle core management and leakage current reuse in MPSoC platforms

  • MD Shazzad Hossain
  • Ioannis Savidis

In this paper, algorithmic and circuit techniques are proposed for dynamic power management that allows for the reuse of the leakage current of idle circuit blocks and cores in a multiprocessor system-on-chip platform. First, a novel scheduling algorithm, longest idle time – leakage reuse (LIT-LR), is proposed for energy efficient reuse of leakage current, which generates a supply voltage of 340 mV with less than ±3% variation across the tt, ff, and ss process corners. The LIT-LR algorithm reduces the energy consumption of the leakage control blocks and the peak power consumption by, respectively, 25% and 7.4% as compared to random assignment of idle cores for leakage reuse. Second, a novel usage ranking based algorithm, longest idle time – simultaneous leakage reuse and power gating (LIT-LRPG), is proposed for simultaneous implementation of power gating and leakage reuse. Applying power gating with leakage reuse reduces the total energy consumption of the MPSoC by 50.2%, 14.4%, and 5.7% as compared to, respectively, a baseline topology that includes neither leakage reuse or power gating, only includes power gating, and only includes leakage reuse.

Towards wearable piezoelectric energy harvesting: modeling and experimental validation

  • Yigit Tuncel
  • Shiva Bandyopadhyay
  • Shambhavi V. Kulshrestha
  • Audrey Mendez
  • Umit Y. Ogras

Motion energy harvesting is an ideal alternative to battery in wearable applications since it can produce energy on demand. So far, widespread use of this technology has been hindered by bulky, inflexible and impractical designs. New flexible piezoelectric materials enable comfortable use of this technology. However, the energy harvesting potential of this approach has not been thoroughly investigated to date. This paper presents a novel mathematical model for estimating the energy that can be harvested from joint movements on the human body. The proposed model is validated using two different piezoelectric materials attached on a 3D model of the human knee. To the best of our knowledge, this is the first study that combines analytical modeling and experimental validation for joint movements. Thorough experimental evaluations show that 1) users can generate on average 13 μW power while walking, 2) we can predict the generated power with 4.8% modeling error.

RAMANN: in-SRAM differentiable memory computations for memory-augmented neural networks

  • Mustafa Ali
  • Amogh Agrawal
  • Kaushik Roy

Memory-Augmented Neural Networks (MANNs) have been shown to outperform Recurrent Neural Networks (RNNs) in terms of long-term dependencies. Since MANNs are equipped with an external memory, they can store and retrieve more data through longer periods of time. A MANN generally consists of a network controller and an external memory. Unlike conventional memory having read/write operations to specific addresses, a differentiable memory has soft read and write operations involving all the data stored in the memory. Such soft read and write operations present new computational challenges for hardware implementation of MANNs. In this work, we present a novel in-memory computing primitive to accelerate the differentiable memory operations of MANNs in SRAMs. We propose a 9T SRAM macro capable of performing both Hamming similarity and dot products (crucial for soft read/write and addressing mechanisms in MANNs). Regarding Hamming similarity, we operate the 9T cell in analog Content-Addressable Memory (CAM) mode by applying the key at the bitlines (RBLs/RBLBs) in each column, and reading out the analog output at the sourceline (SL). To perform dot product operation, the input data is applied at the wordlines, and the current passing through RBLs represents the dot product between the input data and the stored bits. The proposed SRAM array performs computations that reliably match the operations required for a differentiable memory, thereby leading to energy-efficient on-chip acceleration of MANNs. Compared to standard GPU systems, the proposed scheme achieves 43x and 85x performance and energy improvements respectively, for computing the differentiable memory operations.

Swan: a two-step power management for distributed search engines

  • Liang Zhou
  • Laxmi N. Bhuyan
  • K. K. Ramakrishnan

The service quality of web search depends considerably on the request tail latency from Index Serving Nodes (ISNs), prompting data centers to operate them at low utilization and wasting server power. ISNs can be made more energy efficient utilizing Dynamic Voltage and Frequency Scaling (DVFS) or sleep states techniques to take advantage of slack in latency of search queries. However, state-of-the-art frameworks use a single distribution to predict a request’s service time and select a high percentile tail latency to derive the CPU’s frequency or sleep states. Unfortunately, this misses plenty of energy saving opportunities. In this paper, we develop a simple linear regression predictor to estimate each individual search request’s service time, based on the length of the request’s posting list. To use this prediction for power management, the major challenge lies in reducing miss rates for deadlines due to prediction errors, while improving energy efficiency. We present Swan, a two-Step poWer mAnagement for distributed search eNgines. For each request, Swan selects an initial, lower frequency to optimize power, and then appropriately boosts the CPU frequency just at the right time to meet the deadline. Additionally, we re-configure the time instant for boosting frequency, when a critical request arrives and avoid deadline violations. Swan is implemented on the widely-used Solr search engine and evaluated with two representative, large query traces. Evaluations show Swan outperforms state-of-the-art approaches, saving at least 39% CPU power on average.

SESSION: Tuning the design flow for low power: From synthesis to pin assignment

Deep-PowerX: a deep learning-based framework for low-power approximate logic synthesis

  • Ghasem Pasandi
  • Mackenzie Peterson
  • Moises Herrera
  • Shahin Nazarian
  • Massoud Pedram

This paper aims at integrating three powerful techniques namely Deep Learning, Approximate Computing, and Low Power Design into a strategy to optimize logic at the synthesis level. We utilize advances in deep learning to guide an approximate logic synthesis engine to minimize the dynamic power consumption of a given digital CMOS circuit, subject to a predetermined error rate at the primary outputs. Our framework, Deep-PowerX1, focuses on replacing or removing gates on a technology-mapped network and uses a Deep Neural Network (DNN) to predict error rates at primary outputs of the circuit when a specific part of the netlist is approximated. The primary goal of Deep-PowerX is to reduce the dynamic power whereas area reduction serves as a secondary objective. Using the said DNN, Deep-PowerX is able to reduce the exponential time complexity of standard approximate logic synthesis to linear time. Experiments are done on numerous open source benchmark circuits. Results show significant reduction in power and area by up to 1.47× and 1.43× compared to exact solutions and by up to 22% and 27% compared to state-of-the-art approximate logic synthesis tools while having orders of magnitudes lower run-time.

Steady state driven power gating for lightening always-on state retention storage

  • Taehwan Kim
  • Gyounghwan Hyun
  • Taewhan Kim

It is generally known that a considerable portion of flip-flops in circuits is occupied by the ones with mux-feedback loop (called self-loop), which are the critical (inherently unavoidable) bottleneck in minimizing total (always-on) storage size for the allocation of non-uniform multi-bits for retaining flip-flop states in power gated circuits. This is because it is necessary to replace every self-loop flip-flop with a distinct retention flip-flop with at least one-bit storage for retaining its state since there is no clue as to where the flip-flop state, when waking up, comes from, i.e., from the mux-feedback loop or from the driving flip-flops other than itself. This work breaks this bottleneck by safely treating a large portion of the self-loop flip-flops as if they were the same as the flip-flops with no self-loop. Specifically, we design a novel mechanism of steady state monitoring, operating for a few cycles just before sleeping, on a partial set of self-loop flip-flops, by which the expensive state retention storage never be needed for the monitored flip-flops, contributing to a significant saving on the total size of the always- on state retention storage for power gating.

Pin-in-the-middle: an efficient block pin assignment methodology for block-level monolithic 3D ICs

  • Bon Woong Ku
  • Sung Kyu Lim

In a 2D design, the periphery of a block serves as the optimal pin location since blocks are placed aside horizontally in a single placement layer. However, Monolithic 3D (M3D) integration relieves this boundary constraint by allowing vertical block communication between different tiers based on a nm-scale 3D interconnection pitch. In this paper, we present a design methodology named Pin-in-the-Middle that assigns block pins in the middle of a block using commercial 2D P&R tools to enable efficient block implementation and integration for two-tier block-level M3D ICs. Based on a 28nm two-tier M3D hierarchical design result, we show that our solution offers 13.6% and 24.7% energy-delay-product reduction compared to the M3D design with pins assigned at the block boundaries and its 2D counterpart, respectively.

SESSION: ML related

GRLC: grid-based run-length compression for energy-efficient CNN accelerator

  • Yoonho Park
  • Yesung Kang
  • Sunghoon Kim
  • Eunji Kwon
  • Seokhyeong Kang

Convolutional neural networks (CNNs) require a huge amount of off-chip DRAM access, which accounts for most of its energy consumption. Compression of feature maps can reduce the energy consumption of DRAM access. However, previous compression methods show poor compression ratio if the feature maps are either extremely sparse or dense. To improve the compression ratio efficiently, we have exploited the spatial correlation and the distribution of non-zero activations in output feature maps. In this work, we propose a grid-based run-length compression (GRLC) and have implemented a hardware for the GRLC. Compared with a previous compression method [1], GRLC reduces 11% of the DRAM access and 5% of the energy consumption on average in VGG-16, ExtractionNet and ResNet-18.

NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting

  • Qin Li
  • Sheng Lin
  • Changlu Liu
  • Yidong Liu
  • Fei Qiao
  • Yanzhi Wang
  • Huazhong Yang

Keyword spotting (KWS) is a crucial front-end module in the whole speech interaction system. The always-on KWS module detects input words, then activates the energy-consuming complex backend system when keywords are detected. The performance of the KWS determines the standby performance of the whole system and the conventional KWS module encounters the power consumption bottleneck problem of the data conversion near the microphone sensor. In this paper, we propose an energy-efficient near-sensor processing architecture for always-on KWS, which could enhance continuous perception of the whole speech interaction system. By implementing the keyword detection in the analog domain after the microphone sensor, this architecture avoids energy-consuming data converter and achieves faster speed than conventional realizations. In addition, we propose a lightweight gated recurrent unit (GRU) with negligible accuracy loss to ensure the recognition performance. We also implement and fabricate the proposed KWS system with the CMOS 0.18μm process. In the system-view evaluation results, the hardware-software co-design architecture achieves 65.6% energy consumption saving and 71 times speed up than state of the art.

Multi-channel precision-sparsity-adapted inter-frame differential data codec for video neural network processor

  • Yixiong Yang
  • Zhe Yuan
  • Fang Su
  • Fanyang Cheng
  • Zhuqing Yuan
  • Huazhong Yang
  • Yongpan Liu

Activation I/O traffic is a critical bottleneck of video neural network processor. Recent works adopted an inter-frame difference method to reduce activation size. However, current methods can’t fully adapt to the various precision and sparsity in differential data. In this paper, we propose the multi-channel precision-sparsity-adapted codec, which will separate the differential activation and encode activation in multiple channels. We analyze the most adapted encoding of each channel, and select the optimal channel number with the best performance. A two-channel codec hardware has been implemented in the ASIC accelerator, which can encode/decode activations in parallel. Experiment results show that our coding achieves 2.2x-18.2x compression rate in three scenarios with no accuracy loss, and the hardware has 42x/174x improvement on speed and energy-efficiency compared with the software codec.

SESSION: Non-ML low-power architecture

Slumber: static-power management for GPGPU register files

  • Devashree Tripathy
  • Hadi Zamani
  • Debiprasanna Sahoo
  • Laxmi N. Bhuyan
  • Manoranjan Satpathy

The leakage power dissipation has become one of the major concerns with technology scaling. The GPGPU register file has grown in size over last decade in order to support the parallel execution of thousands of threads. Given that each thread has its own dedicated set of physical registers, these registers remain idle when corresponding threads go for long latency operation. Existing research shows that the leakage energy consumption of the register file can be reduced by under volting the idle registers to a data-retentive low-leakage voltage (Drowsy Voltage) to ensure that the data is not lost while not in use. In this paper, we develop a realistic model for determining the wake-up time of registers from various under-volting and power gating modes. Next, we propose a hybrid energy saving technique where a combination of power-gating and under-volting can be used to save optimum energy depending on the idle period of the registers with a negligible performance penalty. Our simulation shows that the hybrid energy-saving technique results in 94% leakage energy savings in register files on an average when compared with the conventional clock gating technique and 9% higher leakage energy saving compared to the state-of-art technique.

STINT: selective transmission for low-energy physiological monitoring

  • Tao-Yi Lee
  • Khuong Vo
  • Wongi Baek
  • Michelle Khine
  • Nikil Dutt

Noninvasive, and continuous physiological sensing enabled by novel wearable sensors is generating unprecedented diagnostic insights in many medical practices. However, the limited battery capacity of these wearable sensors poses a critical challenge in extending device lifetime in order to prevent omission of informative events. In this work, we exploit the inherent sparsity of physiological signals to intelligently enable selective transmission of these signals and thereby improve the energy efficiency of wearable sensors. We propose STINT, a selective transmission framework that generates a sparse representation of the raw signal based on domain-specific knowledge, and which can be integrated into a wide range of resource-constrained embedded sensing IoT platforms. STINT employs a neural network (NN) for selective transmission: the NN identifies, and transmits only the informative parts of the raw signal, thereby achieving low power operation. We validate STINT and establish its efficacy in the domain of IoT for energy-efficient physiological monitoring, by testing our framework on EcoBP – a novel miniaturized, and wireless continuous blood pressure sensor. Early experimental results on the EcoBP device demonstrate that the STINT-enabled EcoBP sensor outperforms the native platform by 14% of sensor energy consumption, with room for additional energy savings via complementary bluetooth and wireless optimizations.

Reconfigurable tiles of computing-in-memory SRAM architecture for scalable vectorization

  • R. Gauchi
  • V. Egloff
  • M. Kooli
  • J.-P. Noel
  • B. Giraud
  • P. Vivet
  • S. Mitra
  • H.-P. Charles

For big data applications, bringing computation to the memory is expected to reduce drastically data transfers, which can be done using recent concepts of Computing-In-Memory (CIM). To address kernels with larger memory data sets, we propose a reconfigurable tile-based architecture composed of Computational-SRAM (C-SRAM) tiles, each enabling arithmetic and logic operations within the memory. The proposed horizontal scalability and vertical data communication are combined to select the optimal vector width for maximum performance. These schemes allow to use vector-based kernels available on existing SIMD engines onto the targeted CIM architecture. For architecture exploration, we propose an instruction-accurate simulation platform using SystemC/TLM to quantify performance and energy of various kernels. For detailed performance evaluation, the platform is calibrated with data extracted from the Place&Route C-SRAM circuit, designed in 22nm FDSOI technology. Compared to 512-bit SIMD architecture, the proposed CIM architecture achieves an EDP reduction up to 60× and 34× for memory bound kernels and for compute bound kernels, respectively.

SESSION: Memory technology and in-memory computing

FeFET-based low-power bitwise logic-in-memory with direct write-back and data-adaptive dynamic sensing interface

  • Mingyen Lee
  • Wenjun Tang
  • Bowen Xue
  • Juejian Wu
  • Mingyuan Ma
  • Yu Wang
  • Yongpan Liu
  • Deliang Fan
  • Vijaykrishnan Narayanan
  • Huazhong Yang
  • Xueqing Li

Compute-in-memory (CiM) is a promising method for mitigating the memory wall problem in data-intensive applications. The proposed bitwise logic-in-memory (BLiM) is targeted at data intensive applications, such as database, data encryption. This work proposes a low-power BLiM approach using the emerging nonvolatile ferroelectric FETs with direct write-back and data-adaptive dynamic sensing interface. Apart from general-purpose random-access memory, it also supports BLiM operations such as copy, not, nand, xor, and full adder (FA). The novel features of the proposed architecture include: (i) direct result-write-back based on the remnant bitline BLiM charge that avoids bitline sensing and charging operations; (ii) a fully dynamic sensing interface that needs no static reference current, but adopts data-adaptive voltage references for certain multi-operand operations, and (iii) selective bitline charging from wordline (instead of pre-charging all bitlines) to save power and also enable direct write-back. Detailed BLiM operations and benchmarking against conventional approaches show the promise of low-power computing with the FeFET-based circuit techniques.

Enabling efficient ReRAM-based neural network computing via crossbar structure adaptive optimization

  • Chenchen Liu
  • Fuxun Yu
  • Zhuwei Qin
  • Xiang Chen

Resistive random-access memory (ReRAM) based accelerators have been widely studied to achieve efficient neural network computing in speed and energy. Neural network optimization algorithms such as sparsity are developed to achieve efficient neural network computing on traditional computer architectures such as CPU and GPU. However, such computing efficiency improvement is hindered when deploying these algorithms on the ReRAM-based accelerator because of its unique crossbar-structural computations. And a specific algorithm and hardware co-optimization for the ReRAM-based architecture is still in a lack. In this work, we propose an efficient neural network computing framework that is specialized for the crossbar-structural computations on the ReRAM-based accelerators. The proposed framework includes a crossbar specific feature map pruning and an adaptive neural network deployment. Experimental results show our design can improve the computing accuracy by 9.1% compared with the state-of-the-art sparse neural networks. Based on a famous ReRAM-based DNN accelerator, the proposed framework demonstrates up to 1.4× speedup, 4.3× power efficiency, and 4.4× area saving.

Embedding error correction into crossbars for reliable matrix vector multiplication using emerging devices

  • Qiuwen Lou
  • Tianqi Gao
  • Patrick Faley
  • Michael Niemier
  • X. Sharon Hu
  • Siddharth Joshi

Emerging memory devices are an attractive choice for implementing very energy-efficient in-situ matrix-vector multiplication (MVM) for use in intelligent edge platforms. Despite their great potential, device-level non-idealities have a large impact on the application-level accuracy of deep neural network (DNN) inference. We introduce a low-density parity-check code (LDPC) based approach to correct non-ideality induced errors encountered during in-situ MVM. We first encode the weights using error correcting codes (ECC), perform MVM on the encoded weights, and then decode the result after in-situ MVM. We show that partial encoding of weights can maintain DNN inference accuracy while minimizing the overhead of LDPC decoding. Within two iterations, our ECC method recovers 60% of the accuracy in MVM computations when 5% of underlying computations are error-prone. Compared to an alternative ECC method which uses arithmetic codes, using LDPC improves AlexNet classification accuracy by 0.8% at iso-energy. Similarly, at iso-energy, we demonstrate an improvement in CIFAR-10 classification accuracy of 54% with VGG-11 when compared to a strategy that uses 2× redundancy in weights. Further design space explorations demonstrate that we can leverage the resilience endowed by ECC to improve energy efficiency (by reducing operating voltage). A 3.3× energy efficiency improvement in DNN inference on CIFAR-10 dataset with VGG-11 is achieved at iso-accuracy.

SESSION: Low power system and NVM

A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs

  • Kshitij Bhardwaj
  • Marton Havasi
  • Yuan Yao
  • David M. Brooks
  • José Miguel Hernández-Lobato
  • Gu-Yeon Wei

Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.

DidaSel: dirty data based selection of VC for effective utilization of NVM buffers in on-chip interconnects

  • Khushboo Rani
  • Sukarn Agarwal
  • Hemangee K. Kapoor

In a multi-core system, communication across cores is managed by an on-chip interconnect called Network-on-Chip (NoC). The utilization of NoC results in limitations such as high communication delay and high network power consumption. The buffers of the NoC router consume a considerable amount of leakage power. This paper attempts to reduce leakage power consumption by using Non-Volatile Memory technology-based buffers. NVM technology has the advantage of higher density and low leakage but suffers from costly write operation, and weaker write endurance. These characteristics impact on the total network power consumption, network latency, and lifetime of the router as a whole.

In this paper, we propose a write reduction technique, which is based on dirty flits present in write-back data packets. The method also suggests a dirty flit based Virtual Channel (VC) allocation technique that distributes writes in NVM technology-based VCs to improve the lifetime of NVM buffers.

The experimental evaluation on the full system simulator shows that the proposed policy obtains a 53% reduction in write-back flits, which results in 27% lesser total network flit on average. All these results in a significant decrease in total and dynamic network power consumption. The policy also shows remarkable improvement in the lifetime.

WELCOMF: wear leveling assisted compression using frequent words in non-volatile main memories

  • Arijit Nath
  • Hemangee K. Kapoor

Emerging Non-Volatile memories such as Phase Change Memory (PCM) and Resistive RAM are projected as potential replacements of the traditional DRAM-based main memories. However, limited write endurance and high write energy limit their chances of adoption as a mainstream main memory standard.

In this paper, we propose a word-level compression scheme called COMF to reduce bitflips in PCMs by removing the most repeated words from the cache lines before writing into memory. Later, we also propose an intra-line wear leveing technique called WELCOMF that extends COMF to improve lifetime. Experimental results show that the proposed technique improves lifetime by 75% and, reduce bit flips and energy by 45% and 46% respectively over baseline.

SESSION: ML-based low-power architecture

Low-power object counting with hierarchical neural networks

  • Abhinav Goel
  • Caleb Tung
  • Sara Aghajanzadeh
  • Isha Ghodgaonkar
  • Shreya Ghosh
  • George K. Thiruvathukal
  • Yung-Hsiang Lu

Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in many computer vision tasks, such as object counting. Object counting takes two inputs: an image and an object query and reports the number of occurrences of the queried object. To achieve high accuracy, DNNs require billions of operations, making them difficult to deploy on resource-constrained, low-power devices. Prior work shows that a significant number of DNN operations are redundant and can be eliminated without affecting the accuracy. To reduce these redundancies, we propose a hierarchical DNN architecture for object counting. This architecture uses a Region Proposal Network (RPN) to propose regions-of-interest (RoIs) that may contain the queried objects. A hierarchical classifier then efficiently finds the RoIs that actually contain the queried objects. The hierarchy contains groups of visually similar object categories. Small DNNs at each node of the hierarchy classify between these groups. The RoIs are incrementally processed by the hierarchical classifier. If the object in an RoI is in the same group as the queried object, then the next DNN in the hierarchy processes the RoI further; otherwise, the RoI is discarded. By using a few small DNNs to process each image, this method reduces the memory requirement, inference time, energy consumption, and number of operations with negligible accuracy loss when compared with the existing techniques.

Integrating event-based dynamic vision sensors with sparse hyperdimensional computing: a low-power accelerator with online learning capability

  • Michael Hersche
  • Edoardo Mello Rella
  • Alfio Di Mauro
  • Luca Benini
  • Abbas Rahimi

We propose to embed features extracted from event-driven dynamic vision sensors to binary sparse representations in hyperdimensional (HD) space for regression. This embedding compresses events generated across 346×260 differential pixels to a sparse 8160-bit vector by applying random activation functions. The sparse representation not only simplifies inference, but also enables online learning with the same memory footprint. Specifically, it allows efficient updates by retaining binary vector components over the course of online learning that cannot be otherwise achieved with dense representations demanding multibit vector components. We demonstrate online learning capability: using estimates and confidences of an initial model trained with only 25% of data, our method continuously updates the model for the remaining 75% of data, resulting in a close match with accuracy obtained with an oracle model on ground truth labels. When mapped on an 8-core accelerator, our method also achieves lower error, latency, and energy compared to other sparse/dense alternatives. Furthermore, it is 9.84× more energy-efficient and 6.25× faster than an optimized 9-layer perceptron with comparable accuracy.

FTRANS: energy-efficient acceleration of transformers using FPGA

  • Bingbing Li
  • Santosh Pandey
  • Haowen Fang
  • Yanjun Lyv
  • Ji Li
  • Jieyang Chen
  • Mimi Xie
  • Lipeng Wan
  • Hang Liu
  • Caiwen Ding

In natural language processing (NLP), the “Transformer” architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduce the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07× and 81 × improvement in performance and energy efficiency compared to CPU, and up to 8.80× improvement in energy efficiency compared to GPU.

POSTER SESSION: Poster papers

BrainWave: an energy-efficient EEG monitoring system – evaluation and trade-offs

  • Barry de Bruin
  • Kamlesh Singh
  • Jos Huisken
  • Henk Corporaal

This paper presents the design and evaluation of an energy-efficient seizure detection system for emerging EEG-based monitoring applications, such as non-convulsive epileptic seizure detection and Freezing-of-Gait (FoG) detection. As part of the BrainWave system, a BrainWave processor for flexible and energy-efficient signal processing is designed. The key system design parameters, including algorithmic optimizations, feature offloading and near-threshold computing are evaluated in this work. The BrainWave processor is evaluated while executing a complex EEG-based epileptic seizure detection algorithm. In a 28-nm FDSOI technology, 325 μJ per classification at 0.9 V and 290 μJ at 0.5 V are achieved using an optimized software-only implementation. By leveraging a Coarse-Grained Reconfigurable Array (CGRA), 160 μJ and 135 μJ are obtained, respectively, while maintaining a high level of flexibility. Near-threshold computing combined with CGRA acceleration leads to an energy reduction of up to 59%, or 55% including idle-time overhead.

QUANOS: adversarial noise sensitivity driven hybrid quantization of neural networks

  • Priyadarshini Panda

Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial attacks, wherein, a model gets fooled by applying slight perturbations on the input. In this paper, we investigate the use of quantization to potentially resist adversarial attacks. Several recent studies have reported remarkable results in reducing the energy requirement of a DNN through quantization. However, no prior work has considered the relationship between adversarial sensitivity of a DNN and its effect on quantization. We propose QUANOS- a framework that performs layer-specific hybrid quantization based on Adversarial Noise Sensitivity (ANS). We identify a novel noise stability metric (ANS) for DNNs, i.e., the sensitivity of each layer’s computation to adversarial noise. ANS allows for a principled way of determining optimal bit-width per layer that incurs adversarial robustness as well as energy-efficiency with minimal loss in accuracy. Essentially, QUANOS assigns layer significance based on its contribution to adversarial perturbation and accordingly scales the precision of the layers. We evaluate the benefits of QUANOS on precision scalable Multiply and Accumulate (MAC) hardware architectures with data gating and subword parallelism capabilities. Our experiments on CIFAR10, CIFAR100 datasets show that QUANOS outperforms homogeneously quantized 8-bit precision baseline in terms of adversarial robustness (3 — 4% higher) while yielding improved compression (> 5×) and energy savings (> 2×) at iso-accuracy. At iso-compression rate, QUANOS yields significantly higher adversarial robustness (> 10%) than similar sized baseline against strong white-box attacks. We also find that combining QUANOS with state-of-the-art defense methods outperforms the state-of-the-art in robustness (~ 5% — 16% higher) against very strong attacks.

Pre-layout clock tree estimation and optimization using artificial neural network

  • Sunwha Koh
  • Yonghwi Kwon
  • Youngsoo Shin

Clock tree synthesis (CTS) takes place in a very late design stage, so most of the time, power consumption is analyzed while a circuit does not contain a clock tree. We build an artificial neural network (ANN) to estimate the number of clock buffers and apply to each clock gater as well as clock source in ideal clock network. Clock structure is then constructed using such estimated clock buffers. Experiments with a few test circuits demonstrate very high accuracy of this method, average clock power estimation error less than 5%. The proposed method also allows us to find the possible minimum number of clock buffers with optimized clock parameters (e.g. target skew, clock transition time). The possible minimum number of buffers can be found by binary search algorithm and on each step of the algorithm, trained ANN is used to find such clock parameters for the target number of buffers. Using proposed clock parameter optimization, we found that the number of buffers in clock network can be reduced by 31%, on average.

GC-eDRAM design using hybrid FinFET/NC-FinFET

  • Ramin Rajaei
  • Yen-Kai Lin
  • Sayeef Salahuddin
  • Michael Niemier
  • X. Sharon Hu

Gain cell embedded DRAMs (GC-eDRAM) are a potential alternative for conventional static random access memories thanks to their attractive advantages such as high density, low-leakage, and two-ported operation. As CMOS technology nodes scale down, the design of GC-eDRAM at deeply scaled nanometer nodes becomes more challenging. Deeply-scale technology nodes suffer from high leakage currents and result in low data retention times (DRTs) for GC-eDRAMs. Negative capacitance FinFETs (NC-FinFETs) are a promising emerging device for ultra-low-power VLSI design. Due to the lower leakage currents, NC-FinFETs can facilitate GC-eDRAM design with higher DRTs. We show that though NC-FinFETs have lower OFF currents and higher ION/IOFF ratios, their ON current is lower than FinFETs by approximately 30%, which results in lower performance. To benefit from the potential power efficiencies and the high DRTs of NC-FinFETs without sacrificing performance, we propose hybrid FinFET/NC-FinFET configurations for some prior 2T, 3T, and 4T GC-eDRAM cells. Simulations based on a 14nm experimentally calibrated NC-FinFET model suggest that the hybrid designs offer up to 96.8% and 86.3% improvements in DRT and static power consumption, respectively, when compared to the FinFET implementation. They also offer up to 47% read delay improvement over the NC-FinFET design. We also study the voltage scaling effects on DRT and refresh-energy of the proposed GC-eDRAM cells. The associated simulation results reveal that, with different supply voltages, the proposed hybrid 4T GC-eDRAM cell offers up to 370× less refresh-energy when compared to the other designs.

SAOU: safe adaptive overclocking and undervolting for energy-efficient GPU computing

  • Hadi Zamani
  • Devashree Tripathy
  • Laxmi Bhuyan
  • Zizhong Chen

The current trend of ever-increasing performance in scientific applications comes with tremendous growth in energy consumption. In this paper, we present a framework for GPU applications, which reduces energy consumption in GPUs through Safe Overclocking and Undervolting (SAOU) without sacrificing performance. The idea is to increase the frequency beyond the safe frequency fsa f eMax and undervolt below Vsa f eMin to get maximum energy saving. Since such overclocking and undervolting may give rise to faults, we employ an enhanced checkpoint-recovery technique to cover the possible errors. Empirically, we explore different errors and derive a fault model that can set the undervolting and overclocking level for maximum energy saving. We target cuBLAS Matrix Multiplication (cuBLAS-MM) kernel for error correction using the checkpoint and recovery (CR) technique as an example of scientific applications. In case of cuBLAS, SAOU achieves up to 22% energy reduction through undervolting and overclocking without sacrificing the performance.

SparTANN: sparse training accelerator for neural networks with threshold-based sparsification

  • Hyeonuk Sim
  • Jooyeon Choi
  • Jongeun Lee

While sparsity has been exploited in many inference accelerators, not much work is done for training accelerators. Exploiting sparsity in training accelerators involves multiple issues, including where to find sparsity, how to exploit sparsity, and how to create more sparsity. In this paper we present a novel sparse training architecture that can exploit sparsity in gradient tensors in both back propagation and weight update computation. We also propose a single-pass sparsification algorithm, which is a hardware-friendly version of a recently proposed sparse training algorithm, that can create additional sparsity aggressively during training. Our experimental results using large networks such as AlexNet and GoogleNet demonstrate that our sparse training architecture can accelerate convolution layer training time by 4.20~8.88× over baseline dense training without accuracy loss, and further increase the training speed by 7.30~11.87× over the baseline with minimal accuracy loss.

BLINK: bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices

  • Zhe Chen
  • Garrett J. Blair
  • Hugh T. Blair
  • Jason Cong

Miniaturized fluorescent calcium imaging microscopes are widely used for monitoring the activity of a large population of neurons in freely behaving animals in vivo. Conventional calcium image analyses extract calcium traces by iterative and bulk image processing and they are hard to meet the power and latency requirements for neurofeedback devices. In this paper, we propose the calcium image processing pipeline based on a bit-sparse long short-term memory (LSTM) inference kernel (BLINK) for efficient calcium trace extraction. It largely reduces the power and latency while remaining the trace extraction accuracy. We implemented the customized pipeline on the Ultra96 platform. It can extract calcium traces from up to 1024 cells with sub-ms latency on a single FPGA device. We designed the BLINK circuits in a 28-nm technology. Evaluation shows that the proposed bit-sparse representation can reduce the circuit area by 38.7% and save the power consumption by 38.4% without accuracy loss. The BLINK circuits achieve 410 pJ/inference, which has 6293x and 52.4x gains in energy efficiency compared to the evaluation on the high performance CPU and GPU, respectively.

BiasP: a DVFS based exploit to undermine resource allocation fairness in linux platforms

  • Harshit Kumar
  • Nikhil Chawla
  • Saibal Mukhopadhyay

Dynamic Voltage and Frequency Scaling (DVFS) plays an integral role in reducing the energy consumption of mobile devices, meeting the targeted performance requirements at the same time. We examine the security obliviousness of CPUFreq, the DVFS framework in Linux-kernel based systems. Since Linux-kernel based operating systems are present in a wide array of applications, the high-level CPUFreq policies are designed to be platform-independent. Using these policies, we present BiasP exploit, which restricts the allocation of CPU resources to a set of targeted applications, thereby degrading their performance. The exploit involves detecting the execution of instructions on the CPU core pertinent to the targeted applications, thereafter using CPUFreq policies to limit the available CPU resources available to those instructions. We demonstrate the practicality of the exploit by operating it on a commercial smartphone, running Android OS based on Linux-kernel. We can successfully degrade the User Interface (UI) performance of the targeted applications by increasing the frame processing time and the number of dropped frames by up to 200% and 947% for the animations belonging to the targeted-applications. We see a reduction of up to 66% in the number of retired instructions of the targeted-applications. Furthermore, we propose a robust detector which is capable of detecting exploits aimed at undermining resource allocation fairness through malicious use of the DVFS framework.

Resiliency analysis and improvement of variational quantum factoring in superconducting qubit

  • Ling Qiu
  • Mahabubul Alam
  • Abdullah Ash-Saki
  • Swaroop Ghosh

Variational algorithm using Quantum Approximate Optimization Algorithm (QAOA) can solve the prime factorization problem in near-term noisy quantum computers. Conventional Variational Quantum Factoring (VQF) requires a large number of 2-qubit gates (especially for factoring a large number) resulting in deep circuits. The output quality of the deep quantum circuit is degraded due to errors limiting the computational power of quantum computing. In this paper, we explore various transformations to optimize the QAOA circuit for integer factorization. We propose two criteria to select the optimal quantum circuit that can improve the noise resiliency of VQF.

HIPE-MAGIC: a technology-aware synthesis and mapping flow for highly parallel execution of memristor-aided LoGIC

  • Arash Fayyazi
  • Amirhossein Esmaili
  • Massoud Pedram

Recent efforts for finding novel computing paradigms that meet today’s design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing.

SHEARer: highly-efficient hyperdimensional computing by software-hardware enabled multifold approximation

  • Behnam Khaleghi
  • Sahand Salamat
  • Anthony Thomas
  • Fatemeh Asgarinejad
  • Yeseong Kim
  • Tajana Rosing

Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, followed by a bundling procedure that adds up the hypervectors to realize the encoding hypervector. The operations of HD are simple and highly parallelizable, but the large number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithmhardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. Unlike previous works that generate the encoding hypervectors in full precision and then and then perform ex-post quantization, we compute the encoding hypervectors in an approximate manner that saves resources yet affords high accuracy. We also propose a novel FPGA architecture that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904× (15.7×) and energy savings of up to 56,044× (301×) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets.

Implementing binary neural networks in memory with approximate accumulation

  • Saransh Gupta
  • Mohsen Imani
  • Hengyu Zhao
  • Fan Wu
  • Jishen Zhao
  • Tajana Šimunić Rosing

Processing in-memory (PIM) has shown great potential to accelerate the inference tasks of binarized neural networks (BNNs) by reducing data movement between processing units and memory. However, existing PIM architectures require analog/mixed-signal circuits that do not scale with the CMOS technology. On the contrary, we propose BitNAP (Binarized neural network acceleration with in-memory ThreSholding), which performs optimization at operation, peripheral, and architecture levels for an efficient BNN accelerator. BitNAP supports row-parallel bitwise operations in crossbar memory by exploiting the switching of 1-bit bipolar resistive devices and a unique hybrid tunable thresholding operation. In order to reduce the area overhead of sensing-based operations, BitNAP presents a memory sense amplifier sharing scheme and also, a novel operation pipelining to reduce the latency overhead of sharing. We evaluate the efficiency of BitNAP on the MNIST and ImageNet datasets using popular neural networks. BitNAP is on average 1.24× (10.7×) faster and 185.6× (10.5×) more energy-efficient as compared to the state-of-the-art PIM accelerator for simple (complex) networks.

MEMOCODE 2019 TOC

Full Citation in the ACM Digital Library

A compositional semantics of Simulink/Stateflow based on quantized state hybrid automata

  • Jin Woo Ro
  • Avinash Malik
  • Partha Roop

Simulink/Stateflow® is the de-facto tool for design of Cyber-physical Systems (CPS). CPS include hybrid systems, where a discrete controller guides a continuous plant. Hybrid systems are characterised by their continuous time dynamics with sudden discontinuities, caused by level/zero crossings. Stateflow can graphically capture hybrid phenomenon, making it popular with control engineers. However, Stateflow is unable to correctly and efficiently simulate complex hybrid systems, especially those characterised by even number of level crossings.

In this paper we first propose a new formal model for hybrid systems called Quantized State Hybrid Input Output Automaton (QSHIOA). QSHIOA is used to give a deterministic semantics to Stateflow in addition to efficiently handling even number of level crossing detections. In the proposed compositional semantics, a network of Stateflow charts can be compiled into a network of QSHIOAs. Benchmark results show that in the median case, the proposed stateflow execution technique, via QSHIOA, is 84% faster than using the best variable-step size solvers in Simulink/Stateflow®.

Further sub-cycle and multi-cycle schedulling support for Bluespec Verilog

  • David J. Greaves

Bluespec [13] is a hardware description language where all behaviour is expressed in rules that execute atomically. The standard compilation semantics for Bluespec enforce a particular mapping between rule firing and hardware clock cycles, such as a register only being updated by exactly one firing of at most one rule in any clock cycle. Also, the standard compiler does not introduce any additional state, such as credit-based or round-robin arbiters to guarantee fairness between rules over time. On the other hand, many useful hardware resources, such as complex ALUs and synchronous RAMs, are pipelined. Unlike typical high-level synthesis tools, in standard Bluespec such resources cannot be invoked using infix operators in expressions such as A[e] or e1*e2 since binding to specific instances and multi-clock cycle schedules are required. In this paper we extend the reference semantics of Bluespec to decouple it from clock cycles, allowing multiple updates to a register within one clock cycle and automatic instantiation of arbiters for multi-clock cycle behaviour. We describe the new semantic packing rules as extensions of our standard compilation rules and we report early results from an open-source, fully-functional implementation.

Securing implantable medical devices with runtime enforcement hardware

  • Hammond Pearce
  • Matthew M. Y. Kuo
  • Partha S. Roop
  • Srinivas Pinisetty

In recent years we have seen numerous proof-of-concept attacks on implantable medical devices such as pacemakers. Attackers aim to breach the strict operational constraints that these devices operate within, with the end-goal of compromising patient safety and health. Most efforts to prevent these kinds of attacks are informal, and focus on application- and system-level security — for instance, using encrypted communications and digital certificates for program verification. However, these approaches will struggle to prevent all classes of attacks. Runtime verification has been proposed as a formal methodology for monitoring the status of implantable medical devices. Here, if an attack is detected a warning is generated. This leaves open the risk that the attack can succeed before intervention can occur. In this paper, we propose a runtime-enforcement based approach for ensuring patient security. Custom hardware is constructed for individual patients to ensure a safe minimum quality of service at all times. To ensure correctness we formally verify the hardware using a model-checker. We present our approach through a pacemaker case study and demonstrate that it incurs minimal overhead in terms of execution time and power consumption.

A timeless model for the verification of quasi-periodic distributed systems

  • Maryam Dabaghchian
  • Zvonimir Rakamarić

A cyber-physical system often consists of distributed multi-rate periodic processes that communicate using message passing; each process owns a local clock not synchronized with others. We call such systems quasi-periodic distributed systems. Traditionally, one would model them using timed automata, thereby having to deal with high-complexity verification problems. Recently, several researchers proposed discrete-time abstractions based on the calendar model to make the verification more tractable. However, even the calendar model contains a notion of time in the form of a global clock. We propose a novel, timeless computation model for quasi-periodic distributed systems to facilitate their verification. The main idea behind our model is to judiciously replace synchronization using a global clock and calendar with synchronization over lengths of message buffers. We introduce a simple domain-specific language for programming of such systems and use it to formalize the semantics of both the calendar and timeless model. Then, we prove that our timeless model is an overapproximation of the calendar model. Finally, we evaluate our timeless model using several benchmarks.

RTL bug localization through LTL specification mining (WIP)

  • Vighnesh Iyer
  • Donggyu Kim
  • Borivoje Nikolic
  • Sanjit A. Seshia

As the complexity of contemporary hardware designs continues to grow, functional verification demands more effort and resources in the design cycle than ever. As a result, manually debugging RTL designs is extremely challenging even with full signal traces after detecting errors in chip-level software simulation or FPGA emulation. Therefore, it is necessary to reduce the burden of verification by automating RTL debugging processes.

In this paper, we propose a novel approach for debugging with the use of LTL specification mining. In this approach, we extract fine-grained assertions that are implicitly encoded in the RTL design, representing the designer’s assumptions, to localize bugs that are only detected when high-level properties are violated from long-running full-system simulations. We employ template-based RTL spec mining to infer both safety and bounded liveness properties. We propose strategies to convert multi-bit signals to atomic propositions based on common RTL design idioms such as ready-valid handshakes and specific state transitions using automatic static analysis.

Our initial results with a tiny RISC-V core design show that this methodology is promising for localizing bugs in time and space by demonstrating that the mined fine-grained LTL properties are violated before a high-level test failure condition occurs, such as a timeout or hanging, and can point to specific lines of suspect RTL.

Encoding and monitoring responsibility sensitive safety rules for automated vehicles in signal temporal logic

  • Mohammad Hekmatnejad
  • Shakiba Yaghoubi
  • Adel Dokhanchi
  • Heni Ben Amor
  • Aviral Shrivastava
  • Lina Karam
  • Georgios Fainekos

As Automated Vehicles (AV) get ready to hit the public roads unsupervised, many practical questions still remain open. For example, there is no commonly acceptable formal definition of what safe driving is. A formal definition of safe driving can be utilized in developing the vehicle behaviors as well as in certification and legal cases. Toward that goal, the Responsibility-Sensitive Safety (RSS) model was developed as a first step toward formalizing safe driving behavior upon which the broader AV community can expand. In this paper, we demonstrate that the RSS model can be encoded in Signal Temporal Logic (STL). Moreover, using the S-TaLiRo tools, we present a case study of monitoring RSS requirements on selected traffic scenarios from CommonRoad. We conclude that monitoring RSS rules encoded in STL is efficient even in heavy traffic scenarios. One interesting observation is that for the selected traffic data, vehicle parameters and response times, the RSS model violations are not frequent.

A compositional approach for real-time machine learning

  • Nathan Allen
  • Yash Raje
  • Jin Woo Ro
  • Partha Roop

Cyber-Physical Systems are highly safety critical, especially since they have to provide both functional and timing guarantees. Increasingly, Cyber-Physical Systems such as autonomous vehicles are relying on Artificial Neural Networks in their decision making and this has obvious safety implications. While many formal approaches have been recently developed for ensuring functional correctness of machine learning modules involving Artificial Neural Networks, the issue of timing correctness has received scant attention.

This paper proposes a new compiler from the well known Keras Neural Network library to hardware to mitigate the above problem. In the developed approach, we compile networks of Artificial Neural Networks, called Meta Neural Networks, to hardware implementations using a new synchronous semantics for their execution. The developed semantics enables compilation of Meta Neural Networks to a parallel hardware implementation involving limited hardware resources. The developed compiler is semantics driven and guarantees that the generated implementation is deterministic and time predictable. The compiler also provides a better alternative for the realisation of non-linear functions in hardware. Overall, we show that the developed approach is significantly more efficient than a software approach, without the burden of complex algorithms needed for software Worst Case Execution Time analysis.

Polyhedral fragments: an efficient representation for symbolically generating code for processor arrays

  • Michael Witterauf
  • Frank Hannig
  • Jürgen Teich

To leverage the vast parallelism of loops, embedded loop accelerators often take the form of processor arrays with many, but simple processing elements. Each processing element executes a subset of a loop’s iterations in parallel using instruction- and datalevel parallelism by tightly scheduling iterations using software pipelining and packing instructions into compact, individual programs. However, loop bounds are often unknown until runtime, which complicates the static generation of programs because they influence each program’s control flow.

Existing solutions, like generating and storing all possible programs or full just-in-time compilation, are prohibitively expensive, especially in embedded systems. As a remedy, we propose a hybrid approach introducing a tree-like program representation, whose generation front-loads all intractable sub-problems to compile time, and from which all concrete program variants can efficiently be stitched together at runtime. The tree consists of so-called polyhedral fragments that represent concrete program parts and are annotated with iteration-dependent conditions.

We show that both this representation is both space- and time-efficient: it requires polynomial space to store—whereas storing all possibly generated programs is non-polynomial—and polynomial time to evaluate—whereas just-in-time compilation requires solving NP-hard problems. In a case study, we show for a representative loop program that using a tree of polyhedral fragments saves 98.88 % of space compared to storing all program variants.

Security-driven metrics and models for efficient evaluation of logic encryption schemes

  • Yinghua Hu
  • Vivek V. Menon
  • Andrew Schmidt
  • Joshua Monson
  • Matthew French
  • Pierluigi Nuzzo

Research in logic encryption over the last decade has resulted in various techniques to prevent different security threats such as Trojan insertion, intellectual property leakage, and reverse engineering. However, there is little agreement on a uniform set of metrics and models to efficiently assess the achieved security level and the trade-offs between security and overhead. This paper addresses the above challenges by relying on a general logic encryption model that can encompass all the existing techniques, and a uniform set of metrics that can capture multiple, possibly conflicting, security concerns. We apply our modeling approach to four state-of-the-art encryption techniques, showing that it enables fast and accurate evaluation of design trade-offs, average prediction errors that are at least 2× smaller than previous approaches, and the evaluation of compound encryption methods.

Modeling observability in adaptive systems to defend against advanced persistent threats

  • Cody Kinneer
  • Ryan Wagner
  • Fei Fang
  • Claire Le Goues
  • David Garlan

Advanced persistent threats (APTs) are a particularly troubling challenge for software systems. The adversarial nature of the security domain, and APTs in particular, poses unresolved challenges to the design of self-* systems, such as how to defend against multiple types of attackers with different goals and capabilities. In this interaction, the observability of each side is an important and under-investigated issue in the self-* domain. We propose a model of APT defense that elevates observability as a first-class concern. We evaluate this model by showing how an informed approach that uses observability improves the defender’s utility compared to a uniform random strategy, can enable robust planning through sensitivity analysis, and can inform observability-related architectural design decisions.

Approximate computing for multithreaded programs in shared memory architectures

  • Bernard Nongpoh
  • Rajarshi Ray
  • Ansuman Banerjee

In multicore and multicached architectures, cache coherence is ensured with a coherence protocol. However, the performance benefits of caching diminishes due to the cost associated with the protocol implementation. In this paper, we propose a novel technique to improve the performance of multithreaded programs running on shared-memory multicore processors by embracing approximate computing. Our idea is to relax the coherence requirement selectively in order to reduce the cost associated with a cache-coherence protocol, and at the same time, ensure a bounded QoS degradation with probabilistic reliability. In particular, we detect instructions in a multithreaded program that write to shared data, we call them Shared-Write-Access-Points (SWAPs), and propose an automated statistical analysis to identify those which can tolerate coherence faults. We call such SWAPs approximable. Our experiments on 9 applications from the SPLASH 3.0 benchmarks suite reveal that an average of 57% of the tested SWAPs are approximable. To leverage this observation, we propose an adapted cache-coherence protocol that relaxes the coherence requirement on stores from approximable SWAPs. Additionally, our protocol uses stale values for load misses due to coherence, the stale value being the version at the time of invalidation. We observe an average of 15% reduction in CPU cycles and 11% reduction in energy footprint from architectural simulation of the 9 applications using our approximate execution scheme.

Compositional construction of bounded error over-approximations of acyclic interconnected continuous dynamical systems

  • Ratan Lal
  • Pavithra Prabhakar

We consider the problem of bounded time safety verification of interconnections of input-output continuous dynamical systems. We present a compositional framework for computing bounded error approximations of the complete system from those of the components. The main crux of our approach consists of capturing the input-output signal behaviors of a component using an abstraction predicate that represents the input-output sample behaviors corresponding to the signal behaviors. We define a semantics for the abstraction predicate that captures an over-approximation of the input-output signal behaviors of a component. Next, we define how to compose abstraction predicates of components to obtain an abstraction predicate for the composed system. We instantiate our compositional abstraction construction framework for linear dynamical systems by providing concrete methods for constructing the input-output abstraction predicates for the individual systems.

Security analysis of cloud-connected industrial control systems using combinatorial testing

  • Peter W. V. Tran-Jørgensen
  • Tomas Kulik
  • Jalil Boudjadar
  • Peter Gorm Larsen

Industrial control systems are moving from monolithic to distributed and cloud-connected architectures, which increases system complexity and vulnerability, thus complicates security analysis. When exhaustive verification accounts for this complexity the state space being sought grows drastically as the system model evolves and more details are considered. Eventually this may lead to state space explosion, which makes exhaustive verification infeasible. To address this, we use VDM-SL’s combinatorial testing feature to generate security attacks that are executed against the model to verify whether the system has the desired security properties. We demonstrate our approach using a cloud-connected industrial control system that is responsible for performing safety-critical tasks and handling client requests sent to the control network. Although the approach is not exhaustive it enables verification of mitigation strategies for a large number of attacks and complex systems within reasonable time.

Detecting security leaks in hybrid systems with information flow analysis

  • Luan Viet Nguyen
  • Gautam Mohan
  • James Weimer
  • Oleg Sokolsky
  • Insup Lee
  • Rajeev Alur

Information flow analysis is an effective way to check useful security properties, such as whether secret information can leak to adversaries. Despite being widely investigated in the realm of programming languages, information-flow-based security analysis has not been widely studied in the domain of cyber-physical systems (CPS). CPS provide interesting challenges to traditional type-based techniques, as they model mixed discrete-continuous behaviors and are usually expressed as a composition of state machines. In this paper, we propose a lightweight static analysis methodology that enables information security properties for CPS models. We introduce a set of security rules for hybrid automata that characterizes the property of non-interference. Based on those rules, we propose an algorithm that generates security constraints between each sub-component of hybrid automata, and then transforms these constraints into a directed dependency graph to search for non-interference violations. The proposed algorithm can be applied directly to parallel compositions of automata without resorting to model-flattening techniques. Our static checker works on hybrid systems modeled in Simulink/Stateflow format and decides whether or not the model satisfies non-interference given a user-provided security annotation for each variable. Moreover, our approach can also infer the security labels of variables, allowing a designer to verify the correctness of partial security annotations. We demonstrate the potential benefits of the proposed methodology on two case studies.

Logical specification and uniform synthesis of robust controllers

  • Paritosh K. Pandya
  • Amol Wakankar

This paper investigates the synthesis of robust controllers from a logical specification of regular properties given in an interval temporal logic QDDC. Our specification encompasses both hard robustness and soft robustness. Here, hard robustness guarantees the invariance of commitment under relaxed (weakened) assumptions. A systematic framework for logically specifying the assumption weakening by means of a QDDC formula Rb(A), called Robustness criterion, is presented. This can be used with any user specified assumption DA to obtain a relaxed (weakened) assumption Rb(DA). A variety of robustness criteria encompassing some existing notions such as k, b resilience as well as some new notions like tolerating non-burst errors and recovery from transient errors are formulated logically. The soft robustness pertains to the ability of the controller to maintain the commitment for as many inputs as possible, irrespective of any assumption. We present a uniform method for the synthesis of a robust controller which guarantees the invariance of specified hard robustness and it optimizes the expected value of occurrence of commitment across input sequences. Through the case study of a synchronous bus arbiter, we experimentally show the impact of variety of hard robustness criteria as well as the soft robustness on the ability of the synthesized controllers to meet the commitment “as much as possible”.

Lattice-based SMT for program verification

  • Karine Even-Mendoza
  • Antti E. J. Hyvärinen
  • Hana Chockler
  • Natasha Sharygina

We present a lattice-based satisfiability modulo theory for verification of programs with library functions, for which the mathematical libraries supporting these functions contain a high number of equations and inequalities. Common strategies for dealing with library functions include treating them as uninterpreted functions or using the theories under which the functions are fully defined. The full definition could in most cases lead to instances that are too large to solve efficiently.

Our lightweight theory uses lattices for efficient representation of library functions by a subset of guarded literals. These lattices are constructed from equations and inequalities of properties of the library functions. These subsets are found during the lattice traversal. We generalise the method to a number of lattices for functions whose values depend on each other in the program, and we describe a simultaneous traversal algorithm of several lattices, so that a combination of guarded literals from all lattices does not lead to contradictory values of their variables.

We evaluate our approach on benchmarks taken from the robotics community, and our experimental results demonstrate that we are able to solve a number of instances that were previously unsolvable by existing SMT solvers.

Establishing a refinement relation between binaries and abstract code

  • Freek Verbeek
  • Joshua Bockenek
  • Abhijith Bharadwaj
  • Binoy Ravindran
  • Ian Roessle

This paper presents a method for establishing a refinement relation between a binary and a high-level abstract model. The abstract model is based on standard notions of control flow, such as if-then-else statements, while loops and variable scoping. Moreover, it contains high-level data structures such as lists and records. This makes the abstract model amenable for off-the-shelf verification techniques such as model checking or interactive theorem proving. The refinement relation translates, e.g., sets of memory locations to high-level datatypes, or pointer arithmetic to standard HOL functions such as list operations or record accessors. We show applicability of our approach by verifying functions from a binary containing the Network Security Services framework from Mozilla Firefox, running on the x86-64 architecture. Our methodology is interactive. We show that we are able to verify approximately 1000 lines of x86-64 machine code (corresponding to about 400 lines of source code) in one person month.

Backup Bylaws

BYLAWS of the Special Interest Group on DESIGN AUTOMATION of the Association for Computing Machinery, Inc.

  • Adopted – 27 October 1979
  • Revised – 9 March 1994
  • Revised – 7 July 2004
  • Revised – 24 March 2005
  • Revised – 20 January 2009

Article 1. Name and Scope

  1. This organization is called the Special Interest Group on Design Automation (SIGDA) of the Association for Computing Machinery, Inc: (the “ACM”).
  2. The scope of SIGDA’s specialty is to enhance the utility of computers as engineering tools in the design, fabrication, and test of systems and structures.

Article 2. Purpose

  1. SIGDA is organized and operated exclusively for educational, scientific, and technical purposes in design automation.
  2. The purpose of SIGDA and its activities includes:
    1. Collecting and disseminating information in design automation through a newsletter and other publications;
    2. Organizing sessions at conferences of the ACM;
    3. Sponsoring conferences, symposia, and workshops;
    4. Organizing projects and working groups for education, research, and development;
    5. Serving as a source of technical information for the Council and subunits of the ACM; and
    6. Representing the opinions and expertise of the membership on matters of technical interest to SIGDA or ACM.

Article 3. Charter

SIGDA will exist until dissolved as provided in Bylaw 6 of the ACM.

Article 4. Officers

  1. SIGDA officers are the Chair and Chairs for Awards, Conferences, Technical Activities, Educational Activities, Communications, and Finance; one of the named Chairs will also be a Vice-Chair. The Past Chair is not an elected official and may fill one of the named Chair positions. The officers are elected for three-year terms beginning July 1 of 2009. No extension of terms shall be allowed.
  2. The Chair is the principal officer, being responsible for leading SIGDA and managing its activities. The duties of the Chair are:
    1. Calling and presiding at SIGDA Executive Committee and business meetings;
    2. Conducting all of SIGDA’s activities in accordance with the policies of the ACM; and
    3. Making all appointments as authorized herein.
  3. The duties of the Vice-Chair are:
    1. Assisting the Chair in leading and managing SIGDA; and
    2. Presiding at meetings when the Chair is absent.
  4. The duties of the Past Chair are:
    1. Filling one of the named chair positions below, or act as a member of the Advisory Board; and
    2. Chairing the Nominating Committee for SIGDA officer elections.
  5. The duties of the Communications Chair are:
    1. Maintaining the records and correspondence of SIGDA;
    2. Keeping and distributing the minutes and action items of business and Executive Committee meetings.
  6. The duties of the Finance Chair are:
    1. Managing SIGDA’s finances according to the Financial Accountability Policy of the ACM. This includes preparing the annual budget, monitoring disbursements for adherence to the annual budget, and preparing financial reports as required.
    2. Managing of the SIGDA Travel Grants program, if applicable.
  7. The duties of the Awards Chair are:
    1. Providing a single point of contact for all SIGDA sponsored awards;
    2. Coordinating the process of nominating ACM/SIGDA members for Fellow, Distinguished, and Senior grades.
  8. The duties of the Conference Chair are:
    1. Providing a single point of contact for all SIGDA sponsored, co-sponsored, in-coop events except events for which other SIGDA Advisory Board members have been specifically assigned;
    2. Coordinating the review and approval of all conference/symposia/workshop budgets.
  9. The duties of the Technical Activities Chair are:
    1. Providing a single point of contact for all SIGDA Technical Committees and other technical activities;
    2. Coordinating and reviewing SIGDA TC activities and other technical activities.
  10. The duties of the Educational Activities Chair are:
    1. Providing a single point of contact for all SIGDA educational activities;
    2. Coordinating and reviewing all SIGDA educational activities.

Article 5. The Executive Committee

  1. The Executive Committee comprises the officers.
  2. Specific duties of the Executive Committee include:
    1. Approval of bylaw amendments before submission to members;
    2. Approval of annual dues for SIGDA;
    3. Approval of the annual budget and review all expenditures in excess of 1% of the fiscal year’s opening Fund Balance on a quarterly basis;
    4. Approval of conferences, symposia, workshops or sessions sponsored, co-sponsored or held in cooperation with SIGDA; and
    5. All the major management policy decisions of SIGDA must be approved by the Executive Committee.
  3. A quorum is a majority of the members of the Executive Committee and approval requires a majority vote of those present. Approval by mail ballot requires a majority vote.
  4. Only a member of the Executive Committee can make a motion for a vote by the Executive Committee.
  5. All members of, or candidates for, the Executive Committee must be voting Members of ACM and of SIGDA.

Article 6. Vacancies and Appointments

  1. Should the Chair leave office before his term expires, the Vice-Chair will assume the duties of Chair. Should any other elected office (including Past Chair) become vacant, the Chair of the SIG Governing Board may, on nomination by the SIGDA Chair, and approval by majority vote of the Executive Committee, fill the vacancy. The Chair may fill vacancies in positions appointed by the Chair, according to the procedures for making the original appointments as provided herein.
  2. Should a vacancy be unfilled, either because of inadequacy of these bylaws or because of a dispute or for any other reason, the SIG Governing Board Chair may fill it.
  3. All appointments expire automatically when the Chair’s term of office expires.

Article 7. The Newsletter

  1. SIGDA will publish a newsletter at regular intervals as determined by the Executive Committee. The newsletter will be distributed to all members.
  2. The Chair will nominate an Editor of the Newsletter, to be approved by majority vote of the Executive Committee.

Article 8. The Advisory Board

  1. The Advisory Board includes the Executive Committee (officers). It also includes members-at-large who are nominated by the SIGDA Chair. The Chair normally nominates up to ten members-at-large to the Advisory Board for his or her term of office. Appointments to the Advisory Board must be approved by a majority vote of the Executive Committee.
  2. The purpose of the Advisory Board is to allow members outside the Executive Committee to participate in setting policy and direction for, and assist in the operation of, SIGDA. The Advisory Board members are typically the program managers or coordinators of SIGDA sponsored activities.
  3. The Advisory Board members are non-voting members of the SIGDA Board, and while the Advisory Board may participate in a vote, their votes are non-binding, and only the Executive Committee votes are binding.

Article 9. Membership, Dues, and Voting Privileges

  1. A person becomes a member only after enrolling and paying the required dues. The dues for SIGDA are determined by the SIGDA Executive Committee with the approval of the Chair of the SIG Governing Board.
  2. All members of SIGDA may vote in any ballot conducted by SIGDA. On any ballot, the votes cast by non-ACM members of SIGDA will, if necessary, be prorated downward so that their effective total cannot exceed 50% of the eligible votes.

Article 10. Reports and Records

The SIGDA Chair is responsible for filing reports about SIGDA as required by the SIG Board. These include:

  1. An annual report on the activities during the previous year;
  2. All reports required by the Financial Accountability Policy of the ACM; and
  3. Closing reports on conferences and symposia.

The membership records of SIGDA will be maintained by ACM headquarters.

Article 11. Elections

  1. The Chair shall appoint a nominating committee in the autumn of each election year. This committee will nominate at least two candidates for the position of the chair and at least six other candidates for the members-at-large, who consent to serve on the Executive Committee and fill one of the named Chair positions if elected. The person winning the most votes among those nominated for the chair will be elected to that position.  The six (or seven, if Past Chair does not wish to fill a name Chair position) receiving the highest number of votes among members-at-large are elected to the Executive Committee. A report of the nominating committee must be presented to the SIGDA membership before an election can be held.
  2. All applicants for the chair should have significant service experience of at least 3 years in the design automation community and SIGDA, in particular. They should have served at least one term in the executive committee in roles other than the chair. Equivalent experience through service to SIGDA-approved sponsored conferences as deemed acceptable by the nominating committee is allowed.  
  3. A petition from at least ten voting members of SIGDA will place other consenting candidates on the ballot for any of the EC positions, subject to meeting the requirements of 11(b) for the chair position. Petitions must be received by the Past Chair no later than April 15 in the year of election or within one month after the nominating committee has announced the candidates selected by the committee, whichever is later.
  4. Elections must be announced by direct communication to the SIGDA Membership with sufficient time before the election such that the membership has an opportunity to petition to be placed on the ballot.
  5. The election will be conducted among eligible voters by ballot sent by the nominating committee or by ACM Headquarters, following the election procedures of the ACM. The SIG Board will resolve ties.
  6.  All named chairs,  except those of the Chair,  are to be decided by the new Executive Committee by ballot, from those elected as members-at-large. The new Executive Committee votes for each position: Vice-Chair, Finance, Communications, Conferences, Technical Activities, Educational Activities, and Awards. 

Article 12. Amendments

  1. These bylaws may be amended by a majority vote of the ACM Executive Committee, or by a vote of SIGDA’s members as provided below. With the approval of the SIGDA Executive Committee, and the Executive Committee of the ACM, 2/3 of all the members of the SIG Board may amend Article 1 of these bylaws without a referendum of the members.
  2. Amendments to these bylaws may be proposed by the SIGDA Executive Committee, the SIG Governing Board, or by a petition from 10 voting members of SIGDA. All proposed amendments must be approved, prior to being submitted for a vote of the membership, by the Chairperson of both the SIG Governing Board and the Constitution and Bylaws Committee of ACM after the Executive Director of ACM has provided his advice.
  3. The ballot on the proposed amendment(s) will be conducted among the eligible voters by ACM Headquarters following the procedures of the ACM for voting bylaw amendments, unless a different procedure has been approved by the SIG Board. The proposal is adopted only if at least 2/3 of the effective votes of returned ballots approve it, and only if at least 10% of the ballots are returned. The Secretary/Treasurer will send a clean copy of the amended bylaws to the Executive Director of ACM and to the Chair of the SIG Governing Board.

Article 13. Dissolution

Should SIGDA be dissolved, control of its assets will revert to the ACM.

Article 14. Meetings

SIGDA will conduct at least one business meeting each year, normally in conjunction with the annual Design Automation Conference. All meetings sponsored by SIGDA must be open to all members of the ACM. SIGDA may hold meetings only in places that are open to all classes of members of the ACM. The Executive Committee may meet in closed sessions during business meetings.

Article 15. Consistency

The Constitution, Bylaws, and policies of the ACM and of the SIG Governing Board take precedence over any conflicting provisions of these bylaws or internal policies of SIGDA.

Info for Organizers of SIGDA Sponsored Events

ACM and SIGDA is closely monitoring the COVID19 or 2019-nCoV situation (Coronavirus) and its potential impact on ACM conferences. We are following updates on the situation from the World Health Organization (WHO) and the Center for Disease Control (CDC). We encourage all Conference Leaders to keep informed on risks, precautions, and symptoms to make educated decisions for their community.

An ACM Presidential Task Force was formed to provide advice to conference organizers facing the need to move their conference online in light of the social distancing recommendations and global restrictions on travel due to the COVID-19 pandemic. Here is the link to What Conferences Can do to Replace Face-to-Face Meetings https://people.clarkson.edu/~jmatthew/acm/VirtualConferences_GuideToBestPractices_CURRENT.pdf, put together by ACM Presidential Task Force. 

Conference Leaders should contact the ACM SIGDA liaison, Sade Rodriguez, for guidance on any concerns related to the potential impact this may have on conference planning and review the ACM Conference Planning Guide as it’s a great resource for an overview of the ACM support available. As a SIGDA sponsored conference, it is important that SIGDA leaders are included in all discussions in regards to any changes to the conference.