ASPDAC 2021 TOC
ASPDAC ’21: Proceedings of the 26th Asia and South Pacific Design Automation Conference
SESSION: 1A: University Design Contest I
A DSM-based Polar Transmitter with 23.8% System Efficiency
An energy efficient digital polar transmitter (TX) based on 1.5bit Delta-Sigma modulator
(DSM) and fractional-N injection-locked phase-locked loop (IL-PLL) is proposed. In
the proposed TX, redundant charge and discharge of turned-off capacitors in the conventional
switched-capacitor power amplifiers (SCPAs) are avoided, which drastically improves
the efficiency at power back-off. In the PLL, spur-mitigation technique is proposed
to reduce the frequency mismatch between the oscillator and the reference. The transmitter,
implemented in 65nm CMOS, achieves a PAE of 29% at an EVM of -25.1dB, and a system
efficiency of 23.8%.
A 0.41W 34Gb/s 300GHz CMOS Wireless Transceiver
A 300GHz CMOS-only wireless transceiver that achieves a maximum data rate of 34Gb/s
while consuming a total power of 0.41W from a 1V supply is introduced. A subharmonic
mixer with low conversion loss is proposed to compensate the absence of the RF amplifiers
in TX and RX as a mixer-last-mixer-first topology is adopted. The TRX covers 19 IEEE802.15.3d
channels (13-23, 39-43, 52-53, 59).
Capacitive Sensor Circuit with Relative Slope-Boost Method Based on a Relaxation Oscillator
This paper presents a relative slope-boosting technique for a capacitive sensor circuit
based on a relaxation oscillator. Our technique improves jitter, i.e. resolution,
by changing both the voltage slope on the sensing and the reference sides with respect
to the sensor capacitance. The sensor prototype circuit is implemented in a 180-nm
standard CMOS process and achieves resolution of 710 aF while consuming 12.7 pJ energy
every cycle of 13.78 kHz output frequency. The measured power consumption from a 1.2
V DC supply is 430 nW.
28GHz Phase Shifter with Temperature Compensation for 5G NR Phased-array Transceiver
A phase shifter with temperature compensation for 28GHz phased-array TRX is presented.
A precise low-voltage current reference is proposed for the IDAC biasing circuit.
The total gain variation for a single TX path including phase shifter and post stage
amplifiers over -40°C to 80°C is only 1dB in measurement and the overall phase error
due to temperature is less than 1 degree without off-chip calibration.
An up to 35 dBc/Hz Phase Noise Improving Design Methodology for Differential-Ring-Oscillators
Applied in Ultra-Low Power Systems
This work presents a novel control loop concept to adjust dynamically a differential
ring oscillators (DRO) biasing in order to improve the phase noise performance (PN)
in the ultra-low-power domain. Applying this proposed feedback system on any DRO with
a tail current source is possible. The following paper presents the proposed concept
and includes measurements of a 180 nm CMOS integrated prototype system, which underlines
the feasibility of the discussed idea. Measurements show an up to 35 dBc/Hz phase
noise improvement with an active control loop. Moreover, the tuning range of the implemented
ring oscillator is extended by about 430 % compared to fixed bias operation. These
values are measured at a minimum oscillation power consumption of 55 pW/Hz.
University LSI Design Contest ASP-DAC 2021
Gate Voltage Optimization in Capacitive DC-DC Converters for Thermoelectric Energy
Harvesting
This paper presents a gate voltage optimized fully integrated charge pump for thermoelectric
energy harvesting applications. In this paper, the trade-off generated by rising the
gate voltage of switching transistors are discussed. The proposed 5/3-stage design,
which implemented with 180 nm CMOS technique, achieved a down to 0.12V/0.13V startup
voltage correspondingly with the proposed technique. A 20% peak power conversion efficiency
improvement is achieved when comparing with a similar 3-stage linear charge pump in
previous state-of-the-art research.
A 0.57-GOPS/DSP Object Detection PIM Accelerator on FPGA
The paper presents an object detection accelerator featuring a processing-in-memory
(PIM) architecture on FPGAs. PIM architectures are well known for their energy efficiency
and avoidance of the memory wall. In the accelerator, a PIM unit is developed using
BRAM and LUT based counters, which also helps to improve the DSP performance density.
The overall architecture consists of 64 PIM units and three memory buffers to store
inter-layer results. A shrunk and quantized Tiny-YOLO network is mapped to the PIM
accelerator, where DRAM access is fully eliminated during inference. The design achieves
a throughput of 201.6 GOPs at 100MHz clock rate and correspondingly, a performance
density of 0.57 GOPS/DSP.
Supply Noise Reduction Filter for Parallel Integrated Transimpedance Amplifiers
This paper presents a supply noise reduction in transimpedance amplifier (TIA) for
optical interconnection. TIAs integrated in parallel suffer from inter-channel interference
via the supply and the ground lines. We employ an RC filter to reduce the supply noise.
The filter is inserted to the first stage of TIA and does not need extra power. The
proposed circuit was fabricated in an 180-nm CMOS. The measurement results verify
38% noise reduction at 5 Gbps operation.
SESSION: 1B: Accelerating Design and Simulation
A Fast Yet Accurate Message-level Communication Bus Model for Timing Prediction of
SDFGs on MPSoC
Fast yet accurate performance and timing prediction of complex parallel data flow
applications on multi-processor systems remains a difficult discipline. The reason
for it comes from the complexity of the data flow applications and the hardware platform
with shared resources, like buses and memories. This combination may lead to complex
timing interferences that are difficult to express in pure analytical or classical
simulation-based approaches. In this work, we propose a message-level communication
model for timing and performance prediction of Synchronous Data Flow (SDF) applications
on MPSoCs with shared memories. We compare our work against measurement and TLM simulation-based
performance prediction models on two case-studies from the computer vision domain.
We show that the accuracy and execution time of our simulation outperforms existing
approaches and is suitable for a fast yet accurate design space exploration.
Simulation of Ideally Switched Circuits in SystemC
Modeling and simulation of power systems at low levels of abstraction is supported
by specialized tools such as SPICE and MATLAB. But when power systems are part of
larger systems including digital hardware and software, low-level models become over-detailed;
at the system level, models must be simple and execute fast. We present an extension
to SystemC that relies on efficient modeling, simulation, and synchronization strategies
for Ideally Switched Circuits. Our solution enables designers to specify circuits
and to jointly simulate them with other SystemC hardware and software models. We test
our extension with three power converter case studies and show a simulation speed-up
between 1.2 and 2.7 times while preserving accuracy when compared to the reference
tool. This work demonstrates the suitability of SystemC for the simulation of heterogeneous
models to meet system-level goals such as validation, verification, and integration.
HW-BCP: A Custom Hardware Accelerator for SAT Suitable for Single Chip Implementation for
Large Benchmarks
Boolean Satisfiability (SAT) has broad usage in Electronic Design Automation (EDA),
artificial intelligence (AI), and theoretical studies. Further, as an NP-complete
problem, acceleration of SAT will also enable acceleration of a wide range of combinatorial
problems.
We propose a completely new custom hardware design to accelerate SAT. Starting with
the well-known fact that Boolean Constraint Propagation (BCP) takes most of the SAT
solving time (80-90%), we focus on accelerating BCP. By profiling a widely-used software
SAT solver, MiniSAT v2.2.0 (MiniSAT2) [1], we identify opportunities to accelerate
BCP via parallelization and elimination of von Neumann overheads, especially data
movement. The proposed hardware for BCP (HW-BCP) achieves these goals via a customized
combination of content-addressable memory (CAM) cells, SRAM cells, logic circuitry,
and optimized interconnects.
In 65nm technology, on the largest SAT instances in the SAT Competition 2017 benchmark
suite, our HW-BCP dramatically accelerates BCP (4.5ns per BCP in simulations) and
hence provides a 62-185x speedup over optimized software implementation running on
general purpose processors.
Finally, we extrapolate our HW-BCP design to 7nm technology and estimate area and
delay. The analysis shows that in 7nm, in a realistic chip size, HW-BCP would be large
enough for the largest SAT instances in the benchmark suite.
SESSION: 1C: Process-in-Memory for Efficient and Robust AI
A Novel DRAM-Based Process-in-Memory Architecture and its Implementation for CNNs
Processing-in-Memory (PIM) is an emerging approach to bridge the memory-computation
gap. One of the key challenges of PIM architectures in the scope of neural network
inference is the deployment of traditional area-intensive arithmetic multipliers in
memory technology, especially for DRAM-based PIM architectures. Hence, existing DRAM
PIM architectures are either confined to binary networks or exploit the analog property
of the sub-array bitlines to perform bulk bit-wise logic operations. The former reduces
the accuracy of predictions, i.e. Quality-of-results, while the latter increases overall
latency and power consumption.
In this paper, we present a novel DRAM-based PIM architecture and implementation for
multi-bit-precision CNN inference. The proposed implementation relies on shifter based
approximate multiplications specially designed to fit into commodity DRAM architectures
and its technology. The main goal of this work is to propose an architecture that
is fully compatible with commodity DRAM architecture and to maintain a similar thermal
design power (i.e. < 1 W). Our evaluation shows that the proposed DRAM-based PIM has
a small area overhead of 6.6% when compared with an 8 Gb commodity DRAM. Moreover,
the architecture delivers a peak performance of 8.192 TOPS per memory channel while
maintaining a very high energy efficiency. Finally, our evaluation also shows that
the use of approximate multipliers results in a negligible drop B@in prediction-accuracy
(i.e. < 2 %) in comparison with conventional CNN inference that relies on traditional
arithmetic multipliers.
A Quantized Training Framework for Robust and Accurate ReRAM-based Neural Network
Accelerators
Neural networks (NN), especially deep neural networks (DNN), have achieved great success
in lots of fields. ReRAM crossbar, as a promising candidate, is widely employed to
accelerate neural network owing to its nature of processing MVM. However, ReRAM crossbar
suffers high conductance variation due to many non-ideal effects, resulting in great
inference accuracy degradation. Recent works use uniform quantization to enhance the
tolerance of conductance variation, but these methods still suffer high accuracy loss
with large variation. In this paper, firstly, we analyze the impact of the quantization
and conductance variation on the accuracy. Then, based on two observation, we propose
a quantized training framework to enhance the robustness and accuracy of the neural
network running on the accelerator, by introducing a smart non-uniform quantizer.
This framework consists of a robust trainable quantizer and a corresponding training
method, and needs no extra hardware overhead and compatible with a standard neural
network training procedure. Experimental results show that our proposed method can
improve inference accuracy by 10% ~ 30% under large variation, compared with uniform
quantization method.
Attention-in-Memory for Few-Shot Learning with Configurable Ferroelectric FET Arrays
Attention-in-Memory (AiM), a computing-in-memory (CiM) design, is introduced to implement
the attentional layer of Memory Augmented Neural Networks (MANNs). AiM consists of
a memory array based on Ferroelectric FETs (FeFET) along with CMOS peripheral circuits
implementing configurable functionalities, i.e., it can be dynamically changed from
a ternary content-addressable memory (TCAM) to a general-purpose (GP) CiM. When compared
to state-of-the art accelerators, AiM achieves comparable end-to-end speed-up and
energy for MANNs, with better accuracy (95.14% v.s. 92.21%, and 95.14% v.s. 91.98%)
at iso-memory size, for a 5-way 5-shot inference task with the Omniglot dataset.
SESSION: 1D: Validation and Verification
Mutation-based Compliance Testing for RISC-V
Compliance testing for RISC-V is very important. Essentially, it ensures that compatibility
is maintained between RISC-V implementations and the ever growing RISC-V ecosystem.
Therefore, an official Compliance Test-suite (CT) is being actively developed. However,
it is very difficult to achieve that all relevant functional behavior is comprehensively
tested.
In this paper, we propose a mutation-based approach to boost RISC-V compliance testing
by providing more comprehensive testing results. Therefore, we define mutation classes
tailored for RISC-V to assess the quality of the CT and provide a symbolic execution
framework to generate new test-cases that kill the undetected mutants. Our experimental
results demonstrate the effectiveness of our approach. We identified several serious
gaps in the CT and generated new tests to close these gaps.
A General Equivalence Checking Framework for Multivalued Logic
Logic equivalence checking is a critical task in the ASIC design flow. Due to the
rapid development in nanotechnology-based devices, an efficient implementation of
multivalued logic becomes practical. As a result, many synthesis algorithms for ternary
logic were proposed. In this paper, we bring out an equivalence checking framework
based on multivalued logic exploiting the modern SAT solvers. Furthermore, a structural
conflict-driven clause learning (SCDCL) technique is also proposed to accelerate the
SAT solving process. The SCDCL algorithm deploys some strategies to cut off the search
space for SAT algorithms. The experimental results show that the proposed SCDCL technique
saves 42% CPU time from SAT solvers on average over a set of industrial benchmarks.
ATLaS: Automatic Detection of Timing-based Information Leakage Flows for SystemC HLS Designs
In order to meet the time-to-market constraint, High-level Synthesis (HLS) is being
increasingly adopted by the semiconductor industry. HLS designs, which can be automatically
translated into the Register Transfer Level (RTL), are typically written in SystemC
at the Electronic System Level (ESL). Timing-based information leakage and its countermeasures,
while well-known at RTL and below, have not been yet considered for HLS. The paper
makes a contribution to this emerging research area by proposing ATLaS, a novel timing-based
information leakage flows detection approach for SystemC HLS designs. The efficiency
of our approach in identifying timing channels for SystemC HLS designs is demonstrated
on two security-critical architectures which are shared interconnect and crypto core.
SESSION: 1E: Design Automation Methods for Various Microfluidic Platforms
A Multi-Commodity Network Flow Based Routing Algorithm for Paper-Based Digital Microfluidic
Biochips
Paper-based digital microfluidic biochips (P-DMFBs) have emerged as a safe, low-cost,
and fast-responsive platform for biochemical assays. In P-DMFB, droplet manipulations
are executed by the electrowetting technology. In order to enable the electrowetting
technology, pattern arrays of electrodes and control lines are coated on paper with
a hydrophobic Teflon film and dielectric parylene-C film. Different from traditional
DMFBs, the manufacturing of P-DMFBs is efficient and inexpensive since the electrodes
and control lines are printed on photo paper with an inkjet printer. Active paper-based
hybridized chip (APHC) is a type of P-DMFBs that has open and closed part. APHC enjoys
more convenience than common P-DMFBs since it has no need to fabricate and maintain
the micro gap between glass and paper chip, which requires highly delicate treatments.
However, the pattern rails of electrodes in APHCs are denser than traditional P-DMFBs,
which makes existing electrode routing algorithm fail in APHCs. To deal with the challenge
in electrode routing of APHCs, this paper proposes a multi-commodity network flow-based
routing algorithm, which simultaneously maximizes the routability and minimizes the
total wire length of control lines. The multi-commodity flow model can utilize the
pin-sharing between electrodes, which can improve routability and reduce the detour
of routing lines. Moreover, the activation sequences of electrodes are considered,
which guarantees that the bioassay will not be interfered with after pin-sharing.
The proposed method achieves a 100% successful routing rate on real-life APHCs while
other electrode routing method cannot solve the electrode routing of APHCs successfully.
Interference-Free Design Methodology for Paper-Based Digital Microfluidic Biochips
Paper-based digital microfluidic biochips (P-DMFBs) have recently attracted great
attention for its low-cost, in-place, and fast fabrication. This technology is essential
for agile bio-assay development and deployment. P-DMFBs print electrodes and associate
control lines on paper to control droplets and complete bio-assays. However, P-DMFBs
have following issues: 1) control line interference may cause unwanted droplet movements,
2) avoiding control interference degrades assay performance and routability, 3) single
layer fabrication limits routability, and 4) expensive ink cost limits low-cost benefits
of P-DMFBs. To solve above issues, this work proposes an interference-free design
methodology to design P-DMFBs with fast assay speed, better routability, and compact
printing area. The contributions are as follows: First, we categorize control interference
into soft and hard. Second, we identify only soft interference happens and propose
to remove soft control interference constraints. Third, we propose an interference-free
design methodology. Finally, we propose a cost-efficient ILP-based fluidic design
module. Experimental results show proposed method outperforms prior work [14] across
all bio-assay benchmarks. Compared to previous work, our cost-optimized designs use
only 47%~78% area, gain 3.6%~16.2% more routing resources, and achieve 0.97x~1.5x
shorter assay completion time. Our performance-optimized designs can accelerate assay
speed by 1.05x~1.65x using 81%~96% printed area.
Accurate and Efficient Simulation of Microfluidic Networks
Microfluidics is a prospective field which provides technological advances to the
life sciences. However, the design process for microfluidic devices is still in its
infancy and frequently results in a “trial-and-error” scheme. In order to overcome
this problem, simulation methods provide a powerful solution—allowing for deriving
a design, validating its functionality, or exploring alternatives without the need
of an actual fabricated and costly prototype. To this end, several physical models
are available such as Computational Fluid Dynamics (CFD) or the 1-dimensional analysis
model. However, while CFD-simulations have high accuracy, they also have high costs
with respect to setup and simulation time. On the other hand, the 1D-analysis model
is very efficient but lacks in accuracy when it comes to certain phenomena. In this
work, we present ideas to combine these two models and, thus, to provide an accurate
and efficient simulation approach for microfluidic networks. A case study confirms
the general suitability of the proposed approach.
SESSION: 2A: University Design Contest II
A 65nm CMOS Process Li-ion Battery Charging Cascode SIDO Boost Converter with 89%
Maximum Efficiency for RF Wireless Power Transfer Receiver
This paper proposes a 65nm CMOS process cascode single-inductor-dual-output (SIDO)
boost converter for RF wireless power transfer (WPT) receiver. In order to withstand
4.2V Li-ion battery output, cascode 2.5V I/O PFETs are used at the power stage while
2.5V cascode NFETs are used for 1V output to supply low voltage control circuit. By
using NFETs, 1V output with 5V tolerance can be achieved. Measurement results show
conversion efficiency of 89% at PIN=7.9mW and Vbat=3.4V.
A High Accuracy Phase and Amplitude Detection Circuit for Calibration of 28GHz Phased
Array Beamformer System
This paper presents high-accuracy phase and amplitude detection circuits for the calibration
of 5G millimeter-wave phased array beamformer systems. The phase and amplitude detection
circuits, which are implemented in a 65nm CMOS process, can realize phase and amplitude
detections with RMS phase error of 0.17 degree and RMS gain error of 0.12 dB, respectively.
The total power consumption of the circuits is 59mW.
A Highly Integrated Energy-efficient CMOS Millimeter-wave Transceiver with Direct-modulation
Digital Transmitter, Quadrature Phased-coupled Frequency Synthesizer and Substrate-Integrated
Waveguide E-shaped Patch Antenna
An energy-efficient millimeter-wave transceiver with direct-modulation digital transmitter
(TX), I/Q phase-coupled frequency synthesizer and Substrate-Integrated Waveguide (SIW)
E-shaped patch antenna is presented in this paper. The proposed transceiver achieves
the 10-Gbps data rate while consuming 340.4 mW. The measured Over-the-Air (OTA) EVM
is -13.8 dB. The energy efficiency is 34 pJ/bit, which is a significant improvement
compared with the state-of-the-art mm-wave transceivers.
A 3D-Stacked SRAM Using Inductive Coupling Technology for AI Inference Accelerator
in 40-nm CMOS
A 3D-stacked SRAM using an inductive coupling wireless inter-chip communication technology
(TCI) is presented for an AI inference accelerator. The energy and area efficiency
are improved thanks to the introduction of a proposed low-voltage NMOS push-pull transmitter
and a 12:1 SerDes. A termination scheme to short unused open coils is proposed to
eliminate the ringing in an inductive coupling bus. Test chips were fabricated in
a 40-nm CMOS technology confirming 0.40-V operation of the proposed transmitter with
successful stacked SRAM operation.
Sub-10-μm Coil Design for Multi-Hop Inductive Coupling Interface
Sub-10-μm on-chip coils are designed and prototyped for the multi-hop inductive coupling
interface in a 40-nm CMOS. Multi-layer coils and a new receiver circuit are employed
to compensate the decrease of the coupling coefficient due to the small coil size.
The prototype emulates a 3D stacked module with 8 dies in a 7-nm CMOS and shows that
a 0.1-pJ/bit and 41-Tb/s/mm2 inductive coupling interface is achievable.
Current-Starved Chaotic Oscillator Over Multiple Frequency Decades on Low-Cost CMOS: Towards Distributed and Scalable Environmental Sensing with a Myriad of Nodes
This work presents a current-starved cross-coupled chaotic oscillator achieving multiple
decades of oscillation frequency spanning 2 kHz to 15 MHz. The main circuit characteristics
are low-power consumption (<100 nW to 25 μW, at 1 V supply voltage), and controllability
of the oscillation frequency, enabling future applications such as in distributed
environmental sensing. The IC was implemented in 180 nm standard CMOS process, yielding
a core area of 0.028 mm2.
TCI Tester: Tester for Through Chip Interface
An 18 Bit Time-to-Digital Converter Design with Large Dynamic Range and Automated
Multi-Cycle Concept
This paper presents a wide-dynamic-range high-resolution time-domain converter concept
tailored for low-power sensor interfaces. The unique system structure applies different
techniques to reduce circuit complexity, power consumption, and noise sensitivity.
A multi-cycle concept allows a virtual delay line extension and is applied to achieve
high resolution down to 1ns. At the same time, it expands the dynamic range drastically
up to 2.35 ms. Moreover, individually tunable delay elements in the range of 1ns to
12 ns allow on-demand flexible operation in a low- or high-resolution mode for smart
sensing applications and flexible power control. The concept of this paper is evaluated
by a custom-designed FPGA supported PCB. The presented concept is highly suitable
for on-chip integration.
University LSI Design Contest ASP-DAC 2021
SESSION: 2B: Emerging Non-Volatile Processing-In-Memory for Next Generation Computing
Connection-based Processing-In-Memory Engine Design Based on Resistive Crossbars
Deep neural networks have successfully been applied to various fields. The efficient
deployment of neural network models emerges as a new challenge. Processing-in-memory
(PIM) engines that carry out computation within memory structures are widely studied
for improving computation efficiency and data communication speed. In particular,
resistive memory crossbars can naturally realize the dot-product operations and show
great potential in PIM design. The common practice of a current-based design is to
map a matrix to a crossbar, apply the input data from one side of the crossbar, and
extract the accumulated currents as the computation results at the orthogonal direction.
In this study, we propose a novel PIM design concept that is based on the crossbar
connections. Our analysis on star-mesh network transformation reveals that in a crossbar
storing both input data and weight matrix, the dot-product result is embedded within
the network connection. Our proposed connection-based PIM design leverages this feature
and discovers the latent dot-products directly from the connection information. Moreover,
in the connection-based PIM design, the output current range of resistive crossbars
can easily be adjusted, leading to more linear conversion to voltage values, and the
output circuitry can be shared by multiple resistive crossbars. The simulation results
show that our design can achieve on average 46.23% and 33.11% reductions in area and
energy consumption, with a merely 3.85% latency overhead compared with current-based
designs.
FePIM: Contention-Free In-Memory Computing Based on Ferroelectric Field-Effect Transistors
The memory wall bottleneck has caused a large portion of the energy to be consumed
by data transfer between processors and memories when dealing with data-intensive
workloads. By giving some processing abilities to memories, processing-in-memory (PIM)
is a promising technique to alleviate the memory wall bottleneck. In this work, we
proposed a novel PIM architecture by employing ferroelectric field-effect transistors
(FeFETs). The proposed design, named FePIM, is able to perform in-memory bitwise logic
and add operations between two selected rows or between one selected row and an immediate
operand. By utilizing unique features of FeFET devices, we further propose novel solutions
to eliminate simultaneous-read-and-write (SRAW) contentions such that stalls are eliminated.
Experimental results show that FePIM reduces 15% of the memory access latency and
44% of the memory access energy, compared with an enhanced version of a state-of-the-art
FeFET-based PIM design which cannot handle SRAW contentions.
RIME: A Scalable and Energy-Efficient Processing-In-Memory Architecture for Floating-Point
Operations
Processing in-memory (PIM) is an emerging technology poised to break the memory-wall
in the conventional von Neumann architecture. PIM reduces data movement from the memory
systems to the CPU by utilizing memory cells for logic computation. However, existing
PIM designs do not support high precision computation (e.g., floating-point operations)
essential for critical data-intensive applications. Furthermore, PIM architectures
require complex control module and costly peripheral circuits to harness the full
potential of in-memory computation. These peripherals and control modules usually
suffer from scalability and efficiency issues.
Hence, in this paper, we explore the analog properties of the resistive random access
memory (RRAM) crossbar and propose a scalable RRAM-based in-memory floating-point
computation architeture (RIME). RIME uses single-cycle NOR, NAND, and Minority logic
to achieve floating-point operations. RIME features a centralized control module and
a simplified peripheral circuit to eliminate data movement during parallel computation.
An experimental 32-bit RIME multiplier demonstrates 4.8X speedup, 1.9X area-improvement,
and 5.4X energy-efficiency than state-of-the-art RRAM-based PIM multipliers.
A Non-Volatile Computing-In-Memory Framework With Margin Enhancement Based CSA and
Offset Reduction Based ADC
Nowadays, deep neural network (DNN) has played an important role in machine learning.
Non-volatile computingin-memory (nvCIM) for DNN has become a new architecture to optimize
hardware performance and energy efficiency. However, the existing nvCIM accelerators
focus on system-level performance but ignore analog factors. In this paper, the sense
margin and offset are considered in the proposed nvCIM framework. The margin enhancement
based current-mode sense amplifier (MECSA) and the offset reduction based analog-to-digital
converter (ORADC) are proposed to improve the accuracy of the ADC. Based on the above
methods, the nvCIM framework is displayed and the experiment results show that the
proposed framework has an improvement on area, power, and latency with the high accuracy
of network models, and the energy efficiency is 2.3 – 20.4x compared to the existing
RRAM based nvCIM accelerators.
SESSION: 2C: Emerging Trends for Cross-Layer Co-Design: From Device, Circuit, to Architecture,
Application
Cross-layer Design for Computing-in-Memory: From Devices, Circuits, to Architectures and Applications
The era of Big Data, Artificial Intelligence (AI) and Internet of Things (IoT) is
approaching, but our underlying computing infrastructures are not sufficiently ready.
The end of Moore’s law and process scaling as well as the memory wall associated with
von Neumann architectures have throttled the rapid development of conventional architectures
based on CMOS technology, and cross-layer efforts that involve the interactions from
low-end devices to high-end applications have been prominently studied to overcome
the aforementioned challenges. On one hand, various emerging devices, e.g., Ferroelectric
FET, have been proposed to either sustain the scaling trends or enable novel circuit
and architecture innovations. On the other hand, novel computing architectures/algorithms,
e.g., computing-in-memory (CiM), have been proposed to address the challenges faced
by conventional von Neumann architectures. Naturally, integrated approaches across
the emerging devices and computing architectures/algorithms for data-intensive applications
are of great interests. This paper uses the FeFET as a representative device, and
discuss about the challenges, opportunities and contributions for the emerging trends
of cross-layer co-design for CiM.
SESSION: 2D: Machine Learning Techniques for EDA in Analog/Mixed-Signal ICs
Automatic Surrogate Model Generation and Debugging of Analog/Mixed-Signal Designs
Via Collaborative Stimulus Generation and Machine Learning
In top-down analog and mixed-signal design, a key problem is to ensure that the netlist
or physical design does not contain unanticipated behaviors. Mismatches between netlist
level circuit descriptions and high level behavioral models need to be captured at
all stages of the design process for accuracy of system level simulation as well as
fast convergence of the design. To support the above, we present a guided test generation
algorithm that explores the input stimulus space and generates new stimuli which are
likely to excite differences between the model and its netlist description. Subsequently,
a recurrent neural network (RNN) based learning model is used to learn divergent model
and netlist behaviors and absorb them into the model to minimize these differences.
The process is repeated iteratively and in each iteration, a Bayesian optimization
algorithm is used to find optimal RNN hyperparameters to maximize behavior learning.
The result is a circuit-accurate behavioral model that is also much faster to simulate
than a circuit simulator. In addition, another sub-goal is to perform design bug diagnosis
to track the source of observed behavioral anomalies down to individual modules or
small levels of circuit detail. An optimization-based diagnosis approach using Volterra
learning kernels that is easily integrated into circuit simulators is proposed. Results
on representative circuits are presented.
A Robust Batch Bayesian Optimization for Analog Circuit Synthesis via Local Penalization
Bayesian optimization has been successfully introduced to analog circuit synthesis
recently. Since the evaluations of performances are computational expensive, batch
Bayesian optimization has been proposed to run simulations in parallel. However, circuit
simulations may fail during the optimization, due to the improper design variables.
In such cases, Bayesian optimization methods may have poor performance. In this paper,
we propose a Robust Batch Bayesian Optimization approach (RBBO) for analog circuit
synthesis. Local penalization (LP) is used to capture the local repulsion between
query points in one batch. The diversity of the query points can thus be guaranteed.
The failed points and their neighborhoods can also be excluded by LP. Moreover, we
propose an Adaptive Local Penalization (ALP) strategy to adaptively scale the penalized
areas to improve the convergence of our proposed RBBO method. The proposed approach
is compared with the state-of-the-art algorithms with several practical analog circuits.
The experimental results have demonstrated the efficiency and robustness of the proposed
method.
Layout Symmetry Annotation for Analog Circuits with Graph Neural Networks
The performance of analog circuits is susceptible to various layout constraints, such
as symmetry, matching, etc. Modern analog placement and routing algorithms usually
need to take these constraints as input for high quality solutions, while manually
annotating such constraints is tedious and requires design expertise. Thus, automatic
constraint annotation from circuit netlists is a critical step to analog layout automation.
In this work, we propose a graph learning based framework to learn the general rules
for annotation of the symmetry constraints with path-based feature extraction and
label filtering techniques. Experimental results on the open-source analog circuit
designs demonstrate that our framework is able to achieve significantly higher accuracy
compared with the most recent works on symmetry constraint detection leveraging graph
similarity and signal flow analysis techniques. The framework is general and can be
extended to other pairwise constraints as well.
Fast and Efficient Constraint Evaluation of Analog Layout Using Machine Learning Models
Placement algorithms for analog circuits explore numerous layout configurations in
their iterative search. To steer these engines towards layouts that meet the electrical
constraints on the design, this work develops a fast feasibility predictor to guide
the layout engine. The flow first discerns rough bounds on layout parasitics and prunes
the feature space. Next, a Latin hypercube sampling technique is used to sample the
reduced search space, and the labeled samples are classified by a linear support vector
machine (SVM). If necessary, a denser sample set is used for the SVM, or if the constraints
are found to be nonlinear, a multilayer perceptron (MLP) is employed. The resulting
machine learning model demonstrated to rapidly evaluate candidate placements in a
placer, and is used to build layouts for several analog blocks.
SESSION: 2E: Innovating Ideas in VLSI Routing Optimization
TreeNet: Deep Point Cloud Embedding for Routing Tree Construction
In the routing tree construction, both wirelength (WL) and path-length (PL) are of
importance. Among all methods, PD-II and SALT are the two most prominent ones. However,
neither PD-II nor SALT always dominates the other one in terms of both WL and PL for
all nets. In addition, estimating the best parameters for both algorithms is still
an open problem. In this paper, we model the pins of a net as point cloud and formalize
a set of special properties of such point cloud. Considering these properties, we
propose a novel deep neural net architecture, TreeNet, to obtain the embedding of
the point cloud. Based on the obtained cloud embedding, an adaptive workflow is designed
for the routing tree construction. Experimental results show that the proposed TreeNet
is superior to other mainstream models for the point cloud on classification tasks.
Moreover, the proposed adaptive workflow for the routing tree construction outperforms
SALT and PD-II in terms of both efficiency and effectiveness.
A Unified Printed Circuit Board Routing Algorithm With Complicated Constraints and
Differential Pairs
The printed circuit board (PCB) routing problem has been studied extensively in recent
years. Due to continually growing net/pin counts, extremely high pin density, and
unique physical constraints, the manual routing of PCBs has become a time-consuming
task to reach design closure. Previous works break down the problem into escape routing
and area routing and focus on these problems separately. However, there is always
a gap between these two problems requiring a massive amount of human efforts to fine-tune
the algorithms back and forth. Besides, previous works of area routing mainly focus
on routing between escaping routed ball-grid-array (BGA) packages. Nevertheless, in
practice, many components are not in the form of BGA packages, such as passive devices,
decoupling capacitors, and through-hole pin arrays. To mitigate the deficiencies of
previous works, we propose a full-board routing algorithm that can handle multiple
real-world complicated constraints to facilitate the printed circuit board routing
and produce high-quality manufacturable layouts. Experimental results show that our
algorithm is effective and efficient. Specifically, for all given test cases, our
router can achieve 100% routability without any design rule violation while the other
two state-of-the-art routers fail to complete the routing for some test cases and
incur design rule violations.
Multi-FPGA Co-optimization: Hybrid Routing and Competitive-based Time Division Multiplexing Assignment
In multi-FPGA systems, time-division multiplexing (TDM) is a widely used technique
to transfer signals between FPGAs. While TDM can greatly increase logic utilization,
the inter-FPGA delay will also become longer. A good time-multiplexing scheme for
inter-FPGA signals is very important for optimizing the system performance. In this
work, we propose a fast algorithm to generate high quality time-multiplexed routing
results for multiple FPGA systems. A hybrid routing algorithm is proposed to route
the nets between FPGAs, by maze routing and by a fast minimum terminal spanning tree
method. After obtaining a routing topology, a two-step method is applied to perform
TDM assignment to optimize timing, which includes an initial assignment and a competitive-based
refinement. Experiments show that our system-level routing and TDM assignment algorithm
can outperform both the top winner of the ICCAD 2019 Contest and the state-of-the-art
methods. Moreover, compared to the state-of-the-art works [17, 22], our approach has
better run time by more than 2x with better or comparable TDM performance.
Boosting Pin Accessibility Through Cell Layout Topology Diversification
As the layout of standard cells is becoming dense, accessing pins is much harder in
detailed routing. The conventional solutions to resolving the pin access issue are
to attempt cell flipping, cell shifting, cell swapping, and/or cell dilating in the
placement optimization stage, expecting to acquire high pin accessibility. However,
those solutions do not guarantee close-to-100% pin accessibility to ensure safe manual
fixing afterward in the routing stage. Furthermore, there is no easy and effective
methodology to fix the inaccessibility in the detailed routing stage as yet. This
work addresses the problem of fixing the inaccessibility in the detailed routing stage.
Precisely, (1) we produce, for each type of cell, multiple layouts with diverse pin
locations and access points by modifying the core engines i.e., gate poly ordering
and middle-of-line dummy insertion in the flow of design-technology co-optimization
based automatic cell layout generation. Then, (2) we propose a systematic method to
make use of those layouts to fix the routing failures caused by pin inaccessibility
in the ECO (Engineering Change Order) routing stage. Experimental results demonstrate
that our proposed cell layout diversification and replacement approach can fix metal-2
shorts by 93.22% in the ECO routing stage.
SESSION: 3A: ML-Driven Approximate Computing
Approximate Computing for ML: State-of-the-art, Challenges and Visions
In this paper, we present our state-of-the-art approximate techniques that cover the
main pillars of approximate computing research. Our analysis considers both static
and reconfigurable approximation techniques as well as operation-specific approximate
components (e.g., multipliers) and generalized approximate highlevel synthesis approaches.
As our application target, we discuss the improvements that such techniques bring
on machine learning and neural networks. In addition to the conventionally analyzed
performance and energy gains, we also evaluate the improvements that approximate computing
brings in the operating temperature.
SESSION: 3B: Architecture-Level Exploration
Bridging the Frequency Gap in Heterogeneous 3D SoCs through Technology-Specific NoC
Router Architectures
In heterogeneous 3D System-on-Chips (SoCs), NoCs with uniform properties suffer one
major limitation; the clock frequency of routers varies due to different manufacturing
technologies. For example, digital nodes allow for a higher clock frequency of routers
than mixed-signal nodes. This large frequency gap is commonly tackled by complex and
expensive pseudo-mesochronous or asynchronous router architectures. Here, a more efficient
approach is chosen to bridge the frequency gap. We propose to use a heterogeneous
network architecture. We show that reducing the number of VCs allows to bridge a frequency
gap of up to 2x. We achieve a system-level latency improvement of up to 47% for uniform
random traffic and up to 59% for PARSEC benchmarks, a maximum throughput increase
of 50%, up to 68% reduced area and 38% reduced power in an exemplary setting combining
15-nm digital and 30-nm mixed-signal nodes and comparing against a homogeneous synchronous
network architecture. Versus asynchronous and pseudo-mesochronous router architectures,
the proposed optimization consistently performs better in area, in power and the average
flit latency improvement can be larger than 51%.
Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs
Coarse-Grained Reconfigurable Architectures (CGRAs) are attractive reconfigurable
platforms with the advantages of high performance and power efficiency. In a CGRA
based computing system, the computations are often mapped onto the CGRA with parallel
memory accesses. To fully exploit the on-chip memory bandwidth, memory partitioning
algorithms are widely used to reduce access conflicts. CGRAs have a fixed storage
fabric and limited size memory due to the severe area constraints. Previous memory
partitioning algorithms assumed that data could be completely transferred into the
target memory. However, in practice, we often encounter situations where on-chip storage
is insufficient to store the complete data. In order to perform the computation of
these applications in the memory-limited CGRA, we first develop a memory partitioning
strategy with continual placement, which can also avoid data preprocessing, and then
divide the kernel into multiple subtasks that suit the size of the target memory.
Experimental results show that, compared to the state-of-the-art method, our approach
achieves a 43.2% reduction in data preparation time and an 18.5% improvement in overall
performance. If the subtask generation scheme is adopted, our approach can achieve
a 14.4% overall performance improvement while reducing memory requirements by 99.7%.
A Dynamic Link-latency Aware Cache Replacement Policy (DLRP)
Multiprocessor system-on-chips (MPSoCs) in modern devices have mostly adopted the
non-uniform cache architecture (NUCA) [1], which features varied physical distance
from cores to data locations and, as a result, varied access latency. In the past,
researchers focused on minimizing the average access latency of the NUCA. We found
that dynamic latency is also a critical index of the performance. A cache access pattern
with long dynamic latency will result in a significant cache performance degradation
without considering dynamic latency. We have also observed that a set of commonly
used neural network application kernels, including the neural network fully-connected
and convolutional layers, contains substantial accessing patterns with long dynamic
latency. This paper proposes a hardware-friendly dynamic latency identification mechanism
to detect such patterns and a dynamic link-latency aware replacement policy (DLRP)
to improve cache performance based on the NUCA.
The proposed DLRP, on average, outperforms the least recently used (LRU) policy by
53% with little hardware overhead. Moreover, on average, our method achieves 45% and
24% more performance improvement than the not recently used (NRU) policy and the static
re-reference interval prediction (SRRIP) policy normalized to LRU.
Prediction of Register Instance Usage and Time-sharing Register for Extended Register
Reuse Scheme
Register renaming is the key for the performance of out-of-order processors. However,
the release mechanism of the physical register may cause a waste from time dimension.
The register reuse technique is the earliest solution to release a physical register
at renaming stage, which takes the advantage of those register instances with only
one time use. However, the range of possible reuse mined by this scheme is not high,
and the physical structure of the register have to be modified. Aiming at these two
problems, we propose an extended register reuse scheme. Our work presents: 1) prediction
of the use times of the register instance, so as to reuse the physical registers at
the end of the last use, to expand the range of possible reuse. 2) A design of time-sharing
register file with little overheads which is implemented by Backup Registers, avoiding
to modify the physical register structure. Compared with the original register reuse
technique, this work achieves 8.5% performance improvement, alternatively, 9.6% decrease
of the number of physical registers with minor hardware overhead.
SESSION: 3C: Core Circuits for AI Accelerators
Residue-Net: Multiplication-free Neural Network by In-situ No-loss Migration to Residue Number
Systems
Deep neural networks are widely deployed on embedded devices to solve a wide range
of problems from edge-sensing to autonomous driving. The accuracy of these networks
is usually proportional to their complexity. Quantization of model parameters (i.e.,
weights) and/or activations to alleviate the complexity of these networks while preserving
accuracy is a popular powerful technique. Nonetheless, previous studies have shown
that quantization level is limited as the accuracy of the network decreases afterward.
We propose Residue-Net, a multiplication-free accelerator for neural networks that
uses Residue Number System (RNS) to achieve substantial energy reduction. RNS breaks
down the operations to several smaller operations that are simpler to implement. Moreover,
Residue-Net replaces the copious of costly multiplications with non-complex, energy-efficient
shift and add operations to further simplify the computational complexity of neural
networks. To evaluate the efficiency of our proposed accelerator, we compared the
performance of Residue-Net with a baseline FPGA implementation of four widely-used
networks, viz., LeNet, AlexNet, VGG16, and ResNet-50. When delivering the same performance
as the baseline, Residue-Net reduces the area and power (hence energy) respectively
by 36% and 23%, on average with no accuracy loss. Leveraging the saved area to accelerate
the quantized RNS network through parallelism, Residue-Net improves its throughput
by 2.8x and energy by 2.7x.
A Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy
for AI Accelerating
Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator
with AI applications ranging from filtering to convolutional neural networks(CNN).
This paper proposes an energy-efficient MAC design, supporting a wide range of bit-width,
for both signed and unsigned operands. Firstly, based on the classic Booth algorithm,
we propose the Booth algorithm to propose a multiply-add merged strategy. The design
can not only support both signed and unsigned operations but also eliminate the delay,
area and power overheads from the adder of traditional MAC units. Then a multiply-add
merged design method for flexible bit-width adjustment is proposed using the fusion
strategy. In addition, treating the addend as a partial product makes the operation
easy to pipeline and balanced. The comprehensive improvement in delay, area and power
can meet various requirements from different applications and hardware design. By
using the proposed method, we have synthesized MAC units for several operation modes
using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed
design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width
fixed MAC designs, and 28.43% ~ 38.16% for bit-width adjustable ones. When pipelined,
the design has decreased the latency by more than 13%. The improvement in power and
area is up to 8.0% and 8.1% respectively.
DeepOpt: Optimized Scheduling of CNN Workloads for ASIC-based Systolic Deep Learning Accelerators
Scheduling computations in each layer of a convolutional neural network on a deep
learning (DL) accelerator involves a large number of choices, each of which involves
a different set of memory reuse and memory access patterns. Since memory transactions
are the primary bottleneck in DL acceleration, these choices can strongly impact the
energy and throughput of the accelerator. This work proposes an optimization framework,
DeepOpt, for general ASIC-based systolic hardware accelerators for layer-specific
and hardware-specific scheduling strategy for each layer of a CNN to optimize energy
and latency. Optimal hardware allocation significantly reduces execution cost as compared
to generic static hardware resource allocation, e.g., improvements of up to 50x in
the energy-delay product for VGG-16 and 41x for GoogleNet-v1.
Value-Aware Error Detection and Correction for SRAM Buffers in Low-Bitwidth, Floating-Point
CNN Accelerators
Low-power CNN accelerators are a key technique to enable the future artificial intelligence
world. Dynamic voltage scaling is an essential low-power strategy, but it is bottlenecked
by on-chip SRAM. More specifically, SRAM can exhibit stuck-at (SA) faults at a rate
as high as 0.1% when the supply voltage is lowered to, e.g., 0.5 V. Although this
issue has been studied in CPU cache design, since their solutions are tailored for
CPUs instead of CNN accelerators, they inevitably incur unnecessary design complexity
and SRAM capacity overhead.
To address the above issue, we conduct simulations and analyses to enable us to propose
error detecting and correcting mechanisms that are tailored for our targeting low-bitwidth,
floating-point (LBFP) CNN accelerators. We analyze the impacts of SA faults in different
SRAM positions, and we also analyze the impacts of different SA types, i.e., stuck-at-one
(SA1) and stuck-at-zero (SA0). The analysis results lead us to the error detecting
and correcting mechanisms that prioritize fixing SA1 appearing at SRAM positions where
the exponent bits of LBFP are stored. The evaluation results show that our proposed
mechanisms can help to push the voltage scaling limit down to a voltage level with
0.1% SA faults (e.g., 0.5 V).
SESSION: 3D: Stochastic and Approximate Computing
MIPAC: Dynamic Input-Aware Accuracy Control for Dynamic Auto-Tuning of Iterative Approximate
Computing
For many applications that exhibit strong error resilience, such as machine learning
and signal processing, energy efficiency and performance can be dramatically improved
by allowing for slight errors in intermediate computations. Iterative methods (IMs),
wherein the solution is improved over multiple executions of an approximation algorithm,
allow for energy-quality trade-off at run-time by adjusting the number of iterations
(NOI). However, in prior IM circuits, NOI adjustment has been made based on a pre-characterized
NOI-quality mapping, which is input-agnostic thus results in an undesirable large
variation in output quality. In this paper, we propose a novel design framework that
incorporates a lightweight quality controller that makes input-dependent predictions
on the output quality and determines the optimal NOI at run-time. The proposed quality
controller is composed of accurate yet low-overhead NOI predictors, generated by a
novel logic reduction technique. We evaluate the proposed design framework on several
IM circuits and demonstrate significant improvements in energy-quality performance.
Normalized Stability: A Cross-Level Design Metric for Early Termination in Stochastic Computing
Stochastic computing is a statistical computing scheme that represents data as serial
bit streams to greatly reduce hardware complexity. The key trade-off is that processing
more bits in the streams yields higher computation accuracy at the cost of more latency
and energy consumption. To maximize efficiency, it is desirable to account for the
error tolerance of applications and terminate stochastic computations early when the
result is acceptably accurate. Currently, the stochastic computing community lacks
a standard means of measuring a circuit’s potential for early termination and predicting
at what cycle it would be safe to terminate. To fill this gap, we propose normalized
stability, a metric that measures how fast a bit stream converges under a given accuracy
budget. Our unit-level experiments show that normalized stability accurately reflects
and contrasts the early-termination capabilities of varying stochastic computing units.
Furthermore, our application-level experiments on low-density parity-check decoding,
machine learning and image processing show that normalized stability can reduce the
design space and predict the timing to terminate early.
Zero Correlation Error: A Metric for Finite-Length Bitstream Independence in Stochastic Computing
Stochastic computing (SC), with its probabilistic data representation format, has
sparked renewed interest due to its ability to use very simple circuits to implement
complex operations. Though unlike traditional binary computing, SC needs to carefully
handle correlations that exist across data values to avoid the risk of unacceptably
inaccurate results. With many SC circuits designed to operate under the assumption
that input values are independent, it is important to provide the ability to accurately
measure and characterize independence of SC bitstreams. We propose zero correlation
error (ZCE), a metric that quantifies how independent two finite-length bitstreams
are, and show that it addresses fundamental limitations in metrics currently used
by the SC community. Through evaluation at both the functional unit level and application
level, we demonstrate how ZCE can be an effective tool for analyzing SC bitstreams,
simulating circuits and design space exploration.
An Efficient Approximate Node Merging with an Error Rate Guarantee
Approximate computing is an emerging design paradigm for error-tolerant applications.
e.g., signal processing and machine learning. In approximate computing, the area,
delay, or power consumption of an approximate circuit can be improved by trading off
its accuracy. In this paper, we propose an approximate logic synthesis approach based
on a node-merging technique with an error rate guarantee. The ideas of our approach
are to replace internal nodes by constant values and to merge two similar nodes in
the circuit in terms of functionality. We conduct experiments on a set of IWLS 2005
and MCNC benchmarks. The experimental results show that our approach can reduce area
by up to 80%, and 31% on average. As compared with the state-of-the-art method, our
approach has a speedup of 51 under the same 5% error rate constraint.
SESSION: 3E: Timing Analysis and Timing-Aware Design
An Adaptive Delay Model for Timing Yield Estimation under Wide-Voltage Range
Yield analysis for wide-voltage circuit design is a strong nonlinear integration problem.
The most challenging task is how to accurately estimate the yield of long-tail distribution.
This paper proposes an adaptive delay model to substitute expensive transistor-level
simulation for timing yield estimation. We use the Low-Rank Tensor Approximation (LRTA)
to model the delay variation from a large number of process parameters. Moreover,
an adaptive nonlinear sampling algorithm is adopted to calibrate the model iteratively,
which can capture the larger variability of delay distribution for different voltage
regions. The proposed method is validated on benchmark circuits of TAU15 in 45nm free
PDK. The experiment results show that our method achieves 20-100X speedup compared
to Monte Carlo simulation at the same accuracy level.
ATM: A High Accuracy Extracted Timing Model for Hierarchical Timing Analysis
As technology advances, the complexity and size of integrated circuits continue to
grow. Hierarchical design flow is a mainstream solution to speed up timing closure.
Static timing analysis is a pivotal step in the flow but it can be timing-consuming
on large flat designs. To reduce the long runtime, we introduce ATM, a high-accuracy
extracted timing model for hierarchical timing analysis. Interface logic model (ILM)
and extracted timing model (ETM) are the two popular paradigms for generating timing
macros. ILM is accurate but large in model size, and ETM is compact but less accurate.
Recent research has applied graph compression techniques to ILM to reduce model size
with simultaneous high accuracy. However, the generated models are still very large
compared to ETM, and its efficiency of in-context usage may be limited. We base ATM
on the ETM paradigm and address its accuracy limitation. Experimental results on TAU
2017 benchmarks show that ATM reduces the maximum absolute error of ETM from 131 ps
to less than 1 ps. Compared to the ILM-based approach, our accuracy differs within
1 ps and the generated model can be up to 270x smaller.
Mode-wise Voltage-scalable Design with Activation-aware Slack Assignment for Energy
Minimization
This paper proposes a design optimization methodology that can achieve a mode-wise
voltage scalable (MWVS) design with applying the activation-aware slack assignment
(ASA). Originally, ASA allocates the timing margin of critical paths with a stochastic
treatment of timing errors, which limits its application. Instead, this work employs
ASA with guaranteeing no timing errors. The MWVS design is formulated as an optimization
problem that minimizes the overall power consumption considering each mode duration,
achievable voltage reduction, and accompanied circuit overhead explicitly, and explores
the solution space with the downhill simplex algorithm that does not require numerical
derivation. For obtaining a solution, i.e., a design, in the optimization process,
we exploit the multi-corner multi-mode design flow in a commercial tool for performing
mode-wise ASA with sets of false paths dedicated to individual modes. Experimental
results based on RISC-V design show that the proposed methodology saves 20% more power
compared to the conventional voltage scaling approach and attains 15% gain from the
single-mode ASA. Also, the cycle-by-cycle fine-grained false path identification reduced
leakage power by 42%.
A Timing Prediction Framework for Wide Voltage Design with Data Augmentation Strategy
Wide voltage design has been widely used to achieve power reduction and energy efficiency
improvement. The consequent increasing number of PVT corners poses severe challenges
to timing analysis in terms of accuracy and efficiency. The data insufficiency issue
during path delay acquisition raises the difficulty for the training of machine learning
models, especially at low voltage corners due to tremendous library characterization
effort and/or simulation cost. In this paper, a learning-based timing prediction framework
is proposed to predict path delays across wide voltage region by LightGBM (Light Gradient
Boosting Machine) with data augmentation strategies including CTGAN (Conditional Generative
Adversarial Networks) and SMOTER (Synthetic Minority Oversampling Technique for Regression),
which generate realistic synthetic data of circuit delays to improve prediction precision
and reduce data sampling effort. Experimental results demonstrate that with the proposed
framework, the path delays at low voltage could be predicted by their delays at high
voltage corners with rRMSE of less than 5%, owing to the data augmentation strategies
which achieve significant prediction error reduction by up to 12x.
SESSION: 4A: Technological Advancements inside the AI chips, and using the AI Chips
Energy-Efficient Deep Neural Networks with Mixed-Signal Neurons and Dense-Local and
Sparse-Global Connectivity
Neuromorphic Computing has become tremendously popular due to its ability to solve
certain classes of learning tasks better than traditional von-Neumann computers. Data-intensive
classification and pattern recognition problems have been of special interest to Neuromorphic
Engineers, as these problems present complex use-cases for Deep Neural Networks (DNNs)
which are motivated from the architecture of the human brain, and employ densely connected
neurons and synapses organized in a hierarchical manner. However, as these systems
become larger in order to handle an increasing amount of data and higher dimensionality
of features, the designs often become connectivity constrained. To solve this, the
computation is divided into multiple cores/islands, called processing engines (PEs).
Today, the communication among these PEs are carried out through a power-hungry network-on-chip
(NoC), and hence the optimal distribution of these islands along with energy-efficient
compute and communication strategies become extremely important in reducing the overall
energy of the neuromorphic computer, which is currently orders of magnitude higher
than the biological human brain. In this paper, we extensively analyze the choice
of the size of the islands based on mixed-signal neurons/synapses for 3-8 bit-resolution
within allowable ranges for system-level classification error, determined by the analog
non-idealities (noise and mismatch) in the neurons, and propose strategies involving
local and global communication for reduction of the system-level energy consumption.
AC-coupled mixed-signal neurons are shown to have 10X lower non-idealities than DC-coupled
ones, while the choice of number of islands are shown to be a function of the network,
constrained by the analog to digital conversion (or viceversa) power at the interface
of the islands. The maximum number of layers in an island is analyzed and a global
bus-based sparse connectivity is proposed, which consumes orders of magnitude lower
power than the competing powerline communication techniques.
Merged Logic and Memory Fabrics for AI Workloads
As we approach the end of the silicon roadmap, we observe a steady increase in both
the research effort toward and quality of embedded non-volatile memories (eNVM). Integrated
in a dense array, eNVM such as resistive random access memory (RRAM), spin transfer
torque based random access memory, or phase change random access memory (PCRAM) can
perform compute in-memory (CIM) using the physical properties of the device. The combination
of eNVM and CIM seeks to minimize both data transport and leakage power while offering
density up to 10x that of traditional 6T SRAM. Despite these exciting new properties,
these devices introduce problems that were not faced by traditional CMOS and SRAM
based designs. While some of these problems will be solved by further research and
development, properties such as significant cell-to-cell variance and high write power
will persist due to the physical limitations of the devices. As a result, circuit
and system level designs must account for and mitigate the problems that arise. In
this work we introduce these problems from the system level and propose solutions
that improve performance while mitigating the impact of the non-ideal properties of
eNVM. Using statistics from the application and known properties of the eNVM, we can
configure a CIM accelerator to minimize error from cell-to-cell variance and maximize
throughput while minimizing write energy.
Vision Control Unit in Fully Self Driving Vehicles using Xilinx MPSoC and Opensource
Stack
Fully self-driving (FSD) vehicles are becoming increasing popular over the last few
years and companies are investing significantly into its research and development.
In the recent years, FSD technology innovators like Tesla, Google etc. have been working
on proprietary autonomous driving stacks and have been able to successfully bring
the vehicle to the roads. On the other end, organizations like Autoware Foundation
and Baidu are fueling the growth of self-driving mobility using open source stacks.
These organizations firmly believe in enabling autonomous driving technology for everyone
and support developing software stacks through the open source community that is SoC
vendor agnostic. In this proposed solution we describe a vision control unit for a
fully self-driving vehicle developed on Xilinx MPSoC platform using open source software
components.
The vision control unit of an FSD vehicle is responsible for camera video capture,
image processing and rendering, AI algorithm processing, data and meta-data transfer
to next stage of the FSD pipeline. In this proposed solution we have used many open
source stacks and frameworks for video and AI processing. The processing of the video
pipeline and algorithms take full advantage of the pipelining and parallelism using
all the heterogenous cores of the Xilinx MPSoC. In addition, we have developed an
extensible, scalable, adaptable and configurable AI backend framework, XTA, for acceleration
purposes that is derived from a popular, open source AI backend framework, TVM-VTA.
XTA uses all the MPSoC cores for its computation in a parallel and pipelined fashion.
XTA also adapts to the compute and memory parameters of the system and can scale to
achieve optimal performance for any given AI problem. The FSD system design is based
on a distributed system architecture and uses open source components like Autoware
for autonomous driving algorithms, ROS and Distributed Data Services as a messaging
middleware between the functional nodes and a real-time kernel to coordinate the actions.
The details of image capture, rendering and AI processing of the vision perception
pipeline will be presented along with the performance measurements of the vision pipeline.
In this proposed solution we will demonstrate some of the key use cases of vision
perception unit like surround vision and object detection. In addition, we will also
show the capability of Xilinx MPSoC technology to handle multiple channels of real
time camera and the integration with the Lidar/Radar point cloud data to feed into
the decision-making unit of the overall system. The system is also designed with the
capability to update the vision control unit through Over the Air Update (OTA). It
is also envisioned that the core AI engine will require regular updates with the latest
training values; hence a built-in platform level mechanism supporting such capability
is essential for real world deployment.
SESSION: 4B: System-Level Modeling, Simulation, and Exploration
Constrained Conservative State Symbolic Co-analysis for Ultra-low-power Embedded Systems
Symbolic simulation and symbolic execution techniques have long been used for verifying
designs and testing software. Recently, using symbolic hardware-software co-analysis
to characterize unused hardware resources across all possible executions of an application
running on a processor has been leveraged to enable application-specific analysis
and optimization techniques. Like other symbolic simulation techniques, symbolic hardware-software
co-analysis does not scale well to complex applications, due to an explosion in the
number of execution paths that must be analyzed to characterize all possible executions
of an application. To overcome this issue, prior work proposed a scalable approach
by maintaining conservative states of the system at previously-visited locations in
the application. However, this approach can be too pessimistic in determining the
exercisable subset of resources of a hardware design. In this paper, we propose a
technique for performing symbolic co-analysis of an application on a processor’s netlist
by identifying, propagating, and imposing constraints from the software level onto
the gate-level simulation. This produces a more precise, less pessimistic estimate
of the gates that an application can exercise when executing on a processor, while
guaranteeing coverage of all possible gates that the application can exercise. This
also reduces the simulation time of the analysis, significantly, by eliminating the
need to explore many simulation paths in the application. Compared to the state-of-art
analysis based on conservative states, our constrained approach reduces the number
of gates identified as exercisable by up to 34.98%, 11.52% on average, and analysis
runtime by up to 84.61%, 43.83% on average.
Arbitrary and Variable Precision Floating-Point Arithmetic Support in Dynamic Binary
Translation
Floating-point hardware support has more or less been settled 35 years ago by the
adoption of the IEEE 754 standard. However, many scientific applications require higher
accuracy than what can be represented on 64 bits, and to that end make use of dedicated
arbitrary precision software libraries. To reach a good performance/accuracy trade-off,
developers use variable precision, requiring e.g. more accuracy as the computation
progresses. Hardware accelerators for this kind of computations do not exist yet,
and independently of the actual quality of the underlying arithmetic computations,
defining the right instruction set architecture, memory representations, etc, for
them is a challenging task. We investigate in this paper the support for arbitrary
and variable precision arithmetic in a dynamic binary translator, to help gain an
insight of what such an accelerator could provide as an interface to compilers, and
thus programmers. We detail our design and present an implementation in QEMU using
the MPRF library for the RISC-V processor1.
Optimizing Temporal Decoupling using Event Relevance
Over the last decades, HW/SW systems have grown ever more complex. System simulators,
so called virtual platforms, have been an important tool for developing and testing
these systems. However, the rise in overall complexity has also impacted the simulators.
Complex platforms require fast simulation components and a sophisticated simulation
infrastructure to meet today’s performance demands. With the introduction of SystemC
TLM2.0, temporal decoupling has become a staple in the arsenal of simulation acceleration
techniques. Temporal decoupling yields a significant simulation performance increase
at the cost of diminished accuracy. The two prevalent approaches are called static
quantum and dynamic quantum. In this work both are analyzed using a state-of-the-art,
industrial virtual platform as a case study. While dynamic quantum offers an ideal
trade-off between simulation performance and accuracy in a single-core scenario, performance
reductions can be observed in multi-core platforms. To address this, a novel performance
optimization is proposed, achieving a 14.32% performance gain in our case study while
keeping near-perfect accuracy.
Design Space Exploration of Heterogeneous-Accelerator SoCs with Hyperparameter Optimization
Modern SoC systems consist of general-purpose processor cores augmented with large
numbers of specialized accelerators. Building such systems requires a design flow
allowing the design space to be explored at the system level with an appropriate strategy.
In this paper, we describe a methodology allowing to explore the design space of power-performance
heterogeneous SoCs by combining an architecture simulator (gem5-Aladdin) and a hyperparameter
optimization method (Hyperopt). This methodology allows different types of parallelism
with loop unrolling strategies and memory coherency interfaces to be swept. The flow
has been applied to a convolutional neural network algorithm. We show that the most
energy efficient architecture achieves a 2x to 4x improvement in energy-delay-product
compared to an architecture without parallelism. Furthermore, the obtained solution
is more efficient than commonly implemented architectures (Systolic, 2D-mapping, and
Tiling). We also applied the methodology to find the optimal architecture including
its coherency interface for a complex SoC made up of six accelerated-workloads. We
show that a hybrid interface appears to be the most efficient; it reaches 22% and
12% improvement in energy-delay-product compared to just only using non-coherent and
only LLC-coherent models, respectively.
SESSION: 4C: Neural Network Optimizations for Compact AI Inference
DNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs
This paper presents a dynamic network rewiring (DNR) method to generate pruned deep
neural network (DNN) models that are robust against adversarial attacks yet maintain
high accuracy on clean images. In particular, the disclosed DNR method is based on
a unified constrained optimization formulation using a hybrid loss function that merges
ultra-high model compression with robust adversarial training. This training strategy
dynamically adjusts inter-layer connectivity based on per-layer normalized momentum
computed from the hybrid loss function. In contrast to existing robust pruning frameworks
that require multiple training iterations, the proposed learning strategy achieves
an overall target pruning ratio with only a single training iteration and can be tuned
to support both irregular and structured channel pruning. To evaluate the merits of
DNR, experiments were performed with two widely accepted models, namely VGG16 and
ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared
to the baseline uncompressed models, DNR provides over 20x compression on all the
datasets with no significant drop in either clean or adversarial classification accuracy.
Moreover, our experiments show that DNR consistently finds compressed models with
better clean and adversarial image classification performance than what is achievable
through state-of-the-art alternatives. Our models and test codes are available at
https://github.com/ksouvik52/DNR_ASP_DAC2021.
Dynamic Programming Assisted Quantization Approaches for Compressing Normal and Robust
DNN Models
In this work, we present effective quantization approaches for compressing the deep
neural networks (DNNs). A key ingredient is a novel dynamic programming (DP) based
algorithm to obtain the optimal solution of scalar K-means clustering. Based on the
approaches with regularization and quantization function, two weight quantization
approaches called DPR and DPQ for compressing normal DNNs are proposed respectively.
Experiments show that they produce models with higher inference accuracy than recently
proposed counterparts while achieving same or larger compression. They are also extended
for compressing robust DNNs, and the relevant experiments show 16X compression of
the robust ResNet-18 model with less than 3% accuracy drop on both natural and adversarial
examples.
Accelerate Non-unit Stride Convolutions with Winograd Algorithms
While computer vision tasks target increasingly challenging scenarios, the need for
real-time processing of images rises as well, requiring more efficient methods to
accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based
methods and Winograd algorithms to compute matrix convolutions, which effectively
lower the computing complexity by reducing the number of multiplications. For non-unit
stride convolutions, we usually cannot directly apply those algorithms to accelerate
the computations. In this work, we propose a novel universal approach to construct
the non-unit stride convolution algorithms for any given stride and filter sizes from
Winograd algorithms. Specifically, we first demonstrate the steps to decompose an
arbitrary convolutional kernel and apply the Winograd algorithms separately to compute
non-unit stride convolutions. We then present the derivation of this method and proof
by construction to confirm the validity of this approach. Finally, we discuss the
minimum number of multiplications and additions necessary for the non-unit stride
convolutions and evaluate the performance of the decomposed Winograd algorithms. From
our analysis of the computational complexity, the new approach can benefit from 1.5x
to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired
around 1.3x speedup (Told /Tnew) of the Winograd algorithms against the conventional
convolution algorithm in various experiment settings.
Efficient Accuracy Recovery in Approximate Neural Networks by Systematic Error Modelling
Approximate Computing is a promising paradigm for mitigating the computational demands
of Deep Neural Networks (DNNs), by leveraging DNN performance and area, throughput
or power. The DNN accuracy, affected by such approximations, can be then effectively
improved through retraining. In this paper, we present a novel methodology for modelling
the approximation error introduced by approximate hardware in DNNs, which accelerates
retraining and achieves negligible accuracy loss. To this end, we implement the behavioral
simulation of several approximate multipliers and model the error generated by such
approximations on pre-trained DNNs for image classification on CIFAR10 and ImageNet.
Finally, we optimize the DNN parameters by applying our error model during DNN retraining,
to recover the accuracy lost due to approximations. Experimental results demonstrate
the efficiency of our proposed method for accelerated retraining (11 x faster for
CIFAR10 and 8x faster for ImageNet) for full DNN approximation, which allows us to
deploy approximate multipliers with energy savings of up to 36% for 8-bit precision
DNNs with an accuracy loss lower than 1%.
SESSION: 4D: Brain-Inspired Computing
Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators
ReRAM-based accelerators have shown great potential for accelerating DNN inference
because ReRAM crossbars can perform analog matrix-vector multiplication operations
with low latency and energy consumption. However, these crossbars require the use
of ADCs which constitute a significant fraction of the cost of MVM operations. The
overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization
flows for DNN inference accelerators do not consider partial sum quantization which
is not highly relevant to traditional digital architectures. To address this issue,
we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators
where weight quantization, input quantization, and partial sum quantization are jointly
applied for each DNN layer. We also propose an automated quantization flow powered
by deep reinforcement learning to search for the best quantization configuration in
the large design space. Our evaluation shows that the proposed mixed precision quantization
scheme and quantization flow reduce inference latency and energy consumption by up
to 3.89x and 4.84x, respectively, while only losing 1.18% in DNN inference accuracy.
A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA
Sparse matrix-vector multiplication is often employed in many data-analytic workloads
in which low latency and high throughput are more valuable than exact numerical convergence.
FPGAs provide quick execution times while offering precise control over the accuracy
of the results thanks to reduced-precision fixed-point arithmetic. In this work, we
propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector
multiplication, and study its effectiveness when applied to the Personalized PageRank
algorithm, a common building block of recommender systems in e-commerce websites and
social networks. Our implementation achieves speedups up to 6x over a reference floating-point
FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different
data-sets, while preserving the numerical fidelity of the results and reaching up
to 42x higher energy efficiency compared to the CPU implementation.
HyperRec: Efficient Recommender Systems with Hyperdimensional Computing
Recommender systems are important tools for many commercial applications such as online
shopping websites. There are several issues that make the recommendation task very
challenging in practice. The first is that an efficient and compact representation
is needed to represent users, items and relations. The second issue is that the online
markets are changing dynamically, it is thus important that the recommendation algorithm
is suitable for fast updates and hardware acceleration. In this paper, we propose
a new hardware-friendly recommendation algorithm based on Hyperdimensional Computing,
called HyperRec. Unlike existing solutions which leverages floating-point numbers
for the data representation, in HyperRec, users and items are modeled with binary
vectors in a high dimension. The binary representation enables to perform the reasoning
process of the proposed algorithm only using Boolean operations, which is efficient
on various computing platforms and suitable for hardware acceleration. In this work,
we show how to utilize GPU and FPGA to accelerate the proposed HyperRec. When compared
with the state-of-the-art methods for rating prediction, the CPU-based HyperRec implementation
is 13.75x faster and consumes 87% less memory, while decreasing the mean squared error
(MSE) for the prediction by as much as 31.84%. Our FPGA implementation is on average
67.0x faster and has 6.9x higher energy efficient as compared to CPU. Our GPU implementation
further achieves on average 3.1x speedup as compared to FPGA, while providing only
1.2x lower energy efficiency.
Efficient Techniques for Training the Memristor-based Spiking Neural Networks Targeting
Better Speed, Energy and Lifetime
Speed and energy consumption are two important metrics in designing spiking neural
networks (SNNs). The inference process of current SNNs is terminated after a preset
number of time steps for all images, which leads to a waste of time and spikes. We
can terminate the inference process after proper number of time steps for each image.
Besides, normalization method also influences the time and spikes of SNNs. In this
work, we first use reinforcement learning algorithm to develop an efficient termination
strategy which can help find the right number of time steps for each image. Then we
propose a model tuning technique for memristor-based crossbar circuit to optimize
the weight and bias of a given SNN. Experimental results show that the proposed techniques
can reduce about 58.7% crossbar energy consumption and over 62.5% time consumption
and double the drift lifetime of memristor-based SNN.
SESSION: 4E: Cross-Layer Hardware Security
PCBench: Benchmarking of Board-Level Hardware Attacks and Trojans
Most modern electronic systems are hosted by printed circuit boards (PCBs), making
them a ubiquitous system component that can take many different shapes and forms.
In order to achieve a high level of economy of scale, the global supply chain of electronic
systems has evolved into disparate segments for the design, fabrication, assembly,
and testing of PCB boards and their various associated components. As a consequence,
the modern PCB supply chain exposes many vulnerabilities along its different stages,
allowing adversaries to introduce malicious alterations to facilitate board-level
attacks.
As an emerging hardware threat, the attack and defense techniques at the board level
have not yet been systemically explored and thus require a thorough and comprehensive
investigation. In the absence of standard board-level attack benchmark, current research
on perspective countermeasures is likely to be evaluated on proprietary variants of
ad-hoc attacks, preventing credible and verifiable comparison among different techniques.
Upon this request, in this paper, we will systematically define and categorize a broad
range of board-level attacks. For the first time, the attack vectors and construction
rules for board-level attacks are developed. A practical and reliable board-level
attack benchmark generation scheme is also developed, which can be used to produce
references for evaluating countermeasures. Finally, based on the proposed approach,
we have created a comprehensive set of board-level attack benchmarks for open-source
release.
Cache-Aware Dynamic Skewed Tree for Fast Memory Authentication
Memory integrity trees are widely-used to protect external memories in embedded systems
against bus attacks. However, existing methods often result in high performance overheads
incurred during memory authentication. To reduce memory accesses during authentication,
the tree nodes are cached on-chip. In this paper, we propose a cacheaware technique
to dynamically skew the integrity tree based on the application workloads in order
to reduce the performance overhead. The tree is initialized using Van-Emde Boas (vEB)
organization to take advantage of locality of reference. At run time, the nodes of
the integrity tree are dynamically positioned based on their memory access patterns.
In particular, frequently accessed nodes are placed closer to the root to reduce the
memory access overheads. The proposed technique is compared with existing methods
on Multi2Sim using benchmarks from SPEC-CPU2006, SPLASH-2 and PARSEC to demonstrate
its performance benefits.
Automated Test Generation for Hardware Trojan Detection using Reinforcement Learning
Due to globalized semiconductor supply chain, there is an increasing risk of exposing
System-on-Chip (SoC) designs to malicious implants, popularly known as hardware Trojans.
Unfortunately, traditional simulation-based validation using millions of test vectors
is unsuitable for detecting stealthy Trojans with extremely rare trigger conditions
due to exponential input space complexity of modern SoCs. There is a critical need
to develop efficient Trojan detection techniques to ensure trustworthy SoCs. While
there are promising test generation approaches, they have serious limitations in terms
of scalability and detection accuracy. In this paper, we propose a novel logic testing
approach for Trojan detection using an effective combination of testability analysis
and reinforcement learning. Specifically, this paper makes three important contributions.
1) Unlike existing approaches, we utilize both controllability and observability analysis
along with rareness of signals to significantly improve the trigger coverage. 2) Utilization
of reinforcement learning considerably reduces the test generation time without sacrificing
the test quality. 3) Experimental results demonstrate that our approach can drastically
improve both trigger coverage (14.5% on average) and test generation time (6.5 times
on average) compared to state-of-the-art techniques.
On the Impact of Aging on Power Analysis Attacks Targeting Power-Equalized Cryptographic
Circuits
Side-channel analysis attacks exploit the physical characteristics of cryptographic
chip implementations to extract their embedded secret keys. In particular, Power Analysis
(PA) attacks make use of the dependency of the power consumption on the data being
processed by the cryptographic devices. To tackle the vulnerability of cryptographic
circuits against PA attack, various countermeasures have been proposed in literature
and adapted by industries, among which a branch of hiding schemes opt to equalize
the power consumption of the chip regardless of the processed data. Although these
countermeasures are supposed to reduce the information leakage of cryptographic chips,
they fail to consider the impact of aging occurs during the device lifetime. Due to
aging, the specifications of transistors, and in particular their threshold-voltage,
deviate from their fabrication-time specification, leading to a change of circuit’s
delay and power consumption over time. In this paper, we show that the aging-induced
impacts result in imbalances in the equalized power consumption achieved by hiding
countermeasures. This makes such protected cryptographic chips vulnerable to PA attacks
when aged. The experimental results extracted through the aging simulation of the
PRESENT cipher protected by Sense Amplifier Based Logic (SABL), one of the well-known
hiding countermeasures, show that the achieved protection may not last during the
circuit lifetime.
SESSION: 5B: Embedded Operating Systems and Information Retrieval
Energy-Performance Co-Management of Mixed-Sensitivity Workloads on Heterogeneous Multi-core
Systems
Satisfying performance of complex workload scenarios with respect to energy consumption
on Heterogeneous Multi-core Platforms (HMPs) is challenging when considering i) the
increasing variety of applications, and ii) the large space of resource management
configurations. Existing run-time resource management approaches use online and offline
learning to handle such complexity. However, they focus on one type of application,
neglecting concurrent execution of mixed sensitivity workloads. In this work, we propose
an energy-performance co-management method which prioritizes mixed type of applications
at run-time, and searches in the configuration space to find the optimal configuration
for each application which satisfies the performance requirements while saving energy.
We evaluate our approach on a real Odroid XU3 platform over mixed-sensitivity embedded
workloads. Experimental results show our approach provides 54% lower performance violation
with 50% higher energy saving compared to the existing approaches.
Optimizing Inter-Core Data-Propagation Delays in Industrial Embedded Systems under
Partitioned Scheduling
This paper addresses the scheduling of industrial time-critical applications on multi-core
embedded systems. A novel scheduling technique under partitioned scheduling is proposed
that minimizes inter-core data-propagation delays between tasks that are activated
with different periods. The proposed technique is based on the read-execute-write
model for the execution of tasks to guarantee temporal isolation when accessing the
shared resources. A Constraint Programming formulation is presented to find the schedule
for each core. Evaluations are preformed to assess the scalability as well as the
resulting schedulability ratio, which is still 18% for two cores that are both utilized
90%. Furthermore, an automotive industrial case study is performed to demonstrate
the applicability of the proposed technique to industrial systems. The case study
also presents a comparative evaluation of the schedules generated by (i) the proposed
technique and (ii) the Rubus-ICE industrial tool suite with respect to jitter, inter-core
data-propagation delays and their impact on data age of task chains that span multiple
cores.
LiteIndex: Memory-Efficient Schema-Agnostic Indexing for JSON documents in SQLite
SQLite with JSON (JavaScript Object Notation) format is widely adopted for local data
storage in mobile applications such as Twitter and Instagram. With more data are generated
and stored, it becomes vitally important to efficiently index and search JSON records
in SQLite. However, current methods in SQLite either require full text search (that
incurs big memory usage and long query latency) or indexing based on expression (that
needs to be manually created by specifying search keys). On the other hand, existing
JSON automatic indexing techniques, mainly focusing on big data and cloud environments,
depend on a colossal tree structure that cannot be applied in memory-constrained mobile
devices.
In this paper, we propose a novel schema-agnostic indexing technique called LiteIndex
that can automatically index JSON records by extracting keywords from long text and
maintaining user-preferred items within a given memory constraint. This is achieved
by memory-efficient index organization with light-weight keyword extraction from long
text and user-preference-aware reinforcement-learning-based index pruning mechanism.
LiteIndex has been implemented in a Android smartphone platform and evaluated with
a dataset from Tweet. Experimental results show that LiteIndex can significantly reduce
the query latency by up to 18x with less memory usage compared with SQLite with FTS3/FTS4
extensions.
SESSION: 5C: Security Issues in AI and Their Impacts on Hardware Security
Micro-architectural Cache Side-Channel Attacks and Countermeasures
Central Processing Unit (CPU) is considered as the brain of a computer. If the CPU
has vulnerabilities, the security of software running on it is difficult to be guaranteed.
In recent years, various micro-architectural cache side-channel attacks on the CPU
such as Spectre and Meltdown have appeared. They exploit contention on internal components
of the processor to leak secret information between processes. This newly evolving
research area has aroused significant interest due to the broad application range
and harmfulness of these attacks. This article reviews recent research progress on
micro-architectural cache side-channel attacks and defenses. First, the various micro-architectural
cache side-channel attacks are classified and discussed. Then, the corresponding countermeasures
are summarized. Finally, the limitations and future development trends are prospected.
Security of Neural Networks from Hardware Perspective: A Survey and Beyond
Recent advances in neural networks (NNs) and their applications in deep learning techniques
have made the security aspects of NNs an important and timely topic for fundamental
research. In this paper, we survey the security challenges and opportunities in the
computing hardware used in implementing deep neural networks (DNN). First, we explore
the hardware attack surfaces for DNN. Then, we report the current state-of-the-art
hardware-based attacks on DNN with focus on hardware Trojan insertion, fault injection,
and side-channel analysis. Next, we discuss the recent development on detecting these
hardware-oriented attacks and the corresponding countermeasures. We also study the
application of secure enclaves for the trusted execution of NN-based algorithms. Finally,
we consider the emerging topic of intellectual property protection for deep learning
systems. Based on our study, we find ample opportunities for hardware based research
to secure the next generation of DNN-based artificial intelligence and machine learning
platforms.
Learning Assisted Side Channel Delay Test for Detection of Recycled ICs
With the outsourcing of design flow, ensuring the security and trustworthiness of
integrated circuits has become more challenging. Among the security threats, IC counterfeiting
and recycled ICs have received a lot of attention due to their inferior quality, and
in turn, their negative impact on the reliability and security of the underlying devices.
Detecting recycled ICs is challenging due to the effect of process variations and
process drift occurring during the chip fabrication. Moreover, relying on a golden
chip as a basis for comparison is not always feasible. Accordingly, this paper presents
a recycled IC detection scheme based on delay side-channel testing. The proposed method
relies on the features extracted during the design flow and the sample delays extracted
from the target chip to build a Neural Network model using which the target chip can
be truly identified as new or recycled. The proposed method classifies the timing
paths of the target chip into two groups based on their vulnerability to aging using
the information collected from the design and detects the recycled ICs based on the
deviation of the delay of these two sets from each other.
ML-augmented Methodology for Fast Thermal Side-channel Emission Analysis
Accurate side-channel attacks can non-invasively or semi-invasively extract secure
information from hardware devices using “side- channel” measurements. The thermal
profile of an IC is one class of side channel that can be used to exploit the security
weaknesses in a design. Measurement of junction temperature from an on-chip thermal
sensor or top metal layer temperature using an infrared thermal image of an IC with
the package being removed can disclose secret keys of a cryptographic design through
correlation power analysis. In order to identify the design vulnerabilities to thermal
side channel attacks, design time simulation tools are highly important. However,
simulation of thermal side-channel emission is highly complex and computationally
intensive due to the scale of simulation vectors required and the multi-physics simulation
models involved. Hence, in this paper, we have proposed a fast and comprehensive Machine
Learning (ML) augmented thermal simulation methodology for thermal Side-Channel emission
Analysis (SCeA). We have developed an innovative tile-based Delta-T Predictor using
a data-driven DNN-based thermal solver. The developed tile based Delta-T Predictor
temperature is used to perform the thermal side-channel analysis which models the
scenario of thermal attacks with the measurement of junction temperature. This method
can be 100-1000x faster depending on the size of the chip compared to traditional
FEM-based thermal solvers with the same level of accuracy. Furthermore, this simulation
allows for the determination of location- dependent wire temperature on the top metal
layer to validate the scenario of thermal attack with top metal layer temperature.
We have demonstrated the leakage of the encryption key in an 128-bit AES chip using
both proposed tile-based temperature calculations and top metal wire temperature calculations,
quantified by simulation MTD (Measurements-to-Disclosure).
SESSION: 5D: Advances in Logic and High-level Synthesis
1st-Order to 2nd-Order Threshold Logic Gate Transformation with an Enhanced ILP-based
Identification Method
This paper introduces a method to enhance an integer linear programming (ILP)-based
method for transforming a 1st-order threshold logic gate (1-TLG) to a 2nd-order TLG
(2-TLG) with lower area cost. We observe that for a 2-TLG, most of the 2nd-order weights
(2-weights) are zero. That is, in the ILP formulation, most of the variables for the
2-weights could be set to zero. Thus, we first propose three sufficient conditions
for transforming a 1-TLG to a 2-TLG by extracting 2-weights. These extracted weights
are seen to be more likely non-zero. Then, we simplify the ILP formulation by eliminating
the non-extracted 2-weights to speed up the ILP solving. The experimental results
show that, to transform a set of 1-TLGs to 2-TLGs, the enhanced method saves an average
of 24% CPU time with only an average of 1.87% quality loss in terms of the area cost
reduction rate.
A Novel Technology Mapper for Complex Universal Gates
Complex universal logic gates, which may have higher density and flexibility than
basic logic gates and look-up tables (LUT), are useful for cost-effective or security-oriented
VLSI design requirements. However, most of the technology mapping algorithms aim to
optimize combinational logic with basic standard cells or LUT components. It is desirable
to investigate optimal technology mappers for complex universal gates in addition
to basic standard cells and LUT components. This paper proposes a novel technology
mapper for complex universal gates with a tight integration of the following techniques:
Boolean network simulation with permutation classification, supergate library construction,
dynamic programming based cut enumeration, Boolean matching with optimal universal
cell covering. Experimental results show that the proposed method outperforms the
state-of-the-art technology mapper in ABC, in terms of both area and delay.
High-Level Synthesis of Transactional Memory
The rising popularity of high-level synthesis (HLS) is due to the complexity and amount
of background knowledge required to design hardware circuits. Despite significant
recent advances in HLS research, HLS-generated circuits may be of lower quality than
human-expert-designed circuits, from the performance, power, or area perspectives.
In this work, we aim to raise circuit performance by introducing a transactional memory
(TM) synchronization model to the open-source LegUp HLS tool [1]. LegUp HLS supports
the synthesis of multi-threaded software into parallel hardware [4], including support
for mutual-exclusion lock-based synchronization. With the introduction of transactional
memory-based synchronization, location-specific (i.e. finer grained) memory locks
are made possible, where instead of placing an access lock around an entire array,
one can place a lock around individual array elements. Significant circuit performance
improvements are observed through reduced stalls due to contention, and greater memory-access
parallelism. On a set of 5 parallel benchmarks, wall-clock time is improved by 2.0x,
on average, by the TM synchronization model vs. mutex-based locks.
SESSION: 5E: Hardware-Oriented Threats and Solutions in Neural Networks
VADER: Leveraging the Natural Variation of Hardware to Enhance Adversarial Attack
Adversarial attacks have been viewed as the primary threat to the security of neural
networks. Hence, extensive adversarial defense techniques have been proposed to protect
the neural networks from adversarial attacks, allowing for the application of neural
networks to the security-sensitive tasks. Recently, the emerging devices, e.g., Resistive
RAM (RRAM), attracted extensive attention for establishing the hardware platform for
neural networks to tackle the inadequate computing capability of the traditional computing
platform. Though the emerging devices exhibit the instinct instability issues due
to the advanced manufacture technology, including hardware variations and defects,
the error-resilience capability of neural networks enables the wide deployment of
neural networks on the emerging devices. In this work, we find that the natural instability
in emerging devices impairs the security of neural networks. Specifically, we design
an enhanced adversarial attack, Variation-oriented ADvERsarial (VADER) attack which
leverages the inherent hardware variations in RRAM chips to penetrate the protection
of adversarial defenses and mislead the prediction of neural networks. We evaluated
the effectiveness of VADER across various protected neural network models and the
result shows that VADER achieves higher success attack rate over other adversarial
attacks.
Entropy-Based Modeling for Estimating Adversarial Bit-flip Attack Impact on Binarized
Neural Network
Over past years, the high demand to efficiently process deep learning (DL) models
has driven the market of the chip design companies. However, the new Deep Chip architectures,
a common term to refer to DL hardware accelerator, have slightly paid attention to
the security requirements in quantized neural networks (QNNs), while the black/white
-box adversarial attacks can jeopardize the integrity of the inference accelerator.
Therefore in this paper, a comprehensive study of the resiliency of QNN topologies
to black-box attacks is examined. Herein, different attack scenarios are performed
on an FPGA-processor co-design, and the collected results are extensively analyzed
to give an estimation of the impact’s degree of different types of attacks on the
QNN topology. To be specific, we evaluated the sensitivity of the QNN accelerator
to a range number of bit-flip attacks (BFAs) that might occur in the operational lifetime
of the device. The BFAs are injected at uniformly distributed times either across
the entire QNN or per individual layer during the image classification. The acquired
results are utilized to build the entropy-based model that can be leveraged to construct
resilient QNN architectures to bit-flip attacks.
A Low Cost Weight Obfuscation Scheme for Security Enhancement of ReRAM Based Neural
Network Accelerators
The resistive random-access memory (ReRAM) based accelerator can execute the large
scale neural network (NN) applications in an extremely energy efficient way. However,
the non-volatile feature of the ReRAM introduces some security vulnerabilities. The
weight parameters of a well-trained NN model deployed on the ReRAM based accelerator
are persisted even after the chip is powered off. The adversaries who have the physical
access to the accelerator can hence launch the model stealing attack and extract these
weights by some micro-probing methods. Run time encryption of the weights is intuitive
to protect the NN model but degrades execution performance and device endurance largely.
While obfuscation of the weight rows needs to pay the tremendous hardware area overhead
in order to achieve the high security. In view of above mentioned problems, in this
paper we propose a low cost weight obfuscation scheme to secure the NN model deployed
on the ReRAM based accelerators from the model stealing attack. We partition the crossbar
into many virtual operation units (VOUs) and perform full permutation on the weights
of the VOUs along the column dimension. Without the keys, the attacker cannot perform
the correct NN computations even if they have obtained the obfuscated model. Compared
with the weight rows based obfuscation, our scheme can achieve the same level of security
with less an order of magnitude in the hardware area and power overheads.
SESSION: 6B: Advanced Optimizations for Embedded Systems
Puncturing the memory wall: Joint optimization of network compression with approximate memory for ASR application
The automatic speech recognition (ASR) system is becoming increasingly irreplaceable
in smart speech interaction applications. Nonetheless, these applications confront
the memory wall when embedded in the energy and memory constrained Internet of Things
devices. Therefore, it is extremely challenging but imperative to design a memory-saving
and energy-saving ASR system. This paper proposes a joint-optimized scheme of network
compression with approximate memory for the economical ASR system. At the algorithm
level, this work presents block-based pruning and quantization with error model (BPQE),
an optimized compression framework including a novel pruning technique coordinated
with low-precision quantization and the approximate memory scheme. The BPQE compressed
recurrent neural network (RNN) model comes with an ultra-high compression rate and
finegrained structured pattern that reduce the amount of memory access immensely.
At the hardware level, this work presents an ASR-adapted incremental retraining method
to further obtain optimal power saving. This retraining method stimulates the utility
of the approximate memory scheme, while maintaining considerable accuracy. According
to the experiment results, the proposed joint-optimized scheme achieves 58.6% power
saving and 40x memory saving with a phone error rate of 20%.
Canonical Huffman Decoder on Fine-grain Many-core Processor Arrays
Canonical Huffman codecs have been used in a wide variety of platforms ranging from
mobile devices to data centers which all demand high energy efficiency and high throughput.
This work presents bit-parallel canonical Huffman decoder implementations on a fine-grain
many-core array built using simple RISC-style programmable processors. We develop
multiple energy-efficient and area-efficient decoder implementations and the results
are compared with an Intel i7-4850HQ and a massively parallel GT 750M GPU executing
the corpus benchmarks: Calgary, Canterbury, Artificial, and Large. The many-core implementations
achieve a scaled throughput per chip area that is 324x and 2.7x greater on average
than the i7 and GT 750M respectively. In addition, the many-core implementations yield
a scaled energy efficiency (bytes decoded per energy) that is 24.1x and 4.6x greater
than the i7 and GT 750M respectively.
A Decomposition-Based Synthesis Algorithm for Sparse Matrix-Vector Multiplication
in Parallel Communication Structure
There is an obvious trend that hardware including many-core CPU, GPU and FPGA are
always made use of to conduct computationally intensive tasks of deep learning implementations,
a large proportion of which can be formulated into the format of sparse matrix-vector
multiplication(SpMV). In contrast with dense matrix-vector multi-plication(DMV), scheduling
solutions for SpMV targeting parallel processing turn out to be irregular, leading
to the dilemma that the optimum synthesis problems are time-consuming or even infeasible,
when the size of the involved matrix increases. In this paper, the minimum scheduling
problem of 4×4 SpMV on ring-connected architecture is first studied, with two concepts
named Multi-Input Vector and Multi-Output Vector introduced. The classification of
4×4 sparse matrices has been conducted, on account of which a decomposition-based
synthesis algorithm for larger matrices is put forward. As the proposed method is
guided by known sub-scheduling solutions, search space of the synthesis problem is
considerably reduced. Through comparison with an exhaustive search method and a brute
force-based parallel scheduling method, the proposed method is proved to be able to
offer scheduling solutions of high-equality: averagely utilize 65.27% of the sparseness
of the involved matrices and achieve 91.39% of the performance of the solutions generated
by exhaustive search, with a remarkable saving of compilation time and the best scalability
among the above-mentioned approaches.
SESSION: 6C: Design and Learning of Logic Circuits and Systems
Learning Boolean Circuits from Examples for Approximate Logic Synthesis
Many computing applications are inherently error resilient. Thus, it is possible to
decrease computing accuracy to achieve greater efficiency in area, performance, and/or
energy consumption. In recent years, a slew of automatic techniques for approximate
computing has been proposed; however, most of these techniques require full knowledge
of an exact, or ‘golden’ circuit description. In contrast, there has been significant
recent interest in synthesizing computation from examples, a form of supervised learning.
In this paper, we explore the relationship between supervised learning of Boolean
circuits and existing work on synthesizing incompletely-specified functions. We show
that when considered through a machine learning lens, the latter work provides a good
training accuracy but poor test accuracy. We contrast this with prior work from the
1990s which uses mutual information to steer the search process, aiming for good generalization.
By combining this early work with a recent approach to learning logic functions, we
are able to achieve a scalable and efficient machine learning approach for Boolean
circuits in terms of area/delay/test-error trade-off.
Read your Circuit: Leveraging Word Embedding to Guide Logic Optimization
To tackle the involved complexity, Electronic Design Automation (EDA) tools are broken
in well-defined steps, each operating at different abstraction levels. Higher levels
of abstraction shorten the flow run-time while sacrificing correlation with the physical
circuit implementation. Bridging this gap between Logic Synthesis tool and Physical
Design (PnR) tools is key to improve Quality of Results (QoR), while possibly shorting
the time-to-market. To address this problem, in this work, we formalize logic paths
as sentences, with the gates being a bag of words. Thus, we show how word embedding
can be leveraged to represent generic paths and predict if a given path is likely
to be critical post-PnR. We present the effectiveness of our approach, with accuracy
over than 90% for our test-cases. Finally, we give a step further and introduce an
intelligent and non-intrusive flow that uses this information to guide optimization.
Our flow presents up to 15.53% area delay product (ADP) and 18.56% power delay product
(PDP), compared to a standard flow.
Exploiting HLS-Generated Multi-Version Kernels to Improve CPU-FPGA Cloud Systems
Cloud Warehouses have been exploiting CPU-FPGA collaborative execution environments,
where multiple clients share the same infrastructure to achieve to maximize resource
utilization with the highest possible energy efficiency and scalability. However,
the resource provisioning is challenging in these environments, since kernels may
be dispatched to both CPU and FPGA concurrently in a highly variant scenario, in terms
of available resources and workload characteristics. In this work, we propose MultiVers,
a framework that leverages automatic HLS generation to enable further gains in such
CPU-FPGA collaborative systems. MultiVers exploits the automatic generation from HLS
to build libraries containing multiple versions of each incoming kernel request, greatly
enlarging the available design space exploration passive of optimization by the allocation
strategies in the cloud provider. Multivers makes both kernel multiversioning and
allocation strategy to work symbiotically, allowing fine-tuning in terms of resource
usage, performance, energy, or any combination of these parameters. We show the efficiency
of MultiVers by using real-world cloud request scenarios with a diversity of benchmarks,
achieving average improvements on makespan and energy of up to 4.62x and 19.04x, respectively,
over traditional allocation strategies executing non-optimized kernels.
SESSION: 6D: Hardware Locking and Obfuscation
Area Efficient Functional Locking through Coarse Grained Runtime Reconfigurable Architectures
The protection of Intellectual Property (IP) has emerged as one of the most important
issues in the hardware design industry. Most VLSI design companies are now fabless
and need to protect their IP from being illegally distributed. One of the main approach
to address this has been through logic locking. Logic locking prevents IPs from being
reversed engineered as well as overbuilding the hardware circuit by untrusted foundries.
One of the main problem with existing logic locking techniques is that the foundry
has full access to the entire design including the logic locking mechanism. Because
of the importance of this topic, continuous more robust locking mechanisms are proposed
and equally fast new methods to break them appear. One alternative approach is to
lock a circuit through omission. The main idea is to selectively map a portion of
the IP onto an embedded FPGA (eFPGA). Because the foundry does not have access to
the bitstream, the circuit cannot be used until programmed by the legitimate user.
One of the main problems with this approach is the large overhead in terms of area
and power, as well as timing degradation. Area is especially a concern for price sensitive
applications. To address this, in this work we presents a method to map portions of
a design onto a Coarse Grained Runtime Reconfigurable Architecture (CGRRA) such that
multiple parts of a design can be hidden onto the CGRRA, substantially amortizing
the area overhead introduced by the CGRRA.
ObfusX: Routing Obfuscation with Explanatory Analysis of a Machine Learning Attack
This is the first work that incorporates recent advancements in “explainability” of
machine learning (ML) to build a routing obfuscator called ObfusX. We adopt a recent
metric—the SHAP value—which explains to what extent each layout feature can reveal
each unknown connection for a recent ML-based split manufacturing attack model. The
unique benefits of SHAP-based analysis include the ability to identify the best candidates
for obfuscation, together with the dominant layout features which make them vulnerable.
As a result, ObfusX can achieve better hit rate (97% lower) while perturbing significantly
fewer nets when obfuscating using a via perturbation scheme, compared to prior work.
When imposing the same wirelength limit using a wire lifting scheme, ObfusX performs
significantly better in performance metrics (e.g., 2.4 times more reduction on average
in percentage of netlist recovery).
Breaking Analog Biasing Locking Techniques via Re-Synthesis
We demonstrate an attack to break all analog circuit locking techniques that act upon
the biasing of the circuit. The attack is based on re-synthesizing the biasing circuits
and requires only the use of an optimization algorithm. It is generally applicable
to any analog circuit class. For the attacker the method requires no in-depth understanding
or analysis of the circuit. The attack is demonstrated on a bias-locked Low-Dropout
(LDO) regulator. As the underlying optimization algorithm we employ a Genetic Algorithm
(GA).
SESSION: 6E: Efficient Solutions for Emerging Technologies
Energy and QoS-Aware Dynamic Reliability Management of IoT Edge Computing Systems
The Internet of Things (IoT) systems, as any electronic or mechanical system, are
prone to failures. Hard failures in hardware due to aging and degradation are particularly
important since they are irrecoverable, requiring maintenance for the replacement
of defective parts, at high costs. In this paper, we propose a novel dynamic reliability
management (DRM) technique for IoT edge computing systems to satisfy the Quality of
Service (QoS) and reliability requirements while maximizing the remaining energy of
the edge device batteries. We formulate a state-space optimal control problem with
a battery energy objective, QoS, and terminal reliability constraints. We decompose
the problem into low-overhead subproblems and solve it employing a hierarchical and
multi-timescale control approach, distributed over the edge devices and the gateway.
Our results, based on real measurements and trace-driven simulation demonstrate that
the proposed scheme can achieve a similar battery lifetime compared to the state-of-the-art
approaches while satisfying reliability requirements, where other approaches fail
to do so.
Light: A Scalable and Efficient Wavelength-Routed Optical Networks-On-Chip Topology
Wavelength-routed optical networks-on-chip (WRONoCs) are known for delivering collision-
and arbitration-free on-chip communication in many-cores systems. While appealing
for low latency and high predictability, WRONoCs are challenged by scalability concerns
due to two reasons: (1) State-of-the-art WRONoC topologies use a large number of microring
resonators (MRRs) which result in much MRR tuning power and crosstalk noise. (2) The
positions of master and slave nodes in current topologies do not match realistic layout
constraints. Thus, many additional waveguide crossings will be introduced during physical
implementation, which degrades the network performance. In this work, we propose an
N x (N – 1) WRONoC topology: Light with a 4 x 3 router Hash as the basic building
block, and a simple but efficient approach to configure the resonant wavelength for
each MRR. Experimental results show that Light outperforms state-of-the-art topologies
in terms of enhancing signal-to-noise ratio (SNR) and reducing insertion loss, especially
for large-scale networks. Furthermore, Light can be easily implemented onto a physical
plane without causing external waveguide crossings.
One-pass Synthesis for Field-coupled Nanocomputing Technologies
Field-coupled Nanocomputing (FCN) is a class of post-CMOS emerging technologies, which
promises to overcome certain physical limitations of conventional solutions such as
CMOS by allowing for high computational throughput with low power dissipation. Despite
their promises, the design of corresponding FCN circuits is still in its infancy.
In fact, state-of-the-art solutions still heavily rely on conventional synthesis approaches
that do not take the tight physical constraints of FCN circuits (particularly with
respect to routability and clocking) into account. Instead, physical design is conducted
in a second step in which a classical logic network is mapped onto an FCN layout.
Using this two-stage approach with a classical and FCN-oblivious logic network as
an intermediate result, frequently leads to substantial quality loss or completely
impractical results. In this work, we propose a one-pass synthesis scheme for FCN
circuits, which conducts both steps, synthesis and physical design, in a single run.
For the first time, this allows to generate exact, i. e., minimal FCN circuits for
a given functionality.
SESSION: 7A: Platform-Specific Neural Network Acceleration
Real-Time Mobile Acceleration of DNNs: From Computer Vision to Medical Applications
With the growth of mobile vision applications, there is a growing need to break through
the current performance limitation of mobile platforms, especially for computationally
intensive applications, such as object detection, action recognition, and medical
diagnosis. To achieve this goal, we present our unified real-time mobile DNN inference
acceleration framework, seamlessly integrating hardware-friendly, structured model
compression with mobile-targeted compiler optimizations. We aim at an unprecedented,
realtime performance of such large-scale neural network inference on mobile devices.
A fine-grained block-based pruning scheme is proposed to be universally applicable
to all types of DNN layers, such as convolutional layers with different kernel sizes
and fully connected layers. Moreover, it is also successfully extended to 3D convolutions.
With the assist of our compiler optimizations, the fine-grained block-based sparsity
is fully utilized to achieve high model accuracy and high hardware acceleration simultaneously.
To validate our framework, three representative fields of applications are implemented
and demonstrated, object detection, activity detection, and medical diagnosis. All
applications achieve real-time inference using an off-the-shelf smartphone, outperforming
the representative mobile DNN inference acceleration frameworks by up to 6.7x in speed.
The demonstrations of these applications can be found in the following link: https://bit.ly/39lWpYu.
Dynamic Neural Network to Enable Run-Time Trade-off between Accuracy and Latency
To deploy powerful deep neural network (DNN) into smart, but resource limited IoT
devices, many prior works have been proposed to compress DNN to reduce the network
size and computation complexity with negligible accuracy degradation, such as weight
quantization, network pruning, convolution decomposition, etc. However, by utilizing
conventional DNN compression methods, a smaller, but fixed, network is generated from
a relative large background model to achieve resource limited hardware acceleration.
However, such optimization lacks the ability to adjust its structure in real-time
to adapt for a dynamic computing hardware resource allocation and workloads. In this
paper, we mainly review our two prior works [13, 15] to tackle this challenge, discussing
how to construct a dynamic DNN by means of either uniform or non-uniform sub-nets
generation methods. Moreover, to generate multiple nonuniform sub-nets, [15] needs
to fully retrain the background model for each sub-net individually, named as multi-path
method. To reduce the training cost, in this work, we further propose a single-path
sub-nets generation method that can sample multiple sub-nets in different epochs within
one training round. The constructed dynamic DNN, consisting of multiple sub-nets,
provides the ability to run-time trade-off the inference accuracy and latency according
to hardware resources and environment requirements. In the end, we study the the dynamic
DNNs with different sub-nets generation methods on both CIFAR-10 and ImageNet dataset.
We also present the run-time tuning of accuracy and latency on both GPU and CPU.
When Machine Learning Meets Quantum Computers: A Case Study
Along with the development of AI democratization, the machine learning approach, in
particular neural networks, has been applied to wide-range applications. In different
application scenarios, the neural network will be accelerated on the tailored computing
platform. The acceleration of neural networks on classical computing platforms, such
as CPU, GPU, FPGA, ASIC, has been widely studied; however, when the scale of the application
consistently grows up, the memory bottleneck becomes obvious, widely known as memory-wall.
In response to such a challenge, advanced quantum computing, which can represent 2N
states with N quantum bits (qubits), is regarded as a promising solution. It is imminent
to know how to design the quantum circuit for accelerating neural networks. Most recently,
there are initial works studying how to map neural networks to actual quantum processors.
To better understand the state-of-the-art design and inspire new design methodology,
this paper carries out a case study to demonstrate an end-to-end implementation. On
the neural network side, we employ the multilayer perceptron to complete image classification
tasks using the standard and widely used MNIST dataset. On the quantum computing side,
we target IBM Quantum processors, which can be programmed and simulated by using IBM
Qiskit. This work targets the acceleration of the inference phase of a trained neural
network on the quantum processor. Along with the case study, we will demonstrate the
typical procedure for mapping neural networks to quantum circuits.
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance
Optimization
Neural network accelerator is a key enabler for the on-device AI inference, for which
energy efficiency is an important metric. The datapath energy, including the computation
energy and the data movement energy among the arithmetic units, claims a significant
part of the total accelerator energy. By revisiting the basic physics of the arithmetic
logic circuits, we show that the datapath energy is highly correlated with the bit
flips when streaming the input operands into the arithmetic units, defined as the
hamming distance (HD) of the input operand matrices. Based on the insight, we propose
a post-training optimization algorithm and a HD-aware training algorithm to co-design
and co-optimize the accelerator and the network synergistically. The experimental
results based on post-layout simulation with MobileNetV2 demonstrate on average 2.85x
datapath energy reduction and up to 8.51x datapath energy reduction for certain layers.
Lightweight Run-Time Working Memory Compression for Deployment of Deep Neural Networks
on Resource-Constrained MCUs
This work aims to achieve intelligence on embedded devices by deploying deep neural
networks (DNNs) onto resource-constrained microcontroller units (MCUs). Apart from
the low frequency (e.g., 1-16 MHz) and limited storage (e.g., 16KB to 256KB ROM),
one of the largest challenges is the limited RAM (e.g., 2KB to 64KB), which is needed
to save the intermediate feature maps of a DNN. Most existing neural network compression
algorithms aim to reduce the model size of DNNs so that they can fit into limited
storage. However, they do not reduce the size of intermediate feature maps significantly,
which is referred to as working memory and might exceed the capacity of RAM. Therefore,
it is possible that DNNs cannot run in MCUs even after compression. To address this
problem, this work proposes a technique to dynamically prune the activation values
of the intermediate output feature maps in the runtime to ensure that they can fit
into limited RAM. The results of our experiments show that this method could significantly
reduce the working memory of DNNs to satisfy the hard constraint of RAM size, while
maintaining satisfactory accuracy with relatively low overhead on memory and run-time
latency.
SESSION: 7B: Toward Energy-Efficient Embedded Systems
EHDSktch: A Generic Low Power Architecture for Sketching in Energy Harvesting Devices
Energy harvesting devices (EHDs) are becoming extremely prevalent in remote and hazardous
environments. They sense the ambient parameters and compute some statistics on them,
which are then sent to a remote server. Due to the resource-constrained nature of
EHDs, it is challenging to perform exact computations on streaming data; however,
if we are willing to tolerate a slight amount of inaccuracy, we can leverage the power
of sketching algorithms to provide quick answers with significantly lower energy consumption.
In this paper, we propose a novel hardware architecture called EHDSktch — a set of
IP blocks that can be used to implement most of the popular sketching algorithms.
We demonstrate an energy savings of 4-10X and a speedup of more than 10X over state-of-the-art
software implementations. Leveraging the temporal locality further provides us a performance
gain of 3-20% in energy and time and reduces the on-chip memory requirement by at
least 50-75%.
Energy-Aware Design Methodology for Myocardial Infarction Detection on Low-Power Wearable
Devices
Myocardial Infarction (MI) is a heart disease that damages the heart muscle and requires
immediate treatment. Its silent and recurrent nature necessitates real-time continuous
monitoring of patients. Nowadays, wearable devices are smart enough to perform on-device
processing of heartbeat segments and report any irregularities in them. However, the
small form factor of wearable devices imposes resource constraints and requires energy-efficient
solutions to satisfy them. In this paper, we propose a design methodology to automate
the design space exploration of neural network architectures for MI detection. This
methodology incorporates Neural Architecture Search (NAS) using Multi-Objective Bayesian
Optimization (MOBO) to render Pareto optimal architectural models. These models minimize
both detection error and energy consumption on the target device. The design space
is inspired by Binary Convolutional Neural Networks (BCNNs) suited for mobile health
applications with limited resources. The models’ performance is validated using the
PTB diagnostic ECG database from PhysioNet. Moreover, energy-related measurements
are directly obtained from the target device in a typical hardware-in-the-loop fashion.
Finally, we benchmark our models against other related works. One model exceeds state-of-the-art
accuracy on wearable devices (reaching 91.22%), whereas others trade off some accuracy
to reduce their energy consumption (by a factor reaching 8.26x).
Power-Efficient Layer Mapping for CNNs on Integrated CPU and GPU Platforms: A Case Study
Heterogeneous MPSoCs consisting of integrated CPUs and GPUs are suitable platforms
for embedded applications running on handheld devices such as smart phones. As the
handheld devices are mostly powered by battery, the integrated CPU and GPU MPSoC is
usually designed with an emphasis on low-power rather than performance. In this paper,
we are interested in exploring a power-efficient layer mapping of convolution neural
networks (CNNs) deployed on integrated CPU and GPU platforms. Specifically, we investigate
the impact of layer mapping of YoloV3-Tiny (i.e., a widely-used CNN in both industry
and academia) on system power consumption through numerous experiments on NVIDIA board
Jetson TX2. The experimental results indicate that 1) almost all of the convolution
layers are not suitable for mapping to CPU, 2) the pooling layer can be mapped to
CPU for reducing power consumption, but the mapping may lead to a decrease in inference
speed when the layer’s output tensor size is large, 3) the detection layer can be
mapped to CPU as long as its floating-point operation scale is not too large, and
4) the channel and upsampling layers are both suitable for mapping to CPU. These observations
obtained in this study can be further utilized to guide the design of power-efficient
layer mapping strategies for integrated CPU and GPU platforms.
A Write-friendly Arithmetic Coding Scheme for Achieving Energy-Efficient Non-Volatile
Memory Systems
In the era of the Internet of Things (IoT), wearable IoT devices become popular and
closely related to our life. Most of these devices are based on the embedded systems
that have to operate on limited energy resources, such as batteries or energy harvesters.
Therefore, energy efficiency is one of the critical issues for these devices. To relieve
the energy consumption by reducing the total accesses on memory and storage layers,
the technologies of storage-class memory (SCM) and data compression techniques are
applied to eliminate the data movements and squeeze the data size, respectively. However,
the information gap between them hinders the cooperation among the two techniques
for achieving further optimizations on minimizing energy consumption. This work proposes
a write-friendly arithmetic coding with joint managing both techniques to achieve
energy-efficient non-volatile memory (NVM) systems. In particular, the concept of
“ignorable bits” is introduced to further skip the write operations while storing
the compressed data into SCM devices. The proposed design was evaluated by a series
of intensive experiments, and the results are encouraging.
SESSION: 7C: Software and System Support for Nonvolatile Memory
DP-Sim: A Full-stack Simulation Infrastructure for Digital Processing In-Memory Architectures
Digital processing in-memory (DPIM) is a promising technology that significantly reduces
data movements while providing high parallelism. In this work, we design and implement
the first full-stack DPIM simulation infrastructure, DP-Sim, which evaluates a comprehensive
range of DPIM-specific design space concerning both software and hardware. DP-Sim
provides a C++ library to enable DPIM acceleration in general programs while supporting
several aspects of software-level exploration by a convenient interface. The DP-Sim
software front-end generates specialized instructions that can be processed by a hardware
simulator based on a new DPIM-enabled architecture model which is 10.3% faster than
conventional memory simulation models. We use DP-Sim to explore the DPIM-specific
design space of acceleration for various emerging applications. Our experiments show
that bank-level control is 11.3x faster than conventional channel-level control because
of higher computing parallelism. Furthermore, cost-aware memory allocation can provide
at least 2.2x speedup vs. heuristic methods, showing the importance of data layout
in DPIM acceleration.
SAC: A Stream Aware Write Cache Scheme for Multi-Streamed Solid State Drives
This work found that the state-of-the-art multi-streamed SSDs are inefficiently used
due to two issues. First, the write cache inside SSDs is not aware of data from different
streams, which induce conflict among streams. Second, the current stream identification
methods are not accurate, which should be optimized inside SSDs. This work proposed
a novel write cache scheme to efficiently utilize and optimize the multiple streams.
First, an inter-stream aware cache partitioning scheme is proposed to manage the data
from different streams. Second, an intra-stream based active cache evicting scheme
is proposed to evict data to block with more invalid pages in priority. Experiment
results show that the proposed scheme significantly reduces the write amplification
(WAF) of multi-streamed SSDs by up to 28% with negligible cost.
Providing Plug N’ Play for Processing-in-Memory Accelerators
Although Processing-in-Memory (PIM) emerged as a solution to avoid unnecessary and
expensive data movements to/from host and accelerators, their widespread usage is
still difficult, given that to effectively use a PIM device, huge and costly modifications
must be done at the host processor side to allow instructions offloading, cache coherence,
virtual memory management, and communication between different PIM instances. The
present work addresses these challenges by presenting non-invasive solutions for those
requirements. We demonstrate that, at compile-time, and without any host modifications
or programmer intervention, it is possible to exploit already available resources
to allow efficient host and PIM communication and task partitioning, without disturbing
neither host nor memory hierarchy. We present Plug&PIM, a plug n’ play strategy for
PIM adoption with minimal performance penalties.
Aging-Aware Request Scheduling for Non-Volatile Main Memory
Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity
and low-cost main memory. Elevated operating voltages of NVM accelerate the aging
of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device
scaling increases power density and temperature, which further accelerates aging,
challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural
technique to mitigate the circuit aging-related problems of NVM-based main memory.
HEBE is built on three contributions. First, we propose a new analytical model that
can dynamically track the aging in the peripheral circuitry of each memory bank based
on the bank’s utilization. Second, we develop an intelligent memory request scheduler
that exploits this aging model at run time to de-stress the peripheral circuitry of
a memory bank only when its aging exceeds a critical threshold. Third, we introduce
an isolation transistor to decouple parts of a peripheral circuit operating at different
voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations
independently and off the critical path of memory read and write accesses, improving
performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite.
Our results show that HEBE significantly improves both performance and lifetime of
NVM-based main memory.
SESSION: 7D: Learning-Driven VLSI Layout Automation Techniques
Placement for Wafer-Scale Deep Learning Accelerator
To meet the growing demand from deep learning applications for computing resources,
accelerators by ASIC are necessary. A wafer-scale engine (WSE) is recently proposed
[1], which is able to simultaneously accelerate multiple layers from a neural network
(NN). However, without a high-quality placement that properly maps NN layers onto
the WSE, the acceleration efficiency cannot be achieved. Here, the WSE placement resembles
the traditional ASIC floor plan problem of placing blocks onto a chip region, but
they are fundamentally different. Since the slowest layer determines the compute time
of the whole NN on WSE, a layer with a heavier workload needs more computing resources.
Besides, locations of layers and protocol adapter cost of internal 10 connections
will influence inter-layer communication overhead. In this paper, we propose GigaPlacer
to handle this new challenge. A binary-search-based framework is developed to obtain
a minimum compute time of the NN. Two dynamic-programming-based algorithms with different
optimizing strategies are integrated to produce legal placement. The distance and
adapter cost between connected layers will be further minimized by some refinements.
Compared with the first place of the ISPD2020 Contest, GigaPlacer reduces the contest
metric by up to 6.89% and on average 2.09%, while runs 7.23X faster.
Net2: A Graph Attention Network Method Customized for Pre-Placement Net Length Estimation
Net length is a key proxy metric for optimizing timing and power across various stages
of a standard digital design flow. However, the bulk of net length information is
not available until cell placement, and hence it is a significant challenge to explicitly
consider net length optimization in design stages prior to placement, such as logic
synthesis. This work addresses this challenge by proposing a graph attention network
method with customization, called Net2, to estimate individual net length before cell
placement. Its accuracy-oriented version Net2a achieves about 15% better accuracy
than several previous works in identifying both long nets and long critical paths.
Its fast version Net2f is more than 1000x faster than placement while still outperforms
previous works and other neural network techniques in terms of various accuracy metrics.
Machine Learning-based Structural Pre-route Insertability Prediction and Improvement
with Guided Backpropagation
With the development of semiconductor technology nodes, the sizes of standard cells
become smaller and the number of standard cells is dramatically increased to bring
into more functionality in integrated circuits (ICs). However, the shrinking of standard
cell sizes causes many problems of ICs such as timing, power, and electromigration
(EM). To tackle these problems, a new style structural pre-route (SPR) is proposed.
Such type of pre-route is composed of redundant parallel metals and vias so that the
low resistance and the redundant sub-structures can improve timing and yield. But
the large area overhead becomes the major problem of inserting such pre-routes all
over a design. In this paper, we propose a machine learning-based approach to predict
the insertability of SPRs for placed designs. In addition, we apply a pattern visualization
method by using a guided backpropagation technique to see in depth of our model and
identify the problematic layout features causing SPR insertion failures. The experimental
results not only show the excellent performance of our model, but also show that avoiding
generating the identified critical features during legalization can improve SPR insertability
compared to a commercial SPR-aware placement tool.
Standard Cell Routing with Reinforcement Learning and Genetic Algorithm in Advanced
Technology Nodes
Standard cell layout in advanced technology nodes are done manually in the industry
today. Automating standard cell layout process, in particular the routing step, are
challenging because of the constraints of enormous design rules. In this paper we
propose a machine learning based approach that applies genetic algorithm to create
initial routing candidates and uses reinforcement learning (RL) to fix the design
rule violations incrementally. A design rule checker feedbacks the violations to the
RL agent and the agent learns how to fix them based on the data. This approach is
also applicable to future technology nodes with unseen design rules. We demonstrate
the effectiveness of this approach on a number of standard cells. We have shown that
it can route a cell which is deemed unroutable manually, reducing the cell size by
11%.
SESSION: 7E: DNN-Based Physical Analysis and DNN Accelerator Design
Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks
Computationally expensive temperature and power grid analyses are required during
the design cycle to guide IC design. This paper employs encoder-decoder based generative
(EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence
translation tasks. The network takes a power map as input and outputs the temperature
or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip
temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based
on input power, power grid distribution, and power pad distribution patterns. The
models are design-independent and must be trained just once for a particular technology
and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict on-chip
temperature and IR drop contours in milliseconds (in contrast with commercial tools
that require several hours or more) and provide an average error of 0.6% and 0.008%
respectively.
GRA-LPO: Graph Convolution Based Leakage Power Optimization
Static power consumption is a critical challenge for IC designs, particularly for
mobile and IoT applications. A final post-layout step in modern design flows involves
a leakage recovery step that is embedded in signoff static timing analysis tools.
The goal of such recovery is to make use of the positive slack (if any) and recover
the leakage power by performing cell swaps with footprint compatible variants. Though
such swaps result in unaltered routing, the hard constraint is not to introduce any
new timing violations. This process can require up to tens of hours of runtime, just
before the tapeout, when schedule and resource constraints are tightest. The physical
design teams can benefit greatly from a fast predictor of the leakage recovery step:
if the eventual recovery will be too small, the entire step can be skipped, and the
resources can be allocated elsewhere. If we represent the circuit netlist as a graph
with cells as vertices and nets connecting these cells as edges, the leakage recovery
step is an optimization step, on this graph. If we can learn these optimizations over
several graphs with various logic-cone structures, we can generalize the learning
to unseen graphs. Using graph convolution neural networks, we develop a learning-based
model, that predicts per-cell recoverable slack, and translate these slack values
to equivalent power savings. For designs up to 1.6M instances, our inference step
takes less than 12 seconds on a Tesla P100 GPU, and an additional feature extraction,
post-processing steps consuming 420 seconds. The model is accurate with relative error
under 6.2%, for the design-specific context.
DEF: Differential Encoding of Featuremaps for Low Power Convolutional Neural Network Accelerators
As the need for the deployment of Deep Learning applications on edge-based devices
becomes ever increasingly prominent, power consumption starts to become a limiting
factor on the performance that can be achieved by the computational platforms. A significant
source of power consumption for these edge-based machine learning accelerators is
off-chip memory transactions. In the case of Convolutional Neural Network (CNN) workloads,
a predominant workload in deep learning applications, those memory transactions are
typically attributed to the store and recall of feature-maps. There is therefore a
need to explicitly reduce the power dissipation of these transactions whilst minimising
any overheads needed to do so. In this work, a Differential Encoding of Feature-maps
(DEF) scheme is proposed, which aims at minimising activity on the memory data bus,
specifically for CNN workloads. The coding scheme uses domain-specific knowledge,
exploiting statistics of feature-maps alongside knowledge of the data types commonly
used in machine learning accelerators as a means of reducing power consumption. DEF
is able to out-perform recent state-of-the-art coding schemes, with significantly
less overhead, achieving up to 50% reduction of activity across a number of modern
CNNs.
Temperature-Aware Optimization of Monolithic 3D Deep Neural Network Accelerators
We propose an automated method to facilitate the design of energy-efficient Mono3D
DNN accelerators with safe on-chip temperatures for mobile systems. We introduce an
optimizer to investigate the effect of different aspect ratios and footprint specifications
of the chip, and select energy-efficient accelerators under user-specified thermal
and performance constraints. We also demonstrate that using our optimizer, we can
reduce energy consumption by 1.6x and area by 2x with a maximum of 9.5% increase in
latency compared to a Mono3D DNN accelerator optimized only for performance.
SESSION: 8B: Embedded Neural Networks and File Systems
Gravity: An Artificial Neural Network Compiler for Embedded Applications
This paper introduces the Gravity compiler. Gravity is an open source optimizing Artificial
Neural Network (ANN) to ANSI C compiler with two unique design features that make
it ideal for use in resource constrained embedded systems: (1) the generated ANSI
C code is self-contained and void of any library or platform dependencies and (2)
the generated ANSI C code is optimized for maximum performance and minimum memory
usage. Moreover, Gravity is constructed as a modern compiler consisting of an intuitive
input language, an expressive Intermediate Representation (IR), a mapping to a Fictitious
Instruction Set Machine (FISM) and a retargetable backend, making it an ideal research
tool for exploring high-performance embedded software strategies in AI and Deep-Learning
applications. We validate the efficacy of Gravity by solving the MNIST handwriting
digit recognition on an embedded device We measured a 300x reduction in memory, 2.5x
speedup in inference and 33% speedup in training compared to TensorFlow. We also outperformed
TVM, by over 2.4x in inference speed.
A Self-Test Framework for Detecting Fault-induced Accuracy Drop in Neural Network
Accelerators
Hardware accelerators built with SRAM or emerging memory devices are essential to
the accommodation of the ever-increasing Deep Neural Network (DNN) workloads on resource-constrained
devices. After deployment, however, the performance of these accelerators is threatened
by the faults in their on-chip and off-chip memories where millions of DNN weights
are held. Different types of faults may exist depending on the underlying memory technology,
degrading inference accuracy. To tackle this challenge, this paper proposes an online
self-test framework that monitors the accuracy of the accelerator with a small set
of test images selected from the test dataset. Upon detecting a noticeable level of
accuracy drop, the framework uses additional test images to identify the corresponding
fault type and predict the severeness of faults by analyzing the change in the ranking
of the test images. Experimental results show that our method can quickly detect the
fault status of a DNN accelerator and provide accurate fault type and fault severeness
information, allowing for subsequent recovery and self-healing process.
Facilitating the Efficiency of Secure File Data and Metadata Deletion on SMR-based
Ext4 File System
The efficiency of secure deletion is highly dependent on the data layout of underlying
storage devices. In particular, owing to the sequential-write constraint of the emerging
Shingled Magnetic Recording (SMR) technology, an improper data layout could lead to
serious write amplification and hinder the performance of secure deletion. The performance
degradation of secure deletion on SMR drives is further aggravated with the need to
securely erase the file system metadata of deleted files due to the small-size nature
of file system metadata. Such an observation motivates us to propose a secure-deletion
and SMR-aware space allocation (SSSA) strategy to facilitate the process of securely
erasing both the deleted files and their metadata simultaneously. The proposed strategy
is integrated within the widely-used extended file system 4 (ext4) and is evaluated
through a series of experiments to demonstrate the effectiveness of the proposed strategy.
The evaluation results show that the proposed strategy can reduce the secure deletion
latency by 91.3% on average when compared with naive SMR-based ext4 file system.
SESSION: 8C: Design Automation for Future Autonomy
Efficient Computing Platform Design for Autonomous Driving Systems
Autonomous driving is becoming a hot topic in both academic and industrial communities.
Traditional algorithms can hardly achieve the complex tasks and meet the high safety
criteria. Recent research on deep learning shows significant performance improvement
over traditional algorithms and is believed to be a strong candidate in autonomous
driving system. Despite the attractive performance, deep learning does not solve the
problem totally. The application scenario requires that an autonomous driving system
must work in real-time to keep safety. But the high computation complexity of neural
network model, together with complicated pre-process and post-process, brings great
challenges. System designers need to do dedicated optimizations to make a practical
computing platform for autonomous driving. In this paper, we introduce our work on
efficient computing platform design for autonomous driving systems. In the software
level, we introduce neural network compression and hardware-aware architecture search
to reduce the workload. In the hardware level, we propose customized hardware accelerators
for pre- and post-process of deep learning algorithms. Finally, we introduce the hardware
platform design, NOVA-30, and our on-vehicle evaluation project.
On Designing Computing Systems for Autonomous Vehicles: a PerceptIn Case Study
PerceptIn develops and commercializes autonomous vehicles for micromobility around
the globe. This paper makes a holistic summary of PerceptIn’s development and operating
experiences. It provides the business tale behind our product, and presents the development
of the computing system for our vehicles. We illustrate the design decision made for
the computing system, and show the advantage of offloading localization workloads
onto an FPGA platform.
Runtime Software Selection for Adaptive Automotive Systems
As automotive systems become more intelligent than ever, they need to handle many
functional tasks, resulting in more and more software programs running in automotive
systems. However, whether a software program should be executed depends on the environmental
conditions (surrounding conditions). For example, a deraining algorithm supporting
object detection and image recognition should only be executed when it is raining.
Supported by the advance of over-the-air (OTA) updates and plug-and-play systems,
adaptive automotive systems, where the software programs are updated, activated, and
deactivated before driving and during driving, can be realized. In this paper, we
consider the upcoming environmental conditions of an automotive system and target
the corresponding software selection problem during runtime. We formulate the problem
as a set cover problem with timing constraints and then propose a heuristic approach
to solve the problem. The approach is very efficient so that it can be applied during
runtime, and it is a preliminary step towards the broad realization of adaptive automotive
systems.
Safety-Assured Design and Adaptation of Learning-Enabled Autonomous Systems
Future autonomous systems will employ sophisticated machine learning techniques for
the sensing and perception of the surroundings and the making corresponding decisions
for planning, control, and other actions. They often operate in highly dynamic, uncertain
and challenging environment, and need to meet stringent timing, resource, and mission
requirements. In particular, it is critical and yet very challenging to ensure the
safety of these autonomous systems, given the uncertainties of the system inputs,
the constant disturbances on the system operations, and the lack of analyzability
for many machine learning methods (particularly those based on neural networks). In
this paper, we will discuss some of these challenges, and present our work in developing
automated, quantitative, and formalized methods and tools for ensuring the safety
of autonomous systems in their design and during their runtime adaptation. We argue
that it is essential to take a holistic approach in addressing system safety and other
safety-related properties, vertically across the functional, software, and hardware
layers, and horizontally across the autonomy pipeline of sensing, perception, planning,
and control modules. This approach could be further extended from a single autonomous
system to a multi-agent system where multiple autonomous agents perform tasks in a
collaborative manner. We will use connected and autonomous vehicles (CAVs) as the
main application domain to illustrate the importance of such holistic approach and
show our initial efforts in this direction.
SESSION: 8D: Emerging Hardware Verification
System-Level Verification of Linear and Non-Linear Behaviors of RF Amplifiers using
Metamorphic Relations
System-on-Chips (SoC) have imposed new yet stringent design specifications on the
Radio Frequency (RF) subsystems. The Timed Data Flow (TDF) model of computation available
in SystemC-AMS offers here a good trade-off between accuracy and simulation-speed
at the system-level. However, one of the main challenges in system-level verification
is the availability of reference models traditionally used to verify the correctness
of the Design Under Verification (DUV). Recently, Metamorphic testing (MT) introduced
a new verification perspective in the software domain to alleviate this problem. MT
uncovers bugs just by using and relating test-cases.
In this paper, we present a novel MT-based verification approach to verify the linear
and non-linear behaviors of RF amplifiers at the system-level. The central element
of our MT-approach is a set of Metamorphic Relations (MRs) which describes the relation
of the inputs and outputs of consecutive DUV executions. For the class of Low Noise
Amplifiers (LNAs) we identify 12 high-quality MRs. We demonstrate the effectiveness
of our proposed MT-based verification approach in an extensive set of experiments
on an industrial system-level LNA model without the need of a reference model.
Random Stimuli Generation for the Verification of Quantum Circuits
Verification of quantum circuits is essential for guaranteeing correctness of quantum
algorithms and/or quantum descriptions across various levels of abstraction. In this
work, we show that there are promising ways to check the correctness of quantum circuits
using simulative verification and random stimuli. To this end, we investigate how
to properly generate stimuli for efficiently checking the correctness of a quantum
circuit. More precisely, we introduce, illustrate, and analyze three schemes for quantum
stimuli generation—offering a trade-off between the error detection rate (as well
as the required number of stimuli) and efficiency. In contrast to the verification
in the classical realm, we show (both, theoretically and empirically) that even if
only a few randomly-chosen stimuli (generated from the proposed schemes) are considered,
high error detection rates can be achieved for quantum circuits. The results of these
conceptual and theoretical considerations have also been empirically confirmed—with
a grand total of approximately 106 simulations conducted across 50 000 benchmark instances.
Exploiting Extended Krylov Subspace for the Reduction of Regular and Singular Circuit
Models
During the past decade, Model Order Reduction (MOR) has become key enabler for the
efficient simulation of large circuit models. MOR techniques based on moment-matching
are well established due to their simplicity and computational performance in the
reduction process. However, moment-matching methods based on the ordinary Krylov subspace
are usually inadequate to accurately approximate the original circuit behaviour. In
this paper, we present a moment-matching method which is based on the extended Krylov
subspace and exploits the superposition property in order to deal with many terminals.
The proposed method can handle large-scale regular and singular circuits, and generate
accurate and efficient reduced-order models for circuit simulation. Experimental results
on industrial IBM power grid benchmarks demonstrate that our method achieves an error
reduction up to 83.69% over a standard Krylov subspace technique.
SESSION: 8E: Optimization and Mapping Methods for Quantum Technologies
Algebraic and Boolean Optimization Methods for AQFP Superconducting Circuits
Adiabatic quantum-flux-parametron (AQFP) circuits are a family of superconducting
electronic (SCE) circuits that have recently gained growing interest due to their
low-energy consumption, and may serve as alternative technology to overcome the down-scaling
limitations of CMOS. AQFP logic design differs from classic digital design because
logic cells are natively abstracted by the majority function, require data and clocking
in specific timing windows, and have fan-out limitations. We describe here a novel
majority-based logic synthesis flow addressing AQFP technology. In particular, we
present both algebraic and Boolean methods over majority-inverter graphs (MIGs) aiming
at optimizing size and depth of logic circuits. The technology limitations and constraints
of the AQFP technology (e.g., path balancing and maximum fanout) are considered during
optimization. The experimental results show that our flow reduces both size and depth
of MIGs, while meeting the constraint of the AQFP technology. Further, we show an
improvement for both area and delay when the MIGs are mapped into the AQFP technology.
Dynamical Decomposition and Mapping of MPMCT Gates to Nearest Neighbor Architectures
We usually use Mixed-Polarity Multiple-Control Toffoli (MPMCT) gates to realize large
control logic functions for quantum computation. A logic circuit consisting of MPMCT
gates needs to be mapped to a quantum computing device that has some physical limitation;
(1) we need to decompose MPMCT gates into one or two-qubit gates, and then (2) we
need to insert SWAP gates such that all the gates can be performed on Nearest Neighbor
Architectures (NNAs). Up to date, the above two processes have been independently
studied intensively. This paper points out that we can decrease the total number of
the gates in a circuit if the above two processes are considered dynamically as a
single step; we propose a method to inserts SWAP gates while decomposing MPMCT gates
unlike most of the existing methods. Our additional idea is to consider the effect
on the latter part of a circuit carefully by considering the qubit layout when composing
an MPMCT gate. We show some experimental results to confirm the effectiveness of our
method.
Exploiting Quantum Teleportation in Quantum Circuit Mapping
Quantum computers are constantly growing in their number of qubits, but continue to
suffer from restrictions such as the limited pairs of qubits that may interact with
each other. Thus far, this problem is addressed by mapping and moving qubits to suitable
positions for the interaction (known as quantum circuit mapping). However, this movement
requires additional gates to be incorporated into the circuit, whose number should
be kept as small as possible since each gate increases the likelihood of errors and
decoherence. State-of-the-art mapping methods utilize swapping and bridging to move
the qubits along the static paths of the coupling map—solving this problem without
exploiting all means the quantum domain has to offer. In this paper, we propose to
additionally exploit quantum teleportation as a possible complementary method. Quantum
teleportation conceptually allows to move the state of a qubit over arbitrary long
distances with constant overhead—providing the potential of determining cheaper
mappings. The potential is demonstrated by a case study on the IBM Q Tokyo architecture
which already shows promising improvements. With the emergence of larger quantum computing
architectures, quantum teleportation will become more effective in generating cheaper
mappings.
SESSION: 9B: Emerging System Architectures for Edge-AI
Hardware-Aware NAS Framework with Layer Adaptive Scheduling on Embedded System
Neural Architecture Search (NAS) has been proven to be an effective solution for building
Deep Convolutional Neural Network (DCNN) models automatically. Subsequently, several
hardware-aware NAS frameworks incorporate hardware latency into the search objectives
to avoid the potential risk that the searched network cannot be deployed on target
platforms. However, the mismatch between NAS and hardware persists due to the absent
of rethinking the applicability of the searched network layer characteristics and
hardware mapping. A convolution neural network layer can be executed on various dataflows
of hardware with different performance, with which the characteristics of on-chip
data using varies to fit the parallel structure. This mismatch also results in significant
performance degradation for some maladaptive layers obtained from NAS, which might
achieved a much better latency when the adopted dataflow changes. To address the issue
that the network latency is insufficient to evaluate the deployment efficiency, this
paper proposes a novel hardware-aware NAS framework in consideration of the adaptability
between layers and dataflow patterns. Beside, we develop an optimized layer adaptive
data scheduling strategy as well as a coarse-grained reconfigurable computing architecture
so as to deploy the searched networks with high power-efficiency by selecting the
most appropriate dataflow pattern layer-by-layer under limited resources. Evaluation
results show that the proposed NAS framework can search DCNNs with the similar accuracy
to the state-of-the-art ones as well as the low inference latency, and the proposed
architecture provides both power-efficiency improvement and energy consumption saving.
Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package
Deep neural network (DNN) models continue to grow in size and complexity, demanding
higher computational power to enable real-time inference. To efficiently deliver such
computational demands, hardware accelerators are being developed and deployed across
scales. This naturally requires an efficient scale-out mechanism for increasing compute
density as required by the application. 2.5D integration over interposer has emerged
as a promising solution, but as we show in this work, the limited interposer bandwidth
and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the
approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D
DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator
chiplets to the global buffer chiplet, providing high-bandwidth multicasting capabilities.
Here, we also identify the dataflow style that most efficienty exploits the wireless
NoP’s high-bandwidth multicasting capability on each layer. With modest area and power
overheads, WIENNA achieves 2.2X-5.1X higher throughput and 38.2% lower energy than
an interposer-based NoP design.
Block-Circulant Neural Network Accelerator Featuring Fine-Grained Frequency-Domain
Quantization and Reconfigurable FFT Modules
Block-circulant based compression is a popular technique to accelerate neural network
inference. Though storage and computing costs can be reduced by transforming weights
into block-circulant matrices, this method incurs uneven data distribution in the
frequency domain and imbalanced workload. In this paper, we propose RAB: a Reconfigurable
Architecture Block-Circulant Neural Network Accelerator to solve the problems via
two techniques. First, a fine-grained frequency-domain quantization is proposed to
accelerate MAC operations. Second, a reconfigurable architecture is designed to transform
FFT/IFFT modules into MAC modules, which alleviates the imbalanced workload and further
improves efficiency. Experimental results show that RAB can achieve 1.9x/1.8x area/energy
efficiency improvement compared with the state-of-the-art block-circulant compression
based accelerator.
BatchSizer: Power-Performance Trade-off for DNN Inference
GPU accelerators can deliver significant improvement for DNN processing; however,
their performance is limited by internal and external parameters. A well-known parameter
that restricts the performance of various computing platforms in real-world setups,
including GPU accelerators, is the power cap imposed usually by an external power
controller. A common approach to meet the power cap constraint is using the Dynamic
Voltage Frequency Scaling (DVFS) technique. However, the functionally of this technique
is limited and platform-dependent. To improve the performance of DNN inference on
GPU accelerators, we propose a new control knob, which is the size of input batches
fed to the GPU accelerator in DNN inference applications. After evaluating the impact
of this control knob on power consumption and performance of GPU accelerators and
DNN inference applications, we introduce the design and implementation of a fast and
lightweight runtime system, called BatchSizer. This runtime system leverages the new
control knob for managing the power consumption of GPU accelerators in the presence
of the power cap. Conducting several experiments using a modern GPU and several DNN
models and input datasets, we show that our BatchSizer can significantly surpass the
conventional DVFS technique regarding performance (up to 29%), while successfully
meeting the power cap.
SESSION: 9C: Cutting-Edge EDA Techniques for Advanced Process Technologies
Deep Learning for Mask Synthesis and Verification: A Survey
Achieving lithography compliance is increasingly difficult in advanced technology
nodes. Due to complicated lithography modeling and long simulation cycles, verifying
and optimizing photomasks becomes extremely expensive. To speedup design closure,
deep learning techniques have been introduced to enable data-assisted optimization
and verification. Such approaches have demonstrated promising results with high solution
quality and efficiency. Recent research efforts show that learning-based techniques
can accomplish more and more tasks, from classification, simulation, to optimization,
etc. In this paper, we will survey the successful attempts of advancing mask synthesis
and verification with deep learning and highlight the domain-specific learning techniques.
We hope this survey can shed light on the future development of learning-based design
automation methodologies.
Physical Synthesis for Advanced Neural Network Processors
The remarkable breakthroughs in deep learning have led to a dramatic thirst for computational
resources to tackle interesting real-world problems. Various neural network processors
have been proposed for the purpose, yet, far fewer discussions have been made on the
physical synthesis for such specialized processors, especially in advanced technology
nodes. In this paper, we review several physical synthesis techniques for advanced
neural network processors. We especially argue that datapath design is an essential
methodology in the above procedures due to the organized computational graph of neural
networks. As a case study, we investigate a wafer-scale deep learning accelerator
placement problem in detail.
Advancements and Challenges on Parasitic Extraction for Advanced Process Technologies
As the feature size scales down, the process technology becomes more complicated and
the design margin shrinks, accurate parasitic extraction during IC design is largely
demanded. In this invited paper, we survey the recent advancements on parasitic extraction
techniques, especially those enhancing the floating random walk based capacitance
solver and incorporating machine learning methods. The work dealing with process variation
are also addressed. After that, we briefly discuss the challenges for capacitance
extraction under advanced process technologies, including manufacture-aware geometry
variations and middle-end-of-line (MEOL) parasitic extraction, etc.
SESSION: 9D: Robust and Reliable Memory Centric Computing at Post-Moore
Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems
Memristor based Processing-In-Memory (PIM) systems give alternative solutions to boost
the computing energy efficiency of Convolutional Neural Network (CNN) based algorithms.
However, Analog-to-Digital Converters’ (ADCs) high interface costs and the limited
size of the memristor crossbars make it challenging to map CNN models onto PIM systems
with both high accuracy and high energy efficiency. Besides, it takes a long time
to simulate the performance of large-scale PIM systems, resulting in unacceptable
development time for the PIM system. To address these problems, we propose a reliability-aware
training framework and a behavior-level modeling tool (MNSIM 2.0) for PIM accelerators.
The proposed reliability-aware training framework, containing network splitting/merging
analysis and a PIM-based non-uniform activation quantization scheme, can improve the
energy efficiency by reducing the ADC resolution requirements in memristor crossbars.
Moreover, MNSIM 2.0 provides a general modeling method for PIM architecture design
and computation data flow; it can evaluate both accuracy and hardware performance
within a short time. Experiments based on MNSIM 2.0 show that the reliability-aware
training framework can improve 3.4x energy efficiency of PIM accelerators with little
accuracy loss. The equivalent energy efficiency is 9.02 TOPS/W, nearly 2.6~4.2x compared
with the existing work. We also evaluate more case studies of MNSIM 2.0, which help
us balance the trade-off between accuracy and hardware performance.
Robustness of Neuromorphic Computing with RRAM-based Crossbars and Optical Neural
Networks
RRAM-based crossbars and optical neural networks are attractive platforms to accelerate
neuromorphic computing. However, both accelerators suffer from hardware uncertainties
such as process variations. These uncertainty issues left unaddressed, the inference
accuracy of these computing platforms can degrade significantly. In this paper, a
statistical training method where weights under process variations and noise are modeled
as statistical random variables is presented. To incorporate these statistical weights
into training, the computations in neural networks are modified accordingly. For optical
neural networks, we modify the cost function during software training to reduce the
effects of process variations and thermal imbalance. In addition, the residual effects
of process variations are extracted and calibrated in hardware test, and thermal variations
on devices are also compensated in advance. Simulation results demonstrate that the
inference accuracy can be improved significantly under hardware uncertainties for
both platforms.
Uncertainty Modeling of Emerging Device based Computing-in-Memory Neural Accelerators
with Application to Neural Architecture Search
emerging device based Computing-in-memory (CiM) has been proved to be a promising
candidate for high energy efficiency deep neural network (DNN) computations. However,
most emerging devices suffer uncertainty issues, resulting in a difference between
actual data stored and the weight value it is design to be. This leads to an accuracy
drop from trained models to actually deployed platforms. In this work, we offer a
thorough analysis on the effect of such uncertainties induced changes in DNN models.
To reduce the impact of device uncertainties, we propose UAE, a uncertainty-aware
Neural Architecture Search scheme to identify a DNN model that is both accurate and
robust against device uncertainties.
A Physical-Aware Framework for Memory Network Design Space Exploration
At the era of big data, there have been growing demands for server memory capacity
and performance. Memory network is a promising alternative to provide high bandwidth
and low latency through distributed memory nodes connected by high speed interconnect.
However, most of them implement the design from a pure-logic-level and ignore the
physical impact from network interconnect latency, processor placement and the interplay
between processor and memory. In this work, we propose a Physical-Aware framework
for memory network design space exploration, which facilitates the design of an energy
efficient and physical-aware memory network system. Experimental results on various
workloads show that the proposed framework can help customize network topology with
significant improvements on various design metrics when compared to the other commonly
used topologies.
SESSION: 9E: Design for Manufacturing and Soft Error Tolerance
Manufacturing-Aware Power Staple Insertion Optimization by Enhanced Multi-Row Detailed
Placement Refinement
Power staple insertion is a new methodology for IR drop mitigation in advanced technology
nodes. Detailed placement refinement which perturbs an initial placement slightly
is an effective strategy to increase the success rate of power staple insertion. We
are the first to address the manufacturing-aware power staple insertion optimization
problem by triple-row placement refinement. We present a correct-by-construction approach
based on dynamic programming to maximize the total number of legal power staples inserted
subject to the design rule for 1D patterning. Instead of using a multidimensional
array which incurs huge space overhead, we show how to construct a directed acyclic
graph (DAG) on the fly efficiently to implement the dynamic program for multi-row
optimization in order to conserve memory usage. The memory usage can thus be reduced
by a few orders of magnitude in practice.
A Hierarchical Assessment Strategy on Soft Error Propagation in Deep Learning Controller
Deep learning techniques have been introduced into the field of intelligent controller
design in recent years and become an effective alternative in complex control scenarios.
In addition to improve control robustness, deep learning controllers (DLCs) also provide
a potential fault tolerance to internal disturbances (such as soft errors) due to
the inherent redundant structure of deep neural networks (DNNs). In this paper, we
propose a hierarchical assessment to characterize the impact of soft errors on the
dependability of a PID controller and its DLC alternative. Single-bit-flip injections
in underlying hardware and time series data collection from multiple abstraction layers
(ALs) are performed on a virtual prototype system based on an ARM Cortex-A9 CPU, with
a PID controller and corresponding recurrent neural network (RNN) implemented DLC
deployed on it. We employ generative adversarial networks and Bayesian networks to
characterize the local and global dependencies caused by soft errors across the system.
By analyzing cross-AL fault propagation paths and component sensitivities, we discover
that the parallel data processing pipelines and regular feature size scaling mechanism
in DLC can effectively prevent critical failure causing faults from propagating to
the control output.
Attacking a CNN-based Layout Hotspot Detector Using Group Gradient Method
Deep neural networks are being used in disparate VLSI design automation tasks, including
layout printability estimation, mask optimization, and routing congestion analysis.
Preliminary results show the power of deep learning as an alternate solution in state-of-the-art
design and sign-off flows. However, deep learning is vulnerable to adversarial attacks.
In this paper, we examine the risk of state-of-the-art deep learning-based layout
hotspot detectors under practical attack scenarios. We show that legacy gradient-based
attacks do not adequately consider the design rule constraints. We present an innovative
adversarial attack formulation to attack the layout clips and propose a fast group
gradient method to solve it. Experiments show that the attack can deceive the deep
neural networks using small perturbations in clips which preserve layout functionality
while meeting the design rules. The source code is available at https://github.com/phdyang007/dlhsd/tree/dct_as_conv.
Bayesian Inference on Introduced General Region: An Efficient Parametric Yield Estimation Method for Integrated Circuits
In this paper, we propose an efficient parametric yield estimation method based on
Bayesian Inference. By observing that nowadays analog and mixed-signal circuit is
designed via a multi-stage flow, and that the circuit performance correlation of early
stage and late stage is naturally symmetrical, we introduce a general region to capture
the common features of the early and late stage. Meanwhile, two private regions are
also incorporated to represent the unique features of these two stages respectively.
Afterwards, we introduce classifiers one for each region to explicitly encode the
correlation information. Next, we set up a graphical model, and consequently adopt
Bayesian Inference to calculate the model parameters. Finally, based on the obtained
optimal model parameters, we can accurately and efficiently estimate the parametric
yield with a simple sampling method. Our numerical experiments demonstrate that compared
to the state-of-the-art algorithms, our proposed method can better estimate the yield
while significantly reducing the number of circuit simulations.
Analog IC Aging-induced Degradation Estimation via Heterogeneous Graph Convolutional
Networks
With continued scaling, transistor aging induced by Hot Carrier Injection and Bias
Temperature Instability causes a gradual failure of nanometer-scale integrated circuits
(ICs). In this paper, to characterize the multi-typed devices and connection ports,
a heterogeneous directed multigraph is adopted to efficiently represent analog IC
post-layout netlists. We investigate a heterogeneous graph convolutional network (H-GCN)
to fast and accurately estimate aging-induced transistor degradation. In the proposed
H-GCN, an embedding generation algorithm with a latent space mapping method is developed
to aggregate information from the node itself and its multi-typed neighboring nodes
through multi-typed edges. Since our proposed H-GCN is independent of dynamic stress
conditions, it can replace static aging analysis. We conduct experiments on very advanced
5nm industrial designs. Compared to traditional machine learning and graph learning
methods, our proposed H-GCN can achieve more accurate estimations of aging-induced
transistor degradation. Compared to an industrial reliability tool, our proposed H-GCN
can achieve 24.623x speedup on average.