FPGA 2020 TOC
FPGA ’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
SESSION: Morning Tutorial Session
High-level synthesis tools, both commercial and academic, typically rely on static
scheduling to produce high-throughput pipelines. However, in applications with unpredictable
memory accesses or irregular control flow, these tools need to make pessimistic scheduling
assumptions. In contrast, dataflow circuits implement dynamically scheduled circuits,
in which components communicate locally using a handshake mechanism and exchange data
as soon as all conditions for a transaction are satisfied. Due to their ability to
adapt the schedule at runtime, dataflow circuits are suitable for handling irregular
and control-dominated code. This paper describes Dynamatic, an open-source HLS framework
which generates synchronous dataflow circuits out of C/C++ code. The purpose of this
paper is to give an introductory overview of Dynamatic and demonstrate some of its
use cases, in order to enable others to use the tool and participate in its development.
Since FPGAs are now available in datacenters to accelerate applications, providing
FPGA hardware security is a high priority. FPGA security is becoming more serious
with the transition to FPGA-as-a-Service where users can upload their own bitstreams.
Full control over FPGA hardware through the bitstream enables attacks to weaken an
FPGA-based system. These include physically damaging the FPGA equipment and leaking
of sensitive information such as the secret keys of crypto algorithms. While there
is no known attacks in the commercial settings so far, it is not so much a question
of if but more of when? The tutorial will show concrete attacks applicable on datacenter
FPGAs. The goal of this tutorial is to prepare the FPGA community to impending security
issues in order to pave way for a proactive security. First, we will give a tour through
the FPGA hardware security jungle surveying practical attacks and potential threats.
We will reinforce this with live demos of denial of service attacks. Less than 10%
of the logic resources on an FPGA can draw enough dynamic power to crash a datacenter
FPGA card. In the second part of the tutorial, we will show different mitigations
that are either vendor supported or proposed by the academic community. In summary,
the tutorial will communicate that while FPGA hardware security is complicated to
bring about, there are acceptable solutions for known FPGA security problems.
SESSION: Invited Session: Security in FPGA Design and Application
In recent years, substantial attention has been drawn to vulnerabilities in the architectural
design of microelectronics, as well as the security of their global supply chains.
In reality, establishing trust in microelectronics requires broader considerations,
from verification of the software leveraged to implement hardware designs, to analyzing
third-party intellectual property cores, all the way to run-time design assurance
and periodic device screening post-deployment. These concerns are relevant to stakeholders
at all levels, from small independent design houses all the way to multi-national
strategic interests. One notable example of the latter is the U.S. Department of Defense’s
Trusted and Assured Microelectronics (T&AM) program, which seeks assured access to
state-of-the-art foundries through modern trust and assurance methods and demonstrations
. This talk will describe research efforts at the Georgia Tech Research Institute
centered around providing assurance of FPGAs. Current research thrusts include the
development of verification techniques at multiple stages of the design process, including
vendor design software execution, implementation of user designs, and even the operation
of the underlying physical device hardware itself. For example, to address trust in
synthesis and implementation of high-level user source code, we discuss the development
of canary circuits which are compiled alongside user design circuits and can be independently
inspected and verified to ensure adherence to user-defined implementation rules. Additionally,
we discuss one avenue for providing trust in vendor hardware devices through our development
of Independent Functional Test (IFT) suites.
Cloud FPGAs have been gaining interest in recent years due to the ability of users
to request FPGA resources quickly, flexibly, and on-demand. In addition to the existing
single-tenant deployments, where each user gets an access to a whole FPGA, recent
academic proposals have looked at creating multi-tenant deployments, where multiple
users share a single FPGA, e.g., . In both settings, there is a large amount of
infrastructure and physical resources that are shared among users. Sharing of the
physical resources in data centers and processors is well known to lead to potential
attacks, e.g., . However, only recently have there been demonstrations of various
security attacks that our group and others have shown to be possible in Cloud FPGA
setting, e.g., .
This talk will discuss Cloud FPGA security from the perspective of side and covert
channel attacks that arise due to these shared resources. It will first cover our
recent work on thermal channels that can be used to create covert channels between
users renting same FPGA over time . These channels can create stealthy communication
medium for leaking small amounts of sensitive information, e.g., cryptographic keys.
As defense strategies, the talk will point out possible solutions at the system level
and at the hardware level. At the system level, adding delays between when different
users can access the same FPGA, or preventing users from being able to identify unique
FPGA instances can mitigate the threats, but does increase overhead. At the hardware
level, additional cooling to erase thermal information after users uses and FPGA,
or new sensors to monitor FPGAs and generate an alert when excessive heat is detected
are possible solutions that will be discussed.
The talk will also discuss recent work on voltage-based attacks that leverage custom
circuits instantiated inside the FPGAs to measure voltage changes. Voltage-based channels
can be used to leak sensitive information across FPGAs (in single-tenant or multi-tenant
settings) , or can be combined with other existing attacks to perform cross-talk
leakage inside the FPGAs (in multi-tenant settings) . These attacks highlight the
power of attacker when they are able to synthesize any circuit into a shared FPGA
environment. Furthermore, even with certain restrictions on the types of designs that
can be synthesized, this talk will show how attacks can be deployed. As defense strategies,
the talk will point out possible new design check rules that can be used by Cloud
In light of the attacks and defenses, Cloud FPGA security remains a cat-and-mouse
game. There is then the foremost need to better understand the existing and potential
attacks — to design defenses and deploy them before malicious users try to launch
such attacks. Only with proper understanding of the possible FPGA attacks, can secure
Cloud FPGAs be created.
An emerging trend in the data center is the at-scale deployment of Field-Programmable
Gate Arrays (FPGAs) which combine multi-gigabit and ultra-low latency workload acceleration
with hardware-level reconfigurability. In particular, for applications such as Deep
Learning where techniques and algorithms are rapidly changing, the inherent flexibility
of FPGAs grants them an edge over hardened data processing units such as ASICs or
GPUs. Inevitably, Cloud Service Providers (CSPs) will seek to maximize resource utilization
for their FPGA investments as they currently do for general-purpose computing resources.
Current FPGA deployments in the data center tend to be single-tenant or support multiple
tenants through time multiplexing (temporal multi-tenancy) which can result in resource
underutilization. This approach does not scale and presents challenges for elastic
workloads whose properties are not fully known ahead of time. Instead, we expect that
closing the resource utilization gap will require efficient spatial allocation of
FPGA resources across multiple tenants while maintaining security and QoS guarantees.
In particular, new usage models such as FPGA-as-a-Service, where resources are exposed
directly to the cloud tenant, present unique challenges on the security and QoS side.
In this talk we review the threat landscape and trust models associated with FPGA
multi-tenancy, highlight future research challenges and examine the unique opportunities
that FPGA multi-tenancy enables given adequate guarantees on security and QoS.
Technology and cost are motivating more and more developers to put more and more of
their “secret sauce” in programmable logic. This is great for the consumer as it opens
the market to the smaller players. However, it also opens the market for IP theft;
after all, why spend years making something yourself if you can just pilfer it and
re-brand it as your own. A sad statement for sure but it is the reality of the world
we live in. Worsening the situation is the fact that FPGA’s and SoC’s are starting
to become the anchor for the security of larger systems. This now brings in another
set of bad guys; ones that are tech savvy and armed with lab equipment. They are not
looking for your IP but are looking to break into the system your IP is protecting.
The number of adversaries is growing just as fast as the markets for the devices themselves
and is quickly becoming an arms race.
The first volley was fired when the adversaries started reverse engineering the programming
files (bitstreams) so we added bitstream encryption (3-DES at the time). However,
commercial computational power rendered that algorithm obsolete, so we moved to AES-256.
The adversary gave up attacking the algorithm and started going after the key itself,
so we added physical protections. Failing to break into the device they opted for
less physically invasive attacks such as Differential Power Analysis, so we added
authentication before decryption and key rolling. Frustrated with these and other
protections the adversary went back to physical attacks. This was aided by the ever-expanding
capabilities of Failure Analysis which needs the same equipment to understand device
failures as attackers need to break into devices. They started with circuit probing
and edits using the Focus Ion Beam (FIB). This allowed them to disable security features
or tap into the key space. To counter this redundancy and circuit obfuscation techniques
were added. They then switched to less destructive imaging methods which forced us
to levy system requirements (detection and prevention techniques) on the customer;
costly but effective.
The security of FPGAs and SoCs has greatly evolved over the years but the next battlefield
in the arms race is on the horizon: FPGAs / SoC’s in the cloud. Most of the security
in modern day devices primarily considers an attacker on the outside of the device
with close physical access. As such, most of the mitigations make a similar assumption.
However, devices in the cloud creates an entire new form of device warfare; remote
attacks on FPGAs and SoCs. This is not an issue of hacking software; that has been
around since before the internet. It is about the number of devices in the cloud that
are programmable and, therefore, hackable. Side channel attacks such as row-hammer
(1) and CLKscrew (2) show that if there are secrets and some level of device access,
the attackers will find a way to exploit it. In this modern era, security engineers
are going to have to look for adversaries where they would never expect them; inside
FPGAs have been deployed in datacenters worldwide and are now available for use by
in both public and private clouds. Enormous focus has been given to optimizing machine
learning workloads for FPGAs, especially for deep neural networks (DNNs) in areas
like web search, image classification, and translation. However, major cloud applications
encompasses a variety of areas that aren’t primarily machine learning workloads, including
databases, video encoding, text processing, gaming, bioinformatics, productivity and
collaboration, file hosting and storage, e-mail, and many more. While machine learning
can certainly play a role in each of these areas, is there more that can be done to
accelerate these more traditional workloads? Even more challenging than identifying
promising workloads is figuring out how developers can practically create and deploy
useful applications using FPGAs to the cloud. While FPGAs-as-a-Service allow access
to FPGAs in the cloud, there is a huge gap between raw programmable hardware and a
customer paying money to use an application powered by that hardware. A wide variety
of FPGA IP exists for developers to use, but individual IP blocks are a long way from
being a fully functional cloud application. Building block IPs like Memcached, regex
matching, protocol parsing, and linear algebra are only a subset of the necessary
functionality for full cloud applications. Developing or acquiring IP and integrating
it into a full application that customers will pay for is a significant task. And
even when a customer pays, how should the money be distributed between IP vendors.
Should it be a onetime fee? By usage? By number of FPGAs deployed? Who should have
the burden for support if something goes wrong? In traditional cloud applications,
FPGA IP block functions are implemented in software libraries. However, few examples
of optimized software libraries are commercially successful, so is selling FPGA IP
even a viable commercial model for cloud applications? High-level synthesis (HLS)
tools promise to provide one path to enable software developers to make effective
use of FPGAs for computing tasks, but are any tools really capable of accelerating
cloud-scale applications? Many HLS tools require substantial microarchitectural guidance
in the form of pragmas or configuration files to come out with good results. Real
cloud applications also rarely have a single dominant function and have significant
data movement, so without proper partitioning and tuning, the acceleration gains from
the FPGA are quickly wiped out by data movement and Amdahl’s Law. This panel will
gather experts in using FPGAs for cloud application areas beyond machine learning,
and how those applications can be built and successfully deployed. We will cover topics
such as: -What are the most important cloud workloads for FPGAs to target besides
machine learning? -Are there specific changes to the FPGA architecture that would
benefit these cloud applications? -What are the economic models that will work for
IP developers, application developers, and cloud providers? -How can we make development
of FPGA applications easier for the Cloud? -Will open source IP make it impossible
for IP vendors to make commercially successful libraries? -What advances are necessary
for HLS tools to be practical in the Cloud? The panel is comprised of experts in applications,
IP development, and cloud deployment. Each will give a short presentation of what
they find as the most important applications and how they see FPGA development for
the cloud going forward, then we will open the floor to an interactive discussion
with the audience.
SESSION: Session: Keynote I
Spatial compute architectures, like Field Programmable Gate Arrays (FPGAs), constitute
a key architectural pillar in modern heterogeneous compute platforms. Spatial architectures
need a sophisticated Electronic Design Automation (EDA) compiler to optimally map
and fit a user’s workload/design onto the underlying spatial device. This EDA compiler
not only helps users to custom-configure the spatial device but is also critically
required for architectural exploration of new spatial architectures. The FPGA industry
has had a long history of innovation in this symbiotic relationship between EDA and
reconfigurable spatial architectures.
This talk will walk down the memory lane of multiple waves of such innovation, amplifying
how the complexity of EDA technology has not only scaled with Moore’s law scaling
of size and complexity of silicon hardware, but also how it has been pivotal in the
architectural design of modern FPGAs. A general overview of modern FPGA EDA flows
and key differences compared to Application-Specific Integrated Circuit (ASIC) EDA
flows will be discussed. State-of-the-art FPGAs, Stratix® 10 and AgileX™ from Intel
incorporate an advanced register-rich HyperFlex™ architecture that introduces disruptive
optimization opportunities in the EDA compiler. Such physical synthesis optimization
technologies like logic retiming, clock skew optimization, time borrowing, and their
synergies and challenges will be discussed. Solving these challenges enables FPGAs
to achieve non-linear performance improvements.
Logic retiming was first introduced as a powerful sequential design optimization technique
three decades ago, yet gained limited popularity in the ASIC industry, because of
the lack of scalable sequential verification techniques. This talk will highlight
the root causes of this issue and present innovations in retiming technology and constrained
random simulation that allow the successful verification of retimed circuits, thereby
enabling the use of logic retiming for FPGAs.
FPGAs have traditionally targeted Register-Transfer Level (RTL) designers. To enable
wider adoption of FPGAs, Intel has developed several High-Level Design (HLD) tools,
frameworks, libraries, and methodologies, raising the level of programming abstraction.
This talk will provide a glimpse into Intel’s HLD offerings that enable software developers
in the broader ecosystem to leverage FPGAs.
Academic researchers will also be provided with some key research vectors to help
propel the FPGA industry further.
SESSION: Session: High-Level Abstractions and Tools I
In cloud computing, software is transitioning from monolithic to microservices architecture
to improve the maintainability, upgradability and the flexibility of the applications.
They are able to request a service with different implementations of the same functionality,
including hardware accelerator, depending on cost and performance. This model opens
up a new opportunity to integrate reconfigurable hardware, specifically, FPGA, in
the cloud to offer such services. There are many research works discussing solutions
for this problem but they focus primarily on the high-level aspects of resource manager,
hypervisor or hardware architecture. The low-level physical design choices of FPGA
to maximize the accelerator allocation success rate (called serviceability) is largely
untouched. In this paper, we propose a design space exploration algorithm to determine
the best configuration of partially reconfigurable regions (PRRs) to host the accelerators.
Besides, the algorithm is capable of estimating the actual resources occupied by the
PRRs on the FPGA even before floorplanning. We systematically study the effects of
having more PRRs on the system in various aspects, i.e., serviceability, waiting time
and resource wastage. The experiments show that at a certain number of PRRs, upto
91% serviceability can be achieved for 12 concurrent users. It is a significant improvement
from 52% without our approach. The average amount of time that each request has to
wait to be served is also reduced by 6.3X. Furthermore, the cumulative unused FPGA
resources is reduced almost by half.
Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for
domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips
is non-trivial because: (1) mainstream DNNs have millions of parameters and billions
of operations; (2) the design space is large due to numerous design choices of dataflows,
processing elements, memory hierarchy, etc.; and (3) there is an algorithm/hardware
co-design need for the same DNN functionality to have a different decomposition that
would require different hardware IPs and thus correspond to dramatically different
performance/energy/area tradeoffs. Therefore, DNN chips often take months to years
to design and require a large team of cross-disciplinary experts. To enable fast and
effective DNN chip design, we propose AutoDNNchip – a DNN chip generator that can
automatically produce both FPGA- and ASIC-based DNN chip implementation (i.e., synthesizable
RTL code with optimized algorithm-to-hardware mapping) from DNNs developed by machine
learning frameworks (e.g., PyTorch) for a designated application and dataset without
humans in the loop. Specifically, AutoDNNchip consists of 2 integrated enablers: (1)
a Chip Predictor, which can accurately and efficiently predict a DNN accelerator’s
energy, throughput, latency, and area based on the DNN model parameters, hardware
configurations, technology-based IPs, and platform constraints; and (2) a Chip Builder,
which can automatically explore the design space of DNN chips (including IP selections,
block configurations, resource balancing, etc.), optimize chip designs via the Chip
Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve
the target design metrics. Experimental results show that our Chip Predictor’s predicted
performance differs from real-measured ones by <10% when validated using 15 DNN models
and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, DNN accelerators generated
by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that
of expert-crafted state-of-the-art FPGA- and ASIC-based accelerators, showing the
effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.
The domain-specific language (DSL) for image processing, Halide, has generated a lot
of interest because of its capability of decoupling algorithms from schedules that
allow programmers to search for optimized mappings targeting CPU and GPU. Unfortunately,
while the Halide community has been growing rapidly, there is currently no way to
easily map the vast number of Halide programs to efficient FPGA accelerators. To tackle
this challenge, we propose HeteroHalide, an end-to-end system for compiling Halide
programs to FPGA accelerators. This system makes use of both algorithm and scheduling
information specified in a Halide program. Compared to the existing approaches, flow
provided by HeteroHalide is significantly simplified, as it only requires moderate
modifications for Halide programs on the scheduling part to be applicable to FPGAs.
For part of the compilation flow, and to act as the intermediate representation (IR)
of HeteroHalide, we choose HeteroCL, a heterogeneous programming infrastructure which
supports multiple implementation backends (such as systolic arrays and stencil implementations).
By using HeteroCL, HeteroHalide can generate efficient accelerators by choosing different
backends according to the application. The performance evaluation compares the accelerator
generated by HeteroHalide with multi-core CPU and an existing Halide-HLS compiler.
As a result, HeteroHalide achieves 4.15\texttimes speedup on average over 28 CPU cores,
and 2 \textasciitilde 4\texttimes throughput improvement compared with the existing
In recent years, multiple public cloud FPGA providers have emerged, increasing interest
in FPGA acceleration of cryptographic, bioinformatic, financial, and machine learning
algorithms. To help understand the security of the cloud FPGA infrastructures, this
paper focuses on a fundamental question of understanding what an adversary can learn
about the cloud FPGA infrastructure itself, without attacking it or damaging it. In
particular, this work explores how unique features of FPGAs can be exploited to instantiate
Physical Unclonable Functions (PUFs) that can distinguish between otherwise-identical
FPGA boards. This paper specifically introduces the first method for identifying cloud
FPGA instances by extracting a unique and stable FPGA fingerprint based on PUFs measured
from the FPGA boards’ DRAM modules. Experiments conducted on the Amazon Web Services
(AWS) cloud reveal the probability of renting the same physical board more than once.
Moreover, the experimental results show that hardware is not shared among f1.2xlarge,
f1.4xlarge, and f1.16xlarge instance types. As the approach used does not violate
any restrictions currently placed by Amazon, this paper also presents a set of defense
mechanisms that can be added to existing countermeasures to mitigate users’ attempts
to fingerprint cloud FPGA infrastructures.
SESSION: Session: Applications I
Combinatorial optimizations are widely adopted in scientific and engineering applications,
such as VLSI design, automated machine learning (AutoML), and compiler design. Combinatorial
optimization problems are notoriously challenging to exactly solve due to the NP-hardness.
Scientists have long discovered that numerically simulating classical nonlinear Hamiltonian
systems can effectively solve many well-known combinatorial optimization problems.
However, such physical simulation typically requires a massive amount of computation,
which even outstrips the logic capability of modern reconfigurable digital fabrics.
In this work, we proposed an FPGA-based general combinatorial optimization problem
solver which achieved ultra-high performance and scalability. Specifically, we first
reformulated a broad range of combinatorial optimization problems with a general graph-based
data structure called the Ising model. Second, instead of utilizing classical simulated
annealing to find an approximate solution, we utilized a new heuristic algorithm,
simulated bifurcation, to search for solutions. Third, we designed an efficient hardware
architecture to fully exploit FPGAs’ potentials to accelerate the algorithm, and proposed
three hardware-software co-optimizations to further improve the performance. By experimenting
on benchmarks, our proposal outperformed the state-of-the-art simulated annealing
optimization solver by up to 10.91 times.
We present a high-throughput FPGA design for supporting high-performance network switching.
FPGAs have recently been attracting attention for datacenter computing due to their
increasing transceiver count and capabilities, which also benefit the implementation
and refinement of network switches. Our solution replaces the crossbar in favour of
a novel, more pipeline-friendly approach, the “Combined parallel round-robin arbiter”.
It also removes the overhead of incorporating an often-iterative scheduling or matching
algorithm, which sometimes tries to fit too many steps in a single or a few FPGA cycles.
The result is a network switch implementation on FPGAs operating at a high frequency
and with a low port-to-port latency. It also provides a wiser buffer memory utilisation
than traditional Virtual Output Queue (VOQ)-based switches and is able to keep 100%
throughput for a wider range of traffic patterns using a fraction of the buffer memory
and shorter packets.
Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic
Cancer Treatment Simulator
The simulation of light propagation through tissues is important for medical applications,
such as photodynamic therapy (PDT) for cancer treatment. To optimize PDT an inverse
problem, which works backwards from a desired distribution of light to the parameters
that caused it, must be solved. These problems have no closed-form solution and therefore
must be solved numerically using an iterative method. This involves running many forward
light propagation simulations which is time-consuming and computationally intensive.
Currently, the fastest general software solver for this problem is FulMonteSW. It
models complex 3D geometries with tetrahedral meshes and uses Monte Carlo techniques
to model photon interactions with tissues. This work presents FullMonteFPGACL: an
FPGA-accelerated version of FullMonteSW using an Intel Stratix 10 FPGA and the Intel
FPGA SDK for OpenCL. FullMonteFPGACL has been validated and benchmarked using several
models and achieves improvements in performance (4x) and energy-efficiency (11x) over
the optimized and multi-threaded FullMonteSW implementation. We discuss methods for
extending the design to improve the performance and energy-efficiency ratios to 16x
and 17x, respectively. We achieved these gains by developing in an agile fashion using
OpenCL to facilitate quick prototyping and hardware-software partitioning. However,
achieving competitive area and performance required careful design of the hardware
pipeline and expression of its structure in OpenCL. This led to a hybrid design style
that can improve productivity when developing complex applications on an FPGA.
360° panoramic video provides an immersive Virtual Reality experience. However, rendering
360° videos consumes excessive energy on client devices. FPGA is an ideal offloading
target to improve the energy-efficiency. However, a naive implementation of the processing
algorithm would lead to an excessive memory footprint that offsets the energy benefit.
In this paper, we propose an algorithm-architecture co-designed system that dramatically
reduces the on-chip memory requirement of VR video processing to enable FPGA offloading.
Evaluation shows that our system is able to achieve significant energy reduction with
no loss of performance compared to today’s off-the-shelf VR video rendering system.
The real-time synthesis of 3D spatial audio has many applications, from virtual reality
to navigation for the visually-impaired. Head-related transfer functions (HRTF) can
be used to generate spatial audio based on a model of the user’s head. Previous studies
have focused on the creation and interpolation of these functions with little regard
for real-time performance. In this paper, we present an FPGA-based platform for real-time
synthesis of spatial audio using FIR filters created from head-related transfer functions.
For performance reasons, we run filtering, crossfading, and audio output on FPGA fabric,
while calculating audio source locations and storing audio files on the CPU. We use
a head-mounted 9-axis IMU to track the user’s head in real-time and adjust relative
spatial audio locations to create the perception that audio sources are fixed in space.
Our system, running on a Xilinx Zynq Z-7020, is able to support 4X more audio sources
than a comparable GPU and 8X more sources than a CPU while maintaining sub-millisecond
latency and comparable power consumption. Furthermore, we show how our system can
be leveraged to communicate the location of landmarks and obstacles to a visually-impaired
user during a sailing race or other navigation scenario. We test our system with multiple
users and show that, as a result of our reduced latency, a user is able to locate
a virtual audio source with an extremely high degree of accuracy and navigate toward
SESSION: Session: Deep Learning I
Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of
one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to
show state-of-the-art results in various applications including handwritten text recognition,
medical imaging, and many more. However, efficient implementation suffers from very
sequential execution that tremendously slows down both training and inference compared
to other neural networks. This is the primary reason that prevents intensive research
involving MD-LSTM in the recent years, despite large progress in microelectronics
and architectures. The main goal of the current research is to provide acceleration
for inference of MD-LSTM, so to open a door for efficient training that can boost
application of MD-LSTM. By this research we advocate that FPGA is an alternative platform
for deep learning that can offer a solution in cases when a massive parallelism of
GPUs does not provide the necessary performance required by the application. In this
paper, we present the first hardware architecture for MD-LSTM. We conduct a systematic
exploration of precision vs. accuracy trade-off using challenging dataset for historical
document image binarization from DIBCO 2017 contest, and well known MNIST dataset
for handwritten digits recognition. Based on our new architecture we implement FPGA-based
accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by
up to 50x and energy efficiency by up to 746x. At the same time, our accelerator demonstrates
higher accuracy and comparable throughput in comparison with state-of-the-art FPGA-based
implementations of multilayer perceptron for MNIST dataset.
Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet,
SqueezeNet, etc., have emerged in the past few years for fast inference on embedded
and mobile system. However, lightweight operations limit acceleration potential by
GPU due to their memory bounded nature and their parallel mechanisms that are not
friendly to SIMD. This calls for more specific accelerators. In this paper, we propose
an FPGA-based overlay processor with a corresponding compilation flow for general
LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates
and decomposes lightweight operations for efficient acceleration. Moreover, our instruction
architecture considers sharing of major computation engine between LW operations and
conventional convolution operations. This improves the run-time resource efficiency
and overall power efficiency. Finally, Light-OPU is software programmable, since loading
of compiled codes and kernel weights completes switch of targeted network without
FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves
5.5x better latency and 3.0x higher power efficiency on average compared with edge
GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency
compared with previous customized FPGA accelerators. To the best of our knowledge,
Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs
acceleration with high performance and power efficiency, which is evaluated using
all major LW-CNNs including the newly released MobileNetV3.
The irregularity of recent Convolutional Neural Network (CNN) models such as less
data reuse and parallelism due to the extensive network pruning and simplification
creates new challenges for FPGA acceleration. Furthermore, without proper optimization,
there could be significant overheads when integrating FPGAs into existing machine
learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous
studies. However, our study shows that a naive FPGA integration into TensorFlow could
lead to up to 8.45x performance degradation. To address the challenges mentioned above,
we propose several SW/HW co-design approaches to perform the end-to-end optimization
of deep learning applications. We present a flexible and composable architecture called
FlexCNN. It can deliver high computation efficiency for different types of convolution
layers using techniques including dynamic tiling and data layout optimization. FlexCNN
is further integrated into the TensorFlow framework with a fully-pipelined software-hardware
integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake
and other non-CNN processing stages. We use OpenPose, a popular CNN-based application
for human pose recognition, as a case study. Experimental results show that with the
FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The
pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization
produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for
OpenPose with floating-point precision, which is the highest performance reported
for this application on FPGA in the literature.
SESSION: Session: FPGA Architecture
This paper describes architectural enhancements in Intel® Agilex™ FPGAs and SoCs.
Agilex devices are built on Intel’s 10nm process and feature next-generation programmable
fabric, tightly coupled with a quad-core ARM processor subsystem, a secure device
manager, IO and memory interfaces, and multiple companion transceiver tile choices.
The Agilex fabric features multiple logic block enhancements that significantly improve
propagation delays and integrate more effectively with the second-generation HyperFlexAgilex™
pipelined routing architecture. Routing connections are re-designed to be point-to-point,
dropping intermediate connections featured in prior FPGA generations and replacing
them with a wider variety of shorter wire types. Fine-grain programmable clock skew
and time-borrowing were introduced throughout the fabric to augment the slack-balancing
capabilities of HyperFlex registers. DSP capabilities are also extended to natively
support new INT9/BFLOAT16/FP16 formats. Together, along with process and circuit enhancements,
these changes support more than 40% performance improvement over the Stratix® 10 family
Straight to the Point: Intra- and Intercluster LUT Connections to Mitigate the Delay of Programmable Routing
Technology scaling makes metal delay ever more problematic, but routing between Look-Up
Tables (LUTs) still passes through a series of transistors. It seems wise to avoid
the corresponding delay whenever possible. Direct connections between LUTs, both within
and across multiple clusters, can eschew the transistor delays of crossbars, connection
blocks, and switch blocks. In this paper we investigate the usefulness of enhancing
classical Field-Programmable Gate Array (FPGA) architectures with direct connections
between LUTs. We present an efficient algorithm for searching automatically the most
interesting patterns of such direct connections. Despite our methods being fairly
conservative and relying on the use of unmodified standard CAD tools, we obtain a
2.77% improvement of the geometric mean critical path delay of a standard benchmark
set, with improvement ranging from -0.17% to 7.3% for individual circuits. As modest
as these results may seem at first glance, we believe that they position direct connections
between LUTs as a promising topic for future research. Extending this work with dedicated
CAD algorithms and exploiting the increased possibilities for optimal buffering, diagonal
routing, and pipelining could prove direct connections important to the continuation
of performance improvement into next generation FPGAs.
We propose two tiers of modifications to FPGA logic cell architecture to deliver a
variety of performance and utilization benefits with only minor area overheads. In
the first tier, we augment existing commercial logic cell datapaths with a 6-input
XOR gate in order to improve the expressiveness of each element, while maintaining
backward compatibility. This new architecture is vendor-agnostic, and we refer to
it as LUXOR. We also consider a secondary tier of vendor-specific modifications to
both Xilinx and Intel FPGAs, which we refer to as X-LUXOR+ and I-LUXOR+ respectively.
We demonstrate that compressor tree synthesis using generalized parallel counters
(GPCs) is further improved with the proposed modifications. Using both the Intel adaptive
logic module and the Xilinx slice at the 65nm technology node for a comparative study,
it is shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We demonstrate
that LUXOR can deliver an average reduction of 13-19% in logic utilization on micro-benchmarks
from a variety of domains. BNN benchmarks benefit the most with an average reduction
of 37-47% in logic utilization, which is due to the highly-efficient mapping of the
XnorPopcount operation on our proposed LUXOR+ logic cells.
SESSION: Invited Panel
FPGAs will Never be the Same Again: How the Newest FPGA Architectures are Totally Disrupting the Entire FPGA Ecosystem
as We Know It
Since the inception of FPGAs over 2 decades ago, the micro-architectures and macro-architectures
of FPGAs across all FPGA vendors have been converging strongly to the point that comparable
FPGAs from the main FPGA vendors had virtually the same use models, and the same programming
models. User designs were getting easier to port from one vendor to the other with
every generation. Recent developments in from different FPGA vendors targeting the
most advanced semiconductor technology nodes are an abrupt and disruptive break from
this trend, especially at the macro-architectural level.
SESSION: Session: Keynote II
FPGAs provide significant advantages in throughput, latency, and energy efficiency
for implementing low-latency, compute-intensive applications when compared to general-purpose
CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable SoCs
with on-chip CPUs, domain-specific programmable accelerators, and flexible connectivity
options. Recently, Xilinx introduced a new heterogeneous compute architecture, the
Adaptive Compute Acceleration Platform (ACAP), with significantly more flexibility
and performance to address an evolving set of new applications such as machine learning.
This advancement on the device side is accompanied by similar advances on higher-level
programming approaches to make FPGAs and ACAPs significantly easy to use for a wide
range of applications. Xilinx Vitis Unified Software Platform is a comprehensive development
environment to build and seamlessly deploy accelerated applications on Xilinx platforms
including Alveo cards, FPGA-instances in the cloud, and embedded platforms. It addresses
the three major industry trends: the need for heterogenous computing, applications
that span cloud to edge to end-point, and AI proliferation. Vitis supports application
programming using C, C++ and OpenCL, and it enables the development of large-scale
data processing and machine learning applications using familiar, higher-level frameworks
such as TensorFlow and SPARK. To facilitate communication between the host application
and accelerators, Xilinx Runtime library (XRT) provides APIs for accelerator life-cycle
management, accelerator execution management, memory allocation, and data communication
between the host application and accelerators. In addition, a rich set of performance-optimized,
open-source libraries significantly ease the application development. Vitis AI, an
integral part of Vitis, enables AI inference acceleration on Xilinx platforms. It
supports industry’s leading deep learning frameworks like Tensorflow and Caffe, and
offers a comprehensive suite of tools and APIs to prune, quantize, optimize, and compile
pre-trained models to achieve the highest AI inference performance on Xilinx platforms.
This talk provides an overview of Vitis and Vitis AI development environments.
SESSION: Session: High-Level Abstractions and Tools II
Debugging consumes a large portion of FPGA design time, and with the growing complexity
of traditional FPGA systems and the additional verification challenges posed by multiple
FPGAs interacting within data centers, debugging productivity is becoming even more
important. Current debugging flows either depend on simulation, which is extremely
slow but has full visibility, or on hardware execution, which is fast but provides
very limited control and visibility. In this paper, we present StateMover, a checkpointing-based
debugging framework for FPGAs, which can move design state back and forth between
an FPGA and a simulator in a seamless way. StateMover leverages the speed of hardware
execution and the full visibility and ease-of-use of a simulator. This enables a novel
debugging flow that has a software-like combination of speed with full observability
and controllability. StateMover adds minimal hardware to the design to safely stop
the design under test so that its state can be extracted or modified in an orderly
manner. The added hardware has no timing overhead and a very small area overhead.
StateMover currently supports Xilinx UltraScale devices, and its underlying techniques
and tools can be ported to other device families that support configuration readback.
Moving the state from/to an FPGA to/from a simulator can be performed in a few seconds
for large FPGAs, enabling a new debugging flow.
Commercial high-level synthesis tools typically produce statically scheduled circuits.
Yet, effective C-to-circuit conversion of arbitrary software applications calls for
dataflow circuits, as they can handle efficiently variable latencies (e.g., caches)
and unpredictable memory dependencies. Dataflow circuits exhibit an unconventional
property: registers (usually referred to as “buffers”) can be placed anywhere in the
circuit without changing its semantics, in strong contrast to what happens in traditional
datapaths. Yet, although functionally irrelevant, this placement has a significant
impact on the circuit’s timing and throughput. In this work, we show how to strategically
place buffers into a dataflow circuit to optimize its performance. Our approach extracts
a set of choice-free critical loops from arbitrary dataflow circuits and relies on
the theory of marked graphs to optimize the buffer placement and sizing. We demonstrate
the performance benefits of our approach on a set of dataflow circuits obtained from
This paper presents an extension to PathFinder FPGA routing algorithm, which enables
it to deliver FPGA designs free from risks of crosstalk attacks. Crosstalk side-channel
attacks are a real threat in large designs assembled from various IPs, where some
IPs are provided by trusted and some by untrusted sources. It suffices that a ring-oscillator
based sensor is conveniently routed next to a signal that carries secret information
(for instance, a cryptographic key), for this information to possibly get leaked.
To address this security concern, we apply several different strategies and evaluate
them on benchmark circuits from Verilog-to-Routing tool suite. Our experiments show
that, for a quite conservative scenario where 10-20% of all design nets are carrying
sensitive information, the crosstalk-attack-aware router ensures that no information
leaks at a very small penalty: 1.58-7.69% increase in minimum routing channel width
and 0.12-1.18% increase in critical path delay, on average. In comparison, in an AES-128
cryptographic core, less than 5% of nets carry the key or the intermediate state values
of interest to an attacker, making it highly likely that the overhead for obtaining
a secure design is, in practice, even smaller.
Embedded and cyber-physical systems are pervading all aspects of our lives, including
sensitive and critical ones. As a result, they are an alluring target for cyber attacks.
These systems, whose implementation is often based on reconfigurable hardware, are
typically deployed in places accessible to attackers. Therefore, they require protection
against tampering and side-channel attacks. However, a side-channel resistant implementation
of a security primitive is not sufficient, as it can be weakened by an adversary,
aging, or environmental factors. To detect this, legitimate users should be able to
evaluate the side-channel resistance of their systems not only when deploying them
for the first time, but also during their entire service life. The most widespread
and de facto standard methodology for measuring power side-channel leakage uses Welch’s
t-test. In practice, collecting the data for the t-test requires physical access to
the device, a device-specific test setup, and the equipment for measuring the power
consumption during device operation. Consequently, only a small number of cyber-physical
systems deployed in the field can be tested this way and the tests to reevaluate the
device resistance to side-channel attacks cannot be easily repeated. To address these
issues, we present a design and an FPGA implementation of a built-in test for self-evaluation
of the resistance to first-order power side-channel attacks. Once our test is triggered,
the FPGA measures its own internal power-supply voltage and computes the t-test statistic
in real time. Experimental results on two different implementations of the AES-128
algorithm demonstrate that the self-evaluation test is very reliable. We believe that
this work is an important step towards the development of security sensors for the
next generation of safe and robust cyber-physical systems.
SESSION: Session: Applications II
FPGA emulation is a promising approach to accelerating Network-on-Chip (NoC) modeling
which has traditionally relied on software simulators. In most early studies of FPGA-based
NoC emulators, only synthetic workloads like uniform and bit permutations were considered.
Although a set of carefully designed synthetic workloads can reveal a relatively thorough
coverage of the characteristics of the NoC under evaluation, they alone are insufficient,
especially when the NoC needs to be optimized for specific applications. In such cases,
trace-driven workloads are effective. However, there is a problem with conventional
trace-driven workloads that has been pointed out by some recent studies: the network
load and congestion may be distorted because dependencies between packets are not
considered. These studies also provide infrastructures for extending existing software
simulators to enforce dependencies between packets. Unfortunately, enforcing dependencies
between packets is not trivial in the FPGA emulation approach. Therefore, although
there are some recent FPGA-based NoC emulators supporting trace-driven workloads,
most of them ignore packet dependencies. In this paper, we first clarify the challenges
of supporting trace-driven workloads with dependencies between packets taken into
account in the FPGA emulation approach. We then propose efficient methods and architectures
to tackle these challenges and build an FPGA-based NoC emulator, which we call DNoC,
based on the proposals. Our evaluation results show that (1) on a VC707 FPGA board,
DNoC achieves an average speed of 10,753K cycles/s when emulating an 8×8 NoC with
trace data collected from full-system simulation of the PARSEC benchmark suite, which
is 274x higher than the speed reported in a recent related work on dependency-driven
trace-based NoC emulation on FPGAs; (2) Compared to BookSim, one of the most popular
NoC simulators, DNoC is 395x faster while providing the same results; (3) DNoC can
scale to a 4,096-node NoC on a VC707 board, and the size of the largest NoC depends
on only the on-chip memory capacity of the target FPGA.
Sorting is a fundamental operation in many applications such as databases, search,
and social networks. Although FPGAs have been shown very effective at sorting data
sizes that fit on chip, systems that sort larger data sets by shuffling data on and
off chip are bottlenecked by costly merge operations or data transfer time. We propose
a new technique for sorting large data sets, which uses a variant of the samplesort
algorithm on a server with a PCIe-connected FPGA. Samplesort avoids merging by randomly
sampling values to determine how to partition data into non-overlapping buckets that
can be independently sorted. The key to our design is a novel parallel multi-stage
hardware partitioner, which is a scalable high-throughput solution that greatly accelerates
the samplesort partitioning step. Using samplesort for FPGA-accelerated sorting provides
several advantages over mergesort, while also presenting a number of new challenges
that we address with cooperation between the FPGA and the software running on the
host CPU. We prototype our design using Amazon Web Services FPGA instances, which
pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments
demonstrate that our prototype system sorts 2^30 key-value records with a speed of
7.2 GB/s, limited only by the on-board DRAM capacity and available PCIe bandwidth.
When sorting 2^30 records, our system exhibits a 37.4x speedup over the widely used
GNU parallel sort on an 8-thread state-of-the-art CPU.
K-Means is a popular clustering algorithm widely used and extensively studied in the
literature. In this paper we explore the challenges and opportunities in using low
precision input in conjunction with a standard K-Means algorithm as a way to improve
the memory bandwidth utilization on hardware accelerators. Low precision input through
quantization has become a standard technique in machine learning to reduce computational
costs and memory traffic. When applied in FPGAs, several issues need to be addressed.
First and foremost is the overhead of storing the data at different precision levels
since, depending on the training objective, different levels of precision might be
needed. Second, the FPGA design needs to accommodate varying precision without requiring
reconfiguration. To address these concerns, we propose Bit-Serial K-Means (BiS-KM),
a combination of a hybrid memory layout supporting data retrieval at any level of
precision, a novel FPGA design based on bit-serial arithmetic, and a modified K-Means
algorithm tailored to FPGAs. We have tested BiS-KM with various data sets and compared
our design with a state-of-the-art FPGA accelerator. BiS-KM achieves an almost linear
speedup as precision decreases, providing a more effective way to perform K-Means
Data movement is the dominating factor affecting performance and energy in modern
computing systems. Consequently, many algorithms have been developed to minimize the
number of I/O operations for common computing patterns. Matrix multiplication is no
exception, and lower bounds have been proven and implemented both for shared and distributed
memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing
algorithms, as they offer full control of memory accesses to the programmer. While
bounds developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources requires
a decentralized approach to optimize algorithms for maximum hardware utilization.
We present a model to optimize matrix multiplication for FPGA platforms, simultaneously
targeting maximum performance and minimum off-chip data movement, within constraints
set by the hardware. We map the model to a concrete architecture using a high-level
synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary
data types, and enables maintainability and portability across FPGA devices. Kernels
generated from our architecture are shown to offer competitive performance in practice,
scaling with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing algorithms
on reconfigurable hardware platforms.
SESSION: Session: Deep Learning II
Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning
model for representation learning on graphs. It is challenging to accelerate training
of GCNs, due to (1) substantial and irregular data communication to propagate information
within the graph, and (2) intensive computation to propagate information along the
neural network layers. To address these challenges, we design a novel accelerator
for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture
co-optimizations. We first analyze the computation and communication characteristics
of various GCN training algorithms, and select a subgraph-based algorithm that is
well suited for hardware execution. To optimize the feature propagation within subgraphs,
we propose a light-weight pre-processing step based on a graph theoretic approach.
Such pre-processing performed on the CPU significantly reduces the memory access requirements
and the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient parallelization.
We integrate the above optimizations into a complete hardware pipeline, and analyze
its load-balance and resource utilization by accurate performance modeling. We evaluate
our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three
large graphs, we achieve an order of magnitude training speedup with negligible accuracy
loss, compared with state-of-the-art implementation on a multi-core platform.
Spectral-domain CNNs have been shown to be more efficient than traditional spatial
CNNs in terms of reducing computation complexity. However they come with a ‘kernel
explosion’ problem that, even after compression (pruning), imposes a high memory burden
and off-chip bandwidth requirement for kernel access. This creates a performance gap
between the potential acceleration offered by compression and actual FPGA implementation
performance, especially for low-latency CNN inference. In this paper, we develop a
principled approach to overcoming this performance gap and designing a low-latency,
low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage
tradeoff of sparse convolutional layers and locate communication bottlenecks. We then
develop a dataflow for flexibly optimizing data reuse in different layers to minimize
off-chip communication. Finally, we propose a novel scheduling algorithm to optimally
schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts.
On a state-of-the-art FPGA platform, our design reduces data transfers by 42% with
DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared
to the baseline state-of-the-art latency of 68 ms.
SESSION: Session: High-Level Synthesis and Tools
All software ultimately relies on hardware functioning correctly. Hardware correctness
is becoming increasingly important due to the growing use of custom accelerators using
FPGAs to speed up applications on servers. Furthermore, the increasing complexity
of hardware also leads to ever more reliance on automation, meaning that the correctness
of synthesis tools is vital for the reliability of the hardware. This paper aims to
improve the quality of FPGA synthesis tools by introducing a method to test them automatically
using randomly generated, correct Verilog, and checking that the synthesised netlist
is always equivalent to the original design. The main contributions of this work are
twofold: firstly a method for generating random behavioural Verilog free of undefined
values, and secondly a Verilog test case reducer used to locate the cause of the bug
that was found. These are implemented in a tool called Verismith. This paper also
provides a qualitative and quantitative analysis of the bugs found in Yosys, Vivado,
XST and Quartus Prime. Every synthesis tool except Quartus Prime was found to introduce
discrepancies between the netlist and the design. In addition to that, Vivado and
a development version of Yosys were found to crash when given valid input. Using Verismith,
eleven bugs were reported to tool vendors, of which six have already been fixed.
A central task in high-level synthesis is scheduling: the allocation of operations
to clock cycles. The classic approach to scheduling is static, in which each operation
is mapped to a clock cycle at compile-time, but recent years have seen the emergence
of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time.
Both approaches have their merits: static scheduling can lead to simpler circuitry
and more resource sharing, while dynamic scheduling can lead to faster hardware when
the computation has non-trivial control flow.
In this work, we seek a scheduling approach that combines the best of both worlds.
Our idea is to identify the parts of the input program where dynamic scheduling does
not bring any performance advantage and to use static scheduling on those parts. These
statically-scheduled parts are then treated as black boxes when creating a dataflow
circuit for the remainder of the program which can benefit from the flexibility of
An empirical evaluation on a range of applications suggests that by using this approach,
we can obtain 74% of the area savings that would be made by switching from dynamic
to static scheduling, and 135% of the performance benefits that would be made by switching
from static to dynamic scheduling.
Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL
Applications on FPGAs
FPGA vendors provide OpenCL software development kits for easier programmability,
with the goal of replacing the time-consuming and error-prone register-transfer level
(RTL) programming. Many studies explore optimization methods (e.g., loop unrolling,
local memory) to accelerate OpenCL programs running on FPGAs. These programs typically
follow the default OpenCL execution model, where a kernel deploys multiple work-items
arranged into work-groups. However, the default execution model is not always a good
fit for an application mapped to the FPGA architecture, which is very different from
the multithreaded architecture of GPUs, for which OpenCL was originally designed.
In this work, we identify three other execution models that can better utilize the
FPGA resources for the OpenCL applications that do not fit well into the default execution
model. These three execution models are based on two OpenCL features devised for FPGA
programming (namely, single work-item kernel and OpenCL channel). We observe that
the selection of the right execution model determines the performance upper bound
of a particular application, which can vary by two orders magnitude between the most
suitable execution model and the most unsuitable one. However, there is no way to
select the most suitable execution model other than empiricall exploring the optimization
space for the four of them, which can be prohibitive. To help FPGA programmers identify
the right execution model, we propose Boyi, a systematic framework that makes automatic
decisions by analyzing OpenCL programming patterns in an application. After finding
the right execution model with the help of Boyi, programmers can apply other conventional
optimizations to reach the performance upper bound. Our experimental evaluation shows
that Boyi can 1) accurately determine the right execution model, and 2) greatly reduce
the exploration space of conventional optimization methods.
SESSION: Poster Session I
Programming abstractions decrease the cognitive gap between program idealization and
expression. In the software domain, this high-level expressive power is achieved through
layered abstractions – virtual machines, compilers, operating systems – which translate,
at design and runtime, programmer visible code into hardware-compatible code. While
this paradigm is ideal for static, i.e., unmodifiable, hardware, several of these
abstractions break down when programming configurable hardware. State of the art hardware/software
co-design techniques (e.g., High Level Synthesis (HLS), Intermediate Fabrics) are,
for the most part, ad hoc patches to the traditional abstraction stack, applicable
only to specific toolchains or software components. In this paper, we survey current
hardware design and hardware/software co-design abstractions, from the perspective
of the design language/toolchain. We perform a systematic analysis of different design
paradigms, including HLS, Domain Specific Languages (DSL), and new-generation Hardware
Description Languages (HDL). We analyze how these paradigms differ in expressiveness,
support for hardware/software interaction, hierarchy and modularity, HDL interoperability,
and interface with the outside world.
With the technical advance of quantum computers that can solve intractable problems
for conventional computers, many of the currently used public-key cryptosystems become
vulnerable. Recently proposed post-quantum cryptography (PQC) is secure against both
classical and quantum computers, but existing embedded systems such as smart card
can not easily support the PQC algorithms due to their much larger key sizes and more
complex arithmetics. To accelerate the PQC algorithms, embedded systems have to embed
the PQC hardware blocks, which can lead to huge hardware design costs. Although High-Level
Synthesis (HLS) helps significantly reduce the design costs, current HLS frameworks
produce inefficient hardware design for the PQC algorithms in terms of area and performance.
This work analyzes common features of the PQC algorithms and proposes a new pipeline-aware
logic deduplication method in HLS. The proposed method shares commonly invoked logic
across hardware design while considering load balancing in pipeline and resolving
dynamic memory accesses. This work implements FPGA hardware design of seven PQC algorithms
in the round 2 candidates from the National Institute of Standards and Technology
(NIST) PQC standardization process. Compared to commercial HLS framework, the proposed
method achieves an area-delay-product reduction by 34.5%.
The use of parallelism has increased drastically in recent years. Parallel platforms
come in many forms: multi-core processors, embedded hybrid solutions such as multi-processor
system-on-chip with reconfigurable logic, and cloud datacenters with multi-core and
reconfigurable logic. These heterogeneous platforms can offer massive parallelism,
but it can be difficult to exploit, particularly when combining solutions constructed
with multiple architectures. To program a heterogeneous platform, a developer must
master different programming languages, tools, and APIs to program each aspect of
platform separately and then must find a means to connect them with communication
interfaces. The motivation of this work is to provide a single programming model and
framework for hardware-software stream programs on heterogeneous platforms. Our framework,
StreamBlocks, starts with a dataflow programming model for both embedded and datacenter
platforms. Dataflow programming is an alternative model of computation that captures
both data and task parallelism. We describe a compiler infrastructure for CAL dataflow
programs for hardware code generation. CAL is a dataflow programming language that
can express multiple dataflow models of computation. StreamBlocks is based on the
Tycho compiler infrastructure, which transforms each actor in a dataflow program to
an abstract machine model, called Actor Machine. Actor Machines provides a unified
model for executing actors in both hardware and software and permit our compiler extension
and backend to generate efficient FPGA code. Unlike other systems, the programming
model and compiler directly support hardware-software systems in which an FPGA functions
as a coprocessor to a CPU. This permits easy integration with existing workflows.
Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency
compared to manual RTL designs. We study the timing issues in a diverse set of nine
realistic HLS designs and observe that in most cases the frequency degradation is
related to the signal broadcast structures. In this work, we classify the common broadcast
types in HLS designs, including the data signal broadcast and two types of control
signal broadcast: the pipeline control broadcast and the synchronization signal broadcast.
We further identify several common limitations of the current HLS tools, which lead
to improper handling of the broadcasts. First, the HLS delay model does not consider
the extra delay caused by broadcasts, thus the scheduling results will be suboptimal.
To solve the issue, we implement a set of comprehensive synthetic designs and benchmark
the extra delay to calibrate the HLS delay model. Second, the HLS adopts back-pressure
signals for pipeline control, which will lead to large broadcasts. Instead, we propose
to use the skid-buffer-based pipeline control, where the back-pressure signal is removed,
and an extra skid-buffer is used for flow-control. We use dynamic programming to minimize
the area of the extra FIFO. Third, there exist redundant synchronizations among concurrent
modules that may lead to huge broadcasts. We propose methods to identify and prune
unnecessary synchronization signals. Our solutions boost the frequency of nine real-world
HLS benchmarks by 53% on average and with marginal area and latency overhead. In some
cases, the gain is more than 100 MHz.
Current High-Level Synthesis frameworks provide a productive hardware development
methodology where hardware accelerators are generated directly from high-level languages
like C/C++ or OpenCL, allowing software developers to quickly accelerate their applications.
However, the hardware generated by these frameworks is sub-optimal compared to often
hand-optimized RTL modules. A hybrid development approach would leverage the productive
software stack and hardware board support package that HLS provides but allow for
fine-grained optimization using RTL components. In this work, we introduce a new software-hardware
co-design framework that integrates OpenCL/OpenACC with RTL code enabling direct execution
on FPGAs as well as full emulation with a high-speed simulator to reduce the development
The P4 language and the PISA architecture have revolutionized the field of networking.
Thanks to P4 and PISA, new networking applications and protocols can be rapidly evaluated
on high performance switches. While P4 allows the expression of a wide range of packet
processing algorithms, current programmable switch architecture limit the overall
processing flexibility. To address this shortcoming recent work have proposed to implement
PISA on FPGAs. However, little effort has been devoted to analyze whether FPGAs are
good candidates to implement PISA. In this work, we take a step back and evaluate
the micro-architecture efficiency of various PISA blocks. Using a theoretical analysis
and experiments, we demonstrate that current FPGA architecture drastically limit the
performance of a few PISA blocks. Thus, we explore two avenues to alleviate these
shortcomings. First, we identify some network applications that are well tailored
to current FPGAs. Second, to support a wider range of networking applications, we
propose modifications to the FPGA architecture which can also be of interest outside
the networking field.
The resource requirements of application-specific accelerators challenge embedded
system designers who have a tight area budget but must cover a range of possible software
kernels. We propose an early detection methodology (ReconfAST) to identify computationally
similar synthesizable kernels to build Shared Accelerators (SAs). SAs are specialized
hardware accelerators that execute very different software kernels but share the common
hardware functions between them. SAs increase the fraction of workloads covered by
specialized hardware by detecting similarities in dataflow and control flow between
seemingly very different workloads. Existing methods use either dynamic traces or
analyze register transfer level (RTL) implementations to find these similarities which
require deep knowledge of RTL and time-consuming design process.
ReconfAST leverages abstract-syntax-trees (ASTs) generated from LLVM’s-clang to discover
similar kernels among workloads. ASTs provide the right level of abstraction to detect
commonalities. ASTs are compact, unlike control and dataflow representations, but
contain extra syntax and variable node ordering that complicates workload comparison.
ReconfAST, transforms ASTs into a new clustered-ASTs (CASTs) representation, removes
unneeded nodes, and uses a regular expression to match common node configurations.
The approach is validated using MachSuite accelerator benchmarks.
On FPGAs, a good Shared Accelerator accelerates workloads by an average of 5x and
reduces the resources required for FPGA implementations: 37% FFs, 16% DSPs, and 10%
on LUTs on average over a dedicated accelerator implementation.
Prevalent hardware description languages (HDLs), e.g., Verilog and VHDL, employ register-transfer
level (RTL) as their underlying programming model. One major downside of the RTL model
is that it tightly couples design functionality with timing and device constraints.
This coupling increases code complexity and yields code that is more verbose and less
portable. High-level synthesis (HLS) tools decouple functionality from timing and
design constraints by utilizing constructs from imperative programming languages.
These constructs and their sequential semantics, however, impede construction of inherently
parallel hardware and data scheduling, which is crucial in many design use-cases.
In our work we present a novel dataflow hardware description abstraction layer as
basis for hardware design and apply it to DFiant, a Scala-embedded HDL. DFiant leverages
dataflow semantics along with modern software language features (e.g., inheritance,
polymorphism) and classic HDL traits (e.g., bit-accuracy, input/output ports) to decouple
functionality from implementation constraints. Therefore, DFiant designs are timing-agnostic
and device-agnostic and can be automatically pipelined by the DFiant compiler to meet
target performance requirements. With DFiant we demonstrate how dataflow HDL code
can be substantially more portable and compact than its equivalent RTL code, yet without
compromising its target design performance.
With the advent of Machine Learning (ML), predictive EDA tools are becoming the next
hot topic of research in the EDA community, and researchers are working on ML-based
tools to predict the performance of the EDA tool. As the designs become complex, there
is a need to start the design using higher levels of abstraction, such as High-Level
Synthesis (HLS) tools in FPGA and SoC design flows. Quick prediction of performance-related
parameters of the final design after the C-synthesis stage, can help in rapid design
closure. Even though multiple papers exist in the domain of post routing performance
prediction of HLS tools, there are no standard benchmarks available to compare the
performance and accuracy of the predictive models. In this paper, we have presented
MLSBench, a collection of around 5000 synthesizable designs written in C and C++.
We provide a methodology to generate designs with various variations from a single
design, which creates a potential for creating newer designs and enlarging the database
in the future. This is followed by analysis, and validating the generated designs
are indeed different. This allows designers to create generalized machine-learning-based
models that are not overfitted to a small dataset. We also perform statistical analysis
for measuring the design diversity by synthesizing them using Xilinx-Vivado HLS for
Zynq 7000 device series.
Design methodologies for synthesizing FPGA fabrics presented in the literature typically
employ a bottom-up approach wherein individual tiles are synthesized in isolation
and later stitched together to generate the large FPGA fabric. However, using a bottom-up
methodology to ensure fabric-level performance targets is challenging due to the lack
of a global timing view across multiple tiles spanning the FPGA fabric. While previous
works address this problem with a combination of manual buffering and floorplanning,
these additional steps introduce significant deviations from standard push-button
ASIC flows. In this paper, a top-down synthesis methodology is proposed, which eliminates
the need for floorplanning and manual buffering by providing a global timing view
of the FPGA fabric. To evaluate the proposed design methodology, we developed an FPGA
fabric generator using the Chisel hardware construction language. The fabric generator
reads in the Verilog-to-Routing architecture file, describing the user-defined FPGA
fabric, and generates the Verilog netlist and timing exceptions required to automatically
place and route the FPGA fabric in any technology node with a standard cell library.
Post layout timing analysis of placed and routed FPGA fabrics on a 28nm industrial
CMOS process demonstrates that the top-down methodology can place and route fabrics
without the need for any manual buffering or floorplanning while providing ~20% average
improvement in performance across multiple benchmark designs.
Among all the neural network specialized hardware accelerators like the Application-Specific-Integrate-Circuit(ASIC),
an FPGA accelerator stands out for its flexibility, short time-to-market, and energy
efficiency. However, when it comes to multitasking and high-speed requirements or
realtime and power-efficient scenarios (e.g., UAVs, self-driving cars, and IoT devices),
a single-board FPGA accelerator has difficulties in achieving excellent performance.
Therefore, Cloud FPGAs(Multi-FPGAs) will be a significant role in high-performance
and energy-efficient computation of CNNs for both mobile and cloud computing domains.
In this work, we propose an adaptive neural network accelerator on Cloud FPGAs, using
multi-FPGA design to satisfy multitasking and high-speed requirements or realtime
and power-efficient scenarios. We adopt the roofline model to figure out the optimal
configuration of each CNN layer. And a layer clustering algorithm and a layer sequence
detection method are proposed to transform CNN models into layer sequences for mapping
the CNN model layers efficiently to different FPGA boards. Then, we built an adaptive
CNN mapping method of Multi-FPGA chips for CNN models. Preliminary results on the
Multi-FPGAs platform demonstrate that our accelerator can improve the performance
significantly due to the adaptive mapping method.
The 2-D median filter, one of the oldest and most well-established image-filtering
techniques, still sees widespread use throughout computer vision. Despite its relative
algorithmic simplicity, accelerating the 2-D median filter via a hardware implementation
becomes increasingly challenging as the window size increases, since the resources
required grow quadratically with the window size. Previous works, in a non-FPGA context,
have shown that applying a sequence of multiple directional median filters to an image
yields results that are competitive with, and in some cases even better than, those
of a classic 2-D window median. Inspired by these approaches, we propose a novel way
of substituting a 2-D median filter on an FPGA with a sequence of directional median
filters, in our case in the pursuit of an FPGA implementation that achieves better
scalability and hardware efficiency without sacrificing accuracy. We empirically show
that the combination of three particular directional filters, in any order, achieves
this, whilst requiring quadratically fewer resources on the FPGA. Our approach allows
for much higher throughput and is easier to implement as a pipeline.
FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on
Intel Stratix 10
Deep learning has becoming increasingly more popular in recent years, and there are
many popular frameworks in the market accordingly, such as Caffe, TensorFlow and Pytorch.
All these frameworks natively support CPUs and GPGPUs. However, FPGAs still cannot
provide a comprehensive support by these frameworks for deep learning development,
especially for the training phase. In this paper, we firstly propose the FeCaffe,
i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based
on the Caffe, to enable FPGA to support CNN training features. Moreover, we provide
some benchmarks of popular CNN networks with FeCaffe, and further analysis in details
accordingly. Finally, some optimization directions including FPGA kernel design, system
pipeline, network architecture, user case application and heterogeneous platform levels,
have been proposed gradually. The result demonstrates the proposed FeCaffe can support
almost full features for training and inference respectively with high degree of design
flexibility, expansibility and reusability for deep learning development. Compared
to prior studies, our architecture can support more network and training settings
and current configuration can achieve 6.4x and 8.4x average execution time improvement
for forward and backward respectively for LeNet.
SESSION: Poster Session II
High-Level Synthesis (HLS) can achieve significant performance improvements through
effective memory partitioning and meticulous data reuse. Many modern applications,
such as medical imaging and convolutional layers in a CNN, mostly contain kernels
where iterations can be reordered freely without compromising its correctness. In
this paper, we propose an optimal micro-architecture that can be automatically implemented
for simple and iterative stencil computations that utilizes only 2 banks to achieve
fully parallel conflict memory accesses from single stage stencil kernels, while only
requiring reuse buffers of size proportional to the kernel size to achieve an II of
1, irrespectively of the stencil geometry. We demonstrate the effectiveness of our
micro-architecture by implementing it with a Kintex 7 xc7k160tg676-1 Xilinx FPGA and
testing it with several stencil-based kernels found in real-world applications. On
average, when compared with the mainstream GMP and SRC architectures our approach
achieves approximately 30- 70% reduction in hardware usage, while improving performance
by about 15%. Moreover, the number of independent memory banks required to accomplish
conflict-free data accesses have dropped by more than 30% together with some increase
in power consumption due to higher clock frequencies.
Per-flow traffic measurement has emerged as a critical but challenging task in data
center in recent years in the face of massive network traffic. Many approximate methods
have been proposed to resolve the existing resource-accuracy trade-off in per-flow
traffic measurement, one of which is the sketch-based method. However, sketches are
affected by their high computational cost and low throughput; moreover, their measurement
accuracy is hard to guarantee under the conditions of changing network bandwidth or
flow size distribution. Recently, FPGA platforms have been widely deployed in data
centers, as they demonstrate a good fit for high-speed network processing. In this
work, we propose a scalable pipelined architecture for high high-throughput per-flow
traffic measurement on FPGA. We adopts memory-friendly D-left hashing in our design,
which guarantees high space utilization that successfully addressing the challenge
of tracking high speed data stream under limit memory resource on FPGA. Comparisons
with state-of-the-art sketch-based solutions show that our design outperforms state-of-the-art
sketch-based methods in terms of throughput by over 80x.
Field-programmable gate arrays (FPGAs) have become a popular compute platform for
convolutional neural network (CNN) inference; however, the design of a CNN model and
its FPGA accelerator has been inherently sequential. A CNN is first prototyped with
no-or-little hardware awareness to attain high accuracy; subsequently, an FPGA accelerator
is tuned to that specific CNN to maximize its efficiency. Instead, we formulate a
neural architecture search (NAS) optimization problem that contains parameters from
both the CNN model and the FPGA accelerator, and we jointly search for the best CNN
model-accelerator pair that boosts accuracy and efficiency -we call this Codesign-NAS.
In this paper we focus on defining the Codesign-NAS multiobjective optimization problem,
demonstrating its effectiveness, and exploring different ways of navigating the codesign
search space. For Cifar-10 image classification, we enumerate close to 4 billion model-accelerator
pairs, and find the Pareto frontier within that large search space. Next we propose
accelerator innovations that improve the entire Pareto frontier. Finally, we compare
to ResNet on a highly-tuned accelerator, and show that using codesign, we can improve
on Cifar-100 classification accuracy by 1.8% while simultaneously increasing performance/area
by 41% in just 1000 GPU-hours of running Codesign-NAS, thus demonstrating that our
automated codesign approach is superior to sequential design of a CNN model and accelerator.
Placement Aware Design and Automation of High Speed Architectures for Tree-Structured
Linear Cellular Automata on FPGAs with Scan Path Insertion
VLSI implementation of Cellular Automata (CAs) has gained importance owing to its
features which guarantee parallelism, locality and structural regularity. In this
work, we have addressed the design challenges pertaining to an implementation optimized
for speed, of tree-structured linear CA architectures on Field Programmable Gate Array
(FPGA) with built-in scan paths. Scan based design facilitates state initialization,
helps to escape from any graveyard state, or figure out faulty locations (if any)
on which the circuit is mapped. Our design automation platform generates synthesizable
circuit descriptions of tree-structured CA on FPGA, and appends scan functionality
without additional logic or speed overhead. Placement algorithms governing the map
of CA cell nodes on the FPGA slices have been proposed to ensure maximum physical
proximity among CA cells sharing neighborhood dependencies. This is done to exploit
the VLSI amenable features such as physical adjacency of the neighboring nodes participating
in the next state (NS) computation of each other. The ultimate implementation leads
to minimum spacing of linear order between CA neighbours. The NS logic of each CA
cell inclusive of scan multiplexing, owing to restricted neighborhood size, is realized
using a single Look-Up Table. Our architectures outperform behavioral implementations
realized with higher levels of design style abstraction.
Multi-Robot Exploration (MR-Exploration) that provides the location and map is a basic
task for many multi-robot applications. Recent researches introduce Convolutional
Neural Network (CNN) to critical components in MR-Exploration, like Feature-point
Extraction (FE) and Place Recognition (PR), to improve the system performance. Such
CNN-based MR-Exploration requires running multiple CNN models simultaneously, together
with complex post-processing algorithms, greatly challenges the hardware platforms,
which are usually embedded systems. Previous researches have shown that FPGA is a
good candidate for CNN processing on embedded platforms. But such accelerators usually
process different models sequentially, lacking the ability to schedule multiple tasks
at runtime. Furthermore, post-processing of CNNs in FE is also computation consuming
and becomes the system bottleneck after accelerating the CNN models. To handle such
problems, we propose an INterruptible CNN Accelerator for Multi-Robot Exploration
(INCAME) framework for rapid deployment of robot applications on FPGA. In INCAME,
we propose a virtual-instruction-based interrupt method to support multi-task on CNN
accelerators. INCAME also includes hardware modules to accelerate the post-processing
of the CNN-based components. Experimental results show that INCAME enables multi-task
scheduling on the CNN accelerator with negligible performance degradation (0.3%).
With the help of multi-task supporting and post-processing acceleration, INCAME enables
embedded FPGA to execute MR-Exploration in real time (20 fps).
Low bit quantization of neural network is required on edge devices to achieve lower
power consumption and higher performance. 8bit or binary network either consumes a
lot of resources or has accuracy degradation. Thus, a full-process hardware-friendly
quantization solution of 4A4W (activations 4bit and weights 4bit) is proposed to achieve
better accuracy/resource trade-off. It doesn’t contain any additional floating operations
and achieve accuracy comparable to full-precision. We also implement a low-precision
accelerator for CNN (LPAC) on the Xilinx FPGA, which takes full advantage of its DSP
by efficiently mapping convolutional computations. Through on-chip reassign management
and resource-saving analysis, high performance can be achieved on small chips. Our
4A4W solution achieves 1.8x higher performance than 8A8W and 2.42x increase in power
efficiency under the same resource. On ImageNet classification, the accuracy has a
gap less than 1% to full-precision in Top-5. On the human pose estimation, we achieve
261 frames per second on ZU2EG, which is 1.78x speed up compared to 8A8W and the accuracy
has only 1.62% gap to full-precision. This proves that our solution has better universality.
FPGAs have shown great potential in providing low-latency and energy-efficient solutions
for deep learning applications, especially for the deep neural network (DNN). Currently,
the majority of FPGA based DNN accelerators are designed for single-task and static-workload
applications, making it difficult to adapt to the multi-task and dynamic-workload
applications in the cloud. To meet these requirements, DNN accelerators need to support
multi-task concurrent execution and low-overhead runtime resources reconfiguration.
However, neither instruction set architecture (ISA) based nor template-based FPGA
accelerators can support both functions at the same time. In this paper, we introduce
a novel FPGA virtualization framework for ISA-based DNN accelerators in the cloud.
As for the design goals of supporting multi-task and runtime reconfiguration, we propose
a two-level instruction dispatch module and deep learning hardware resources pooling
technique at the hardware level. As for the software level, we propose a tiling-based
instruction frame package design and two-stage static-dynamic compilation. Furthermore,
we propose a history information aware scheduling algorithm for the proposed ISA-based
deep learning accelerators in the cloud scenario. According to our evaluation on Xilinx
VU9P FPGA, the proposed virtualization method achieves 1.88x to 2.20x higher throughput
and 1.36x to 1.77x lower latency against the static baseline design.
Evaluation of Optimized CNNs on FPGA and non-FPGA based Accelerators using a Novel
Numerous algorithmic optimization techniques have been proposed to alleviate the computational
complexity of convolutional neural networks (CNNs). However, given the broad selection
of inference accelerators, it is not obvious which approach benefits from which optimization
and to what degree. In addition, the design space is further obscured by many deployment
settings such as power and operating modes, batch sizes, as well as ill-defined measurement
methodologies. In this paper, we systematically benchmark different types of CNNs
leveraging both pruning and quantization as the most promising optimization techniques
leveraging a novel benchmarking approach. We evaluate a spectrum of FPGA implementations,
GPU, TPU and VLIW processor, for a selection of systematically pruned and quantized
neural networks (including ResNet50, GoogleNetv1, MobileNetv1, a VGG derivative, and
a multilayer perceptron) taking the full design space into account including batch
sizes, thread counts, stream sizes and operating modes, and considering power, latency,
and throughput at a specific accuracy as figure of merit. Our findings show that channel
pruning is effective across most hardware platforms, with resulting speedups directly
correlated to the reduction in compute load, while FPGAs benefit the most from quantization.
FPGAs outperform regarding latency and latency variation for the majority of CNNs,
in particular with feed-forward dataflow implementations. Finally, pruning and quantization
are orthogonal techniques and yield the majority of all optimal design points when
combined. With this benchmarking approach, both in terms of methodology and measured
results, we aim to drive more clarity in the choice of CNN implementations and optimizations.
Recently, FPGA-accelerated cloud has emerged as a new computing environment. The inclusion
of FPGAs in the cloud has created new security risks, some of which are due to circuits
exercising excessive switching activity. These power-wasting tenants can cause timing
faults in the collocated circuits or a denial-of-service attack by resetting the host
FPGA. In this work, we present the idea of populating the FPGA with voltage sensors
based on ring oscillators, to continuously monitor the core voltage fluctuations across
the entire FPGA. To implement the sensors, we do not lock any FPGA resources; instead,
we infiltrate the sensors undercover, by taking advantage of the logic and the routing
resources unused by the tenants. Additionally, we infiltrate the sensors into the
FPGA circuits after their implementation, but before their deployment on the cloud;
the tenants are thus neither aware nor affected by our voltage monitoring system.
Finally, we devise a novel metric that takes the sensor measurements to quantify the
power wasting activity in the FPGA clock regions where the sensors are infiltrated.
We use VTR benchmarks and a Xilinx Virtex-7 FPGA to test the feasibility of our approach.
Experimental results demonstrate that, using the undercover voltage sensors and our
novel metric, one can accurately locate the source of the malicious power-wasting
High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve hardware
design productivity and enable efficient design space exploration, by providing simple
program directives (pragmas) and/or API calls that allow hardware programmers to use
higher-level languages (like HLS-C or OpenCL). However, modern HLS tools sometimes
miss important optimizations that are necessary for high performance. In this poster,
we present a study of the tradeoffs in HLS optimizations, and the potential of a modern
HLS tool in automatically optimizing an application. We perform the study on a generic,
5-stage camera ISP pipeline using the Intel FPGA SDK for OpenCL and an Arria 10 FPGA
Dev Kit. We show that automatic optimizations in the HLS tool are valuable, achieving
up to 2.7x speedup over equivalent CPU execution. With further hand tuning, however,
we can achieve up to 36.5x speedup over CPU. We draw several specific lessons about
the effectiveness of automatic optimizations guided by simple directives and about
the nature of manual rewriting required for high performance. Finally, we conclude
that there is a gap in the current potential of HLS tools which needs to be filled
by next-gen research.
CANSEE: Customized Accelerator for Neural Signal Enhancement and Extraction from the
Calcium Image in Real Time
Miniaturized fluorescent calcium imaging miniscope has become a prominent technique
in monitoring the activity of a large population of neurons in vivo. However, existing
calcium image processing algorithms are developed for off-line analysis, and their
implementations on general-purpose processors are difficult to meet the real-time
processing requirement under constrained energy budget for closed-loop applications.
In this paper, we propose the CANSEE, a customized accelerator for neural signal enhancement
and extraction from calcium image in real time. The accelerator can perform the motion
correction, the calcium image enhancement, and the fluorescence tracing from up to
512 cells with less than 1-ms processing latency. We also designed the hardware that
can detect new cells based on the long short-term memory (LSTM) inference. We implemented
the accelerator on a Xilinx Ultra96 FPGA. The implementation achieves 15.8x speedup
and over 2 orders of magnitude improvement in energy efficiency compared to the evaluation
on the multi-core CPU.
Low precision data representation is important to reduce storage size and memory access
for convolutional neural networks (CNNs). Yet, existing methods have two major limitations:
(1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit
floating point or 8-bit fixed point for a good accuracy.
In this paper, we propose a low precision (8-bit) floating point (LPFP) quantization
method for FPGA-based acceleration to overcome the above limitations. Without any
re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5
accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly
better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP
multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder, and therefore
implement four 8-bit LPFP multiplications using one DSP slice of Xilinx Kintex-7 family
(KC705 in this paper) while one DSP can implement only two 8-bit fixed point multiplications.
Experiments on six typical CNNs for inference show that on average, we improve throughput
by 64.5× over Intel i9 CPU and by 1.5× over existing FPGA accelerators. Particularly
for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput
by 3.5× and 27.5× and improve average throughput per DSP by 4.1× and 5×, respectively.
To the best of our knowledge, this is the first in-depth study to simplify one multiplication
for CNN inference to one 4-bit MAC and implement four multiplications within one DSP
while maintaining comparable accuracy without any re-training.
Field Programmable Gate Array (FPGA) platform has been a popular choice for deploying
Convolutional Neural Networks (CNNs) as a result of its high parallelism and low energy
consumption. Due to the limitation of on-chip resources on a single board, FPGA clusters
become promising solutions to improve the throughput of CNNs. In this paper, we firstly
put forward strategies to optimize the resource allocation intra and inter FPGA boards.
Then we model the multi-board cluster problem and design algorithms based on knapsack
problem and dynamic programming to calculate the optimal topology of the FPGA clusters.
We also give a quantitative analysis of the inter-board data transmission bandwidth
requirement. To make our design accommodate for more situations, we provide solutions
for deploying fully connected layers and special convolution layers with large memory
requirement. Experimental results show that typical well-known CNNs with the proposed
topology of FPGA clusters could obtain a higher throughput per board than single-board
solutions and other multi-board solutions.
Over the past years, feed-forward convolutional neural networks (CNNs) have evolved
from a simple feed-forward architecture to deep and residual (skip-connection) architectures,
demonstrating increasingly higher object categorization accuracy and increasingly
better explanatory power of both neural and behavioral responses. However, from the
neuroscientist point of view, the relationship between such deep architectures and
the ventral visual pathway is incomplete. For example, current state-of-the-art CNNs
appear to be too complex (e.g., now over 100 layers for ResNet) compared with the
relatively shallow cortical hierarchy (4-8 layers). We introduce new CNNs with shallow
recurrent architectures and skip connections requiring fewer parameters. With higher
accuracy for classification, we propose an architecture for recurrent residual convolutional
neural network (R2CNN) on FPGA, which efficiently utilizes on-chip memory bandwidth.
We propose an Output-Kernel- Input-Parallel (OKIP) convolution circuit for a recurrent
residual convolution stage. We implement the inference hardware on a Xilinx ZCU104
evaluation board with high-level synthesis. Our R2CNN accelerator achieves top-5 accuracy
of 90.08% on ImageNet bench- mark, which has higher accuracy than conventional FPGA
Computational neuroscience uses models to study the brain. The Hodgkin-Huxley (HH)
model, and its extensions, is one of the most powerful, biophysically meaningful models
currently used. The high experimental value of the (extended) Hodgkin-Huxley (eHH)
models comes at the cost of steep computational requirements. Consequently, for larger
networks, neuroscientists either opt for simpler models, losing neuro-computational
features, or use high-performance computing systems. The eHH models can be efficiently
implemented as a dataflow application on a FPGA-based architecture. The state-of-the-art
FPGA-based implementations have proven to be time-consuming because of the long-duration
synthesis requirements. We have developed flexHH, a flexible hardware library, compatible
with a widely used neuron-model description format, implementing five FPGA-accelerated
and parameterizable variants of eHH models (standard HH with optional extensions:
custom ion-gates, gap junctions, and/or multiple cell compartments). Therefore, flexHH
is a crucial step towards high-flexibility and high-performance FPGA-based simulations,
eschewing the penalty of re-engineering and re-synthesis, dismissing the need for
an engineer. In terms of performance, flexHH achieves a speedup of 1,065x against
NEURON, the simulator standard in computational neuroscience, and speedups between
8x-20x against sequential C. Furthermore, flexHH is faster per simulation step compared
to other HPC technologies, provides 65% or better performance density (in FLOPS/LUT)
compared to related works, and only shows a marginal performance drop in real-time
This poster presents a novel cross-layer-pipelined Convolutional Neural Network accelerator
architecture, and network compiler, that make use of precision minimization and parameter
pruning to fit ResNet-50 entirely into on-chip memory on a Stratix 10 2800 FPGA. By
statically partitioning the hardware across each of the layers in the network, our
architecture enables full DSP utilization and reduces the soft logic per DSP ratio
by roughly 4x over prior work on sparse CNN accelerators for FPGAs. This high DSP
utilization, a frequency of 420MHz, and skipping zero weights enable our architecture
to execute a sparse ResNet-50 model at a batch size of 1 at 3300 images/s, which is
nearly 3x higher throughput than NVIDIA’s fastest machine learning targeted GPU, the
V100. We also present a network compiler and a flexible hardware interface that make
it easy to add support for new types of neural networks, and to optimize these networks
for FPGAs with different on-chip resources.
Hardware acceleration of deep learning (DL) systems has been increasingly studied
to achieve desirable performance and energy efficiency. The FPGA strikes a balance
between high energy efficiency and fast development cycle and therefore is widely
used as a DNN accelerator. However, there exists an architecture-layout mismatch in
the current designs, which introduces scalability and flexibility issues, leading
to irregular routing and resource imbalance problems. To address these limitations,
in this work, we propose FTDL, an FPGA-tailored architecture with a parameterized
and hierarchical hardware that is adaptive to different FPGA devices. FTDL has the
following novelties: (i) At the architecture level, FTDL consists of Tiled Processing
Elements (TPE) and super blocks, to achieve a near-to-theoretical digital signal processing
(DSP) operating-frequency of 650 MHz. More importantly, FTDL is configurable and delivers
good scalability, i.e., the timing is stabilized even when the design is scaled-up
to 100% resource utilization for different deep learning systems. (ii) In workload
compilation, FTDL provides a compiler that manages to map the DL workloads to the
architecture level in an optimal manner. Experimental results show that for most benchmark
layers in MLPerf, FTDL achieves an over 80% hardware efficiency.
SESSION: Poster Session III
With Moore’s Law coming to an end, hardware specialization and systems on chips are
providing new opportunities for continuing performance scaling while reducing the
energy cost of computation. However, the current hardware design methodologies require
significant engineering efforts and domain expertise, making the design process unscalable.
More importantly, hardware specialization presents a unique challenge for a much tighter
software and hardware co-design environment to exploit domain-specific optimizations
and design efficiency. In this work, we introduce Cash, a single-source hardware-software
co-design framework for rapid SoC prototyping and accelerators research. Cash leverages
the unique efficiency and generative attributes of Modern C++ to provide a unified
development environment, aiming at closing the architecture research methodology gap.
The Cash framework introduces new co-design programming abstractions that enable seamless
integration with existing software from architecture research simulators to high-level
Stream computing is a suitable approach to improve both performance and power efficiency
of numerical computations with FPGAs. To achieve further performance gain, temporal
and spatial parallelism were exploited: the first one deepens and the latter duplicates
pipelines of streamed computation cores. These two types of parallelism were previously
evaluated with Arria 10 FPGA. However, it has not been verified if they are also effective
for the latest FPGA, Stratix 10, which has a larger amount of logic elements (i.e.,
2.4X of Arria 10) and is equipped with a new feature to improve the maximum clock
frequency (i.e., HyperFlex architecture). To show the scalability for such state-of-the-art
FPGAs, in this paper, we firstly implemented a streamed fluid simulation accelerator
with both parallelism types for Stratix 10. We then thoroughly evaluated it by obtaining
computational performance (FLOPS), power efficiency (FLOPS/W), resource utilization,
and maximum clock frequency (Fmax). From the results, we found that this implementation
excessively used DSP blocks due to inefficient mapping of floating-point operations,
which reduced Fmax and the number of pipelined cores. To improve the scalability,
we optimized the implementation to reduce the DSP block usage by utilizing a Multiply-Add
function in a single DSP block. As a result, the optimized fluid simulation achieves
1.06 TFLOPS and 12.6 GFLOPS/W, which is 1.36X and 1.24X higher than the non-optimized
version, respectively. Moreover, we estimate that the fluid simulation with Stratix
10 could outperform GPU-based implementation with Tesla V100 by optimizing it for
Routing is one of the most time-consuming steps in the FPGA synthesis flow. Existing
works have described several ways to accelerate the routing process. The partitioning-based
parallel routing technique that leverages the high-performance computing of multi-core
processors are gaining popularity recently. Specifically, those parallel routers partition
nets to regions by nets’ bounding boxes, followed by a parallel routing procedure.
Nets can be split up into source-sink connections that share wire segments as much
as possible. In order to exploit more parallelism by a finer granularity in both spatial
partitioning and routing, a connection-aware routing bounding box model is introduced
in this work. We first explore in detail to show that connection-aware partitioning
using the new routing bounding boxes enables the parallel routing to perform better
runtime efficiency than the existing net-based partitioning by analyzing the workloads
of parallel routers. It reduces the connections spanning more than one region and
exploits more parallelism. The large heterogeneous Titan23 designs and a detailed
representation of the Stratix IV FPGA are used for benchmarking. Experimental results
show that the parallel FPGA router is faster when using our connection-aware partitioning
than using the existing net-based partitioning, while achieving similar quality of
routing results in terms of the wirelength and critical path delay. The connection-aware
routing bounding box model is easy to be embedded into other existing parallel routers
and further enables them to be faster.
With the advent of AI and machine learning as the highest profile FPGA applications,
INT8 performance is currently one of the key benchmarking metrics. In current devices,
INT8 multipliers must be extracted from higher precision multipliers. Recently, we
reported the implementation of a mixed DSP Block and soft logic design, with 22,400
INT8 multipliers, and a system clock rate of 416MHz, on the Intel Stratix 10 2800
In this paper we demonstrate alternate techniques for integer multiplier construction
to better balance the resource types on current FPGAs – logic, memory, and DSP – to
make a significant improvement in the multiplier, and therefore the dot product, density.
We further extend these techniques to 8 bit signed-magnitude (SM) 1.7 representation,
which can further improve arithmetic density by using the logic and memory resources
more flexibly. We describe variable composition dot product structures, which can
be assembled in a scalable 2D systolic array. In one example, we report a design containing
32,768 SM1.7 multipliers, with a clock rate of 432MHz, giving a system performance
of over 28 TOPs. Our INT8 densities are improved by up to 30% over the earlier work
– we show one design with 28,800 INT8 multipliers. In all cases, enough device resources
are left free and accessible to implement a full application level design.
With tremendous economic and technological ramifications, hardware security has become
an increasingly more critical design metric for FPGA-based logic design. In this work,
we focus on countermeasures against power side-channel attacks in any reconfigurable
computing system implemented with modern FPGA fabric. We design and implement a novel
countermeasure technique called Time-Fracturing (TF) to fend off side-channel-based
information leakage, which proves to be both hardware-efficient and minimally invasive.
To validate its effectiveness, we have applied our TF technique to an FPGA-based AES128
encryption core. Our experimental results have shown an increase of more than 50 times,
when compared to its unprotected baseline, in its attack difficulty measured by the
number of traces required to extract the secret key. Furthermore, our approach is
orthogonal to existing methods, thus having the potential to be integrated in the
future for a multi-variate defense mechanism.
Accurate timing characterization of FPGA routing resources, i.e. wires and switches,
is critical to achieving high quality of results from FPGA routing tools. Although
the composition and connectivity of the routing resources are easily extracted from
an FPGA’s architecture, post-layout timing characterization of the FPGA’s wires and
switches (NOT the design being mapped onto the FPGA) with EDA tools is a challenging
task due to the large quantity of combinational loops (cycles in the routing graph).
Likewise, the use of EDA tools is severely limited when constructing new FPGA architectures.
This work addresses the challenge by proposing an algorithm to construct cycle-free
FPGA routing graphs. A cycle-free FPGA routing graph is achieved by logically ordering
wires and intelligently removing or rearranging a small fraction of the switch block
connections in order to break cycles. The proposed approach enables constraining the
timing of all routing resources, which is otherwise impossible due to the combinational
loops. This technique can be applied to post-layout static timing analysis (STA) of
existing FPGAs, significantly reducing the complexity and improving the accuracy of
the analysis. In addition, this cycle-free approach can be adopted when designing
new FPGAs, transforming costly hand layout into an automated step compatible with
commercial ASIC EDA tools.
Logic replication is often necessary to improve speed of emulation for systems employing
field programmable gate arrays (FPGAs), since design sizes are large enough requiring
partitioning to fit a design into multiple (boards of) FPGAs. In this paper, we propose
a polynomial time algorithm for combinational logic replication that ensures delay
optimality for directed acyclic graphs and reduces overhead due to look-up table (LUT)
and cut resources. The algorithm is further extended to consider combinational loops,
often yielding delay optimal results. Experimental results on industrial designs show,
on an average, 44%, 33%, and 33% reduction in overhead due to cut, LUT costs, and
runtimes, respectively, compared to existing heuristics, thus demonstrating the efficiency
of the algorithm.
Q-Table based Reinforcement Learning (QRL) is a class of widely used algorithms in
AI that work by successively improving the estimates of Q values — quality of state-action
pairs, stored in a table. They significantly outperform Neural Network based techniques
when the state space is tractable. Fast learning for AI applications in several domains
(e.g. robotics), with tractable ‘mid-sized’ Q-tables, still necessitates performing
substantial rapid updates. State-of-the-art FPGA implementations of QRL do not scale
with the increasing Q-Table state space, thus are not efficient for such applications.
In this work, we develop a novel FPGA implementation of QRL, scalable to large state
spaces and facilitating a large class of AI applications. Our pipelined architecture
provides higher throughput while using significantly fewer on-chip resources and thereby
supports a variety of action selection policies that covers Q-Learning and variations
of bandit algorithms. Possible dependencies caused by consecutive Q value updates
are handled, allowing the design to process one Q-sample every clock cycle. Additionally,
we provide the first known FPGA implementation of the SARSA (State-Action-Reward-State-Action)
algorithm. We evaluate our architecture for Q-Learning and SARSA algorithms and show
that our designs achieve a high throughput of up to 180 million Q samples per second.
Designing efficient hardware for accelerating machine learning (ML) applications is
a major challenge. Rapid changing algorithms and network architectures in this field
make FPGA based designs an attractive solution. But the generic building blocks available
in current FPGAs (ALMs/CLBs, DSP blocks) limit the acceleration that can be achieved.
We propose a modification to the current FPGA architecture that makes FPGAs specialized
for ML applications. Specifically, we propose adding hard matrix multiplier blocks
(matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays
of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect
between neighboring matmuls to make larger systolic matrix multipliers. We explore
various matmul sizes (4x4x4, 8x8x8, 16x16x16, 32x32x32) and various strategies to
place these blocks on the FPGA (clustered, surround, columnar). We recommend 4x4x4
matmul blocks with columnar placement after studying tradeoffs between area, frequency,
fragmentation and channel width. Experimental results and analytical evaluation reveal
that providing matmuls in an FPGA speeds up state-of-the-art neural networks (Resnet50,
GNMT, Transformer, Minigo) by ~2.5x on average, compared to a DSP-heavy FPGA with
equal number of MACs. Therefore, FPGAs with hard matrix multipliers can be used to
design faster, more area (and hence, power) efficient hardware accelerators for ML
applications, compared to current FPGAs, at the cost of reducing the flexibility of
the FPGA for other applications. A matmul-heavy FPGA fabric could be a part of bigger
FPGA, the rest of which can have general programmable logic, or fully ML-specific
FPGAs with matmuls could be created.
FPGA platforms are widely used for application acceleration. Although a number of
high-level design frameworks exist, application and performance portability across
different platforms remain challenging. To address the above problem, we propose an
API design for high-level development tools to separate platform-dependent code from
the remaining application design. Additionally, we propose design guidelines to assist
with performance portability. To demonstrate our techniques, a large-scale application,
originally developed for an Intel Stratix-V FPGA is ported to several new Xilinx Virtex
UltraScale+ systems. The accelerated application, developed in a high-level framework,
is rapidly moved onto the new platforms with minimal changes. The original, unmodified
kernel code delivers a 1.74x speedup due to increased clock frequency on the new platform.
Subsequently, the application is further optimised to make use of the additional resources
available on the larger Ultrascale+ FPGAs, guided by a simple analytical performance
model. This results in an additional performance increase of up to 7.4x. Using the
presented framework, we demonstrate rapid deployment of the same application across
a number of different platforms that leverage the same FPGA family but differ in their
low-level implementation details and the available peripherals. As a result, the same
application code supports five different platforms: Maxeler MAX5C DFE, Amazon EC2
F1, Xilinx Alveo U200, U250 and the original Intel Stratix-V accelerator card, with
performance close to what is theoretically achievable for each of these platforms.
Accuracy-Aware Memory Allocation to Mitigate BRAM Errors for Voltage Underscaling
on FPGA Overlay Accelerators
Approximate computing (AC) aims to achieve energy-efficiency in digital systems by
sacrificing the computational accuracy of an application. Memory-intensive applications,
in which a large amount of data is processed to reach a meaningful conclusion, are
the primary target. Systems for such applications consists of a large pool of compute-unit
and sizeable on-chip memory. The total energy consumption for such applications is
often dominated by the on-chip memory. We, therefore, focus on improving the energy
efficiency of the on-chip memory by appropriately scaling down its supply voltage.
In this paper, we propose a memory allocation technique for FPGA-based accelerators
to improve accuracy and energy consumption for such memory-intensive applications.
Unlike state-of-the-art, our technique focusses on the BRAM of the FPGA. Since an
application consists of both critical and non-critical data and is required to treat
them accordingly to maintain good computational accuracy, we thereby use LUTRAM of
FPGA to realize the reliable memory, whereas BRAM operating at a lower voltage is
considered as the unreliable one. First, we introduce a compiler pre-processor to
annotate the arrays of an application as critical and non-critical ones. Afterward,
we employ an exploration heuristic to select an optimal point of the reliable and
unreliable memories for the application without incurring run-time as well as energy
consumption based on pre-characterize memory power. Experimental results on various
signal and image processing applications reveal that the proposed memory allocation
heuristic improves the accuracy from 13.0% to 73.2% along with 0.77x energy savings
while incurring 1.12x circuit area.
Phylogenetics study the evolutionary history of a collection of organisms based on
observed heritable molecular traits, finding practical application in a wide range
of domains, from conservation biology and epidemiology, to forensics and drug development.
A fundamental computational kernel to evaluate evolutionary histories, also referred
to as phylogenies, is the Phylogenetic Likelihood Function (PLF), which dominates
the total execution time (by up to 95%) of widely used maximum-likelihood phylogenetic
methods. Numerous efforts to boost PLF performance over the years mostly focused on
accelerating computation; since the PLF is a data-intensive, memory-bound operation,
performance remains limited by data movement. In this work, we employ near-memory
computation units (NMUs) within a FPGA-based computing environment with disaggregated
memory to alleviate the data movement problem and improve performance and energy efficiency
when inferring large-scale phylogenies. NMUs were deployed on a multi-FPGA emulation
platform for the IBM dReDBox disaggregated datacenter prototype. We find that performance
and power efficiency improves by an order of magnitude when NMUs compute on local
data that reside on the same server tray. This is achieved through an efficient data-allocation
scheme that minimizes inter-tray data transfers (remote-data movement) when computing
the PLF. More specifically, we observe up to 22x better FLOPS performance and 13x
higher power efficiency (FLOPS/Watt) over the more traditional, accelerator-as-a-coprocessor
model, which requires explicit remote-data transfers between disaggregated memory
modules and accelerator units.
The FPGA circuit design usually adopts full-custom design method, it indicates that
it is difficult to design and optimize an FPGA manually. So, we present FPTLOPT (FPGA
Transistor-Level Optimization Tool) which supports a more complex FPGA architecture
called general routing matrix (GRM) architecture, and also has higher-accuracy and
higher-speed than COFFE . To fit a more complex FPGA architecture, we use the regular
matching method to automatically extract the circuits type and build the circuits
netlist; To get the higher-accuracy, we predict the layout area by area model we build,
then we precisely predict the layout post simulation delay by load model we build;
To get the higher-speed, we devise the variable range greedy algorithm, to expanding
range automatically. We also provide equalization kernel multi-thread acceleration
that can change the thread number according to the current CPU hardware environment.
The experimental results illustrate that FPTLOPT supports the optimization of GRM
architecture and build the key sub-circuit netlist. Also, the area prediction is by
maximum of 43%, the delay get from delay prediction is 28% more precise than the ones
in COFFE. Besides, quickly gets the optimal transistor sizing results for different
optimization objectives. For the same circuit, the optimization speed is 19.96 times
faster than COFFE.
CAD exploration is important for designing FPGA interconnect topologies. It includes
two steps: first, design a model with some parameters that can express as much architecture
space. Second, use CAD flow to analyze the described interconnect architecture. In
this paper, we present a new interconnect model, named INTB (Interconnect Block).
At a logical position, one INTB is adopted to represent all related routing resources
and hierarchical parameters are designed to simplify description. Compared with existing
CB-SB model, INTB model can support more interconnect features of modern FPGA, such
as various types of wire segment and complex connections. These features can improve
FPGA routing ability. For the application of INTB model, two modifications are made
in CAD flow: one is generation of routing resource graph (RRG). A tile-based method
is proposed to generate RRG from parameters. The other is cost computing during routing
process. Two strategies are applied respectively for cost estimation of short and
curve wire segment, which do not exist in CB-SB model. INTB model and CAD improvement
are implemented in VTR 8.0. The experiments consist of two parts. First, INTB model
is adopted to re-describe CB-SB architectures to verify its description capacity.
After CAD flow, average difference of routing area and timing between two models is
about 4% and 5%. Second, INTB model is used to explore architecture space with modern
FPGA features. Experimental results show obvious performance enhancement, over 10%
in some benchmarks.
Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data,
such as speech recognition and language modeling. LSTM brought significant accuracy
improvement by introducing additional parameters to Recurrent Neural Network (RNN).
However, increasing number of parameters and computations also led to inefficiency
in computing LSTM on edge devices with limited on-chip memory size and DRAM bandwidth.
In order to reduce the latency and energy of LSTM computations, there has been a pressing
need for model compression schemes and suitable hardware accelerators. In this paper,
we first propose the Fixed Nonzero-ratio Viterbi-based Pruning, which can reduce the
memory footprint of LSTM models by 96% with negligible accuracy loss. By applying
additional constraints on the distribution of surviving weights in Viterbi-based Pruning,
the proposed pruning scheme mitigates the load-imbalance problem and thereby increases
the processing engine utilization rate. Then, we propose the V-LSTM, an efficient
sparse LSTM accelerator based on the proposed pruning scheme. High compression ratio
of the proposed pruning scheme allows the proposed accelerator to achieve 24.9% lower
per-sample latency than that of state-of-the-art accelerators. The proposed accelerator
is implemented on Xilinx VC-709 FPGA evaluation board running at 200MHz for evaluation.
This paper presents a system-level co-simulation and co-verification workflow to ease
the transition from a software-only procedure, executed in a General Purpose processor,
to the integration of a custom hardware accelerator developed in a Hardware Description
We propose a tool which enables Dynamic Binary Modification to decouple the development
of the hardware accelerator from the software-only application to be accelerated.
It provides support for rapid iterative exploration and functional verification of
hardware designs while keeping the unmodified software application as a reference.
DBHI is able to instrument an application and inject compiled hardware. It allows
progressive migration from application source code, to non-synthesizable HDL, and
to synthesizable HDL. At the same time, it preserves cycle-accurate/bit-accurate results,
and provides run-time visibility of the internal data buffers for debugging purposes.
Foreign architecture emulation overhead during development is avoided, and early integration
with peripherals in the target System-on-Chip is possible.
The proposed design flow was evaluated on executions of hardware simulations on x86-64
and Arm. DBHI was developed from existing off-the-shelf tools, and we evaluated it
on multiple architectures, however, the technique is not tied to any specific architecture.