February 2021

FPGA ’21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Session 1: FPGA Architecture

Top-down Physical Design of Soft Embedded FPGA Fabrics

Prashanth Mohan
Oguz Atli
Onur Kibar
Mohammed Zackriya
Larry Pileggi
Ken Mai

In recent years, IC reverse engineering and IC fabrication supply chain security have
grown to become significant economic and security threats for designers, system integrators,
and end customers. Many of the existing logic locking and obfuscation techniques have
shown to be vulnerable to attack once the attacker has access to the design netlist
either through reverse engineering or through an untrusted fabrication facility. We
introduce soft embedded FPGA redaction, a hardware obfuscation approach that allows
the designer substitute security-critical IP blocks within a design with a synthesizable
eFPGA fabric. This method fully conceals the logic and the routing of the critical
IP and is compatible with standard ASIC flows for easy integration and process portability.
To demonstrate eFPGA redaction, we obfuscate a RISC-V control path and a GPS P-code
generator. We also show that the modified netlists are resilient to SAT attacks with
moderate VLSI overheads. The secure RISC-V design has 1.89x area and 2.36x delay overhead
while the GPS design has 1.39x area and negligible delay overhead when implemented
on an industrial 22nm FinFET CMOS process.

NetCracker: A Peek into the Routing Architecture of Xilinx 7-Series FPGAs

Morten B. Petersen
Stefan Nikolić
Mirjana Stojilović

Novel applications have triggered significant changes at the system level of FPGA
architecture design, such as the introduction of embedded VLIW processor arrays and
hardened NoCs. However, the routing architecture of the soft logic fabric has largely
remained unchanged in recent years. Since hunger for acceleration of ever more varied
tasks with various power budgets—as well as complications related to technology
scaling—is likely to remain significant, it is foreseeable that the routing architecture
too will have to evolve. In this work, we do not try to suggest how routing architectures
of tomorrow should look like. Instead, we analyze an existing architecture from a
popular commercial FPGA family, discussing the possible origins of various design
decisions and pointing out aspects that may merit future research. Moreover, we present
an open-source tool that greatly eases such analyses, relying only on data readily
available from the vendor CAD tools. Our hope is that this work will help the academic
research community in catching up with the current developments in industry and accelerate
its contributions to FPGA architectures of the future.

Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs

Aman Arora
Samidh Mehta
Vaughn Betz
Lizy K. John

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the
rapidly changing algorithms, network architectures and computation requirements in
this field. However, the generic building blocks available on traditional FPGAs limit
the acceleration that can be achieved. Many modifications to FPGA architecture have
been proposed and deployed including adding specialized artificial intelligence (AI)
processing engines, adding support for IEEE half-precision (fp16) math in DSP slices,
adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a
small percentage of the FPGA’s programmable logic area with Tensor Slices. These slices
are arrays of processing elements at their heart that support multiple tensor operations,
multiple dynamically-selectable precisions and can be dynamically fractured into individual
adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar
at the inputs that helps with easing the routing pressure caused by a large slice.
By spending ~3% of FPGA’s area on Tensor Slices, we observe an average frequency increase
of 2.45x and average area reduction by 0.41x across several ML benchmarks, including
a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the
impact of spending area on Tensor slices on non-ML applications. We observe an average
reduction of 1% in frequency and an average increase of 1% in routing wirelength compared
to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific
coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator
for ML applications, while still keeping the vast majority of the real estate on the
FPGA programmable at fine-grain.

Global Is the New Local: FPGA Architecture at 5nm and Beyond

Stefan Nikolić
Francky Catthoor
Zsolt Tőkei
Paolo Ienne

It takes only high-school physics to appreciate that the resistance of a wire grows
with a diminishing cross section, and a quick look at any plot about Moore’s law immediately
suggests that such cross section must decrease over time. Clearly, everyone can easily
imagine that this trend must have a deep influence on FPGA architectures. What is
difficult to predict is whether and when well-established architectural ideas will
break—and what can replace them. Unfortunately, in architectural research, we often
use fairly simplistic models of the underlying technology nodes which limit our ability
to visualize the detailed impact of technology evolution. In this paper, we develop,
from the available industrial disclosures, a consistent electrical model of the metal
stacks of recent and current technologies, as well as future trends. We combine it
to a plausible layout strategy to have an accurate idea of how wire characteristics
play nowadays into architectural decisions. To demonstrate our models, necessarily
speculative due to the paucity of reliable industrial information, we use them to
explore the evolution of a typical architectural family across technology nodes and
to reevaluate one of the most basic design parameters—namely, cluster size. We notice
effects which may in fact explain some recent changes in commercial architectures.
We also observe how conventional architectures may fail to take advantage of the performance
improvements of future nodes. Although conceptually straightforward, this study signals
how profoundly our understanding of FPGAs will be affected by technology while moving
towards the 3 nm node.

FABulous: An Embedded FPGA Framework

Dirk Koch
Nguyen Dao
Bea Healy
Jing Yu
Andrew Attwood

At the end of CMOS-scaling, the role of architecture design is increasingly gaining
importance. Supporting this trend, customizable embedded FPGAs are an ingredient in
ASIC architectures to provide the advantages of reconfigurable hardware exactly where
and how it is most beneficial. To enable this, we are introducing the FABulous embedded
open-source FPGA framework. FABulous is designed to fulfill the objectives of ease
of use, maximum portability to different process nodes, good control for customization,
and delivering good area, power, and performance characteristics of the generated
FPGA fabrics. The framework provides templates for logic, arithmetic, memory, and
I/O blocks that can be easily stitched together, whilst enabling users to add their
own fully customized blocks and primitives. The FABulous ecosystem generates the embedded
FPGA fabric for chip fabrication, integrates Yosys, ABC, VPR and nextpnr as FPGA CAD
tools, deals with the bitstream generation and after fabrication tests. Additionally,
we provide an emulation path for system development. FABulous was demonstrated for
an ASIC integrating a RISC-V core with an embedded FPGA fabric for custom instruction
set extensions using a TSMC 180nm process and an open-source 45nm process node.

Stratix 10 NX Architecture and Applications

Martin Langhammer
Eriko Nurvitadhi
Bogdan Pasca
Sergey Gribok

The advent of AI has driven the adoption of high density low precision arithmetic
on FPGAs. This has resulted in new methods in mapping both arithmetic functions as
well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks.
Technologies outside of the FPGA realm have also evolved, such as the addition of
tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of
which have a higher claimed performance and efficiency than current FPGAs. In this
paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA
specifically optimized for the AI application space. In addition to the computational
capabilities of the standard programmable soft logic fabric, a new type of DSP Block
provides the dense arrays of low precision multipliers typically used in AI implementations.
The architecture of the block is tuned for the common matrix-matrix or vector-matrix
multiplications in AI, with capabilities designed to work efficiently for both small
and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent
support for support block floating point FP16 and FP12 numerics. All additions/accumulations
can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple
blocks can be cascaded together to support larger matrices. We will also describe
methods by which the smaller precision multipliers can be aggregated to create larger
multiplier that are more applicable to standard signal processing requirements. In
terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs,
or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency
is in the range of 1-4 TOPs or TFLOPs/W.

SESSION: Keynote 1

Scientific Applications of FPGAs at the LHC

Philip Harris

The next generation of high throughput data acquisition systems is capable of acquisition
at rates far exceeding our ability to save data. To process data in real-time specialized
computing systems are needed with incredibly high throughput so that data can be quickly
assessed to determine whether it is sufficiently interesting for further processing.
With a raw data rate exceeding 1 Petabit per second, particle detectors at the Large
Hadron Collider at the Europe Center for Nuclear Research (CERN) contend with some
of the largest data rates ever encountered. With planned upgrades in the near future,
these rates will continue to grow, further complicating our ability to process data
effectively to continue to understand the fundamental properties of the universe.

In this talk, we present the current, FPGA-based, LHC data acquisition system, and
we discuss the plenitude of data challenges that are currently being addressed. Furthermore,
we discuss various aspects of the system, and we present deep learning base solutions
that are quickly being adopted by the LHC. Furthermore, we discuss the lower throughput
computationally complex systems and discuss how FPGAs can augment the system leading
to enhanced physics performance. Throughout the talk, we discuss the scientific implications
possible with an improved system. Finally, we discuss related problems in other scientific
fields, including astrophysics and materials science. We present new challenges that,
if solved, can open paths to new avenues of fundamental scientific research.

SESSION: Session 2: Abstractions and Tools

ThunderGP: HLS-based Graph Processing Framework on FPGAs

Xinyu Chen
Hongshi Tan
Yao Chen
Bingsheng He
Weng-Fai Wong
Deming Chen

FPGA has been an emerging computing infrastructure in datacenters benefiting from
features of fine-grained parallelism, energy efficiency, and reconfigurability. Meanwhile,
graph processing has attracted tremendous interest in data analytics, and its performance
is in increasing demand with the rapid growth of data. Many works have been proposed
to tackle the challenges of designing efficient FPGA-based accelerators for graph
processing. However, the largely overlooked programmability still requires hardware
design expertise and sizable development efforts from developers.

In order to close the gap, we propose ThunderGP, an open-source HLS-based graph processing
framework on FPGAs, with which developers could enjoy the performance of FPGA-accelerated
graph processing by writing only a few high-level functions with no knowledge of the
hardware. ThunderGP adopts the Gather-Apply-Scatter (GAS) model as the abstraction
of various graph algorithms and realizes the model by a build-in highly-paralleled
and memory-efficient accelerator template. With high-level functions as inputs, ThunderGP
automatically explores the massive resources and memory bandwidth of multiple Super
Logic Regions (SLRs) on FPGAs to generate accelerator and then deploys the accelerator
and schedules tasks for the accelerator. We evaluate ThunderGP with seven common graph
applications. The results show that accelerators on real hardware platforms deliver
2.9 times speedup over the state-of-the-art approach, running at 250MHz and achieving
throughput up to 6,400 MTEPS (Million Traversed Edges Per Second). We also conduct
a case study with ThunderGP, which delivers up to 419 times speedup over the CPU-based
design and requires significantly reduced development efforts. This work is open-sourced
on Github at https://github.com/Xtra-Computing/ThunderGP.

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design
on Multi-Die FPGAs

Licheng Guo
Yuze Chi
Jie Wang
Jason Lau
Weikang Qiao
Ecenur Ustun
Zhiru Zhang
Jason Cong

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity
advantages, there remains a significant gap in the achievable clock frequency between
an HLS-generated design and a handcrafted RTL one. A key factor that limits the timing
quality of the HLS outputs is the difficulty in accurately estimating the interconnect
delay at the HLS level. Unfortunately, this problem becomes even worse when large
HLS designs are implemented on the latest multi-die FPGAs, where die-crossing interconnects
incur a high delay penalty.

To tackle this challenge, we propose AutoBridge, an automated framework that couples
a coarse-grained floorplanning step with pipelining during HLS compilation. First,
our approach provides HLS with a view on the global physical layout of the design,
allowing HLS to more easily identify and pipeline the long wires, especially those
crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining,
the floorplanner is able to distribute the design logic across multiple dies on the
FPGA device without degrading clock frequency. This prevents the placer from aggressively
packing the logic on a single die which often results in local routing congestion
that eventually degrades timing. Since pipelining may introduce additional latency,
we further present analysis and algorithms to ensure the added latency will not compromise
the overall throughput.

AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our
experiments with a total of 43 design configurations, we improve the average frequency
from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible
change in resource utilization. Notably, in 16 experiments we make the originally
unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA

Jie Wang
Licheng Guo
Jason Cong

While systolic array architectures have the potential to deliver tremendous performance,
it is notoriously challenging to customize an efficient systolic array processor for
a target application. Designing systolic arrays requires knowledge for both high-level
characteristics of the application and low-level hardware details, thus making it
a demanding and inefficient process. To relieve users from the manual iterative trial-and-error
process, we present AutoSA, an end-to-end compilation framework for generating systolic
arrays on FPGA. AutoSA is based on the polyhedral framework, and further incorporates
a set of optimizations on different dimensions to boost performance. An efficient
and comprehensive design space exploration is performed to search for high-performance
designs. We have demonstrated AutoSA on a wide range of applications, on which AutoSA
achieves high performance within a short amount of time. As an example, for matrix
multiplication, AutoSA achieves 934 GFLOPs, 3.41 TOPs, and 6.95 TOPs in floating point,
16-bit and 8-bit integer data types on Xilinx Alveo U250.

Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers
through Microbenchmarking

Alec Lu
Zhenman Fang
Weihua Liu
Lesley Shannon

With the public availability of FPGAs from major cloud service providers like AWS,
Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms.
However, it is nontrivial to develop efficient FPGA accelerators, especially for software
programmers who use high-level synthesis (HLS).

The major goal of this paper is to figure out how to efficiently access the memory
system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially
important for memory-bound applications; for example, a naive accelerator design only
utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal,
we first identify a comprehensive set of factors that affect the memory bandwidth,
including 1) the number of concurrent memory access ports, 2) the data width of each
port, 3) the maximum burst access length for each port, and 4) the size of consecutive
data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively
evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when
changing those affecting factors, and provide insights into efficient memory access
in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we
also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN)
and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline
designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups
for the KNN and SpMV accelerators.

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Young-kyu Choi
Yuze Chi
Weikang Qiao
Nikola Samardzic
Jason Cong

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers
can now exploit unprecedented external memory bandwidth. This allows more memory-bounded
applications to benefit from FPGA acceleration. However, fully utilizing the available
bandwidth may not be an easy task. If an application requires multiple processing
elements to access multiple HBM channels, we observed a significant drop in the effective
bandwidth. The existing high-level synthesis (HLS) programming environment had limitation
in producing an efficient communication architecture. In order to solve this problem,
we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board.
Novel HLS-based optimization techniques are introduced to increase the throughput
of AXI bus masters and switching elements. We also present a high-performance customized
crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect
is demonstrated using Xilinx’s Alveo U280 HBM board. Based on bucket sort and merge
sort case studies, we explore several design spaces and find the design point with
the best resource-performance trade-off. The result shows that HBM Connect improves
the resource-performance metrics by 6.5X-211X.

PRGA: An Open-Source FPGA Research and Prototyping Framework

Ang Li
David Wentzlaff

Field Programmable Gate Arrays (FPGA) are being used in a fast-growing range of scenarios,
and heterogeneous CPU-FPGA systems are being tapped as a possible way to mitigate
the challenges posed by the end of Moore’s Law. This growth in diverse use cases has
fueled the need to customize FPGA architectures for particular applications or application
domains. While high-level FPGA models can help explore the FPGA architecture space,
as FPGAs move to more advanced design nodes, there is an increased need for low-level
FPGA research and prototyping platforms that can be brought all the way to fabrication.

This paper presents Princeton Reconfigurable Gate Array (PRGA), a highly customizable, scalable, and complete open-source framework for building
custom FPGAs. The framework’s core functions include generating synthesizable Verilog
from user-specified FPGA architectures, and providing a complete, auto-generated,
open-source CAD toolchain for the custom FPGAs. Developed in Python, PRGA provides
a user-friendly API and supports use both as a standalone FPGA as well as an embedded
FPGA. PRGA is a great platform for FPGA architecture research, FPGA configuration
memory research, FPGA CAD tool research, and heterogeneous systems research. It is
also a completely open-source framework for designers who need a free and customizable
FPGA IP core. An FPGA designed with PRGA is placed and routed using standard cell
libraries. The design is evaluated and compared to prior works, providing comparable
performance and increased configurability.

Interactive Debugging at IP Block Interfaces in FPGAs

Marco Antonio Merlini
Isamu Poy
Paul Chow

Recent developments have shown FPGAs to be effective for data centre applications,
but debugging support in that environment has not evolved correspondingly. This presents
an additional barrier to widespread adoption. This work proposes Debug Governors,
a new open-source debugger designed for controllability and interactive debugging
that can help to locate issues across multiple FPGAs.

A Debug Governor can pause, log, drop, and/or inject data into any streaming interface.
These operations enable single-stepping, unit testing, and interfacing with software.
Hundreds of Debug Governors can fit in a single FPGA and, because they are transparent
when inactive, can be left “dormant” in production designs.

We show how Debug Governors can be used to resolve functional problems on a real FPGA,
and how they can be extended to memory-mapped protocols.

SESSION: Poster Session 1

Probabilistic Optimization for High-Level Synthesis

Jianyi Cheng
John Wickerson
George A. Constantinides

High-level synthesis (HLS) tools automatically transform a high-level program, for
example in C/C++, into a low-level hardware description. A key challenge in HLS tools
is scheduling, i.e. determining the start time of all the operations in the untimed
program. There are three approaches to scheduling: static, dynamic and hybrid.

Static scheduling has been well studied, however, statically analysing dynamic hardware
behaviours is still challenging due to the unpredictability due to run-time dependencies.
Existing approaches either assume the worst-case timing behaviour, which can cause
significant performance loss or area overhead, or use simulation, which takes significant
time to explore a sufficiently large number of program traces.

In this work, we introduce a novel probabilistic model allowing HLS tools to efficiently
estimate and optimize the cycle-level timing behaviour of HLS-generated hardware.
Our framework offers insights to assist both hardware engineers and HLS tools when
estimating and optimizing hardware performance.

A Framework for Optimizing GCN Inference on FPGA

Bingyi Zhang
Rajgopal Kannan
Viktor Prasanna

Graph convolutional networks (GCNs) have revolutionized many big data applications.
However, accelerating GCN inference is still challenging due to (1) massive external
memory traffic and irregular memory access, (2) workload imbalance because of the
skewed degree distribution, and (3) intra-stage load imbalance between feature aggregation
and feature transformation steps. To address the above challenges, we propose a framework
to optimize GCN inference on FPGA. First, we propose a novel Partition-Centric Feature
Aggregation (PCFA) scheme to increase the data locality and reduce the number of random
memory accesses in feature aggregation step. Second, we propose a novel hardware architecture
to enable pipelined execution of the two heterogeneous computation steps. Then, a
low-overhead task scheduling strategy is proposed to achieve stall-free execution
of the two computation steps. Third, we provide a complete GCN acceleration framework
on FPGA, and define key parameters for users to fine-tune the throughput. The model-specific
operators can be customized to support a wide-range of GCN models. Using our framework,
we design accelerators on a state-of-the-art FPGA. We evaluate our work using widely
used datasets and. Experimental results show the accelerators produced by our framework
achieve significant speedup compared with state-of-the-art implementations on CPU
(≈100x), GPU (≈30x), and FPGA (4.5-32x).

Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications
on FPGAs

Dillon Huff
Steve Dai
Pat Hanrahan

Image processing algorithms can benefit tremendously from hardware acceleration. However,
hardware accelerators for image processing algorithms look very different from the
programs that image processing algorithm designers are accustomed to writing. Many
image processing hardware compilers have been proposed to close this gap. Unfortunately,
all of them either exclude crucial access patterns, do not scale to realistic size
applications, or rely on a compilation process in which each stage of the application
is an independently scheduled module that sends data to its consumers through FIFOs,
which adds resource and energy overhead while inhibiting synthesis optimizations.
In this work we present a new algorithm for compiling image processing applications
to hardware, Clockwork, that combines insights from polyhedral analysis and synchronous
dataflow to overcome these limitations. Clockwork achieves an average of 43% reduction
in LUTs, 22% reduction in flip-flops, and 17% reduction in BRAMs compared to a state-of-the-art
stencil compiler at the same throughput while handling a wider range of access patterns.
For an image processing application with dozens of stages Clockwork achieves energy
efficiency 265x that of an 8 core CPU, 17x that of an NVIDIA K80 GPU, and 2.4x that
of an NVIDIA V100 GPU.

LEAP: A Deep Learning based Aging-Aware Architecture Exploration Framework for FPGAs

Behnam Ghavami
Seyed Milad Ebrahimi
Zhenman Fang
Lesley Shannon

Transistor aging raises a vital lifetime reliability challenge for FPGA devices in
advanced technology nodes. In this paper, we design a tool called LEAP to enable the
aging-aware FPGA architecture exploration. The core idea of LEAP is to efficiently
model the aging-induced delay degradation at the coarse-grained FPGA basic block level
using deep neural networks (DNNs), while achieving almost the same accuracy as the
transistor-level simulation. For each type of the FPGA basic block such as LUT and
DSP, we first characterize its accurate delay degradation via transistor-level SPICE
simulation under a versatile set of aging factors from the FPGA fabric and in-field
operation. Then we train one DNN model for each block type to learn the relation between
its delay degradation and aging factors. Moreover, we integrate our DNN models into
the widely used Verilog-to-Routing (VTR 8) toolflow and generate the aging-aware FPGA
architecture file. Experimental results demonstrate that our proposed flow can predict
the delay degradation of FPGA blocks more than 10⁴x to 10⁷x faster than transistor-level SPICE simulation, with the maximum prediction error
of less than 0.7%. Therefore, FPGA architects can leverage LEAP to explore better
aging-aware FPGA architectures.

Modeling FPGA-Based Systems via Few-Shot Learning

Gagandeep Singh
Dionysios Diamantopolous
Juan Gómez-Luna
Sander Stuijk
Onur Mutlu
Henk Corporaal

Machine-learning-based models have recently gained traction as a way to overcome the
slow downstream implementation process of FPGAs by building models that provide fast
and accurate performance predictions. However, these models suffer from two main limitations:
(1) a model trained for a specific environment cannot predict for a new, unknown environment;
(2) training requires large amounts of data (features extracted from FPGA synthesis
and implementation reports), which is cost-inefficient because of the time-consuming
FPGA design cycle. In various systems (e.g., cloud systems), where getting access
to platforms is typically costly, error-prone, and sometimes infeasible, collecting
enough data is even more difficult. Our research aims to answer the following question:
for an FPGA-based system, can we leverage and transfer our ML-based performance models
trained on a low-end local system to a new, unknown, high-end FPGA-based system, thereby
avoiding the aforementioned two main limitations of traditional ML-based approaches?
To this end, we propose a transfer-learning-based approach for FPGA-based systems
that adapts an existing ML-based model to a new, unknown environment to provide fast
and accurate performance and resource utilization predictions.

APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA

Beilei Jiang
Xianwei Cheng
Sihai Tang
Xu Ma
Zhaochen Gu
Hui Zhao
Song Fu

In this paper, we introduce APCNN, which explores algorithm-hardware co-design and
provides a CNN acceleration framework with multi-layer cooperative optimization and
customized design on FPGA. In terms of the algorithm design, the pooling layer is
moved before the non-linear activation function and normalization in APCNN, which
we prove causes negligible accuracy loss; the pooling layer is then co-optimized with
the convolutional layer by means of redundant multiplication elimination, local addition
reuse, and global addition reuse. We further design a dedicated accelerator to take
full advantage of convolutional-pooling cross-layer optimization to not only accelerate
computation but also reduce on-off chip data communication on FPGA. We demonstrate
that our novel APCNN can achieve 75% multiplication and 75% addition reduction in
the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent
of memory footprint can be eliminated, where Row and Col are the number of rows and
columns in the activation feature map respectively. We have implemented a prototype
of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level
cycle and energy model and an RTL implementation. Our experimental results show that
APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN.
(This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)

ScalaBFS: A Scalable BFS Accelerator on FPGA-HBM Platform

Chenhao Liu
Zhiyuan Shao
Kexin Li
Minkang Wu
Jiajie Chen
Ruoshi Li
Xiaofei Liao
Hai Jin

High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing
multiple memory channels to the processing units. To achieve high performance, an
accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform)
needs to scale its performance according to the available memory channels. In this
paper, we propose an accelerator for BFS (Breadth-First Search), named as ScalaBFS,
which decouples memory accessing from processing to scale its performance with available
HBM memory channels. Moreover, by configuring each HBM memory channel with multiple
processing elements, ScalaBFS sufficiently exploits the memory bandwidth of HBM. We
implement the prototype system of ScalaBFS and conduct BFS in both real-world and
synthetic scale-free graphs on Xilinx Alveo U280 Data Center Accelerator card (real
hardware). The experimental results show that ScalaBFS scales its performance almost
linearly according to the available memory pseudo channels (PCs) from the HBM2 subsystem
of U280. By fully using the 32 PCs and building 64 processing elements (PEs) on U280,
ScalaBFS achieves a performance up to 19.7 GTEPS (Giga Traversed Edges Per Second).
When conducting BFS in sparse real-world graphs, ScalaBFS achieves equivalent GTEPS
to Gunrock running on the state-of-art Nvidia V100 GPU that features 64-PC HBM2 (twice
memory bandwidth than U280).

AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators

Atefeh Sohrabizadeh
Cody Hao Yu
Min Gao
Jason Cong

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized
computing, but the fact that FPGAs are hard to program creates a steep learning curve
for software programmers. Even with the help of high-level synthesis (HLS), accelerator
designers still must manually perform code reconstruction and cumbersome parameter
tuning to achieve the optimal performance. While many learning models have been leveraged
by existing work to automate the design of efficient accelerators, the unpredictability
of modern HLS tools becomes a major obstacle for them to maintain high accuracy. We
address this problem by incorporating an automated DSE framework – AutoDSE – that
leverages bottleneck-guided gradient optimizer to systematically find a better design
point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact
parameters to overcome that, which is like the approach an expert would take. The
experimental results show that AutoDSE is able to find the design point that achieves,
on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks
and 1.04x over the manually designed HLS accelerated vision kernels in Xilinx Vitis
libraries yet with 26x reduction of their optimization pragmas.

SWIFT: Small-World-based Structural Pruning to Accelerate DNN Inference on FPGA

Yufei Ma
Gokul Krishnan
Yu Cao
Le Ye
Ru Huang

State-of-the-art DNN pruning approaches achieved high sparsity. However, these methods
usually do not consider the intrinsic graph property of DNNs, leading to an irregular
pruned network. Consequently, hardware accelerators cannot directly benefit from such
pruning, suffering additional cost on indexing, control and data paths. Inspired by
the observation that the brain and real-world networks follow a Small-World model,
we propose a graph-based progressive structural pruning technique, SWIFT, that integrates
local clusters and global sparsity in DNNs to benefit the dataflow and workload balance
of the accelerators. In particular, we propose an output stationary FPGA architecture
to accelerate DNN inference and integrate it with the structural sparsity by SWIFT,
so that the communication and computation of clustered zero weights are eliminated.
In addition, a full mesh data router is designed to adaptively direct inputs into
corresponding processing elements (PEs) for different layer configurations and skipping
zero operations. The proposed SWIFT is evaluated with multiple DNNs on different datasets.
It achieves sparsity ratio up to 76% for CIFAR-10, 83% for CIFAR-100, 76% for the
SVHN datasets. Moreover, our proposed SWIFT FPGA accelerator achieves up to 4.4× improvement
in throughput for different dense networks with a marginal hardware overhead.

Fuzzing High-Level Synthesis Tools

Zewei Du
Yann Herklotz
Nadesh Ramanathan
John Wickerson

High-level synthesis (HLS) is becoming an increasingly important part of the computing
landscape, even in safety-critical domains where correctness is key. As such, HLS
tools are increasingly relied upon. But are they trustworthy?

We have subjected three widely used HLS tools – LegUp, Xilinx Vivado HLS, and the
Intel HLS Compiler – to a rigorous fuzzing campaign using thousands of random, valid
C programs that we generated using a modified version of the Csmith tool. For each
C program, we compiled it to a hardware design using the HLS tool under test and checked
whether that hardware design generates the same output as an executable generated
by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test,
we reduced the C program to a minimal example in order to zero in on the potential
bug. Our testing campaign has revealed that all three HLS tools can be made either
to crash or to generate wrong code when given valid C programs, and thereby underlines
the need for these increasingly trusted tools to be more rigorously engineered. Out
of 6700 test cases, we found 272 programs that failed in at least one tool, out of
which we were able to discern at least 6 unique bugs.

RIFL: A Reliable Link Layer Network Protocol for FPGA-to-FPGA Communication

Qianfeng (Clark) Shen
Jun Zheng
Paul Chow

More and more latency-sensitive applications are being introduced into the data center.
Performance of such applications can be limited by the high latency of the network
interconnect. Because the conventional network stack is designed not only for LAN,
but also for WAN, it carries a great amount of redundancy that is not required in
a data center network. This paper introduces the concept of a three-layer protocol
stack that can replace the conventional network stack and fulfill the exact demands
of data center network communications. The detailed design and implementation of the
first layer of the stack, which we call RIFL, is presented. A novel low latency in-band
hop-by-hop re-transmission protocol is proposed and adopted in RIFL, which guarantees
lossless transmission for links whose longest wire segment is no more than 150 meters.
Experimental results show that RIFL achieves 218 nanoseconds round-trip latency on
3 meter zero-hop links, at a throughput of 104.7 Gbps. RIFL is a multi-lane protocol
with scalable throughput from 500 Mbps to above 200 Gbps. It is portable to most of
the recent FPGAs. It can be the enabler of low latency, high throughput, flexible,
scalable, and lossless data center networks.

SESSION: Session 3: Machine Learning and Supporting Algorithms

GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing

Qinggang Wang
Long Zheng
Yu Huang
Pengcheng Yao
Chuangyi Gui
Xiaofei Liao
Hai Jin
Wenbin Jiang
Fubing Mao

Existing FPGA-based graph accelerators, typically designed for static graphs, rarely
handle dynamic graphs that often involve substantial graph updates (e.g., edge/node
insertion and deletion) over time. In this paper, we aim to fill this gap. The key
innovation of this work is to build an FPGA-based dynamic graph accelerator easily
from any off-the-shelf static graph accelerator with minimal hardware engineering
efforts (rather than from scratch). We observe \em spatial similarity of dynamic graph
updates in the sense that most of graph updates get involved with only a small fraction
of vertices. We therefore propose an FPGA library, called GraSU, to exploit spatial
similarity for fast graph updates. GraSU uses a differential data management, which
retains the high-value data (that will be frequently accessed) in the specialized
on-chip UltraRAM while the overwhelming majority of low-value ones reside in the off-chip
memory. Thus, GraSU can transform most of off-chip communications arising in dynamic
graph updates into fast on-chip memory accesses. Our experiences show that GraSU can
be easily integrated into existing state-of-the-art static graph accelerators with
only 11 lines of code modifications. Our implementation atop AccuGraph using a Xilinx
Alveo#8482; \ U250 board outperforms two state-of-the-art CPU-based dynamic graph
systems, Stinger and Aspen, by an average of 34.24× and 4.42× in terms of update throughput,
improving further overall efficiency by 9.80× and 3.07× on average.

Folded Integer Multiplication for FPGAs

Martin Langhammer
Bogdan Pasca

Encryption – especially the key exchange algorithms such as RSA – is an increasing
use-model for FPGAs, driven by the adoption of the FPGA as a SmartNIC in the datacenter.
While bulk encryption such as AES maps well to generic FPGA features, the very large
multipliers required for RSA are a much more difficult problem. Although FPGAs contain
thousands of small integer multipliers in DSP Blocks, aggregating them into very large
multipliers is very challenging because of the large amount of soft logic required
– especially in the form of long adders, and the high embedded multiplier count. In
this paper, we describe a large multiplier architecture that operates in a multi-cycle
format and which has a linear area/throughput ratio. We show results for a 2048-bit
multiplier that has a latency of 118 cycles, inputs data every 9th cycle and closes
timing at 377MHz in an Intel Arria 10 FPGA, and over 400MHz in a Stratix 10. The proposed
multiplier uses 1/9 of the DSP resources typically used in a 2048-bit Karatsuba implementation,
showing a perfectly linear throughput to DSP-count ratio. Our proposed solution outperforms
recently reported results, in either arithmetic complexity – by making use of the
Karatsuba techniques, or in scheduling efficiency – embedded DSP resources are fully
utilized.

FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations

Yichi Zhang
Junhao Pan
Xinheng Liu
Hongzheng Chen
Deming Chen
Zhiru Zhang

Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are
well suited for FPGAs, as their dominant computations are bitwise arithmetic and the
memory requirement is also significantly reduced. However, compared to start-of-the-art
compact convolutional neural network (CNN) models, BNNs tend to produce a much lower
accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs
has gradually become a major compute bottleneck, because it is conventionally excluded
from binarization to avoid a large accuracy loss.

This work proposes FracBNN, which exploits fractional activations to substantially
improve the accuracy of BNNs. Specifically, our approach employs a dual-precision
activation scheme to compute features with up to two bits, using an additional sparse
binary convolution. We further binarize the input layer using a novel thermometer
encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where
all convolutional layers are computed in pure binary MAC operations (BMACs). We design
an efficient FPGA-based accelerator for our novel BNN model that supports the fractional
activations. To evaluate the performance of FracBNN under a resource-constrained scenario,
we implement the entire optimized network architecture on an embedded FPGA (Xilinx
Ultra96 v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable
to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of
28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms
a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using
the same model size. On the embedded FPGA device, FracBNN demonstrates the ability
of real-time image classification.

DYNAMAP: <u>Dyna</u>mic Algorithm <u>Map</u>ping Framework for Low Latency CNN Inference

Yuan Meng
Sanmukh Kuppannagari
Rajgopal Kannan
Viktor Prasanna

Most of the existing work on FPGA acceleration of Convolutional Neural Network (CNN)
focuses on employing a single strategy (algorithm, dataflow, etc.) across all the
layers. Such an approach does not achieve optimal latency on complex and deep CNNs.
Emerging CNNs have diverse per-layer computation characteristics including parallelism,
arithmetic intensity, locality, and memory footprint. Per-layer strategy selection
and fine-grained tuning are required to achieve low end-to-end latency. However, specialized
hardware modules dedicated to each layer limit the per-layer utilization and adversely
affect end-to-end latency. In this paper, we address these problems by an algorithm-architecture
co-optimization framework, DYNAMAP, consisting of (1) a unified hardware overlay that
can be reused across layers, supporting dynamic mapping of all three families of popular
convolution algorithms, and further allowing flexible dataflow switching to maximize
hardware utilization for each layer; (2) a novel software Design Space Exploration
(DSE) flow that customizes the hardware overlay and chooses optimal strategy mapping.
We show that the algorithm mapping space increases exponentially with network depth,
and while the optimal algorithm selection problem is NP-hard in general, by exploiting
the series-parallel structure of CNN models, we demonstrate a polynomial-time solution
for optimal algorithm mapping. DYNAMAP is optimized for any CNN, including those having
diverse computation and memory requirements across the layers. We demonstrate DYNAMAP
using two state-of-the-art CNNs – GoogleNet and Inception-V4. The generated accelerators
achieve up to 2.8x and 1.4x speedups, respectively, wrt inference latency compared
with the state-of-the-art FPGA implementations.

S2N2: A FPGA Accelerator for Streaming Spiking Neural Networks

Alireza Khodamoradi
Kristof Denolf
Ryan Kastner

Spiking Neural Networks (SNNs) are the next generation of Artificial Neural Networks
(ANNs) that utilize an event-based representation to perform more efficient computation.
Most SNN implementations have a systolic array-based architecture and, by assuming
high sparsity in spikes, significantly reduce computing in their designs. This work
shows this assumption does not hold for applications with signals of large temporal
dimension. We develop a streaming SNN (S2N2) architecture that can support fixed-per-layer
axonal and synaptic delays for its network. Our architecture is built upon FINN and
thus efficiently utilizes FPGA resources. We show how radio frequency processing matches
our S2N2 computational model. By not performing tick-batching, a stream of RF samples
can efficiently be processed by S2N2, improving the memory utilization by more than
three orders of magnitude.

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs

Qijing Huang
Dequan Wang
Zhen Dong
Yizhao Gao
Yaohui Cai
Tian Li
Bichen Wu
Kurt Keutzer
John Wawrzynek

Deploying deep learning models on embedded systems for computer vision tasks has been
challenging due to limited compute resources and strict energy budgets. The majority
of existing work focuses on accelerating image classification, while other fundamental
vision problems, such as object detection, have not been adequately addressed. Compared
with image classification, detection problems are more sensitive to the spatial variance
of objects, and therefore, require specialized convolutions to aggregate spatial information.
To address this need, recent work introduces dynamic deformable convolution to augment
regular convolutions. Regular convolutions process a fixed grid of pixels across all
the spatial locations in an image, while dynamic deformable convolution may access
arbitrary pixels in the image with the access pattern being input-dependent and varying
with spatial location. These properties lead to inefficient memory accesses of inputs
with existing hardware.

In this work, we harness the flexibility of FPGAs to develop a novel object detection
pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a
set of algorithm modifications including irregular-access versus limited-range and
fixed-shape on a flexible hardware accelerator. We evaluate these algorithmic changes
with corresponding hardware optimizations and show a 1.36x and 9.76x speedup respectively
for the full and depthwise deformable convolution on hardware with minor accuracy
loss. We then co-design a network called CoDeNet with the modified deformable convolution
for object detection and quantize the network to 4-bit weights and 8-bit activations.
With our high-efficiency implementation, our solution reaches 26.9 frames per second
with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object
detection dataset, Pascal VOC. With our higher-accuracy implementation, our model
gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters–20.9x smaller but
10% more accurate than Tiny-YOLO.

Efficient FPGA Modular Multiplication Implementation

Martin Langhammer
Bogdan Pasca

Barrett’s algorithm is the most commonly known method of performing a modular multiplication,
which is the core of many modern encryption algorithms such as RSA. Barrett’s algorithm
requires an accurate quotient estimation which in turn requires accurate multiplications.
These multiplications operating on word sizes of thousands of bits are particularly
expensive to implement in FPGAs, requiring many hundreds or even thousands of embedded
DSP components along with large amounts of logic and routing. In this work we show
that approximate quotient estimates as results of aggressive multiplier truncations
can significantly reduce implementation cost. The looser modified Barrett’s output
[0; YM) is reduced to [0; M) using a shallow reduction technique based on table lookups
and wide additions, taking advantage of new techniques which have recently been introduced
for FPGA. We first use these techniques to develop an improved standard Barrett’s
implementation for 1024b modular multiplication, followed by our approximate method
which reduces logic cost in the LSB truncated multiplier by approximately 10%. The
effect is more pronounced for very large word sizes, where our relaxed error bounds
in the LSB truncated multiplication can reduce the number of operations by 20%.

SESSION: Keynote 2

Are We Alone? Searching for ET with FPGAs

Dan Werthimer

What is the possibility of other intelligent life in the universe? Can we detect radio,
infrared, or visible light signals from alien civilizations? Current and future projects
searching for such signals may provide an answer. Dan will describe SETI@home, the
new PANOSETI observatory, future searches, and show how FPGAs and new technologies
are revolutionizing the search for extra-terrestrial intelligence (SETI).

Dan will also describe the Collaboration for Astronomy Signal Processing and Electronics
Research (CASPER) open source hardware, tools and libraries for FPGA based radio astronomy
instrumentation that produced the first images of the black hole and discovered many
fast radio bursts, pulsars, and a planet made from solid diamond. Next generation
radio telescopes will be composed of hundreds to thousands of smaller telescopes;
these large arrays require peta-ops per second of real time processing to combine
telescope signals and generate spectral-images. Dan will describe these telescopes
and their real time signal processing systems.

Open source hardware, software, libraries, tools, reference designs and video training
are available at http://casper.berkeley.edu

SESSION: Poster Session 2

Stealing Neural Network Structure through Remote FPGA Side-channel Analysis

Yicheng Zhang
Rozhin Yasaei
Hao Chen
Zhou Li
Mohammad Abdullah Al Faruque

Deep Neural Network (DNN) models have been extensively developed by companies for
a wide range of applications. The development of a customized DNN model with great
performance requires costly investments, and its structure (layers and hyper-parameters)
is considered intellectual property and holds immense value. However, in this paper,
we found the model secret is vulnerable when a cloud-based FPGA accelerator executes
it. We demonstrate an end-to-end attack based on remote power side-channel analysis
and machine-learning-based secret inference against different DNN models. The evaluation
result shows that an attacker can reconstruct the layer and hyper-parameter sequence
at over 90% accuracy using our method, which can significantly reduce her model development
workload. We believe the threat presented by our attack is tangible, and new defense
mechanisms should be developed against this threat.

Exploring PGAS Communication for Heterogeneous Clusters with FPGAs

Varun Sharma
Paul Chow

This work presents a heterogeneous communication library for generic clusters of processors
and FPGAs. This library, Shoal, supports the Partitioned Global Address Space (PGAS)
memory model for applications. PGAS is a shared memory model for clusters that creates
a distinction between local and remote memory access. Through Shoal and its common
application programming interface for hardware and software, applications can be more
freely migrated to the optimal platform and deployed onto dynamic cluster topologies.

The library is tested using a thorough suite of microbenchmarks to establish latency
and throughput performance. We also show an implementation of the Jacobi iterative
method that demonstrates the ease with which applications can be moved between platforms
to yield faster run times.

Extending High-Level Synthesis for Task-Parallel Programs

Yuze Chi
Licheng Guo
Young-kyu Choi
Jie Wang
Jason Cong

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable
gate array (FPGA) accelerators in many application domains in recent years, thanks
to its competitive quality of result (QoR) and short development cycle compared with
the traditional register-transfer level (RTL) design approach. Yet, limited by the
sequential C semantics, it remains challenging to adopt the same highly productive
high-level programming approach in many other application domains, where coarse-grained
tasks run in parallel and communicate with each other at a fine-grained level. While
current HLS tools support task-parallel programs, the productivity is greatly limited
in the code development, correctness verification, and QoR tuning cycles, due to the
poor programmability, restricted software simulation, and slow code generation, respectively.
Such limited productivity often defeats the purpose of HLS and hinder programmers
from adopting HLS for task-parallel FPGA accelerators.

In this paper, we extend the HLS C++ language and present a fully automated framework
with programmer-friendly interfaces, universal software simulation, and fast code
generation to overcome these limitations. Experimental results based on a wide range
of real-world task-parallel programs show that, on average, the lines of kernel and
host code are reduced by 22% and 51%, respectively, which considerably improves the
programmability. The correctness verification and the iterative QoR tuning cycles
are both greatly accelerated by 3.2× and 6.8×, respectively.

Simulating and Evaluating a Quaternary Logic FPGA Based on Floating-gate Memories
and Voltage Division

Ayokunle Fadamiro
Pouyan Rezaie
Spencer Millican
Christopher Harris

Technology scaling cannot meet consumer demands, especially for binary circuits. Previous
studies proposed addressing this with multi-valued logic (MVL) architectures, but
these architectures use non-standard fabrication techniques and optimistic performance
analysis. This study presents a new quaternary FPGA (QFPGA) architecture based on
floating-gate memories that standard CMOS fabrication can fabricate: programming floating-gates
implement a voltage divider, and these divided voltages represent one of four distinct
logic values. When simulated with open-source FinFET SPICE models, the proposed architecture
obtains competitive delay and power performance compared to equivalent binary and
QFPGA architectures from literature. Results show the proposed QFPGA basic logic element
(BLE) requires half the area and dissipates a third of the power density compared
to QFPGA architectures from literature. When projecting BLE performance onto benchmark
circuits, implementing circuits requires up to 55% less area and one-third the power,
and the proposed architecture can operate at clock speeds up to three times faster
than binary equivalents. Future studies will investigate accurate modeling of interconnects
to better account for their performance impacts and will explore efficient architectures
for programming MVL memories when they’re used in FPGAs.

Resource Sharing in Dataflow Circuits

Lana Josipović
Axel Marmet
Andrea Guerrieri
Paolo Ienne

To achieve resource-efficient hardware designs, high-level synthesis tools share functional
units among operations of the same type. This optimization is typically performed
in conjunction with operation scheduling to ensure the best possible unit usage at
each point in time. Dataflow circuits have emerged as an alternative HLS approach
to efficiently handle irregular and control-dominated code. However, these circuits
do not have a predetermined schedule; in its absence, it is challenging to determine
which operations can share a functional unit without a performance penalty. Additionally,
although sharing seems to imply only trivial circuitry, sharing units in dataflow
circuits may cause deadlock by blocking certain data transfers and preventing operations
from executing. We developed a complete methodology to implement resource sharing
in dataflow designs. Our approach automatically identifies performance-acceptable
resource sharing opportunities based on average unit utilization with data tokens.
Our sharing mechanism achieves functionally correct and deadlock-free circuits by
regulating the multiplexing of tokens at the inputs of the shared unit. On a set of
benchmarks obtained out of C code, we showed that our approach effectively implements
resource sharing and results in significant area savings compared to dataflow circuits
which do not support this feature. Our sharing mechanism is key to achieve different
area-performance tradeoffs in dataflow designs and to make them competitive in terms
of computational resources with circuits achieved using standard HLS techniques.

Triggered Scheduling: Efficient Detection of Dataflow Network Idleness on Heterogeneous
Systems

Mahyar Emami
Endri Bezati
Jörn W. Janneck
James Larus

Hardware-software codesign for FPGAs requires flexible and changeable boundaries between
hardware and software. Design space exploration is facilitated by expressing programs
in a language that can be compiled for both CPU and FPGA execution. Such an approach
requires efficient and general communication mechanisms between hardware and software.
We present a practical solution to this problem for heterogeneous programs expressed
in CAL, an actor based language running on a PCIe-based FPGA system where communication
between a processor and FPGA is relatively expensive. We show how a network of continuously
executing software and hardware actors with fine-grained communication can be expressed
as a coprocessor model that executes the network in discrete steps with efficient
coarse-grained transfers across the PCIe bus.

To this end, we present the Triggered Scheduling (TS) algorithm to detect idleness
(i.e. lack of forward progress) of a dynamic actor network with unpredictable consumption/production
rates. With TS, it is possible to treat a network of actors running on hardware as
a coprocessor that can be called by software. We show how TS can be used to build
a truly heterogeneous system on a HLS platform. Using 4 large benchmarks, we analyze
the performance and resource utilization of the Triggered Scheduling algorithm.

Classifying Computations on Multi-Tenant FPGAs

Mustafa Gobulukoglu
Colin Drewes
Bill Hunter
Dustin Richmond
Ryan Kastner

Modern data centers leverage large FPGAs to provide low latency, high throughput,
and low energy computation. FPGA multi-tenancy is an attractive option to maximize
utilization, yet it opens the door to unique security threats. In this work, we develop
a remote classification pipeline that targets the confidentiality of multi-tenant
cloud FPGA environments. We design a unique Dual-Edged voltage fluctuation sensor
that measures subtle changes in the power distribution network caused by co-located
computations. The sensor measurements are given to a classification pipeline that
is able to deduce information about co-located applications including the type of
computation and its implementation. We study the importance of the trace length, signal
conditioning algorithms, and other aspects that affect classification accuracy. Our
results show that we can determine if another co-tenant is present with 96% accuracy.
We can classify with 98% accuracy whether a power waster circuit is operating. Furthermore,
we are able to determine if a cryptographic operation is occurring, differentiate
between different cryptographic algorithms (AES and PRESENT) and microarchitectural
implementations (Microblaze, ORCA, and PicoRV32).

NPE: An FPGA-based Overlay Processor for Natural Language Processing

Hamza Khan
Asma Khan
Zainab Khan
Lun Bin Huang
Kun Wang
Lei He

In recent years, transformer-based models have shown state-of-the-art results for
Natural Language Processing (NLP). In particular, the introduction of the BERT language
model brought with it breakthroughs in tasks such as question answering and natural
language inference, advancing applications that allow humans to interact naturally
with embedded devices. FPGA-based overlay processors have been shown as effective
solutions for edge image and video processing applications, which mostly rely on low
precision linear matrix operations. In contrast, transformer-based NLP techniques
employ a variety of higher precision nonlinear operations with significantly higher
frequency. We present NPE, an FPGA-based overlay processor that can efficiently execute
a variety of NLP models. NPE offers software-like programmability to the end user
and, unlike FPGA designs that implement specialized accelerators for each nonlinear
function, can be upgraded for future NLP models without requiring reconfiguration.
NPE can meet real-time conversational AI latency targets for the BERT language model
with 4x lower power than CPUs and 6x lower power than GPUs. We also show NPE uses
3x fewer FPGA resources relative to comparable BERT network-specific accelerators
in the literature. NPE provides a cost-effective and power-efficient FPGA-based solution
for Natural Language Processing at the edge.

PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow

Sitao Huang
Kun Wu
Hyunmin Jeong
Chengyue Wang
Deming Chen
Wen-mei Hwu

The exploding complexity and computation efficiency requirements of applications are
stimulating a strong demand for hardware acceleration with heterogeneous platforms
such as FPGAs. However, a high-quality FPGA design is very hard to create as it requires
FPGA expertise and a long design iteration time. In contrast, software applications
are typically developed in a short development cycle, in high-level languages like
Python, which is at a much higher level of abstraction than all existing hardware
design flows. To close this gap between hardware design flows and software applications,
and simplify FPGA programming, we create PyLog, a high-level, algorithm-centric Python-based
programming and synthesis flow for FPGA. PyLog is powered by a set of compiler optimization
passes and a type inference system to generate high-quality design. It abstracts away
the implementation details and allows designers to focus on algorithm specification.
PyLog captures more high-level computation patterns for better optimization than traditional
HLS systems. PyLog also has a runtime for running PyLog code directly on FPGA platform
without any extra code development. Evaluation shows that PyLog significantly improves
FPGA design productivity and generates highly efficient FPGA designs that outperform
highly optimized CPU and FPGA version by 3.17× and 1.24× on average.

MLBlocks: FPGA Blocks for Machine Learning Applications

Seyedramin Rasoulinezhad
David Boland
Philip H.W. Leong

The underlying goal of FPGA architecture research is to devise flexible substrates
which implement a wide variety of circuits efficiently. Contemporary FPGA architectures
have been optimized to support networking, signal processing and image processing
applications through high precision digital signal processing (DSP) blocks. The recent
emergence of machine learning has created a new set of demands characterized by: 1)
higher computational density and 2) low precision arithmetic requirements. With the
goal of exploring this new design space in a methodical manner, we first propose a
problem formulation involving computing nested loops over multiply-accumulate (MAC)
operations, which covers many basic linear algebra primitives and standard deep neural
network (DNN) layers. A quantitative methodology for deriving efficient coarse-grained
compute block architectures from benchmarks is then proposed together with a family
of new compute units, called MLBlocks. These blocks are flexible mesh-based systolic
array units parameterized with different data movements, data reuse, and multi-precision
support. They utilize a columnar arrangement which is compatible with existing FPGA
architectures. Finally, using synthetic benchmarks, we demonstrate that MLBlocks offer
significantly improved performance over the commercial Xilinx DSP48E2, while maintaining
similar area and timing requirements to current DSPs.

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems
in the Cloud

Shulin Zeng
Guohao Dai
Hanbo Sun
Jun Liu
Hongren Zheng
Yusong Wu
Fan Zhang
Xinhao Yang
Yi Cai
Yu Wang
Huazhong Yang

With the ever-growing demands for online Artificial Intelligence (AI), the hardware
virtualization support for deep learning accelerators is vital for providing AI capability
in the cloud. Three basic features, multi-task, dynamic workload, and remote access,
are fundamental for hardware virtualization. However, most of the deep learning accelerators
do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN
scheduling algorithm for NN accelerators neither consider the multi-task concurrent
execution and resources allocation for the multi-core DNN accelerators. Moreover,
existing GPU virtualized solutions could introduce a huge remote access latency overhead,
resulting in a severe system performance drop.

In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core
virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model
parallelism on multi-FPGA by optimizing data synchronization and movement between
FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate
multi-core latency prediction model. 3M-AI significantly reduces the remote API access
overhead to nearly 1%, and achieves better NN inference latency with a batch size
1 compared with GPU virtualization solutions.

SESSION: Session 4: Applications

Reconfigurable Acceleration of Short Read Mapping with Biological Consideration

Ho-Cheung Ng
Izaak Coleman
Shuanglong Liu
Wayne Luk

Existing FPGA accelerators for short read mapping often fail to utilize the complete
biological information in sequencing data for simple hardware design, leading to missed
or incorrect alignment. Furthermore, their performance may not be optimized across
hardware platforms. This paper proposes a novel alignment pipeline that considers
all information in sequencing data for biologically accurate acceleration of short
read mapping. To ensure the performance of the proposed design optimized across different
platforms, we accelerate the memory-bound operations which have been a bottleneck
in short read mapping. Specifically, we partition the FM-index into buckets. The length
of each bucket is equal to an optimal multiple of the memory burst size and is determined
through data-driven exploration. A tool has been developed to obtain the optimal parameters
of the design for different hardware platforms to enhance performance optimization.
Experimental results indicate that our design maximizes alignment accuracy compared
to the state-of-the-art software Bowtie, mapping reads 4.48x as fast. Compared to
the previous hardware aligner, our achieved accuracy is 97.7% which reports 4.48 M
more valid alignments with a similar speed.

An FPGA-based 7-ENOB 600 MSample/s ADC without any External Components

Lukas Leuenberger
Dorian Amiet
Tao Wei
Paul Zbinden

Analog to digital converters (ADCs) are indispensable nowadays. Analog signals are
digitized earlier and earlier in the processing chain to reduce the need for complex
analog signal processing. For this reason, ADCs are often integrated directly into
field-programmable gate arrays (FPGA) or microprocessors. However, such ADCs are designed
for a specific set of requirements with limited flexibility. In this paper, a new
structure of an FPGA-based ADC is proposed. The ADC is based on the slope ADC, where
a time-to-digital converter (TDC) measures the time from the beginning of a reference
slope until the slope reaches the voltage-to-be-measured. Only FPGA-internal elements
are used to build the ADC. It is fully reconfigurable and does not require any external
components. This innovation offers the flexibility to convert almost any digital input/output
(I/O) into an ADC. Considering the very high number of digital I/O ports available
in today’s FPGA systems, this enables the construction of a massive and powerful ADC
array directly on a standard FPGA. The proposed ADC has a resolution of 9.3 bit and
achieves an effective number of bits (ENOB) of 7 at a sample rate of 600 MSample/s.
The differential nonlinearity (DNL) ranges from -0.9 to 0.9 bit, and the integral
nonlinearity (INL) is in the range between -1.1 and 0.9 bit. An alternative version
of the ADC operates at 1.2 GSample/s and achieves an ENOB of 5.3.

A Framework for Customizable FPGA-based Image Registration Accelerators

Davide Conficconi
Eleonora D’Arnese
Emanuele Del Sozzo
Donatella Sciuto
Marco D. Santambrogio

Image Registration is a highly compute-intensive optimization procedure that determines
the geometric transformation to align a floating image to a reference one. Generally,
the registration targets are images taken from different time instances, acquisition
angles, and/or sensor types. Several methodologies are employed in the literature
to address the limiting factors of this class of algorithms, among which hardware
accelerators seem the most promising solution to boost performance. However, most
hardware implementations are either closed-source or tailored to a specific context,
limiting their application to different fields. For these reasons, we propose an open-source
hardware-software framework to generate a configurable architecture for the most compute-intensive
part of registration algorithms, namely the similarity metric computation. This metric
is the Mutual Information, a well-known calculus from the Information Theory, used
in several optimization procedures. Through different design parameters configurations,
we explore several design choices of our highly-customizable architecture and validate
it on multiple FPGAs. We evaluated various architectures against an optimized Matlab implementation on an Intel Xeon Gold, reaching a speedup up to 2.86x, and remarkable
performance and power efficiency against other state-of-the-art approaches.

NASCENT: Near-Storage Acceleration of Database Sort on SmartSSD

Sahand Salamat
Armin Haj Aboutalebi
Behnam Khaleghi
Joo Hwan Lee
Yang Seok Ki
Tajana Rosing

As the size of data generated every day grows dramatically, the computational bottleneck
of computer systems has been shifted toward the storage devices. Thanks to recent
developments in storage devices, the interface between the storage and the computational
platforms has become the main limitation as it provides limited bandwidth which does
not scale when the number of storage devices increases. Interconnect networks limit
the performance of the system when independent operations are executing on different
storage devices since they do not provide simultaneous accesses to all the storage
devices. Offloading the computations to the storage devices eliminates the burden
of data transfer from the interconnects. Emerging as a nascent computing trend, near
storage computing offloads a portion of computation to the storage devices to accelerate
the big data applications. In this paper, we propose a near storage accelerator for
database sort, NASCENT, which utilizes Samsung SmartSSD, an NVMe flash drive with
an on-board FPGA chip that processes data in-situ. We propose, to the best of our
knowledge, the first near storage database sort based on bitonic sort which considers
the specifications of the storage devices to increase the scalability of computer
systems as the number of storage devices increases. NASCENT improves both performance
and energy efficiency as the number of storage devices increases. With 12 SmartSSDs,
NASCENT is 7.6x (147.2x) faster and 5.6x (131.4x) more energy efficient than the FPGA
(CPU) baseline.

MOCHA: Multinode Cost Optimization in Heterogeneous Clouds with Accelerators

Peipei Zhou
Jiayi Sheng
Cody Hao Yu
Peng Wei
Jie Wang
Di Wu
Jason Cong

FPGAs have been widely deployed in public clouds, e.g., Amazon Web Services (AWS)
and Huawei Cloud. However, simply offloading accelerated kernels from CPU hosts to
PCIe-based FPGAs does not guarantee out-of-pocket cost savings in a pay-as-you-go
public cloud. Taking Genome Analysis Toolkit (GATK) applications as case studies,
although the adoption of FPGAs reduces the overall execution time, it introduces 2.56×
extra cost, due to insufficient application-level speedup by Amdahl’s law. To optimize
the out-of-pocket cost while keeping high speedup and throughput, we propose Mocha
framework as a distributed runtime system to fully utilize the accelerator resource
by accelerator sharing and CPU-FPGA partial task offloading. Evaluation results on
Haplotype Caller (HTC) and Mutect2 in GATK show that on AWS, Mocha saves on the application
cost by 2.82x for HTC, 1.06x for Mutect2 and on Huawei Cloud by 1.22x, 1.52x respectively
than straightforward CPU-FPGA integration solution with less than 5.1% performance
overhead.

Design Principles for Packet Deparsers on FPGAs

Thomas Luinaud
Jeferson Santiago da Silva
J.M. Pierre Langlois
Yvon Savaria

The P4 language has drastically changed the networking field as it allows to quickly
describe and implement new networking applications. Although a large variety of applications
can be described with the P4 language, current programmable switch architectures impose
significant constraints on P4 programs. To address this shortcoming, FPGAs have been
explored as potential targets for P4 applications. P4 applications are described using
three abstractions: a packet parser, match-action tables, and a packet deparser, which
reassembles the output packet with the result of the match-action tables. While implementations
of packet parsers and match-action tables on FPGAs have been widely covered in the
literature, no general design principles have been presented for the packet deparser.
Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open
issue because it requires a large amount of interconnections and the architecture
must be tailored to a P4 program. As a result, in several works where a P4 application
is implemented on FPGAs, the deparser consumes a significant proportion of chip resources.
Hence, in this paper, we address this issue by presenting design principles for efficient
and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates
an efficient vendor-agnostic deparser architecture from a P4 program.Our design has
been validated and simulated with a cocotb-based framework.The resulting architecture
is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than
200 Gbps while reducing resource usage by almost 10x compared to other solutions.

FPGA 2021 TOC