FPGA ’23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Full Citation in the ACM Digital Library

SESSION: Keynote I

Compiler Support for Structured Data

Saman Amarasinghe

In 1957, the FORTRAN language and compiler introduced multidimensional dense arrays or dense tensors. Subsequent programming languages added a myriad of data structures from lists, sets, hash tables, trees, to graphs. Still, when dealing with extremely large data sets, dense tensors are the only simple and practical solution. However, modern data is anything but dense. Real world data, generated by sensors, produced by computation, or created by humans, often contain underlying structure, such as sparsity, runs of repeated values, or symmetry.

In this talk I will describe how programming languages and compilers can support large data sets with structure. I will introduce TACO, a compiler for sparse data computing. TACO is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. It pioneered a new technique for compiling compound tensor expressions into efficient loops in a systematic way. TACO generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. With TACO, I will show how to put sparse array programming on the same compiler transformation and code generation footing as dense array codes. Structured data has immense potential for hardware acceleration. However, instead of one-off single-operation compute engines, with compilers frameworks such as TACO, I believe that it is possible to create hardware for an entire class of sparse computations. With the help of the FPGA community, I am looking forward to such a future.

SESSION: Session: High-Level Abstraction and Tools

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS

Linus Y. Wong
Jialiang Zhang
Jing (Jane) Li

Rapid growth in data size poses increasing computational and memory challenges to data processing. FPGA accelerators and near-storage processing are promising candidates for tackling computational and memory requirements, and many near-storage FPGA accelerators have been shown to be effective in processing large data. However, the current HLS development environment does not allow direct NVMe storage access from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.

To address these challenges, we propose a new HLS storage interface named DONGLE that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. We prototyped DONGLE with an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD and demonstrate a geomean speed-up of 2.3× and a reduction of lines-of-code by 2.4× on evaluated workloads over the state-of-the-art commercial platform.

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

Linfeng Du
Tingyuan Liang
Sharad Sinha
Zhiyao Xie
Wei Zhang

Multi-die FPGAs are widely adopted to deploy large-scale hardware accelerators. Two factors impede the performance optimization of high-level synthesis (HLS) designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplanning of HLS designs generated under each group of configurations during directive optimization due to the large design scale.

To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive optimization, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit strategy from the online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing step to compact the floorplan, followed by pipelining of the long wires crossing die-boundaries.

Through experiments on a set of HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.

Eliminating Excessive Dynamism of Dataflow Circuits Using Model Checking

Jiahui Xu
Emmet Murphy
Jordi Cortadella
Lana Josipovic

Recent HLS efforts explore the generation of dynamically scheduled, dataflow circuits from high-level code; their ability to adapt the schedule at runtime to particular data and control outcomes promises superior performance to standard, statically scheduled HLS solutions. However, dataflow circuits are notoriously resource-expensive: their distributed handshake mechanism brings performance benefits in some cases, but causes an unneeded resource overhead when general dynamism is not required. In this work, we present a verification framework based on model checking to systematically reduce the hardware complexity of dataflow circuits. We devise a series of formal proofs that identify the absence of particular behavioral scenarios and use this information to replace the generic dataflow logic with simpler and cheaper control structures. On a set of benchmarks obtained from high-level code, we demonstrate that our technique significantly reduces the resource requirements of dataflow circuits (i.e., it results in LUT and FF reductions of up to 51% and 53%, respectively), while still reaping all performance benefits of dynamic scheduling.

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits

Ayatallah Elakhras
Riya Sawhney
Andrea Guerrieri
Lana Josipovic
Paolo Ienne

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such a way as to mimic the program’s sequential execution. Although correct and quite effective in many cases, this adherence to control flow still significantly limits exploitable parallelism. Recent research introduced techniques to deliver data tokens directly from producers to consumers and achieved tangible benefits both in circuit complexity and execution time. Unfortunately, while this successfully addressed ordinary data dependencies, the problem of potential dependencies through memory remains open: When no technique can statically disambiguate accesses, circuits must be built with load-store queues (LSQs) which, to reorder accesses safely, need memory accesses to be allocated in the queues in program order. Such in-order allocation still demands control circuitry emulating sequential execution, with its negative impact on parallelization. In this paper, we transform potential memory dependencies into virtual data dependencies and use the new direct token delivery strategy to allocate accesses sequentially into the LSQ. In other words, we exploit more parallelism by constructing control circuitry to emulate exclusively those parts of the control flow strictly necessary for in-order allocation. Our results show that we can achieve up to a 74% reduction in execution time compared to prior work, in some cases, at no area cost.

SESSION: Poster Session I

OMT: A Demand-Adaptive, Hardware-Targeted Bonsai Merkle Tree Framework for Embedded Heterogeneous Memory Platform

Rakin Muhammad Shadab
Yu Zou
Sanjay Gandham
Mingjie Lin

Novel flash-based, crash-tolerant, non-volatile memory (NVM) such as Intel’s Optane DC memory brings about new and exciting use-case scenarios for both traditional and embedded computing systems involving Field-Programmable Gate Arrays (FPGA). However, NVMs cannot be proper replacement for existing DDR memory modules due to low write endurance and are more well-suited for a hybrid NVM + Volatile memory system. They are also well-known to be vulnerable to different memory-based adversaries that demand the use of a robust authentication method such as Bonsai Merkle Tree. However, typical update process of a BMT (eager update) requires updating the entire update chain frequently, affecting run-time performance even for the data that is not persistence-critical. The latest intermittent BMT update techniques can help provide better real-time throughput, but they lack crash-consistency.

A heterogeneous memory-based system would, therefore, greatly benefit from an authentication mechanism that can change its update method on-the-fly. Hence we propose a modular, unified and adaptable hardware-based BMT framework called Opportunistic Merkle tree (OMT). OMT combines two BMT with different update methods and streamlines the BMT read with a common datapath to provide support for both recovery-critical and general data, eliminating the need for individual authentication subsystems for heterogeneous memory platforms. It also allows for a switch between the update methods based on the request type (persistent/intermittent) while considerably reducing the resource overhead compared to standalone BMT implementations. We test OMT on a heterogeneous embedded secure memory system and the setup provides 44% lower memory overhead & up to 22% faster execution in synthetic benchmarks compared to a baseline.

Cyclone-NTT: An NTT/FFT Architecture Using Quasi-Streaming of Large Datasets on DDR- and HBM-based FPGA Platforms

Kaveh Aasaraai
Emanuele Cesena
Rahul Maganti
Nicolas Stalder
Javier Varela
Kevin Bowers

Number-Theoretic-Transform (NTT) is a variation of Fast-Fourier-Transform (FFT) on finite fields. NTT is being increasingly used in blockchain and zero-knowledge proof applications. Although FFT and NTT are widely studied for FPGA implementation, we believe CycloneNTT is the first to solve this problem for large data sets (2^24, 64-bit numbers) that would not fit in the on-chip RAM. CycloneNTT uses a state-of-the-art butterfly network and maps the dataflow to hybrid FIFOs composed of on-chip SRAM and external memory. This manifests into a quasi-streaming data access pattern minimizing external memory access latency and maximizing throughput. We implement two variants of CycloneNTT optimized for DDR and HBM external memories. Although historically this problem has been shown to be memory-bound, CycloneNTT’s quasi-streaming access pattern is optimized to the point that when using HBM (Xilinx C1100), the architecture becomes compute-bound. On the DDR-based platform (AWS F1), the latency of the application is equal to the streaming of the entire dataset log(N) times to/from external memory. Moreover, exploiting HBM’s larger number of channels, and following a series of additional optimizations, CycloneNTT only requires log(N)/6 passes.

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

Hyeong-Ju Kang

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of slow and power consuming external memories. This paper exploits two schemes to reduce the required memory amount and ultimately to implement a CNN of reasonable performance only with on-chip memory of a practical device like a low-end FPGA. To reduce the memory amount of the intermediate data, a stream-based line-buffer architecture and a dataflow for the architecture are proposed instead of the conventional frame-based architecture, where the amount of the intermediate data memory is proportional to the square of the input image size. The architecture consists of layer-dedicated blocks operating in a pipelined way with the input and output streams. Each convolutional layer block has a line buffer storing just a few rows of input data. The sizes of the line buffers are proportional to the width of the input image, so the architecture requires less intermediate data storage, especially in the trend of getting larger input size in modern object detection CNNs. In addition, the weight memory is reduced by the accelerator-aware pruning. The experimental results show that a whole object detection CNN can be implemented even on a low-end FPGA without an external memory. Compared to previous accelerators with similar object detection accuracy, the proposed accelerator reaches higher throughput even with less FPGA resources of LUTs, registers, and DSPs, showing higher efficiency.

Fault Detection on Multi COTS FPGA Systems for Physics Experiments on the International Space Station

Tim Oberschulte
Jakob Marten
Holger Blume

Field-programmable gate arrays (FPGAs) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose Einstein Condensate and Cold Atom Laboratory (BECCAL) apparatus, which is planned to operate on the International Space Station for several years. A total of more than 100 FPGAs distributed in the setup will be used for high-precision control of specialized sensors and actuators at nanosecond scale. Due to the large amount of devices in BECCAL, commercial off-the-shelf (COTS) FPGAs are used which are not radiation hardened. In this work, we detect and mitigate radiation effects in an application specific COTS-FPGA-based communication network. For that redundancy is integrated into the design while the firmware is optimized to stay within the FPGA’s resource constraints. A redundant integrity checker module is developed which can notify preceding network devices about data and configuration bit errors. The firmware is evaluated by injecting faults into data and configuration registers in simulation and real hardware. The FPGA resource usage of the firmware is cut down by more than half, enabling the use of double modular redundancy for the switching fabric. Together with the triple modular redundancy protected integrity checker, this combination fully prevents silent data corruptions in the design as shown in simulations and by injecting faults in hardware using the Intel Fault Injection FPGA IP Core while staying in the resource limitation of a COTS FPGA.

Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization

Meghna Mandava
Deming Chen

As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores both preemption and pipelining as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA. We evaluate our algorithm and validate its efficacy by measuring results from real workloads running on the board with different real-time constraints and priority levels. In our exploration, we compare our novel Nimblock algorithm against a no-sharing and no-virtualization baseline algorithm and three other algorithms which support sharing and virtualization. We achieve up to 5x lower average response time when compared to the baseline algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.

Graph-OPU: An FPGA-Based Overlay Processor for Graph Neural Networks

Ruiqi Chen
Haoyang Zhang
Yuhanxiao Ma
Enhao Tang
Shun Li
Yanxiang Zhu
Jun Yu
Kun Wang

Graph Neural Networks (GNNs) have outstanding performance on graph-structured data and have been extensively accelerated by field-programmable gate array (FPGA) in various ways. However, existing accelerators significantly lack flexibility, especially in the following two aspects: 1) Many FPGA-based accelerators only support one GNN model. 2) The processes of re-synthesizing and bitstream re-generating are very time-consuming for new GNN models. To this end, we propose a highly integrated FPGA-based overlay processor for general GNN accelerations named Graph-OPU. Regarding the data structure and operation irregularity, we customize the instruction sets to support irregular operation patterns in the inference process of GNN models. Then, we customize our datapath and optimize the data format in the microarchitecture to take full advantage of high bandwidth memory (HBM). Moreover, we design the computation module to ensure a unified and fully-pipelined process of sparse matrix multiplication (SpMM) and general matrix multiplication (GEMM). Users can avoid the process of FPGA reconfiguration or RTL regeneration for the newly invented GNN models. We implement the hardware prototype on Xilinx Alveo U50 and test the mainstream GNN models with 9 datasets. Graph-OPU can achieve an average of 435× and 18× speedup, while 2013× and 109× better energy efficiency, compared with the Intel I7-12700KF processor and NVIDIA RTX3090 GPU, respectively. To the best of our knowledge, Graph-OPU is the first in-depth study on FPGA-based general processors for GNN acceleration with high speedup and energy efficiency.

HMLib: Efficient Data Transfer for HLS Using Host Memory

Michael Lo
Young-kyu Choi
Weikang Qiao
Mau-Chung Frank Chang
Jason Cong

Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather than directly into its on-chip memory, substantially diminishing the end-to-end speedup, especially for small workloads (hundreds of kilobytes). AMD Xilinx’s Host Memory IP (HMI) aims to address the data movement problem by exposing to the developer an High-Level Synthesis (HLS) interface that moves the data from the host directly to the FPGA’s on-chip memory. However, using HMI purely for its interface without additional code changes incurred a 3.3x slowdown in comparison with the current programming model. The slowdown mainly originates from OpenCL call overhead and the kernel control logic unnecessarily switching states. To overcome these issues, we propose Host Memory Library (HMLib), an efficient HLS-based library that facilitates data transfer on behalf of the user. HMLib not only optimizes the runtime stack for efficient data transfer, but also provides HLS compatible and user-friendly interfaces. We demonstrate HMLib’s effectiveness for streaming applications (Deflate compression and CRC32) with improvements of up to up to 36.2X over OpenCL-DDR and up to 79.5X over raw HMI for small-scale data while maintaining little-to-no performance loss for large scale inputs. We plan to open source our work in the future.

An Efficient High-Speed FFT Implementation

Ross Martin

This poster introduces the “BxBFFT” parallel-pipelined Fast Fourier Transform (FFT), which gives higher clock speeds (Fmax) than competitors with substantial savings in power and logic resources. In comparisons with the Xilinx SSR FFT, Spiral FFT, Astron FFT, and ZipCPU FFT, the BxBFFT had clock speeds above 650MHz in cases where all others were below 300MHz. The BxBFFT’s LUTs and power were lower by a factor of ~1.5. The BxBFFT had faster Vivado implementation and faster RTL simulation, for improved productivity in design, testing, and verification. BxBFFT simulations were over 10 times faster than the Xilinx SSR FFT. The BxBFFT supports more features than other FFTs, including real-to-complex FFTs, non-power-of-2 FFTs, and features for high reliability in adverse environments. The BxBFFT’s improved performance has been verified in real applications. One customer design had to operate with a reduced workload due to excessive current draw of the Xilinx SSR FFT. A quick replacement of the Xilinx SSR FFT with the BxBFFT lowered die temperature by 34.8 degree Celsius and allowed the design to operate under full load. The source of the BxBFFT’s performance is intensive optimization of well-known FFT algorithms, not new algorithms. The BxBFFT’s coding style gives better control over synthesis to avoid and resolve performance bottlenecks. Automated generation of top-level code supports 13 different choices for radix and 2 different choices for data flow at each stage, to make optimal choices for each BxBFFT size. This results in a highly efficient FFT.

Weave: Abstraction for Accelerator Integration of Generated Modules

Tuo Dai
Bizhao Shi
Guojie Luo

As domain-specific accelerators demand multiple functional components for complex applications in a domain, the conventional wisdom for effective development involves module decomposition, module implementation, and module integration. In the recent decade, the generator-based design methodology improves the productivity of module implementation. However, with the guidance of current abstractions, it is difficult to integrate modules implemented by generators because of implicit interface definition, non-unified performance modeling, and fragmented memory management. These disadvantages cause low productivity of the integration flow and low performance of the integrated accelerators.

To address these drawbacks, we propose Weave, an abstraction for the integration of generated modules to facilitate an agile design flow for domain-specific accelerators. Weave abstraction enables the formulation and automation of optimizing the unified performance model under resource constraints of all modules. And we design a hierarchical memory management method with corresponding interface to integrate modules under the guidance of modular abstraction. In the experiments, the accelerator developed by Weave achieves 2.17× higher performance in the deep learning domain compared with an open-source accelerator, and the integrated acce-lerator attains 88.9% peak performance of generated accelerators.

A Novel FPGA Simulator Accelerating Reinforcement Learning-Based Design of Power Converters

Zhenyu Xu
Miaoxiang Yu
Qing Yang
Yeonho Jeong
Tao Wei

High-efficiency energy conversion systems have become increasingly important due to their wide use in all electronic systems such as data centers, smart mobile devices, E-vehicles, medical instruments, and so forth. Complex and interdependent parameters make optimal designs of power converters challenging to get. Recent research has shown that reinforcement learning (RL) shows great promise in the design of such converter circuits. A trained RL agent can search for optimal design parameters for power conversion circuit topologies under targeted application requirements. Training an RL agent requires numerous circuit simulations. As a result, they may take days to complete, primarily because of the slow time-domain circuit simulation.

This abstract proposes a new FPGA architecture that accelerates the circuit simulation and hence substantially speeds up the RL-based design method for power converters. Our new architecture supports all power electronic circuit converters and their variations. It substantially improves the training speed of RL-based design methods. High-level synthesis (HLS) was used to build the accelerator on Amazon Web Service (AWS) F1 instance. An AWS virtual PC hosts the training algorithm. The host interacts with the FPGA accelerator by updating the circuit parameters, initiating simulation, and collecting the simulation results during training iterations. A script was created on the host side to facilitate this design method to convert a netlist containing circuit topology and parameters into core matrices in the FPGA accelerator. Experimental results showed 60x overall speedup of our RL-based design method in comparison with using a popular commercial simulator, PowerSim.

A Fractal Astronomical Correlator Based on FPGA Cluster with Scalability

Lin Shu
Long Xiao
Yafang Song
Qiuxiang Fan
Guitian Fang
Jie Hao

Correlation is a highly computationally intensive and data-intensive signal processing application that is used heavily in radio astronomy for imaging and other measurements. For example, the next generation radio telescope, Square Kilometer Array Low (SKA-L), needs a correlator that calculates up to 22 million cross products, which is a real-time system with continuous input data rates of 6 terabits per second and equivalent computation of 2 Peta-operations per second. Therefore, a flexible and scalable solution with high performance per watt is very urgent and meaningful. In this work, a flexible FX correlation architecture based on FPGA cluster is proposed, which can be fractal in subsystem level, engine level and calculation module level, simplifying the complexity of data distribution network to increase the system’s scalability. The interconnect network between processing engines is a new two-stage solution, using self-developed data redistribution hardware to decouple full bandwidth correlation into several independent sub-bands’ computation. And the most intensive calculations, cross-multiplications among all the antennas, are modularly designed under MATLAB Simulink and AMD Xilinx System Generator, which are parametrized to scale to arbitrary antenna numbers with optional parallel granularity to minimize development effort on different FPGA or for different applications. What’s more, a fully FPGA-based FX correlator for a large array with 202 antennas, consisting of 26 F Engines based on AMD Xilinx Kintex-7 325T FPGAs, 13 X Engines based on AMD Xilinx Kintex ultrascale KU115 FPGAs, has been deployed in 2022, which is the largest full FPGA-based astronomical correlator as we know.

Power Side-channel Countermeasures for ARX Ciphers using High-level Synthesis

Saya Inagaki
Mingyu Yang
Yang Li
Kazuo Sakiyama
Yuko Hara-Azumi

In the era of Internet of Things (IoT), edge devices are considerably diversified and are often designed using high-level synthesis (HLS) to improve design-productivity. A problem here is that HLS tools were originally developed in a security-unaware fashion, inducing vulnerabilities to power side-channel attacks (PSCA), which is a serious threat in IoT. Although PSCA vulnerabilities induced by HLS tools recently started to be discussed, the effects and applicability of existing methods for PSCA-resistant designs using HLS are limited so far. In this paper, we propose a novel HLS-based design method for PSCA-resistant ciphers in hardware. Particularly focusing on lightweight block ciphers composed of Addition-Rotation-XOR (ARX)-based permutations, we studied the effects of applying ”threshold implementation”, one of the provably secure countermeasures against PSCA, to behavioral descriptions of the ciphers. In addition, we tuned the scheduling optimization of HLS tools that might cause power side-channel leakage. In our experiment, using ARX-based ciphers (Chaskey, Simon, and Speck) as benchmarks, we implemented the unprotected and protected circuit on FPGA and evaluated the PSCA vulnerability using Welch’s t-test. The results demonstrated that our proposed method can successfully mitigate vulnerabilities to PSCA for all benchmarks. From these results, we provide further discussion on the direction of PSCA countermeasures based on HLS.

Single-Batch CNN Training using Block Minifloats on FPGAs

Chuliang Guo
Binglei Lou
Xueyuan Liu
David Boland
Philip H.W. Leong

Training convolutional neural networks remains a challenge on resource-limited edge devices due to its intensive computations, large storage requirements, and high bandwidth. Error back-propagation, gradient generation, and weight update usually require high precision to guarantee model accuracy, which places a further burden on computation and bandwidth. This paper presents the first parallel FPGA CNN training accelerator with block minifloat datatypes. We first propose a heuristic bit-width allocation technique to derive a unified 8-bit block minifloat format with a sign bit, 2 exponent bits, and 5 mantissa bits. In contrast to previous techniques, the same data format is used for weights, activations, errors, and gradients. Using this format, accuracy similar to 32-bit single precision floating point is achieved and thus simplifies the FPGA-based designs of computational units such as multiply-and-add. In addition, we propose a unified Conv block to deal with Conv and transposed Conv in the forward and backward paths respectively; and a dilated Conv block with a weight kernel partition scheme for gradient generation. Both Conv blocks support non-unit stride, this being crucial for the residual connections that appear in modern CNNs. For training of ResNet20 on the CIFAR-10 dataset with a batch size of 1, our accelerator on a Xilinx Ultrascale+ ZCU102 FPGA achieves state-of-the-art single-batch throughput of 144.64 and 192.68 GOPs with and without batch normalisation layers respectively.

SESSION: Session: Applications and Design Studies I

A Study of Early Aggregation in Database Query Processing on FPGAs

Mehdi Moghaddamfar
Norman May
Christian Färber
Wolfgang Lehner
Akash Kumar

In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggregation operator. In this paper, we study early aggregation algorithms in the context of query processing acceleration in database systems on FPGAs. The comparative study leads us to set-associative caches with a low inter-reference recency set (LIRS) replacement policy. They show both great performance and modest implementation complexity compared to some of the most prominent early aggregation algorithms. We also present a novel application-specific architecture for implementing set-associative caches. Benchmarks of our implementation show speedups of up to 3x for end-to-end aggregation compared to a state-of-the-art FPGA-based query engine.

FNNG: A High-Performance FPGA-based Accelerator for K-Nearest Neighbor Graph Construction

Chaoqiang Liu
Haifeng Liu
Long Zheng
Yu Huang
Xiangyu Ye
Xiaofei Liao
Hai Jin

The k-nearest neighbor graph has emerged as the key data structure for many critical applications. However, it can be notoriously challenging to construct k-nearest neighbor graphs over large graph datasets, especially with a high-dimensional vector feature. Many solutions have been recently proposed to support the construction of k-nearest neighbor graphs. However, these solutions involve substantial memory access and computational overheads and an architecture-level solution is still absent. To address these issues, we architect FNNG, the first FPGA-based accelerator to support k-nearest neighbor graph construction. Specifically, FNNG is equipped with the block-based scheduling technique to exploit the inherent data locality between vertices. It divides the vertices that are close in space into blocks and process the vertices according to the granularity of the blocks during the construction process. FNNG also adopts the useless computation aborting technique to identify superfluous useless computations. It keeps the existing maximum similarity values of all vertices inside the computing unit. In addition, we propose an improved architecture in order to fully utilize both techniques. We implement FNNG on the Xilinx Alveo U280 FPGA card. The results show that FNNG achieves 190x and 2.1x speedups over the state-of-the-art CPU and GPU solutions, running on Intel Xeon Gold 5117 CPU and NVIDIA GeForce RTX 3090 GPU, respectively.

ACTS: A Near-Memory FPGA Graph Processing Framework

Wole Jaiyeoba
Nima Elyasi
Changho Choi
Kevin Skadron

Despite the high off-chip bandwidth and on-chip parallelism offered by today’s near-memory accelerators, software-based (CPU and GPU) graph processing frameworks still suffer performance degradation from under-utilization of available memory bandwidth because graph traversal often exhibits poor locality. Emerging FPGAbased graph accelerators tackle this challenge by designing specialized graph processing pipelines and application-specific memory subsystems to maximize bandwidth utilization and efficiently utilize high-speed on-chip memory. To use the limited on-chip (BRAM) memory effectively while handling larger graph sizes, several FPGAbased solutions resort to some form of graph slicing or partitioning during preprocessing to stage vertex property data into the BRAM. While this has demonstrated performance superiority for small graphs, this approach breaks down with larger graph sizes. For example, GraphLily [19], a recent high-performance FPGA-based graph accelerator, experiences up to 11X performance degradation between graphs having 3M vertices and 28M vertices. This makes prior FPGA approaches impractical for large graphs.

We propose ACTS, an HBM-enabled FPGA graph accelerator, to address this problem. Rather than partitioning the graph offline to improve spatial locality, we partition vertex-update messages (based on destination vertex IDs) generated online after active edges have been processed. This optimizes read bandwidth even as the graph size scales. We compare ACTS against Gunrock, a state-of-the-art graph processing accelerator for the GPU, and GraphLily, a recent FPGA-based graph accelerator also utilizing HBM memory. Our results show a geometric mean speedup of 1.5X, with a maximum speedup of 4.6X over Gunrock, and a geometric speedup of 3.6X, with a maximum speedup of 16.5X, over GraphLily. Our results also showed a geometric mean power reduction of 50% and a mean reduction of energy-delay product of 88% over Gunrock.

Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation

Nick Brown

AMD Xilinx’s new Versal Adaptive Compute Acceleration Platform (ACAP) is an FPGA architecture combining reconfigurable fabric with other on-chip hardened compute resources. AI engines are one of these and, by operating in a highly vectorized manner, they provide significant raw compute that is potentially beneficial for a range of workloads including HPC simulation. However, this technology is still early-on, and as yet unproven for accelerating HPC codes, with a lack of benchmarking and best practice.

This paper presents an experience report, exploring porting of the Piacsek and Williams (PW) advection scheme onto the Versal ACAP, using the chip’s AI engines to accelerate the compute. A stencil-based algorithm, advection is commonplace in atmospheric modelling, including several Met Office codes who initially developed this scheme. Using this algorithm as a vehicle, we explore optimal approaches for structuring AI engine compute kernels and how best to interface the AI engines with programmable logic. Evaluating performance using a VCK5000 against non-AI engine FPGA configurations on the VCK5000 and Alveo U280, as well as a 24-core Xeon Platinum Cascade Lake CPU and Nvidia V100 GPU, we found that whilst the number of channels between the fabric and AI engines are a limitation, by leveraging the ACAP we can double performance compared to an Alveo U280.

SESSION: Session: Architecture, CAD, and Circuit Design

Regularity Matters: Designing Practical FPGA Switch-Blocks

Stefan Nikolic
Paolo Ienne

Several techniques have been proposed for automatically searching for FPGA switch-blocks which typically show some tangible advantage over the well-known academic architectures. However, the resulting switch-blocks usually exhibit high levels of irregularity, in contrast with what can be observed in a typical commercial architecture. One wonders whether the architectures produced by such search methods can actually be laid out in an efficient manner while retaining the perceived gains. In this work, we propose a new switch-block exploration method that enhances a recently published search algorithm by combining it with ILP, in order to guarantee that the obtained solution satisfies arbitrary regularity constraints. We measure the impact of regularity constraints commonly seen in industrial architectures (such as limiting the number of different multiplexer sizes or forced sharing of inputs between pairs of multiplexers) and observe benefits to the routability of complex circuits with only a limited reduction in performance.

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Time-to-Digital Converters

Colin Drewes
Olivia Weng
Keegan Ryan
Bill Hunter
Christopher McCarty
Ryan Kastner
Dustin Richmond

Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5× over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80% classification accuracy that generalizes beyond a single board. Finally, our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2×.

Post-Radiation Fault Analysis of a High Reliability FPGA Linux SoC

Andrew Elbert Wilson
Nathan Baker
Ethan Campbell
Jackson Sahleen
Michael Wirthlin

FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments.

FPGA Technology Mapping with Adaptive Gate Decomposition

Longfei Fan
Chang Wu

Most existing technology mapping algorithms use graph covering approaches and suffer from the netlist structural bias problem. Chen and Cong proposed a simultaneous simple gate decomposition with technology mapping algorithm that encodes many gate decomposition choices into the netlist. However, their algorithm suffers from the long runtime problem due to a large set of choices. Later on, A. Mishchenko et al. proposed a mapping algorithm based on choice generation with the so-called lossless synthesis. Nevertheless, their algorithm cannot guarantee to find and keep all good choices a priori before mapping. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvement. Compared with the lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction. In this paper, we propose a simultaneous mapping with gate decomposition algorithm named AGDMap. Our algorithm uses cut-enumeration based engine. Bin packing algorithm is used for simple gate decomposition during cut enumeration. Input sharing based cut cost computation is used during iterative cut selection for logic duplication reduction. Based on a set of EPFL benchmark suite and HLS generated designs, our algorithm produces results with significant area improvements. Compared with the state-of-the-art ABC lossless synthesis algorithm, for area optimization, our average improvement is 12.4%. For delay optimization, we get results with similar delay and 9.2% area reduction.

FPGA Mux Usage and Routability Estimates without Explicit Routing

Jonathan W. Greene

A new algorithm is proposed to rapidly evaluate an FPGA routing architecture without need of explicitly routing benchmark applications. The algorithm takes as input a probability distribution of nets to be accommodated and a description of an architecture. It produces an estimate for the usage of each type of mux in the FPGA (including intra-cluster muxes), valuable feedback to the architect. The estimates are shown to correlate with actual routed applications in both academic and commercial architectures. This is due in part to the algorithm’s novel ability to account for long and multi-fanout nets. Run time is reduced by applying periodic graphs to model FPGAs’ regular structure.

We then show how Percolation Theory (a branch of statistical physics) can be applied to elucidate the relationship between mux usage and routability. We show that any blockages when routing a net are most likely to occur in the neighborhood of its terminals, and demonstrate a quantitative relationship among the post-placement wirelength of an application, the percolation threshold of an architecture, and the channel width required to map the application to the architecture. Supporting experimental data is provided.

SESSION: Banquet and Panel

Open-source and FPGAs: Hardware, Software, Both or None?

Dana How
Tim Ansell
Vaughn Betz
Chris Lavin
Ted Speers
Pierre-Emmanuel Gaillardon

Following the footsteps of the open-source software movement that is at the foundation of many fundamental infrastructures today, e.g., Linux, the internet, etc., a growing amount of open-source hardware initiatives have been impacting our field, e.g., the RISC-V ISA, Open chiplet standards, etc.

SESSION: Keynote II

FPGAs and Their Evolving Role in Domain Specific Architectures: A Case Study of the AMD 400G Adaptive SmartNIC/DPU SoC

Jaideep Dastidar

Domain Specific Architectures (DSA) typically apply heterogeneous compute elements such as FPGAs, GPUs, AI Engines, TPUs, etc. towards solving domain-specific problems, and have their accompanying Domain Specific Software. FPGAs have played a prominent role in DSAs for AI, Video Transcoding, Network Acceleration etc. This talk will start by going over a brief historical survey of FPGAs in DSAs and an emerging trend in Domain Specific Accelerators, where the programmable logic element is paired with other heterogeneous compute or acceleration elements. The talk will then perform a case study of AMD’s 400G Adaptive SmartNIC/DPU SoC and the considerations that went into that DSA. The case study includes where, why, and how the programmable logic element was paired with other hardened offload accelerators and embedded processors with the goal of striking the right balance between Software Processing on the embedded cores, Fastpath ASIC-like processing on the Hardened Accelerators, and Adaptive and Composable processing on the integrated FPGA. The talk will describe the data movement between various network, storage and interface acceleration elements and their shared and private memory resources. Throughout the talk, we will focus on the tradeoffs between the FPGA element and the rest of the heterogeneous compute or acceleration elements as they apply to SmartNIC/DPU offload acceleration.

SESSION: Session: Deep Learning

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

Jinming Zhuang
Jason Lau
Hanchen Ye
Zhuoping Yang
Yubo Du
Jack Lo
Kristof Denolf
Stephen Neuendorffer
Alex Jones
Jingtong Hu
Deming Chen
Jason Cong
Peipei Zhou

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?

We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing

Alireza Khataei
Gaurav Singh
Kia Bazargan

We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

Lei Cai
Jing Wang
Lianfeng Yu
Bonan Yan
Yaoyu Tao
Yuchao Yang

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

SESSION: Session: FPGA-Based Computing Engines

hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA

Xuan Wang
Lei Gong
Jing Cao
Wenqi Lou
Weiya Wang
Chao Wang
Xuehai Zhou

Regular expression (REGEX) matching tasks drive much research on automata processors (AP). Among them, the von Neumann AP can efficiently utilize on-chip memory to process the Deterministic Finite Automata (DFA), but it is limited to small REGEX sets due to the DFA’s state explosion problem. For large REGEX sets, the spatial AP based on Nondeterministic Finite Automaton (NFA) is the mainstream choice. However, there are two problems with previous FPGA-based spatial AP. First, it cannot obtain a balanced FPGA resource usage (LUT and BRAM), which easily leads to resource shortage. Second, to compress the report output data of large REGEX sets, it uses dynamic report compression, which not only consumes a lot of FPGA resources but also limits performance.

This paper optimizes the resource and IO overhead of spatial AP. First, noticing the resource optimization ability of the von Neumann AP, we propose the flex-hybrid-FA algorithm to generate small hybrid-FAs (an NFA/DFA hybrid model) and further propose the Spatial-von Neumann Heterogeneous AP to deploy hybrid-FA. Under the constraints of the flex-hybrid-FA algorithm, we can obtain balanced and efficient FPGA resource usage. Second, we propose High-Efficient Automata Report Compression (HEARC) with a compression ratio of up to 5.5-47.6x, which can thoroughly release the performance from IO congestion, and consumes less FPGA resource compared to previous dynamic report compression approaches. As far as we know, this is the first work to deploy large REGEX sets on low-cost small-scale FPGAs (e.g. Xilinx XCZU3CG). The experimental results show that compared to the previous FPGA-based APs, we save 4.0-6.6x power consumption and improve 2.7-5.9x energy efficiency.

CSAIL2019 Crypto-Puzzle Solver Architecture

Sergey Gribok
Bogdan Pasca
Martin Langhammer

The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations (i.e. computations which cannot be parallelized), with each iteration performing a very large (3072-bit in the case of CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than known FPGA implementations, and the number of iterations has been increased by about 1000x compared to LCS35. Because of the high complexity of this new puzzle, a number of intermediate, or milestone versions of the puzzle have been specified.

In this paper, we present an FPGA architecture for the CSAIL2019 solver, which we implement on a medium-sized Intel Agilex device. We develop a new multi-cycle modular multiplication method, which is flexible and can fit on a wide variety of sizes of current FPGAs. We also demonstrate a new approach for improving the fitting and timing closure of large, chip-filling arithmetic designs. We used the solver to compute the first 21 out of the 28 milestone solutions of the puzzle, which are the first reported results for this problem.

ENCORE: Efficient Architecture Verification Framework with FPGA Acceleration

Kan Shi
Shuoxiang Xu
Yuhan Diao
David Boland
Yungang Bao

Verification typically consumes the majority of the time in the hardware development cycle. Primarily this is because multiple iterations to debug hardware using software simulation is extremely time-consuming. While FPGAs can be utilised to accelerate the simulation, existing methods either provide limited visibility of design details, or are expensive to check against a reference model dynamically at the system level.

In this paper, we present ENCORE, an FPGA-accelerated framework for processor architecture verification. The design-under-test (DUT) hardware and the corresponding software emulator run simultaneously on the same FPGA with hardened processors. ENCORE embodies hardware modules that dynamically monitor and compare key registers from both the DUT and reference model, pausing the execution if any mismatches are detected. In this case, ENCORE automatically creates snapshots of the current design status, and offloads this to software simulators for further debugging. We demonstrate the performance of ENCORE by running RISC-V processor designs and benchmarks. We show that ENCORE can achieve over 44000x speedup over a traditional software simulation-based approach, while maintaining full visibility and debugging capabilities.

BOBBER A Prototyping Platform for Batteryless Intermittent Accelerators

Vishak Narayanan
Rohit Sahu
Jidong Sun
Henry Duwe

Batteryless systems offer promising platforms to support pervasive, near-sensor intelligence in a sustainable manner. These systems solely rely on ambient energy sources that often provide limited power. One common approach to designing batteryless systems is using intermittent execution—a node banks energy into a capacitive store until a threshold voltage is met and the digital components turn on and consume the banked energy until the energy is depleted and they die. The limited amount of available energy demands the development of application- and domain-specific accelerators to achieve energy efficiency and timeliness. Given the extremely close relationship between volatile state and intermittent behavior, performing actual system prototyping has been critical for demonstrating feasibility of intermittent systems. However, no prototyping platform exists for intermittent accelerators. This paper introduces BOBBER, the first implementation of an intermittent FPGA-based accelerator prototyping platform. We demonstrate BOBBER in the optimization and evaluation of a neural network accelerator powered solely by RF energy harvesting.

SESSION: Poster Session II

Adapting Skip Connections for Resource-Efficient FPGA Inference

Olivia Weng
Gabriel Marcano
Vladimir Loncar
Alireza Khodamoradi
Nojan Sheybani
Farinaz Koushanfar
Kristof Denolf
Javier Mauricio Duarte
Ryan Kastner

Deep neural networks employ skip connections – identity functions that combine the outputs of different layers-to improve training convergence; however, these skip connections are costly to implement in hardware. In particular, for inference accelerators on resource-limited platforms, they require extra buffers, increasing not only on- and off-chip memory utilization but also memory bandwidth requirements. Thus, a network that has skip connections costs more to deploy in hardware than one that has none. We argue that, for certain classification tasks, a network’s skip connections are needed for the network to learn but not necessary for inference after convergence. We thus explore removing skip connections from a fully-trained network to mitigate their hardware cost. From this investigation, we introduce a fine-tuning/retraining method that adapts a network’s skip connections – by either removing or shortening them-to make them fit better in hardware with minimal to no loss in accuracy. With these changes, we decrease resource utilization by up to 34% for BRAMs, 7% for FFs, and 12% LUTs when implemented on an FPGA.

Multi-bit-width CNN Accelerator with Systolic-in-Systolic Dataflow and Single DSP Multiple Multiplication Scheme

Mingqiang Huang
Yucen Liu
Sixiao Huang
Kai Li
Qiuping Wu
Hao Yu

Multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has been one of the core hardware platforms for deploying various neural networks. However, it is still difficult to fully make use of the dedicated digital signal processing (DSP) blocks in FPGA for accelerating the multi-bit-width network. In this work, we develop state-of-the-art multi-bit-width convolutional neural network accelerator with novel systolic-in-systolic type of dataflow and single DSP multiple multiplication (SDMM) INT2/4/8 execution scheme. Multi-level optimizations have also been adopted to further improve the performance, including group-vector systolic array for maximizing the circuit efficiency as well as minimizing the systolic delay, and differential neural architecture search (NAS) method for the high accuracy multi-bit-width network generation. The proposed accelerator has been practically deployed on Xilinx ZCU102 with accelerating NAS optimized VGG16 and Resnet18 networks as case studies. Average performance on accelerating the convolutional layer in VGG16 and Resnet18 is 1289GOPs and 1155GOPs, respectively. Throughput for running the full multi-bit-width VGG16 network is 870.73 GOPS at 250MHz, which has exceeded all of previous CNN accelerators on the same platform.

Janus: An Experimental Reconfigurable SmartNIC with P4 Programmability and SDN Isolation

Bharat Sukhwani
Mohit Kapur
Alda Ohmacht
Liran Schour
Martin Ohmacht
Chris Ward
Chuck Haymes
Sameh Asaad

Disparate deployment models of cloud computing pose varying requirements on cloud infrastructure components such as networking, storage, provisioning, and security. Infrastructure providers need to study these and often create custom infrastructure components to satisfy these requirements. A major challenge in the research and development of these cloud infrastructure solutions, however, is the availability of customizable platforms for experimentation and trade-off analysis of the various hardware and software components. Most platforms are either general purpose or bespoke solutions created to assist a particular task, too rigid to allow meaningful customization. In this work, we present a 100G reconfigurable smartNIC prototyping platform called Janus that enables cloud infrastructure research and hardware-software co-design of infrastructure components such as hypervisor, secure boot, software defined networking and distributed storage. The platform provides a path to optimize the stack by offloading the functionalities from the host x86 to the embedded processor on the smartNIC and optimize performance by moving pieces to hardware using P4. Further, our platform provides hardware-enforced isolation of cloud network control plane, thereby securing the control plane from the tenants even for bare-metal deployments.

LAWS: Large-Scale Accelerated Wave Simulations on FPGAs

Dimitrios Gourounas
Bagus Hanindhito
Arash Fathi
Dimitar Trenev
Lizy John
Andreas Gerstlauer

Computing numerical solution to large-scale scientific computing problems described by partial differential equations is a common task in high-performance computing. Improving their performance and efficiency is critical to exa-scale computing. Application-specific hardware design is a well-known solution, but the wide range of kernels makes it infeasible to provision supercomputers with accelerators for all applications. This makes reconfigurable platforms a promising direction. In this work, we focus on wave simulations using discontinuous Galerkin solvers, as an important class of applications. Existing work using FPGAs is limited to accelerating specific kernels or small problems that fit into FPGA BRAM. We present LAWS, a generic and configurable architecture for large-scale accelerated wave simulation problems running on FPGAs out of DRAM. LAWS exploits fine- and coarse-grain parallelism using a scalable array of application-specific cores, and incorporates novel dataflow optimizations, including prefetching, kernel fusion, and memory layout optimizations to minimize data transfers and maximize DRAM bandwidth utilization. We further accompany LAWS with an analytical performance model that allows for scaling across technology trends and architecture configurations. We demonstrate LAWS on the simulation of elastic wave equations. Results show that a single FPGA core achieves 69% higher performance than 24 Xeon cores with 13.27x better energy efficiency, when given 1.94x less peak DRAM bandwidth. Scaling to the same peak DRAM bandwidth shows that an FPGA is 3.27x and 1.5x faster than 24 CPU cores and an Nvidia P100 GPU, with 22.3x and 4.53x better efficiency, respectively.

Mitigating the Last-Mile Bottleneck: A Two-Step Approach For Faster Commercial FPGA Routing

Shashwat Shrivastava
Stefan Nikolic
Chirag Ravishankar
Dinesh Gaitonde
Mirjana Stojilovic

We identified that in modern commercial FPGAs, routing signals from the general interconnect to the inputs of the CLB primitives through a very sparse input interconnect block (IIB) represents a significant runtime bottleneck. This is despite academic research often neglecting the runtime of last-mile routing through the IIB. We propose a two-step routing approach that allows resolving this bottleneck by leveraging massive parallelism of today’s compute infrastructure. The main premise that enables massive parallelization is that once the signals are legally routed in the general interconnect-only reaching the inputs of the IIB, but not the final targets-the remaining last-mile routing through the IIB can be completed independently for each FPGA tile.

We ran experiments using ISPD16 and industrial designs to demonstrate the dominant contribution of last-mile routing to the router’s runtime. We used an architectural model closely resembling Xilinx UltraScale FPGAs, which makes it highly representative of the current state of the art. For ISPD16 benchmarks, we observed that when the router is instructed to complete the entire routing, including its last-mile portion, the average number of heap pushes (a machine-agnostic measure of runtime) increases 4.1× compared to a simplified reference in which last-mile routing is neglected. On industrial designs, the number of heap pushes increased 4.4×. Last-mile routing was successfully completed using a SAT-based router in up to 83% of FPGA tiles. With a slight increase in density of IIB connectivity, we were able to bring the completion success rate up to 100%.

Towards a Machine Learning Approach to Predicting the Difficulty of FPGA Routing Problems

Andrew David Gunter
Steven Wilton

In this poster, we present a Machine Learning (ML) technique to predict the number of iterations needed for a Pathfinder-based FPGA router to complete a routing problem. Given a placed circuit, our technique uses features gathered on each routing iteration to predict if the circuit is routable and how many more iterations will be required to successfully route the circuit. This enables early exit for routing problems which are unlikely to be completed in a target number of iterations. Such early exit may help to achieve a successful route within tractable time by allowing the user to quickly retry the circuit compilation with a different random seed, a modified circuit design, or a different FPGA. We demonstrate our predictor in the VTR 8 framework; compared to VTR’s predictor, our ML predictor incurs lower prediction errors on the Koios Deep Learning benchmark suite. This corresponds with an approximate time saving of 48% from early rejection of unroutable FPGA designs while also successfully completing 5% more routable designs and having a 93% shorter early exit latency.

An FPGA-Based Weightless Neural Network for Edge Network Intrusion Detection

Zachary Susskind
Aman Arora
Alan T. L. Bacellar
Diego L. C. Dutra
Igor D. S. Miranda
Mauricio Breternitz
Priscila M. V. Lima
Felipe M. G. França
Lizy K. John

Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using ASICs, 6G emphasizes the adaptability of algorithms for specific user scenarios, which motivates broader adoption of FPGAs. In this paper, we propose the FPGA-based Weightless Intrusion Warden (FWIW), a novel solution for detecting anomalous network traffic on edge devices. While prior work in this domain is based on conventional deep neural networks (DNNs), FWIW incorporates a weightless neural network (WNN), a table lookup-based model which learns sophisticated nonlinear behaviors. This allows FWIW to achieve accuracy far superior to prior FPGA-based work at a very small fraction of the model footprint, enabling deployment on small, low-cost devices. FWIW achieves a prediction accuracy of 98.5% on the UNSW-NB15 dataset with a total model parameter size of just 192 bytes, reducing error by 7.9x and model size by 262x vs. LogicNets, the best prior edge-optimized implementation. Implemented on a Xilinx Virtex UltraScale+ FPGA, FWIW demonstrates a 59x reduction in LUT usage with a 1.6x increase in throughput. The accuracy of FWIW comes within 0.6% of the best-reported result in literature (Edge-Detect), a model several orders of magnitude larger. Our results make it clear that WNNs are worth exploring in the emerging domain of edge networking, and suggest that FPGAs are capable of providing the extreme throughput needed.

A Flexible Toolflow for Mapping CNN Models to High Performance FPGA-based Accelerators

Yongzheng Chen
Gang Wu

There have been many studies on developing automatic tools for mapping CNN models onto FPGAs. However, challenges remain in designing an easy-to-use toolflow. First, the toolflow should be able to handle models exported from various deep learning frameworks and models with different topologies. Second, the hardware architecture should make better use of on-chip resources to achieve high performance. In this work, we build a toolflow upon Open Neural Network Exchange (ONNX) IR to support different DL frameworks. We also try to maximize the overall throughput via multiple hardware-level efforts. We propose to accelerate the convolution operation by applying parallelism not only at the input and output channel level, but also at the output feature map level. Several on-chip buffers and corresponding management algorithms are also designed to leverage abundant memory resources. Moreover, we employ a fully pipelined systolic array running at 400 MHz as the convolution engine, and develop a dedicated bus to implement the im2col algorithm and provide feature inputs to the systolic array. We generated 4 accelerators with different systolic array shapes and compiled 12 CNN models for each accelerator. Deployed on a Xilinx VCU118 evaluation board, the performance of convolutional layers can reach 3267.61 GOPS, which is 99.72% of the ideal throughput (3276.8 GOPS). We also achieve an overall throughput of up to 2424.73 GOPS. Compared with previous studies, our toolflow is more user-friendly. The end-to-end performance of the generated accelerators is also better than that of related work at the same DSP utilization.

Senju: A Framework for the Design of Highly Parallel FPGA-based Iterative Stencil Loop Accelerators

Emanuele Del Sozzo
Davide Conficconi
Marco D. Santambrogio
Kentaro Sano

Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving, seismic simulations, and financial option pricing, to name a few. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional system of points until it reaches convergence. However, due to their iterative and computationally intensive nature, these workloads are highly performance-hungry, demanding specialized solutions to boost performance and reduce power consumption. Here, FPGAs represent a valid architectural choice as their peculiar features enable the design of custom, parallel, and scalable ISL accelerators. Besides, the regular structure of ISLs makes them an ideal candidate for automatic optimization and generation flows. For these reasons, this paper introduces Senju, an automation framework for FPGA-based ISL accelerators. Starting from an input description, Senju builds highly parallel hardware modules and automatizes all their design phases. The experimental evaluation shows remarkable and scalable results, reaching significant performance and energy efficiency improvements compared to the other single-FPGA literature approaches.

FPGA Acceleration for Successive Interference Cancellation in Severe Multipath Acoustic Communication Channels

Jinfeng Li
Yahong Rosa Zheng

This paper proposes a hardware implementation of a Successive Interference Cancellation (SIC) scheme in a Turbo Equalizer for very long multipath fading channels where the Intersymbol-interference (ISI) channel length L is on the order of 100 taps. To reduce the computational complexity caused by large matrix arithmetic in the SIC, we explore the data dependencies and convolutional nature of the SIC algorithm and propose an FPGA acceleration architecture by taking advantage of the high degree of parallelism and the flexible data movements offered by FPGA. Instead of reconstructing interference in symbol-wise by matrix and vector multiplication directly for each symbol in a block, we propose a two-stage processing algorithm. The first stage is block-wise processing, where the convolution of the channel impulse response (CIR) vector and the vector consisting of the whole symbol block is computed. The second stage is symbol-wise processing but turns to the multiplication of one symbol and the CIR vector. The result shows that for a block of Nblk symbols and a channel of length L, the proposed architecture completes the SIC within 2Nblk+L clock cycles, while direct matrix multiplication requires L×Nblk clock cycles. Implemented on a Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit, the SIC and equalization in one turbo iteration based on this architecture is completed around 40 us for a 1024 symbol block and a channel length of L=100. This architecture achieves around 40× speed-up compared with the implementation on a powerful CPU platform.

FreezeTime: Towards System Emulation through Architectural Virtualization

Sergiu Mosanu
Joshua Fixelle
Kevin Skadron
Mircea Stan

High-end FPGAs enable architecture modeling through emulation with high speed and fidelity. However, the available reconfigurable logic and memory resources limit the size, complexity, and speed of the emulated target designs. The challenge is to map and model large and fast memory hierarchies, such as large caches and mixed main memory, various heterogeneous computation instances, such as CPUs, GPUs, AI/ML processing units and accelerator cores, and communication infrastructure, such as buses and networks. In addition to the spatial dimension, this work uses the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture. Our approach presents mechanisms to abstract instance plurality while preserving timing in sync. With only a subset of the architecture on the FPGA, we freeze a whole emulated module’s activity and state during the additional time intervals necessary for the action on the virtualized modules to elapse. We demonstrate this technique by emulating a hypothetical system consisting of a processor and an SRAM memory too large to map on the FPGA. For this, we modify a LiteX-generated SoC consisting of a VexRISC-V processor and DDR memory, with the memory controller issuing stall signals that freeze the processor, effectively ”hiding” the memory latency. For Linux boot, we measure significant emulation vs. simulation speedup while matching RTL simulation accuracy. The work is open-sourced.

SESSION: Session: Applications and Design Studies II

A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree Management

Yuan Meng
Rajgopal Kannan
Viktor Prasanna

Monte Carlo Tree Search (MCTS) is a widely used search technique in Artificial Intelligence (AI) applications. MCTS manages a dynamically evolving decision tree (i.e., one whose depth and height evolve at run-time) to guide an AI agent toward an optimal policy. In-tree operations are memory-bound leading to a critical performance bottleneck for large-scale parallel MCTS on general-purpose processors. CPU-FPGA accelerators can alleviate the memory bottleneck of in-tree operations. However, a major challenge for existing FPGA accelerators is the lack of dynamic memory management due to which they cannot efficiently support dynamically evolving MCTS trees. In this work, we address this challenge by proposing an MCTS acceleration framework that (1) incorporates an algorithm-hardware co-optimized accelerator design that supports in-tree operations on dynamically evolving trees without expensive hardware reconfiguration; (2) adopts a hybrid parallel execution model to fully exploit the compute power in a CPU-FPGA heterogeneous system; (3) supports Python-based programming API for easy integration of the proposed accelerator with RL domain-specific bench-marking libraries at run-time. We show that by using our framework, we achieve up to 6.8× speedup and superior scalability of parallel workers than state-of-the-art parallel MCTS on multi-core systems.

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

Linghao Song
Licheng Guo
Suhail Basalama
Yuze Chi
Robert F. Lucas
Jason Cong

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

Sasindu Wijeratne
Ta-Yang Wang
Rajgopal Kannan
Viktor Prasanna

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.

FPGA’23 TOC