FPGA 2017 TOC
WORKSHOP SESSION: FPGA’17 Workshops
OLAF’17: Third International Workshop on Overlay Architectures for FPGAs
The Third International Workshop on Overlay Architectures for FPGAs (OLAF) is held
in Monterey, California, USA, on Feburary 22, 2017 and co-located with FPGA 2017:
The 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. The
main …
SESSION: Special Session: The Role of FPGAs in Deep Learning
Session details: Special Session: The Role of FPGAs in Deep Learning
The Role of FPGAs in Deep Learning
Deep learning has garnered significant visibility recently as an Artificial Intelligence
(AI) paradigm, with success in wide ranging applications such as image and speech
recognition, natural language understanding, self-driving cars, and game playing (…
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily
on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular
parallelism, high TFLOP/s). Because of this, GPUs are widely used for …
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Convolutional neural networks (CNN) are the current stateof-the-art for many computer
vision tasks. CNNs outperform older methods in accuracy, but require vast amounts
of computation and memory. As a result, existing CNN applications are typically run
…
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural
Network
OpenCL FPGA has recently gained great popularity with emerging needs for workload
acceleration such as Convolutional Neural Network (CNN), which is the most popular
deep learning architecture in the domain of computer vision. While OpenCL enhances
the …
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared
Memory System
We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks
(CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier
Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational …
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional
Neural Networks
As convolution layers contribute most operations in convolutional neural network (CNN)
algorithms, an effective convolution acceleration scheme significantly affects the
efficiency and performance of a hardware CNN accelerator. Convolution in CNNs …
SESSION: Machine Learning
Session details: Machine Learning
An OpenCL™ Deep Learning Accelerator on Arria 10
Convolutional neural nets (CNNs) have become a practical means to perform vision tasks,
particularly in the area of image classification. FPGAs are well known to be able
to perform convolutions efficiently, however, most recent efforts to run CNNs on …
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant redundancy,
and high classification accuracy can be obtained even when weights and activations
are reduced from floating point to binary values. In this paper, we present FINN,
a …
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve
higher prediction accuracy, machine learning scientists have built increasingly larger
models. Such large model is both computation intensive and memory intensive. …
SESSION: Interconnect and Routing
Session details: Interconnect and Routing
Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays
How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated
that the negative effects of variation can be largely mitigated using complete knowledge
of device characteristics and full per-FPGA CAD flow. However, the cost …
Synchronization Constraints for Interconnect Synthesis
Interconnect synthesis tools ease the burden on the designer by automatically generating
and optimizing communication hardware. In this paper we propose a novel capability
for FPGA interconnect synthesis tools that further simplifies the designer’s …
Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion
FPGAs are increasingly popular as application-specific accelerators because they lead
to a good balance between flexibility and energy efficiency, compared to CPUs and
ASICs. However, the long routing time imposes a barrier on FPGA computing, which …
SESSION: Architecture
Session details: Architecture
Don’t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration
While academic FPGA architecture exploration tools have become sufficiently advanced
to enable a wide variety of explorations and optimizations on soft fabric and outing,
support for Block RAM (BRAM) has been very limited. In this paper, we present …
Automatic Construction of Program-Optimized FPGA Memory Networks
Memory systems play a key role in the performance of FPGA applications. As FPGA deployments
move towards design entry points that are more serial, memory latency has become a
serious design consideration. For these applications, memory network …
NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element
The And-Inverter Cone has been introduced as an alternative logic element to the look-up
table in FPGAs, since it improves their performance and resource utilization. However,
further analysis of the AIC design showed that it suffers from the delay …
120-core microAptiv MIPS Overlay for the Terasic DE5-NET FPGA board
We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight
message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested
RTL source code for the microAptiv MIPS processor made available …
SESSION: CAD Tools
Session details: CAD Tools
A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology
Mapping
Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations
on the input logic network before carrying out technology mapping. While the “known
recipes” of logic transformations often lead to improved mapping results, …
A Parallel Bandit-Based Approach for Autotuning FPGA Compilation
Mainstream FPGA CAD tools provide an extensive collection of optimization options
that have a significant impact on the quality of the final design. These options together
create an enormous and complex design space that cannot effectively be explored …
PANEL SESSION: Panel: FPGAs in the Cloud
Session details: Panel: FPGAs in the Cloud
FPGAs in the Cloud
Ever greater amounts of computing and storage are happening remotely in the cloud,
and it is estimated that spending on public cloud services will grow by over 19%/year
to $140B in 2019. Besides commodity processors, network and storage infrastructure,
…
SESSION: High-Level Synthesis — Tools and Applications
Session details: High-Level Synthesis — Tools and Applications
Hardware Synthesis of Weakly Consistent C Concurrency
Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion
but via fine-grained atomic operations (‘atomics’), have been shown empirically to
be the fastest class of multi-threaded algorithms in the realm of conventional …
A New Approach to Automatic Memory Banking using Trace-Based Address Mining
Recent years have seen an increased deployment of FPGAs as programmable accelerators
for improving the performance and energy efficiency of compute-intensive applications.
A well-known “secret sauce” of achieving highly efficient FPGA acceleration is to
…
Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis
Current pipelining approach in high-level synthesis (HLS) achieves high performance
for applications with regular and statically analyzable memory access patterns. However,
it cannot effectively handle infrequent data-dependent structural and data …
Accelerating Face Detection on Programmable SoC Using C-Based Synthesis
High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively
cope with design complexity of emerging applications on modern programmable system-on-chip
(SoC). While HLS continues to evolve with a growing set of algorithms,…
Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules
Packet processing systems increasingly need larger rulesets to satisfy the needs of
deep-network intrusion prevention and cluster computing. FPGA-based implementations
of packet processing systems have been proposed but their use of on-chip memory …
SESSION: Graph Processing Applications
Session details: Graph Processing Applications
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Large graph processing has gained great attention in recent years due to its broad
applicability from machine learning to social science. Large real-world graphs, however,
are inherently difficult to process efficiently, not only due to their large …
ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture
The performance of large-scale graph processing suffers from challenges including
poor locality, lack of scalability, random access pattern, and heavy data conflicts.
Some characteristics of FPGA make it a promising solution to accelerate various …
FPGA-Accelerated Transactional Execution of Graph Workloads
Many applications that operate on large graphs can be intuitively parallelized by
executing a large number of the graph operations concurrently and as transactions
to deal with potential conflicts. However, large numbers of operations occurring …
SESSION: Virtualization and Applications
Session details: Virtualization and Applications
Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center
We present a framework for creating network FPGA clusters in a heterogeneous cloud
data center. The FPGA clusters are created using a logical kernel description describing
how a group of FPGA kernels are to be connected (independent of which FPGA these …
Energy Efficient Scientific Computing on FPGAs using OpenCL
An indispensable part of our modern life is scientific computing which is used in
large-scale high-performance systems as well as in low-power smart cyber-physical
systems. Hence, accelerators for scientific computing need to be fast and energy …
Secure Function Evaluation Using an FPGA Overlay Architecture
Secure Function Evaluation (SFE) has received considerable attention recently due
to the massive collection and mining of personal data over the Internet, but large
computational costs still render it impractical. In this paper, we leverage hardware
…
SESSION: Applications
Session details: Applications
FPGA Acceleration for Computational Glass-Free Displays
The increasing computational power enables various new applications that are runtime
prohibitive before. FPGA is one of such computational power with both reconfigurability
and energy efficiency. In this paper, we demonstrate the feasibility of …
Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling
With the advent of several accurate and sophisticated statistical algorithms and pipelines
for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing
data into biologically meaningful information for further clinical …
POSTER SESSION: Poster Session 1
Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors
(Abstract Only)
This work measures the performance and power consumption gap between the current generation
of low power FPGAs and low power microprocessors (microcontrollers) through an implementation
of the Canny edge detection algorithm. In particular, the algorithm …
A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology
(Abstract Only)
This poster presents a data-centric reconfigurable architecture, which is enabled
by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture
of commercial FPGAs, it is inherently a homogeneous architecture comprising of a …
A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming
Model (Abstract Only)
Iterative stencil algorithms find applications in a wide range of domains. FPGAs have
long been adopted for computation acceleration due to its advantages of dedicated
hardware design. Hence, FPGAs are a compelling alternative for executing iterative
…
Scala Based FPGA Design Flow (Abstract Only)
With the rapid growth of data scale, data analysis applications start to meet the
performance bottleneck, and thus requiring the aid of hardware acceleration. At the
same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability
…
Thermal Flattening in 3D FPGAs Using Embedded Cooling (Abstract Only)
Thermal management is one of the key concerns in modern high power density chips.
A variety of thermal cooling techniques that have been in use in industrial applications
are now also being applied to integrated circuits. In this work, we explore the …
A Machine Learning Framework for FPGA Placement (Abstract Only)
Many of the key stages in the traditional FPGA CAD flow require substantial amounts
of computational effort. Moreover, due to limited overlap among individual stages,
poor decisions made in earlier stages will often adversely affect the quality of …
Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only)
In high-performance applications, such as quantum physics and positron emission tomography,
precise coincidence detection is of central importance: The quality of the reconstructed
images depends on the accuracy with which the underlying system detects …
Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming
HPC Applications (Abstract Only)
Streaming HPC applications are data intensive and have widespread use in various fields
(e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist
of different processing kernels where each kernel performs a specific computation
…
Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT
Interpolation Filter (Abstract Only)
Implementing an accurate and fast activation function with low cost is a crucial aspect
to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy
approximation approach for the hyperbolic tangent activation function of …
An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract
Only)
Snort and Bro are Deep Packet Inspection systems which express complex rules with
regular expressions. Before performing a regular expression search, these applications
apply a filter to select which regular expressions must be searched. One way to …
POSTER SESSION: Poster Session 2
Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)
There have been ample successful examples of applying Xilinx Vivado’s “function-to-module”
high-level synthesis (HLS) where the subject is algorithmic in nature. In this work,
we carried out a design study to assess the effectiveness of applying Vivado-…
Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip
(Abstract Only)
Component based design is one of the preferred methods to tackle system complexity,
and reduce costs and time-to-market. Major parts of the system design and IC production
are outsourced to facilities distributed across the globe, thus opening the door …
Accelerating Financial Market Server through Hybrid List Design (Abstract Only)
The financial market server in exchanges aims to maintain the order books and provide
real time market data feeds to traders. Low-latency processing is in a great demand
in financial trading. Although software solutions provide the flexibility to …
Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level
Synthesis (Abstract Only)
High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient
compilation process targeting FPGAs for algorithm evaluation and product prototyping.
However, the massively parallel memory access demands and the extremely expensive
…
A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA
(Abstract Only)
A pre-trained convolutional deep neural network (CNN) is a feed-forward computation
perspective, which is widely used for the embedded systems, requires high power-and-area
efficiency. This paper realizes a binarized CNN which treats only binary 2-…
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural
Networks (Abstract Only)
FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained
great attentions due to its higher energy efficiency than GPUs. However, it has been
a challenge for FPGA-based solutions to achieve a higher throughput than GPU …
CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only)
To efficiently process a tremendous amount of data, today’s big data applications
tend to distribute the datasets into multiple partitions, such that each partition
can be fit into memory and be processed by a separate core/server in parallel. Meanwhile,…
Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural
Network (Abstract Only)
Large-scale convolutional neural network (CNN), conceptually mimicking the operational
principle of visual perception in human brain, has been widely applied to tackle many
challenging computer vision and artificial intelligence applications. …
fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)
In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art
in several Artificial Intelligence tasks. Across the range of applications, the performance
needs vary significantly, from high-throughput image recognition to …
POSTER SESSION: Poster Session 3
FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging
(Abstract Only)
Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of
MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier
Transform for image reconstruction using the Inverse Fast Fourier Transform (…
Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network
Layers (Abstract Only)
Convolutional neural networks (CNNs) are used to solve many challenging machine learning
problems. These networks typically use convolutional layers for feature extraction
and fully-connected layers to perform classification using those features. …
ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only)
The proliferation of high-throughput sequencing machines allows for the rapid generation
of billions of short nucleotide fragments in a short period. This massive amount of
sequence data can quickly overwhelm today’s storage and compute infrastructure. …
RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction
(Abstract Only)
Despite the considerable improvements in the quality of HLS tools, they still require
the designer’s manual optimizations and tweaks to generate efficient results, which
negates the HLS design productivity gains. Majority of designer interventions lead
…
GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)
Although there is explosive growth of theoretical research on cognitive radio, the
real-time platform for cognitive radio is progressing at a low pace. Researchers expect
fast prototyping their designs with appropriate wireless platforms to precisely …
FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations
(Abstract Only)
FPGAs have been used as accelerators in a wide variety of domains such as learning,
search, genomics, signal processing, compression, analytics and so on. In recent years,
the availability of tools and flows such as high-level synthesis has made it even
…
Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable
Architectures (Abstract Only)
Data flow graph (DFG) mapping is critical for the compiling of spatial programmable
architecture, where compilation time is a key factor for both time-to-market requirement
and mapping successful rate. Inspired from the great progress made in tree …
Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only)
In this presentation we show that side-channels arising from micro-architecture of
SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which
performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA.
…
Dynamic Partitioning for Library based Placement on Heterogeneous FPGAs (Abstract
Only)
Library based design and IP reuses have been previously proposed to speed up the synthesis
of large-scale FPGA designs. However, existing methods result in large area wastage
due to the module size difference and the waste area inside each module. In …
An Energy-Efficient Design-Time Scheduler for FPGAs Leveraging Dynamic Frequency Scaling
Emulation (Abstract Only)
We present a design-time tool, EASTA, that combines the feature of reconfigurability
in FPGAs and Dynamic Frequency Scaling to realize an efficient multiprocessing scheduler
on a single-FPGA system. Multiple deadlines, reconvergent nodes, flow …