MEMOCODE 2019 TOC

29 July 2020

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

A compositional semantics of Simulink/Stateflow based on quantized state hybrid automata

Jin Woo Ro
Avinash Malik
Partha Roop

Simulink/Stateflow® is the de-facto tool for design of Cyber-physical Systems (CPS). CPS include hybrid systems, where a discrete controller guides a continuous plant. Hybrid systems are characterised by their continuous time dynamics with sudden discontinuities, caused by level/zero crossings. Stateflow can graphically capture hybrid phenomenon, making it popular with control engineers. However, Stateflow is unable to correctly and efficiently simulate complex hybrid systems, especially those characterised by even number of level crossings.

In this paper we first propose a new formal model for hybrid systems called Quantized State Hybrid Input Output Automaton (QSHIOA). QSHIOA is used to give a deterministic semantics to Stateflow in addition to efficiently handling even number of level crossing detections. In the proposed compositional semantics, a network of Stateflow charts can be compiled into a network of QSHIOAs. Benchmark results show that in the median case, the proposed stateflow execution technique, via QSHIOA, is 84% faster than using the best variable-step size solvers in Simulink/Stateflow®.

Further sub-cycle and multi-cycle schedulling support for Bluespec Verilog

David J. Greaves

Bluespec [13] is a hardware description language where all behaviour is expressed in rules that execute atomically. The standard compilation semantics for Bluespec enforce a particular mapping between rule firing and hardware clock cycles, such as a register only being updated by exactly one firing of at most one rule in any clock cycle. Also, the standard compiler does not introduce any additional state, such as credit-based or round-robin arbiters to guarantee fairness between rules over time. On the other hand, many useful hardware resources, such as complex ALUs and synchronous RAMs, are pipelined. Unlike typical high-level synthesis tools, in standard Bluespec such resources cannot be invoked using infix operators in expressions such as A[e] or e1*e2 since binding to specific instances and multi-clock cycle schedules are required. In this paper we extend the reference semantics of Bluespec to decouple it from clock cycles, allowing multiple updates to a register within one clock cycle and automatic instantiation of arbiters for multi-clock cycle behaviour. We describe the new semantic packing rules as extensions of our standard compilation rules and we report early results from an open-source, fully-functional implementation.

Securing implantable medical devices with runtime enforcement hardware

Hammond Pearce
Matthew M. Y. Kuo
Partha S. Roop
Srinivas Pinisetty

In recent years we have seen numerous proof-of-concept attacks on implantable medical devices such as pacemakers. Attackers aim to breach the strict operational constraints that these devices operate within, with the end-goal of compromising patient safety and health. Most efforts to prevent these kinds of attacks are informal, and focus on application- and system-level security — for instance, using encrypted communications and digital certificates for program verification. However, these approaches will struggle to prevent all classes of attacks. Runtime verification has been proposed as a formal methodology for monitoring the status of implantable medical devices. Here, if an attack is detected a warning is generated. This leaves open the risk that the attack can succeed before intervention can occur. In this paper, we propose a runtime-enforcement based approach for ensuring patient security. Custom hardware is constructed for individual patients to ensure a safe minimum quality of service at all times. To ensure correctness we formally verify the hardware using a model-checker. We present our approach through a pacemaker case study and demonstrate that it incurs minimal overhead in terms of execution time and power consumption.

A timeless model for the verification of quasi-periodic distributed systems

Maryam Dabaghchian
Zvonimir Rakamarić

A cyber-physical system often consists of distributed multi-rate periodic processes that communicate using message passing; each process owns a local clock not synchronized with others. We call such systems quasi-periodic distributed systems. Traditionally, one would model them using timed automata, thereby having to deal with high-complexity verification problems. Recently, several researchers proposed discrete-time abstractions based on the calendar model to make the verification more tractable. However, even the calendar model contains a notion of time in the form of a global clock. We propose a novel, timeless computation model for quasi-periodic distributed systems to facilitate their verification. The main idea behind our model is to judiciously replace synchronization using a global clock and calendar with synchronization over lengths of message buffers. We introduce a simple domain-specific language for programming of such systems and use it to formalize the semantics of both the calendar and timeless model. Then, we prove that our timeless model is an overapproximation of the calendar model. Finally, we evaluate our timeless model using several benchmarks.

RTL bug localization through LTL specification mining (WIP)

Vighnesh Iyer
Donggyu Kim
Borivoje Nikolic
Sanjit A. Seshia

As the complexity of contemporary hardware designs continues to grow, functional verification demands more effort and resources in the design cycle than ever. As a result, manually debugging RTL designs is extremely challenging even with full signal traces after detecting errors in chip-level software simulation or FPGA emulation. Therefore, it is necessary to reduce the burden of verification by automating RTL debugging processes.

In this paper, we propose a novel approach for debugging with the use of LTL specification mining. In this approach, we extract fine-grained assertions that are implicitly encoded in the RTL design, representing the designer’s assumptions, to localize bugs that are only detected when high-level properties are violated from long-running full-system simulations. We employ template-based RTL spec mining to infer both safety and bounded liveness properties. We propose strategies to convert multi-bit signals to atomic propositions based on common RTL design idioms such as ready-valid handshakes and specific state transitions using automatic static analysis.

Our initial results with a tiny RISC-V core design show that this methodology is promising for localizing bugs in time and space by demonstrating that the mined fine-grained LTL properties are violated before a high-level test failure condition occurs, such as a timeout or hanging, and can point to specific lines of suspect RTL.

Encoding and monitoring responsibility sensitive safety rules for automated vehicles in signal temporal logic

Mohammad Hekmatnejad
Shakiba Yaghoubi
Adel Dokhanchi
Heni Ben Amor
Aviral Shrivastava
Lina Karam
Georgios Fainekos

As Automated Vehicles (AV) get ready to hit the public roads unsupervised, many practical questions still remain open. For example, there is no commonly acceptable formal definition of what safe driving is. A formal definition of safe driving can be utilized in developing the vehicle behaviors as well as in certification and legal cases. Toward that goal, the Responsibility-Sensitive Safety (RSS) model was developed as a first step toward formalizing safe driving behavior upon which the broader AV community can expand. In this paper, we demonstrate that the RSS model can be encoded in Signal Temporal Logic (STL). Moreover, using the S-TaLiRo tools, we present a case study of monitoring RSS requirements on selected traffic scenarios from CommonRoad. We conclude that monitoring RSS rules encoded in STL is efficient even in heavy traffic scenarios. One interesting observation is that for the selected traffic data, vehicle parameters and response times, the RSS model violations are not frequent.

A compositional approach for real-time machine learning

Nathan Allen
Yash Raje
Jin Woo Ro
Partha Roop

Cyber-Physical Systems are highly safety critical, especially since they have to provide both functional and timing guarantees. Increasingly, Cyber-Physical Systems such as autonomous vehicles are relying on Artificial Neural Networks in their decision making and this has obvious safety implications. While many formal approaches have been recently developed for ensuring functional correctness of machine learning modules involving Artificial Neural Networks, the issue of timing correctness has received scant attention.

This paper proposes a new compiler from the well known Keras Neural Network library to hardware to mitigate the above problem. In the developed approach, we compile networks of Artificial Neural Networks, called Meta Neural Networks, to hardware implementations using a new synchronous semantics for their execution. The developed semantics enables compilation of Meta Neural Networks to a parallel hardware implementation involving limited hardware resources. The developed compiler is semantics driven and guarantees that the generated implementation is deterministic and time predictable. The compiler also provides a better alternative for the realisation of non-linear functions in hardware. Overall, we show that the developed approach is significantly more efficient than a software approach, without the burden of complex algorithms needed for software Worst Case Execution Time analysis.

Polyhedral fragments: an efficient representation for symbolically generating code for processor arrays

Michael Witterauf
Frank Hannig
Jürgen Teich

To leverage the vast parallelism of loops, embedded loop accelerators often take the form of processor arrays with many, but simple processing elements. Each processing element executes a subset of a loop’s iterations in parallel using instruction- and datalevel parallelism by tightly scheduling iterations using software pipelining and packing instructions into compact, individual programs. However, loop bounds are often unknown until runtime, which complicates the static generation of programs because they influence each program’s control flow.

Existing solutions, like generating and storing all possible programs or full just-in-time compilation, are prohibitively expensive, especially in embedded systems. As a remedy, we propose a hybrid approach introducing a tree-like program representation, whose generation front-loads all intractable sub-problems to compile time, and from which all concrete program variants can efficiently be stitched together at runtime. The tree consists of so-called polyhedral fragments that represent concrete program parts and are annotated with iteration-dependent conditions.

We show that both this representation is both space- and time-efficient: it requires polynomial space to store—whereas storing all possibly generated programs is non-polynomial—and polynomial time to evaluate—whereas just-in-time compilation requires solving NP-hard problems. In a case study, we show for a representative loop program that using a tree of polyhedral fragments saves 98.88 % of space compared to storing all program variants.

Security-driven metrics and models for efficient evaluation of logic encryption schemes

Yinghua Hu
Vivek V. Menon
Andrew Schmidt
Joshua Monson
Matthew French
Pierluigi Nuzzo

Research in logic encryption over the last decade has resulted in various techniques to prevent different security threats such as Trojan insertion, intellectual property leakage, and reverse engineering. However, there is little agreement on a uniform set of metrics and models to efficiently assess the achieved security level and the trade-offs between security and overhead. This paper addresses the above challenges by relying on a general logic encryption model that can encompass all the existing techniques, and a uniform set of metrics that can capture multiple, possibly conflicting, security concerns. We apply our modeling approach to four state-of-the-art encryption techniques, showing that it enables fast and accurate evaluation of design trade-offs, average prediction errors that are at least 2× smaller than previous approaches, and the evaluation of compound encryption methods.

Modeling observability in adaptive systems to defend against advanced persistent threats

Cody Kinneer
Ryan Wagner
Fei Fang
Claire Le Goues
David Garlan

Advanced persistent threats (APTs) are a particularly troubling challenge for software systems. The adversarial nature of the security domain, and APTs in particular, poses unresolved challenges to the design of self-* systems, such as how to defend against multiple types of attackers with different goals and capabilities. In this interaction, the observability of each side is an important and under-investigated issue in the self-* domain. We propose a model of APT defense that elevates observability as a first-class concern. We evaluate this model by showing how an informed approach that uses observability improves the defender’s utility compared to a uniform random strategy, can enable robust planning through sensitivity analysis, and can inform observability-related architectural design decisions.

Approximate computing for multithreaded programs in shared memory architectures

Bernard Nongpoh
Rajarshi Ray
Ansuman Banerjee

In multicore and multicached architectures, cache coherence is ensured with a coherence protocol. However, the performance benefits of caching diminishes due to the cost associated with the protocol implementation. In this paper, we propose a novel technique to improve the performance of multithreaded programs running on shared-memory multicore processors by embracing approximate computing. Our idea is to relax the coherence requirement selectively in order to reduce the cost associated with a cache-coherence protocol, and at the same time, ensure a bounded QoS degradation with probabilistic reliability. In particular, we detect instructions in a multithreaded program that write to shared data, we call them Shared-Write-Access-Points (SWAPs), and propose an automated statistical analysis to identify those which can tolerate coherence faults. We call such SWAPs approximable. Our experiments on 9 applications from the SPLASH 3.0 benchmarks suite reveal that an average of 57% of the tested SWAPs are approximable. To leverage this observation, we propose an adapted cache-coherence protocol that relaxes the coherence requirement on stores from approximable SWAPs. Additionally, our protocol uses stale values for load misses due to coherence, the stale value being the version at the time of invalidation. We observe an average of 15% reduction in CPU cycles and 11% reduction in energy footprint from architectural simulation of the 9 applications using our approximate execution scheme.

Compositional construction of bounded error over-approximations of acyclic interconnected continuous dynamical systems

Ratan Lal
Pavithra Prabhakar

We consider the problem of bounded time safety verification of interconnections of input-output continuous dynamical systems. We present a compositional framework for computing bounded error approximations of the complete system from those of the components. The main crux of our approach consists of capturing the input-output signal behaviors of a component using an abstraction predicate that represents the input-output sample behaviors corresponding to the signal behaviors. We define a semantics for the abstraction predicate that captures an over-approximation of the input-output signal behaviors of a component. Next, we define how to compose abstraction predicates of components to obtain an abstraction predicate for the composed system. We instantiate our compositional abstraction construction framework for linear dynamical systems by providing concrete methods for constructing the input-output abstraction predicates for the individual systems.

Security analysis of cloud-connected industrial control systems using combinatorial testing

Peter W. V. Tran-Jørgensen
Tomas Kulik
Jalil Boudjadar
Peter Gorm Larsen

Industrial control systems are moving from monolithic to distributed and cloud-connected architectures, which increases system complexity and vulnerability, thus complicates security analysis. When exhaustive verification accounts for this complexity the state space being sought grows drastically as the system model evolves and more details are considered. Eventually this may lead to state space explosion, which makes exhaustive verification infeasible. To address this, we use VDM-SL’s combinatorial testing feature to generate security attacks that are executed against the model to verify whether the system has the desired security properties. We demonstrate our approach using a cloud-connected industrial control system that is responsible for performing safety-critical tasks and handling client requests sent to the control network. Although the approach is not exhaustive it enables verification of mitigation strategies for a large number of attacks and complex systems within reasonable time.

Detecting security leaks in hybrid systems with information flow analysis

Luan Viet Nguyen
Gautam Mohan
James Weimer
Oleg Sokolsky
Insup Lee
Rajeev Alur

Information flow analysis is an effective way to check useful security properties, such as whether secret information can leak to adversaries. Despite being widely investigated in the realm of programming languages, information-flow-based security analysis has not been widely studied in the domain of cyber-physical systems (CPS). CPS provide interesting challenges to traditional type-based techniques, as they model mixed discrete-continuous behaviors and are usually expressed as a composition of state machines. In this paper, we propose a lightweight static analysis methodology that enables information security properties for CPS models. We introduce a set of security rules for hybrid automata that characterizes the property of non-interference. Based on those rules, we propose an algorithm that generates security constraints between each sub-component of hybrid automata, and then transforms these constraints into a directed dependency graph to search for non-interference violations. The proposed algorithm can be applied directly to parallel compositions of automata without resorting to model-flattening techniques. Our static checker works on hybrid systems modeled in Simulink/Stateflow format and decides whether or not the model satisfies non-interference given a user-provided security annotation for each variable. Moreover, our approach can also infer the security labels of variables, allowing a designer to verify the correctness of partial security annotations. We demonstrate the potential benefits of the proposed methodology on two case studies.

Logical specification and uniform synthesis of robust controllers

Paritosh K. Pandya
Amol Wakankar

This paper investigates the synthesis of robust controllers from a logical specification of regular properties given in an interval temporal logic QDDC. Our specification encompasses both hard robustness and soft robustness. Here, hard robustness guarantees the invariance of commitment under relaxed (weakened) assumptions. A systematic framework for logically specifying the assumption weakening by means of a QDDC formula Rb(A), called Robustness criterion, is presented. This can be used with any user specified assumption D_A to obtain a relaxed (weakened) assumption Rb(D_A). A variety of robustness criteria encompassing some existing notions such as k, b resilience as well as some new notions like tolerating non-burst errors and recovery from transient errors are formulated logically. The soft robustness pertains to the ability of the controller to maintain the commitment for as many inputs as possible, irrespective of any assumption. We present a uniform method for the synthesis of a robust controller which guarantees the invariance of specified hard robustness and it optimizes the expected value of occurrence of commitment across input sequences. Through the case study of a synchronous bus arbiter, we experimentally show the impact of variety of hard robustness criteria as well as the soft robustness on the ability of the synthesized controllers to meet the commitment “as much as possible”.

Lattice-based SMT for program verification

Karine Even-Mendoza
Antti E. J. Hyvärinen
Hana Chockler
Natasha Sharygina

We present a lattice-based satisfiability modulo theory for verification of programs with library functions, for which the mathematical libraries supporting these functions contain a high number of equations and inequalities. Common strategies for dealing with library functions include treating them as uninterpreted functions or using the theories under which the functions are fully defined. The full definition could in most cases lead to instances that are too large to solve efficiently.

Our lightweight theory uses lattices for efficient representation of library functions by a subset of guarded literals. These lattices are constructed from equations and inequalities of properties of the library functions. These subsets are found during the lattice traversal. We generalise the method to a number of lattices for functions whose values depend on each other in the program, and we describe a simultaneous traversal algorithm of several lattices, so that a combination of guarded literals from all lattices does not lead to contradictory values of their variables.

We evaluate our approach on benchmarks taken from the robotics community, and our experimental results demonstrate that we are able to solve a number of instances that were previously unsolvable by existing SMT solvers.

Establishing a refinement relation between binaries and abstract code

Freek Verbeek
Joshua Bockenek
Abhijith Bharadwaj
Binoy Ravindran
Ian Roessle

This paper presents a method for establishing a refinement relation between a binary and a high-level abstract model. The abstract model is based on standard notions of control flow, such as if-then-else statements, while loops and variable scoping. Moreover, it contains high-level data structures such as lists and records. This makes the abstract model amenable for off-the-shelf verification techniques such as model checking or interactive theorem proving. The refinement relation translates, e.g., sets of memory locations to high-level datatypes, or pointer arithmetic to standard HOL functions such as list operations or record accessors. We show applicability of our approach by verifying functions from a binary containing the Network Security Services framework from Mozilla Firefox, running on the x86-64 architecture. Our methodology is interactive. We show that we are able to verify approximately 1000 lines of x86-64 machine code (corresponding to about 400 lines of source code) in one person month.

Backup Bylaws

16 May 2020

Yibo Lin

No comments

Categories: About

BYLAWS of the Special Interest Group on DESIGN AUTOMATION of the Association for Computing Machinery, Inc.

Adopted – 27 October 1979
Revised – 9 March 1994
Revised – 7 July 2004
Revised – 24 March 2005
Revised – 20 January 2009

Article 1. Name and Scope

This organization is called the Special Interest Group on Design Automation (SIGDA) of the Association for Computing Machinery, Inc: (the “ACM”).
The scope of SIGDA’s specialty is to enhance the utility of computers as engineering tools in the design, fabrication, and test of systems and structures.

Article 2. Purpose

SIGDA is organized and operated exclusively for educational, scientific, and technical purposes in design automation.
The purpose of SIGDA and its activities includes:
1. Collecting and disseminating information in design automation through a newsletter and other publications;
2. Organizing sessions at conferences of the ACM;
3. Sponsoring conferences, symposia, and workshops;
4. Organizing projects and working groups for education, research, and development;
5. Serving as a source of technical information for the Council and subunits of the ACM; and
6. Representing the opinions and expertise of the membership on matters of technical interest to SIGDA or ACM.

Article 3. Charter

SIGDA will exist until dissolved as provided in Bylaw 6 of the ACM.

Article 4. Officers

SIGDA officers are the Chair and Chairs for Awards, Conferences, Technical Activities, Educational Activities, Communications, and Finance; one of the named Chairs will also be a Vice-Chair. The Past Chair is not an elected official and may fill one of the named Chair positions. The officers are elected for three-year terms beginning July 1 of 2009. No extension of terms shall be allowed.
The Chair is the principal officer, being responsible for leading SIGDA and managing its activities. The duties of the Chair are:
1. Calling and presiding at SIGDA Executive Committee and business meetings;
2. Conducting all of SIGDA’s activities in accordance with the policies of the ACM; and
3. Making all appointments as authorized herein.
The duties of the Vice-Chair are:
1. Assisting the Chair in leading and managing SIGDA; and
2. Presiding at meetings when the Chair is absent.
The duties of the Past Chair are:
1. Filling one of the named chair positions below, or act as a member of the Advisory Board; and
2. Chairing the Nominating Committee for SIGDA officer elections.
The duties of the Communications Chair are:
1. Maintaining the records and correspondence of SIGDA;
2. Keeping and distributing the minutes and action items of business and Executive Committee meetings.
The duties of the Finance Chair are:
1. Managing SIGDA’s finances according to the Financial Accountability Policy of the ACM. This includes preparing the annual budget, monitoring disbursements for adherence to the annual budget, and preparing financial reports as required.
2. Managing of the SIGDA Travel Grants program, if applicable.
The duties of the Awards Chair are:
1. Providing a single point of contact for all SIGDA sponsored awards;
2. Coordinating the process of nominating ACM/SIGDA members for Fellow, Distinguished, and Senior grades.
The duties of the Conference Chair are:
1. Providing a single point of contact for all SIGDA sponsored, co-sponsored, in-coop events except events for which other SIGDA Advisory Board members have been specifically assigned;
2. Coordinating the review and approval of all conference/symposia/workshop budgets.
The duties of the Technical Activities Chair are:
1. Providing a single point of contact for all SIGDA Technical Committees and other technical activities;
2. Coordinating and reviewing SIGDA TC activities and other technical activities.
The duties of the Educational Activities Chair are:
1. Providing a single point of contact for all SIGDA educational activities;
2. Coordinating and reviewing all SIGDA educational activities.

Article 5. The Executive Committee

The Executive Committee comprises the officers.
Specific duties of the Executive Committee include:
1. Approval of bylaw amendments before submission to members;
2. Approval of annual dues for SIGDA;
3. Approval of the annual budget and review all expenditures in excess of 1% of the fiscal year’s opening Fund Balance on a quarterly basis;
4. Approval of conferences, symposia, workshops or sessions sponsored, co-sponsored or held in cooperation with SIGDA; and
5. All the major management policy decisions of SIGDA must be approved by the Executive Committee.
A quorum is a majority of the members of the Executive Committee and approval requires a majority vote of those present. Approval by mail ballot requires a majority vote.
Only a member of the Executive Committee can make a motion for a vote by the Executive Committee.
All members of, or candidates for, the Executive Committee must be voting Members of ACM and of SIGDA.

Article 6. Vacancies and Appointments

Should the Chair leave office before his term expires, the Vice-Chair will assume the duties of Chair. Should any other elected office (including Past Chair) become vacant, the Chair of the SIG Governing Board may, on nomination by the SIGDA Chair, and approval by majority vote of the Executive Committee, fill the vacancy. The Chair may fill vacancies in positions appointed by the Chair, according to the procedures for making the original appointments as provided herein.
Should a vacancy be unfilled, either because of inadequacy of these bylaws or because of a dispute or for any other reason, the SIG Governing Board Chair may fill it.
All appointments expire automatically when the Chair’s term of office expires.

Article 7. The Newsletter

SIGDA will publish a newsletter at regular intervals as determined by the Executive Committee. The newsletter will be distributed to all members.
The Chair will nominate an Editor of the Newsletter, to be approved by majority vote of the Executive Committee.

Article 8. The Advisory Board

The Advisory Board includes the Executive Committee (officers). It also includes members-at-large who are nominated by the SIGDA Chair. The Chair normally nominates up to ten members-at-large to the Advisory Board for his or her term of office. Appointments to the Advisory Board must be approved by a majority vote of the Executive Committee.
The purpose of the Advisory Board is to allow members outside the Executive Committee to participate in setting policy and direction for, and assist in the operation of, SIGDA. The Advisory Board members are typically the program managers or coordinators of SIGDA sponsored activities.
The Advisory Board members are non-voting members of the SIGDA Board, and while the Advisory Board may participate in a vote, their votes are non-binding, and only the Executive Committee votes are binding.

Article 9. Membership, Dues, and Voting Privileges

A person becomes a member only after enrolling and paying the required dues. The dues for SIGDA are determined by the SIGDA Executive Committee with the approval of the Chair of the SIG Governing Board.
All members of SIGDA may vote in any ballot conducted by SIGDA. On any ballot, the votes cast by non-ACM members of SIGDA will, if necessary, be prorated downward so that their effective total cannot exceed 50% of the eligible votes.

Article 10. Reports and Records

The SIGDA Chair is responsible for filing reports about SIGDA as required by the SIG Board. These include:

An annual report on the activities during the previous year;
All reports required by the Financial Accountability Policy of the ACM; and
Closing reports on conferences and symposia.

The membership records of SIGDA will be maintained by ACM headquarters.

Article 11. Elections

The Chair shall appoint a nominating committee in the autumn of each election year. This committee will nominate at least two candidates for the position of the chair and at least six other candidates for the members-at-large, who consent to serve on the Executive Committee and fill one of the named Chair positions if elected. The person winning the most votes among those nominated for the chair will be elected to that position. The six (or seven, if Past Chair does not wish to fill a name Chair position) receiving the highest number of votes among members-at-large are elected to the Executive Committee. A report of the nominating committee must be presented to the SIGDA membership before an election can be held.
All applicants for the chair should have significant service experience of at least 3 years in the design automation community and SIGDA, in particular. They should have served at least one term in the executive committee in roles other than the chair. Equivalent experience through service to SIGDA-approved sponsored conferences as deemed acceptable by the nominating committee is allowed.
A petition from at least ten voting members of SIGDA will place other consenting candidates on the ballot for any of the EC positions, subject to meeting the requirements of 11(b) for the chair position. Petitions must be received by the Past Chair no later than April 15 in the year of election or within one month after the nominating committee has announced the candidates selected by the committee, whichever is later.
Elections must be announced by direct communication to the SIGDA Membership with sufficient time before the election such that the membership has an opportunity to petition to be placed on the ballot.
The election will be conducted among eligible voters by ballot sent by the nominating committee or by ACM Headquarters, following the election procedures of the ACM. The SIG Board will resolve ties.
All named chairs, except those of the Chair, are to be decided by the new Executive Committee by ballot, from those elected as members-at-large. The new Executive Committee votes for each position: Vice-Chair, Finance, Communications, Conferences, Technical Activities, Educational Activities, and Awards.

Article 12. Amendments

These bylaws may be amended by a majority vote of the ACM Executive Committee, or by a vote of SIGDA’s members as provided below. With the approval of the SIGDA Executive Committee, and the Executive Committee of the ACM, 2/3 of all the members of the SIG Board may amend Article 1 of these bylaws without a referendum of the members.
Amendments to these bylaws may be proposed by the SIGDA Executive Committee, the SIG Governing Board, or by a petition from 10 voting members of SIGDA. All proposed amendments must be approved, prior to being submitted for a vote of the membership, by the Chairperson of both the SIG Governing Board and the Constitution and Bylaws Committee of ACM after the Executive Director of ACM has provided his advice.
The ballot on the proposed amendment(s) will be conducted among the eligible voters by ACM Headquarters following the procedures of the ACM for voting bylaw amendments, unless a different procedure has been approved by the SIG Board. The proposal is adopted only if at least 2/3 of the effective votes of returned ballots approve it, and only if at least 10% of the ballots are returned. The Secretary/Treasurer will send a clean copy of the amended bylaws to the Executive Director of ACM and to the Chair of the SIG Governing Board.

Article 13. Dissolution

Should SIGDA be dissolved, control of its assets will revert to the ACM.

Article 14. Meetings

SIGDA will conduct at least one business meeting each year, normally in conjunction with the annual Design Automation Conference. All meetings sponsored by SIGDA must be open to all members of the ACM. SIGDA may hold meetings only in places that are open to all classes of members of the ACM. The Executive Committee may meet in closed sessions during business meetings.

Article 15. Consistency

The Constitution, Bylaws, and policies of the ACM and of the SIG Governing Board take precedence over any conflicting provisions of these bylaws or internal policies of SIGDA.

Info for Organizers of SIGDA Sponsored Events

1 May 2020

Yibo Lin

No comments

Categories: About

ACM and SIGDA is closely monitoring the COVID19 or 2019-nCoV situation (Coronavirus) and its potential impact on ACM conferences. We are following updates on the situation from the World Health Organization (WHO) and the Center for Disease Control (CDC). We encourage all Conference Leaders to keep informed on risks, precautions, and symptoms to make educated decisions for their community.

An ACM Presidential Task Force was formed to provide advice to conference organizers facing the need to move their conference online in light of the social distancing recommendations and global restrictions on travel due to the COVID-19 pandemic. Here is the link to What Conferences Can do to Replace Face-to-Face Meetings https://people.clarkson.edu/~jmatthew/acm/VirtualConferences_GuideToBestPractices_CURRENT.pdf, put together by ACM Presidential Task Force.

Conference Leaders should contact the ACM SIGDA liaison, Sade Rodriguez, for guidance on any concerns related to the potential impact this may have on conference planning and review the ACM Conference Planning Guide as it’s a great resource for an overview of the ACM support available. As a SIGDA sponsored conference, it is important that SIGDA leaders are included in all discussions in regards to any changes to the conference.

ISPD 2020 TOC

29 March 2020

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Keynote 1

Session details: Keynote 1

William Swartz

Scalable System and Silicon Architectures to Handle the Workloads of the Post-Moore
Era

Ivo Bolsens

The end of Moore’s law has been proclaimed on many occasions and it’s probably safe
to say that we are now working in the post-Moore era. But no one is ready to slow
down just yet. We can view Gordon Moore’s observation on transistor densification
as just one aspect of a longer-term underlying technological trend – the Law of Accelerating
Returns articulated by Kurzweil. Arguably, companies became somewhat complacent in
the Moore era, happy to settle for the gains brought by each new process node. Although
we can expect scaling to continue, albeit at a slower pace, the end of Moore’s Law
delivers a stronger incentive to push other trends of technology progress harder.
Some exciting new technologies are now emerging such as multi-chip 3D integration
and the introduction of new technologies such as storage-class memory and silicon
photonics. Moreover, we are also entering a golden age of computer architecture innovation.
One of the key drivers is the pursuit of domain-specific architectures as proclaimed
by Turing award winners John Hennessy and David Patterson. A good example is the Xilinx’s
AI Engine, one of the important features of the Versal? ACAP (adaptive compute acceleration
platform) [1]. Today, the explosion of AI workloads is one of the most powerful drivers
shifting our attention to find faster ways of moving data into, across, and out of
accelerators. Features such as massive parallel processing elements, the use of domain
specific accelerators, the dense interconnect between distributed on-chip memories
and processing elements, are examples of the ways chip makers are looking beyond scaling
to achieve next-generation performance gains. Next, the growing demands of scaling-out
hyperscale datacenter applications drive much of the new architecture developments.
Given a high diversification of workloads that invoke massive compute and data movement,
datacenter architectures are moving away from rigid CPU-centric structures and instead
prioritize adaptability and configurability to optimize resources such as memory and
connectivity of accelerators assigned to individual workloads. There is no longer
a single figure of merit. It’s not all about Tera-OPS. Other metrics such as transfers-per-second
and latency come to the fore as demands become more real-time; autonomous vehicles
being an obvious and important example. Moreover, the transition to 5G will result
in solutions that operate across the traditional boundaries between the cloud and
edge and embedded platforms that are obviously power-conscious and cost-sensitive.
Future workloads will require agile software flows that accommodate the spread of
functions across edge and cloud. Another industry megatrend that will drive technology
requirements especially in encryption, data storage and communication, is Blockchain.
To some, it may already have a bad reputation, tarnished by association with the anarchy
of cryptocurrency, but it will be more widely relevant than many of us realize. Who
could have foreseen the development of today’s Internet when ARPANET first appeared
as a simple platform for distributed computing and sending email? Through projects
such as the open-source Hyperledger, Blockchain technology could be game-changing
as a platform for building trust in transactions executed over the Internet. We may
soon be talking in terms of the Trusted Internet. The predictability of Moore’s law
may have become rather too comfortable and slow. The future requires maximizing the
flexibility, agility, and efficiency of new technologies. With Moore’s Law now mostly
behind us, new adaptable and scalable architectures will allow us to further provide
exponential return from technology in order to create a more adaptable and intelligent
world.

SESSION: Session 1: Placement

Session details: Session 1: Placement

Stephen Yang

Placement Optimization with Deep Reinforcement Learning

Anna Goldie
Azalia Mirhoseini

Placement Optimization is an important problem in systems and chip design, which consists
of mapping the nodes of a graph onto a limited set of resources to optimize for an
objective, subject to constraints. In this paper, we start by motivating reinforcement
learning as a solution to the placement problem. We then give an overview of what
deep reinforcement learning is. We next formulate the placement problem as a reinforcement
learning problem, and show how this problem can be solved with policy gradient optimization.
Finally, we describe lessons we have learned from training deep reinforcement learning
policies across a variety of placement optimization problems.

Hill Climbing with Trees: Detail Placement for Large Windows

Mohammad Khasawneh
Patrick H. Madden

Integrated circuit design encompasses a wide range of intractable optimization problems.
In this paper, we extend linear time hill climbing techniques from graph partitioning
to address detailed placement — this results in a new way to refine circuit designs,
dramatically expands the size of practical optimization windows, and enables wire
length reductions on a variety of benchmark problems. The approach is versatile and
straight-forward to implement, allowing it to be applied to a wide range of problems
within design automation, and beyond.

Via Pillar-aware Detailed Placement

Yong Zhong
Tao-Chun Yu
Kai-Chuan Yang
Shao-Yun Fang

With the feature size shrinking down to 7 nm and beyond, the impact of wire resistance
is significantly growing, and the circuit delay incurred by metal wires is noticeably
raising. To address this issue, a new technique called via pillar insertion is developed.
However, the poor success rate of the via pillar insertion process immediately becomes
an important problem. In this paper, we explore the causes of via pillar insertion
failures by experiments on the ISPD 2015 benchmarks, which are embedded with a real
industrial cell library. The results show that the reasons for the low success rate
may be due to track misalignment, power and ground stripe overlapping, and insufficient
margin area. Therefore, we propose the first detailed placement flow which is aware
of via pillars to maximize the success rate of via pillar insertion. In the proposed
flow, we first filter out infeasible cell rows and then move the via pillar-inserting
cells to their eligible positions. Next, we adopt a two-stage legalization method
with high flexibility on cell ordering based on a dynamic programming-based detailed
placement algorithm. Finally, we improve congested rows with a global moving process.
Experiment results show that our algorithm improves the insertion rates by 54-58%,
and achieves over 99% insertion rate on average.

Soft-Clustering Driven Flip-flop Placement Targeting Clock-induced OCV

Dimitrios Mangiras
Pavlos Mattheakis
Pierre-Olivier Ribet
Giorgos Dimitrakopoulos

On-Chip Variation (OCV) in advanced technology nodes introduces delay uncertainties
that may cause timing violations. This problem drastically affects the clock tree
that, besides the growing design complexity, needs to be appropriately synthesized
to tackle the increased variability effects. To reduce the magnitude of the clock-induced
OCV, we incrementally relocate the flip-flops and the clock gaters in a bottom-up
manner to implicitly guide the clock tree synthesis engine to produce clock trees
with increased common clock tree paths. The relocation of the clock elements is performed
using a soft clustering approach that is orthogonal to the clock tree synthesis method
used. The clock elements are repeatedly relocated and incrementally re-clustered,
thus gradually forming better clusters and settling to more appropriate positions
to increase the common paths of the clock tree. This behavior is verified by applying
the proposed method in industrial designs, resulting in clock trees which are more
resilient to process variations, while exhibiting improved overall timing.

SESSION: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging

Session details: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging

Patrick H. Madden

Advances in Carbon Nanotube Technologies: From Transistors to a RISC-V Microprocessor

Gage Hills
Christian Lau
Tathagata Srimani
Mindy D. Bishop
Pritpal Kanhaiya
Rebecca Ho
Aya Amer
Max M. Shulaker

Carbon nanotube (CNT) field-effect transistors (CNFETs) promise to improve the energy
efficiency of very-large-scale integrated (VLSI) systems. However, multiple challenges
have prevented VLSI CNFET circuits from being realized, including inherent nano-scale
material defects, robust processing for yielding complementary CNFETs (i.e., CNT CMOS:
including both PMOS and NMOS CNFETs), and major CNT variations. Here, we summarize
techniques that we have recently developed to overcome these outstanding challenges,
enabling VLSI CNFET circuits to be experimentally realized today using standard VLSI
processing and design flows. Leveraging these techniques, we demonstrate the most
complex CNFET circuits and systems to-date, including a three-dimensional (3D) imaging
system comprising CNFETs fabricated directly on top of a silicon imager, CNT CMOS
analog and mixed-signal circuits, 1 kilobit CNFET static random-access memory (SRAM)
memory arrays, and a 16-bit RISC-V microprocessor built entirely out of CNFETs.

Full-Chip Electro-Thermal Coupling Extraction and Analysis for Face-to-Face Bonded
3D ICs

Lingjun Zhu
Kyungwook Chang
Dusan Petranovic
Saurabh Sinha
Yun Seop Yu
Sung Kyu Lim

Due to the short die-to-die distance and inferior heat dissipation capability, Face-to-Face
(F2F) boned 3D ICs are often considered to be vulnerable to electrical and thermal
coupling. This study is the first to quantify the impacts of the electro-thermal coupling
on the full-chip timing, power, and performance. We first present an implementation
flow for realistic F2F 3D ICs including pad layers and power grids. Then, we propose
our signal integrity analysis, parasitic extraction, and thermal analysis flows. Next,
we investigate the impacts of the coupling on the delay, power, and noise of F2F 3D
ICs, and provide guidelines to mitigate these effects. Our experimental results show
that the inter-die electrical coupling causes up to 5.81% timing degradation and 4.00%
noise increase, while the thermal coupling leads to less than 0.41% timing degradation
and nearly no noise increase. The impact of the combined electro-thermal coupling
on delay and noise reaches 6.07% and 4.05%, respectively.

Pseudo-3D Approaches for Commercial-Grade RTL-to-GDS Tool Flow Targeting Monolithic
3D ICs

Heechun Park
Bon Woong Ku
Kyungwook Chang
Da Eun Shim
Sung Kyu Lim

Despite the recent academic efforts to develop Electronic Design Automation (EDA)
algorithms for 3D ICs, the current market does not have commercial 3D computer-aided
design (CAD) tools. Insteadpseudo-3D alternative design flows have been devised which
utilize commercial 2D CAD engines with tricks that help them operate as a fairly-efficient
3D CAD tool. In this paper we provide detailed discussions and fair power-performance-area
(PPA) comparisons of state-of-the-art pseudo-3D design flows. We also analyze the
limitations of each design flow and provide solutions with better PPA and various
design options. Our experiments using commercial PDK, GDS layouts, and sign-off simulations
demonstrate that we achieve up to 26% wirelength and 10% power consumption reduction
for pseudo-3D design flows. We also provide a partitioning-first scheme to partitioning-last
design flow which increases design freedom with tolerable PPA degradation.

SESSION: Session 3: Machine Learning for Physical Design (part 1)

Session details: Session 3: Machine Learning for Physical Design (part 1)

Patrick Groeneveld

Learning from Experience: Applying ML to Analog Circuit Design

Kishor Kunal
Tonmoy Dhar
Yaguang Li
Meghna Madhusudan
Jitesh Poojary
Arvind K. Sharma
Wenbin Xu
Steven M. Burns
Ramesh Harjani
Jiang Hu
Parijat Mukherjee
Sachin S. Sapatnekar

The problem of analog design automation has vexed several generations of researchers
in electronic design automation. At its core, the difficulty of the problem is related
to the fact that machinegenerated designs have been unable to match the quality of
the human designer. The human designer typically recognizes blocks from a netlist
and draws upon her/his experience to translate these blocks into a circuit that is
laid out in silicon. The ability to annotate blocks in a schematic or netlist-level
description of a circuit is key to this entire process, but it is a process fraught
with complexity due to the large number of variants of each circuit type. For example,
the number of topologies of operational transconductance amplifiers (OTAs) easily
numbers in the hundreds. A designer manages this complexity by dividing this large
set of variants into classes (e.g., OTAs may be telescopic, folded cascode, etc.).
Even so, the number of minor variations within each class is large. Early approaches
to analog design automation attempted to use rule-based methods to capture these variations,
but this database of rules required tender care: each new variant might require a
new rule. As machine learning (ML) based alternatives have become more viable, alternative
forms of solving this problem have begun to be explored.

Our effort is part of the ALIGN (Analog Layout, Intelligently Generated from Netlists)
project [2, 3], which is developing opensource software for analog/mixed-signal circuit
layout [1]. Our specific goal is to translate a netlist into a physical layout, with
24-hour turnaround and no human in the loop. The ALIGN flow inputs a netlist whose
topology and transistor sizes have already been chosen, a set of performance specifications,
and a process design kit (PDK) that defines the process technology. The output of
ALIGN is a layout in GDSII format.

Transforming Global Routing Report into DRC Violation Map with Convolutional Neural
Network

Wei-Tse Hung
Jun-Yang Huang
Yih-Chih Chou
Cheng-Hong Tsai
Mango Chao

In this paper, we have proposed a machine-learning framework to predict the DRC-violation
map of a given design resulting from its detailed routing based on the congestion
report resulting from its global routing. The proposed framework utilizes convolutional
neural network as its core technique to train this prediction model. The training
dataset is collected from 15 industrial designs using a leading commercial APR tool,
and the total number of collected training samples exceed 26M. A specialized under-sampling
technique is proposed to select important training samples for learning, compensate
for the inaccuracy misled by a highly imbalanced training dataset, and speed up the
entire training process. The experimental result demonstrates that our trained model
can result in not only a significantly higher accuracy than previous related works
but also a DRC violation map visually matching the actual ones closely. The average
runtime of using our learned model to generate a DRC-violation map is only 3% of that
of global routing, and hence our proposed framework can be viewed as a simple add-on
tool to a current commercial global router that can efficiently and effectively generate
a more realistic DRC-violation map without really applying detailed routing.

Lookahead Placement Optimization with Cell Library-based Pin Accessibility Prediction
via Active Learning

Tao-Chun Yu
Shao-Yun Fang
Hsien-Shih Chiu
Kai-Shun Hu
Philip Hui-Yuh Tai
Cindy Chin-Fang Shen
Henry Sheng

With the development of advanced process nodes of semiconductor, the problem of pin
access has become one of the major factors to impact the occurrences of design rule
violations (DRVs) due to complex design rules and limited routing resource. Many state-of-the-art
works address the problem of DRV prediction by adopting supervised machine learning
approaches. However, those supervised learning approaches extract the labels of training
data by generating a great number of routed designs in advance, giving rise to large
effort on training data preparation. In addition, the pre-trained model could hardly
predict unseen data and thus may not be applied to predict other designs containing
cells that are not used in the training data. In this paper, we propose the first
work of cell library-based pin accessibility prediction (PAP) by using active learning
techniques. A given set of standard cell libraries is served as the only input for
model training. Unlike most of existing studies that aim at design-specific training,
we propose a library-based model which can be applied to all designs referencing to
the same standard cell library set. Experimental results show that the proposed model
can be applied to predict two different designs with different reference library sets.
The number of remaining DRVs and M2 shorts of the designs optimized by the proposed
model are also much fewer than those of design-specific models.

SESSION: Keynote 2

Session details: Keynote 2

Mark Po-Hung Lin

Physical Design for 3D Chiplets and System Integration

Cliff Hou

The convergence of 5G and Artificial Intelligence (AI) that covers the gamut from
cloud data centers through network routers to edge applications is poised to open
possibilities beyond our imagination and transform how we will go about our daily
lives. As the foundational technology supporting 5G and AI innovation, semiconductors
strive for greater system performance and broader bandwidth, while increasing functionality
and lowering cost. In response, device innovation is transitioning from SoCs to 3D
chiplets that combine advanced wafer-level system integration (WLSI) technologies
such as CoWoS® (Chip on Wafer on Substrate), Integrated Fan-Out (InFO), Wafer-on-Wafer
(WoW) and System-on-Integrated-Chips (SoIC), to enable system integration that meets
these demands. Designing 3D chiplets and housing various chips on wafer-level for
system integration creates a whole new set of challenges. These start with design
partitioning and include handling interfaces between or passing through chips, design
for testing (DFT), thermal dissipation, databases and tools integration for chip and
packaging design, new IO/ESD (electrostatic discharge), simulation run time and tool
capacity, among others. Considering current capabilities and constraints, divide-and-conquer
remains the most feasible approach for 3D chiplet design and packaging. Chiplet design
needs to integrate data bases and tools with packaging environments for both verification
and optimization. Leveraging existing 2D physical design solutions and chip-level
abstraction can help meet 3D verification and optimization requirements. The IC industry
also needs more DFT and thermal dissipation innovation, especially the latter one.
Thermal optimization is critical to 3D chiplets and system integration. The current
thermal solution only covers thermal analysis + system-level thermal dissipation.
It should start at the IPs and across chip design process, i.e., thermal-aware 3D
IC design, to cover IP, macros, and transistors. This speech will address these and
other challenges, then propose physical design solutions for 3D chiplets and system
integration. CCS CONCEPTS – VLSI design, 3D integrated circuits, VLSI system specification
and constraints, and VLSI packaging KEYWORDS Physical design, 3D chiplets and system
integration, thermal optimization BIOGRAPHY Dr. Cliff Hou was appointed Vice President
of Research and Development at Taiwan Semiconductor Manufacturing Co. Ltd. (TSMC)
in 2011. Since 1999, he has worked to establish node-specific reference flows from
0.13μm to today’s leading-edge 3nm at TSMC. Dr. Hou also led TSMC’s in-house IP development
teams from 2008 to 2010. He is now spearheading TSMC’s efforts to build total platform
solutions for the industry’s high growth markets in Mobile, IoT, Automotive, and High-Performance
Computing. Dr. Hou holds 44 U.S. Patents and serves as a member of Board of Directors
in Global Unichip Corp. He received B.S. degree in Control Engineering from Taiwan’s
National Chiao-Tung University, and Ph.D. in Electrical and Computer Engineering from
Syracuse University.

SESSION: Session 4: Circuit Design and Security

Session details: Session 4: Circuit Design and Security

David Chinnery

Hardware Security For and Beyond CMOS Technology: An Overview on Fundamentals, Applications, and Challenges

Johann Knechtel

As with most aspects of electronic systems and integrated circuits, hardware security
has traditionally evolved around the dominant CMOS technology. However, with the rise
of various emerging technologies, whose main purpose is to overcome the fundamental
limitations for scaling and power consumption of CMOS technology, unique opportunities
arise also to advance the notion of hardware security. In this paper, I first provide
an overview on hardware security in general. Next, I review selected emerging technologies,
namely (i) spintronics, (ii) memristors, (iii) carbon nanotubes and related transistors,
(iv) nanowires and related transistors, and (v) 3D and 2.5D integration. I then discuss
their application to advance hardware security and also outline related challenges.

Design Optimization by Fine-grained Interleaving of Local Netlist Transformations
in Lagrangian Relaxation

Apostolos Stefanidis
Dimitrios Mangiras
Chrysostomos Nicopoulos
David Chinnery
Giorgos Dimitrakopoulos

Design optimization modifies a netlist with the goal of satisfying the timing constraints
at the minimum area and leakage power, without violating any slew or load capacitance
constraints. Lagrangian relaxation (LR) based optimization has been established as
a viable approach for this. We extend LR-based optimization by interleaving in each
iteration techniques such as: gate and flip-flop sizing; buffering to fix late and
early timing violations; pin swapping; and useful clock skew. Locally optimal decisions
are made using LR-based cost functions, without the need for incremental timing updates.
Sub-steps are applied in a balanced manner, accounting for the expected savings and
any conflicting timing violations, maximizing the final quality of results under multiple
process/operating corners with a reasonable runtime. Experimental results show that
our approach achieves better timing, and both lower area and leakage power than the
winner of the TAU 2019 contest, on those benchmarks.

Selective Sensor Placement for Cost-Effective Online Aging Monitoring and Resilience

Hao-Chun Chang
Li-An Huang
Kai-Chiang Wu
Yu-Guang Chen

Aggressive technology scaling trends, such as thinner gate oxide without proportional
downscaling of supply voltage, aggravate the aging impact and thus necessitate an
aging-aware reliability verification and optimization framework during early design
stages. In this paper, we propose a novel in-situ sensing strategy based on deploying
transition detectors (TDs), for on-chip aging monitoring and resilience. Transformed
into the set cover problem and then formulated into maximum satisfiability, the proposed
problem of TD/sensor placement can be solved efficiently. Experimental results show
that, by introducing at most 2.2% area overhead (for TD/sensor placement), the aging
behavior of a target circuit can be effectively monitored, and the correctness of
its functionality can be perfectly guaranteed with an average of 77% aging resilience
achieved. In other words, with 2.2% area overhead, potential aging-induced timing
errors can be detected and then eliminated, while achieving 77% recovery from aging-induced
performance degradation.

SESSION: Session 5: Timing and Clocking

Session details: Session 5: Timing and Clocking

Evangeline Young

Synthesis of Clock Networks with a Mode Reconfigurable Topology and No Short Circuit
Current

Necati Uysal
Juan Ariel Cabrera
Rickard Ewetz

Circuits deployed in the Internet of Things operate in low and high performance modes
to cater to variable frequency and power requirements. Consequently, the clock networks
for such circuits must be synthesized meeting drastically different timing constraints
under variations in the different modes. The overall power consumption and robustness
to variations of a clock network is determined by the topology. However, state-of-the-art
clock networks use the same topology in every mode, despite that the timing constraints
in the low and high performance modes are very different. In this paper, we propose
a clock network with a mode reconfigurable topology (MRT) for circuits with positive-edge
triggered sequential elements. In high performance modes, the required robustness
to variations is provided by reconfiguring the MRT structure into a near-tree. In
low performance modes, the MRT structure is reconfigured into a tree to save power.
Non-tree (or near-tree) structures provide robustness to variations by appropriately
constructing multiple alternative paths from the clock source to the clock sinks,
which neutralizes the negative impact of variations. In MRT structures, OR-gates are
used to join multiple alternative paths into a single path. Consequently, the MRT
structures consume no short circuit power because there is only one gate driving each
net. Moreover, it is straightforward to reconfigure MRT structures into a tree by
gating the clock signal in part of the structure. Compared with state-of-the-art near-tree
structures, MRT structures have 8% lower power consumption and similar robustness
to variations in high performance modes. In low performance modes, the power consumption
is 16% smaller when reconfiguration is used.

Timing Driven Partition for Multi-FPGA Systems with TDM Awareness

Sin-Hong Liou
Sean Liu
Richard Sun
Hung-Ming Chen

Multi-FPGA system is a popular approach to achieve hardware acceleration with the
scalability to accommodate large designs. To overcome the connectivity constraint
between each pair of FPGAs, Time-division multiplexing (TDM) is adopted with the expense
of additional delay that dominates the performance on multi-FPGA system based emulator.
To the best of our knowledge, there is no prior work on partitioning for multi-FPGA
system considering hardware configuration and the impact of TDM. This work proposes
a partition methodology to improve timing performance for multi-FPGA system. Delay
introduced by TDM is estimated and optimized using look-up table for better efficiency.
Our experimental result shows 43% improvement in maximum delay while considering both
hardware configuration and impact of TDM compared with cut driven partition approach.

SESSION: Session 6: Machine Learning for Physical Design (part 2)

Session details: Session 6: Machine Learning for Physical Design (part 2)

Ismail Bustany

Understanding Graphs in EDA: From Shallow to Deep Learning

Yuzhe Ma
Zhuolun He
Wei Li
Lu Zhang
Bei Yu

As the scale of integrated circuits keeps increasing, it is witnessed that there is
a surge in the research of electronic design automation (EDA) to make the technology
node scaling happen. Graph is of great significance in the technology evolution since
it is one of the most natural ways of abstraction to many fundamental objects in EDA
problems like netlist and layout, and hence many EDA problems are essentially graph
problems. Traditional approaches for solving these problems are mostly based on analytical
solutions or heuristic algorithms, which require substantial efforts in designing
and tuning. With the emergence of the learning techniques, dealing with graph problems
with machine learning or deep learning has become a potential way to further improve
the quality of solutions. In this paper, we discuss a set of key techniques for conducting
machine learning on graphs. Particularly, a few challenges in applying graph learning
to EDA applications are highlighted. Furthermore, two case studies are presented to
demonstrate the potential of graph learning on EDA applications.

TEMPO: Fast Mask Topography Effect Modeling with Deep Learning

Wei Ye
Mohamed Baker Alawieh
Yuki Watanabe
Shigeki Nojima
Yibo Lin
David Z. Pan

With the continuous shrinking of the semiconductor device dimensions, mask topography
effects stand out among the major factors influencing the lithography process. Including
these effects in the lithography optimization procedure has become necessary for advanced
technology nodes. However, conventional rigorous simulation for mask topography effects
is extremely computationally expensive for high accuracy. In this work, we propose
TEMPO as a novel generative learning-based framework for efficient and accurate 3D
aerial image prediction. At its core, TEMPO comprises a generative adversarial network
capable of predicting aerial image intensity at different resist heights. Compared
to the default approach of building a unique model for each desired height, TEMPO
takes as one of its inputs the desired height to produce the corresponding aerial
image. In this way, the global model in TEMPO can capture the shared behavior among
different heights, thus, resulting in smaller model size. Besides, across-height information
sharing results in better model accuracy and generalization capability. Our experimental
results demonstrate that TEMPO can obtain up to 1170x speedup compared with rigorous
simulation while achieving satisfactory accuracy.

DRC Hotspot Prediction at Sub-10nm Process Nodes Using Customized Convolutional Network

Rongjian Liang
Hua Xiang
Diwesh Pandey
Lakshmi Reddy
Shyam Ramji
Gi-Joon Nam
Jiang Hu

As the semiconductor process technology advances into sub-10nm regime, cell pin accessibility,
which is a complex joint effect from the pin shape and nearby blockages, becomes a
main cause for DRC violations. Therefore, a machine learning model for DRC hotspot
prediction needs to consider both very high-resolution pin shape patterns and low-resolution
layout information as input features. A new convolutional neural network technique,
J-Net, is introduced for the prediction with mixed resolution features. This is a
customized architecture that is flexible for handling various input and output resolution
requirements. It can be applied at placement stage without using global routing information.
This technique is evaluated on 12 industrial designs at 7nm technology node. The results
show that it can improve true positive rate by 37%, 40% and 14% respectively, compared
to three recent works, with similar false positive rates.

SESSION: Keynote 3

Session details: Keynote 3

Iris Hui-Ru Jiang

Physical Verification at Advanced Technology Nodes and the Road Ahead

Juan C. Rey

In spite of “doomsday” expectations, Moore’s Law is alive and well. Semiconductor
manufacturing and design companies, as well as the Electronic Design Automation (EDA)
industry have been pushing ahead to bring more functionality to satisfy more aggressive
space/power/performance requirements.

Physical verification occupies a unique space in the ecosystem as one of the key bridges
between design and manufacturing. As such, the traditional space of design rule checking
(DRC) and layout versus schematic (LVS) have expanded into electrical verification
and yield enabling technologies such as optical proximity correction, critical area
analysis, multi-patterning decomposition and automated filling.

To achieve the expected accuracy and performance demanded by the design and manufacturing
community, it is necessary to consider the physical effects of the manufacturing processes
and electronic devices and to use the most advanced software engineering technology
and computational capabilities.

SESSION: Session 8: ISPD 2020 Contest Results and Poster Presentations

Session details: Session 8: ISPD 2020 Contest Results and Poster Presentations

Marvin Tom

ISPD 2020 Physical Mapping of Neural Networks on a Wafer-Scale Deep Learning Accelerator

Michael James
Marvin Tom
Patrick Groeneveld
Vladimir Kibardin

This paper introduces a special case of the floorplanning problem for optimizing neural
networks to run on a wafer-scale computing engine. From a compute perspective, neural
networks can be represented by a deeply layered structure of compute kernels. During
the training of a neural network, gradient descent is used to determine the weight
factors. Each layer then uses a local weight tensor to transform “activations” and
“gradients” that are shared among connected kernels according to the topology of the
network. This process is computationally intensive and requires high memory and communication
bandwidth. Cerebras has developed a novel computer system designed for this work that
is powered by a 21.5cm by 21.5cm wafer-scale processor with 400,000 programmable compute
cores. It is structured as a regular array of 633 by 633 processing elements, each
with its own local high bandwidth SRAM memory and direct high bandwidth connection
to its neighboring cores. In addition to supporting traditional execution models for
neural network training and inference, this engine has a unique capability to compile
and compute every layer of a complete neural network simultaneously. Mapping a neural
network in this fashion onto Cerebras’ Wafer-Scale Engine (WSE) is reminiscent of
the traditional floorplanning problem in physical design. A kernel ends up as a rectangle
of x by y compute elements. These are the flexible blocks that need to be placed to
optimize performance. This paper describes an ISPD 2020 challenge to develop algorithms
and heuristics that produce compiled neural networks that achieve the highest possible
performance on the Cerebras WSE.

FPGA 2020 TOC

25 February 2020

Yibo Lin

No comments

Categories: Publications

FPGA ’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Digital Library logo
Full Citation in the ACM Digital Library

SESSION: Morning Tutorial Session

Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits

Lana Josipović

High-level synthesis tools, both commercial and academic, typically rely on static
scheduling to produce high-throughput pipelines. However, in applications with unpredictable
memory accesses or irregular control flow, these tools need to make pessimistic scheduling
assumptions. In contrast, dataflow circuits implement dynamically scheduled circuits,
in which components communicate locally using a handshake mechanism and exchange data
as soon as all conditions for a transaction are satisfied. Due to their ability to
adapt the schedule at runtime, dataflow circuits are suitable for handling irregular
and control-dominated code. This paper describes Dynamatic, an open-source HLS framework
which generates synchronous dataflow circuits out of C/C++ code. The purpose of this
paper is to give an introductory overview of Dynamatic and demonstrate some of its
use cases, in order to enable others to use the tool and participate in its development.

Invited Tutorial: FPGA Hardware Security for Datacenters and Beyond

Kaspar Matas

Since FPGAs are now available in datacenters to accelerate applications, providing
FPGA hardware security is a high priority. FPGA security is becoming more serious
with the transition to FPGA-as-a-Service where users can upload their own bitstreams.
Full control over FPGA hardware through the bitstream enables attacks to weaken an
FPGA-based system. These include physically damaging the FPGA equipment and leaking
of sensitive information such as the secret keys of crypto algorithms. While there
is no known attacks in the commercial settings so far, it is not so much a question
of if but more of when? The tutorial will show concrete attacks applicable on datacenter
FPGAs. The goal of this tutorial is to prepare the FPGA community to impending security
issues in order to pave way for a proactive security. First, we will give a tour through
the FPGA hardware security jungle surveying practical attacks and potential threats.
We will reinforce this with live demos of denial of service attacks. Less than 10%
of the logic resources on an FPGA can draw enough dynamic power to crash a datacenter
FPGA card. In the second part of the tutorial, we will show different mitigations
that are either vendor supported or proposed by the academic community. In summary,
the tutorial will communicate that while FPGA hardware security is complicated to
bring about, there are acceptable solutions for known FPGA security problems.

SESSION: Invited Session: Security in FPGA Design and Application

Session details: Invited Session: Security in FPGA Design and Application

Ryan Kastner

Establishing Trust in Microelectronics

Lee W. Lerner

In recent years, substantial attention has been drawn to vulnerabilities in the architectural
design of microelectronics, as well as the security of their global supply chains.
In reality, establishing trust in microelectronics requires broader considerations,
from verification of the software leveraged to implement hardware designs, to analyzing
third-party intellectual property cores, all the way to run-time design assurance
and periodic device screening post-deployment. These concerns are relevant to stakeholders
at all levels, from small independent design houses all the way to multi-national
strategic interests. One notable example of the latter is the U.S. Department of Defense’s
Trusted and Assured Microelectronics (T&AM) program, which seeks assured access to
state-of-the-art foundries through modern trust and assurance methods and demonstrations
[1]. This talk will describe research efforts at the Georgia Tech Research Institute
centered around providing assurance of FPGAs. Current research thrusts include the
development of verification techniques at multiple stages of the design process, including
vendor design software execution, implementation of user designs, and even the operation
of the underlying physical device hardware itself. For example, to address trust in
synthesis and implementation of high-level user source code, we discuss the development
of canary circuits which are compiled alongside user design circuits and can be independently
inspected and verified to ensure adherence to user-defined implementation rules. Additionally,
we discuss one avenue for providing trust in vendor hardware devices through our development
of Independent Functional Test (IFT) suites.

Thermal and Voltage Side and Covert Channels and Attacks in Cloud FPGAs

Jakub Szefer

Cloud FPGAs have been gaining interest in recent years due to the ability of users
to request FPGA resources quickly, flexibly, and on-demand. In addition to the existing
single-tenant deployments, where each user gets an access to a whole FPGA, recent
academic proposals have looked at creating multi-tenant deployments, where multiple
users share a single FPGA, e.g., [3]. In both settings, there is a large amount of
infrastructure and physical resources that are shared among users. Sharing of the
physical resources in data centers and processors is well known to lead to potential
attacks, e.g., [4]. However, only recently have there been demonstrations of various
security attacks that our group and others have shown to be possible in Cloud FPGA
setting, e.g., [5].

This talk will discuss Cloud FPGA security from the perspective of side and covert
channel attacks that arise due to these shared resources. It will first cover our
recent work on thermal channels that can be used to create covert channels between
users renting same FPGA over time [5]. These channels can create stealthy communication
medium for leaking small amounts of sensitive information, e.g., cryptographic keys.
As defense strategies, the talk will point out possible solutions at the system level
and at the hardware level. At the system level, adding delays between when different
users can access the same FPGA, or preventing users from being able to identify unique
FPGA instances can mitigate the threats, but does increase overhead. At the hardware
level, additional cooling to erase thermal information after users uses and FPGA,
or new sensors to monitor FPGAs and generate an alert when excessive heat is detected
are possible solutions that will be discussed.

The talk will also discuss recent work on voltage-based attacks that leverage custom
circuits instantiated inside the FPGAs to measure voltage changes. Voltage-based channels
can be used to leak sensitive information across FPGAs (in single-tenant or multi-tenant
settings) [2], or can be combined with other existing attacks to perform cross-talk
leakage inside the FPGAs (in multi-tenant settings) [1]. These attacks highlight the
power of attacker when they are able to synthesize any circuit into a shared FPGA
environment. Furthermore, even with certain restrictions on the types of designs that
can be synthesized, this talk will show how attacks can be deployed. As defense strategies,
the talk will point out possible new design check rules that can be used by Cloud
FPGA providers.

In light of the attacks and defenses, Cloud FPGA security remains a cat-and-mouse
game. There is then the foremost need to better understand the existing and potential
attacks — to design defenses and deploy them before malicious users try to launch
such attacks. Only with proper understanding of the possible FPGA attacks, can secure
Cloud FPGAs be created.

Multi-tenant FPGA Security: Challenges and Opportunities

Patrick Koeberl

An emerging trend in the data center is the at-scale deployment of Field-Programmable
Gate Arrays (FPGAs) which combine multi-gigabit and ultra-low latency workload acceleration
with hardware-level reconfigurability. In particular, for applications such as Deep
Learning where techniques and algorithms are rapidly changing, the inherent flexibility
of FPGAs grants them an edge over hardened data processing units such as ASICs or
GPUs. Inevitably, Cloud Service Providers (CSPs) will seek to maximize resource utilization
for their FPGA investments as they currently do for general-purpose computing resources.
Current FPGA deployments in the data center tend to be single-tenant or support multiple
tenants through time multiplexing (temporal multi-tenancy) which can result in resource
underutilization. This approach does not scale and presents challenges for elastic
workloads whose properties are not fully known ahead of time. Instead, we expect that
closing the resource utilization gap will require efficient spatial allocation of
FPGA resources across multiple tenants while maintaining security and QoS guarantees.
In particular, new usage models such as FPGA-as-a-Service, where resources are exposed
directly to the cloud tenant, present unique challenges on the security and QoS side.
In this talk we review the threat landscape and trust models associated with FPGA
multi-tenancy, highlight future research challenges and examine the unique opportunities
that FPGA multi-tenancy enables given adequate guarantees on security and QoS.

FPGA / SoC Security: Arms Race in the Cloud

Steven McNeil

Technology and cost are motivating more and more developers to put more and more of
their “secret sauce” in programmable logic. This is great for the consumer as it opens
the market to the smaller players. However, it also opens the market for IP theft;
after all, why spend years making something yourself if you can just pilfer it and
re-brand it as your own. A sad statement for sure but it is the reality of the world
we live in. Worsening the situation is the fact that FPGA’s and SoC’s are starting
to become the anchor for the security of larger systems. This now brings in another
set of bad guys; ones that are tech savvy and armed with lab equipment. They are not
looking for your IP but are looking to break into the system your IP is protecting.
The number of adversaries is growing just as fast as the markets for the devices themselves
and is quickly becoming an arms race.

The first volley was fired when the adversaries started reverse engineering the programming
files (bitstreams) so we added bitstream encryption (3-DES at the time). However,
commercial computational power rendered that algorithm obsolete, so we moved to AES-256.
The adversary gave up attacking the algorithm and started going after the key itself,
so we added physical protections. Failing to break into the device they opted for
less physically invasive attacks such as Differential Power Analysis, so we added
authentication before decryption and key rolling. Frustrated with these and other
protections the adversary went back to physical attacks. This was aided by the ever-expanding
capabilities of Failure Analysis which needs the same equipment to understand device
failures as attackers need to break into devices. They started with circuit probing
and edits using the Focus Ion Beam (FIB). This allowed them to disable security features
or tap into the key space. To counter this redundancy and circuit obfuscation techniques
were added. They then switched to less destructive imaging methods which forced us
to levy system requirements (detection and prevention techniques) on the customer;
costly but effective.

The security of FPGAs and SoCs has greatly evolved over the years but the next battlefield
in the arms race is on the horizon: FPGAs / SoC’s in the cloud. Most of the security
in modern day devices primarily considers an attacker on the outside of the device
with close physical access. As such, most of the mitigations make a similar assumption.
However, devices in the cloud creates an entire new form of device warfare; remote
attacks on FPGAs and SoCs. This is not an issue of hacking software; that has been
around since before the internet. It is about the number of devices in the cloud that
are programmable and, therefore, hackable. Side channel attacks such as row-hammer
(1) and CLKscrew (2) show that if there are secrets and some level of device access,
the attackers will find a way to exploit it. In this modern era, security engineers
are going to have to look for adversaries where they would never expect them; inside
the system.

SESSION: Panel

Session details: Panel

Andrew Putnam

What To Do With Datacenter FPGAs Besides Deep Learning

Andrew Putnam

FPGAs have been deployed in datacenters worldwide and are now available for use by
in both public and private clouds. Enormous focus has been given to optimizing machine
learning workloads for FPGAs, especially for deep neural networks (DNNs) in areas
like web search, image classification, and translation. However, major cloud applications
encompasses a variety of areas that aren’t primarily machine learning workloads, including
databases, video encoding, text processing, gaming, bioinformatics, productivity and
collaboration, file hosting and storage, e-mail, and many more. While machine learning
can certainly play a role in each of these areas, is there more that can be done to
accelerate these more traditional workloads? Even more challenging than identifying
promising workloads is figuring out how developers can practically create and deploy
useful applications using FPGAs to the cloud. While FPGAs-as-a-Service allow access
to FPGAs in the cloud, there is a huge gap between raw programmable hardware and a
customer paying money to use an application powered by that hardware. A wide variety
of FPGA IP exists for developers to use, but individual IP blocks are a long way from
being a fully functional cloud application. Building block IPs like Memcached, regex
matching, protocol parsing, and linear algebra are only a subset of the necessary
functionality for full cloud applications. Developing or acquiring IP and integrating
it into a full application that customers will pay for is a significant task. And
even when a customer pays, how should the money be distributed between IP vendors.
Should it be a onetime fee? By usage? By number of FPGAs deployed? Who should have
the burden for support if something goes wrong? In traditional cloud applications,
FPGA IP block functions are implemented in software libraries. However, few examples
of optimized software libraries are commercially successful, so is selling FPGA IP
even a viable commercial model for cloud applications? High-level synthesis (HLS)
tools promise to provide one path to enable software developers to make effective
use of FPGAs for computing tasks, but are any tools really capable of accelerating
cloud-scale applications? Many HLS tools require substantial microarchitectural guidance
in the form of pragmas or configuration files to come out with good results. Real
cloud applications also rarely have a single dominant function and have significant
data movement, so without proper partitioning and tuning, the acceleration gains from
the FPGA are quickly wiped out by data movement and Amdahl’s Law. This panel will
gather experts in using FPGAs for cloud application areas beyond machine learning,
and how those applications can be built and successfully deployed. We will cover topics
such as: -What are the most important cloud workloads for FPGAs to target besides
machine learning? -Are there specific changes to the FPGA architecture that would
benefit these cloud applications? -What are the economic models that will work for
IP developers, application developers, and cloud providers? -How can we make development
of FPGA applications easier for the Cloud? -Will open source IP make it impossible
for IP vendors to make commercially successful libraries? -What advances are necessary
for HLS tools to be practical in the Cloud? The panel is comprised of experts in applications,
IP development, and cloud deployment. Each will give a short presentation of what
they find as the most important applications and how they see FPGA development for
the cloud going forward, then we will open the floor to an interactive discussion
with the audience.

SESSION: Session: Keynote I

Session details: Session: Keynote I

Lesley Shannon

Symbiosis in Action: Reconfigurable Architectures and EDA

Mahesh A. Iyer

Spatial compute architectures, like Field Programmable Gate Arrays (FPGAs), constitute
a key architectural pillar in modern heterogeneous compute platforms. Spatial architectures
need a sophisticated Electronic Design Automation (EDA) compiler to optimally map
and fit a user’s workload/design onto the underlying spatial device. This EDA compiler
not only helps users to custom-configure the spatial device but is also critically
required for architectural exploration of new spatial architectures. The FPGA industry
has had a long history of innovation in this symbiotic relationship between EDA and
reconfigurable spatial architectures.

This talk will walk down the memory lane of multiple waves of such innovation, amplifying
how the complexity of EDA technology has not only scaled with Moore’s law scaling
of size and complexity of silicon hardware, but also how it has been pivotal in the
architectural design of modern FPGAs. A general overview of modern FPGA EDA flows
and key differences compared to Application-Specific Integrated Circuit (ASIC) EDA
flows will be discussed. State-of-the-art FPGAs, Stratix® 10 and AgileX™ from Intel
incorporate an advanced register-rich HyperFlex™ architecture that introduces disruptive
optimization opportunities in the EDA compiler. Such physical synthesis optimization
technologies like logic retiming, clock skew optimization, time borrowing, and their
synergies and challenges will be discussed. Solving these challenges enables FPGAs
to achieve non-linear performance improvements.

Logic retiming was first introduced as a powerful sequential design optimization technique
three decades ago, yet gained limited popularity in the ASIC industry, because of
the lack of scalable sequential verification techniques. This talk will highlight
the root causes of this issue and present innovations in retiming technology and constrained
random simulation that allow the successful verification of retimed circuits, thereby
enabling the use of logic retiming for FPGAs.

FPGAs have traditionally targeted Register-Transfer Level (RTL) designers. To enable
wider adoption of FPGAs, Intel has developed several High-Level Design (HLD) tools,
frameworks, libraries, and methodologies, raising the level of programming abstraction.
This talk will provide a glimpse into Intel’s HLD offerings that enable software developers
in the broader ecosystem to leverage FPGAs.

Academic researchers will also be provided with some key research vectors to help
propel the FPGA industry further.

SESSION: Session: High-Level Abstractions and Tools I

Session details: Session: High-Level Abstractions and Tools I

Caiwen Ding

Maximizing the Serviceability of Partially Reconfigurable FPGA Systems in Multi-tenant
Environment

Tuan D. A. Nguyen

In cloud computing, software is transitioning from monolithic to microservices architecture
to improve the maintainability, upgradability and the flexibility of the applications.
They are able to request a service with different implementations of the same functionality,
including hardware accelerator, depending on cost and performance. This model opens
up a new opportunity to integrate reconfigurable hardware, specifically, FPGA, in
the cloud to offer such services. There are many research works discussing solutions
for this problem but they focus primarily on the high-level aspects of resource manager,
hypervisor or hardware architecture. The low-level physical design choices of FPGA
to maximize the accelerator allocation success rate (called serviceability) is largely
untouched. In this paper, we propose a design space exploration algorithm to determine
the best configuration of partially reconfigurable regions (PRRs) to host the accelerators.
Besides, the algorithm is capable of estimating the actual resources occupied by the
PRRs on the FPGA even before floorplanning. We systematically study the effects of
having more PRRs on the system in various aspects, i.e., serviceability, waiting time
and resource wastage. The experiments show that at a certain number of PRRs, upto
91% serviceability can be achieved for 12 concurrent users. It is a significant improvement
from 52% without our approach. The average amount of time that each request has to
wait to be served is also reduced by 6.3X. Furthermore, the cumulative unused FPGA
resources is reduced almost by half.

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

Pengfei Xu

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for
domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips
is non-trivial because: (1) mainstream DNNs have millions of parameters and billions
of operations; (2) the design space is large due to numerous design choices of dataflows,
processing elements, memory hierarchy, etc.; and (3) there is an algorithm/hardware
co-design need for the same DNN functionality to have a different decomposition that
would require different hardware IPs and thus correspond to dramatically different
performance/energy/area tradeoffs. Therefore, DNN chips often take months to years
to design and require a large team of cross-disciplinary experts. To enable fast and
effective DNN chip design, we propose AutoDNNchip – a DNN chip generator that can
automatically produce both FPGA- and ASIC-based DNN chip implementation (i.e., synthesizable
RTL code with optimized algorithm-to-hardware mapping) from DNNs developed by machine
learning frameworks (e.g., PyTorch) for a designated application and dataset without
humans in the loop. Specifically, AutoDNNchip consists of 2 integrated enablers: (1)
a Chip Predictor, which can accurately and efficiently predict a DNN accelerator’s
energy, throughput, latency, and area based on the DNN model parameters, hardware
configurations, technology-based IPs, and platform constraints; and (2) a Chip Builder,
which can automatically explore the design space of DNN chips (including IP selections,
block configurations, resource balancing, etc.), optimize chip designs via the Chip
Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve
the target design metrics. Experimental results show that our Chip Predictor’s predicted
performance differs from real-measured ones by <10% when validated using 15 DNN models
and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, DNN accelerators generated
by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that
of expert-crafted state-of-the-art FPGA- and ASIC-based accelerators, showing the
effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration

Jiajie Li

The domain-specific language (DSL) for image processing, Halide, has generated a lot
of interest because of its capability of decoupling algorithms from schedules that
allow programmers to search for optimized mappings targeting CPU and GPU. Unfortunately,
while the Halide community has been growing rapidly, there is currently no way to
easily map the vast number of Halide programs to efficient FPGA accelerators. To tackle
this challenge, we propose HeteroHalide, an end-to-end system for compiling Halide
programs to FPGA accelerators. This system makes use of both algorithm and scheduling
information specified in a Halide program. Compared to the existing approaches, flow
provided by HeteroHalide is significantly simplified, as it only requires moderate
modifications for Halide programs on the scheduling part to be applicable to FPGAs.
For part of the compilation flow, and to act as the intermediate representation (IR)
of HeteroHalide, we choose HeteroCL, a heterogeneous programming infrastructure which
supports multiple implementation backends (such as systolic arrays and stencil implementations).
By using HeteroCL, HeteroHalide can generate efficient accelerators by choosing different
backends according to the application. The performance evaluation compares the accelerator
generated by HeteroHalide with multi-core CPU and an existing Halide-HLS compiler.
As a result, HeteroHalide achieves 4.15\texttimes speedup on average over 28 CPU cores,
and 2 \textasciitilde 4\texttimes throughput improvement compared with the existing
Halide-HLS compiler.

Fingerprinting Cloud FPGA Infrastructures

Shanquan Tian

In recent years, multiple public cloud FPGA providers have emerged, increasing interest
in FPGA acceleration of cryptographic, bioinformatic, financial, and machine learning
algorithms. To help understand the security of the cloud FPGA infrastructures, this
paper focuses on a fundamental question of understanding what an adversary can learn
about the cloud FPGA infrastructure itself, without attacking it or damaging it. In
particular, this work explores how unique features of FPGAs can be exploited to instantiate
Physical Unclonable Functions (PUFs) that can distinguish between otherwise-identical
FPGA boards. This paper specifically introduces the first method for identifying cloud
FPGA instances by extracting a unique and stable FPGA fingerprint based on PUFs measured
from the FPGA boards’ DRAM modules. Experiments conducted on the Amazon Web Services
(AWS) cloud reveal the probability of renting the same physical board more than once.
Moreover, the experimental results show that hardware is not shared among f1.2xlarge,
f1.4xlarge, and f1.16xlarge instance types. As the approach used does not violate
any restrictions currently placed by Amazon, this paper also presents a set of defense
mechanisms that can be added to existing countermeasures to mitigate users’ attempts
to fingerprint cloud FPGA infrastructures.

SESSION: Session: Applications I

Session details: Session: Applications I

Miriam Leeser

Massively Simulating Adiabatic Bifurcations with FPGA to Solve Combinatorial Optimization

Yu Zou

Combinatorial optimizations are widely adopted in scientific and engineering applications,
such as VLSI design, automated machine learning (AutoML), and compiler design. Combinatorial
optimization problems are notoriously challenging to exactly solve due to the NP-hardness.
Scientists have long discovered that numerically simulating classical nonlinear Hamiltonian
systems can effectively solve many well-known combinatorial optimization problems.
However, such physical simulation typically requires a massive amount of computation,
which even outstrips the logic capability of modern reconfigurable digital fabrics.
In this work, we proposed an FPGA-based general combinatorial optimization problem
solver which achieved ultra-high performance and scalability. Specifically, we first
reformulated a broad range of combinatorial optimization problems with a general graph-based
data structure called the Ising model. Second, instead of utilizing classical simulated
annealing to find an approximate solution, we utilized a new heuristic algorithm,
simulated bifurcation, to search for solutions. Third, we designed an efficient hardware
architecture to fully exploit FPGAs’ potentials to accelerate the algorithm, and proposed
three hardware-software co-optimizations to further improve the performance. By experimenting
on benchmarks, our proposal outperformed the state-of-the-art simulated annealing
optimization solver by up to 10.91 times.

High-Performance FPGA Network Switch Architecture

Philippos Papaphilippou

We present a high-throughput FPGA design for supporting high-performance network switching.
FPGAs have recently been attracting attention for datacenter computing due to their
increasing transceiver count and capabilities, which also benefit the implementation
and refinement of network switches. Our solution replaces the crossbar in favour of
a novel, more pipeline-friendly approach, the “Combined parallel round-robin arbiter”.
It also removes the overhead of incorporating an often-iterative scheduling or matching
algorithm, which sometimes tries to fit too many steps in a single or a few FPGA cycles.
The result is a network switch implementation on FPGAs operating at a high frequency
and with a low port-to-port latency. It also provides a wiser buffer memory utilisation
than traditional Virtual Output Queue (VOQ)-based switches and is able to keep 100%
throughput for a wider range of traffic patterns using a fraction of the buffer memory
and shorter packets.

Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic
Cancer Treatment Simulator

Tanner Young-Schultz

The simulation of light propagation through tissues is important for medical applications,
such as photodynamic therapy (PDT) for cancer treatment. To optimize PDT an inverse
problem, which works backwards from a desired distribution of light to the parameters
that caused it, must be solved. These problems have no closed-form solution and therefore
must be solved numerically using an iterative method. This involves running many forward
light propagation simulations which is time-consuming and computationally intensive.

Currently, the fastest general software solver for this problem is FulMonteSW. It
models complex 3D geometries with tetrahedral meshes and uses Monte Carlo techniques
to model photon interactions with tissues. This work presents FullMonteFPGACL: an
FPGA-accelerated version of FullMonteSW using an Intel Stratix 10 FPGA and the Intel
FPGA SDK for OpenCL. FullMonteFPGACL has been validated and benchmarked using several
models and achieves improvements in performance (4x) and energy-efficiency (11x) over
the optimized and multi-threaded FullMonteSW implementation. We discuss methods for
extending the design to improve the performance and energy-efficiency ratios to 16x
and 17x, respectively. We achieved these gains by developing in an agile fashion using
OpenCL to facilitate quick prototyping and hardware-software partitioning. However,
achieving competitive area and performance required careful design of the hardware
pipeline and expression of its structure in OpenCL. This led to a hybrid design style
that can improve productivity when developing complex applications on an FPGA.

Energy-Efficient 360-Degree Video Rendering on FPGA via Algorithm-Architecture Co-Design

Qiuyue Sun

360° panoramic video provides an immersive Virtual Reality experience. However, rendering
360° videos consumes excessive energy on client devices. FPGA is an ideal offloading
target to improve the energy-efficiency. However, a naive implementation of the processing
algorithm would lead to an excessive memory footprint that offsets the energy benefit.
In this paper, we propose an algorithm-architecture co-designed system that dramatically
reduces the on-chip memory requirement of VR video processing to enable FPGA offloading.
Evaluation shows that our system is able to achieve significant energy reduction with
no loss of performance compared to today’s off-the-shelf VR video rendering system.

Real-Time Spatial 3D Audio Synthesis on FPGAs for Blind Sailing

Anish Singhani

The real-time synthesis of 3D spatial audio has many applications, from virtual reality
to navigation for the visually-impaired. Head-related transfer functions (HRTF) can
be used to generate spatial audio based on a model of the user’s head. Previous studies
have focused on the creation and interpolation of these functions with little regard
for real-time performance. In this paper, we present an FPGA-based platform for real-time
synthesis of spatial audio using FIR filters created from head-related transfer functions.
For performance reasons, we run filtering, crossfading, and audio output on FPGA fabric,
while calculating audio source locations and storing audio files on the CPU. We use
a head-mounted 9-axis IMU to track the user’s head in real-time and adjust relative
spatial audio locations to create the perception that audio sources are fixed in space.
Our system, running on a Xilinx Zynq Z-7020, is able to support 4X more audio sources
than a comparable GPU and 8X more sources than a CPU while maintaining sub-millisecond
latency and comparable power consumption. Furthermore, we show how our system can
be leveraged to communicate the location of landmarks and obstacles to a visually-impaired
user during a sailing race or other navigation scenario. We test our system with multiple
users and show that, as a result of our reduced latency, a user is able to locate
a virtual audio source with an extremely high degree of accuracy and navigate toward
it.

SESSION: Session: Deep Learning I

Session details: Session: Deep Learning I

Bita Rouhani

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Vladimir Rybalkin

Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of
one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to
show state-of-the-art results in various applications including handwritten text recognition,
medical imaging, and many more. However, efficient implementation suffers from very
sequential execution that tremendously slows down both training and inference compared
to other neural networks. This is the primary reason that prevents intensive research
involving MD-LSTM in the recent years, despite large progress in microelectronics
and architectures. The main goal of the current research is to provide acceleration
for inference of MD-LSTM, so to open a door for efficient training that can boost
application of MD-LSTM. By this research we advocate that FPGA is an alternative platform
for deep learning that can offer a solution in cases when a massive parallelism of
GPUs does not provide the necessary performance required by the application. In this
paper, we present the first hardware architecture for MD-LSTM. We conduct a systematic
exploration of precision vs. accuracy trade-off using challenging dataset for historical
document image binarization from DIBCO 2017 contest, and well known MNIST dataset
for handwritten digits recognition. Based on our new architecture we implement FPGA-based
accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by
up to 50x and energy efficiency by up to 746x. At the same time, our accelerator demonstrates
higher accuracy and comparable throughput in comparison with state-of-the-art FPGA-based
implementations of multilayer perceptron for MNIST dataset.

Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

Yunxuan Yu

Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet,
SqueezeNet, etc., have emerged in the past few years for fast inference on embedded
and mobile system. However, lightweight operations limit acceleration potential by
GPU due to their memory bounded nature and their parallel mechanisms that are not
friendly to SIMD. This calls for more specific accelerators. In this paper, we propose
an FPGA-based overlay processor with a corresponding compilation flow for general
LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates
and decomposes lightweight operations for efficient acceleration. Moreover, our instruction
architecture considers sharing of major computation engine between LW operations and
conventional convolution operations. This improves the run-time resource efficiency
and overall power efficiency. Finally, Light-OPU is software programmable, since loading
of compiled codes and kernel weights completes switch of targeted network without
FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves
5.5x better latency and 3.0x higher power efficiency on average compared with edge
GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency
compared with previous customized FPGA accelerators. To the best of our knowledge,
Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs
acceleration with high performance and power efficiency, which is evaluated using
all major LW-CNNs including the newly released MobileNetV3.

End-to-End Optimization of Deep Learning Applications

Atefeh Sohrabizadeh

The irregularity of recent Convolutional Neural Network (CNN) models such as less
data reuse and parallelism due to the extensive network pruning and simplification
creates new challenges for FPGA acceleration. Furthermore, without proper optimization,
there could be significant overheads when integrating FPGAs into existing machine
learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous
studies. However, our study shows that a naive FPGA integration into TensorFlow could
lead to up to 8.45x performance degradation. To address the challenges mentioned above,
we propose several SW/HW co-design approaches to perform the end-to-end optimization
of deep learning applications. We present a flexible and composable architecture called
FlexCNN. It can deliver high computation efficiency for different types of convolution
layers using techniques including dynamic tiling and data layout optimization. FlexCNN
is further integrated into the TensorFlow framework with a fully-pipelined software-hardware
integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake
and other non-CNN processing stages. We use OpenPose, a popular CNN-based application
for human pose recognition, as a case study. Experimental results show that with the
FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The
pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization
produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for
OpenPose with floating-point precision, which is the highest performance reported
for this application on FPGA in the literature.

SESSION: Session: FPGA Architecture

Session details: Session: FPGA Architecture

Satwant Singh

Architectural Enhancements in Intel® Agilex™ FPGAs

Jeffrey Chromczak

This paper describes architectural enhancements in Intel® Agilex™ FPGAs and SoCs.
Agilex devices are built on Intel’s 10nm process and feature next-generation programmable
fabric, tightly coupled with a quad-core ARM processor subsystem, a secure device
manager, IO and memory interfaces, and multiple companion transceiver tile choices.
The Agilex fabric features multiple logic block enhancements that significantly improve
propagation delays and integrate more effectively with the second-generation HyperFlexAgilex™
pipelined routing architecture. Routing connections are re-designed to be point-to-point,
dropping intermediate connections featured in prior FPGA generations and replacing
them with a wider variety of shorter wire types. Fine-grain programmable clock skew
and time-borrowing were introduced throughout the fabric to augment the slack-balancing
capabilities of HyperFlex registers. DSP capabilities are also extended to natively
support new INT9/BFLOAT16/FP16 formats. Together, along with process and circuit enhancements,
these changes support more than 40% performance improvement over the Stratix® 10 family
of FPGAs.

Straight to the Point: Intra- and Intercluster LUT Connections to Mitigate the Delay of Programmable Routing

Stefan Nikolić

Technology scaling makes metal delay ever more problematic, but routing between Look-Up
Tables (LUTs) still passes through a series of transistors. It seems wise to avoid
the corresponding delay whenever possible. Direct connections between LUTs, both within
and across multiple clusters, can eschew the transistor delays of crossbars, connection
blocks, and switch blocks. In this paper we investigate the usefulness of enhancing
classical Field-Programmable Gate Array (FPGA) architectures with direct connections
between LUTs. We present an efficient algorithm for searching automatically the most
interesting patterns of such direct connections. Despite our methods being fairly
conservative and relying on the use of unmodified standard CAD tools, we obtain a
2.77% improvement of the geometric mean critical path delay of a standard benchmark
set, with improvement ranging from -0.17% to 7.3% for individual circuits. As modest
as these results may seem at first glance, we believe that they position direct connections
between LUTs as a promising topic for future research. Extending this work with dedicated
CAD algorithms and exploiting the increased possibilities for optimal buffering, diagonal
routing, and pipelining could prove direct connections important to the continuation
of performance improvement into next generation FPGAs.

LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations

Seyedramin Rasoulinezhad

We propose two tiers of modifications to FPGA logic cell architecture to deliver a
variety of performance and utilization benefits with only minor area overheads. In
the first tier, we augment existing commercial logic cell datapaths with a 6-input
XOR gate in order to improve the expressiveness of each element, while maintaining
backward compatibility. This new architecture is vendor-agnostic, and we refer to
it as LUXOR. We also consider a secondary tier of vendor-specific modifications to
both Xilinx and Intel FPGAs, which we refer to as X-LUXOR+ and I-LUXOR+ respectively.
We demonstrate that compressor tree synthesis using generalized parallel counters
(GPCs) is further improved with the proposed modifications. Using both the Intel adaptive
logic module and the Xilinx slice at the 65nm technology node for a comparative study,
it is shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We demonstrate
that LUXOR can deliver an average reduction of 13-19% in logic utilization on micro-benchmarks
from a variety of domains. BNN benchmarks benefit the most with an average reduction
of 37-47% in logic utilization, which is due to the highly-efficient mapping of the
XnorPopcount operation on our proposed LUXOR+ logic cells.

SESSION: Invited Panel

Session details: Invited Panel

Raymond Nijssen

FPGAs will Never be the Same Again: How the Newest FPGA Architectures are Totally Disrupting the Entire FPGA Ecosystem
as We Know It

Raymond Nijssen

Since the inception of FPGAs over 2 decades ago, the micro-architectures and macro-architectures
of FPGAs across all FPGA vendors have been converging strongly to the point that comparable
FPGAs from the main FPGA vendors had virtually the same use models, and the same programming
models. User designs were getting easier to port from one vendor to the other with
every generation. Recent developments in from different FPGA vendors targeting the
most advanced semiconductor technology nodes are an abrupt and disruptive break from
this trend, especially at the macro-architectural level.

SESSION: Session: Keynote II

Session details: Session: Keynote II

George Constantinides

Xilinx Vitis Unified Software Platform

Vinod Kathail

FPGAs provide significant advantages in throughput, latency, and energy efficiency
for implementing low-latency, compute-intensive applications when compared to general-purpose
CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable SoCs
with on-chip CPUs, domain-specific programmable accelerators, and flexible connectivity
options. Recently, Xilinx introduced a new heterogeneous compute architecture, the
Adaptive Compute Acceleration Platform (ACAP), with significantly more flexibility
and performance to address an evolving set of new applications such as machine learning.
This advancement on the device side is accompanied by similar advances on higher-level
programming approaches to make FPGAs and ACAPs significantly easy to use for a wide
range of applications. Xilinx Vitis Unified Software Platform is a comprehensive development
environment to build and seamlessly deploy accelerated applications on Xilinx platforms
including Alveo cards, FPGA-instances in the cloud, and embedded platforms. It addresses
the three major industry trends: the need for heterogenous computing, applications
that span cloud to edge to end-point, and AI proliferation. Vitis supports application
programming using C, C++ and OpenCL, and it enables the development of large-scale
data processing and machine learning applications using familiar, higher-level frameworks
such as TensorFlow and SPARK. To facilitate communication between the host application
and accelerators, Xilinx Runtime library (XRT) provides APIs for accelerator life-cycle
management, accelerator execution management, memory allocation, and data communication
between the host application and accelerators. In addition, a rich set of performance-optimized,
open-source libraries significantly ease the application development. Vitis AI, an
integral part of Vitis, enables AI inference acceleration on Xilinx platforms. It
supports industry’s leading deep learning frameworks like Tensorflow and Caffe, and
offers a comprehensive suite of tools and APIs to prune, quantize, optimize, and compile
pre-trained models to achieve the highest AI inference performance on Xilinx platforms.
This talk provides an overview of Vitis and Vitis AI development environments.

SESSION: Session: High-Level Abstractions and Tools II

Session details: Session: High-Level Abstractions and Tools II

Ilya Ganusov

StateMover: Combining Simulation and Hardware Execution for Efficient FPGA Debugging

Sameh Attia

Debugging consumes a large portion of FPGA design time, and with the growing complexity
of traditional FPGA systems and the additional verification challenges posed by multiple
FPGAs interacting within data centers, debugging productivity is becoming even more
important. Current debugging flows either depend on simulation, which is extremely
slow but has full visibility, or on hardware execution, which is fast but provides
very limited control and visibility. In this paper, we present StateMover, a checkpointing-based
debugging framework for FPGAs, which can move design state back and forth between
an FPGA and a simulator in a seamless way. StateMover leverages the speed of hardware
execution and the full visibility and ease-of-use of a simulator. This enables a novel
debugging flow that has a software-like combination of speed with full observability
and controllability. StateMover adds minimal hardware to the design to safely stop
the design under test so that its state can be extracted or modified in an orderly
manner. The added hardware has no timing overhead and a very small area overhead.
StateMover currently supports Xilinx UltraScale devices, and its underlying techniques
and tools can be ported to other device families that support configuration readback.
Moving the state from/to an FPGA to/from a simulator can be performed in a few seconds
for large FPGAs, enabling a new debugging flow.

Buffer Placement and Sizing for High-Performance Dataflow Circuits

Lana Josipović

Commercial high-level synthesis tools typically produce statically scheduled circuits.
Yet, effective C-to-circuit conversion of arbitrary software applications calls for
dataflow circuits, as they can handle efficiently variable latencies (e.g., caches)
and unpredictable memory dependencies. Dataflow circuits exhibit an unconventional
property: registers (usually referred to as “buffers”) can be placed anywhere in the
circuit without changing its semantics, in strong contrast to what happens in traditional
datapaths. Yet, although functionally irrelevant, this placement has a significant
impact on the circuit’s timing and throughput. In this work, we show how to strategically
place buffers into a dataflow circuit to optimize its performance. Our approach extracts
a set of choice-free critical loops from arbitrary dataflow circuits and relies on
the theory of marked graphs to optimize the buffer placement and sizing. We demonstrate
the performance benefits of our approach on a set of dataflow circuits obtained from
imperative code.

Closing Leaks: Routing Against Crosstalk Side-Channel Attacks

Zeinab Seifoori

This paper presents an extension to PathFinder FPGA routing algorithm, which enables
it to deliver FPGA designs free from risks of crosstalk attacks. Crosstalk side-channel
attacks are a real threat in large designs assembled from various IPs, where some
IPs are provided by trusted and some by untrusted sources. It suffices that a ring-oscillator
based sensor is conveniently routed next to a signal that carries secret information
(for instance, a cryptographic key), for this information to possibly get leaked.
To address this security concern, we apply several different strategies and evaluate
them on benchmark circuits from Verilog-to-Routing tool suite. Our experiments show
that, for a quite conservative scenario where 10-20% of all design nets are carrying
sensitive information, the crosstalk-attack-aware router ensures that no information
leaks at a very small penalty: 1.58-7.69% increase in minimum routing channel width
and 0.12-1.18% increase in critical path delay, on average. In comparison, in an AES-128
cryptographic core, less than 5% of nets carry the key or the intermediate state values
of interest to an attacker, making it highly likely that the overhead for obtaining
a secure design is, in practice, even smaller.

Built-in Self-Evaluation of First-Order Power Side-Channel Leakage for FPGAs

Ognjen Glamočanin

Embedded and cyber-physical systems are pervading all aspects of our lives, including
sensitive and critical ones. As a result, they are an alluring target for cyber attacks.
These systems, whose implementation is often based on reconfigurable hardware, are
typically deployed in places accessible to attackers. Therefore, they require protection
against tampering and side-channel attacks. However, a side-channel resistant implementation
of a security primitive is not sufficient, as it can be weakened by an adversary,
aging, or environmental factors. To detect this, legitimate users should be able to
evaluate the side-channel resistance of their systems not only when deploying them
for the first time, but also during their entire service life. The most widespread
and de facto standard methodology for measuring power side-channel leakage uses Welch’s
t-test. In practice, collecting the data for the t-test requires physical access to
the device, a device-specific test setup, and the equipment for measuring the power
consumption during device operation. Consequently, only a small number of cyber-physical
systems deployed in the field can be tested this way and the tests to reevaluate the
device resistance to side-channel attacks cannot be easily repeated. To address these
issues, we present a design and an FPGA implementation of a built-in test for self-evaluation
of the resistance to first-order power side-channel attacks. Once our test is triggered,
the FPGA measures its own internal power-supply voltage and computes the t-test statistic
in real time. Experimental results on two different implementations of the AES-128
algorithm demonstrate that the self-evaluation test is very reliable. We believe that
this work is an important step towards the development of security sensors for the
next generation of safe and robust cyber-physical systems.

SESSION: Session: Applications II

Session details: Session: Applications II

Grace Zgheib

Dependency-Driven Trace-Based Network-on-Chip Emulation on FPGAs

Thiem Van Chu

FPGA emulation is a promising approach to accelerating Network-on-Chip (NoC) modeling
which has traditionally relied on software simulators. In most early studies of FPGA-based
NoC emulators, only synthetic workloads like uniform and bit permutations were considered.
Although a set of carefully designed synthetic workloads can reveal a relatively thorough
coverage of the characteristics of the NoC under evaluation, they alone are insufficient,
especially when the NoC needs to be optimized for specific applications. In such cases,
trace-driven workloads are effective. However, there is a problem with conventional
trace-driven workloads that has been pointed out by some recent studies: the network
load and congestion may be distorted because dependencies between packets are not
considered. These studies also provide infrastructures for extending existing software
simulators to enforce dependencies between packets. Unfortunately, enforcing dependencies
between packets is not trivial in the FPGA emulation approach. Therefore, although
there are some recent FPGA-based NoC emulators supporting trace-driven workloads,
most of them ignore packet dependencies. In this paper, we first clarify the challenges
of supporting trace-driven workloads with dependencies between packets taken into
account in the FPGA emulation approach. We then propose efficient methods and architectures
to tackle these challenges and build an FPGA-based NoC emulator, which we call DNoC,
based on the proposals. Our evaluation results show that (1) on a VC707 FPGA board,
DNoC achieves an average speed of 10,753K cycles/s when emulating an 8×8 NoC with
trace data collected from full-system simulation of the PARSEC benchmark suite, which
is 274x higher than the speed reported in a recent related work on dependency-driven
trace-based NoC emulation on FPGAs; (2) Compared to BookSim, one of the most popular
NoC simulators, DNoC is 395x faster while providing the same results; (3) DNoC can
scale to a 4,096-node NoC on a VC707 board, and the size of the largest NoC depends
on only the on-chip memory capacity of the target FPGA.

FPGA-Accelerated Samplesort for Large Data Sets

Han Chen

Sorting is a fundamental operation in many applications such as databases, search,
and social networks. Although FPGAs have been shown very effective at sorting data
sizes that fit on chip, systems that sort larger data sets by shuffling data on and
off chip are bottlenecked by costly merge operations or data transfer time. We propose
a new technique for sorting large data sets, which uses a variant of the samplesort
algorithm on a server with a PCIe-connected FPGA. Samplesort avoids merging by randomly
sampling values to determine how to partition data into non-overlapping buckets that
can be independently sorted. The key to our design is a novel parallel multi-stage
hardware partitioner, which is a scalable high-throughput solution that greatly accelerates
the samplesort partitioning step. Using samplesort for FPGA-accelerated sorting provides
several advantages over mergesort, while also presenting a number of new challenges
that we address with cooperation between the FPGA and the software running on the
host CPU. We prototype our design using Amazon Web Services FPGA instances, which
pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments
demonstrate that our prototype system sorts 2^30 key-value records with a speed of
7.2 GB/s, limited only by the on-board DRAM capacity and available PCIe bandwidth.
When sorting 2^30 records, our system exhibits a 37.4x speedup over the widely used
GNU parallel sort on an 8-thread state-of-the-art CPU.

BiS-KM: Enabling Any-Precision K-Means on FPGAs

Zhenhao He

K-Means is a popular clustering algorithm widely used and extensively studied in the
literature. In this paper we explore the challenges and opportunities in using low
precision input in conjunction with a standard K-Means algorithm as a way to improve
the memory bandwidth utilization on hardware accelerators. Low precision input through
quantization has become a standard technique in machine learning to reduce computational
costs and memory traffic. When applied in FPGAs, several issues need to be addressed.
First and foremost is the overhead of storing the data at different precision levels
since, depending on the training objective, different levels of precision might be
needed. Second, the FPGA design needs to accommodate varying precision without requiring
reconfiguration. To address these concerns, we propose Bit-Serial K-Means (BiS-KM),
a combination of a hybrid memory layout supporting data retrieval at any level of
precision, a novel FPGA design based on bit-serial arithmetic, and a modified K-Means
algorithm tailored to FPGAs. We have tested BiS-KM with various data sets and compared
our design with a state-of-the-art FPGA accelerator. BiS-KM achieves an almost linear
speedup as precision decreases, providing a more effective way to perform K-Means
on FPGAs.

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Johannes de Fine Licht

Data movement is the dominating factor affecting performance and energy in modern
computing systems. Consequently, many algorithms have been developed to minimize the
number of I/O operations for common computing patterns. Matrix multiplication is no
exception, and lower bounds have been proven and implemented both for shared and distributed
memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing
algorithms, as they offer full control of memory accesses to the programmer. While
bounds developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources requires
a decentralized approach to optimize algorithms for maximum hardware utilization.
We present a model to optimize matrix multiplication for FPGA platforms, simultaneously
targeting maximum performance and minimum off-chip data movement, within constraints
set by the hardware. We map the model to a concrete architecture using a high-level
synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary
data types, and enables maintainability and portability across FPGA devices. Kernels
generated from our architecture are shown to offer competitive performance in practice,
scaling with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing algorithms
on reconfigurable hardware platforms.

SESSION: Session: Deep Learning II

Session details: Session: Deep Learning II

Lita Yang

GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms

Hanqing Zeng

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning
model for representation learning on graphs. It is challenging to accelerate training
of GCNs, due to (1) substantial and irregular data communication to propagate information
within the graph, and (2) intensive computation to propagate information along the
neural network layers. To address these challenges, we design a novel accelerator
for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture
co-optimizations. We first analyze the computation and communication characteristics
of various GCN training algorithms, and select a subgraph-based algorithm that is
well suited for hardware execution. To optimize the feature propagation within subgraphs,
we propose a light-weight pre-processing step based on a graph theoretic approach.
Such pre-processing performed on the CPU significantly reduces the memory access requirements
and the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient parallelization.
We integrate the above optimizations into a complete hardware pipeline, and analyze
its load-balance and resource utilization by accurate performance modeling. We evaluate
our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three
large graphs, we achieve an order of magnitude training speedup with negligible accuracy
loss, compared with state-of-the-art implementation on a multi-core platform.

Reuse Kernels or Activations?: A Flexible Dataflow for Low-latency Spectral CNN Acceleration

Yue Niu

Spectral-domain CNNs have been shown to be more efficient than traditional spatial
CNNs in terms of reducing computation complexity. However they come with a ‘kernel
explosion’ problem that, even after compression (pruning), imposes a high memory burden
and off-chip bandwidth requirement for kernel access. This creates a performance gap
between the potential acceleration offered by compression and actual FPGA implementation
performance, especially for low-latency CNN inference. In this paper, we develop a
principled approach to overcoming this performance gap and designing a low-latency,
low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage
tradeoff of sparse convolutional layers and locate communication bottlenecks. We then
develop a dataflow for flexibly optimizing data reuse in different layers to minimize
off-chip communication. Finally, we propose a novel scheduling algorithm to optimally
schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts.
On a state-of-the-art FPGA platform, our design reduces data transfers by 42% with
DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared
to the baseline state-of-the-art latency of 68 ms.

SESSION: Session: High-Level Synthesis and Tools

Session details: Session: High-Level Synthesis and Tools

Peter Cheung

Finding and Understanding Bugs in FPGA Synthesis Tools

Yann Herklotz

All software ultimately relies on hardware functioning correctly. Hardware correctness
is becoming increasingly important due to the growing use of custom accelerators using
FPGAs to speed up applications on servers. Furthermore, the increasing complexity
of hardware also leads to ever more reliance on automation, meaning that the correctness
of synthesis tools is vital for the reliability of the hardware. This paper aims to
improve the quality of FPGA synthesis tools by introducing a method to test them automatically
using randomly generated, correct Verilog, and checking that the synthesised netlist
is always equivalent to the original design. The main contributions of this work are
twofold: firstly a method for generating random behavioural Verilog free of undefined
values, and secondly a Verilog test case reducer used to locate the cause of the bug
that was found. These are implemented in a tool called Verismith. This paper also
provides a qualitative and quantitative analysis of the bugs found in Yosys, Vivado,
XST and Quartus Prime. Every synthesis tool except Quartus Prime was found to introduce
discrepancies between the netlist and the design. In addition to that, Vivado and
a development version of Yosys were found to crash when given valid input. Using Verismith,
eleven bugs were reported to tool vendors, of which six have already been fixed.

Combining Dynamic & Static Scheduling in High-level Synthesis

Jianyi Cheng

A central task in high-level synthesis is scheduling: the allocation of operations
to clock cycles. The classic approach to scheduling is static, in which each operation
is mapped to a clock cycle at compile-time, but recent years have seen the emergence
of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time.
Both approaches have their merits: static scheduling can lead to simpler circuitry
and more resource sharing, while dynamic scheduling can lead to faster hardware when
the computation has non-trivial control flow.

In this work, we seek a scheduling approach that combines the best of both worlds.
Our idea is to identify the parts of the input program where dynamic scheduling does
not bring any performance advantage and to use static scheduling on those parts. These
statically-scheduled parts are then treated as black boxes when creating a dataflow
circuit for the remainder of the program which can benefit from the flexibility of
dynamic scheduling.

An empirical evaluation on a range of applications suggests that by using this approach,
we can obtain 74% of the area savings that would be made by switching from dynamic
to static scheduling, and 135% of the performance benefits that would be made by switching
from static to dynamic scheduling.

Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL
Applications on FPGAs

Jiantong Jiang

FPGA vendors provide OpenCL software development kits for easier programmability,
with the goal of replacing the time-consuming and error-prone register-transfer level
(RTL) programming. Many studies explore optimization methods (e.g., loop unrolling,
local memory) to accelerate OpenCL programs running on FPGAs. These programs typically
follow the default OpenCL execution model, where a kernel deploys multiple work-items
arranged into work-groups. However, the default execution model is not always a good
fit for an application mapped to the FPGA architecture, which is very different from
the multithreaded architecture of GPUs, for which OpenCL was originally designed.
In this work, we identify three other execution models that can better utilize the
FPGA resources for the OpenCL applications that do not fit well into the default execution
model. These three execution models are based on two OpenCL features devised for FPGA
programming (namely, single work-item kernel and OpenCL channel). We observe that
the selection of the right execution model determines the performance upper bound
of a particular application, which can vary by two orders magnitude between the most
suitable execution model and the most unsuitable one. However, there is no way to
select the most suitable execution model other than empiricall exploring the optimization
space for the four of them, which can be prohibitive. To help FPGA programmers identify
the right execution model, we propose Boyi, a systematic framework that makes automatic
decisions by analyzing OpenCL programming patterns in an application. After finding
the right execution model with the help of Boyi, programmers can apply other conventional
optimizations to reach the performance upper bound. Our experimental evaluation shows
that Boyi can 1) accurately determine the right execution model, and 2) greatly reduce
the exploration space of conventional optimization methods.

SESSION: Poster Session I

Session details: Poster Session I

Vaughn Betz

Programming Abstractions for Configurable Hardware: Survey and Research Directions

Samuel Dewan

Programming abstractions decrease the cognitive gap between program idealization and
expression. In the software domain, this high-level expressive power is achieved through
layered abstractions – virtual machines, compilers, operating systems – which translate,
at design and runtime, programmer visible code into hardware-compatible code. While
this paradigm is ideal for static, i.e., unmodifiable, hardware, several of these
abstractions break down when programming configurable hardware. State of the art hardware/software
co-design techniques (e.g., High Level Synthesis (HLS), Intermediate Fabrics) are,
for the most part, ad hoc patches to the traditional abstraction stack, applicable
only to specific toolchains or software components. In this paper, we survey current
hardware design and hardware/software co-design abstractions, from the perspective
of the design language/toolchain. We perform a systematic analysis of different design
paradigms, including HLS, Domain Specific Languages (DSL), and new-generation Hardware
Description Languages (HDL). We analyze how these paradigms differ in expressiveness,
support for hardware/software interaction, hierarchy and modularity, HDL interoperability,
and interface with the outside world.

Pipeline-aware Logic Deduplication in High-Level Synthesis for Post-Quantum Cryptography
Algorithms

Changsu Kim

With the technical advance of quantum computers that can solve intractable problems
for conventional computers, many of the currently used public-key cryptosystems become
vulnerable. Recently proposed post-quantum cryptography (PQC) is secure against both
classical and quantum computers, but existing embedded systems such as smart card
can not easily support the PQC algorithms due to their much larger key sizes and more
complex arithmetics. To accelerate the PQC algorithms, embedded systems have to embed
the PQC hardware blocks, which can lead to huge hardware design costs. Although High-Level
Synthesis (HLS) helps significantly reduce the design costs, current HLS frameworks
produce inefficient hardware design for the PQC algorithms in terms of area and performance.
This work analyzes common features of the PQC algorithms and proposes a new pipeline-aware
logic deduplication method in HLS. The proposed method shares commonly invoked logic
across hardware design while considering load balancing in pipeline and resolving
dynamic memory accesses. This work implements FPGA hardware design of seven PQC algorithms
in the round 2 candidates from the National Institute of Standards and Technology
(NIST) PQC standardization process. Compared to commercial HLS framework, the proposed
method achieves an area-delay-product reduction by 34.5%.

Advanced Dataflow Programming using Actor Machines for High-Level Synthesis

Endri Bezati

The use of parallelism has increased drastically in recent years. Parallel platforms
come in many forms: multi-core processors, embedded hybrid solutions such as multi-processor
system-on-chip with reconfigurable logic, and cloud datacenters with multi-core and
reconfigurable logic. These heterogeneous platforms can offer massive parallelism,
but it can be difficult to exploit, particularly when combining solutions constructed
with multiple architectures. To program a heterogeneous platform, a developer must
master different programming languages, tools, and APIs to program each aspect of
platform separately and then must find a means to connect them with communication
interfaces. The motivation of this work is to provide a single programming model and
framework for hardware-software stream programs on heterogeneous platforms. Our framework,
StreamBlocks, starts with a dataflow programming model for both embedded and datacenter
platforms. Dataflow programming is an alternative model of computation that captures
both data and task parallelism. We describe a compiler infrastructure for CAL dataflow
programs for hardware code generation. CAL is a dataflow programming language that
can express multiple dataflow models of computation. StreamBlocks is based on the
Tycho compiler infrastructure, which transforms each actor in a dataflow program to
an abstract machine model, called Actor Machine. Actor Machines provides a unified
model for executing actors in both hardware and software and permit our compiler extension
and backend to generate efficient FPGA code. Unlike other systems, the programming
model and compiler directly support hardware-software systems in which an FPGA functions
as a coprocessor to a CPU. This permits easy integration with existing workflows.

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum
Frequency

Licheng Guo

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency
compared to manual RTL designs. We study the timing issues in a diverse set of nine
realistic HLS designs and observe that in most cases the frequency degradation is
related to the signal broadcast structures. In this work, we classify the common broadcast
types in HLS designs, including the data signal broadcast and two types of control
signal broadcast: the pipeline control broadcast and the synchronization signal broadcast.
We further identify several common limitations of the current HLS tools, which lead
to improper handling of the broadcasts. First, the HLS delay model does not consider
the extra delay caused by broadcasts, thus the scheduling results will be suboptimal.
To solve the issue, we implement a set of comprehensive synthetic designs and benchmark
the extra delay to calibrate the HLS delay model. Second, the HLS adopts back-pressure
signals for pipeline control, which will lead to large broadcasts. Instead, we propose
to use the skid-buffer-based pipeline control, where the back-pressure signal is removed,
and an extra skid-buffer is used for flow-control. We use dynamic programming to minimize
the area of the extra FIFO. Third, there exist redundant synchronizations among concurrent
modules that may lead to huge broadcasts. We propose methods to identify and prune
unnecessary synchronization signals. Our solutions boost the frequency of nine real-world
HLS benchmarks by 53% on average and with marginal area and latency overhead. In some
cases, the gain is more than 100 MHz.

Productive Hardware Designs using Hybrid HLS-RTL Development

Blaise Tine

Current High-Level Synthesis frameworks provide a productive hardware development
methodology where hardware accelerators are generated directly from high-level languages
like C/C++ or OpenCL, allowing software developers to quickly accelerate their applications.
However, the hardware generated by these frameworks is sub-optimal compared to often
hand-optimized RTL modules. A hybrid development approach would leverage the productive
software stack and hardware board support package that HLS provides but allow for
fine-grained optimization using RTL components. In this work, we introduce a new software-hardware
co-design framework that integrates OpenCL/OpenACC with RTL code enabling direct execution
on FPGAs as well as full emulation with a high-speed simulator to reduce the development
time.

Unleashing the Power of FPGAs as Programmable Switches

Thomas Luinaud

The P4 language and the PISA architecture have revolutionized the field of networking.
Thanks to P4 and PISA, new networking applications and protocols can be rapidly evaluated
on high performance switches. While P4 allows the expression of a wide range of packet
processing algorithms, current programmable switch architecture limit the overall
processing flexibility. To address this shortcoming recent work have proposed to implement
PISA on FPGAs. However, little effort has been devoted to analyze whether FPGAs are
good candidates to implement PISA. In this work, we take a step back and evaluate
the micro-architecture efficiency of various PISA blocks. Using a theoretical analysis
and experiments, we demonstrate that current FPGA architecture drastically limit the
performance of a few PISA blocks. Thus, we explore two avenues to alleviate these
shortcomings. First, we identify some network applications that are well tailored
to current FPGAs. Second, to support a wider range of networking applications, we
propose modifications to the FPGA architecture which can also be of interest outside
the networking field.

Early-stage Automated Identification of Similar Hardware Implementations with Abstract-Syntax-Tree

Parnian Mokri

The resource requirements of application-specific accelerators challenge embedded
system designers who have a tight area budget but must cover a range of possible software
kernels. We propose an early detection methodology (ReconfAST) to identify computationally
similar synthesizable kernels to build Shared Accelerators (SAs). SAs are specialized
hardware accelerators that execute very different software kernels but share the common
hardware functions between them. SAs increase the fraction of workloads covered by
specialized hardware by detecting similarities in dataflow and control flow between
seemingly very different workloads. Existing methods use either dynamic traces or
analyze register transfer level (RTL) implementations to find these similarities which
require deep knowledge of RTL and time-consuming design process.

ReconfAST leverages abstract-syntax-trees (ASTs) generated from LLVM’s-clang to discover
similar kernels among workloads. ASTs provide the right level of abstraction to detect
commonalities. ASTs are compact, unlike control and dataflow representations, but
contain extra syntax and variable node ordering that complicates workload comparison.
ReconfAST, transforms ASTs into a new clustered-ASTs (CASTs) representation, removes
unneeded nodes, and uses a regular expression to match common node configurations.
The approach is validated using MachSuite accelerator benchmarks.

On FPGAs, a good Shared Accelerator accelerates workloads by an average of 5x and
reduces the resources required for FPGA implementations: 37% FFs, 16% DSPs, and 10%
on LUTs on average over a dedicated accelerator implementation.

Hardware Description Beyond Register-Transfer Level Languages

Oron Port

Prevalent hardware description languages (HDLs), e.g., Verilog and VHDL, employ register-transfer
level (RTL) as their underlying programming model. One major downside of the RTL model
is that it tightly couples design functionality with timing and device constraints.
This coupling increases code complexity and yields code that is more verbose and less
portable. High-level synthesis (HLS) tools decouple functionality from timing and
design constraints by utilizing constructs from imperative programming languages.
These constructs and their sequential semantics, however, impede construction of inherently
parallel hardware and data scheduling, which is crucial in many design use-cases.

In our work we present a novel dataflow hardware description abstraction layer as
basis for hardware design and apply it to DFiant, a Scala-embedded HDL. DFiant leverages
dataflow semantics along with modern software language features (e.g., inheritance,
polymorphism) and classic HDL traits (e.g., bit-accuracy, input/output ports) to decouple
functionality from implementation constraints. Therefore, DFiant designs are timing-agnostic
and device-agnostic and can be automatically pipelined by the DFiant compiler to meet
target performance requirements. With DFiant we demonstrate how dataflow HDL code
can be substantially more portable and compact than its equivalent RTL code, yet without
compromising its target design performance.

MLSBench: A Synthesizable Dataset of HLS Designs to Support ML Based Design Flows

Pingakshya Goswami

With the advent of Machine Learning (ML), predictive EDA tools are becoming the next
hot topic of research in the EDA community, and researchers are working on ML-based
tools to predict the performance of the EDA tool. As the designs become complex, there
is a need to start the design using higher levels of abstraction, such as High-Level
Synthesis (HLS) tools in FPGA and SoC design flows. Quick prediction of performance-related
parameters of the final design after the C-synthesis stage, can help in rapid design
closure. Even though multiple papers exist in the domain of post routing performance
prediction of HLS tools, there are no standard benchmarks available to compare the
performance and accuracy of the predictive models. In this paper, we have presented
MLSBench, a collection of around 5000 synthesizable designs written in C and C++.
We provide a methodology to generate designs with various variations from a single
design, which creates a potential for creating newer designs and enlarging the database
in the future. This is followed by analysis, and validating the generated designs
are indeed different. This allows designers to create generalized machine-learning-based
models that are not overfitted to a small dataset. We also perform statistical analysis
for measuring the design diversity by synthesizing them using Xilinx-Vivado HLS for
Zynq 7000 device series.

A Top-Down Design Methodology for Synthesizing FPGA Fabrics Using Standard ASIC Flow

Prashanth Mohan

Design methodologies for synthesizing FPGA fabrics presented in the literature typically
employ a bottom-up approach wherein individual tiles are synthesized in isolation
and later stitched together to generate the large FPGA fabric. However, using a bottom-up
methodology to ensure fabric-level performance targets is challenging due to the lack
of a global timing view across multiple tiles spanning the FPGA fabric. While previous
works address this problem with a combination of manual buffering and floorplanning,
these additional steps introduce significant deviations from standard push-button
ASIC flows. In this paper, a top-down synthesis methodology is proposed, which eliminates
the need for floorplanning and manual buffering by providing a global timing view
of the FPGA fabric. To evaluate the proposed design methodology, we developed an FPGA
fabric generator using the Chisel hardware construction language. The fabric generator
reads in the Verilog-to-Routing architecture file, describing the user-defined FPGA
fabric, and generates the Verilog netlist and timing exceptions required to automatically
place and route the FPGA fabric in any technology node with a standard cell library.
Post layout timing analysis of placed and routed FPGA fabrics on a 28nm industrial
CMOS process demonstrates that the top-down methodology can place and route fabrics
without the need for any manual buffering or floorplanning while providing ~20% average
improvement in performance across multiple benchmark designs.

ConvCloud: An Adaptive Convolutional Neural Network Accelerator on Cloud FPGAs

Yang Yang

Among all the neural network specialized hardware accelerators like the Application-Specific-Integrate-Circuit(ASIC),
an FPGA accelerator stands out for its flexibility, short time-to-market, and energy
efficiency. However, when it comes to multitasking and high-speed requirements or
realtime and power-efficient scenarios (e.g., UAVs, self-driving cars, and IoT devices),
a single-board FPGA accelerator has difficulties in achieving excellent performance.
Therefore, Cloud FPGAs(Multi-FPGAs) will be a significant role in high-performance
and energy-efficient computation of CNNs for both mobile and cloud computing domains.
In this work, we propose an adaptive neural network accelerator on Cloud FPGAs, using
multi-FPGA design to satisfy multitasking and high-speed requirements or realtime
and power-efficient scenarios. We adopt the roofline model to figure out the optimal
configuration of each CNN layer. And a layer clustering algorithm and a layer sequence
detection method are proposed to transform CNN models into layer sequences for mapping
the CNN model layers efficiently to different FPGA boards. Then, we built an adaptive
CNN mapping method of Multi-FPGA chips for CNN models. Preliminary results on the
Multi-FPGAs platform demonstrate that our accelerator can improve the performance
significantly due to the adaptive mapping method.

Scalable FPGA Median Filtering using Multiple Efficient Passes

Oscar Rahnama

The 2-D median filter, one of the oldest and most well-established image-filtering
techniques, still sees widespread use throughout computer vision. Despite its relative
algorithmic simplicity, accelerating the 2-D median filter via a hardware implementation
becomes increasingly challenging as the window size increases, since the resources
required grow quadratically with the window size. Previous works, in a non-FPGA context,
have shown that applying a sequence of multiple directional median filters to an image
yields results that are competitive with, and in some cases even better than, those
of a classic 2-D window median. Inspired by these approaches, we propose a novel way
of substituting a 2-D median filter on an FPGA with a sequence of directional median
filters, in our case in the pursuit of an FPGA implementation that achieves better
scalability and hardware efficiency without sacrificing accuracy. We empirically show
that the combination of three particular directional filters, in any order, achieves
this, whilst requiring quadratically fewer resources on the FPGA. Our approach allows
for much higher throughput and is easier to implement as a pipeline.

FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on
Intel Stratix 10

Ke He

Deep learning has becoming increasingly more popular in recent years, and there are
many popular frameworks in the market accordingly, such as Caffe, TensorFlow and Pytorch.
All these frameworks natively support CPUs and GPGPUs. However, FPGAs still cannot
provide a comprehensive support by these frameworks for deep learning development,
especially for the training phase. In this paper, we firstly propose the FeCaffe,
i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based
on the Caffe, to enable FPGA to support CNN training features. Moreover, we provide
some benchmarks of popular CNN networks with FeCaffe, and further analysis in details
accordingly. Finally, some optimization directions including FPGA kernel design, system
pipeline, network architecture, user case application and heterogeneous platform levels,
have been proposed gradually. The result demonstrates the proposed FeCaffe can support
almost full features for training and inference respectively with high degree of design
flexibility, expansibility and reusability for deep learning development. Compared
to prior studies, our architecture can support more network and training settings
and current configuration can achieve 6.4x and 8.4x average execution time improvement
for forward and backward respectively for LeNet.

SESSION: Poster Session II

Session details: Poster Session II

Mike Hutton

DOMIS: Dual-Bank Optimal Micro-Architecture for Iterative Stencils

Juan Escobedo

High-Level Synthesis (HLS) can achieve significant performance improvements through
effective memory partitioning and meticulous data reuse. Many modern applications,
such as medical imaging and convolutional layers in a CNN, mostly contain kernels
where iterations can be reordered freely without compromising its correctness. In
this paper, we propose an optimal micro-architecture that can be automatically implemented
for simple and iterative stencil computations that utilizes only 2 banks to achieve
fully parallel conflict memory accesses from single stage stencil kernels, while only
requiring reuse buffers of size proportional to the kernel size to achieve an II of
1, irrespectively of the stencil geometry. We demonstrate the effectiveness of our
micro-architecture by implementing it with a Kintex 7 xc7k160tg676-1 Xilinx FPGA and
testing it with several stencil-based kernels found in real-world applications. On
average, when compared with the mainstream GMP and SRC architectures our approach
achieves approximately 30- 70% reduction in hardware usage, while improving performance
by about 15%. Moreover, the number of independent memory banks required to accomplish
conflict-free data accesses have dropped by more than 30% together with some increase
in power consumption due to higher clock frequencies.

Scalable FPGA-based Architecture for High-Performance Per-Flow Traffic Measurement

Junzhong Shen

Per-flow traffic measurement has emerged as a critical but challenging task in data
center in recent years in the face of massive network traffic. Many approximate methods
have been proposed to resolve the existing resource-accuracy trade-off in per-flow
traffic measurement, one of which is the sketch-based method. However, sketches are
affected by their high computational cost and low throughput; moreover, their measurement
accuracy is hard to guarantee under the conditions of changing network bandwidth or
flow size distribution. Recently, FPGA platforms have been widely deployed in data
centers, as they demonstrate a good fit for high-speed network processing. In this
work, we propose a scalable pipelined architecture for high high-throughput per-flow
traffic measurement on FPGA. We adopts memory-friendly D-left hashing in our design,
which guarantees high space utilization that successfully addressing the challenge
of tracking high speed data stream under limit memory resource on FPGA. Comparisons
with state-of-the-art sketch-based solutions show that our design outperforms state-of-the-art
sketch-based methods in terms of throughput by over 80x.

Codesign-NAS: Automatic FPGA/CNN Codesign Using Neural Architecture Search

Mohamed S. Abdelfattah

Field-programmable gate arrays (FPGAs) have become a popular compute platform for
convolutional neural network (CNN) inference; however, the design of a CNN model and
its FPGA accelerator has been inherently sequential. A CNN is first prototyped with
no-or-little hardware awareness to attain high accuracy; subsequently, an FPGA accelerator
is tuned to that specific CNN to maximize its efficiency. Instead, we formulate a
neural architecture search (NAS) optimization problem that contains parameters from
both the CNN model and the FPGA accelerator, and we jointly search for the best CNN
model-accelerator pair that boosts accuracy and efficiency -we call this Codesign-NAS.
In this paper we focus on defining the Codesign-NAS multiobjective optimization problem,
demonstrating its effectiveness, and exploring different ways of navigating the codesign
search space. For Cifar-10 image classification, we enumerate close to 4 billion model-accelerator
pairs, and find the Pareto frontier within that large search space. Next we propose
accelerator innovations that improve the entire Pareto frontier. Finally, we compare
to ResNet on a highly-tuned accelerator, and show that using codesign, we can improve
on Cifar-100 classification accuracy by 1.8% while simultaneously increasing performance/area
by 41% in just 1000 GPU-hours of running Codesign-NAS, thus demonstrating that our
automated codesign approach is superior to sequential design of a CNN model and accelerator.

Placement Aware Design and Automation of High Speed Architectures for Tree-Structured
Linear Cellular Automata on FPGAs with Scan Path Insertion

Ayan Palchaudhuri

VLSI implementation of Cellular Automata (CAs) has gained importance owing to its
features which guarantee parallelism, locality and structural regularity. In this
work, we have addressed the design challenges pertaining to an implementation optimized
for speed, of tree-structured linear CA architectures on Field Programmable Gate Array
(FPGA) with built-in scan paths. Scan based design facilitates state initialization,
helps to escape from any graveyard state, or figure out faulty locations (if any)
on which the circuit is mapped. Our design automation platform generates synthesizable
circuit descriptions of tree-structured CA on FPGA, and appends scan functionality
without additional logic or speed overhead. Placement algorithms governing the map
of CA cell nodes on the FPGA slices have been proposed to ensure maximum physical
proximity among CA cells sharing neighborhood dependencies. This is done to exploit
the VLSI amenable features such as physical adjacency of the neighboring nodes participating
in the next state (NS) computation of each other. The ultimate implementation leads
to minimum spacing of linear order between CA neighbours. The NS logic of each CA
cell inclusive of scan multiplexing, owing to restricted neighborhood size, is realized
using a single Look-Up Table. Our architectures outperform behavioral implementations
realized with higher levels of design style abstraction.

INCAME: INterruptible CNN Accelerator for Multi-robot Exploration

Jincheng Yu

Multi-Robot Exploration (MR-Exploration) that provides the location and map is a basic
task for many multi-robot applications. Recent researches introduce Convolutional
Neural Network (CNN) to critical components in MR-Exploration, like Feature-point
Extraction (FE) and Place Recognition (PR), to improve the system performance. Such
CNN-based MR-Exploration requires running multiple CNN models simultaneously, together
with complex post-processing algorithms, greatly challenges the hardware platforms,
which are usually embedded systems. Previous researches have shown that FPGA is a
good candidate for CNN processing on embedded platforms. But such accelerators usually
process different models sequentially, lacking the ability to schedule multiple tasks
at runtime. Furthermore, post-processing of CNNs in FE is also computation consuming
and becomes the system bottleneck after accelerating the CNN models. To handle such
problems, we propose an INterruptible CNN Accelerator for Multi-Robot Exploration
(INCAME) framework for rapid deployment of robot applications on FPGA. In INCAME,
we propose a virtual-instruction-based interrupt method to support multi-task on CNN
accelerators. INCAME also includes hardware modules to accelerate the post-processing
of the CNN-based components. Experimental results show that INCAME enables multi-task
scheduling on the CNN accelerator with negligible performance degradation (0.3%).
With the help of multi-task supporting and post-processing acceleration, INCAME enables
embedded FPGA to execute MR-Exploration in real time (20 fps).

LPAC: A Low-Precision Accelerator for CNN on FPGAs

Tianyu Zhang

Low bit quantization of neural network is required on edge devices to achieve lower
power consumption and higher performance. 8bit or binary network either consumes a
lot of resources or has accuracy degradation. Thus, a full-process hardware-friendly
quantization solution of 4A4W (activations 4bit and weights 4bit) is proposed to achieve
better accuracy/resource trade-off. It doesn’t contain any additional floating operations
and achieve accuracy comparable to full-precision. We also implement a low-precision
accelerator for CNN (LPAC) on the Xilinx FPGA, which takes full advantage of its DSP
by efficiently mapping convolutional computations. Through on-chip reassign management
and resource-saving analysis, high performance can be achieved on small chips. Our
4A4W solution achieves 1.8x higher performance than 8A8W and 2.42x increase in power
efficiency under the same resource. On ImageNet classification, the accuracy has a
gap less than 1% to full-precision in Top-5. On the human pose estimation, we achieve
261 frames per second on ZU2EG, which is 1.78x speed up compared to 8A8W and the accuracy
has only 1.62% gap to full-precision. This proves that our solution has better universality.

Enable Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

Shulin Zeng

FPGAs have shown great potential in providing low-latency and energy-efficient solutions
for deep learning applications, especially for the deep neural network (DNN). Currently,
the majority of FPGA based DNN accelerators are designed for single-task and static-workload
applications, making it difficult to adapt to the multi-task and dynamic-workload
applications in the cloud. To meet these requirements, DNN accelerators need to support
multi-task concurrent execution and low-overhead runtime resources reconfiguration.
However, neither instruction set architecture (ISA) based nor template-based FPGA
accelerators can support both functions at the same time. In this paper, we introduce
a novel FPGA virtualization framework for ISA-based DNN accelerators in the cloud.
As for the design goals of supporting multi-task and runtime reconfiguration, we propose
a two-level instruction dispatch module and deep learning hardware resources pooling
technique at the hardware level. As for the software level, we propose a tiling-based
instruction frame package design and two-stage static-dynamic compilation. Furthermore,
we propose a history information aware scheduling algorithm for the proposed ISA-based
deep learning accelerators in the cloud scenario. According to our evaluation on Xilinx
VU9P FPGA, the proposed virtualization method achieves 1.88x to 2.20x higher throughput
and 1.36x to 1.77x lower latency against the static baseline design.

Evaluation of Optimized CNNs on FPGA and non-FPGA based Accelerators using a Novel
Benchmarking Approach

Michaela Blott

Numerous algorithmic optimization techniques have been proposed to alleviate the computational
complexity of convolutional neural networks (CNNs). However, given the broad selection
of inference accelerators, it is not obvious which approach benefits from which optimization
and to what degree. In addition, the design space is further obscured by many deployment
settings such as power and operating modes, batch sizes, as well as ill-defined measurement
methodologies. In this paper, we systematically benchmark different types of CNNs
leveraging both pruning and quantization as the most promising optimization techniques
leveraging a novel benchmarking approach. We evaluate a spectrum of FPGA implementations,
GPU, TPU and VLIW processor, for a selection of systematically pruned and quantized
neural networks (including ResNet50, GoogleNetv1, MobileNetv1, a VGG derivative, and
a multilayer perceptron) taking the full design space into account including batch
sizes, thread counts, stream sizes and operating modes, and considering power, latency,
and throughput at a specific accuracy as figure of merit. Our findings show that channel
pruning is effective across most hardware platforms, with resulting speedups directly
correlated to the reduction in compute load, while FPGAs benefit the most from quantization.
FPGAs outperform regarding latency and latency variation for the majority of CNNs,
in particular with feed-forward dataflow implementations. Finally, pruning and quantization
are orthogonal techniques and yield the majority of all optimal design points when
combined. With this benchmarking approach, both in terms of methodology and measured
results, we aim to drive more clarity in the choice of CNN implementations and optimizations.

CloudMoles: Surveillance of Power-Wasting Activities by Infiltrating Undercover Sensors

Seyedeh Sharareh Mirzargar

Recently, FPGA-accelerated cloud has emerged as a new computing environment. The inclusion
of FPGAs in the cloud has created new security risks, some of which are due to circuits
exercising excessive switching activity. These power-wasting tenants can cause timing
faults in the collocated circuits or a denial-of-service attack by resetting the host
FPGA. In this work, we present the idea of populating the FPGA with voltage sensors
based on ring oscillators, to continuously monitor the core voltage fluctuations across
the entire FPGA. To implement the sensors, we do not lock any FPGA resources; instead,
we infiltrate the sensors undercover, by taking advantage of the logic and the routing
resources unused by the tenants. Additionally, we infiltrate the sensors into the
FPGA circuits after their implementation, but before their deployment on the cloud;
the tenants are thus neither aware nor affected by our voltage monitoring system.
Finally, we devise a novel metric that takes the sensor measurements to quantify the
power wasting activity in the FPGA clock regions where the sensors are infiltrated.
We use VTR benchmarks and a Xilinx Virtex-7 FPGA to test the feasibility of our approach.
Experimental results demonstrate that, using the undercover voltage sensors and our
novel metric, one can accurately locate the source of the malicious power-wasting
activity.

Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL

Adel Ejjeh

High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve hardware
design productivity and enable efficient design space exploration, by providing simple
program directives (pragmas) and/or API calls that allow hardware programmers to use
higher-level languages (like HLS-C or OpenCL). However, modern HLS tools sometimes
miss important optimizations that are necessary for high performance. In this poster,
we present a study of the tradeoffs in HLS optimizations, and the potential of a modern
HLS tool in automatically optimizing an application. We perform the study on a generic,
5-stage camera ISP pipeline using the Intel FPGA SDK for OpenCL and an Arria 10 FPGA
Dev Kit. We show that automatic optimizations in the HLS tool are valuable, achieving
up to 2.7x speedup over equivalent CPU execution. With further hand tuning, however,
we can achieve up to 36.5x speedup over CPU. We draw several specific lessons about
the effectiveness of automatic optimizations guided by simple directives and about
the nature of manual rewriting required for high performance. Finally, we conclude
that there is a gap in the current potential of HLS tools which needs to be filled
by next-gen research.

CANSEE: Customized Accelerator for Neural Signal Enhancement and Extraction from the
Calcium Image in Real Time

Zhe Chen

Miniaturized fluorescent calcium imaging miniscope has become a prominent technique
in monitoring the activity of a large population of neurons in vivo. However, existing
calcium image processing algorithms are developed for off-line analysis, and their
implementations on general-purpose processors are difficult to meet the real-time
processing requirement under constrained energy budget for closed-loop applications.
In this paper, we propose the CANSEE, a customized accelerator for neural signal enhancement
and extraction from calcium image in real time. The accelerator can perform the motion
correction, the calcium image enhancement, and the fluorescence tracing from up to
512 cells with less than 1-ms processing latency. We also designed the hardware that
can detect new cells based on the long short-term memory (LSTM) inference. We implemented
the accelerator on a Xilinx Ultra96 FPGA. The implementation achieves 15.8x speedup
and over 2 orders of magnitude improvement in energy efficiency compared to the evaluation
on the multi-core CPU.

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

Chen Wu

Low precision data representation is important to reduce storage size and memory access
for convolutional neural networks (CNNs). Yet, existing methods have two major limitations:
(1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit
floating point or 8-bit fixed point for a good accuracy.

In this paper, we propose a low precision (8-bit) floating point (LPFP) quantization
method for FPGA-based acceleration to overcome the above limitations. Without any
re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5
accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly
better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP
multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder, and therefore
implement four 8-bit LPFP multiplications using one DSP slice of Xilinx Kintex-7 family
(KC705 in this paper) while one DSP can implement only two 8-bit fixed point multiplications.
Experiments on six typical CNNs for inference show that on average, we improve throughput
by 64.5× over Intel i9 CPU and by 1.5× over existing FPGA accelerators. Particularly
for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput
by 3.5× and 27.5× and improve average throughput per DSP by 4.1× and 5×, respectively.
To the best of our knowledge, this is the first in-depth study to simplify one multiplication
for CNN inference to one 4-bit MAC and implement four multiplications within one DSP
while maintaining comparable accuracy without any re-training.

Maximizing CNN Throughput on FPGA Clusters

Ruihao Li

Field Programmable Gate Array (FPGA) platform has been a popular choice for deploying
Convolutional Neural Networks (CNNs) as a result of its high parallelism and low energy
consumption. Due to the limitation of on-chip resources on a single board, FPGA clusters
become promising solutions to improve the throughput of CNNs. In this paper, we firstly
put forward strategies to optimize the resource allocation intra and inter FPGA boards.
Then we model the multi-board cluster problem and design algorithms based on knapsack
problem and dynamic programming to calculate the optimal topology of the FPGA clusters.
We also give a quantitative analysis of the inter-board data transmission bandwidth
requirement. To make our design accommodate for more situations, we provide solutions
for deploying fully connected layers and special convolution layers with large memory
requirement. Experimental results show that typical well-known CNNs with the proposed
topology of FPGA clusters could obtain a higher throughput per board than single-board
solutions and other multi-board solutions.

R2CNN: Recurrent Residual Convolutional Neural Network on FPGA

Hiroki Nakahara

Over the past years, feed-forward convolutional neural networks (CNNs) have evolved
from a simple feed-forward architecture to deep and residual (skip-connection) architectures,
demonstrating increasingly higher object categorization accuracy and increasingly
better explanatory power of both neural and behavioral responses. However, from the
neuroscientist point of view, the relationship between such deep architectures and
the ventral visual pathway is incomplete. For example, current state-of-the-art CNNs
appear to be too complex (e.g., now over 100 layers for ResNet) compared with the
relatively shallow cortical hierarchy (4-8 layers). We introduce new CNNs with shallow
recurrent architectures and skip connections requiring fewer parameters. With higher
accuracy for classification, we propose an architecture for recurrent residual convolutional
neural network (R2CNN) on FPGA, which efficiently utilizes on-chip memory bandwidth.
We propose an Output-Kernel- Input-Parallel (OKIP) convolution circuit for a recurrent
residual convolution stage. We implement the inference hardware on a Xilinx ZCU104
evaluation board with high-level synthesis. Our R2CNN accelerator achieves top-5 accuracy
of 90.08% on ImageNet bench- mark, which has higher accuracy than conventional FPGA
implementations.

Synthesis-Free, Flexible and Fast Hardware Library for Biophysically Plausible Neurosimulations

Rene Miedema

Computational neuroscience uses models to study the brain. The Hodgkin-Huxley (HH)
model, and its extensions, is one of the most powerful, biophysically meaningful models
currently used. The high experimental value of the (extended) Hodgkin-Huxley (eHH)
models comes at the cost of steep computational requirements. Consequently, for larger
networks, neuroscientists either opt for simpler models, losing neuro-computational
features, or use high-performance computing systems. The eHH models can be efficiently
implemented as a dataflow application on a FPGA-based architecture. The state-of-the-art
FPGA-based implementations have proven to be time-consuming because of the long-duration
synthesis requirements. We have developed flexHH, a flexible hardware library, compatible
with a widely used neuron-model description format, implementing five FPGA-accelerated
and parameterizable variants of eHH models (standard HH with optional extensions:
custom ion-gates, gap junctions, and/or multiple cell compartments). Therefore, flexHH
is a crucial step towards high-flexibility and high-performance FPGA-based simulations,
eschewing the penalty of re-engineering and re-synthesis, dismissing the need for
an engineer. In terms of performance, flexHH achieves a speedup of 1,065x against
NEURON, the simulator standard in computational neuroscience, and speedups between
8x-20x against sequential C. Furthermore, flexHH is faster per simulation step compared
to other HPC technologies, provides 65% or better performance density (in FLOPS/LUT)
compared to related works, and only shows a marginal performance drop in real-time
simulations.

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

Mathew Hall

This poster presents a novel cross-layer-pipelined Convolutional Neural Network accelerator
architecture, and network compiler, that make use of precision minimization and parameter
pruning to fit ResNet-50 entirely into on-chip memory on a Stratix 10 2800 FPGA. By
statically partitioning the hardware across each of the layers in the network, our
architecture enables full DSP utilization and reduces the soft logic per DSP ratio
by roughly 4x over prior work on sparse CNN accelerators for FPGAs. This high DSP
utilization, a frequency of 420MHz, and skipping zero weights enable our architecture
to execute a sparse ResNet-50 model at a batch size of 1 at 3300 images/s, which is
nearly 3x higher throughput than NVIDIA’s fastest machine learning targeted GPU, the
V100. We also present a network compiler and a flexible hardware interface that make
it easy to add support for new types of neural networks, and to optimize these networks
for FPGAs with different on-chip resources.

FTDL: An FPGA-tailored Architecture for Deep Learning Systems

Runbin Shi

Hardware acceleration of deep learning (DL) systems has been increasingly studied
to achieve desirable performance and energy efficiency. The FPGA strikes a balance
between high energy efficiency and fast development cycle and therefore is widely
used as a DNN accelerator. However, there exists an architecture-layout mismatch in
the current designs, which introduces scalability and flexibility issues, leading
to irregular routing and resource imbalance problems. To address these limitations,
in this work, we propose FTDL, an FPGA-tailored architecture with a parameterized
and hierarchical hardware that is adaptive to different FPGA devices. FTDL has the
following novelties: (i) At the architecture level, FTDL consists of Tiled Processing
Elements (TPE) and super blocks, to achieve a near-to-theoretical digital signal processing
(DSP) operating-frequency of 650 MHz. More importantly, FTDL is configurable and delivers
good scalability, i.e., the timing is stabilized even when the design is scaled-up
to 100% resource utilization for different deep learning systems. (ii) In workload
compilation, FTDL provides a compiler that manages to map the DL workloads to the
architecture level in an optimal manner. Experimental results show that for most benchmark
layers in MLPerf, FTDL achieves an over 80% hardware efficiency.

SESSION: Poster Session III

Session details: Poster Session III

Kia Bazargan

Cash: A Single-Source Hardware-Software Codesign Framework for Rapid Prototyping

Blaise Tine

With Moore’s Law coming to an end, hardware specialization and systems on chips are
providing new opportunities for continuing performance scaling while reducing the
energy cost of computation. However, the current hardware design methodologies require
significant engineering efforts and domain expertise, making the design process unscalable.
More importantly, hardware specialization presents a unique challenge for a much tighter
software and hardware co-design environment to exploit domain-specific optimizations
and design efficiency. In this work, we introduce Cash, a single-source hardware-software
co-design framework for rapid SoC prototyping and accelerators research. Cash leverages
the unique efficiency and generative attributes of Modern C++ to provide a unified
development environment, aiming at closing the architecture research methodology gap.
The Cash framework introduces new co-design programming abstractions that enable seamless
integration with existing software from architecture research simulators to high-level
synthesis.

Performance Evaluation and Power Analysis of Teraflop-scale Fluid Simulation with
Stratix 10 FPGA

Atsushi Koshiba

Stream computing is a suitable approach to improve both performance and power efficiency
of numerical computations with FPGAs. To achieve further performance gain, temporal
and spatial parallelism were exploited: the first one deepens and the latter duplicates
pipelines of streamed computation cores. These two types of parallelism were previously
evaluated with Arria 10 FPGA. However, it has not been verified if they are also effective
for the latest FPGA, Stratix 10, which has a larger amount of logic elements (i.e.,
2.4X of Arria 10) and is equipped with a new feature to improve the maximum clock
frequency (i.e., HyperFlex architecture). To show the scalability for such state-of-the-art
FPGAs, in this paper, we firstly implemented a streamed fluid simulation accelerator
with both parallelism types for Stratix 10. We then thoroughly evaluated it by obtaining
computational performance (FLOPS), power efficiency (FLOPS/W), resource utilization,
and maximum clock frequency (Fmax). From the results, we found that this implementation
excessively used DSP blocks due to inefficient mapping of floating-point operations,
which reduced Fmax and the number of pipelined cores. To improve the scalability,
we optimized the implementation to reduce the DSP block usage by utilizing a Multiply-Add
function in a single DSP block. As a result, the optimized fluid simulation achieves
1.06 TFLOPS and 12.6 GFLOPS/W, which is 1.36X and 1.24X higher than the non-optimized
version, respectively. Moreover, we estimate that the fluid simulation with Stratix
10 could outperform GPU-based implementation with Tesla V100 by optimizing it for
HyperFlex architecture.

On the Exploration of Connection-aware Partitioning for Parallel FPGA Routing

Yun Zhou

Routing is one of the most time-consuming steps in the FPGA synthesis flow. Existing
works have described several ways to accelerate the routing process. The partitioning-based
parallel routing technique that leverages the high-performance computing of multi-core
processors are gaining popularity recently. Specifically, those parallel routers partition
nets to regions by nets’ bounding boxes, followed by a parallel routing procedure.
Nets can be split up into source-sink connections that share wire segments as much
as possible. In order to exploit more parallelism by a finer granularity in both spatial
partitioning and routing, a connection-aware routing bounding box model is introduced
in this work. We first explore in detail to show that connection-aware partitioning
using the new routing bounding boxes enables the parallel routing to perform better
runtime efficiency than the existing net-based partitioning by analyzing the workloads
of parallel routers. It reduces the connections spanning more than one region and
exploits more parallelism. The large heterogeneous Titan23 designs and a detailed
representation of the Stratix IV FPGA are used for benchmarking. Experimental results
show that the parallel FPGA router is faster when using our connection-aware partitioning
than using the existing net-based partitioning, while achieving similar quality of
routing results in terms of the wirelength and critical path delay. The connection-aware
routing bounding box model is easy to be embedded into other existing parallel routers
and further enables them to be faster.

High Density Pipelined 8bit Multiplier Systolic Arrays for FPGA

Martin Langhammer

With the advent of AI and machine learning as the highest profile FPGA applications,
INT8 performance is currently one of the key benchmarking metrics. In current devices,
INT8 multipliers must be extracted from higher precision multipliers. Recently, we
reported the implementation of a mixed DSP Block and soft logic design, with 22,400
INT8 multipliers, and a system clock rate of 416MHz, on the Intel Stratix 10 2800
chip.

In this paper we demonstrate alternate techniques for integer multiplier construction
to better balance the resource types on current FPGAs – logic, memory, and DSP – to
make a significant improvement in the multiplier, and therefore the dot product, density.
We further extend these techniques to 8 bit signed-magnitude (SM) 1.7 representation,
which can further improve arithmetic density by using the logic and memory resources
more flexibly. We describe variable composition dot product structures, which can
be assembled in a scalable 2D systolic array. In one example, we report a design containing
32,768 SM1.7 multipliers, with a clock rate of 432MHz, giving a system performance
of over 28 TOPs. Our INT8 densities are improved by up to 30% over the earlier work
– we show one design with 28,800 INT8 multipliers. In all cases, enough device resources
are left free and accessible to implement a full application level design.

Reactive Signal Obfuscation with Time-Fracturing to Counter Information Leakage in
FPGAs

Stephen M. Williams

With tremendous economic and technological ramifications, hardware security has become
an increasingly more critical design metric for FPGA-based logic design. In this work,
we focus on countermeasures against power side-channel attacks in any reconfigurable
computing system implemented with modern FPGA fabric. We design and implement a novel
countermeasure technique called Time-Fracturing (TF) to fend off side-channel-based
information leakage, which proves to be both hardware-efficient and minimally invasive.
To validate its effectiveness, we have applied our TF technique to an FPGA-based AES128
encryption core. Our experimental results have shown an increase of more than 50 times,
when compared to its unprotected baseline, in its attack difficulty measured by the
number of traces required to extract the secret key. Furthermore, our approach is
orthogonal to existing methods, thus having the potential to be integrated in the
future for a multi-variate defense mechanism.

Cycle-Free FPGA Routing Graphs

Ang Li

Accurate timing characterization of FPGA routing resources, i.e. wires and switches,
is critical to achieving high quality of results from FPGA routing tools. Although
the composition and connectivity of the routing resources are easily extracted from
an FPGA’s architecture, post-layout timing characterization of the FPGA’s wires and
switches (NOT the design being mapped onto the FPGA) with EDA tools is a challenging
task due to the large quantity of combinational loops (cycles in the routing graph).
Likewise, the use of EDA tools is severely limited when constructing new FPGA architectures.
This work addresses the challenge by proposing an algorithm to construct cycle-free
FPGA routing graphs. A cycle-free FPGA routing graph is achieved by logically ordering
wires and intelligently removing or rearranging a small fraction of the switch block
connections in order to break cycles. The proposed approach enables constraining the
timing of all routing resources, which is otherwise impossible due to the combinational
loops. This technique can be applied to post-layout static timing analysis (STA) of
existing FPGAs, significantly reducing the complexity and improving the accuracy of
the analysis. In addition, this cycle-free approach can be adopted when designing
new FPGAs, transforming costly hand layout into an automated step compatible with
commercial ASIC EDA tools.

An Algorithm for Delay Optimal Logic Replication for FPGAs Accounting for Combinational
Loops

Rupesh S. Shelar

Logic replication is often necessary to improve speed of emulation for systems employing
field programmable gate arrays (FPGAs), since design sizes are large enough requiring
partitioning to fit a design into multiple (boards of) FPGAs. In this paper, we propose
a polynomial time algorithm for combinational logic replication that ensures delay
optimality for directed acyclic graphs and reduces overhead due to look-up table (LUT)
and cut resources. The algorithm is further extended to consider combinational loops,
often yielding delay optimal results. Experimental results on industrial designs show,
on an average, 44%, 33%, and 33% reduction in overhead due to cut, LUT costs, and
runtimes, respectively, compared to existing heuristics, thus demonstrating the efficiency
of the algorithm.

QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators

Rachit Rajat

Q-Table based Reinforcement Learning (QRL) is a class of widely used algorithms in
AI that work by successively improving the estimates of Q values — quality of state-action
pairs, stored in a table. They significantly outperform Neural Network based techniques
when the state space is tractable. Fast learning for AI applications in several domains
(e.g. robotics), with tractable ‘mid-sized’ Q-tables, still necessitates performing
substantial rapid updates. State-of-the-art FPGA implementations of QRL do not scale
with the increasing Q-Table state space, thus are not efficient for such applications.
In this work, we develop a novel FPGA implementation of QRL, scalable to large state
spaces and facilitating a large class of AI applications. Our pipelined architecture
provides higher throughput while using significantly fewer on-chip resources and thereby
supports a variety of action selection policies that covers Q-Learning and variations
of bandit algorithms. Possible dependencies caused by consecutive Q value updates
are handled, allowing the design to process one Q-sample every clock cycle. Additionally,
we provide the first known FPGA implementation of the SARSA (State-Action-Reward-State-Action)
algorithm. We evaluate our architecture for Q-Learning and SARSA algorithms and show
that our designs achieve a high throughput of up to 180 million Q samples per second.

The Case for Hard Matrix Multiplier Blocks in an FPGA

Aman Arora

Designing efficient hardware for accelerating machine learning (ML) applications is
a major challenge. Rapid changing algorithms and network architectures in this field
make FPGA based designs an attractive solution. But the generic building blocks available
in current FPGAs (ALMs/CLBs, DSP blocks) limit the acceleration that can be achieved.
We propose a modification to the current FPGA architecture that makes FPGAs specialized
for ML applications. Specifically, we propose adding hard matrix multiplier blocks
(matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays
of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect
between neighboring matmuls to make larger systolic matrix multipliers. We explore
various matmul sizes (4x4x4, 8x8x8, 16x16x16, 32x32x32) and various strategies to
place these blocks on the FPGA (clustered, surround, columnar). We recommend 4x4x4
matmul blocks with columnar placement after studying tradeoffs between area, frequency,
fragmentation and channel width. Experimental results and analytical evaluation reveal
that providing matmuls in an FPGA speeds up state-of-the-art neural networks (Resnet50,
GNMT, Transformer, Minigo) by ~2.5x on average, compared to a DSP-heavy FPGA with
equal number of MACs. Therefore, FPGAs with hard matrix multipliers can be used to
design faster, more area (and hence, power) efficient hardware accelerators for ML
applications, compared to current FPGAs, at the cost of reducing the flexibility of
the FPGA for other applications. A matmul-heavy FPGA fabric could be a part of bigger
FPGA, the rest of which can have general programmable logic, or fully ML-specific
FPGAs with matmuls could be created.

Performance Portable FPGA Design

Nils Voss

FPGA platforms are widely used for application acceleration. Although a number of
high-level design frameworks exist, application and performance portability across
different platforms remain challenging. To address the above problem, we propose an
API design for high-level development tools to separate platform-dependent code from
the remaining application design. Additionally, we propose design guidelines to assist
with performance portability. To demonstrate our techniques, a large-scale application,
originally developed for an Intel Stratix-V FPGA is ported to several new Xilinx Virtex
UltraScale+ systems. The accelerated application, developed in a high-level framework,
is rapidly moved onto the new platforms with minimal changes. The original, unmodified
kernel code delivers a 1.74x speedup due to increased clock frequency on the new platform.
Subsequently, the application is further optimised to make use of the additional resources
available on the larger Ultrascale+ FPGAs, guided by a simple analytical performance
model. This results in an additional performance increase of up to 7.4x. Using the
presented framework, we demonstrate rapid deployment of the same application across
a number of different platforms that leverage the same FPGA family but differ in their
low-level implementation details and the available peripherals. As a result, the same
application code supports five different platforms: Maxeler MAX5C DFE, Amazon EC2
F1, Xilinx Alveo U200, U250 and the original Intel Stratix-V accelerator card, with
performance close to what is theoretically achievable for each of these platforms.

Accuracy-Aware Memory Allocation to Mitigate BRAM Errors for Voltage Underscaling
on FPGA Overlay Accelerators

Tanvir Ahmed

Approximate computing (AC) aims to achieve energy-efficiency in digital systems by
sacrificing the computational accuracy of an application. Memory-intensive applications,
in which a large amount of data is processed to reach a meaningful conclusion, are
the primary target. Systems for such applications consists of a large pool of compute-unit
and sizeable on-chip memory. The total energy consumption for such applications is
often dominated by the on-chip memory. We, therefore, focus on improving the energy
efficiency of the on-chip memory by appropriately scaling down its supply voltage.

In this paper, we propose a memory allocation technique for FPGA-based accelerators
to improve accuracy and energy consumption for such memory-intensive applications.
Unlike state-of-the-art, our technique focusses on the BRAM of the FPGA. Since an
application consists of both critical and non-critical data and is required to treat
them accordingly to maintain good computational accuracy, we thereby use LUTRAM of
FPGA to realize the reliable memory, whereas BRAM operating at a lower voltage is
considered as the unreliable one. First, we introduce a compiler pre-processor to
annotate the arrays of an application as critical and non-critical ones. Afterward,
we employ an exploration heuristic to select an optimal point of the reliable and
unreliable memories for the application without incurring run-time as well as energy
consumption based on pre-characterize memory power. Experimental results on various
signal and image processing applications reveal that the proposed memory allocation
heuristic improves the accuracy from 13.0% to 73.2% along with 0.77x energy savings
while incurring 1.12x circuit area.

Near-memory Acceleration for Scalable Phylogenetic Inference

Nikolaos Alachiotis

Phylogenetics study the evolutionary history of a collection of organisms based on
observed heritable molecular traits, finding practical application in a wide range
of domains, from conservation biology and epidemiology, to forensics and drug development.
A fundamental computational kernel to evaluate evolutionary histories, also referred
to as phylogenies, is the Phylogenetic Likelihood Function (PLF), which dominates
the total execution time (by up to 95%) of widely used maximum-likelihood phylogenetic
methods. Numerous efforts to boost PLF performance over the years mostly focused on
accelerating computation; since the PLF is a data-intensive, memory-bound operation,
performance remains limited by data movement. In this work, we employ near-memory
computation units (NMUs) within a FPGA-based computing environment with disaggregated
memory to alleviate the data movement problem and improve performance and energy efficiency
when inferring large-scale phylogenies. NMUs were deployed on a multi-FPGA emulation
platform for the IBM dReDBox disaggregated datacenter prototype. We find that performance
and power efficiency improves by an order of magnitude when NMUs compute on local
data that reside on the same server tray. This is achieved through an efficient data-allocation
scheme that minimizes inter-tray data transfers (remote-data movement) when computing
the PLF. More specifically, we observe up to 22x better FLOPS performance and 13x
higher power efficiency (FLOPS/Watt) over the more traditional, accelerator-as-a-coprocessor
model, which requires explicit remote-data transfers between disaggregated memory
modules and accelerator units.

FPTLOPT: An Automatic Transistor-Level Optimization Tool for GRM FPGA

Yufan Zhang

The FPGA circuit design usually adopts full-custom design method, it indicates that
it is difficult to design and optimize an FPGA manually. So, we present FPTLOPT (FPGA
Transistor-Level Optimization Tool) which supports a more complex FPGA architecture
called general routing matrix (GRM) architecture, and also has higher-accuracy and
higher-speed than COFFE [1]. To fit a more complex FPGA architecture, we use the regular
matching method to automatically extract the circuits type and build the circuits
netlist; To get the higher-accuracy, we predict the layout area by area model we build,
then we precisely predict the layout post simulation delay by load model we build;
To get the higher-speed, we devise the variable range greedy algorithm, to expanding
range automatically. We also provide equalization kernel multi-thread acceleration
that can change the thread number according to the current CPU hardware environment.
The experimental results illustrate that FPTLOPT supports the optimization of GRM
architecture and build the key sub-circuit netlist. Also, the area prediction is by
maximum of 43%, the delay get from delay prediction is 28% more precise than the ones
in COFFE. Besides, quickly gets the optimal transistor sizing results for different
optimization objectives. For the same circuit, the optimization speed is 19.96 times
faster than COFFE.

INTB: A New FPGA Interconnect Model for Architecture Exploration

Chengyu Hu

CAD exploration is important for designing FPGA interconnect topologies. It includes
two steps: first, design a model with some parameters that can express as much architecture
space. Second, use CAD flow to analyze the described interconnect architecture. In
this paper, we present a new interconnect model, named INTB (Interconnect Block).
At a logical position, one INTB is adopted to represent all related routing resources
and hierarchical parameters are designed to simplify description. Compared with existing
CB-SB model, INTB model can support more interconnect features of modern FPGA, such
as various types of wire segment and complex connections. These features can improve
FPGA routing ability. For the application of INTB model, two modifications are made
in CAD flow: one is generation of routing resource graph (RRG). A tile-based method
is proposed to generate RRG from parameters. The other is cost computing during routing
process. Two strategies are applied respectively for cost estimation of short and
curve wire segment, which do not exist in CB-SB model. INTB model and CAD improvement
are implemented in VTR 8.0. The experiments consist of two parts. First, INTB model
is adopted to re-describe CB-SB architectures to verify its description capacity.
After CAD flow, average difference of routing area and timing between two models is
about 4% and 5%. Second, INTB model is used to explore architecture space with modern
FPGA features. Experimental results show obvious performance enhancement, over 10%
in some benchmarks.

V-LSTM: An Efficient LSTM Accelerator Using Fixed Nonzero-Ratio Viterbi-Based Pruning

Taesu Kim

Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data,
such as speech recognition and language modeling. LSTM brought significant accuracy
improvement by introducing additional parameters to Recurrent Neural Network (RNN).
However, increasing number of parameters and computations also led to inefficiency
in computing LSTM on edge devices with limited on-chip memory size and DRAM bandwidth.
In order to reduce the latency and energy of LSTM computations, there has been a pressing
need for model compression schemes and suitable hardware accelerators. In this paper,
we first propose the Fixed Nonzero-ratio Viterbi-based Pruning, which can reduce the
memory footprint of LSTM models by 96% with negligible accuracy loss. By applying
additional constraints on the distribution of surviving weights in Viterbi-based Pruning,
the proposed pruning scheme mitigates the load-imbalance problem and thereby increases
the processing engine utilization rate. Then, we propose the V-LSTM, an efficient
sparse LSTM accelerator based on the proposed pruning scheme. High compression ratio
of the proposed pruning scheme allows the proposed accelerator to achieve 24.9% lower
per-sample latency than that of state-of-the-art accelerators. The proposed accelerator
is implemented on Xilinx VC-709 FPGA evaluation board running at 200MHz for evaluation.

DBHI: A Tool for Decoupled Functional Hardware-Software Co-Design on SoCs

Unai Martinez-Corral

This paper presents a system-level co-simulation and co-verification workflow to ease
the transition from a software-only procedure, executed in a General Purpose processor,
to the integration of a custom hardware accelerator developed in a Hardware Description
Language (HDL).

We propose a tool which enables Dynamic Binary Modification to decouple the development
of the hardware accelerator from the software-only application to be accelerated.
It provides support for rapid iterative exploration and functional verification of
hardware designs while keeping the unmodified software application as a reference.
DBHI is able to instrument an application and inject compiled hardware. It allows
progressive migration from application source code, to non-synthesizable HDL, and
to synthesizable HDL. At the same time, it preserves cycle-accurate/bit-accurate results,
and provides run-time visibility of the internal data buffers for debugging purposes.
Foreign architecture emulation overhead during development is avoided, and early integration
with peripherals in the target System-on-Chip is possible.

The proposed design flow was evaluated on executions of hardware simulations on x86-64
and Arm. DBHI was developed from existing off-the-shelf tools, and we evaluated it
on multiple architectures, however, the technique is not tied to any specific architecture.

ICCAD 2019 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Keynote

Session details: Keynote

Bustany
Ismail

Fusion: The Dawn of the Hyperconvergence Era in EDA

Krishnamoorthy
Shankar

Hyperconvergence is a software-centric architecture which has disrupted the datacenter
industry in a dramatic way by bringing the disparate areas of compute, storage and
networking into a single system. A hyperconverged system allows the integrated …

SESSION: New Advances in Placement

Session details: New Advances in Placement

Yang
Stephen

How Deep Learning Can Drive Physical Synthesis Towards More Predictable Legalization

Netto
Renan

Machine learning has been used to improve the predictability of different physical
design problems, such as timing, clock tree synthesis and routing, but not for legalization.
Predicting the outcome of legalization can be helpful to guide incremental …

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing
Balancing

Chang
Ya-Chu

As the wide adoption of FinFET technology in mass production, dynamic power becomes
the bottleneck to achieving low power. Therefore, clock power reduction is crucial
in modern IC design. Register clustering can effectively save clock power because
of …

Device Layer-Aware Analytical Placement for Analog Circuits

Xu
Biying

The layouts of analog/mixed-signal (AMS) integrated circuits (ICs) are dramatically
different from their digital counterparts. AMS circuit layouts usually include a variety
of devices, including transistors, capacitors, resistors, and inductors. A …

Analytical Mixed-Cell-Height Legalization Considering Average and Maximum Movement
Minimization

Li
Xingquan

Modern circuit designs often contain standard cells of different row heights to meet
various design requirements. Due to the higher interference among heterogeneous cell
structures, the legalization problem for mixed-cell-height standard cells becomes
…

SESSION: FPGA Special Session: Advances in Adaptable Heterogeneous Computing and Acceleration
for Big Data

Session details: FPGA Special Session: Advances in Adaptable Heterogeneous Computing
and Acceleration for Big Data

Iyer
Mahesh

FPGA-based Computing in the Era of AI and Big Data

Nurvitadhi
Eriko

The continued rapid growth of data, along with advances in Artificial Intelligence
(AI) to extract knowledge from such data, is reshaping the computing ecosystem landscape.
With AI becoming an essential part of almost every end-user application, our …

Advances in Adaptable Computing

Gupta
Amit

Recent technical challenges have forced the industry to explore options beyond the
conventional “one size fits all” CPU scalar processing solution. Very large vector
processing (DSP, GPU) solves some problems, but it runs into traditional scaling …

Improving Programmability and Efficiency of Large-Scale Graph Analytics for FPGA Platforms

Ozdal
Muhammet Mustafa

Large-scale graph analytics has gained importance due to emergence of new applications
in different contexts such as web, social networks, and computational biology. It
is known that typical CPU/GPU implementations for sparse graph applications cannot
…

SESSION: Routing in All Forms

Session details: Routing in All Forms

Madden
Patrick

Pin Access-Driven Design Rule Clean and DFM Optimized Routing of Standard Cells under
Boolean Constraints

Ryzhenko
Nikolay

In this paper, we propose a routing flow for nets within a standard cell that generates
layout of standard cells without any design rule violations. Design rules, density
rules for metal fill, and pin-access requirements are modeled via Boolean formulas
…

PSION: Combining Logical Topology and Physical Layout Optimization for Wavelength-Routed
ONoCs

Truppel
Alexandre

Optical Networks-on-Chip (ONoCs) are a promising solution for high-performance multi-core
integration with better latency and bandwidth than traditional Electrical NoCs. Wavelength-routed
ONoCs (WRONoCs) offer yet additional performance guarantees. …

Construction of All Multilayer Monolithic Rectilinear Steiner Minimum Trees on the
3D Hanan Grid for Monolithic 3D IC Routing

Lin
Sheng-En David

Monolithic three-dimensional~(3D) integration enables stacking multiple ultra-thin
silicon tiers in a single package, thereby providing smaller footprint area, shorter
wirelength, higher performance, and lower power consumption than conventional planar
…

ROAD: Routability Analysis and Diagnosis Framework Based on SAT Techniques

Park
Dongwon

Routability diagnosis has increasingly become the bottleneck in detailed routing for
sub-10nm technology due to the limited tracks, high density, and complex design rules. The
conventional ways to examine the routability of detailed routing are ILP- and …

SESSION: Keynote

Session details: Keynote

Menezes
Noel

A Perspective on Security and Trust Requirements for the Future

Plaks
Kenneth

As integrated circuit manufacturing becomes increasingly global and the availability
of domestically produced advanced transistor nodes shrinks, security vulnerabilities
within the supply chain become a significant issue for IC defense applications. In
…

SESSION: Patterning and Machine Learning

Session details: Patterning and Machine Learning

Young
Evangeline

Declarative Language for Geometric Pattern Matching in VLSI Process Rule Modeling

Suto
Gyuszi

This paper presents a formal (machine readable) declarative language developed for
the specific reason of modeling physical design process rules of any complexity. Case
studies are presented on synthetic as well as industry known design rules of simple
…

Electromigration-Aware Interconnect Design

Sapatnekar
Sachin S.

Electromigration (EM) is seen as a growing problem in recent and upcoming technology
nodes, and affects a wider variety of wires (e.g., power grid, clock/signal nets),
circuits (e.g., digital, analog, mixed-signal), and systems (e.g., mobile, server,
…

Toward Intelligent Physical Design: Deep Learning and GPU Acceleration

Ren
Haoxing

Deep learning (DL) has achieved tremendous success in computer vision, natural language
processing and gaming. Would DL help push physical design toward a more intelligent
paradigm to meet the post-Moore era design automation challenges? We will discuss
…

Multiple Patterning Layout Compliance with Minimizing Topology Disturbance and Polygon
Displacement

Chang
Hua-Yu

Multiple patterning lithography (MPL) divides a layout into several masks and manufactures
them by a series of exposure and etching steps. As technology advances, MPL is still
indispensable because of its cost effectiveness and hybrid lithography …

SESSION: Cyber-Physical Systems

Session details: Cyber-Physical Systems

Groeneveld
Patrick

From Electronic Design Automation to Automotive Design Automation

Lin
Chung-Wei

Advanced driver assistance systems (ADAS), autonomous functions, and connected applications
bring a revolution to automotive systems, but they also make automotive design, especially
software and electronics, more complex than ever. The complexity …

Enterprise-wide AI-enabled Digital Transformation

Maasoumy
Mehdi

Having solved the data integration problem, we discuss how convergence of 4 technology
vectors, namely Big Data, Artificial Intelligence, Cloud Computing, and Internet of
Things (IoT) has, for the first time, enabled us to solve a class of problems …

Secure and Trustworthy Cyber-Physical System Design: A Cross-Layer Perspective

Nuzzo
Pierluigi

This talk discusses some of the design challenges posed by cyber-physical system security
at different abstraction layers, from algorithm design to the realization of trusted
hardware platforms. We introduce two design problems, namely, detecting sensor …

SESSION: Lifetime Achievement Award Tribute to Professor Alberto Sangiovanni-Vicentelli

Session details: Lifetime Achievement Award Tribute to Professor Alberto Sangiovanni-Vicentelli

Nuzzo
Pierluigi

The Slow Start of Fast Spice: A Brief History of Timing

White
Jacob K.

The list of Professor Alberto Sangiovanni-Vincentelli’s research contributions is
astounding in length and breadth, yet does not entirely capture what this author believes
is his true genius. In so many areas of computer-aided design, Sangiovanni-…

Basic and Advanced Researches in Logic Synthesis and their Industrial Contributions

Fujita
Masahiro

We first present historical view on the techniques for two-level and multi-level logic
optimizations, and discuss the practical issues with respect to them. Then the techniques
for sequential optimizations are briefly reviewed. Based on them, a new …

From Electronic Design Automation to Cyber-Physical System Design Automation: A Tale of Platforms and Contracts

Nuzzo
Pierluigi

This paper reflects on the design challenges posed by cyber-physical systems, what
distinguishes cyber-physical system design from large-scale integrated circuit design,
and what could be the opportunities for the design automation community. The paper
…

My 50-Year Journey from Punched Cards to Swarm Systems

Sangiovanni Vincentelli
Alberto

The article is a reflection onmy journey during the development of the EDA field,
from its early days to its explosive growth and present maturity. The two special
issues of the Solid State Circuit Society Magazine “Corsi e Ricorsi: Alberto Sangiovanni
…

SESSION: Lifetime Achievement Award Dinner Banquet Keynote

Freedom From Choice and the Power of Models: in Honor of Alberto Sangiovanni-Vincentelli

Lee
Edward A.

Discovery, invention, and design are all about models. When we say “Joseph Priestly
discovered oxygen in 1774,” we do not mean that Priestly dug up a canister of oxygen,
recognized it as something new, and released it, for the first time, into the air.
…

SESSION: Physical Design – Where are we going?

Session details: Physical Design – Where are we going?

Cheng
C.K.

Analog Layout Synthesis: Are We There Yet?

Mangalagiri
Prasanth

Over the past decade, spurred by advances in mobile computing, there has been a fundamental
shift in computing needs of consumer applications. There has been an industry-wide
transition from highly CPU-centric to a peripheral-centric, connectivity and …

Lagrangian Relaxation Based Gate Sizing With Clock Skew Scheduling – A Fast and Effective
Approach

Sharma
Ankur

Recent work has established Lagrangian relaxation (LR) based gate sizing as state-of-the-art
providing the best power reduction with low run time. Gate sizing has limited potential
to reduce the power when the timing constraints are tight. By adjusting …

Adaptive Clustering and Sampling for High-Dimensional and Multi-Failure-Region SRAM
Yield Analysis

Shi
Xiao

Statistical circuit simulation is exhibiting increasing importance for memory circuits
under process variation. It is challenging to accurately estimate the extremely low
failure probability as it becomes a high-dimensional and multi-failure-region …

SESSION: Detailed Routing Contest Results

Session details: Detailed Routing Contest Results

Chinnery
David

ISPD 2019 Initial Detailed Routing Contest and Benchmark with Advanced Routing Rules

Liu
Wen-Hao

Detailed routing becomes the most complicated and runtime consuming stage in the physical
design flow as technology nodes advance. Due to the inaccessibility of advanced routing
rules and industrial designs, it is hard to conduct detailed routing …

ICCAD 2016 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Scope – quality retaining display rendering workload scaling based on user-smartphone
distance

Nixon
Kent W.

Modern smartphone display system come equipped with powerful GPU’s capable of rendering
advanced 2D and 3D graphics. These GPU’s make up a significant portion of the system
power profile due to the high resolution and framerate of smartphone display. …

NVSim-CAM: a circuit-level simulator for emerging nonvolatile memory based content-addressable
memory

Li
Shuangchen

Ternary Content-Addressable Memory (TCAM) is widely used in networking routers, fully
associative caches, search engines, etc. While the conventional SRAM-based TCAM suffers
from the poor scalability, the emerging nonvolatile memories (NVM, i.e., MRAM, …

Design technology for fault-free and maximally-parallel wavelength-routed optical
networks-on-chip

Peano
Andrea

The recent interest in emerging interconnect technologies is bringing the issue of
a proper EDA support for them to the forefront, so to tackle the design complexity.
A relevant case study is provided by wavelength-routed optical NoCs (WRONoCs), which
…

Fast generation of lexicographic satisfiable assignments: Enabling canonicity in SAT-based applications

Petkovska
Ana

Lexicographic Boolean satisfiability (LEXSAT) is a variation of the Boolean satisfiability
problem (SAT). Given a variable order, LEXSAT finds a satisfying assignment whose
integer value under the given variable order is minimum (maximum) among all …

Analytic approaches to the collapse operation and equivalence verification of threshold
logic circuits

Lee
Nian-Ze

Threshold logic circuits gain increasing attention due to their feasible realization
with emerging technologies and strong bind to neural network applications. In this
paper, for logic synthesis we formulate the fundamental operation of collapsing …

A flash-based digital circuit design flow

Abusultan
Monther

Traditionally, floating gate (flash) transistors have been used exclusively to implement
non-volatile memory in its various forms. Recently, we showed that flash transistors
can be used to implement digital circuits as well. In this paper, we present …

MrDP: multiple-row detailed placement of heterogeneous-sized
cells for advanced nodes

Lin
Yibo

As VLSI technology shrinks to fewer tracks per standard cell, e.g., from 10-track
to 7.5-track libraries (and lesser for 7nm), there has been a rapid increase in the
usage of multiple-row cells like two- and three-row flip-flops, buffers, etc., for
…

OWARU: free space-aware timing-driven incremental placement

Jung
Jinwook

This paper proposes a powerful new technique called “OWARU”¹ that re-places and re-sizes multiple gates simultaneously to improve the most critical
paths of a design. In essence, it is an incremental timing-driven placement technique
integrated with …

Detailed placement for modern FPGAs using 2D dynamic programming

Dhar
Shounak

In this paper, we propose a 2-dimensional dynamic programming (DP) based detailed
placement algorithm for modern FPGAs for wirelength and timing optimization. By tuning
a control parameter, our algorithm can perform fast heuristic or exact optimization.
…

Security and privacy threats to on-chip non-volatile memories and countermeasures

Ghosh
Swaroop

Non-volatile memories (NVMs) such as Spin-Transfer Torque RAM (STTRAM) have drawn
significant attention due to complete elimination of bitcell leakage. In addition
to the plethora of benefits such as density, non-volatility, low-power and high speed,
…

Security engineering of nanostructures and nanomaterials

Shahrjerdi
D.

Proliferation of electronics and their increasing connectivity pose formidable challenges
for information security. At the most fundamental level, nanostructures and nanomaterials
offer an unprecedented opportunity to introduce new approaches to …

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks

Zhang
Chen

With the recent advancement of multilayer convolutional neural networks (CNN), deep
learning has achieved amazing success in many areas, especially in visual content
understanding and classification. To improve the performance and energy-efficiency
of …

Re-architecting the on-chip memory sub-system of machine-learning accelerator for
embedded devices

Wang
Ying

The rapid development of deep learning are enabling a plenty of novel applications
such as image and speech recognition for embedded systems, robotics or smart wearable
devices. However, typical deep learning models like deep convolutional neural …

A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication
kernel

Li
Sicheng

Sparse matrix-vector multiplication (SpMV) is an important computational kernel in
many applications. For performance improvement, software libraries designated for
SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library …

Compact oscillation neuron exploiting metal-insulator-transition for neuromorphic
computing

Chen
Pai-Yu

The phenomenon of metal-insulator-transition (MIT) in strongly correlated oxides,
such as NbO₂, have shown the oscillation behavior in recent experiments. In this work, the MIT
based two-terminal device is proposed as a compact oscillation neuron for …

A new tightly-coupled transient electro-thermal simulation method for power electronics

Chen
Quan

This paper presents a new transient electro-thermal (ET) simulation method for fast
3D chip-level analysis of power electronics with field solver accuracy. The metallization
stacks are meshed and solved with 3D field solver using nonlinear temperature-…

A tensor-based volterra series black-box nonlinear system identification and simulation
framework

Batselier
Kim

Tensors are a multi-linear generalization of matrices to their d-way counterparts, and are receiving intense interest recently due to their natural
representation of high-dimensional data and the availability of fast tensor decomposition
algorithms. …

Efficient statistical analysis for correlated rare failure events via asymptotic probability
approximation

Yu
Handi

In this paper, a novel Asymptotic Probability Approximation (APA) method is proposed
to estimate the overall rare probability of correlated failure events for complex
circuits containing a large number of replicated cells (e.g., SRAM bit-cells). The
key …

Duplex: simultaneous parameter-performance exploration for optimizing analog circuits

Ahmadyan
Seyed Nematollah

We present Duplex random tree search, an algorithm to optimize performance metrics
of analog and mixed signal circuits. Duplex determines the optimal design, the Pareto
set and the sensitivity of circuit’s performance metrics to its parameters. We …

Improved flop tray-based design implementation for power reduction

Kahng
Andrew B.

Clock network power reduction is critical in modern SoC designs. Application of flop trays (i.e., multi-bit flip-flops) can significantly reduce the number of sinks in a clock
network, and thus reduce the number of clock buffers, clock wirelength, and …

RC-aware global routing

Scheifele
Rudolf

We address the problem of incorporating RC delay constraints into global routing.
In contrast to the usual global routing approach that focuses on minimizing net length
while obeying constraints given by other tools such as layer assignments, our method
…

Scalable, high-quality, SAT-based multi-layer escape routing

Bayless
Sam

Escape routing for Printed Circuit Boards (PCBs) is an important problem arising from
modern packaging with large numbers of densely spaced pins, such as BGAs. Single-layer
escape routing has been well-studied, but large, dense BGAs often require …

Redistribution layer routing for integrated fan-out wafer-level chip-scale packages

Lin
Bo-Qiao

The integrated fan-out (InFO) wafer-level chip-scale package (WLCSP) s an emerging
packaging technology, which typically consists of multiple redistribution layers (RDLs)
for signal redistributions among multiple chips. There is still no published work
…

The architecture value engine: measuring and delivering sustainable SoC improvement

Carballo
Juan-Antonio

The value of semiconductor-based systems continues to increase rapidly especially
when considering the cost associated with building it. As such, Moore’s Law has become
a law associated broadly with value growth instead of pure performance growth. While
…

Circuit valorization in the IC design ecosystem

de Gyvez
José Pineda

Staying at the forefront of research, or in the top tier product market requires circuit
innovation as a key differentiation. We are entering an era where more than Moore
is becoming increasingly evident, not only because of the physical limitations of
…

Interconnect-aware device targeting from PPA perspective

Badaroglu
Mustafa

CMOS scaling so far enabled simultaneous system throughput scaling by concurrent improvements
in delay, power, and area with thanks to Moore’s law. CMOS scaling becomes more difficult
with the limits of interconnect and increasing wafer cost. It is …

Measuring progress and value of IC implementation technology

Kahng
Andrew B.

Over the past decade, “Moore’s Law” has become increasingly well-understood as being
a law of “value scaling”: success of new electronics- and semiconductor-based products
depends on improved cost-efficiency, utility, and value. Design Automation (DA) …

Provably secure camouflaging strategy for IC protection

Li
Meng

The advancing of reverse engineering techniques has complicated the efforts in intellectual
property protection. Proactive methods have been developed recently, among which layout-level
IC camouflaging is the leading example. However, existing …

CamoPerturb: secure IC camouflaging for minterm protection

Yasin
Muhammad

Integrated circuit (IC) camouflaging is a layout-level technique that thwarts reverse
engineering attacks on ICs by introducing camouflaged cells that look alike, but can
implement one of many possible Boolean functions. Existing camouflaging techniques
…

Chip editor: leveraging circuit edit for logic obfuscation and trusted fabrication

Shakya
Bicky

The globalization of the semiconductor foundry business poses grave risks in terms
of intellectual property (IP) protection, especially for critical applications. Over
the past few years, several techniques have been proposed that allow manufacturing
of …

Arbitrary streaming permutations with minimum memory and latency

Koehn
Thaddeus

Streaming architectures are a popular choice for data intensive application due to
their high throughput requirements. When assembling components for a streaming application,
it is often necessary to build translation blocks between them to match the …

Multibank memory optimization for parallel data access in multiple data arrays

Yin
Shouyi

To realize high throughput out of a relatively low bandwidth, memory partitioning
algorithms have been proposed to separate data arrays into multiple memory banks,
from which multiple data can be accessed in parallel. However, previous partitioning
…

Allocation of multi-bit flip-flops in logic synthesis for power optimization

Yi
Dongyoun

In this paper, a new approach to the problem of allocating multi-bit flip-flops for
data storage is presented. Previous approaches divide the allocation problem into
two separate steps: (i) placing single-bit flip-flops under circuit timing constraints
…

Model-based design of resource-efficient automotive control software

Chang
Wanli

Automotive platforms today run hundreds of millions of lines of software code implementing
a large number of different control applications spanning across safety-critical functionality
to driver assistance and comfort-related functions. While such …

Testing automotive embedded systems under X-in-the-loop setups

Tibba
Ghizlane

The development of automotive electronics and software systems is often associated
with high costs due to their multi-domain nature (including control engineering, electronics,
hydraulics, mechanics, etc). The involvement of these different disciplines …

Efficient statistical validation of machine learning systems for autonomous driving

Shi
Weijing

Today’s automotive industry is making a bold move to equip vehicles with intelligent
driver assistance features. A modern automobile is now equipped with a powerful computing
platform to run multiple machine learning algorithms for environment …

CONVINCE: a cross-layer modeling, exploration and validation framework for next-generation connected
vehicles

Zheng
Bowen

Next-generation autonomous and semi-autonomous vehicles will not only precept the
environment with their own sensors, but also communicate with other vehicles and surrounding
infrastructures for vehicle safety and transportation efficiency. The design, …

Overview of the 2016 CAD contest at ICCAD

Huang
Shih-Hsu

The CAD Contest at ICCAD is a challenging, multi-month competition, focusing on advanced,
real-world problems in the field of Electronic Design Automation (EDA). In its fifth
year, the 2016 CAD Contest at ICCAD attracted 135 teams from 11 regions/…

ICCAD-2016 CAD contest in large-scale identical fault search

Wei
Tangent

Injecting faults into designs is a way to qualify a verification environment. To improve
the performance of a qualifying process, we need to remove identical faults. The problem
will provide some faulty design cases; the contestants must identify all …

ICCAD-2016 CAD contest in non-exact projective NPNP boolean matching and benchmark
suite

Wu
Chi-An (Rocky)

Boolean Matching is significant to industry applications, such as library binding,
synthesis, engineer change order, and hardware Trojan detection. Instead of basic
Boolean matching, Non-exact Projective NPNP Boolean Matching allows to match two designs
…

ICCAD-2016 CAD contest in pattern classification for integrated circuit design space
analysis and benchmark suite

Topaloglu
Rasit O.

Layout pattern classification has been utilized in recent years in integrated circuit
design towards various goals such as design space analysis, design rule generation,
and systematic yield optimization. There is a need for open source or academic …

OpenDesign flow database: the infrastructure for VLSI design and design automation
research

Jung
Jinwook

Recently, there have been a slew of design automation contests and released benchmarks.
ISPD place & route contests, DAC placement contests, timing analysis contests at TAU
and CAD contests at ICCAD are good examples in the past and more of new contests …

Malicious LUT: a stealthy FPGA trojan injected and triggered by the design flow

Krieg
Christian

We present a novel type of Trojan trigger targeted at the field-programmable gate
array (FPGA) design flow. Traditional triggers base on rare events, such as rare values
or sequences. While in most cases these trigger circuits are able to hide a Trojan
…

On detecting delay anomalies introduced by hardware trojans

Ismari
D.

A hardware Trojan (HT) detection method is presented that is based on measuring and
detecting small systematic changes in path delays introduced by capacitive loading
effects or series inserted gates of HTs. The path delays are measured using a high
…

An optimization-theoretic approach for attacking physical unclonable functions

Liu
Yuntao

Physical unclonable functions (PUFs) utilize manufacturing ariations of circuit elements
to produce unpredictable response to any challenge vector. The attack on PUF aims
to predict the PUF response to all challenge vectors while only a small number of
…

LRR-DPUF: learning resilient and reliable digital physical unclonable function

Miao
Jin

Conventional silicon physical unclonable function (PUF) extracts fingerprints from
transistor’s analog attributes, which are vulnerable to environmental and operational
variations. Recently, digitalized PUF prototypes have emerged to overcome the …

Enabling online learning in lithography hotspot detection with information-theoretic
feature optimization

Zhang
Hang

With the continuous shrinking of technology nodes, lithography hotspot detection and
elimination in the physical verification phase is of great value. Recently machine
learning and pattern matching based methods have been extensively studied to overcome
…

Incorporating cut redistribution with mask assignment to enable 1D gridded design

Kuang
Jian

1D gridded design is one of the most promising solutions that can enable the scaling
to 10nm technology node and beyond. Line-end cuts are needed to fabricate 1D layouts, where
two techniques are available to resolve the conflicts between cuts: cut …

VCR: simultaneous via-template and cut-template-aware routing for directed self-assembly
technology

Su
Yu-Hsuan

The directed self-assembly (DSA) technology for next-generation lithography has been
shown its great potential for fabricating highly dense via patterns and cut masks
in the sub-5 nm technology node and beyond. However, DSA via and cut optimizations
…

DSA-compliant routing for two-dimensional patterns using block copolymer lithography

Su
Yu-Hsuan

Two-dimensional (2D) directed self-assembly (DSA) is an emerging lithography for the
5 nm process node and beyond that can substantially increase design flexibility in
critical routing layers and reduce the number of cuts for better yield. The state-of-…

The art of semi-formal bug hunting

Nalla
Pradeep Kumar

Verification is a critical task in the development of correct computing systems. Simulation
remains the predominantly used technique to identify design flaws, due to its scalability.
However, simulation intrinsically suffers from low functional coverage,…

Compiled symbolic simulation for systemC

Herdt
Vladimir

Ensuring the correctness of SystemC virtual prototypes is indispensable. For such
models, existing symbolic simulation approaches are based on interpreting their behavior.
In this paper we propose a major enhancement called Compiled Symbolic Simulation (…

Exact diagnosis using boolean satisfiability

Riener
Heinz

We propose an exact algorithm to model-free diagnosis with an application to fault
localization in digital circuits. We assume that a faulty circuit and a correctness
specification, e.g., in terms of an un-optimized reference circuit, are available.
Our …

Efficient and accurate analysis of single event transients propagation using SMT-based
techniques

Hamad
Ghaith Bany

This paper presents a hierarchical framework to model, analyze, and estimate digital
design vulnerability to soft errors due to Single Event Transients (SETs). A new SET
propagation model is proposed. This model simultaneously includes the impact of …

Power delivery in 3D packages: current crowding effects, dynamic IR drop and compensation network using sensors (invited
paper)

Kannan
Sukeshwar

In 3D packages top-die power delivery is a not only limited by back-end of line (technology
scaling), but also by the TSV integration scheme, the stacking method and the microbump
current-carrying capability. The microbump structure and its …

Cost analysis and cost-driven IP reuse methodology for SoC design based on 2.5D/3D
integration

Stow
Dylan

Due to the increasing fabrication and design complexity with new process nodes, the
cost per transistor trend originally identified in Moore’s Law is slowing when using
traditional integration methods. However, emerging die-level integration …

Energy-efficient and reliable 3D network-on-chip (NoC): architectures and optimization
algorithms

Das
Sourav

The Network-on-Chip (NoC) paradigm has emerged as an enabler for integrating a large
number of embedded cores in a single die. Three-dimensional (3D) integration, a breakthrough
technology to achieve “More Moore and More Than Moore,” provides numerous …

The hype, myths, and realities of testing 3D integrated circuits

Wang
Ran

Three-dimensional (3D) integration using through-silicon vias (TSVs) promises higher
integration levels in a single package, keeping pace with Moore’s law. Despite the
promise and benefits offered by 3D integration, testing remains a major obstacle that
…

TASA: toolchain-agnostic static software randomisation for critical real-time systems

Kosmidis
Leonidas

Measurement-Based Probabilistic Timing Analysis (MBPTA) derives WCET estimates for
tasks running on processors comprising high-performance features such as caches. MBPTA’s
correct application requires the system to exhibit certain timing properties, …

Splitting functions in code management on scratchpad memories

Kim
Youngbin

As the number of cores increases, cache-based memory hierarchy is becoming a major
problem in terms of the scalability and energy consumption. Software-managed scratchpad
memories (SPM) is a scalable alternative to caches, but the benefit comes at the …

Adaptive performance prediction for integrated GPUs

Gupta
Ujjwal

Integrated GPUs have become an indispensable component of mobile processors due to
the increasing popularity of graphics applications. The GPU frequency is a key factor
both in application throughput and mobile processor power consumption under graphics
…

Energy-efficient fault tolerance approach for internet of things applications

Xu
Teng

Fault tolerance (FT) is essential in many Internet of Things (IoT) applications, in
particular in the domains such as medical devices and automotive systems where a single
fault in the system can lead to serious consequences. Non-volatile memory (NVM), …

Critical path isolation for time-to-failure extension and lower voltage operation

Masuda
Yutaka

Device miniaturization due to technology scaling has made manufacturing variability
and aging more significant, and lower supply voltage makes circuits sensitive to dynamic
environmental fluctuation. These may shorten the time to failure (TTF) of …

Control synthesis and delay sensor deployment for efficient ASV designs

Li
Chaofan

Adaptive Supply Voltage (ASV) is a power-efficient approach to achieving resilience
against process variation and circuit aging. Fine-grained ASV offers further power-efficiency
gains, but entails relatively complex control circuit, which has not been …

Performance driven routing for modern FPGAs

Kannan
Parivallal

FPGA routing is a well studied problem. Basic point-to-point routing of nets on FPGA
fabrics can be done optimally using well known shortest path algorithms like Dijkstra’s
and A-star. Practical rip-up and reroute algorithms like PathFinder have been …

UTPlaceF: a routability-driven FPGA placer with physical and congestion aware packing

Li
Wuxi

FPGA packing and placement without routability consideration could lead to unroutable
results for high-utilization designs. Conventional FPGA packing and placement approaches
are shown to have severe difficulties to yield good routability. In this paper,…

RippleFPGA: a routability-driven placement for large-scale heterogeneous FPGAs

Pui
Chak-Wa

As the complexity and scale of FPGA circuits grows, resolving routing congestion becomes
more important in FPGA placement. In this paper, we propose a routability-driven placement
algorithm for large-scale heterogeneous FPGAs. Our proposed algorithm …

GPlace: a congestion-aware placement tool for ultrascale FPGAs

Pattison
Ryan

Traditional FPGA flows that wait until the routing stage to tackle congestion are
quickly becoming less effective. This is due to the increasing size and complexity
of FPGA architectures and the designs targeted for them. In this paper, we present
two …

Resiliency in dynamically power managed designs

Lai
Liangzhen

Dynamic power management has become essential for low power designs and systems. Whether
intentionally or unintentionally, these power reduction techniques and corresponding
management schemes can impact the hardware reliability and system resiliency in …

Dynamic reliability management for near-threshold dark silicon processors

Kim
Taeyoung

In this article, we propose a new dynamic reliability management (DRM) techniques
at the system level for emerging low power dark silicon manycore microprocessors operating
in near-threshold region. We mainly consider the electromigration (EM) failures. …

A cross-layer approach for resiliency and energy efficiency in near threshold computing

Golanbari
M. S.

Energy constrained systems become the cornerstone of emerging energy harvested or
battery-limited applications in Internet of Thing (IoT) platforms. A promising approach
is to operate at near threshold voltage ranges, which can significantly reduce …

Design space exploration of drone infrastructure for large-scale delivery services

Park
Sangyoung

Drones, also referred to as unmanned aerial vehicles (UAVs), are recently expanding
their field of usage beyond military surveillance and tactical applications. Commercial
drone delivery service is one of the promising applications in the near future, …

Multi-objective design optimization for flexible hybrid electronics

Bhat
Ganapati

Flexible systems that can conform to any shape are desirable for wearable applications.
Over the past decade, there have been tremendous advances in the domain of flexible
electronics which enabled printing of devices, such as sensors on a flexible …

KCAD: kinetic cyber-attack detection method for cyber-physical additive manufacturing systems

Chhetri
Sujit Rokka

Additive Manufacturing (AM) uses Cyber-Physical Systems (CPS) (e.g., 3D Printers)
that are vulnerable to kinetic cyber-attacks. Kinetic cyber-attacks cause physical
damage to the system from the cyber domain. In AM, kinetic cyber-attacks are realized
by …

Autonomous sensor-context learning in dynamic human-centered internet-of-things environments

Rokni
Seyed Ali

Human-centered Internet-of-Things (IoT) applications utilize computational algorithms
such as machine learning and signal processing techniques to infer knowledge about
important events such as physical activities and medical complications. The …

Formulating customized specifications for resource allocation problem of distributed
embedded systems

Zhang
Xinhai

There are plentiful attempts for increasing the efficiency, generality and optimality
of the Design Space Exploration (DSE) algorithms for resource allocation problems
of distributed embedded systems. Most contemporary approaches formulate DSE as an
…

A polyhedral model-based framework for dataflow implementation on FPGA devices of
iterative stencil loops

Natale
Giuseppe

Iterative Stencil Loops (ISLs) are a specific class of algorithms of great importance
for their substantial presence in a lot of industrial and scientific computing applications,
such as in numerical methods for solving partial differential equation — …

Efficient memory compression in deep neural networks using coarse-grain sparsification
for speech applications

Kadetotad
Deepak

Recent breakthroughs in deep neural networks have led to the proliferation of its
use in image and speech applications. Conventional deep neural networks (DNNs) are
fully-connected multi-layer networks with hundreds or thousands of neurons in each
…

Parallel code-specific CPU simulation with dynamic phase convergence modeling for
HW/SW co-design

Kemmerer
Warren

While SystemC models provide a promising solution to the complex problem of HW/SW
co-design within the system-on-chip paradigm, such requires a detailed annotation
of transaction level energy and performance data within the model. While this data
can be …

Architectural-space exploration of approximate multipliers

Rehman
Semeen

This paper presents an architectural-space exploration methodology for designing approximate
multipliers. Unlike state-of-the-art, our methodology generates various design points
by adapting three key parameters: (1) different types of elementary …

Design of power-efficient approximate multipliers for approximate artificial neural
networks

Mrazek
Vojtech

Artificial neural networks (NN) have shown a significant promise in difficult tasks
like image classification or speech recognition. Even well-optimized hardware implementations
of digital NNs show significant power consumption. It is mainly due to non-…

Automated error prediction for approximate sequential circuits

Kapare
Amrut

Synthesis tools for approximate sequential circuits require the ability to quickly,
efficiently, and automatically characterize and bound the errors produced by the circuits.
Previous approaches to characterize errors in approximate sequential circuits …

Approximation-aware rewriting of AIGs for error tolerant applications

Chandrasekharan
Arun

Approximation circuits offer superior performance (speed and area) compared to traditional
circuits at the cost of computational accuracy. The accuracy of the results in approximation
circuits is evaluated based on several error metrics such as worst-…

Properties first? a new design methodology for hardware, and its perspectives in safety
analysis

Urdahl
Joakim

This paper discusses the possible role of formal verification techniques in system-level
design flows. It is argued that the role of formal verification techniques should
not be limited to “bug hunting” alone. Instead, formal technology should be …

Where formal verification can help in functional safety analysis

Bernardini
Alessandro

Formal techniques seem to be a way to cope with the exploding complexity of functional
safety analysis. Here, the overall fault propagation probability to a certain safety-point
in the design must be analyzed. As a consequence, the careful verification …

Formal approaches to design of active cell balancing architectures in battery management
systems

Steinhorst
Sebastian

Large battery packs composed of Lithium-Ion cells are continuously gaining in importance
due to their applications in Electric Vehicles (EVs) and smart energy grids. To ensure
maximum lifetime, safety and performance of the battery pack, complex …

How much cost reduction justifies the adoption of monolithic 3D ICs at 7nm node?

Ku
Bon Woong

In this paper we study power, performance, and cost (PPC) tradeoffs for 2-tier, gate-level,
full-chip GDS monolithic 3D ICs (M3D) built using a foundry-grade 7nm bulk FinFET
technology. We first develop highly-accurate wafer and die cost models for 2D …

A novel unified dummy fill insertion framework with SQP-based optimization method

Tao
Yudong

Dummy fill insertion is widely applied to significantly improve the planarity of topographic
patterns for chemical mechanical polishing process in VLSI manufacture. However, these
dummies will lead to additional parasitic capacitance and deteriorate the …

Efficient yield estimation through generalized importance sampling with application
to NBL-assisted SRAM bitcells

Ciampolini
Lorenzo

We consider the general problem of the efficient and accurate determination of the
yield of an integrated circuit, through electrical circuit level simulation, under
variability constraints due to the manufacturing process. We demonstrate the …

Are proximity attacks a threat to the security of split manufacturing of integrated
circuits?

Magaña
Jonathon

Split manufacturing is a technique that allows manufacturing the transistor-level
and lower metal layers of an IC at a high-end, untrusted foundry, while manufacturing
only the higher metal layers at a smaller, trusted foundry. Using split manufacturing
…

Making split-fabrication more secure

Yang
Ping-Lin

Today many design houses must outsource their design fabrication to a third party
which is often an overseas foundry. Split-fabrication is proposed for combining the
FEOL capabilities of an advanced but untrusted foundry with the BEOL capabilities
of a …

A machine learning approach to fab-of-origin attestation

Ahmadi
Ali

We introduce a machine learning approach for distinguishing between integrated circuits
fabricated in a ratified facility and circuits originating from an unknown or undesired
source based on parametric measurements. Unlike earlier approaches, which …

OpenRAM: an open-source memory compiler

Guthaus
Matthew R.

Computer systems research is often inhibited by the availability of memory designs.
Existing Process Design Kits (PDKs) frequently lack memory compilers, while expensive
commercial solutions only provide memory models with immutable cells, limited …

A hardware-based technique for efficient implicit information flow tracking

Shin
Jangseop

To access sensitive information, some recent advanced attacks have been successful
in exploiting implicit flows in a program in which sensitive data affects the control
path and in turn affects other data. To track the sensitive data through implicit
…

Imprecise security: quality and complexity tradeoffs for hardware information flow tracking

Hu
Wei

Secure hardware design is a challenging task that goes far beyond ensuring functional
correctness. Important design properties such as non-interference cannot be verified
on functional circuit models due to the lack of essential information (e.g., …

Encasing block ciphers to foil key recovery attempts via side channel

Agosta
Giovanni

Providing efficient protection against energy consumption based side channel attacks
(SCAs) for block ciphers is a relevant topic for the research community, as current
overheads are in the 100x range. Unprofiled SCAs exploit information leakage from
…

Security of neuromorphic computing: thwarting learning attacks using memristor’s obsolescence effect

Yang
Chaofei

Neuromorphic architectures are widely used in many applications for advanced data
processing, and often implements proprietary algorithms. In this work, we prevent
an attacker with physical access from learning the proprietary algorithm implemented
by …

Generation and use of statistical timing macro-models considering slew and load variability

Sinha
Debjit

Timing macro-modeling captures the timing characteristics of a circuit in a compact
form for use in a hierarchical timing environment. At the same time, statistical timing
provides coverage of the impact from variability sources with the goal of …

TinySPICE plus: scaling up statistical SPICE simulations on GPU leveraging shared-memory based sparse
matrix solution techniques

Han
Lengfei

TinySPICE was a SPICE simulator on GPU developed to achieve dramatic speedups in statistical
simulations of small nonlinear circuits, such as standard cell designs and SRAMs.
While TinySPICE can perform circuit simulations much faster than traditional …

PieceTimer: a holistic timing analysis framework considering setup/hold time interdependency using
a piecewise model

Zhang
Grace Li

In static timing analysis, clock-to-q delays of flip-flops are considered as constants.
Setup times and hold times are characterized separately and also used as constants.
The characterized delays, setup times and hold times, are applied in timing …

A fast layer elimination approach for power grid reduction

Yassine
Abdul-Amir

Simulation and verification of the on-die power delivery network (PDN) is one of the
key steps in the design of integrated circuits (ICs). With the very large sizes of
modern grids, verification of PDNs has become very expensive and a host of techniques
…

A deterministic approach to stochastic computation

Jenson
Devon

Stochastic logic performs computation on data represented by random bit streams. The
representation allows complex arithmetic to be performed with very simple logic, but
it suffers from high latency and poor precision. Furthermore, the results are …

Control-fluidic CoDesign for paper-based digital microfluidic biochips

Wang
Qin

Paper-based digital microfluidic biochips (P-DMFBs) have recently emerged as a promising
low-cost and fast-responsive platform for biochemical assays. In P-DMFBs, electrodes
and control lines are printed on a piece of photo paper using inkjet printer …

Neural networks designing neural networks: multi-objective hyper-parameter optimization

Smithson
Sean C.

Artificial neural networks have gone through a recent rise in popularity, achieving
state-of-the-art results in various fields, including image classification, speech
recognition, and automated control. Both the performance and computational complexity
…

Error recovery in a micro-electrode-dot-array digital microfluidic biochip?

Li
Zipeng

A digital microfluidic biochip (DMFB) is an attractive technology platform for automating
laboratory procedures in biochemistry. However, today’s DMFBs suffer from several
limitations: (i) constraints on droplet size and the inability to vary droplet …

Privacy protection via appliance scheduling in smart homes

Wu
Jie

Smart grid, managed by intelligent devices, have demonstrated great potentials to
help residential customers to optimally schedule and manage the appliances’ energy
consumption. Due to the fine-grained power consumption information collected by smart
…

Framework designs to enhance reliable and timely services of disaster management systems

Shih
Chi-Sheng

How to tolerate fault is a fundamental requirement to the designs of many cyber-physical
systems. Devices or sensors might have different requirements on their levels of reliability
and/or timely services in the composition of a cyber-physical system. …

Analysis of production data manipulation attacks in petroleum cyber-physical systems

Chen
Xiaodao

Petroleum Cyber-Physical System (CPS) marks the beginning of a new chapter of the
oil and gas industry. Combining vast computational power with intelligent Computer
Aided Design (CAD) algorithms, petroleum CPS is capable of precisely modeling the
flow …

Security challenges in smart surveillance systems and the solutions based on emerging
nano-devices

Yang
Chaofei

Modern smart surveillance systems can not only record the monitored environment but
also identify the targeted objects and detect anomaly activities. These advanced functions
are often facilitated by deep neural networks, achieving very high accuracy …

Fast physics-based electromigration checking for on-die power grids

Chatterjee
Sandeep

Due to technology scaling, electromigration (EM) signoff has become increasingly difficult,
mainly due to the use of inaccurate methods for EM assessment, such as the empirical
Black’s model. In this paper, we present a novel approach for EM checking …

Exploring aging deceleration in FinFET-based multi-core systems

Cai
Ermao

Power and thermal issues are the main constraints for highperformance multi-core systems.
As the current technology of choice, FinFET is observed to have lower delay under
higher temperature in super-threshold voltage region, an effect called …

An efficient and accurate algorithm for computing RC current response with applications
to EM reliability evaluation

Guan
Zhong

In this paper, we propose a current waveform estimation algorithm for signal lines
without the necessity of SPICE simulation. Unlike previous methods, we do not use
function fitting or compute the effective capacitance. Instead, the proposed algorithm
…

Voltage-based electromigration immortality check for general multi-branch interconnects

Sun
Zeyu

As VLSI technology features are pushed to the limit with every generation and with
the introduction of new materials and increased current densities to satisfy the performance
demands, Electromigration (EM) is projected to be a key reliability issue for …

Exploiting randomness in sketching for efficient hardware implementation of machine
learning applications

Wang
Ye

Energy-efficient processing of large matrices for big-data applications using hardware
acceleration is an intense area of research. Sketching of large matrices into their
lower-dimensional representations is an effective strategy. For the first time, …

Making neural encoding robust and energy efficient: an advanced analog temporal encoder for brain-inspired computing systems

Zhao
Chenyuan

Neural encoder is one of the key components in neuromorphic computing systems, whereby
sensory information is transformed into spike coded trains. The design of temporal
encoder has attracted a widespread attention in the field of neuromorphic computing
…

Statistical methodology to identify optimal placement of on-chip process monitors
for predicting fmax

Mu
Szu-Pang

In previous literatures, many approaches use ring oscillators or other process monitors
to correlate the chip’s maximum operating frequency (F_max). But none of them focus on the placement of these on-chip process monitors (OPMs)
on a chip. The placement …

BugMD: automatic mismatch diagnosis for bug triaging

Mammo
Biruk

System-level validation is the most challenging phase of design verification. A common
methodology in this context entails simulating the design under validation in lockstep
with a high-level golden model, while comparing the architectural state of the …

ODESY: a novel 3T-3MTJ cell design with optimized area DEnsity, scalability and latencY

Xue
Linuo

The STT-RAM (Spin-Transfer Torque Magnetic RAM) technology is a promising candidate
for cache memory because of its high density, low standy-power, and non-volatility.
As technology scales, especially under 40nm technology node, the read disturbance
…

Delay-optimal technology mapping for in-memory computing using ReRAM devices

Bhattacharjee
Debjyoti

Recent propositions of diverse In-Memory Computing platforms have shown a promising
alternative to classical Von Neumann computing models. Significant benefits, in terms
of energy-efficiency and performance, are reported for in-memory arithmetic …

Reconfigurable in-memory computing with resistive memory crossbar

Zha
Yue

Driven by recent advances in resistive random-access memory (RRAM), there have been
growing interests in exploring alternative computing concept, i.e., in-memory processing,
to address the classical von Neumann bottlenecks. Despite of their great …

Exploiting ferroelectric FETs for low-power non-volatile logic-in-memory circuits

Yin
Xunzhao

Numerous research efforts are targeting new devices that could continue performance
scaling trends associated with Moore’s Law and/or accomplish computational tasks with
less energy. One such device is the ferroelectric FET (FeFET), which offers the …

Approximation knob: power capping meets energy efficiency

Kanduri
Anil

Power Capping techniques are used to restrict power consumption of computer systems
to a thermally safe limit. Current many-core systems employ dynamic voltage and frequency
scaling (DVFS), power gating (PG) and scheduling methods as actuators for power …

IC thermal analyzer for versatile 3-D structures using multigrid preconditioned krylov
methods

Ladenheim
Scott

Thermal analysis is crucial for determining the propagation of heat and tracking the
formation of hot spots in advanced integrated circuit technologies. At the core of
the thermal analysis for integrated circuits is the numerical solution of the heat
…

BoostNoC: power efficient network-on-chip architecture for near threshold computing

Rajamanikkam
Chidhambaranathan

While near threshold design space provides a promising approach towards energy-efficient
computing, it is plagued by sub-optimal performance. Application characteristics and
hardware non-idealities of conventional architectures (optimized for the …

QScale: thermally-efficient QoS management on heterogeneous mobile platforms

Sahin
Onur

Single-ISA heterogeneous mobile processors integrate low-power and power-hungry CPU
cores together to combine energy efficiency with high performance. While running computationally
demanding applications, current power management and scheduling …

Synthesis of statically analyzable accelerator networks from sequential programs

Cheng
Shaoyi

This paper describes a general framework for transforming a sequential program into
a network of processes, which are then converted to hardware accelerators through
high level synthesis. Also proposed is a complementing technique for performing static
…

Joint loop mapping and data placement for coarse-grained reconfigurable architecture
with multi-bank memory

Yin
Shouyi

Coarse-Grained Reconfigurable Architecture (CGRA) is a promising architecture with
high performance, high power-efficiency and attraction of flexibility. The compute-intensive
parts of an application (e.g. loops) are often mapped onto CGRA for …

Efficient synthesis of graph methods: a dynamically scheduled architecture

Minutoli
Marco

RDF databases naturally map to a graph representation and employ languages, such as
SPARQL, that implements queries as graph pattern matching routines. Graph methods
exhibit an irregular behavior: they present unpredictable, fine-grained data accesses,
…

Tier partitioning strategy to mitigate BEOL degradation and cost issues in monolithic
3D ICs

Samal
Sandeep Kumar

In this paper, we develop tier partitioning strategy to mitigate back-end-of-line
(BEOL) interconnect delay degradation and cost issues in monolithic 3D ICs (M3D).
First, we study the routing overhead and delay degradation caused by tungsten BEOL
…

Cascade2D: A design-aware partitioning approach to monolithic 3D IC with 2D commercial tools

Chang
Kyungwook

Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond
traditional Moore’s law scaling limitations by leveraging the third-dimension and
fine-grained monolithic inter-tier vias (MIVs). Several recent studies present …

SAINT: handling module folding and alignment in fixed-outline floorplans for 3D ICs

Lin
Jai-Ming

Three-dimensional integrated circuits (3D ICs) offer significant improvements over
two-dimensional circuits in several aspects. Classic 3D floorplanning algorithm places
each module at one single die. However, power consumption and wirelength of a 3D IC
…

From biochips to quantum circuits: computer-aided design for emerging technologies

Wille
Robert

While previous decades have witnessed impressive accomplishments in the design and
realization of conventional computing devices, physical boundaries and cost restrictions
led to an increasing interest in alternative technologies (often referred to as …

Multilevel design understanding: from specification to logic invited paper

Ray
Sandip

We present an outline of the field of Multilevel Design Understanding by first defining
and motivating the related problems, and then describing the key issues which must
be addressed in future research.

FPGA 2017 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

WORKSHOP SESSION: FPGA’17 Workshops

OLAF’17: Third International Workshop on Overlay Architectures for FPGAs

So
Hayden Kwok-Hay

The Third International Workshop on Overlay Architectures for FPGAs (OLAF) is held
in Monterey, California, USA, on Feburary 22, 2017 and co-located with FPGA 2017:
The 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. The
main …

SESSION: Special Session: The Role of FPGAs in Deep Learning

Session details: Special Session: The Role of FPGAs in Deep Learning

Ling
Andrew

The Role of FPGAs in Deep Learning

Ling
Andrew

Deep learning has garnered significant visibility recently as an Artificial Intelligence
(AI) paradigm, with success in wide ranging applications such as image and speech
recognition, natural language understanding, self-driving cars, and game playing (…

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Nurvitadhi
Eriko

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily
on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular
parallelism, high TFLOP/s). Because of this, GPUs are widely used for …

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Zhao
Ritchie

Convolutional neural networks (CNN) are the current stateof-the-art for many computer
vision tasks. CNNs outperform older methods in accuracy, but require vast amounts
of computation and memory. As a result, existing CNN applications are typically run
…

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural
Network

Zhang
Jialiang

OpenCL FPGA has recently gained great popularity with emerging needs for workload
acceleration such as Convolutional Neural Network (CNN), which is the most popular
deep learning architecture in the domain of computer vision. While OpenCL enhances
the …

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared
Memory System

Zhang
Chi

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks
(CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier
Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational …

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional
Neural Networks

Ma
Yufei

As convolution layers contribute most operations in convolutional neural network (CNN)
algorithms, an effective convolution acceleration scheme significantly affects the
efficiency and performance of a hardware CNN accelerator. Convolution in CNNs …

SESSION: Machine Learning

Session details: Machine Learning

Cong
Jason

An OpenCL™ Deep Learning Accelerator on Arria 10

Aydonat
Utku

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks,
particularly in the area of image classification. FPGAs are well known to be able
to perform convolutions efficiently, however, most recent efforts to run CNNs on …

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Umuroglu
Yaman

Research has shown that convolutional neural networks contain significant redundancy,
and high classification accuracy can be obtained even when weights and activations
are reduced from floating point to binary values. In this paper, we present FINN,
a …

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Han
Song

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve
higher prediction accuracy, machine learning scientists have built increasingly larger
models. Such large model is both computation intensive and memory intensive. …

SESSION: Interconnect and Routing

Session details: Interconnect and Routing

Kaptanoglu
Sinan

Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays

Giesen
Hans

How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated
that the negative effects of variation can be largely mitigated using complete knowledge
of device characteristics and full per-FPGA CAD flow. However, the cost …

Synchronization Constraints for Interconnect Synthesis

Rodionov
Alex

Interconnect synthesis tools ease the burden on the designer by automatically generating
and optimizing communication hardware. In this paper we propose a novel capability
for FPGA interconnect synthesis tools that further simplifies the designer’s …

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Shen
Minghua

FPGAs are increasingly popular as application-specific accelerators because they lead
to a good balance between flexibility and energy efficiency, compared to CPUs and
ASICs. However, the long routing time imposes a barrier on FPGA computing, which …

SESSION: Architecture

Session details: Architecture

Wilton
Steve

Don’t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration

Yazdanshenas
Sadegh

While academic FPGA architecture exploration tools have become sufficiently advanced
to enable a wide variety of explorations and optimizations on soft fabric and outing,
support for Block RAM (BRAM) has been very limited. In this paper, we present …

Automatic Construction of Program-Optimized FPGA Memory Networks

Yang
Hsin-Jung

Memory systems play a key role in the performance of FPGA applications. As FPGA deployments
move towards design entry points that are more serial, memory latency has become a
serious design consideration. For these applications, memory network …

NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element

Huang
Zhihong

The And-Inverter Cone has been introduced as an alternative logic element to the look-up
table in FPGAs, since it improves their performance and resource utilization. However,
further analysis of the AIC design showed that it suffers from the delay …

120-core microAptiv MIPS Overlay for the Terasic DE5-NET FPGA board

Kumar H B
Chethan

We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight
message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested
RTL source code for the microAptiv MIPS processor made available …

SESSION: CAD Tools

Session details: CAD Tools

Shannon
Lesley

A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology
Mapping

Liu
Gai

Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations
on the input logic network before carrying out technology mapping. While the “known
recipes” of logic transformations often lead to improved mapping results, …

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

Xu
Chang

Mainstream FPGA CAD tools provide an extensive collection of optimization options
that have a significant impact on the quality of the final design. These options together
create an enormous and complex design space that cannot effectively be explored …

PANEL SESSION: Panel: FPGAs in the Cloud

Session details: Panel: FPGAs in the Cloud

Constantinides
George

FPGAs in the Cloud

Constantinides
George A.

Ever greater amounts of computing and storage are happening remotely in the cloud,
and it is estimated that spending on public cloud services will grow by over 19%/year
to $140B in 2019. Besides commodity processors, network and storage infrastructure,
…

SESSION: High-Level Synthesis — Tools and Applications

Session details: High-Level Synthesis — Tools and Applications

Neuendorffer
Stephen

Hardware Synthesis of Weakly Consistent C Concurrency

Ramanathan
Nadesh

Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion
but via fine-grained atomic operations (‘atomics’), have been shown empirically to
be the fastest class of multi-threaded algorithms in the realm of conventional …

A New Approach to Automatic Memory Banking using Trace-Based Address Mining

Zhou
Yuan

Recent years have seen an increased deployment of FPGAs as programmable accelerators
for improving the performance and energy efficiency of compute-intensive applications.
A well-known “secret sauce” of achieving highly efficient FPGA acceleration is to
…

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis

Dai
Steve

Current pipelining approach in high-level synthesis (HLS) achieves high performance
for applications with regular and statically analyzable memory access patterns. However,
it cannot effectively handle infrequent data-dependent structural and data …

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

Srivastava
Nitish Kumar

High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively
cope with design complexity of emerging applications on modern programmable system-on-chip
(SoC). While HLS continues to evolve with a growing set of algorithms,…

Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules

Rozhko
Daniel

Packet processing systems increasingly need larger rulesets to satisfy the needs of
deep-network intrusion prevention and cluster computing. FPGA-based implementations
of packet processing systems have been proposed but their use of on-chip memory …

SESSION: Graph Processing Applications

Session details: Graph Processing Applications

Kapre
Nachiket

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Zhang
Jialiang

Large graph processing has gained great attention in recent years due to its broad
applicability from machine learning to social science. Large real-world graphs, however,
are inherently difficult to process efficiently, not only due to their large …

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

Dai
Guohao

The performance of large-scale graph processing suffers from challenges including
poor locality, lack of scalability, random access pattern, and heavy data conflicts.
Some characteristics of FPGA make it a promising solution to accelerate various …

FPGA-Accelerated Transactional Execution of Graph Workloads

Ma
Xiaoyu

Many applications that operate on large graphs can be intuitively parallelized by
executing a large number of the graph operations concurrently and as transactions
to deal with potential conflicts. However, large numbers of operations occurring …

SESSION: Virtualization and Applications

Session details: Virtualization and Applications

Lockwood
John

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Tarafdar
Naif

We present a framework for creating network FPGA clusters in a heterogeneous cloud
data center. The FPGA clusters are created using a logical kernel description describing
how a group of FPGA kernels are to be connected (independent of which FPGA these …

Energy Efficient Scientific Computing on FPGAs using OpenCL

Weller
Dennis

An indispensable part of our modern life is scientific computing which is used in
large-scale high-performance systems as well as in low-power smart cyber-physical
systems. Hence, accelerators for scientific computing need to be fast and energy …

Secure Function Evaluation Using an FPGA Overlay Architecture

Fang
Xin

Secure Function Evaluation (SFE) has received considerable attention recently due
to the massive collection and mining of personal data over the Internet, but large
computational costs still render it impractical. In this paper, we leverage hardware
…

SESSION: Applications

Session details: Applications

Leeser
Miriam

FPGA Acceleration for Computational Glass-Free Displays

He
Zhuolun

The increasing computational power enables various new applications that are runtime
prohibitive before. FPGA is one of such computational power with both reconfigurability
and energy efficiency. In this paper, we demonstrate the feasibility of …

Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling

Huang
Sitao

With the advent of several accurate and sophisticated statistical algorithms and pipelines
for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing
data into biologically meaningful information for further clinical …

POSTER SESSION: Poster Session 1

Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors
(Abstract Only)

Ye
Andy Gean

This work measures the performance and power consumption gap between the current generation
of low power FPGAs and low power microprocessors (microcontrollers) through an implementation
of the Canny edge detection algorithm. In particular, the algorithm …

A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology
(Abstract Only)

Zha
Yue

This poster presents a data-centric reconfigurable architecture, which is enabled
by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture
of commercial FPGAs, it is inherently a homogeneous architecture comprising of a …

A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming
Model (Abstract Only)

Wang
Shuo

Iterative stencil algorithms find applications in a wide range of domains. FPGAs have
long been adopted for computation acceleration due to its advantages of dedicated
hardware design. Hence, FPGAs are a compelling alternative for executing iterative
…

Scala Based FPGA Design Flow (Abstract Only)

Liu
Yanqiang

With the rapid growth of data scale, data analysis applications start to meet the
performance bottleneck, and thus requiring the aid of hardware acceleration. At the
same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability
…

Thermal Flattening in 3D FPGAs Using Embedded Cooling (Abstract Only)

Deshpande
Girish

Thermal management is one of the key concerns in modern high power density chips.
A variety of thermal cooling techniques that have been in use in industrial applications
are now also being applied to integrated circuits. In this work, we explore the …

A Machine Learning Framework for FPGA Placement (Abstract Only)

Grewal
Gary

Many of the key stages in the traditional FPGA CAD flow require substantial amounts
of computational effort. Moreover, due to limited overlap among individual stages,
poor decisions made in earlier stages will often adversely affect the quality of …

Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only)

Salomon
Ralf

In high-performance applications, such as quantum physics and positron emission tomography,
precise coincidence detection is of central importance: The quality of the reconstructed
images depends on the accuracy with which the underlying system detects …

Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming
HPC Applications (Abstract Only)

Koraei
Mostafa

Streaming HPC applications are data intensive and have widespread use in various fields
(e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist
of different processing kernels where each kernel performs a specific computation
…

Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT
Interpolation Filter (Abstract Only)

Abdelsalam
Ahmed M.

Implementing an accurate and fast activation function with low cost is a crucial aspect
to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy
approximation approach for the hyperbolic tangent activation function of …

An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract
Only)

Luinaud
Thomas

Snort and Bro are Deep Packet Inspection systems which express complex rules with
regular expressions. Before performing a regular expression search, these applications
apply a filter to select which regular expressions must be searched. One way to …

POSTER SESSION: Poster Session 2

Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)

Zhao
Zhipeng

There have been ample successful examples of applying Xilinx Vivado’s “function-to-module”
high-level synthesis (HLS) where the subject is algorithmic in nature. In this work,
we carried out a design study to assess the effectiveness of applying Vivado-…

Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip
(Abstract Only)

Bobda
Christophe

Component based design is one of the preferred methods to tackle system complexity,
and reduce costs and time-to-market. Major parts of the system design and IC production
are outsourced to facilities distributed across the globe, thus opening the door …

Accelerating Financial Market Server through Hybrid List Design (Abstract Only)

Fu
Haohuan

The financial market server in exchanges aims to maintain the order books and provide
real time market data feeds to traders. Low-latency processing is in a great demand
in financial trading. Although software solutions provide the flexibility to …

Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level
Synthesis (Abstract Only)

Lu
Tianyi

High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient
compilation process targeting FPGAs for algorithm evaluation and product prototyping.
However, the massively parallel memory access demands and the extremely expensive
…

A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA
(Abstract Only)

Nakahara
Hiroki

A pre-trained convolutional deep neural network (CNN) is a feed-forward computation
perspective, which is widely used for the embedded systems, requires high power-and-area
efficiency. This paper realizes a binarized CNN which treats only binary 2-…

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural
Networks (Abstract Only)

Li
Yixing

FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained
great attentions due to its higher energy efficiency than GPUs. However, it has been
a challenge for FPGA-based solutions to achieve a higher throughput than GPU …

CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only)

Cong
Jason

To efficiently process a tremendous amount of data, today’s big data applications
tend to distribute the datasets into multiple partitions, such that each partition
can be fit into memory and be processed by a separate core/server in parallel. Meanwhile,…

Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural
Network (Abstract Only)

Alawad
Mohammed

Large-scale convolutional neural network (CNN), conceptually mimicking the operational
principle of visual perception in human brain, has been widely applied to tackle many
challenging computer vision and artificial intelligence applications. …

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)

Venieris
Stylianos I.

In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art
in several Artificial Intelligence tasks. Across the range of applications, the performance
needs vary significantly, from high-throughput image recognition to …

POSTER SESSION: Poster Session 3

FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging
(Abstract Only)

Pezzotti
Emanuele

Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of
MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier
Transform for image reconstruction using the Inverse Fast Fourier Transform (…

Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network
Layers (Abstract Only)

Shen
Yongming

Convolutional neural networks (CNNs) are used to solve many challenging machine learning
problems. These networks typically use convolutional layers for feature extraction
and fully-connected layers to perform classification using those features. …

ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only)

Banerjee
Subho S.

The proliferation of high-throughput sequencing machines allows for the rapid generation
of billions of short nucleotide fragments in a short period. This massive amount of
sequence data can quickly overwhelm today’s storage and compute infrastructure. …

RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction
(Abstract Only)

Lotfi
Atieh

Despite the considerable improvements in the quality of HLS tools, they still require
the designer’s manual optimizations and tweaks to generate efficient results, which
negates the HLS design productivity gains. Majority of designer interventions lead
…

GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)

Wu
Haoyang

Although there is explosive growth of theoretical research on cognitive radio, the
real-time platform for cognitive radio is progressing at a low pace. Researchers expect
fast prototyping their designs with appropriate wireless platforms to precisely …

FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations
(Abstract Only)

Siripurapu
Srinivas

FPGAs have been used as accelerators in a wide variety of domains such as learning,
search, genomics, signal processing, compression, analytics and so on. In recent years,
the availability of tools and flows such as high-level synthesis has made it even
…

Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable
Architectures (Abstract Only)

Yin
Shouyi

Data flow graph (DFG) mapping is critical for the compiling of spatial programmable
architecture, where compilation time is a key factor for both time-to-market requirement
and mapping successful rate. Inspired from the great progress made in tree …

Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only)

Chaudhuri
Sumanta

In this presentation we show that side-channels arising from micro-architecture of
SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which
performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA.
…

Dynamic Partitioning for Library based Placement on Heterogeneous FPGAs (Abstract
Only)

Mao
Fubing

Library based design and IP reuses have been previously proposed to speed up the synthesis
of large-scale FPGA designs. However, existing methods result in large area wastage
due to the module size difference and the waste area inside each module. In …

An Energy-Efficient Design-Time Scheduler for FPGAs Leveraging Dynamic Frequency Scaling
Emulation (Abstract Only)

Loke
Wei Ting

We present a design-time tool, EASTA, that combines the feature of reconfigurability
in FPGAs and Dynamic Frequency Scaling to realize an efficient multiprocessing scheduler
on a single-FPGA system. Multiple deadlines, reconvergent nodes, flow …

GLSVLSI 2016 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

SESSION: Keynote 1

Session details: Keynote 1

Coskun
Ayse

Why Is It So Hard to Make Secure Chips?

Witteman
Marc

Chip security has long been the domain of smart cards. These microcontrollers are
specifically designed to thwart many different attacks in order to deliver typical
security functions as payment cards, electronic passports, and access cards. With
the …

SESSION: Keynote 2

Session details: Keynote 2

Han
Jie

Design and Implementation of Real-Time Multi-sensor Vision Systems

Leblebici
Yusuf

Implementation of high performance multi-camera / multi-sensor imaging systems that
are required to produce real-time video output pose a large number of unique challenges
to conventional digital design based on general-purpose processors or GPUs. In …

SESSION: Keynote 3

Session details: Keynote 3

Margala
Martin

Medical Device Security: The First 165 Years

Fu
Kevin

Today, it would be difficult to find medical device technology that does not critically
depend on computer software. Network connectivity and wireless communication has transformed
the delivery of patient care. The technology often enables patients to …

SESSION: Keynote 4

Session details: Keynote 4

Behjat
Laleh

VLSI Design Methods for Low Power Embedded Encryption

Verbauwhede
Ingrid

Intelligent things, medical devices, vehicles and factories, all part of cyberphysical
systems, will only be secure if we can build devices that can perform the mathematically
demanding cryptographic operations in an efficient way. Unfortunately, many …

SESSION: Session 1: VLSI Circuits 1

Session details: Session 1: VLSI Circuits 1

Navabi
Zain

High-Speed Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems

Du
Chaohui

Many lattice-based cryptosystems are based on the security of the Ring learning with
errors (Ring-LWE) problem. The most critical and computationally intensive operation
of these Ring-LWE based cryptosystems is polynomial multiplication. In this paper,
…

Reduced Overhead Gate Level Logic Encryption

Juretus
Kyle

Untrusted third-parties are found throughout the integrated circuit (IC) design flow
resulting in potential threats in IC reliability and security. Threats include IC
counterfeiting, intellectual property (IP) theft, IC overproduction, and the insertion
…

A Design of a Non-Volatile PMC-Based (Programmable Metallization Cell) Register File

Junsangsri
Salin

This paper presents the design of a non-volatile register file using cells made of
a SRAM and a Programmable Metallization Cell (PMC). The proposed cell is a symmetric
8T2P (8-transistors, 2PMC) design; it utilizes three control lines to ensure the …

A Clockless Sequential PUF with Autonomous Majority Voting

Xu
Xiaolin

Physical unclonable functions (PUFs) leverage minute silicon process variations to
produce device-tied secret keys. The energy and area costs of creating keys from PUFs
can far exceed the costs of the basic PUF circuits alone. Minimizing the end-to-end
…

SESSION: Session 2: VLSI and Test

Session details: Session 2: VLSI and Test

Qian
Weikang

Area-Efficient Error-Resilient Discrete Fourier Transformation Design using Stochastic
Computing

Yuan
Bo

Discrete Fourier Transformation (DFT)/Fast Fourier Transformation (FFT) are the widely
used techniques in numerous modern signal processing applications. In general, because
of their inherent multiplication-intensive characteristics, the hardware …

Concurrent Error Detection for Reliable SHA-3 Design

Luo
Pei

Cryptographic systems are vulnerable to random errors and injected faults. Soft errors
can inadvertently happen in critical cryptographic modules and attackers can inject
faults into systems to retrieve the embedded secret. Different schemes have been …

Secure Model Checkers for Network-on-Chip (NoC) Architectures

Boraten
Travis

As chip multiprocessors (CMPs) are becoming more susceptible to process variation,
crosstalk, and hard and soft errors, emerging threats from rogue employees in a compromised
foundry are creating new vulnerabilities that could undermine the integrity of …

Parameter-importance based Monte-Carlo Technique for Variation-aware Analog Yield
Optimization

kondamadugula
Sita

The Monte-Carlo method is the method of choice for accurate yield estimation. Standard
Monte-Carlo methods suffer from a huge computational burden even though they are very
accurate. Recently a Monte-Carlo method was proposed for the parametric yield …

SESSION: Session 3: VLSI Design 1

Session details: Session 3: VLSI Design 1

Thapliyal
Himanshu

Low Energy Sketching Engines on Many-Core Platform for Big Data Acceleration

Kulkarni
Amey

Almost 90% of the data available today was created within the last couple of years,
thus Big Data set processing is of utmost importance. Many solutions have been investigated
to increase processing speed and memory capacity, however I/O bottleneck is …

Low-Power Manycore Accelerator for Personalized Biomedical Applications

Page
Adam

Wearable personal health monitoring systems can offer a cost effective solution for
human healthcare. These systems must provide both highly accurate, secured and quick
processing and delivery of vast amount of data. In addition, wearable biomedical …

Hardware Security Threats and Potential Countermeasures in Emerging 3D ICs

Dofe
Jaya

New hardware security threats are identified in emerging three-dimensional (3D) integrated
circuits (ICs) and potential countermeasures are introduced. Trigger and payload mechanisms
for future 3D hardware Trojans are predicted. Furthermore, a novel, …

Real-Time Analysis for Wormhole NoC: Revisited and Revised

Xiong
Qin

The network delay upper-bound analysis problem is of fundamental importance to real-time
applications in Network-on-Chip (NoC). In the paper, we revisit a state-of-the-art
analysis model for real-time communication in wormhole NoC with priority-based …

SESSION: Session 4: CAD 1

Session details: Session 4: CAD 1

Adegbija
Tosiron

A New Methodology for Noise Sensor Placement Based on Association Rule Mining

Hung
Yu-Hsiang

Due to near-threshold computing nowadays, voltage emergency is threatening our design
margins very seriously. Noise sensors are inserted in order to prevent various integrity
issues from happening during runtime. In this work, we use a new technique …

MCFRoute 2.0: A Redundant Via Insertion Enhanced Concurrent Detailed Router

Jia
Xiaotao

In modern VLSI design, manufacturing yield and chip performance are seriously affected
by via failure. Redundant via insertion is an effective technique recommended by foundries
to deal with the via failure. However, due to the extreme scaling of …

Modular Placement for Interposer based Multi-FPGA Systems

Mao
Fubing

Novel device with multiple FPGAs on-chip based on interposer interconnection has emerged
to resolve the IOs limit and improve the inter-FPGA communication delay. However,
new challenges arise for the placement on such architecture. Firstly, existing …

A Parallel Random Walk Solver for the Capacitance Calculation Problem in Touchscreen
Design

Xu
Zhezhao

In this paper, a random walk based solver is presented which calculates the capacitances
for verifying the touchscreen design. To suit the complicated conductor geometries
in touchscreen structures, we extend the floating random walk (FRW) method for …

POSTER SESSION: Poster Session 1

Session details: Poster Session 1

Moreshet
Tali

Real-Time Hardware Stereo Matching Using Guided Image Filter

Yang
Chen

Stereo matching is a key step in stereo vision systems that require high accurate
depth information and real-time processing of high definition image streams. This
work presents a high-accuracy hardware implementation for the stereo matching based
on …

Computing Complex Functions using Factorization in Unipolar Stochastic Logic

Liu
Yin

This paper addresses computing complex functions using unipolar stochastic logic.
Stochastic computing requires simple logic gates and is inherently fault-tolerant.
Thus, these structures are well suited for nanoscale CMOS technologies. Implementations
…

DCC: Double Capacity Cache Architecture for Narrow-Width Values

Imani
Mohsen

Modern caches are designed to hold 64-bits wide data, however a proportion of data
in the caches continues to be narrow width. In this paper, we propose a new cache
architecture which increases the effective cache capacity up to 2X for the systems
with …

Static Noise Margin based Yield Modelling of 6T SRAM for Area and Minimum Operating
Voltage Improvement using Recovery Techniques

Batra
Nidhi

In advanced technology nodes, the process variations deteriorate SRAM performance
and greatly affect yield. It is necessary to formulate yield estimation models to
optimize SRAMs and effectively trade-off area, performance and robustness. We propose
…

Asynchronous High Speed Serial Links Analysis using Integrated Charge for Event Detection

Dalakoti
Aditya

We present a metric for event detection, targeted for the analysis of CMOS asynchronous
serial data links. Our metric is used to analyze signaling strategies that allow for
coincident or nearly coincident detection of both data and event timing. The …

Design and Comparative Evaluation of a Hybrid Cache Memory at Architectural Level

Wei
Wei

A hybrid memory cell usually consists of a Static Random Access Memory (SRAM) and
an embedded Dynamic Random Access Memory (eDRAM) cell; hybrid cells are particularly
suitable for cache design. A novel hybrid cache memory scheme (that has also non-…

A Sampling Clock Skew Correction Technique for Time-Interleaved SAR ADCs

Prashanth
Daniel

A technique for sampling clock skew correction by adjusting the delay in the input
signal to each channel in a time-interleaved (TI) ADC is proposed. A proof-of-concept
TI ADC employing this technique was implemented in a 65 nm CMOS process. The four-…

Secure and Low-Overhead Circuit Obfuscation Technique with Multiplexers

Wang
Xueyan

Circuit obfuscation techniques have been proposed to conceal circuit’s functionality
in order to thwart reverse engineering (RE) attacks to integrated circuits (IC). We
believe that a good obfuscation method should have low design complexity and low …

Task-Resource Co-Allocation for Hotspot Minimization in Heterogeneous Many-Core NoCs

Reza
Md Farhadur

To fully exploit the massive parallelism of many cores, this work tackles the problem
of mapping large-scale applications onto heterogeneous on-chip networks (NoCs) to
minimize the peak workload for energy hotspot avoidance. A task-resource co-…

Guiding Power/Quality Exploration for Communication-Intense Stream Processing

Tabkhi
Hamed

In this paper, we explore the power/quality trade-off for streaming applications with
a shift from the computation to the communication aspects of the design. The paper
proposes a systematic exploration methodology to formulate and traverse power/…

SESSION: Session 5: Low Power 1

Session details: Session 5: Low Power 1

Savidis
Ioannis

Graphene-PLA (GPLA): a Compact and Ultra-Low Power Logic Array Architecture

Tenace
Valerio

The key characteristics of the next generation of ICs for wearable applications include
high integration density, small area, low power consumption, high energy-efficiency,
reliability and enhanced mechanical properties like stretchability and …

A Metastability Immune Timing Error Masking Flip-Flop for Dynamic Variation Tolerance

Sannena
Govinda

In this paper, two timing error masking flip-flops have been proposed, which are immune
to metastability. The proposed flip-flops exploit the concept of either delayed data
or pulse based approach to detect timing errors. The timing violations are …

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded
Systems

Adegbija
Tosiron

Non-volatile memory (NVM) technologies have recently emerged as alternatives to traditional
SRAM-based cache memories, since NVMs offer advantages such as non-volatility, low
leakage power, fast read speed, and high density. However, NVMs also have …

Multiple Attempt Write Strategy for Low Energy STT-RAM

Park
Jaeyoung

In this paper, we demonstrate an energy-reduction strategy that exploits the stochastic
switching characteristics of STT-RAM write operation and propose a multiple-attempt
write technique needed for it. In contrast to the traditional approach which uses
…

SESSION: Special Session 1: IoT Security: Issues, Innovations and Interplays

Session details: Special Session 1: IoT Security: Issues, Innovations and Interplays

Bhunia
Swarup

Secret Sharing and Multi-user Authentication: From Visual Cryptography to RRAM Circuits

Arafin
Md Tanvir

In this era of Internet of Things (IoT), connectivity exists everywhere, among everything
(including people) at all times. Therefore, security, trust, and privacy become crucial
to the design and implementation of IoT devices [12]. However, it is …

Defense Systems and IoT: Security Issues in an Era of Distributed Command and Control

Palmer
Doug

Security Meets Nanoelectronics for Internet of Things Applications

Rose
Garrett S.

The internet of things (IoT) is quickly emerging as the next major domain for embedded
computer systems. Although the term IoT could be defined in a variety of different
ways, IoT always encompasses typically ordinary devices (e.g., thermostats and …

Tracking Data Flow at Gate-Level through Structural Checking

Le
Thao

The rapid growth of Internet-of-things and other electronic devices make a huge impact
on how and where data travel. The confidential data (e.g., personal data, financial
information) that travel through unreliable channels can be exposed to attackers.
…

SESSION: Session 6: Test 2

Session details: Session 6: Test 2

Yu
Qiaoyan

Design of Error-Resilient Logic Gates with Reinforcement Using Implications

Han
Xijing

Operating circuits in the sub-threshold region can save power, but at the cost of
higher susceptibility to noise. This paper analyzes various gate-level error-mitigation
designs appropriate for sub-threshold circuits. Previous works have proposed a …

Reducing Soft-error Vulnerability of Caches using Data Compression

Mittal
Sparsh

With ongoing chip miniaturization and voltage scaling, particle strike-induced soft
errors present increasingly severe threat to the reliability of on-chip caches. In
this paper, we present a technique to reduce the vulnerability of caches to soft-…

Workload-Aware Worst Path Analysis of Processor-Scale NBTI Degradation

Bian
Song

As technology further scales semiconductor devices, aging-induced device degradation
has become one of the major threats to device reliability. In addition, aging mechanisms
like the negative bias temperature instability (NBTI) is known to be sensitive …

Enhancing Fault Emulation of Transient Faults by Separating Combinational and Sequential
Fault Propagation

Nyberg
Ralph

We present a fault emulation environment capable of injecting single and multiple
transient faults in sequential as well as combinational logic. It is used to perform
fault injection campaigns during design verification of security circuits such as
…

SESSION: Session 7: VLSI Circuits 2

Session details: Session 7: VLSI Circuits 2

Li
Hai

A Novel On-Chip Impedance Calibration Method for LPDDR4 Interface between DRAM and
AP/SoC

Choi
Yongsuk

In this paper, a novel on-chip impedance calibration methodology for a LPDDR4 (low
power double data rate) application is proposed. The background calibration operates
to compensate mismatches and variations of the output NMOS drivers from process and
…

A General Sign Bit Error Correction Scheme for Approximate Adders

Zhou
Rui

Approximate computing is an emerging design technique for error-tolerant applications.
As adders are the key building blocks in many applications, approximate adders have
been widely studied recently. However, existing approximate adders may introduce …

RRAM Refresh Circuit: A Proposed Solution To Resolve The Soft-Error Failures For HfO2/Hf 1T1R RRAM Memory
Cell

Tosson
Amr M.S.

RRAM-based memory is a promising emerging technology for both on-chip and stand-alone
non-volatile data storage in advanced technologies. In addition to its small dimensions,
the RRAM device has many technological advantages including its low-…

Exploratory Power Noise Models of Standard Cell 14, 10, and 7 nm FinFET ICs

Patel
Ravi

The physical dimensions of standard cells constrain the dimensions of power networks,
affecting the on-chip power noise. An exploratory modeling methodology is presented
for estimating power noise in advanced technology nodes. The models are evaluated
…

8T1R: A Novel Low-power High-speed RRAM-based Non-volatile SRAM Design

Abdelwahed
Amr M.S. Tosson

With continuous and aggressive technology scaling, suppressing the stand-by power
is among the top priorities for SRAM design. Switching off the less-frequently accessed
blocks is an efficient way to reduce the stand-by power, provided that the …

SESSION: Session 8: Emerging 1

Session details: Session 8: Emerging 1

Yuan
Bo

Polynomial Arithmetic Using Sequential Stochastic Logic

Saraf
Naman

We present the design of stochastic computing systems based on sequential logic to
implement arbitrary polynomial functions. Stochastic computing is an emerging alternative
computing paradigm that performs arithmetic operations on real-valued data …

Ultra-Robust Null Convention Logic Circuit with Emerging Domain Wall Devices

Bai
Yu

Despite many attractive advantages, Null Convention Logic (NCL) remains to be a niche
largely due to its high imple- mentation costs. Using emerging spintronic devices,
this paper proposes a Domain-Wall-Motion-based NCL circuit design methodology that
…

Inter-Tier Crosstalk Noise On Power Delivery Networks For 3-D ICs With Inductively-Coupled
Interconnects

Papistas
Ioannis A.

Inductive links have been proposed as an inter-tier interconnect solution for three-dimensional
(3-D) integrated systems. Combined with signal multiplexing, inductive links achieve
high communication bandwidth comparable to that of through silicon vias. …

Delay Estimates for Graphene Nanoribbons: A Novel Measure of Fidelity and Experiments with Global Routing Trees

Das
Subrata

With extreme miniaturization of traditional CMOS devices in deep sub-micron design
levels, the delay of a circuit, as well as power dissipation and area are dominated
by interconnections between logic blocks. In an attempt to search for alternative
…

SESSION: Session 9: CAD 2

Session details: Session 9: CAD 2

Velev
Miroslav

VarDroid: Online Variability Emulation in Android/Linux Platforms

Mercati
Pietro

Variability is the real big challenge for integrated circuits. Today, simulators help
to estimate the effect of variability, but fail to capture real workload dynamics
and user interactions, which are fundamental to mobile devices. This paper presents
…

Neural Network-based Prediction Algorithms for In-Door Multi-Source Energy Harvesting
System for Non-Volatile Processors

Liu
Ning

Due to size, longevity, safety, and recharging concerns, energy harvesting is becoming
a better choice for many wearable embedded systems than batteries. However, harvested
energy is intrinsically unstable. In order to overcome this drawback, non-…

A Unified Model of Power Sources for the Simulation of Electrical Energy Systems

Vinco
Sara

Models of power sources are essential elements in the simulation of systems that generate,
store and manage energy. In spite of the huge difference in power scale, they perform
a common function: converting a primary environmental quantity into power. …

Hardware-Accelerated Software Library Drivers Generation for IP-Centric SoC Designs

Jassi
Munish

In recent years, the semiconductor industry has been witnessing an increasing reuse
of hardware IPs for System-on-Chip (SoC) designs and embedded computing systems on
FPGA platforms with hard-core processors. The IP-reuse comes with an increasing …

Extracting Designs of Secure IPs Using FPGA CAD Tools

Mirian
Vincent

In today’s competitive market, a company’s success is strongly dependent on delivering
sophisticated and state-of-the-art IPs prior to their competitors. To take a short
cut, a company may resort to reverse engineering or pirating their competitor’s IP.

…

SESSION: Special Session 3: Emerging Technology Devices and Security

Session details: Special Session 3: Emerging Technology Devices and Security

Rajendran
JV

Security Primitive Design with Nanoscale Devices: A Case Study with Resistive RAM

Karam
Robert

Inherent stochastic physical mechanisms in emerging nonvolatile memories (NVMs), such
as resistive random-access-memory (RRAM), have recently been explored for hardware
security applications. Unlike the conventional silicon Physical Unclonable Functions
…

Enhancing Hardware Security with Emerging Transistor Technologies

Bi
Yu

We consider how the I-V characteristics of emerging transistors (particularly those
sponsored by STARnet) might be employed to enhance hardware security. An emphasis
of this work is to move beyond hardware implementations of physically unclonable …

The Applications of NVM Technology in Hardware Security

Yang
Chaofei

The emerging nonvolatile memory (NVM) technologies have demonstrated great potentials
in revolutionizing modern memory hierarchy because of their many promising properties:
nanosecond read/write time, small cell area, non-volatility, and easy CMOS …

Survey of Emerging Technology Based Physical Unclonable Funtions

Bautista Adames
Ilia A.

Authentication of electronic devices has become critical. Hardware authentication
is one way to enhance security of a chip. Along with software, it makes it harder
for an intruder to access any computer, smart-phone, or other devices without …

SESSION: Session 10: VLSI Design 2

Session details: Session 10: VLSI Design 2

Meyer
Brett

Trellis-search based Dynamic Multi-Path Connection Allocation for TDM-NoCs

Chen
Yong

This paper proposes a centralized approach for connection allocation for TDM-based
NoCs by making use of dedicated hardware unit called NoCManager that employs trellis-based
search algorithm enabling dynamic parallel multi-path, multi-slot allocation. …

Prolonging Lifetime of Non-volatile Last Level Caches with Cluster Mapping

Soltani
Morteza

Recently, work has been done on using nonvolatile cells, such as Spin Transfer Torque
RAM (STT-RAM) or Magnetic RAM (M-RAM), to construct last level caches (LLC). These
structures mitigate the leakage power and density problem found in traditional SRAM
…

A Low-Power Network-on-Chip Architecture for Tile-based Chip Multi-Processors

Psarras
Anastasios

Technology scaling of tiled-based CMPs reduces the physical size of each tile and
increases the number of tiles per die. This trend directly impacts the on-chip interconnect;
even though the tile population increases, the inter-tile link distances scale …

Dynamic Real-Time Scheduler for Large-Scale MPSoCs

Ruaro
Marcelo

Large-scale MPSoCs requires a scalable and dynamic real-time (RT) task scheduler,
able to handle non-deterministic computational behaviors. Current proposals for MPSoCs
have limitations, as lack of scalability, complex static steps, validation with …

SESSION: Special Session 4: Emerging Frontiers in Hardware Security

Session details: Special Session 4: Emerging Frontiers in Hardware Security

Joshi
Ajay

Leveraging 3D Technologies for Hardware Security: Opportunities and Challenges

Gu
Peng

3D die stacking and 2.5D interposer design are promising technologies to improve integration
density, performance and cost. Current approaches face serious issues in dealing with
emerging security challenges such as side channel attacks, hardware …

POSTER SESSION: Poster Session 2

Session details: Poster Session 2

Tabkhi
Hamed

FCM: Towards Fine-Grained GPU Power Management for Closed Source Mobile Games

Song
Jiachen

Contemporary mobile platforms employ embedded graphic processing units (GPUs) for
graphics-intensive games, and dynamic voltage and frequency scaling (DVFS) policies
are used to save energy without sacrificing quality. However, current GPU DVFS policies
…

Quality of Service-Aware, Scalable Cache Tuning Algorithm in Consumer-based Embedded
Devices

Alsafrjalani
Mohamad Hammam

To meet energy and quality of service (QoS) constraints in consumer-based embedded
devices (CEDs), configurable caches can be tuned to a best configuration that consumes
the least amount of energy while adhering to QoS expectations. However, due to …

Temperature-aware Dynamic Voltage Scaling for Near-Threshold Computing

Kiamehr
Saman

Power/energy reduction is of uttermost importance for applications with stringent
power/energy budget such as ultra-low power and energy-harvested systems. Aggressive
voltage scaling and in particular Near-Threshold Computing (NTC) is a promising …

Leakage Power Minimization in Deep Sub-Micron Technology by Exploiting Positive Slacks
of Dependent Paths

Chakraborty
Tuhin Subhra

Leakage power minimization is one of the key aspects of modern multi-million low power
system-on-chip (SoC) design. In post timing-closure phase, leakage-in-place-optimization
(LIPO) is generally adopted to reduce leakage power by swapping high-leaky …

An Enhanced Analytical Electrical Masking Model for Multiple Event Transients

Watkins
Adam

Due to the reducing transistor feature size, the susceptibility of modern circuits
to radiation induced errors has increased. This, as a result, has increased the likelihood
of multiple transients affecting a circuit. An important aspect when modeling …

Capturing True Workload Dependency of BTI-induced Degradation in CPU Components

Stamoulis
Dimitrios

Atomistic-based approaches accurately model Bias Temperature Instability phenomena,
but they suffer from prolonged execution times, preventing their seamless integration
in system-level analysis flows. In this paper we present a comprehensive flow that
…

Performance Constraint-Aware Task Mapping to Optimize Lifetime Reliability of Manycore
Systems

Rathore
Vijeta

Negative bias temperature instability (NBTI) has emerged as a critical challenge to
lifetime reliability of computing systems. Traditionally, temperature-aware methodologies
are used to mitigate the impact of NBTI on aging and degradation of computing …

ASIC Implementation of An All-digital Self-adaptive PVTA Variation-aware Clock Generation
System

Pérez-Puigdemont
Jordi

An all-digital self-adaptive clock generation system capable of autonomously adapt
the clock frequency to compensate the effects of static spatially heterogeneous (SSHet)
PVTA variations is presented. The design uses time-to-digital converters (TDCs) as
…

Ultra-Low Energy Reconfigurable Spintronic Threshold Logic Gate

Fan
Deliang

This paper introduces a novel design of reconfigurable Spintronic Threshold Logic
Gate (STLG), which employs spintronic weight devices to perform current mode weighted
summation of binary inputs, whereas, the low voltage spintronic threshold device …

Red-Shield: Shielding Read Disturbance for STT-RAM Based Register Files on GPUs

Zhang
Hang

To address the high energy consumption issue of SRAM on GPUs, emerging Spin-Transfer
Torque (STT-RAM) memory technology has been intensively studied to build GPU register
files for better energy-efficiency, thanks to its benefits of low leakage power, …

Modeling and Study of Two-BDT-Nanostructure based Sequential Logic Circuits

Marthi
Poorna

In this paper, study of different digital logic circuits developed using two-BDT ballistic
nanostructure is presented. New D flip-flop (DFF) based on the same nanostructure
is also proposed. The logic structure comprises two ballistic deflection …

SESSION: Session 11: Emerging 2

Session details: Session 11: Emerging 2

Dai
Jianwen

Exploring Main Memory Design Based on Racetrack Memory Technology

Hu
Qingda

Emerging non-volatile memories (NVMs), which include PC-RAM and STT-RAM, have been
proposed to replace DRAM, mainly because they have better scalability and lower standby
power. However, previous research has demonstrated that these NVMs cannot …

An Offline Frequent Value Encoding for Energy-Efficient MLC/TLC Non-volatile Memories

Alsuwaiyan
Ali

This paper describes a low overhead, offline frequent value encoding (FVE) solution
to reduce the write energy in multi-level/triple-level cell (MLC/TLC) non-volatile
memories (NVMs). The proposed solution, which does not require any runtime software
…

Low-Power Multi-Port Memory Architecture based on Spin Orbit Torque Magnetic Devices

Bishnoi
Rajendra

Multi-port memories are widely used as shared memory, such as register files, in a
microprocessor system, and its number of ports and capacities are significantly increasing
with every product generation. However, with technology advancements, multi-…

Optimizing the Operating Voltage of Tunnel FET-Based SRAM Arrays Equipped with Read/Write
Assist Circuitry

Afzali-Kusha
Hassan

This paper deals with obtaining the minimum operating voltage of memory arrays based
on TFET SRAM cells. First, we compare the I-V characteristics of two TFETs and one
FDSOI using SPICE simulations based on 20nm technology models. The results reveal
…

SESSION: Session 12: Low Power 2

Session details: Session 12: Low Power 2

Kim
Kyung Ki

Approximate Differential Encoding for Energy-Efficient Serial Communication

Jahier Pagliari
Daniele

Embedded computing systems include several off-chip serial links, that are typically
used to interface processing elements with peripherals, such as sensors, actuators
and I/O controllers. Because of the long physical lines of these connections, they
…

Fast Thermal Simulation using SystemC-AMS

Chen
Yukai

Out of the many options available for thermal simulation of digital electronic systems,
those based on solving an RC equivalent circuit of the thermal network are the most
popular choice in the EDA community, as they provide a reasonable tradeoff …

Learning-Based Near-Optimal Area-Power Trade-offs in Hardware Design for Neural Signal
Acquisition

Aprile
Cosimo

Wireless implantable devices capable of monitoring the electrical activity of the
brain are becoming an important tool for understanding and potentially treating mental
diseases such as epilepsy and depression. While such devices exist, it is still …

Load Balanced On-Chip Power Delivery for Average Current Demand

Pathak
Divya

A dynamic power management system for homogeneous chip multi-processors (CMP) is proposed.
Each core of the CMP includes on chip DC-DC switching buck converters that are interconnected
through a switch network. The peak current rating of the buck …

DAC 2018 TOC

8 October 2019

Yibo Lin

No comments

Categories: Publications

Full Citation in the ACM Digital Library

Ensemble learning for effective run-time hardware-based malware detection: a comprehensive analysis and classification

Sayadi
Hossein

Malware detection at the hardware level has emerged recently as a promising solution
to improve the security of computing systems. Hardware-based malware detectors take
advantage of Machine Learning (ML) classifiers to detect pattern of malicious …

Deepsecure: scalable provably-secure deep learning

Rouhani
Bita Darvish

This paper presents DeepSecure, the an scalable and provably secure Deep Learning
(DL) framework that is built upon automated design, efficient logic synthesis, and
optimization methodologies. DeepSecure targets scenarios in which neither of the …

DWE: decrypting learning with errors with errors

Bian
Song

The Learning with Errors (LWE) problem is a novel foundation of a variety of cryptographic
applications, including quantumly-secure public-key encryption, digital signature,
and fully homomorphic encryption. In this work, we propose an approximate …

Reverse engineering convolutional neural networks through side-channel information
leaks

Hua
Weizhe

A convolutional neural network (CNN) model represents a crucial piece of intellectual
property in many applications. Revealing its structure or weights would leak confidential
information. In this paper we present novel reverse-engineering attacks on …

OFTL: ordering-aware FTL for maximizing performance of the journaling file system

Park
Daekyu

Journaling of ext4 file system employs two FLUSH commands to make their data durable,
even though the FLUSH is more expensive than the ordinary write operations. In this
paper, to halve the number of FLUSH commands, we propose an efficient FTL, called
…

LAWN: boosting the performance of NVMM file system through reducing write amplification

Wang
Chundong

Byte-addressable non-volatile memories can be used with DRAM to build a hybrid memory
system of volatile/non-volatile main memory (NVMM). NVMM file systems demand consistency
techniques such as logging and copy-on-write to guarantee data consistency in …

FastGC: accelerate garbage collection via an efficient copyback-based data migration in SSDs

Wu
Fei

Copyback is an advanced command contributing to accelerating data migration in garbage
collection (GC). Unfortunately, detecting copyback feasibility (whether copyback can
be carried out with assurable reliability) against data corruption in the …

Dynamic management of key states for reinforcement learning-assisted garbage collection
to reduce long tail latency in SSD

Kang
Wonkyung

Garbage collection (GC) is one of main causes of the long-tail latency problem in
storage systems. Long-tail latency due to GC is more than 100 times greater than the
average latency at the 99^th percentile. Therefore, due to such a long tail latency, …

WB-trees: a meshed tree representation for FinFET analog layout designs

Lu
Yu-Sheng

The emerging design requirements with the FinFET technology, along with traditional
geometrical constraints, make the FinFET-based analog placement even more challenging.
Previous works can handle only partial FinFET-induced design constraints because …

Analog placement with current flow and symmetry constraints using PCP-SP

Patyal
Abhishek

Modern analog placement techniques require consideration of current path and symmetry
constraints. The symmetry pairs can be efficiently packed using the symmetry island
configurations, but not all these configurations result in minimum gate …

Multi-objective bayesian optimization for analog/RF circuit synthesis

Lyu
Wenlong

In this paper, a novel multi-objective Bayesian optimization method is proposed for
the sizing of analog/RF circuits. The proposed approach follows the framework of Bayesian
optimization to balance the exploitation and exploration. Gaussian processes (…

Calibrating process variation at system level with in-situ low-precision transfer
learning for analog neural network processors

Jia
Kaige

Process Variation (PV) may cause accuracy loss of the analog neural network (ANN)
processors, and make it hard to be scaled down, as well as feasibility degrading.
This paper first analyses the impact of PV on the performance of ANN chips. Then proposes
…

DPS: dynamic precision scaling for stochastic computing-based deep neural networks

Sim
Hyeonuk

Stochastic computing (SC) is a promising technique with advantages such as low-cost,
low-power, and error-resilience. However so far SC-based CNN (convolutional neural
network) accelerators have been kept to relatively small CNNs only, primarily due
to …

Dyhard-DNN: even more DNN acceleration with dynamic hardware reconfiguration

Putic
Mateja

Deep Neural Networks (DNNs) have demonstrated their utility across a wide range of
input data types, usable across diverse computing substrates, from edge devices to
datacenters. This broad utility has resulted in myriad hardware accelerator …

Exploring the programmability for deep learning processors: from architecture to tensorization

Chen
Chixiao

This paper presents an instruction and Fabric Programmable Neuron Array (iFPNA) architecture, its 28nm CMOS chip prototype, and a compiler for
the acceleration of a variety of deep learning neural networks (DNNs) including convolutional
neural networks (…

LCP: a layer clusters paralleling mapping method for accelerating inception and residual
networks on FPGA

Lin
Xinhan

Deep convolutional neural networks (DCNNs) have been widely used in various AI applications.
Inception and Residual are two promising structures adopted in many important modern
DCNN models, including AlphaGo Zero’s model. These structures allow …

Ares: a framework for quantifying the resilience of deep neural networks

Reagen
Brandon

As the use of deep neural networks continues to grow, so does the fraction of compute
cycles devoted to their execution. This has led the CAD and architecture communities
to devote considerable attention to building DNN hardware. Despite these efforts,
…

DeepN-JPEG: a deep neural network favorable JPEG-based image compression framework

Liu
Zihao

As one of most fascinating machine learning techniques, deep neural network (DNN)
has demonstrated excellent performance in various intelligent tasks such as image
classification. DNN achieves such performance, to a large extent, by performing expensive
…

Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient
deep learning accelerators

Zhang
Jeff

Hardware accelerators are being increasingly deployed to boost the performance and
energy efficiency of deep neural network (DNN) inference. In this paper we propose
Thundervolt, a new framework that enables aggressive voltage underscaling of high-…

Loom: exploiting weight and activation precisions to accelerate convolutional neural networks

Sharify
Sayeh

Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented.
In LM every bit of data precision that can be saved translates to proportional performance
gains. For both weights and activations LM exploits profile-…

Parallelizing SRAM arrays with customized bit-cell for binary neural networks

Liu
Rui

Recent advances in deep neural networks (DNNs) have shown Binary Neural Networks (BNNs)
are able to provide a reasonable accuracy on various image datasets with a significant
reduction in computation and memory cost. In this paper, we explore two BNNs: …

An ultra-low energy internally analog, externally digital vector-matrix multiplier
based on NOR flash memory technology

Mahmoodi
M. Reza

Vector-matrix multiplication (VMM) is a core operation in many signal and data processing
algorithms. Previous work showed that analog multipliers based on nonvolatile memories
have superior energy efficiency as compared to digital counterparts at low-…

Coding approach for low-power 3D interconnects

Bamberg
Lennart

Through-silicon vias (TSVs) in 3D ICs show a significant power consumption, which
can be reduced using coding techniques. This work presents an approach which reduces
the TSV power consumption by a signal-aware bit assignment which includes inversions
…

A novel 3D DRAM memory cube architecture for space applications

Agnesina
Anthony

The first mainstream products in 3D IC design are memory devices where multiple memory
tiers are horizontally integrated to offer manifold improvements compared with their
2D counterparts. Unfortunately, none of these existing 3D memory cubes are ready …

A general graph based pessimism reduction framework for design optimization of timing
closure

Peng
Fulin

In this paper, we develop a general pessimism reduction framework for design optimization
of timing closure. Although the modified graph based timing analysis (mGBA) slack
model can be readily formulated into a quadratic programming problem with …

Virtualsync: timing optimization by synchronizing logic waves with sequential and combinational
components as delay units

Zhang
Grace Li

In digital circuit designs, sequential components such as flip-flops are used to synchronize
signal propagations. Logic computations are aligned at and thus isolated by flip-flop
stages. Although this fully synchronous style can reduce design efforts …

Noise-aware DVFS transition sequence optimization for battery-powered IoT devices

Luo
Shaoheng

Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT)
devices, which are well known for their bursty workloads and limited energy storage
— usually in the form of tiny batteries. To ensure battery lifetime, DVFS has become
…

Accurate processor-level wirelength distribution model for technology pathfinding
using a modernized interpretation of rent’s rule

Prasad
Divya

Faithful system-level modeling is vital to design and technology pathfinding, and
requires accurate representation of interconnects. In this study, Rent’s rule is modernized
to cater to advanced technology and design, and applied to derive a priori …

Semi-automatic safety analysis and optimization

Munk
Peter

The complexity of safety-critical E/E-systems within the automotive domain are continuously
increasing. At the same time, functional safety standards such as the ISO 26262 prescribe
analysis methods like the Fault Tree Analysis (FTA) and Failure Mode …

Reasoning about safety of learning-enabled components in autonomous cyber-physical
systems

Tuncali
Cumhur Erkan

We present a simulation-based approach for generating barrier certificate functions
for safety verification of cyber-physical systems (CPS) that contain neural network-based
controllers. A linear programming solver is utilized to find a candidate …

Runtime monitoring for safety of intelligent vehicles

Watanabe
Kosuke

Advanced driver-assistance systems (ADAS), autonomous driving, and connectivity have
enabled a range of new features, but also made automotive design more complex than
ever. Formal verification can be applied to establish functional correctness, but
its …

Revisiting context-based authentication in IoT

Miettinen
Markus

The emergence of IoT poses new challenges towards solutions for authenticating numerous
very heterogeneous IoT devices to their respective trust domains. Using passwords
or pre-defined keys have drawbacks that limit their use in IoT scenarios. Recent …

MAXelerator: FPGA accelerator for privacy preserving multiply-accumulate (MAC) on cloud servers

Hussain
Siam U.

This paper presents MAXelerator, the first hardware accelerator for privacy-preserving
machine learning (ML) on cloud servers. Cloud-based ML is being increasingly employed
in various data sensitive scenarios. While it enhances both efficiency and …

Hypernel: a hardware-assisted framework for kernel protection without nested paging

Kwon
Donghyun

Large OS kernels always suffer from attacks due to their numerous inherent vulnerabilities.
To protect the kernel, hypervisors have been employed by many security solutions.
However, relying on a hypervisor has a detrimental impact on the system …

Reducing the overhead of authenticated memory encryption using delta encoding and
ECC memory

Yitbarek
Salessawi Ferede

Data stored in an off-chip memory, such as DRAM or non-volatile main memory, can potentially
be extracted or tampered by an attacker with physical access to a device. Protecting
such attacks requires storing message authentication codes and counters – …

Reducing time and effort in IC implementation: a roadmap of challenges and solutions

Kahng
Andrew B.

To reduce time and effort in IC implementation, fundamental challenges must be solved.
First, the need for (expensive) humans must be removed wherever possible. Humans are
skilled at predicting downstream flow failures, evaluating key early decisions …

Efficient reinforcement learning for automating human decision-making in SoC design

Sadasivam
Shankar

The exponential growth in PVT corners due to Moore’s law scaling, and the increasing
demand for consumer applications and longer battery life in mobile devices, has ushered
in significant cost and power-related challenges for designing and productizing …

Compensated-DNN: energy efficient low-precision deep neural networks by compensating quantization errors

Jain
Shubham

Deep Neural Networks (DNNs) represent the state-of-the-art in many Artificial Intelligence
(AI) tasks involving images, videos, text, and natural language. Their ubiquitous
adoption is limited by the high computation and storage requirements of DNNs, …

Thermal-aware optimizations of reRAM-based neuromorphic computing systems

Beigi
Majed Valad

ReRAM-based systems are attractive implementation alternatives for neuromorphic computing
because of their high speed and low design cost. In this work, we investigate the
impact of temperature on the ReRAM-based neuromorphic architectures and show how …

Compiler-guided instruction-level clock scheduling for timing speculative processors

Fan
Yuanbo

Despite the significant promise that circuit-level timing speculation has for enabling
operation in marginal conditions, overheads associated with recovery prove to be a
serious drawback. We show that fine-grained clock adjustment guided by the compiler
…

SRAM based opportunistic energy efficiency improvement in dual-supply near-threshold
processors

Gu
Yunfei

Energy-efficient microprocessors are essential for a wide range of applications. While
near-threshold computing is a promising technique to improve energy efficiency, optimal
supply demands from logic core and on-chip memory are conflicting. In this …

Enhancing workload-dependent voltage scaling for energy-efficient ultra-low-power
embedded systems

Mohan
Veni

Ultra-low-power (ULP) chipsets are in higher demand than ever due to the proliferation
of ULP embedded systems to support growing applications like the Internet of Things
(IoT), wearables and sensor networks. Since ULP systems are also cost constrained,
…

Efficient and reliable power delivery in voltage-stacked manycore system with hybrid
charge-recycling regulators

Zou
An

Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking
multiple voltage domains to eliminate explicit step-down voltage conversion and reduce
energy loss along the power delivery path. However, it suffers from …

Exact algorithms for delay-bounded steiner arborescences

Held
Stephan

Rectilinear Steiner arborescences under linear delay constraints play an important
role for buffering. We present exact algorithms for either minimizing the total length
subject to delay constraints, or minimizing the total length plus the (weighted) …

Efficient multi-layer obstacle-avoiding region-to-region rectilinear steiner tree
construction

Wang
Run-Yi

As Engineering Change Order (ECO) has attracted substantial attention in modern VLSI
design, the open net problem, which aims at constructing a shortest obstacle-avoiding
path to reconnect the net shapes in an open net, becomes more critical in the ECO
…

Obstacle-avoiding open-net connector with precise shortest distance estimation

Fang
Guan-Qi

At the end of digital integrated circuit (IC) design flow, some nets may still be
left open due to engineering change order (ECO). Resolving these opens could be quite
challenging for some huge nets such as power ground nets because of a large number
of …

COSAT: congestion, obstacle, and slew aware tree construction for multiple power domain design

Lu
Chien-Pang

Slew fixing, which ensures correct signal propagation, is essential during timing
closure of IC design flow. Conventionally, gate sizing, Vt swapping, or buffer insertion
is adopted to locally fix the slew violation on a single gate. Nevertheless, when
…

A machine learning framework to identify detailed routing short violations from a
placed netlist

Tabrizi
Aysa Fakheri

Detecting and preventing routing violations has become a critical issue in physical
design, especially in the early stages. Lack of correlation between global and detailed
routing congestion estimations and the long runtime required to frequently …

DSA-friendly detailed routing considering double patterning and DSA template assignments

Yu
Hai-Juan

As integrated circuit technology nodes continue to shrink, dense via distribution
becomes a severe challenge, requiring multiple masks to avoid spacing violations in
via layers. Meanwhile, the directed self-assembly (DSA) technique shows a great promise
…

Developing synthesis flows without human knowledge

Yu
Cunxi

Design flows are the explicit combinations of design transformations, primarily involved
in synthesis, placement and routing processes, to accomplish the design of Integrated
Circuits (ICs) and System-on-Chip (SoC). Mostly, the flows are developed based …

Efficient computation of ECO patch functions

Dao
Ai Quoc

Engineering Change Orders (ECO) modify a synthesized netlist after its specification
has changed. ECO is divided into two major tasks: finding target signals whose functions
should be updated and synthesizing the patch that produces the desired change. …

Canonical computation without canonical representation

Mishchenko
Alan

A representation of a Boolean function is canonical if, given a variable order, only one instance of the representation is possible for
the function. A computation is canonical if the result depends only on the Boolean function and a variable order, and …

SAT based exact synthesis using DAG topology families

Haaswijk
Winston

SAT based exact synthesis is a powerful technique, with applications in logic optimization,
technology mapping, and synthesis for emerging technologies. However, its runtime
behavior can be unpredictable and slow. In this paper, we propose to add a new …

Efficient batch statistical error estimation for iterative multi-level approximate
logic synthesis

Su
Sanbao

Approximate computing is an emerging energy-efficient paradigm for error-resilient
applications. Approximate logic synthesis (ALS) is an important field of it. To improve
the existing ALS flows, one key issue is to derive a more accurate and efficient …

BLASYS: approximate logic synthesis using boolean matrix factorization

Hashemi
Soheil

Approximate computing is an emerging paradigm where design accuracy can be traded
off for benefits in design metrics such as design area, power consumption or circuit
complexity. In this work, we present a novel paradigm to synthesize approximate …

Optimized I/O determinism for emerging NVM-based NVMe SSD in an enterprise system

Kim
Seonbong

Non-volatile memory express (NVMe) over peripheral component interconnect express
(PCIe) has been adopted in the storage system to provide low latency and high throughput.
NVMe allows a host system to reduce latency because it offers a high parallel …

Improving runtime performance of deduplication system with host-managed SMR storage
drives

Wu
Chun-Feng

Due to the cost consideration for data storage, high-areal-density shingled-magnetic-recording
(SMR) drives and data deduplication techniques are getting popular in many data storage
services for the improvement of profit per storage unit. However, …

Wear leveling for crossbar resistive memory

Wen
Wen

Resistive Memory (ReRAM) is an emerging non-volatile memory technology that has many
advantages over conventional DRAM. ReRAM crossbar has the smallest 4F² planar cell size and thus is widely adopted for constructing dense memory with large
capacity. …

RADAR: a 3D-reRAM based DNA alignment accelerator architecture

Huangfu
Wenqin

Next Generation Sequencing (NGS) technology has become an indispensable tool for studying
genomics, resulting in an exponentially growth of biological data. Booming data volume
demands significant computational resources and creates challenges for ‘…

Mamba: closing the performance gap in productive hardware development frameworks

Jiang
Shunning

Modern high-level languages bring compelling productivity benefits to hardware design
and verification. For example, hardware generation and simulation frameworks (HGSFs)
use a single “host” language for parameterization, static elaboration, test bench
…

ACED: a hardware library for generating DSP systems

Wang
Angie

Designers translate DSP algorithms into application-specific hardware via primitives
composed in various ways for different architectural realizations. Despite sharing
underlying algorithms and hardware constructs, designs are often difficult to reuse,
…

PARM: power supply noise aware resource management for NoC based
multicore systems in the dark silicon era

Raparti
Venkata Yaswanth

Reliability is a major concern in chip multi-processors (CMPs) due to shrinking technology
and low operating voltages. Today’s processors designed at sub-10nm technology nodes
have high device densities and fast switching frequencies that cause …

Aging-constrained performance optimization for multi cores

Khdr
Heba

Circuit aging has become a dire design concern and hence it is considered a primary
design constraint. Current practice to cope with this problem is to apply (too) conservative
means.

In contrast, we introduce a far less restrictive approach by …

A measurement system for capacitive PUF-based security enclosures

Obermaier
Johannes

Battery-backed security enclosures that are permanently monitored for penetration
and tampering are common solutions for providing physical integrity to multi-chip
embedded systems. This paper presents a well-tailored measurement system for a …

It’s hammer time: how to attack (rowhammer-based) DRAM-PUFs

Zeitouni
Shaza

Physically Unclonable Functions (PUFs) are still considered promising technology as
building blocks in cryptographic protocols. While most PUFs require dedicated circuitry,
recent research leverages DRAM hardware for PUFs due to its intrinsic properties …

CamPUF: physically unclonable function based on CMOS image sensor fixed pattern noise

Kim
Younghyun

Physically unclonable functions (PUFs) have proved to be an effective measure for
secure device authentication and key generation. We propose a novel PUF design, named
CamPUF, based on commercial off-the-shelf CMOS image sensors, which are ubiquitously
…

Tamper-resistant pin-constrained digital microfluidic biochips

Tang
Jack

Digital microfluidic biochips (DMFBs)—an emerging technology that implements bioassays
through manipulation of discrete fluid droplets—are vulnerable to actuation tampering
attacks, where a malicious adversary modifies control signals for the …

Approximation-aware coordinated power/performance management for heterogeneous multi-cores

Kanduri
Anil

Run-time resource management of heterogeneous multi-core systems is challenging due
to i) dynamic workloads, that often result in ii) conflicting knob actuation decisions,
which potentially iii) compromise on performance for thermal safety. We present a
…

QoS-aware stochastic power management for many-cores

Pathania
Anuj

A many-core processor can execute hundreds of multi-threaded tasks in parallel on
its 100s – 1000s of processing cores. When deployed in a Quality of Service (QoS)-based
system, the many-core must execute a task at a target QoS. The amount of processing
…

Employing classification-based algorithms for general-purpose approximate computing

Oliveira
Geraldo F.

Approximate computing has recently reemerged as a design solution for additional performance
and energy improvements at the cost of output quality. In this paper, we propose using
a tree-based classification algorithm as an approximation tool for …

Using imprecise computing for improved non-preemptive real-time scheduling

Huang
Lin

Conventional hard real-time scheduling is often overly pessimistic due to the worst
case execution time estimation. The pessimism can be mitigated by exploiting imprecise
computing in applications where occasional small errors are acceptable. This …

A modular digital VLSI flow for high-productivity SoC design

Khailany
Brucek

A high-productivity digital VLSI flow for designing complex SoCs is presented. The
flow includes high-level synthesis tools, an object-oriented library of synthesizable
SystemC and C++ components, and a modular VLSI physical design approach based on …

Basejump STL: systemverilog needs a standard template library for hardware design

Taylor
Michael Bedford

We propose a Standard Template Library (STL) for synthesizeable SystemVerilog that
sharply reduces the time required to design digital circuits. We overview the principles
that underly the design of the open-source BaseJump STL, including light-weight …

TRIG: hardware accelerator for inference-based applications and experimental demonstration
using carbon nanotube FETs

Hills
Gage

The energy efficiency demands of future abundant-data applications, e.g., those which
use inference-based techniques to classify large amounts of data, exceed the capabilities
of digital systems today. Field-effect transistors (FETs) built using …

OPERON: optical-electrical power-efficient route synthesis for on-chip signals

Liu
Derong

As VLSI technology scales to deep sub-micron, optical interconnect becomes an attractive
alternative for on-chip communication. The traditional optical routing works mainly
optimize the path loss, and few works explicitly exploit the optical-electrical …

Soft-FET: phase transition material assisted soft switching field effect transistor for supply
voltage droop mitigation

Teja
Subrahmanya

Phase Transition Material (PTM) assisted novel soft switching transistor architecture
named “Soft-FET” is proposed for supply voltage droop mitigation. By utilizing the
abrupt phase transition mechanism in PTMs, the proposed Soft-FET achieves soft …

Ultralow power acoustic feature-scoring using gaussian I-V transistors

Trivedi
Amit Ranjan

This paper discusses energy-efficient acoustic feature-scoring using transistors with
Gaussian-shaped Ids-Vgs. Acoustic feature-scoring is a critical step in speech recognition tasks such as
speaker recognition. Suited to the transistor, we discuss a …

Test cost reduction for X-value elimination by scan slice correlation analysis

Chae
Hyunsu

X-values in test output responses corrupt an output response compaction and can cause
a fault coverage loss. X-Masking and X-Canceling MISR methods have been suggested to eliminate X-values, however, there are control data volume and test time overhead …

Cross-layer fault-space pruning for hardware-assisted fault injection

Dietrich
Christian

With shrinking structure sizes, soft-error mitigation has become a major challenge
in the design and certification of safety-critical embedded systems. Their robustness
is quantified by extensive fault-injection campaigns, which on hardware level can
…

A machine learning based hard fault recuperation model for approximate hardware accelerators

Taher
Farah Naz

Continuous pursuit of higher performance and energy efficiency has led to heterogeneous
SoC that contains multiple dedicated hardware accelerators. These accelerators exploit
the inherent parallelism of tasks and are often tolerant to inaccuracies in …

SOTERIA: exploiting process variations to enhance hardware security with photonic NoC architectures

Chittamuru
Sai Vineel Reddy

Photonic networks-on-chip (PNoCs) enable high bandwidth on-chip data transfers by
using photonic waveguides capable of dense-wave-length-division-multiplexing (DWDM)
for signal traversal and microring resonators (MRs) for signal modulation. A Hardware
…

LEAD: learning-enabled energy-aware dynamic voltage/frequency scaling in NoCs

Clark
Mark

Network on Chips (NoCs) are the interconnect fabric of choice for multicore processors
due to their superiority over traditional buses and crossbars in terms of scalability.
While NoC’s offer several advantages, they still suffer from high static and …

Subutai: distributed synchronization primitives in NoC interfaces for legacy parallel-applications

Cataldo
Rodrigo

Parallel applications are essential for efficiently using the computational power
of a Multiprocessor System-on-Chip (MPSoC). Unfortunately, these applications do not
scale effortlessly with the number of cores because of synchronization operations
that …

Packet pump: overcoming network bottleneck in on-chip interconnects for GPGPUs

Cheng
Xianwei

In order to fully exploit GPGPU’s parallel processing power, on-chip interconnects
need to provide bandwidth efficient data communication. GPGPUs exhibit a many-to-few-to-many
traffic pattern which makes the memory controller connected routers the …

STASH: security architecture for smart hybrid memories

Swami
Shivam

Whereas emerging non-volatile memories (NVMs) are low power, dense, scalable alternatives
to DRAM, the high latency and low endurance of these NVMs limit the feasibility of
NVM-only memory systems. Smart hybrid memories (SHMs) that integrate NVM, DRAM, …

ACME: advanced counter mode encryption for secure non-volatile memories

Swami
Shivam

Modern computing systems that integrate emerging non-volatile memories (NVMs) are
vulnerable to classical security threats to data confidentiality (e.g., stolen DIMM
and bus snooping attacks) as well as new security threats to system availability (e.g.,
…

CASTLE: compression architecture for secure low latency, low energy, high endurance NVMs

Palangappa
Poovaiah M.

CASTLE is a Compression-based main memory Architecture realizing a read-decrypt-free
(i.e., write-only) Secure solution for low laTency, Low Energy, high endurance non-volatile
memories (NVMs). CASTLE integrates pattern-based data compression and …

A collaborative defense against wear out attacks in non-volatile processors

Cronin
Patrick

While the Internet of Things (IoT) keeps advancing, its full adoption is continually
blocked by power delivery problems. One promising solution is Non-Volatile (NV) processors,
which harvest energy for themselves and employ a NV memory hierarchy. This …

Protecting the supply chain for automotives and IoTs

Ray
Sandip

Modern automotive systems and IoT devices are designed through a highly complex, globalized,
and potentially untrustworthy supply chain. Each player in this supply chain may (1)
introduce sensitive information and data (collectively termed “assets”) …

Reconciling remote attestation and safety-critical operation on simple IoT devices

Carpent
Xavier

Remote attestation (RA) is a means of malware detection, typically realized as an
interaction between a trusted verifier and a potentially compromised remote device
(prover). RA is especially relevant for low-end embedded devices that are incapable
of …

Formal security verification of concurrent firmware in SoCs using instruction-level
abstraction for hardware

Huang
Bo-Yuan

Formal security verification of firmware interacting with hardware in modern Systems-on-Chip
(SoCs) is a critical research problem. This faces the following challenges: (1) design
complexity and heterogeneity, (2) semantics gaps between software and …

Application level hardware tracing for scaling post-silicon debug

Pal
Debjit

We present a method for selecting trace messages for post-silicon validation of Systems-on-a-Chips
(SoCs) with diverse usage scenarios. We model specifications of interacting flows
in typical applications. Our method optimizes trace buffer utilization …

Specification-driven automated conformance checking for virtual prototype and post-silicon
designs

Gu
Haifeng

Due to the increasing complexity of System-on-Chip (SoC) design, how to ensure that
silicon implementations conform to their high-level specifications is becoming a major
challenge. To address this problem, we propose a novel specification-driven …

Formal micro-architectural analysis of on-chip ring networks

van Wesel
Perry

In the realm of Multi-Processors System-on-Chip (MPSoC’s), the Network-on-Chip (NoC)
connecting all system components plays a crucial role in the overall correctness and
performance of the system. Recent papers have proposed several ring based NoC …

HFMV: hybridizing formal methods and machine learning for verification of analog and mixed-signal
circuits

Hu
Hanbin

With increasing design complexity and robustness requirement, analog and mixed-signal
(AMS) verification manifests itself as a key bottleneck. While formal methods and
machine learning have been proposed for AMS verification, these two techniques suffer
…

Cost-aware patch generation for multi-target function rectification of engineering
change orders

Zhang
He-Teng

The increasing system complexity makes engineering change order (ECO) mostly inevitable
and a common practice in integrated circuit design. Despite extensive research being
made, prior methods are not effectively applicable to instances where …

Modelling multicore contention on the AURIXTM TC27x

Díaz
Enrique

Multicores are becoming ubiquitous in automotive. Yet, the expected benefits on integration
are challenged by multicore contention concerns on timing V&V. Worst-case execution
time (WCET) estimates are required as early as possible in the software …

Cache side-channel attacks and time-predictability in high-performance critical real-time
systems

Trilla
David

Embedded computers control an increasing number of systems directly interacting with
humans, while also manage more and more personal or sensitive information. As a result,
both safety and security are becoming ubiquitous requirements in embedded …

Cross-layer dependency analysis with timing dependence graphs

Möstl
Mischa

We present Non-interference Analysis as a model-based method to automatically reveal,
track and analyze end-to-end timing dependencies as part of a cross-layer dependency
analysis in complex systems. Based on revealed timing dependencies of functional …

Brook auto: high-level certification-friendly programming for GPU-powered automotive systems

Trompouki
Matina Maria

Modern automotive systems require increased performance to implement Advanced Driving
Assistance Systems (ADAS). GPU-powered platforms are promising candidates for such
computational tasks, however current low-level programming models challenge the …

Dynamic vehicle software with AUTOCONT

Jakobs
Christine

Future automotive software needs to deal with an increasing level of dynamicity, reasoned
by the wish for connected driving, software updates, and dynamic feature activation.
Such functionalities cannot be properly realized with today’s classic AUTOSAR …

Automated interpretation and reduction of in-vehicle network traces at a large scale

Mrowca
Artur

In modern vehicles, high communication complexity requires cost-effective integration
tests such as data-driven system verification with in-vehicle network traces. With
the growing amount of traces, distributable Big Data solutions for analyses become
…

Atomlayer: a universal reRAM-based CNN accelerator with atomic layer computation

Qiao
Ximing

Although ReRAM-based convolutional neural network (CNN) accelerators have been widely
studied, state-of-the-art solutions suffer from either incapability of training (e.g.,
ISSAC [1]) or inefficiency of inference (e.g., PipeLayer [2]) due to the …

Towards accurate and high-speed spiking neuromorphic systems with data quantization-aware
deep networks

Liu
Fuqiang

Deep Neural Networks (DNNs) have gained immense success in cognitive applications
and greatly pushed today’s artificial intelligence forward. The biggest challenge
in executing DNNs is their extremely data-extensive computations. The computing …

CMP-PIM: an energy-efficient comparator-based processing-in-memory neural network accelerator

Angizi
Shaahin

In this paper, an energy-efficient and high-speed comparator-based processing-in-memory
accelerator (CMP-PIM) is proposed to efficiently execute a novel hardware-oriented
comparator-based deep neural network called CMPNET. Inspired by local binary …

SNrram: an efficient sparse neural network computation architecture based on resistive random-access
memory

Wang
Peiqi

The sparsity in the deep neural networks can be leveraged by methods such as pruning
and compression to help the efficient deployment of large-scale deep neural networks
onto hardware platforms, such as GPU or FPGA, for better performance and power …

Long live TIME: improving lifetime for training-in-memory engines by structured gradient sparsification

Cai
Yi

Deeper and larger Neural Networks (NNs) have made breakthroughs in many fields. While
conventional CMOS-based computing platforms are hard to achieve higher energy efficiency.
RRAM-based systems provide a promising solution to build efficient Training-…

Hierarchical hyperdimensional computing for energy efficient classification

Imani
Mohsen

Brain-inspired Hyperdimensional (HD) computing emulates cognition tasks by computing
with hypervectors rather than traditional numerical values. In HD, an encoder maps
inputs to high dimensional vectors (hypervectors) and combines them to generate a
…

Dadu-P: a scalable accelerator for robot motion planning in a dynamic environment

Lian
Shiqi

As a critical operation in robotics, motion planning consumes lots of time and energy,
especially in a dynamic environment. Through approaches based on general-purpose processors,
it is hard to get a valid planning in real time. We present an …

Data prediction for response flows in packet processing cache

Yamaki
Hayato

We propose a technique to reduce compulsory misses of packet processing cache (PPC),
which largely affects both throughput and energy of core routers. Rather than prefetching
data, our technique called response prediction cache (RPC) speculatively …

PULP-HD: accelerating brain-inspired high-dimensional computing on a parallel ultra-low power
platform

Montagna
Fabio

Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD
computing include a well-defined set of arithmetic operations on hypervectors, generality,
…

Active forwarding: eliminate IOMMU address translation for accelerator-rich architectures

Fu
Hsueh-Chun

Accelerator-rich architectures employ IOMMUs to support unified virtual address, but
researches show that they fail to meet the performance and energy requirements of
accelerators. Instead of optimizing the speed/energy of IOMMU address translation,
…

SARA: self-aware resource allocation for heterogeneous MPSoCs

Song
Yang

In modern heterogeneous MPSoCs, the management of shared memory resources is crucial
in delivering end-to-end QoS. Previous frameworks have either focused on singular
QoS targets or the allocation of partitionable resources among CPU applications at
…

PEP: proactive checkpointing for efficient preemption on GPUs

Li
Chen

The demand for multitasking GPUs increases whenever the GPU may be shared by multiple
applications, either spatially or temporally. This requires that GPUs can be preempted
and switch context to a new application while already executing one. Unlike CPUs,…

FMMU: a hardware-accelerated flash map management unit for scalable performance of flash-based
SSDs

Woo
Yeong-Jae

Address translation is increasingly a performance bottleneck in flash-based SSDs (solid
state drives). We propose a hardware-accelerated flash map management unit called
FMMU to speed up the address translation. The FMMU operates in a non-blocking …

Minimizing write amplification to enhance lifetime of large-page flash-memory storage
devices

Wang
Wei-Lin

Due to the decreasing endurance of flash chips, the lifetime of flash drives has become
a critical issue. To resolve this issue, various techniques such as wear-leveling
and error correction code have been proposed to reduce the bit error rates of flash
…

Proactive channel adjustment to improve polar code capability for flash storage devices

Hsu
Kun-Cheng

With the low encoding/decoding complexity and the high error correction capability,
polar code with the support of list-decoding and cyclic redundancy check can outperform
LDPC code in the area of data communication. Thus, it also draws a lot of …

Achieving defect-free multilevel 3D flash memories with one-shot program design

Ho
Chien-Chung

To store the desired data on MLC and TLC flash memories, the conventional programming
strategies need to divide a fixed range of threshold voltage (V_t) window into several parts. The narrowly partitioned V_t window in turn limits the design of …

Power-based side-channel instruction-level disassembler

Park
Jungmin

Modern embedded computing devices are vulnerable against malware and software piracy
due to insufficient security scrutiny and the complications of continuous patching.
To detect malicious activity as well as protecting the integrity of executable …

Side-channel security of superscalar CPUs: evaluating the impact of micro-architectural features

Barenghi
Alessandro

Side-channel attacks are performed on increasingly complex targets, starting to threaten
superscalar CPUs supporting a complete operating system. The difficulty of both assessing
the vulnerability of a device to them, and validating the effectiveness of …

Electro-magnetic analysis of GPU-based AES implementation

Gao
Yiwen

In this work, for the first time, we investigate Electro-Magnetic (EM) attacks on
GPU-based AES implementation. In detail, we first sample EM traces using a delicate
trigger; then, we build a heuristic leakage model and a novel leakage model to exploit
…

GPU obfuscation: attack and defense strategies

Chakraborty
Abhishek

Conventional attacks against existing logic obfuscation techniques rely on the presence
of an activated hardware for analysis. In reality, obtaining such activated chips may not always be practical,
especially if the on-chip test structures are …

Measurement-based cache representativeness on multipath programs

Milutinovic
Suzana

Autonomous vehicles in embedded real-time systems increase critical-software size
and complexity whose performance needs are covered with high-performance hardware
features like caches, which however hampers obtaining WCET estimates that hold valid
for …

Resource-aware partitioned scheduling for heterogeneous multicore real-time systems

Han
Jian-Jun

Heterogeneous multicore processors have become popular computing engines for modern
embedded real-time systems recently. However, there is rather limited research on
the scheduling of real-time tasks running on heterogeneous multicore systems with
…

Response-time analysis of DAG tasks supporting heterogeneous computing

Serrano
Maria A.

Hardware platforms are evolving towards parallel and heterogeneous architectures to
overcome the increasing necessity of more performance in the real-time domain. Parallel
programming models are fundamental to exploit the performance capabilities of …

Duet: an OLED & GPU co-management scheme for dynamic resolution adaptation

Lin
Han-Yi

The increasingly high display resolution of mobile devices imposes a further burden
on energy consumption. Existing schemes manage either OLED or GPU power to save energy.
This paper presents the design, algorithm, and implementation of a co-managing …

RAMP: resource-aware mapping for CGRAs

Dave
Shail

Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate
even non-parallel loops. Acceleration achieved through CGRAs critically depends on
the goodness of mapping (of loop operations onto the PEs of CGRA), and in …

An architecture-agnostic integer linear programming approach to CGRA mapping

Chin
S. Alexander

Coarse-grained reconfigurable architectures (CGRAs) have gained traction as a potential
solution to implement accelerators for compute-intensive kernels, particularly in
domains requiring hardware programmability. Architecture and CAD for CGRAs are …

Dnestmap: mapping deeply-nested loops on ultra-low power CGRAs

Karunaratne
Manupa

Coarse-Grained Reconfigurable Arrays (CGRAs) provide high performance, energy-efficient
execution of the innermost loops of an application. Most real-world applications,
however, comprise of deeply-nested loops with complex and often irregular control
…

Locality aware memory assignment and tiling

Rogers
Samuel

With the trend toward specialization, an efficient memory-path design is vital to
capitalize customization in data-path. A monolithic memory hierarchy is often highly
inefficient for irregular applications, traditionally targeted for CPUs. New …

GAN-OPC: mask optimization with lithography-guided generative adversarial nets

Yang
Haoyu

Mask optimization has been a critical problem in the VLSI design flow due to the mismatch
between the lithography system and the continuously shrinking feature sizes. Optical
proximity correction (OPC) is one of the prevailing resolution enhancement …

An efficient Bayesian yield estimation method for high dimensional and high sigma
SRAM circuits

Zhai
Jinyuan

With increasing dimension of variation space and computational intensive circuit simulation,
accurate and fast yield estimation of realistic SRAM chip remains a significant and
complicated challenge. In this paper, du Experiment results show that the …

RAIN: a tool for reliability assessment of interconnect networks—physics to software

Abbasinasab
Ali

In this paper, we study the main interconnect aging processes: electromigration, thermomigration
and stress migration and propose comprehensive yet compact models for transient and
steady states based on hydrostatic stress evolution. Our model can be …

A fast and robust failure analysis of memory circuits using adaptive importance sampling
method

Shi
Xiao

Performance failure has become a growing concern for the robustness and reliability
of memory circuits. It is challenging to accurately estimate the extremely small failure
probability when failed samples are distributed in multiple disjoint failure …

SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs

Lu
Liqiang

FPGAs have been an efficient accelerator for CNN inference due to its high performance,
flexibility, and energy-efficiency. To improve the performance of CNNs on FPGAs, fast
algorithms and sparse methods emerge as the most attractive alternatives, which …

Efficient winograd-based convolution kernel implementation on edge devices

Xygkis
Athanasios

The implementation of Convolutional Neural Networks on edge Internet of Things (IoT)
devices is a significant programming challenge, due to the limited computational resources
and the real-time requirements of modern applications. This work focuses on …

An efficient kernel transformation architecture for binary- and ternary-weight neural
network inference

Zheng
Shixuan

While deep convolutional neural networks (CNNs) have emerged as the driving force
of a wide range of domains, their computationally and memory intensive natures hinder
the further deployment in mobile and embedded applications. Recently, CNNs with low-…

Content addressable memory based binarized neural network accelerator using time-domain
signal processing

Choi
Woong

Binarized neural network (BNN) is one of the most promising solution for low-cost
convolutional neural network acceleration. Since BNN is based on binarized bit-level
operations, there exist great opportunities to reduce power-hungry data transfers
and …

A security vulnerability analysis of SoCFPGA architectures

Chaudhuri
Sumanta

SoCFPGAs or FPGAs integrated on the same die with chip multi processors have made
it to the market in the past years. In this article we analyse various security loopholes,
existing precautions and countermeasures in these architectures. We consider …

Raise your game for split manufacturing: restoring the true functionality through BEOL

Patnaik
Satwik

Split manufacturing (SM) seeks to protect against piracy of intellectual property
(IP) in chip designs. Here we propose a scheme to manipulate both placement and routing
in an intertwined manner, thereby increasing the resilience of SM layouts. Key …

Analysis of security of split manufacturing using machine learning

Zhang
Boyu

This work is the first to analyze the security of split manufacturing using machine
learning, based on data collected from layouts provided by industry, with 8 routing
metal layers, and significant variation in wire size and routing congestion across
…

Inducing local timing fault through EM injection

Ghodrati
Marjan

Electromagnetic fault injection (EMFI) is an efficient class of physical attacks that
can compromise the immunity of secure cryptographic algorithms. Despite successful
EMFI attacks, the effects of electromagnetic injection (EM) on a processor are not
…

IAfinder: identifying potential implicit assumptions to facilitate validation in medical cyber-physical
system

Fu
Zhicheng

According to the U.S. Food and Drug Administration (FDA) medical device recall database,
medical device recalls are at an all-time high. One of the major causes of the recalls
is due to implicit assumptions of which either the medical device operating …

An efficient timestamp-based monitoring approach to test timing constraints of cyber-physical
systems

Mehrabian
Mohammadreza

Formal specifications on temporal behavior of Cyber-Physical Systems (CPS) is essential
for verification of performance and safety. Existing solutions for verifying the satisfaction
of temporal constraints on a CPS are compute and resource intensive …

Runtime adjustment of IoT system-on-chips for minimum energy operation

Golanbari
Mohammad Saber

Energy-constrained Systems-on-Chips (SoC) are becoming major components of many emerging
applications, especially in the Internet of Things (IoT) domain. Although the best
energy efficiency is achieved when the SoC operates in the near-threshold region,
…

Edge-cloud collaborative processing for intelligent internet of things: a case study on smart surveillance

Mudassar
Burhan A.

Limited processing power and memory prevent realization of state of the art algorithms
on the edge level. Offloading computations to the cloud comes with tradeoffs as compression
techniques employed to conserve transmission bandwidth and energy …

Bandwidth-efficient deep learning

Han
Song

Deep learning algorithms are achieving increasingly higher prediction accuracy on
many machine learning tasks. However, applying brute-force programming to data demands
a huge amount of machine power to perform training and inference, and a huge amount
…

Co-design of deep neural nets and neural net accelerators for embedded vision applications

Kwon
Kiseok

Deep Learning is arguably the most rapidly evolving research area in recent years.
As a result it is not surprising that the design of state-of-the-art deep neural net
models proceeds without much consideration of the latest hardware targets, and the
…

Generalized augmented lagrangian and its applications to VLSI global placement

Zhu
Ziran

Global placement dominates the circuit placement process in its solution quality and
efficiency. With increasing design complexity and various design constraints, it is
desirable to develop an efficient, high-quality global placement algorithm for …

Routability-driven and fence-aware legalization for mixed-cell-height circuits

Li
Haocheng

Placement is one of the most critical stages in the physical synthesis flow. Circuits
with increasing numbers of cells of multi-row height have brought challenges to traditional
placers on efficiency and effectiveness. Furthermore, constraints on fence …

PlanarONoC: concurrent placement and routing considering crossing minimization for optical networks-on-chip

Chuang
Yu-Kai

Optical networks-on-chips (ONoCs) have become a promising solution for the on-chip
communication of multi-and many-core systems to provide superior communication bandwidths,
efficiency in power consumption, and latency performance compared to electronic …

Similarity-aware spectral sparsification by edge filtering

Feng
Zhuo

In recent years, spectral graph sparsification techniques that can compute ultra-sparse
graph proxies have been extensively studied for accelerating various numerical and
graph-related applications. Prior nearly-linear-time spectral sparsification …

S2FA: an accelerator automation framework for heterogeneous computing in datacenters

Yu
Cody Hao

Big data analytics using the JVM-based MapReduce framework has become a popular approach
to address the explosive growth of data sizes. Adopting FPGAs in datacenters as accelerators
to improve performance and energy efficiency also attracts increasing …

Automated accelerator generation and optimization with composable, parallel and pipeline
architecture

Cong
Jason

CPU-FPGA heterogeneous architectures feature flexible acceleration of many workloads
to advance computational capabilities and energy efficiency in today’s datacenters.
This advantage, however, is often overshadowed by the poor programmability of FPGAs.
…

TAO: techniques for algorithm-level obfuscation during high-level synthesis

Pilato
Christian

Intellectual Property (IP) theft costs semiconductor design companies billions of
dollars every year. Unauthorized IP copies start from reverse engineering the given
chip. Existing techniques to protect against IP theft aim to hide the IC’s …

Extracting data parallelism in non-stencil kernel computing by optimally coloring
folded memory conflict graph

Escobedo
Juan

Irregular memory access pattern in non-stencil kernel computing renders the well-known
hyperplane- [1], lattice- [2], or tessellation-based [3] HLS techniques ineffective.
We develop an elegant yet effective technique that synthesizes memory-optimal …

SMApproxlib: library of FPGA-based approximate multipliers

Ullah
Salim

The main focus of the existing approximate arithmetic circuits has been on ASIC-based
designs. However, due to the architectural differences between ASICs and FPGAs, comparable
performance gains cannot be achieved for FPGA-based systems by using the …

Sign-magnitude SC: getting 10X accuracy for free in stochastic computing for deep neural networks

Zhakatayev
Aidyn

Stochastic computing (SC) is a promising computing paradigm for applications with
low precision requirement, stringent cost and power restriction. One known problem
with SC, however, is the low accuracy especially with multiplication. In this paper
we …

Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators

Ullah
Salim

The architectural differences between ASICs and FPGAs limit the effective performance
gains achievable by the application of ASIC-based approximation principles for FPGA-based
reconfigurable computing systems. This paper presents a novel approximate …

Approximate on-the-fly coarse-grained reconfigurable acceleration for general-purpose
applications

Brandalero
Marcelo

Approximate functional unit designs have the potential to reduce power consumption
significantly compared to their precise counterparts; however, few works have investigated
composing them to build generic accelerators. In this work, we do a design-…

LEMAX: learning-based energy consumption minimization in approximate computing with quality
guarantee

Akhlaghi
Vahideh

Approximate computing aims to trade accuracy for energy efficiency. Various approximate
methods have been proposed in the literature that demonstrate the effectiveness of
relaxing accuracy requirements in a specific unit. This provides a basis for …

PIMA-logic: a novel processing-in-memory architecture for highly flexible and energy-efficient
logic computation

Angizi
Shaahin

In this paper, we propose PIMA-Logic, as a novel Processing-in-Memory
Architecture for highly flexible and efficient Logic computation. Insteadof
integrating complex logic units in cost-sensitive memory, PIMA-Logic …

Columba S: a scalable co-layout design automation tool for microfluidic large-scale integration

Tseng
Tsun-Ming

Microfluidic large-scale integration (mLSI) is a promising platform for high-throughput
biological applications. Design automation for mLSI has made much progress in recent
years. Columba and its succeeding work Columba 2.0 proposed a mathematical …

Design-for-testability for continuous-flow microfluidic biochips

Liu
Chunfeng

Flow-based microfluidic biochips are gaining traction in the microfluidics community
since they enable efficient and low-cost biochemical experiments. These highly integrated
lab-on-a-chip systems, however, suffer from manufacturing defects, which cause …

Design and architectural co-optimization of monolithic 3D liquid state machine-based
neuromorphic processor

Ku
Bon Woong

A liquid state machine (LSM) is a powerful recurrent spiking neural network shown
to be effective in various learning tasks including speech recognition. In this work,
we investigate design and architectural co-optimization to further improve the area-…

Enabling a new era of brain-inspired computing: energy-efficient spiking neural network with ring topology

Bai
Kangjun

The reservoir computing, an emerging computing paradigm, has proven its benefit to
multifarious applications. In this work, we successfully designed and fabricated an
analog delayed feedback reservoir (DFR) chip. Measurement results demonstrate its
rich …

A neuromorphic design using chaotic mott memristor with relaxation oscillation

Yan
Bonan

The recent proposed nanoscale Mott memristor features negative differential resistance
and chaotic dynamics. This work proposes a novel neuromorphic computing system that
utilizes Mott memristors to simplify peripheral circuitry. According to the …

DrAcc: a DRAM based accelerator for accurate CNN inference

Deng
Quan

Modern Convolutional Neural Networks (CNNs) are computation and memory intensive.
Thus it is crucial to develop hardware accelerators to achieve high performance as
well as power/energy-efficiency on resource limited embedded systems. DRAM-based CNN
…

On-chip deep neural network storage with multi-level eNVM

Donato
Marco

One of the biggest performance bottlenecks of today’s neural network (NN) accelerators
is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level,
embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. …

Closed yet open DRAM: achieving low latency and high performance in DRAM memory systems

Subramanian
Lavanya

DRAM memory access is a critical performance bottleneck. To access one cache block,
an entire row needs to be sensed and amplified, data restored into the bitcells and
the bitlines precharged, incurring high latency. Isolating the bitlines and sense
…

VRL-DRAM: improving DRAM performance via variable refresh latency

Das
Anup

A DRAM chip requires periodic refresh operations to prevent data loss due to charge
leakage in DRAM cells. Refresh operations incur significant performance overhead as
a DRAM bank/rank becomes unavailable to service access requests while being …

Enabling union page cache to boost file access performance of NVRAM-based storage
device

Chen
Shuo-Han

Due to the fast access performance, byte-addressability, and non-volatility of non-volatile
random access memory (NVRAM), NVRAM has emerged as a popular candidate for the design
of memory/storage systems on mobile computing systems. For example, the …

FLOSS: FLOw sensitive scheduling on mobile platforms

Zhang
Haibo

Today’s mobile platforms have grown in sophistication to run a wide variety of frame-based
applications. To deliver better QoS and energy efficiency, these applications utilize
multi-flow execution, which exploits hardware-level parallelism across …

Context-aware dataflow adaptation technique for low-power multi-core embedded systems

Jung
Hyeonseok

Today’s embedded systems operate under increasingly dynamic conditions. First, computational
workloads can be either fluctuating or adjustable. Moreover, as many devices are battery-powered,
it is common to have runtime power management technique, which …

Architecture decomposition in system synthesis of heterogeneous many-core systems

Richthammer
Valentina

Determining feasible application mappings for Design Space Exploration (DSE) and run-time
embedding is a challenge for modern many-core systems. The underlying NP-complete
system-synthesis problem faces tremendously complex problem instances due to the …

NNsim: fast performance estimation based on sampled simulation of GPGPU kernels for neural
networks

Kang
Jintaek

Existent GPU simulators are too slow to use for neural networks implemented in GPUs.
For fast performance estimation, we propose a novel hybrid method of analytical performance
modeling and sampled simulation of GPUs. By taking full advantage of …

STAFF: online learning with stabilized adaptive forgetting factor and feature selection algorithm

Gupta
Ujjwal

Dynamic resource management techniques rely on power consumption and performance models
to optimize the operating frequency and utilization of processing elements, such as
CPU and GPU. Despite the importance of these decisions, many existing approaches …

Extensive evaluation of programming models and ISAs impact on multicore soft error
reliability

Rosa
Felipe da

To take advantage of the performance enhancements provided by multicore processors,
new instruction set architectures (ISAs) and parallel programming libraries have been
investigated across multiple industrial segments. This paper investigates the …

Optimized selection of wireless network topologies and components via efficient pruning
of feasible paths

Kirov
Dmitrii

We address the design space exploration of wireless networks to jointly select topology
and component sizing. We formulate the exploration problem as an optimized mapping
problem, where network elements are associated with components from pre-defined …

Page 7 of 10