News

Remembering Arvind

From Friends, Colleagues, and Students of Arvind

It is with a heavy heart that we write to share the news that on June 17th 2024, we lost our beloved colleague, mentor, and friend Arvind. A...

Read More...

Call for SIGDA Newsletter Editor-in-Chief

ACM SIGDA announces the call for Editor-in-Chief for the SIGDA Newsletter, a monthly publication for news and event information in the design automation area. The Editor-in-Chief, along with the e...

Read More...

IEEE/ACM A. Richard Newton Technical Impact Award in Electronic Design Automation

The IEEE/ACM A. Richard Newton Technical Impact Award in Electronic Design Automation was conferred at DAC 2023 upon Moshe Vardi and Pierre Wolper for their research work “An Automata-Theoretic Ap...

Read More...

Highlights of CADAthlon Brazil 2023

The CADAthlon Brazil 2023 – 3rd Brazilian Programming Contest for Design Automation of Integrated Circuits (https://csbc.sbc.org.br/2023/cadathlon-brasil-en/) took place on August 8th in João Pess...

Read More...

Prof. Ron Rohrer receives ACM SIGDA Pioneering Achievement Award @ DAC 2023

https://www.dac.com/About/Conference-Archive/60th-DAC-2023/Awards-2023

For the introduction and evolution of simulation and analysis techniques that have supported the design and test of inte...
Read More...

Events

UDemo@DAC

UDDAC 2025: 35th ACM SIGDA / IEEE CEDA University Demonstration at Design Automation Conference

ACM SIGDA/IEEE CEDA University Demonstration (UD, previously University Booth) is an excellent op...

Read More...

LIVE

SIGDA Live is a series of webinars, launched monthly or bi-monthly, on topics (either technical or non-technical) of general interest to the SIGDA community. The talks in general fall on the last ...

Read More...

CADathlon@ICCAD

Sunday, Oct. 29, 2023  08:00 AM – 05:00 PM, In-Person

Gallery Ballroom, The Hyatt Regency San Francisco Downtown SoMa, San Francisco, CA, USA

Welcome to CADathlon@ICCAD

CADathlon...

Read More...

HACK@DAC

HACK@DAC is a hardware security challenge contest, co-located with the Design and Automation Conference (DAC), for finding and exploiting security-critical vulnerabilities in hardware and firmware...

Read More...

SDC@DAC

System Design Contest at DAC 2023

The DAC System Design Contest focuses on object detection and classification on an embedded GPU or FPGA system. Contestants will receive a training dataset pro...

Read More...

Awards

Pioneer

2025 ACM SIGDA Pioneering Achievement Award

Call for Nominations

Description: To honor a person for lifetime, outstanding contributions within the scope of electronic design automation, as evidenced by ideas pioneered in publications, industrial products, or other relevant contributions. The award is based on the impact of the contributions throughout the nominee’s lifetime.

Eligibility: Open to researchers in the field of electronic design automation who have had outstanding contributions in the field during their lifetime. Current members of the Board of the ACM SIGDA, or members of the Award Selection Committee are ineligible for the award. The awardee is usually invited to give a lecture at ICCAD.

Award Items: A plaque for the awardee, a citation, and $1000 honorarium. The honorarium will be funded by the SIGDA annual budget.

Nominee Solicitation: The call for nominees will be published by email to members of SIGDA, on the website of ACM SIGDA, and in the SIGDA newsletter. The nomination should be proposed by someone other than the nominee. The nomination materials should be emailed to sigda.acm@gmail.com (Subject: ACM SIGDA Pioneering Achievement Award). Nominations for the award should include:

  • A nomination letter that gives: a 100-word description of the nominee’s contribution and its impact; a 750-word detailed description of up to 10 of the nominee’s major products (papers, patents, software, etc.), the contributions embodied in those products, and their impact; a list of at most 10 citations to the major products discussed in the description.
  • Up to three letters of recommendation (not including the nominator or nominee).
  • Contact information of the nominator.

In addition to the evidence of impact, the nomination package will include biographical information (including education and employment), professional activities, publications, and recognition. Up to three endorsements attesting to the impact of the work may be included.

Award Committee:

Wanli Chang (Chair)

Alberto Sangiovanni-Vincentelli (UC Berkeley)

Giovanni De Micheli (EPFL)

John Hayes (University of Michigan at Ann Arbor)

Jiang Hu (TAMU)

All standard conflict of interest regulations as stated in ACM policy will be applied. Any awards committee members will recuse themselves from consideration of any candidates where a conflict of interest may exist.

Schedule: The submission deadline for the 2025 Award is 31 July 2025.

Selection/Basis for Judging: This award honors an individual who has made an outstanding technical contribution in the scope of electronic design automation throughout his or her lifetime. The award is based on the impact of the contributions as indicated above. Nominees from universities, industry, and government worldwide will be considered and encouraged. The award is not a best paper or initial original contribution award. Instead, it is intended for lifetime, outstanding contributions within the scope of electronic design automation, throughout the nominee’s lifetime.

Presentation: The award is planned to be presented annually at DAC as well as the SIGDA Annual Member Meeting and Dinner at ICCAD.

2024: John Darringer, IBM
2022: Ron Rohrer, SMU, CMU
For the introduction and evolution of simulation and analysis techniques that have supported the design and test of integrated circuits and systems for more than half a century.
2021: Prof. Rob Rutenbar, PITT
For his pioneering work and extraordinary leadership in analog design automation and general EDA education.
2020: Prof. Jacob A. Abraham, UT Austin
For pioneering and fundamental contributions to manufacturing testing and fault-tolerant operation of computing systems.
2019: Prof. Giovanni De Micheli, EPFL
For pioneering and fundamental contributions to synthesis and optimization of integrated circuits and networks-on-chip.
2018: Prof. Alberto Sangiovanni Vincentelli, UC Berkeley
For pioneering and fundamental contributions to design automation research and industry, in system-level design, embedded systems, logic synthesis, physical design and circuit simulation.
2017: Prof. Mary Jane Irwin, Pennsylvania State University
For contributions to VLSI architectures, electronic design automation and community membership.
2016: Prof. Chung Laung (Dave) Liu, National Tsing Hua University, Taiwan (emeritus)
For the fundamental and seminal contributions to physical design and embedded systems.

2014: Prof. John P. Hayes, University of Michigan

 
2013: Prof. Donald E. Thomas, Carnegie Mellon University
For his pioneering work in making the Verilog Hardware Description Language more accessible for the design automation community and allowing for faster and easier pathways to simulation, high-level synthesis, and co-design of hardware-software systems.
2012: Dr. Louise Trevillyan, IBM
Recognizing her almost-40-year career in EDA and her groundbreaking research contributions in logic and physical synthesis, design verification, high-level synthesis, processor performance analysis, and compiler technology.
2011: Prof. Robert K. Brayton, UC Berkeley
For outstanding contributions to the field of Computer Aided Design of integrated systems over the last several decades.
2010: Prof. Scott Kirkpatrick, The Hebrew University of Jerusalem
On Solving Hard Problems by Analogy
Automated electronic design is not the only field in which surprising analogies from other fields of science have been used to deal with the challenges of very large problem sizes, requiring optimization across multiple scales, with constraints which eliminate any elegant solutions. Similar opportunities arise, for example, in logistics, in scheduling, in portfolio optimization and other classic problems. The common ingredient in all of these is that the problems are fundamentally frustrated, in that conflicting objectives must be traded off at all scales. This, plus the irregular structure in such real world problems eliminates any easy routes to the best solutions. Of course, in engineering, the real objective is not a global optimum, but a solution that is “good enough” and can be obtained “soon enough” to be useful. The model in materials science that gave rise by analogy to simulated annealing is the spin glass, which recently surfaced again in computer science as a vehicle whose inherent complexity might answer the long-vexing question of whether P can be proved not equal to NP.
2009: Prof. Martin Davis, NYU
 For his fundamental contributions to algorithms for solving the Boolean Satisfiability problem, which heavily influenced modern tools for hardware and software verifciation, as well as logic circuit synthesis.
2008: Prof. Edward J. McCluskey, Stanford

 For his outstanding contributions to the areas of CAD, test and reliable computing during the past half of century.
2007: Dr. Gene M. Amdahl, Amdahl Corporation
Award citation: For his outstanding contributions to the computing industry on the occasion of the 40th anniversary of Amdahl’s Law.
Video of Dr. Amdahl’s dinner talk and a panel debate are available on the ACM digital library.
Read More...

ONFA

SIGDA Outstanding New Faculty Award

Call for Applications

The ACM SIGDA Outstanding New Faculty Award (ONFA) recognizes a junior faculty member early in her or his academic career who demonstrates outstanding potential as an educator and/or researcher in the field of electronic design automation. While prior research and/or teaching accomplishments are important, the selection committee will especially consider the impact that the candidate has had on her or his department and on the EDA field during the initial years of their academic appointment. The 2025 award will be presented at ICCAD 2025, consisting of a USD $1,000 cash prize to the faculty member, along with a plaque and a citation.

Eligibility: Outstanding new faculty who are developing academic careers in areas related to electronic design automation are encouraged to apply for this award. Note that this award is not intended for senior or highly experienced investigators who have already established independent research careers, even if they are new to academia. Candidates must have recently completed at least one full academic year and no more than four and a half full academic years in a tenure-track position. Applications will also be considered from people whose appointments are continuing (non-visiting) positions with substantial educational responsibilities regardless whether or not they are tenure track. Persons holding research-only positions are not eligible. Exceptions to the timing requirements will be made for persons who have interrupted their academic careers for substantive reasons, such as family or medical leave. The presence of such reasons must be attested by the sponsoring institution, but no explanation is needed.

Deadline for the 2025 Award: 31 May 2025

Application: Candidates applying for the award must submit the following to the selection committee:

  1. a 2-page statement summarizing the candidate’s teaching and research accomplishments since beginning their current academic position, as well as an indication of plans for further development over the next five years;
  2. a copy of a current curriculum vitae;
  3. a letter from either the candidate’s department chair or dean endorsing the application.

The nomination materials should be emailed by the deadline to sigda.acm@gmail.com (Subject: ACM SIGDA Outstanding New Faculty Award). Endorsement letters may be sent separately.

Award Committee:

Ron Duncan (Synopsys)
Tsung-Yi Ho (CUHK)
Ambar Sarkar (Nvidia)
Chengmo Yang (Delaware)
Dirk Ziegenbein (Bosch)

All standard conflict of interest regulations as stated in ACM policy will be applied. Any awards committee members will recuse themselves from consideration of any candidates where a conflict of interest may exist.
 

Past Awardees

2024Bonan YanPeking University
2023Tsung-Wei HuangUniversity of Utah
2022 Yingyan (Celine) LinRice University
2021Zheng ZhangUC Santa Barbara
2020Pierre-Emmanuel GaillardonUniversity of Utah
2019Jeyavijayan (JV) Rajendran Texas A&M University
2018Shimeng YuArizona State University
2017 Yier JinUniversity of Florida
2016 Swaroop GhoshUniversity of South Florida
2015 Muhammad ShafiqueKarlsruhe Institute of Technology
2014 Yiran ChenUniversity of Pittsburgh
2013 Shobha VasudevanUIUC
2012David AtienzaEPFL, Switzerland
2011 Farinaz KoushanfarRice University
2010Puneet GuptaUCLA
Deming ChenUIUC
2009Yu CaoArizona State University
2008Subhasish MitraStanford University
2007 Michael OrshanskyUniversity of Texas, Austin
2006David PanUniversity of Texas, Austin
2004Kaustav BanerjeeUniversity of California, Santa Barbara
Igor MarkovUniversity of Michigan, Ann Arbor
2003Dennis SylvesterUniversity of Michigan, Ann Arbor
2002Charlie Chung-Ping ChenUniv. of Wisconsin, Madison
2000Vijay NarayananPenn State University
Read More...

OPDA

2025 ACM Outstanding Ph.D. Dissertation Award in Electronic Design Automation

Call for Nominations

Design automation has gained widespread acceptance by the VLSI circuits and systems design community. Advancement in computer-aided design (CAD) methodologies, algorithms, and tools has become increasingly important to cope with the rapidly growing design complexity, higher performance and low-power requirements, and shorter time-to-market demands. To encourage innovative, ground-breaking research in the area of electronic design automation, the ACM’s Special Interest Group on Design Automation (SIGDA) has established an ACM award to be given each year to an outstanding Ph.D. dissertation that makes the most substantial contribution to the theory and/or application in the field of electronic design automation.

The award consists of a plaque and an honorarium of USD $1,000. The 2025 Award will be presented at ICCAD 2025 in November 2025. The award is selected by a committee of experts from academia and industry in the field and appointed by ACM in consultation with the SIGDA Chair.

Deadline for the 2025 Award: 30 April 2025

Eligibility and nomination requirements: For the 2025 Award, the nominated dissertation should date between 1 July 2023 and 31 December 2024. Each nomination package should consist of:

  • The PDF file of the Ph.D. dissertation in the English language;
  • A statement (up to two pages) from the nominee explaining the significance and major contributions of the work;
  • A nomination letter from the nominee’s advisor or department chair or dean of the school endorsing the application;
  • Optionally, up to three letters of recommendation from experts in the field.

The nomination materials should be emailed to sigda.acm@gmail.com (Subject: ACM Outstanding Ph.D. Dissertation Award in EDA). Recommendation letters may be sent separately.

Award Committee:

Ismail Bustany (AMD)

Mustafa Badaroglu (QualComm)

Jintong Hu (Pittsburg)

Sharad Malik (Princeton)

Mark Ren (Nvidia)

Aviral Shrivastava (ASU)

Linghao Song (Yale)

Peh Li Shiuan (NUS)

Natarajan Viswanathan (Cadence)

Robert Wille (TUM)

All standard conflict of interest regulations as stated in ACM policy will be applied. Any award committee members will recuse themselves from consideration of any candidates where a conflict of interest may exist.
 

Past Awardees

2024Lukas Burgholzer, for the dissertation “Design Automation Tools and Software for Quantum Computing”, Johannes Kepler University Linz. Advisors: Robert Wille and Jens Eisert.
2023Zhiyao Xie, for the dissertation “Intelligent Circuit Design and Implementation with Machine Learning”, Duke University, Advisors: Yiran Chen and Hai Li
2022Ganapati Bhat, for the dissertation “Design, Optimization, and Applications of Wearable IoT Devices”, Arizona State University, Advisor: Umit Y. Ogras
2021Ahmedullah Aziz, for the dissertation “Device-Circuit Co-Design Employing Phase Transition Materials for Low power Electronics”, Purdue University, Advisor: Sumeet Gupta.
2020Gengjie Chen, for the dissertation “VLSI Routing: Seeing Nano Tree in Giga Forest,” The Chinese University of Hong Kong. Advisor: Evangeline Young.
2019Tsung-Wei Huang, for the dissertation “Distributed Timing Analysis“, University of Illinois, Urbana-Champaign. Advisor: Martin D. F. Wong.
2018Xiaoqing Xu, for the dissertation “Standard Cell Optimization and Physical Design in Advanced Technology Nodes,” University of Texas at Austin. Advisor: David Z. Pan.
Pramod Subramanyan, for the dissertation “Deriving Abstractions to Address Hardware Platform Security Challenges,” Princeton University. Advisor: Sharad Malik.
2017Jeyavijayan Rajendran, for the dissertation “Trustworthy Integrated Circuit Design,” New York University. Advisor: Ramesh Karri.
2016Zheng Zhang, for the dissertation “Uncertainty Quantification for Integrated Circuits and Microelectromechanical Systems,” Massachusetts Institute of Technology. Advisor: Luca Daniel.
2015Wenchao Li, for the dissertation Specification Mining: New Formalisms, Algorithms and Applications,” University of California at Berkeley. Advisor: Sanjit Seshia.
2014Wangyang Zhang, for the dissertation IC Spatial Variation Modeling: Algorithms and Applicaitons,” Carnegie Mellon University. Advisors: Xin Li and Rob Rutenbar
2013Duo Ding, for the dissertation CAD for Nanolithography and Nanophotonics,” University of Texas at Austin. Advisor: David Z. Pan
Guojie Luo, for the dissertation “Placement and Design Planning for 3D integrated Circuits,” UCLA. Advisor: Jason Cong
2012Tan Yan, for the dissertation “Algorithmic Studies on PCB Routing,” defended with the University of Illinois at Urbana-Champaign.
2011Nishant Patil, for the dissertation “Design and Fabrication of Imperfection-Immune Carbon Nanotube Digital VLSI Circuits,” Stanford University.
2010Himanshu Jain, for the dissertation “Verification using Satisfiability Checking, Predicate Abstraction, and Craig Interpolation,” Carnegie Mellon University.
2009Kai-Hui Chang, for the dissertation “Functional Design Error Diagnosis, Correction and Layout Repair of Digital Circuits”, University of Michigan at Ann Arbor.
2008(No award is given this year)
2007(No award is given this year)
2006Haifeng Qian of University of Minnesota, Minneapolis, Department of Electrical and Computer Engineering, for the thesis entitled Stochastic and Hybrid Linear Equation Solvers and their Applications in VLSI Design Automation.
2005Shuvendu Lahiri of Carnegie Mellon University, Department of Electrical and Computer Engineering, for a thesis entitled “Unbounded System Verification using Decision Procedure and Predicate Abstraction
2004Chao Wang of University of Colorado at Boulder, Department of Electrical Engineering, for a thesis entitled “Abstraction Refinement for Large Scale Model Checking
2003Luca Daniel of University of California, Berkeley Department of Electrical Engineering and Computer Science for a thesis entitled “Simulation and modeling techniques for signal integrity and electromagnetic interference on high frequency electronic systems”
Lintao Zhang of Princeton University Department of Electrical Engineering for a thesis entitled “Searching for truth: techniques for satisfiability of Boolean formulas.
2002(No award is given this year)
2001Darko Kirovski from University of California, Los Angeles Department of Computer Science for a thesis entitled “Constraint Manipulation Techniques for Synthesis and Verification of Embedded Systems.” The runner-up who received an honorable mention in that years ceremony was Michael Beattie of Carnegie Mellon University Department of Electrical and Computer Engineering for a thesis entitled “Efficient Electromagnetic Modeling for Giga-scale IC Interconnect.” 
2000Robert Brent Jones of Stanford University Department of Electrical Engineering for a thesis entitled Applications of Symbolic Simulation To the Formal Verification of Microprocessors.”
Read More...

Newton

ACM/IEEE A. Richard Newton Technical Impact Award in Electronic Design Automation 2025

Call for Nominations

Description

To honor a person or persons for an outstanding technical contribution within the scope of electronic design automation, as evidenced by a paper published at least ten years before the presentation of the award (before July 2015).

Prize

USD 1500 to be shared amongst the authors and a plaque for each author.

Funding

Funded by the IEEE Council on Electronic Design Automation and ACM Special Interest Group on Design Automation.

Presentation

Presented annually at the Design Automation Conference.

Historical Background

A. Richard Newton, one of the foremost pioneers and leaders of the EDA field, passed away on 2 January 2007, of pancreatic cancer at the age of 55.

A. Richard Newton was professor and dean of the College of Engineering at the University of California, Berkeley. Newton was educated at the University of Melbourne and received his bachelor’s degree in 1973 and his master’s degree in 1975. In the early 1970s he began to work on SPICE, a simulation program initially developed by Larry Nagel and Donald Pederson to analyze and design complex electronic circuitry with speed and accuracy. In 1978, Newton earned his Ph.D. in electrical engineering and computer sciences from UC Berkeley.

For his research and entrepreneurial contributions to the electronic design automation industry, he was awarded the 2003 Phil Kaufman Award. In 2004, he was named a member of the National Academy of Engineering, and in 2006, of the American Academy of Arts and Sciences. He was a member of the Association for Computing Machinery and a fellow of the Institute of Electrical and Electronics Engineers.

Basis for Judging

The prime consideration will be the impact on technology, industry, and education, and on working designers and engineers in the field of EDA. Such impact might include a research result that inspires much innovative thinking, or that has been put into wide use in practice.

Eligibility

The paper must have passed through a peer-review process before publication, be an archived conference or journal publication available from or published by either ACM or IEEE, and be a seminal paper where an original idea was first described. Follow-up papers and extended descriptions of the work may be cited in the nomination, but the award is given for the initial original contribution.

Selection Committee

Chair: Wanli Chang

Vice-Chair: Deming Chen

Members to be announced

Nomination Deadline

21 March 2025

Nomination Package

Please send a one-page nomination letter explaining the impact of the nominated paper, evidence of the impact, biography of the nominator, at most three endorsements, and the nominated paper itself, all in one PDF file, to sigda.acm@gmail.com (Subject: 2025 A. Richard Newton Technical Impact Award in Electronic Design Automation).
 

Past Awardees

  • 2024: Mircea Stan and Wayne Burleson, “Bus-Invert Coding for Low-Power I/O”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 3, No. 1, pp. 49-58, March 1995.
  • 2023: Moshe Vardi and Pierre Wolper for their research work “An Automata-Theoretic Approach to Automatic Program Verification”, published in the proceedings of the 1st Symposium on Logic in Computer Science, 1986.
  • 2022: Ricardo Telichevesky, Kenneth S. Kundert, and Jacob K. White, “Efficient Steady-State Analysis based on Matrix-Free Krylov-Subspace Methods”, In Proc. of the 32nd Design Automation Conference, 1995.
  • 2021: John A. Waicukauski, Eric Lindbloom, Barry K. Rosen, and Vijay S. Iyengar, “Transition Fault Simulation,” IEEE Design & Test of Computers, Vol. 4, no. 2, April 1987
  • 2020: Luca Benini and Giovanni De Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, pp. 70-78, January 2002.
  • 2019: E. B. Eichelberger and T. W. Williams, “A Logic Design Structure for LSI Testability,” In Proc. of the 14th Design Automation Conference, 1977.
  • 2018: Hans Eisenmann and Frank M. Johannes, “Generic Global Placement and Floorplanning,” In Proc. of the 35th Design Automation Conference, 1998.
  • 2017: Matthew W. Moskewicz, Conor F. Madigan, Ying Zhao, Lintao Zhang, and Sharad Malik, “Chaff: Engineering an Efficient SAT Solver,” In Proc. of the 38st Design Automation Conference, 2001.
  • 2016: Chandu Visweswariah, Kaushik Ravindran, Kerim Kalafala, Steven G. Walker, Sambasivan Narayan, “First-Order Incremental Block-Based Statistical Timing Analysis,” In Proc. of the 41st Design Automation Conference, 2004.
  • 2015: Blaise Gassend, Dwaine Clarke, Marten van Dijk, and Srinivas Devadas, “Silicon Physical Random Functions,” In Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS), 2002.
  • 2014: Subhasish Mitra and Kee Sup Kim, “X-compact: an efficient response compaction technique for test cost reduction,” IEEE International Test Conference, 2002.
  • 2013: Keith Nabors and Jacob White, “FastCap: A multipole accelerated 3-D capacitance extraction program,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 10, Issue 11 (1991): 1447-1459.
  • 2012: Altan Odabasioglu, Mustafa Celik, Larry Pileggi, “PRIMA: Passive Reduced-Order Interconnect Macromodeling Algorithm,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Aug., 1998.
  • 2011: Jason Cong, Eugene Ding, “FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Jan., 1994.
  • 2010: Randal Bryant, “Graph-based algorithms for Boolean function manipulation” IEEE Transactions on Computers, Aug., 1986.
  • 2009: Robert K. Brayton, Richard Rudell, Alberto Sangiovanni-Vincentelli, Albert R. Wang, “MIS: A Multiple-Level Logic Optimizations System,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Nov., 1997.
Read More...

Service

Service Awards

SIGDA has restructured its service awards, and will be giving two annual service awards.

  • Distinguished Service Award: The SIGDA Distinguished Service Award is given to individuals who have dedicated many years of their career in extraordinary services to promoting, leading, or creating ACM/SIGDA programs or events.
  • Meritorious Service Award: The SIGDA Meritorious Service Award is given to individuals who have performed professional services above and beyond traditional service to promoting, leading, or creating ACM/SIGDA programs or events.

At any given year, the number of Distinguished Service Award will be up to 2, and the number of Meritorious Service Award will be up to 4.

Nominations should consist of:

  • Award type being nominated.
  • Name, address, phone number and email of person making the nomination.
  • Name, affiliation, address, email, and telephone number of the nominee for whom the award is recommended.
  • A statement (between 200 and 500 words long) explaining why the nominee deserves the award. Note that the award is given for service that goes above and beyond traditional services.
  • Up to 2 additional letters of support. Include the name, affiliation, email address, and telephone number of the letter writer(s). Supporters of multiple candidates are strongly encouraged to compare the candidates in their letters.

Note that the nominator and reference shall come from active SIGDA volunteers. Deadline of the nomination every year: March 15 (Except 2019, May 5).

Please send all your nomination materials as one pdf file to SIGDA-Award@acm.org before the deadline.

Distinguished Service Awards

2023Tulika Mitra, National University of Singapore
For her leadership in major SIGDA conferences such asgeneral chair for ICCAD and ESWEEK“.
Patrick Groeneveld, Stanford University
For his multi-year significant contribution to the EDA community, such as DAC finance chair among many other“.
2022Vijay Narayanan, The Pennsylvania State University
For Extraordinary Dedication and Leadership to SIGDA“.
Harry Foster, Siemens EDA
For Extraordinary Dedication and Persistence in Leading DAC during Pandemic“.
2021Deming Chen, University of Illinois Urbana-Champaign
For distinguished contributions to the design automation and reconfigurable computing communities“.
Evangeline F. Y. Young, Chinese University of Hong Kong
For outstanding leadership in promoting diversity in the ACM/SIGDA community“.
2020Sri Parameswaran, University of New South Wales
“For leadership and distinguished service to the EDA community“.
2019Naehyuck Chang, Korea Advanced Institute of Science and Technology
For many years of impactful service to ACM/SIGDA in various leadership positions“.
Sudeep Pasricha, Colorado State University
For a decade of outstanding service to ACM/SIGDA in various volunteer positions
2018Chuck Alpert, Cadence Design Systems
For significant contributions to DAC”.
Jörg Henkel, Karlsruhe Institute of Technology
For leading SIGDA efforts in Europe and DATE”.
Michael ‘Mac’ McNamara, Adapt-IP
For sustained contributions to the design automation community and DAC”.
Michelle Clancy, Cayenne Communication
For sustained contributions to the community, especially DAC”.
2016Steven Levitan
In recognition of a lifetime of devoted service to ACM SIGDA and the Electronic Design Automation community.
2015Tatsuo Ohtsuki, Waseda University
Hiroto Yasuura, Kyushu University
Hidetoshi Onodera, Kyoto University
For their distinguished contributions to the Asia and South Pacific Design automation Conference (ASPDAC) as well as their many years of dedicated service on the conference’s steering committee
Massoud Pedram, University of Southern California
For his many years of service as Editor in Chief of the ACM Transactions on Design Automation of Electronic Systems (TODAES)
2014Peter Marwedel, Technical University of Dortmund
For his muiltiple years of service starting and maintaining the DATE PhD Forum
2012Joe Zamreno, Iowa State University
Baris Taskin, Drexel University
2011Peter Feldman, IBM
Radu Marculescu, CMU
Qinru Qiu, Syracuse University
Martin Wong, UIUC
Qing Wu, Air Force Rome Labs
2010Alex K. Jones
For dedicated service to ACM/SIGDA and the Design Automation Conference as director of the University Booth
Matt Guthaus
For dedicated service as director of SIGDA CADathlon at ICCAD program and Editor-in-Chief of the SIGDA E-Newsletter
Diana Marculescu
For dedicated service as SIGDA Chair, and contributions to SIGDA, DAC and the EDA Profession
2009Nikil Dutt
For contributions to ACM’s Special Interest Group on Design Automation during the past fifteen years as a SIGDA officer, coordinator of the University Booth in its early years, and most recently, as Editor-in-Chief of the ACM Transactions on Design Automation of Electronic Systems
2008SungKyu Lim
For his contributions to the DAC University Booth.”
2007Richard Goering
For his contributions as EE Times Editorial Director for Design Automation for more than two decades
Gary Smith
For his contributions as Chief EDA Analyst at Gartner Dataquest for almost two decades.”
Daniel Gajski
Mary Jane Irwin
Donald E. Thomas
Chuck Shaw

For outstanding contributions to the creation of the SIGDA/DAC University Booth, on the occasion of its 20th edition.”
Soha Hassoun
Steven P. Levitan

For outstanding contributions to the creation of the SIGDA Ph.D. Forum at DAC on the occasion of its 10th edition.”
Richard Auletta
For over a decade of service to SIGDA as University Booth Coordinator, Secretary/Treasurer, and Executive Committee Member-at-Large.
2006Robert Walker
For dedicated service as SIGDA Chair (2001 – 2005), and over a decade of service to SIGDA, DAC and the EDA profession.”
2005Mary Jane Irwin
For dedicated service as Editor in Chief of ACM Journal, TODAES (1998 – 2004), and many years of service to SIGDA, DAC, and the EDA profession.”
2004James P. Cohoon
For exemplary service to SIGDA, to ACM, to DAC, and to the EDA profession as a whole
2003James Plusquellic
For exemplary service to ACM/SIGDA and the Design Automation Conference as director of the University Booth program
2002Steven P. Levitan
For over a decade of service to ACM/SIGDA and the EDA industry — as DAC University Booth Coordinator, Student Design Contest organizer, founder and promoter of SIGDA’s web server, and most recently, Chair of ACM/SIGDA from 1997 to 2001.
Cheng-Kok Koh
For exemplary service to ACM/SIGDA and the EDA industry — as Co-director of SIGDA’s CDROM Project, as SIGDA’s Travel Grants Coordinator, and as Editor of the SIGDA Newsletter.”
2001Robert Grafton
For contributions to the EDA profession through his many years as the Program Director of NSF’s Design, Tools, and Test Program of the computer, Information Sciences & Engineering Directorate. In this position, he provided supervision, mentorship, and guidance to several generation of EDA tool designers and builders funded by grants from the National Science Foundation.”
2000Massoud Pedram
For his contributions in developing the SIGDA Multimedia Series and organizing the Young Student Support Program
Soha Hassoun
For developing the SIGDA Ph.D. Forum
1999C.L. (Dave) Liu
For his work in founding our flagship journal ACM/TODAES

Meritorious Service Awards

2023Robert Wille, Technical University of Munich
For his leading positions in major ACM SIGDA conferences, including Executive Committee of DATE, ICCAD and Chair of the PhD Forum at DAC and DATE“.
Lei Jiang, Indiana University Bloomington
For his leadership and contribution to SIGDA student research forums (SRFs) at ASP-DAC“.
Hui-Ru Jiang, National Taiwan University
For her continuous contribution to SIGDA PhD Forum at DAC and many other events“.
Jeyavijayan (JV) Rajendran, Texas A&M University
For his leadership in co-founding and organizing Hack@DAC, the largest hardware security competition in the world“.
2022Jeff Goeders, Brigham Young University
For Chairing System Design Contest @ DAC for the Past 3 Years“.
Cheng Zhuo, Zhejiang University
For the Leading Efforts to the Success of SRC@ICCAD and SDC@DAC as Chairs for the past5 years, and the Sustained Contributions to the EDA Community in China“.
Tsung-Wei Huang, University of Utah
For Chairing CADathlon and CAD Contests at ICCAD for Three Years. These Activities Have Engaged Hundreds of Students into CAD Research“.
Yiyu Shi, University of Notre Dame
For Outstanding Services in Leading SIGDA Educational Efforts“.
2021Bei Yu, Chinese University of Hong Kong
For service as SIGDA Web Chair from 2016 to 2021, SIGDA Student Research Competition Chair in 2018 and 2019, and other SIGDA activities“.
2020Aida Todri-Sanial, LIRMM/University of Montpellier
“For service as Co-Editor-in-Chief of SIGDA e-Newsletter from 2016 to 2019 and other SIGDA activities“.
Yu Wang, Tsinghua University
For service as Co-Editor-in-Chief of SIGDA e-Newsletter from 2017 to 2019 and other SIGDA activities”.
2019Yinhe Han, Chinese Academy of Sciences
For outstanding effort in promoting EDA and SIGDA events in China
Jingtong Hu, University of Pittsburgh
For contribution to multiple SIGDA education and outreach activities
Xiaowei Xu, University of Notre Dame
For contribution to the 2018 System Design Contest at ACM/IEEE Design Automation Conference
2015Laleh Behjat, University of Calgary
For service as chair of the SIGDA PhD forum at DAC
Soonhoi Ha, Seoul National University
Jeonghee Shin, Apple
For their service as co-chairs of the University Booth at DAC
1998Jason Cong
Bryan Preas
Kathy Preas
Chong-Chian Koh
Cheng-Kok Koh

For contributions in producing SIGDA CD ROM’s – Archiving the knowledge of the Design Automation Community
1997Robert Walker
For his hard work as Secretary/Treasurer and University Booth Coordinator
1996Debbie Hall
For serving as ACM Program Director for SIGDA for the past 6 years

Following are some awards no longer being given:

Technical Leadership Awards

2013Jarrod Roy
Sudeep Pasricha
Sudarshan Banerjee
Srinivas Katkoori

for running CADathlon
2012Cheng Zhuo
Steve Burns
Amin Chirayu
Andrey Ayupov
Gustavo Wilke
Mustafa Ozdal
2011Raju Balasuramanian
Zhuo Li
Frank Liu
Natarajan Viswanathan
2010Cliff Sze
For contributions to the ISPD Physical Design contest, and promoting research in physical design.”
2008Hai Zhou
For contributions to the SIGDA E-Newsletter (2005-2008)
Jing Yang
For contributions to the SIGDA Ph.D. Forum at DAC (2004-2008)
2007Geert Janssen
For contributions to the SIGDA CADathlon
Tony Givargis
For contributions to the SIGDA Ph.D. Forum at DAC (2005-2007)
Gi-Joon Nam
For contributions to the Physical Design Contest at ISPD (2005-2007)
2006Kartik Mohanram
For contributions to the SIGDA CADathlon at ICCAD (2004-2005)
Ray Hoare
For contributions to the SIGDA University Booth at DAC (2004-2006)
Radu Marculescu
For contributions to the SIGDA Ph.D. Forum at DAC (2004-2006)
Frank Liu
For contributions to the SIGDA Ph.D. Forum at DAC (2005-2006)
2005Florian Krohm
For contributions to the SIGDA CADathlon at ICCAD
R. Iris Bahar
Igor Markov

For contributions to the SIGDA E-Newsletter
2004Robert Jones
For contributions to the SIGDA Ph.D. Forum at DAC
2003Diana Marculescu
For contributions to the SIGDA Ph.D. Forum at DAC
2002Geert Janssen
For contributions to the SIGDA CADathlon at ICCAD
Pai Chou
Abe Elfadel
Olivier Coudert
Soha Hassoun

For contributions to the SIGDA Ph.D. Forum at DAC

1996 SIGDA Lifetime Achievement Award

Paul Weil
“For contributions to SIGDA for 13 years, most recently being in charge of Workshops.”

1996 SIGDA Leadership Award

Steve Levitan
“For Newsletter and Electronic Publishing”

1995 SIGDA Outstanding Leadership Award

Charlotte Acken
“In recognition of her efforts in managing the high school scholarship program”

1994 SIGDA Outstanding Leadership Award

Pat Hefferan
“As Editor of the SIGDA Newsletter”

Read More...

Programs

ACM SIGDA Speaker Travel Grant Program

The SIGDA Speaker Series Travel Grant actively supports the travels of the speakers who are invited to give lectures or talks in local events, universities, and companies, so as to disseminate the values and impact of SIGDA. These speakers can be from either academia or company and are considered as good lectures that can help reach out to the audiences in the broad field of design automation. Once the application is approved, SIGDA will issue partial grants to cover the speaker’s travel expenses, including travel and subsistence costs.

This grant is to help on promoting the EDA community and activities all over the world. It will provide travel support averaging $1,000 (USD) for approximately 6 eligible speakers per year to defray their costs of giving lectures or talks in local events, universities, and companies. Priority will be given to the applicants from the local sections of SIGDA with the speakers presenting in the events supported by the local sections of SIGDA. In addition, local EDA communities or individuals, rather than local sections of SIGDA, are also encouraged to apply for this grant. For the application or additional information, please contact SIGDA by sending an email exclusively to the Technical Activity Chair (https://www.sigda.org/about-us/officers/).

Review Process

The review committee will be formed by the current Technical Activity Chair and Education Chair of SIGDA. The reviews will be reported and discussed in SIGDA’s executive committee meeting. After the discussion, the executing committee members will vote to grant or not grant the submitted applications.

Selection Criteria

The review takes the applicants/events and speakers in considerations.

  • Preference is given to the local sections of SIGDA for the speakers invited to the events, universities, and companies supported by the local sections of SIGDA. In addition, the applicants from local EDA communities or individuals are also considered.
  • The invited speaker should be a good lecture or researcher from either academia or industry, and has a good track record in the broad field of design automation.

Post Applications – Report and Reimbursement

  • For the speaker giving a talk in an ACM event, SIGDA can support the travel grant and process reimbursements to the speaker directly. At the end of the event, the speaker needs to complete the ACM reimbursement form and send it to SIGDA or ACM Representative along with copies of the receipts. The speakers will also need to abide by the reimbursement policies/standards found here: https://www.acm.org/special-interest-groups/volunteer-resources/conference-planning/conference-finances#speaker
  • For the speaker giving a talk in a non-ACM event, SIGDA will provide the lump sum payment to the legal and financial sponsoring organization, which would offer the fund as the travel grants and process reimbursements. Meanwhile, the sponsoring organization needs to indicate on the event’s promotional materials that travel grants are being supported by SIGDA. At the end of the event, the sponsoring organization needs to provide (1) a one-page final report to SIGDA reflecting the success of their goals against the funds provided and indicating how the funds were spent, (2) an invoice for the approved amount, and (3) tax form. Note that there is no specific format for the final report.

Application Form

Sponsor

Synopsys

Read More...

Backup Bylaws

BYLAWS of the Special Interest Group on DESIGN AUTOMATION of the Association for Computing Machinery, Inc.

  • Adopted – 27 October 1979
  • Revised – 9 March 1994
  • Revised – 7 July 2004
  • Revised – 24 March 2005
  • Revised – 20 January 2009

Article 1. Name and Scope

  1. This organization is called the Special Interest Group on Design Automation (SIGDA) of the Association for Computing Machinery, Inc: (the “ACM”).
  2. The scope of SIGDA’s specialty is to enhance the utility of computers as engineering tools in the design, fabrication, and test of systems and structures.

Article 2. Purpose

  1. SIGDA is organized and operated exclusively for educational, scientific, and technical purposes in design automation.
  2. The purpose of SIGDA and its activities includes:
    1. Collecting and disseminating information in design automation through a newsletter and other publications;
    2. Organizing sessions at conferences of the ACM;
    3. Sponsoring conferences, symposia, and workshops;
    4. Organizing projects and working groups for education, research, and development;
    5. Serving as a source of technical information for the Council and subunits of the ACM; and
    6. Representing the opinions and expertise of the membership on matters of technical interest to SIGDA or ACM.

Article 3. Charter

SIGDA will exist until dissolved as provided in Bylaw 6 of the ACM.

Article 4. Officers

  1. SIGDA officers are the Chair and Chairs for Awards, Conferences, Technical Activities, Educational Activities, Communications, and Finance; one of the named Chairs will also be a Vice-Chair. The Past Chair is not an elected official and may fill one of the named Chair positions. The officers are elected for three-year terms beginning July 1 of 2009. No extension of terms shall be allowed.
  2. The Chair is the principal officer, being responsible for leading SIGDA and managing its activities. The duties of the Chair are:
    1. Calling and presiding at SIGDA Executive Committee and business meetings;
    2. Conducting all of SIGDA’s activities in accordance with the policies of the ACM; and
    3. Making all appointments as authorized herein.
  3. The duties of the Vice-Chair are:
    1. Assisting the Chair in leading and managing SIGDA; and
    2. Presiding at meetings when the Chair is absent.
  4. The duties of the Past Chair are:
    1. Filling one of the named chair positions below, or act as a member of the Advisory Board; and
    2. Chairing the Nominating Committee for SIGDA officer elections.
  5. The duties of the Communications Chair are:
    1. Maintaining the records and correspondence of SIGDA;
    2. Keeping and distributing the minutes and action items of business and Executive Committee meetings.
  6. The duties of the Finance Chair are:
    1. Managing SIGDA’s finances according to the Financial Accountability Policy of the ACM. This includes preparing the annual budget, monitoring disbursements for adherence to the annual budget, and preparing financial reports as required.
    2. Managing of the SIGDA Travel Grants program, if applicable.
  7. The duties of the Awards Chair are:
    1. Providing a single point of contact for all SIGDA sponsored awards;
    2. Coordinating the process of nominating ACM/SIGDA members for Fellow, Distinguished, and Senior grades.
  8. The duties of the Conference Chair are:
    1. Providing a single point of contact for all SIGDA sponsored, co-sponsored, in-coop events except events for which other SIGDA Advisory Board members have been specifically assigned;
    2. Coordinating the review and approval of all conference/symposia/workshop budgets.
  9. The duties of the Technical Activities Chair are:
    1. Providing a single point of contact for all SIGDA Technical Committees and other technical activities;
    2. Coordinating and reviewing SIGDA TC activities and other technical activities.
  10. The duties of the Educational Activities Chair are:
    1. Providing a single point of contact for all SIGDA educational activities;
    2. Coordinating and reviewing all SIGDA educational activities.

Article 5. The Executive Committee

  1. The Executive Committee comprises the officers.
  2. Specific duties of the Executive Committee include:
    1. Approval of bylaw amendments before submission to members;
    2. Approval of annual dues for SIGDA;
    3. Approval of the annual budget and review all expenditures in excess of 1% of the fiscal year’s opening Fund Balance on a quarterly basis;
    4. Approval of conferences, symposia, workshops or sessions sponsored, co-sponsored or held in cooperation with SIGDA; and
    5. All the major management policy decisions of SIGDA must be approved by the Executive Committee.
  3. A quorum is a majority of the members of the Executive Committee and approval requires a majority vote of those present. Approval by mail ballot requires a majority vote.
  4. Only a member of the Executive Committee can make a motion for a vote by the Executive Committee.
  5. All members of, or candidates for, the Executive Committee must be voting Members of ACM and of SIGDA.

Article 6. Vacancies and Appointments

  1. Should the Chair leave office before his term expires, the Vice-Chair will assume the duties of Chair. Should any other elected office (including Past Chair) become vacant, the Chair of the SIG Governing Board may, on nomination by the SIGDA Chair, and approval by majority vote of the Executive Committee, fill the vacancy. The Chair may fill vacancies in positions appointed by the Chair, according to the procedures for making the original appointments as provided herein.
  2. Should a vacancy be unfilled, either because of inadequacy of these bylaws or because of a dispute or for any other reason, the SIG Governing Board Chair may fill it.
  3. All appointments expire automatically when the Chair’s term of office expires.

Article 7. The Newsletter

  1. SIGDA will publish a newsletter at regular intervals as determined by the Executive Committee. The newsletter will be distributed to all members.
  2. The Chair will nominate an Editor of the Newsletter, to be approved by majority vote of the Executive Committee.

Article 8. The Advisory Board

  1. The Advisory Board includes the Executive Committee (officers). It also includes members-at-large who are nominated by the SIGDA Chair. The Chair normally nominates up to ten members-at-large to the Advisory Board for his or her term of office. Appointments to the Advisory Board must be approved by a majority vote of the Executive Committee.
  2. The purpose of the Advisory Board is to allow members outside the Executive Committee to participate in setting policy and direction for, and assist in the operation of, SIGDA. The Advisory Board members are typically the program managers or coordinators of SIGDA sponsored activities.
  3. The Advisory Board members are non-voting members of the SIGDA Board, and while the Advisory Board may participate in a vote, their votes are non-binding, and only the Executive Committee votes are binding.

Article 9. Membership, Dues, and Voting Privileges

  1. A person becomes a member only after enrolling and paying the required dues. The dues for SIGDA are determined by the SIGDA Executive Committee with the approval of the Chair of the SIG Governing Board.
  2. All members of SIGDA may vote in any ballot conducted by SIGDA. On any ballot, the votes cast by non-ACM members of SIGDA will, if necessary, be prorated downward so that their effective total cannot exceed 50% of the eligible votes.

Article 10. Reports and Records

The SIGDA Chair is responsible for filing reports about SIGDA as required by the SIG Board. These include:

  1. An annual report on the activities during the previous year;
  2. All reports required by the Financial Accountability Policy of the ACM; and
  3. Closing reports on conferences and symposia.

The membership records of SIGDA will be maintained by ACM headquarters.

Article 11. Elections

  1. The Chair shall appoint a nominating committee in the autumn of each election year. This committee will nominate at least two candidates for the position of the chair and at least six other candidates for the members-at-large, who consent to serve on the Executive Committee and fill one of the named Chair positions if elected. The person winning the most votes among those nominated for the chair will be elected to that position.  The six (or seven, if Past Chair does not wish to fill a name Chair position) receiving the highest number of votes among members-at-large are elected to the Executive Committee. A report of the nominating committee must be presented to the SIGDA membership before an election can be held.
  2. All applicants for the chair should have significant service experience of at least 3 years in the design automation community and SIGDA, in particular. They should have served at least one term in the executive committee in roles other than the chair. Equivalent experience through service to SIGDA-approved sponsored conferences as deemed acceptable by the nominating committee is allowed.  
  3. A petition from at least ten voting members of SIGDA will place other consenting candidates on the ballot for any of the EC positions, subject to meeting the requirements of 11(b) for the chair position. Petitions must be received by the Past Chair no later than April 15 in the year of election or within one month after the nominating committee has announced the candidates selected by the committee, whichever is later.
  4. Elections must be announced by direct communication to the SIGDA Membership with sufficient time before the election such that the membership has an opportunity to petition to be placed on the ballot.
  5. The election will be conducted among eligible voters by ballot sent by the nominating committee or by ACM Headquarters, following the election procedures of the ACM. The SIG Board will resolve ties.
  6.  All named chairs,  except those of the Chair,  are to be decided by the new Executive Committee by ballot, from those elected as members-at-large. The new Executive Committee votes for each position: Vice-Chair, Finance, Communications, Conferences, Technical Activities, Educational Activities, and Awards. 

Article 12. Amendments

  1. These bylaws may be amended by a majority vote of the ACM Executive Committee, or by a vote of SIGDA’s members as provided below. With the approval of the SIGDA Executive Committee, and the Executive Committee of the ACM, 2/3 of all the members of the SIG Board may amend Article 1 of these bylaws without a referendum of the members.
  2. Amendments to these bylaws may be proposed by the SIGDA Executive Committee, the SIG Governing Board, or by a petition from 10 voting members of SIGDA. All proposed amendments must be approved, prior to being submitted for a vote of the membership, by the Chairperson of both the SIG Governing Board and the Constitution and Bylaws Committee of ACM after the Executive Director of ACM has provided his advice.
  3. The ballot on the proposed amendment(s) will be conducted among the eligible voters by ACM Headquarters following the procedures of the ACM for voting bylaw amendments, unless a different procedure has been approved by the SIG Board. The proposal is adopted only if at least 2/3 of the effective votes of returned ballots approve it, and only if at least 10% of the ballots are returned. The Secretary/Treasurer will send a clean copy of the amended bylaws to the Executive Director of ACM and to the Chair of the SIG Governing Board.

Article 13. Dissolution

Should SIGDA be dissolved, control of its assets will revert to the ACM.

Article 14. Meetings

SIGDA will conduct at least one business meeting each year, normally in conjunction with the annual Design Automation Conference. All meetings sponsored by SIGDA must be open to all members of the ACM. SIGDA may hold meetings only in places that are open to all classes of members of the ACM. The Executive Committee may meet in closed sessions during business meetings.

Article 15. Consistency

The Constitution, Bylaws, and policies of the ACM and of the SIG Governing Board take precedence over any conflicting provisions of these bylaws or internal policies of SIGDA.

Info for Organizers of SIGDA Sponsored Events

ACM and SIGDA is closely monitoring the COVID19 or 2019-nCoV situation (Coronavirus) and its potential impact on ACM conferences. We are following updates on the situation from the World Health Organization (WHO) and the Center for Disease Control (CDC). We encourage all Conference Leaders to keep informed on risks, precautions, and symptoms to make educated decisions for their community.

An ACM Presidential Task Force was formed to provide advice to conference organizers facing the need to move their conference online in light of the social distancing recommendations and global restrictions on travel due to the COVID-19 pandemic. Here is the link to What Conferences Can do to Replace Face-to-Face Meetings https://people.clarkson.edu/~jmatthew/acm/VirtualConferences_GuideToBestPractices_CURRENT.pdf, put together by ACM Presidential Task Force. 

Conference Leaders should contact the ACM SIGDA liaison, Sade Rodriguez, for guidance on any concerns related to the potential impact this may have on conference planning and review the ACM Conference Planning Guide as it’s a great resource for an overview of the ACM support available. As a SIGDA sponsored conference, it is important that SIGDA leaders are included in all discussions in regards to any changes to the conference.

ISPD 2020 TOC

SESSION: Keynote 1

Session details: Keynote 1

  • William Swartz

Scalable System and Silicon Architectures to Handle the Workloads of the Post-Moore
Era

  • Ivo Bolsens

The end of Moore’s law has been proclaimed on many occasions and it’s probably safe
to say that we are now working in the post-Moore era. But no one is ready to slow
down just yet. We can view Gordon Moore’s observation on transistor densification
as just one aspect of a longer-term underlying technological trend – the Law of Accelerating
Returns articulated by Kurzweil. Arguably, companies became somewhat complacent in
the Moore era, happy to settle for the gains brought by each new process node. Although
we can expect scaling to continue, albeit at a slower pace, the end of Moore’s Law
delivers a stronger incentive to push other trends of technology progress harder.
Some exciting new technologies are now emerging such as multi-chip 3D integration
and the introduction of new technologies such as storage-class memory and silicon
photonics. Moreover, we are also entering a golden age of computer architecture innovation.
One of the key drivers is the pursuit of domain-specific architectures as proclaimed
by Turing award winners John Hennessy and David Patterson. A good example is the Xilinx’s
AI Engine, one of the important features of the Versal? ACAP (adaptive compute acceleration
platform) [1]. Today, the explosion of AI workloads is one of the most powerful drivers
shifting our attention to find faster ways of moving data into, across, and out of
accelerators. Features such as massive parallel processing elements, the use of domain
specific accelerators, the dense interconnect between distributed on-chip memories
and processing elements, are examples of the ways chip makers are looking beyond scaling
to achieve next-generation performance gains. Next, the growing demands of scaling-out
hyperscale datacenter applications drive much of the new architecture developments.
Given a high diversification of workloads that invoke massive compute and data movement,
datacenter architectures are moving away from rigid CPU-centric structures and instead
prioritize adaptability and configurability to optimize resources such as memory and
connectivity of accelerators assigned to individual workloads. There is no longer
a single figure of merit. It’s not all about Tera-OPS. Other metrics such as transfers-per-second
and latency come to the fore as demands become more real-time; autonomous vehicles
being an obvious and important example. Moreover, the transition to 5G will result
in solutions that operate across the traditional boundaries between the cloud and
edge and embedded platforms that are obviously power-conscious and cost-sensitive.
Future workloads will require agile software flows that accommodate the spread of
functions across edge and cloud. Another industry megatrend that will drive technology
requirements especially in encryption, data storage and communication, is Blockchain.
To some, it may already have a bad reputation, tarnished by association with the anarchy
of cryptocurrency, but it will be more widely relevant than many of us realize. Who
could have foreseen the development of today’s Internet when ARPANET first appeared
as a simple platform for distributed computing and sending email? Through projects
such as the open-source Hyperledger, Blockchain technology could be game-changing
as a platform for building trust in transactions executed over the Internet. We may
soon be talking in terms of the Trusted Internet. The predictability of Moore’s law
may have become rather too comfortable and slow. The future requires maximizing the
flexibility, agility, and efficiency of new technologies. With Moore’s Law now mostly
behind us, new adaptable and scalable architectures will allow us to further provide
exponential return from technology in order to create a more adaptable and intelligent
world.

SESSION: Session 1: Placement

Session details: Session 1: Placement

  • Stephen Yang

Placement Optimization with Deep Reinforcement Learning

  • Anna Goldie
  • Azalia Mirhoseini

Placement Optimization is an important problem in systems and chip design, which consists
of mapping the nodes of a graph onto a limited set of resources to optimize for an
objective, subject to constraints. In this paper, we start by motivating reinforcement
learning as a solution to the placement problem. We then give an overview of what
deep reinforcement learning is. We next formulate the placement problem as a reinforcement
learning problem, and show how this problem can be solved with policy gradient optimization.
Finally, we describe lessons we have learned from training deep reinforcement learning
policies across a variety of placement optimization problems.

Hill Climbing with Trees: Detail Placement for Large Windows

  • Mohammad Khasawneh
  • Patrick H. Madden

Integrated circuit design encompasses a wide range of intractable optimization problems.
In this paper, we extend linear time hill climbing techniques from graph partitioning
to address detailed placement — this results in a new way to refine circuit designs,
dramatically expands the size of practical optimization windows, and enables wire
length reductions on a variety of benchmark problems. The approach is versatile and
straight-forward to implement, allowing it to be applied to a wide range of problems
within design automation, and beyond.

Via Pillar-aware Detailed Placement

  • Yong Zhong
  • Tao-Chun Yu
  • Kai-Chuan Yang
  • Shao-Yun Fang

With the feature size shrinking down to 7 nm and beyond, the impact of wire resistance
is significantly growing, and the circuit delay incurred by metal wires is noticeably
raising. To address this issue, a new technique called via pillar insertion is developed.
However, the poor success rate of the via pillar insertion process immediately becomes
an important problem. In this paper, we explore the causes of via pillar insertion
failures by experiments on the ISPD 2015 benchmarks, which are embedded with a real
industrial cell library. The results show that the reasons for the low success rate
may be due to track misalignment, power and ground stripe overlapping, and insufficient
margin area. Therefore, we propose the first detailed placement flow which is aware
of via pillars to maximize the success rate of via pillar insertion. In the proposed
flow, we first filter out infeasible cell rows and then move the via pillar-inserting
cells to their eligible positions. Next, we adopt a two-stage legalization method
with high flexibility on cell ordering based on a dynamic programming-based detailed
placement algorithm. Finally, we improve congested rows with a global moving process.
Experiment results show that our algorithm improves the insertion rates by 54-58%,
and achieves over 99% insertion rate on average.

Soft-Clustering Driven Flip-flop Placement Targeting Clock-induced OCV

  • Dimitrios Mangiras
  • Pavlos Mattheakis
  • Pierre-Olivier Ribet
  • Giorgos Dimitrakopoulos

On-Chip Variation (OCV) in advanced technology nodes introduces delay uncertainties
that may cause timing violations. This problem drastically affects the clock tree
that, besides the growing design complexity, needs to be appropriately synthesized
to tackle the increased variability effects. To reduce the magnitude of the clock-induced
OCV, we incrementally relocate the flip-flops and the clock gaters in a bottom-up
manner to implicitly guide the clock tree synthesis engine to produce clock trees
with increased common clock tree paths. The relocation of the clock elements is performed
using a soft clustering approach that is orthogonal to the clock tree synthesis method
used. The clock elements are repeatedly relocated and incrementally re-clustered,
thus gradually forming better clusters and settling to more appropriate positions
to increase the common paths of the clock tree. This behavior is verified by applying
the proposed method in industrial designs, resulting in clock trees which are more
resilient to process variations, while exhibiting improved overall timing.

SESSION: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging

Session details: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging

  • Patrick H. Madden

Advances in Carbon Nanotube Technologies: From Transistors to a RISC-V Microprocessor

  • Gage Hills
  • Christian Lau
  • Tathagata Srimani
  • Mindy D. Bishop
  • Pritpal Kanhaiya
  • Rebecca Ho
  • Aya Amer
  • Max M. Shulaker

Carbon nanotube (CNT) field-effect transistors (CNFETs) promise to improve the energy
efficiency of very-large-scale integrated (VLSI) systems. However, multiple challenges
have prevented VLSI CNFET circuits from being realized, including inherent nano-scale
material defects, robust processing for yielding complementary CNFETs (i.e., CNT CMOS:
including both PMOS and NMOS CNFETs), and major CNT variations. Here, we summarize
techniques that we have recently developed to overcome these outstanding challenges,
enabling VLSI CNFET circuits to be experimentally realized today using standard VLSI
processing and design flows. Leveraging these techniques, we demonstrate the most
complex CNFET circuits and systems to-date, including a three-dimensional (3D) imaging
system comprising CNFETs fabricated directly on top of a silicon imager, CNT CMOS
analog and mixed-signal circuits, 1 kilobit CNFET static random-access memory (SRAM)
memory arrays, and a 16-bit RISC-V microprocessor built entirely out of CNFETs.

Full-Chip Electro-Thermal Coupling Extraction and Analysis for Face-to-Face Bonded
3D ICs

  • Lingjun Zhu
  • Kyungwook Chang
  • Dusan Petranovic
  • Saurabh Sinha
  • Yun Seop Yu
  • Sung Kyu Lim

Due to the short die-to-die distance and inferior heat dissipation capability, Face-to-Face
(F2F) boned 3D ICs are often considered to be vulnerable to electrical and thermal
coupling. This study is the first to quantify the impacts of the electro-thermal coupling
on the full-chip timing, power, and performance. We first present an implementation
flow for realistic F2F 3D ICs including pad layers and power grids. Then, we propose
our signal integrity analysis, parasitic extraction, and thermal analysis flows. Next,
we investigate the impacts of the coupling on the delay, power, and noise of F2F 3D
ICs, and provide guidelines to mitigate these effects. Our experimental results show
that the inter-die electrical coupling causes up to 5.81% timing degradation and 4.00%
noise increase, while the thermal coupling leads to less than 0.41% timing degradation
and nearly no noise increase. The impact of the combined electro-thermal coupling
on delay and noise reaches 6.07% and 4.05%, respectively.

Pseudo-3D Approaches for Commercial-Grade RTL-to-GDS Tool Flow Targeting Monolithic
3D ICs

  • Heechun Park
  • Bon Woong Ku
  • Kyungwook Chang
  • Da Eun Shim
  • Sung Kyu Lim

Despite the recent academic efforts to develop Electronic Design Automation (EDA)
algorithms for 3D ICs, the current market does not have commercial 3D computer-aided
design (CAD) tools. Insteadpseudo-3D alternative design flows have been devised which
utilize commercial 2D CAD engines with tricks that help them operate as a fairly-efficient
3D CAD tool. In this paper we provide detailed discussions and fair power-performance-area
(PPA) comparisons of state-of-the-art pseudo-3D design flows. We also analyze the
limitations of each design flow and provide solutions with better PPA and various
design options. Our experiments using commercial PDK, GDS layouts, and sign-off simulations
demonstrate that we achieve up to 26% wirelength and 10% power consumption reduction
for pseudo-3D design flows. We also provide a partitioning-first scheme to partitioning-last
design flow which increases design freedom with tolerable PPA degradation.

SESSION: Session 3: Machine Learning for Physical Design (part 1)

Session details: Session 3: Machine Learning for Physical Design (part 1)

  • Patrick Groeneveld

Learning from Experience: Applying ML to Analog Circuit Design

  • Kishor Kunal
  • Tonmoy Dhar
  • Yaguang Li
  • Meghna Madhusudan
  • Jitesh Poojary
  • Arvind K. Sharma
  • Wenbin Xu
  • Steven M. Burns
  • Ramesh Harjani
  • Jiang Hu
  • Parijat Mukherjee
  • Sachin S. Sapatnekar

The problem of analog design automation has vexed several generations of researchers
in electronic design automation. At its core, the difficulty of the problem is related
to the fact that machinegenerated designs have been unable to match the quality of
the human designer. The human designer typically recognizes blocks from a netlist
and draws upon her/his experience to translate these blocks into a circuit that is
laid out in silicon. The ability to annotate blocks in a schematic or netlist-level
description of a circuit is key to this entire process, but it is a process fraught
with complexity due to the large number of variants of each circuit type. For example,
the number of topologies of operational transconductance amplifiers (OTAs) easily
numbers in the hundreds. A designer manages this complexity by dividing this large
set of variants into classes (e.g., OTAs may be telescopic, folded cascode, etc.).
Even so, the number of minor variations within each class is large. Early approaches
to analog design automation attempted to use rule-based methods to capture these variations,
but this database of rules required tender care: each new variant might require a
new rule. As machine learning (ML) based alternatives have become more viable, alternative
forms of solving this problem have begun to be explored.

Our effort is part of the ALIGN (Analog Layout, Intelligently Generated from Netlists)
project [2, 3], which is developing opensource software for analog/mixed-signal circuit
layout [1]. Our specific goal is to translate a netlist into a physical layout, with
24-hour turnaround and no human in the loop. The ALIGN flow inputs a netlist whose
topology and transistor sizes have already been chosen, a set of performance specifications,
and a process design kit (PDK) that defines the process technology. The output of
ALIGN is a layout in GDSII format.

Transforming Global Routing Report into DRC Violation Map with Convolutional Neural
Network

  • Wei-Tse Hung
  • Jun-Yang Huang
  • Yih-Chih Chou
  • Cheng-Hong Tsai
  • Mango Chao

In this paper, we have proposed a machine-learning framework to predict the DRC-violation
map of a given design resulting from its detailed routing based on the congestion
report resulting from its global routing. The proposed framework utilizes convolutional
neural network as its core technique to train this prediction model. The training
dataset is collected from 15 industrial designs using a leading commercial APR tool,
and the total number of collected training samples exceed 26M. A specialized under-sampling
technique is proposed to select important training samples for learning, compensate
for the inaccuracy misled by a highly imbalanced training dataset, and speed up the
entire training process. The experimental result demonstrates that our trained model
can result in not only a significantly higher accuracy than previous related works
but also a DRC violation map visually matching the actual ones closely. The average
runtime of using our learned model to generate a DRC-violation map is only 3% of that
of global routing, and hence our proposed framework can be viewed as a simple add-on
tool to a current commercial global router that can efficiently and effectively generate
a more realistic DRC-violation map without really applying detailed routing.

Lookahead Placement Optimization with Cell Library-based Pin Accessibility Prediction
via Active Learning

  • Tao-Chun Yu
  • Shao-Yun Fang
  • Hsien-Shih Chiu
  • Kai-Shun Hu
  • Philip Hui-Yuh Tai
  • Cindy Chin-Fang Shen
  • Henry Sheng

With the development of advanced process nodes of semiconductor, the problem of pin
access has become one of the major factors to impact the occurrences of design rule
violations (DRVs) due to complex design rules and limited routing resource. Many state-of-the-art
works address the problem of DRV prediction by adopting supervised machine learning
approaches. However, those supervised learning approaches extract the labels of training
data by generating a great number of routed designs in advance, giving rise to large
effort on training data preparation. In addition, the pre-trained model could hardly
predict unseen data and thus may not be applied to predict other designs containing
cells that are not used in the training data. In this paper, we propose the first
work of cell library-based pin accessibility prediction (PAP) by using active learning
techniques. A given set of standard cell libraries is served as the only input for
model training. Unlike most of existing studies that aim at design-specific training,
we propose a library-based model which can be applied to all designs referencing to
the same standard cell library set. Experimental results show that the proposed model
can be applied to predict two different designs with different reference library sets.
The number of remaining DRVs and M2 shorts of the designs optimized by the proposed
model are also much fewer than those of design-specific models.

SESSION: Keynote 2

Session details: Keynote 2

  • Mark Po-Hung Lin

Physical Design for 3D Chiplets and System Integration

  • Cliff Hou

The convergence of 5G and Artificial Intelligence (AI) that covers the gamut from
cloud data centers through network routers to edge applications is poised to open
possibilities beyond our imagination and transform how we will go about our daily
lives. As the foundational technology supporting 5G and AI innovation, semiconductors
strive for greater system performance and broader bandwidth, while increasing functionality
and lowering cost. In response, device innovation is transitioning from SoCs to 3D
chiplets that combine advanced wafer-level system integration (WLSI) technologies
such as CoWoS® (Chip on Wafer on Substrate), Integrated Fan-Out (InFO), Wafer-on-Wafer
(WoW) and System-on-Integrated-Chips (SoIC), to enable system integration that meets
these demands. Designing 3D chiplets and housing various chips on wafer-level for
system integration creates a whole new set of challenges. These start with design
partitioning and include handling interfaces between or passing through chips, design
for testing (DFT), thermal dissipation, databases and tools integration for chip and
packaging design, new IO/ESD (electrostatic discharge), simulation run time and tool
capacity, among others. Considering current capabilities and constraints, divide-and-conquer
remains the most feasible approach for 3D chiplet design and packaging. Chiplet design
needs to integrate data bases and tools with packaging environments for both verification
and optimization. Leveraging existing 2D physical design solutions and chip-level
abstraction can help meet 3D verification and optimization requirements. The IC industry
also needs more DFT and thermal dissipation innovation, especially the latter one.
Thermal optimization is critical to 3D chiplets and system integration. The current
thermal solution only covers thermal analysis + system-level thermal dissipation.
It should start at the IPs and across chip design process, i.e., thermal-aware 3D
IC design, to cover IP, macros, and transistors. This speech will address these and
other challenges, then propose physical design solutions for 3D chiplets and system
integration. CCS CONCEPTS – VLSI design, 3D integrated circuits, VLSI system specification
and constraints, and VLSI packaging KEYWORDS Physical design, 3D chiplets and system
integration, thermal optimization BIOGRAPHY Dr. Cliff Hou was appointed Vice President
of Research and Development at Taiwan Semiconductor Manufacturing Co. Ltd. (TSMC)
in 2011. Since 1999, he has worked to establish node-specific reference flows from
0.13μm to today’s leading-edge 3nm at TSMC. Dr. Hou also led TSMC’s in-house IP development
teams from 2008 to 2010. He is now spearheading TSMC’s efforts to build total platform
solutions for the industry’s high growth markets in Mobile, IoT, Automotive, and High-Performance
Computing. Dr. Hou holds 44 U.S. Patents and serves as a member of Board of Directors
in Global Unichip Corp. He received B.S. degree in Control Engineering from Taiwan’s
National Chiao-Tung University, and Ph.D. in Electrical and Computer Engineering from
Syracuse University.

SESSION: Session 4: Circuit Design and Security

Session details: Session 4: Circuit Design and Security

  • David Chinnery

Hardware Security For and Beyond CMOS Technology: An Overview on Fundamentals, Applications, and Challenges

  • Johann Knechtel

As with most aspects of electronic systems and integrated circuits, hardware security
has traditionally evolved around the dominant CMOS technology. However, with the rise
of various emerging technologies, whose main purpose is to overcome the fundamental
limitations for scaling and power consumption of CMOS technology, unique opportunities
arise also to advance the notion of hardware security. In this paper, I first provide
an overview on hardware security in general. Next, I review selected emerging technologies,
namely (i) spintronics, (ii) memristors, (iii) carbon nanotubes and related transistors,
(iv) nanowires and related transistors, and (v) 3D and 2.5D integration. I then discuss
their application to advance hardware security and also outline related challenges.

Design Optimization by Fine-grained Interleaving of Local Netlist Transformations
in Lagrangian Relaxation

  • Apostolos Stefanidis
  • Dimitrios Mangiras
  • Chrysostomos Nicopoulos
  • David Chinnery
  • Giorgos Dimitrakopoulos

Design optimization modifies a netlist with the goal of satisfying the timing constraints
at the minimum area and leakage power, without violating any slew or load capacitance
constraints. Lagrangian relaxation (LR) based optimization has been established as
a viable approach for this. We extend LR-based optimization by interleaving in each
iteration techniques such as: gate and flip-flop sizing; buffering to fix late and
early timing violations; pin swapping; and useful clock skew. Locally optimal decisions
are made using LR-based cost functions, without the need for incremental timing updates.
Sub-steps are applied in a balanced manner, accounting for the expected savings and
any conflicting timing violations, maximizing the final quality of results under multiple
process/operating corners with a reasonable runtime. Experimental results show that
our approach achieves better timing, and both lower area and leakage power than the
winner of the TAU 2019 contest, on those benchmarks.

Selective Sensor Placement for Cost-Effective Online Aging Monitoring and Resilience

  • Hao-Chun Chang
  • Li-An Huang
  • Kai-Chiang Wu
  • Yu-Guang Chen

Aggressive technology scaling trends, such as thinner gate oxide without proportional
downscaling of supply voltage, aggravate the aging impact and thus necessitate an
aging-aware reliability verification and optimization framework during early design
stages. In this paper, we propose a novel in-situ sensing strategy based on deploying
transition detectors (TDs), for on-chip aging monitoring and resilience. Transformed
into the set cover problem and then formulated into maximum satisfiability, the proposed
problem of TD/sensor placement can be solved efficiently. Experimental results show
that, by introducing at most 2.2% area overhead (for TD/sensor placement), the aging
behavior of a target circuit can be effectively monitored, and the correctness of
its functionality can be perfectly guaranteed with an average of 77% aging resilience
achieved. In other words, with 2.2% area overhead, potential aging-induced timing
errors can be detected and then eliminated, while achieving 77% recovery from aging-induced
performance degradation.

SESSION: Session 5: Timing and Clocking

Session details: Session 5: Timing and Clocking

  • Evangeline Young

Synthesis of Clock Networks with a Mode Reconfigurable Topology and No Short Circuit
Current

  • Necati Uysal
  • Juan Ariel Cabrera
  • Rickard Ewetz

Circuits deployed in the Internet of Things operate in low and high performance modes
to cater to variable frequency and power requirements. Consequently, the clock networks
for such circuits must be synthesized meeting drastically different timing constraints
under variations in the different modes. The overall power consumption and robustness
to variations of a clock network is determined by the topology. However, state-of-the-art
clock networks use the same topology in every mode, despite that the timing constraints
in the low and high performance modes are very different. In this paper, we propose
a clock network with a mode reconfigurable topology (MRT) for circuits with positive-edge
triggered sequential elements. In high performance modes, the required robustness
to variations is provided by reconfiguring the MRT structure into a near-tree. In
low performance modes, the MRT structure is reconfigured into a tree to save power.
Non-tree (or near-tree) structures provide robustness to variations by appropriately
constructing multiple alternative paths from the clock source to the clock sinks,
which neutralizes the negative impact of variations. In MRT structures, OR-gates are
used to join multiple alternative paths into a single path. Consequently, the MRT
structures consume no short circuit power because there is only one gate driving each
net. Moreover, it is straightforward to reconfigure MRT structures into a tree by
gating the clock signal in part of the structure. Compared with state-of-the-art near-tree
structures, MRT structures have 8% lower power consumption and similar robustness
to variations in high performance modes. In low performance modes, the power consumption
is 16% smaller when reconfiguration is used.

Timing Driven Partition for Multi-FPGA Systems with TDM Awareness

  • Sin-Hong Liou
  • Sean Liu
  • Richard Sun
  • Hung-Ming Chen

Multi-FPGA system is a popular approach to achieve hardware acceleration with the
scalability to accommodate large designs. To overcome the connectivity constraint
between each pair of FPGAs, Time-division multiplexing (TDM) is adopted with the expense
of additional delay that dominates the performance on multi-FPGA system based emulator.
To the best of our knowledge, there is no prior work on partitioning for multi-FPGA
system considering hardware configuration and the impact of TDM. This work proposes
a partition methodology to improve timing performance for multi-FPGA system. Delay
introduced by TDM is estimated and optimized using look-up table for better efficiency.
Our experimental result shows 43% improvement in maximum delay while considering both
hardware configuration and impact of TDM compared with cut driven partition approach.

SESSION: Session 6: Machine Learning for Physical Design (part 2)

Session details: Session 6: Machine Learning for Physical Design (part 2)

  • Ismail Bustany

Understanding Graphs in EDA: From Shallow to Deep Learning

  • Yuzhe Ma
  • Zhuolun He
  • Wei Li
  • Lu Zhang
  • Bei Yu

As the scale of integrated circuits keeps increasing, it is witnessed that there is
a surge in the research of electronic design automation (EDA) to make the technology
node scaling happen. Graph is of great significance in the technology evolution since
it is one of the most natural ways of abstraction to many fundamental objects in EDA
problems like netlist and layout, and hence many EDA problems are essentially graph
problems. Traditional approaches for solving these problems are mostly based on analytical
solutions or heuristic algorithms, which require substantial efforts in designing
and tuning. With the emergence of the learning techniques, dealing with graph problems
with machine learning or deep learning has become a potential way to further improve
the quality of solutions. In this paper, we discuss a set of key techniques for conducting
machine learning on graphs. Particularly, a few challenges in applying graph learning
to EDA applications are highlighted. Furthermore, two case studies are presented to
demonstrate the potential of graph learning on EDA applications.

TEMPO: Fast Mask Topography Effect Modeling with Deep Learning

  • Wei Ye
  • Mohamed Baker Alawieh
  • Yuki Watanabe
  • Shigeki Nojima
  • Yibo Lin
  • David Z. Pan

With the continuous shrinking of the semiconductor device dimensions, mask topography
effects stand out among the major factors influencing the lithography process. Including
these effects in the lithography optimization procedure has become necessary for advanced
technology nodes. However, conventional rigorous simulation for mask topography effects
is extremely computationally expensive for high accuracy. In this work, we propose
TEMPO as a novel generative learning-based framework for efficient and accurate 3D
aerial image prediction. At its core, TEMPO comprises a generative adversarial network
capable of predicting aerial image intensity at different resist heights. Compared
to the default approach of building a unique model for each desired height, TEMPO
takes as one of its inputs the desired height to produce the corresponding aerial
image. In this way, the global model in TEMPO can capture the shared behavior among
different heights, thus, resulting in smaller model size. Besides, across-height information
sharing results in better model accuracy and generalization capability. Our experimental
results demonstrate that TEMPO can obtain up to 1170x speedup compared with rigorous
simulation while achieving satisfactory accuracy.

DRC Hotspot Prediction at Sub-10nm Process Nodes Using Customized Convolutional Network

  • Rongjian Liang
  • Hua Xiang
  • Diwesh Pandey
  • Lakshmi Reddy
  • Shyam Ramji
  • Gi-Joon Nam
  • Jiang Hu

As the semiconductor process technology advances into sub-10nm regime, cell pin accessibility,
which is a complex joint effect from the pin shape and nearby blockages, becomes a
main cause for DRC violations. Therefore, a machine learning model for DRC hotspot
prediction needs to consider both very high-resolution pin shape patterns and low-resolution
layout information as input features. A new convolutional neural network technique,
J-Net, is introduced for the prediction with mixed resolution features. This is a
customized architecture that is flexible for handling various input and output resolution
requirements. It can be applied at placement stage without using global routing information.
This technique is evaluated on 12 industrial designs at 7nm technology node. The results
show that it can improve true positive rate by 37%, 40% and 14% respectively, compared
to three recent works, with similar false positive rates.

SESSION: Keynote 3

Session details: Keynote 3

  • Iris Hui-Ru Jiang

Physical Verification at Advanced Technology Nodes and the Road Ahead

  • Juan C. Rey

In spite of “doomsday” expectations, Moore’s Law is alive and well. Semiconductor
manufacturing and design companies, as well as the Electronic Design Automation (EDA)
industry have been pushing ahead to bring more functionality to satisfy more aggressive
space/power/performance requirements.

Physical verification occupies a unique space in the ecosystem as one of the key bridges
between design and manufacturing. As such, the traditional space of design rule checking
(DRC) and layout versus schematic (LVS) have expanded into electrical verification
and yield enabling technologies such as optical proximity correction, critical area
analysis, multi-patterning decomposition and automated filling.

To achieve the expected accuracy and performance demanded by the design and manufacturing
community, it is necessary to consider the physical effects of the manufacturing processes
and electronic devices and to use the most advanced software engineering technology
and computational capabilities.

SESSION: Session 8: ISPD 2020 Contest Results and Poster Presentations

Session details: Session 8: ISPD 2020 Contest Results and Poster Presentations

  • Marvin Tom

ISPD 2020 Physical Mapping of Neural Networks on a Wafer-Scale Deep Learning Accelerator

  • Michael James
  • Marvin Tom
  • Patrick Groeneveld
  • Vladimir Kibardin

This paper introduces a special case of the floorplanning problem for optimizing neural
networks to run on a wafer-scale computing engine. From a compute perspective, neural
networks can be represented by a deeply layered structure of compute kernels. During
the training of a neural network, gradient descent is used to determine the weight
factors. Each layer then uses a local weight tensor to transform “activations” and
“gradients” that are shared among connected kernels according to the topology of the
network. This process is computationally intensive and requires high memory and communication
bandwidth. Cerebras has developed a novel computer system designed for this work that
is powered by a 21.5cm by 21.5cm wafer-scale processor with 400,000 programmable compute
cores. It is structured as a regular array of 633 by 633 processing elements, each
with its own local high bandwidth SRAM memory and direct high bandwidth connection
to its neighboring cores. In addition to supporting traditional execution models for
neural network training and inference, this engine has a unique capability to compile
and compute every layer of a complete neural network simultaneously. Mapping a neural
network in this fashion onto Cerebras’ Wafer-Scale Engine (WSE) is reminiscent of
the traditional floorplanning problem in physical design. A kernel ends up as a rectangle
of x by y compute elements. These are the flexible blocks that need to be placed to
optimize performance. This paper describes an ISPD 2020 challenge to develop algorithms
and heuristics that produce compiled neural networks that achieve the highest possible
performance on the Cerebras WSE.

PhD Forum CFP

Call for Participation

ACM SIGDA/IEEE CEDA Ph.D. Forum at DAC 2021 https://www.sigda.org/sigda-events/phd-forum/

DAC Ph.D. Forum

Virtual | Dec 6

The Ph.D. Forum at the Design Automation Conference is a poster session hosted by ACM SIGDA for Ph.D. students to present and discuss their dissertation research with people in the EDA community. It has become one of the premier forums for Ph.D. students in design automation to get feedback on their research. It enables the industry and other academicians to see latest top academic work and have access to best graduating students in one place. Participation in the forum is through a scientific evaluation by an expert committee consisting of academia and industry. The forum is open to all members of the design automation community and is free-of-charge. It is virtually co-located with DAC; DAC registration is not required in order to attend this event.

Program

Session I – December 6, 2021 10AM-11AM Pacific Time

1.1M.SALAH: Mechanism for Simulation-Assisted Layout PArtitioning and Analysis of HotspotsSherif Mousa
1.2Fully Automated High Power Amplifier Design: From Transistor Selection to Post-layout GenerationLida Kouhalvandi
1.3Designing Data-Aware Network-on-Chip for PerformanceAbhijit Das
1.4Pre and Post Silicon Verification Techniques for Analog and Mixed Signal CircuitsSayandeep Sanyal
1.5Ultra-Fast Temperature Estimation Methods for Architecture-Level Thermal ModelingHameedah Sultan
1.6Hardware-Software Co-Design for Emerging WorkloadsDiksha Moolchandani
1.7Leakage Aware Dynamic Thermal Management for 3D Memory ArchitecturesLokesh Siddhu
1.8Architectural-Space Exploration of Energy-Efficient Approximate Arithmetic Units for Error-Tolerant ApplicationsHaroon Waris
1.9Novel Attack and Defense Strategies for Enahcned Logic Locking SecurityLilas Alrahis

Session II – December 6, 2021 11AM-Noon Pacific Time

2.1Proving Correctness of Industrial Multipliers using Symbolic Computer AlgebraAlireza Mahzoon
2.2Resilience and Energy-Efficiency for Deep Learning and Spiking Neural Networks for Embedded SystemsRachmad Vidya Wicaksana Putra
2.3Robust and Energy-Efficient Deep Learning SystemsMuhammad Abdullah Hanif
2.4Personalized Deep Learning for Patient-Specific Physiological Monitoring in IoTZhenge Jia
2.5Network-on-Chip Performance Analysis and Optimization for Deep Learning ApplicationsSumit Kumar Mandal
2.6Efficient, Mixed Precision In-Memory Deep learning at the EdgeShamma Nasrin
2.7Design of ML-based and Open Source EDA for Power Delivery Network Synthesis and AnalysisVidya A. Chhabria
2.8Cross-Layer Techniques for Energy-Efficiency and Resiliency of Advanced Machine Learning ArchitecturesAlberto Marchisio
2.9Machine Learning Algorithms in Electronics Design AutomationZhiyao Xie
2.10Efficient Stochastic Computing Machine Learning Acceleration at the EdgeWojciech Romaszkan

Session III – December 6, 2021 12PM-1PM Pacific Time

3.1Designing Obfuscated Systems for Enhanced Hardware-Oriented SecurityMichael Zuzak
3.2Designing Approximate Accelerators, AutomaticallyJorge Castro-Godínez
3.3Breaking the Energy Cage of Insect-scale Autonomous Drones: Interplay of Probabilistic Hardware and Co-designed AlgorithmsPriyesh Shukla
3.4Modeling and Optimization of Next-Generation AI Accelerators under UncertaintiesSanmitra Banerjee
3.5Hardware-Software Codesign of Silicon Photonic AI AcceleratorsFebin Sunny
3.6High-performance Spectral Methods for HypergraphsAli Aghdaei
3.7Low-Power Unary Computing ArchitectureDi Wu
3.8Intrinsic Authentication at IoT Edge Nodes using Spatial and Temporal SignaturesAhish Shylendra
3.9Secure and Usable Zero-interaction Pairing and Authentication Methods for the Internet-of-ThingsKyuin Lee
3.10Energy-Quality Scalable Hardware and Software Solutions for Energy-Efficient Approximate ComputingSetareh Behroozi

Eligibility

  • Dissertation topic must be relevant to the DAC community.
  • Students with at least one published or accepted conference, symposium or journal paper.
  • Students within 1-2 years of dissertation completion and students who have completed their dissertation during the 2020-2021 academic year. Students closer to graduation will have higher priority since the rest of the students can attend a future Ph.D. Forum with more mature results.
  • Students who have presented previously at the DATE and ASP-DAC Ph.D. forums are eligible, but will be less likely to receive travel assistance.
  • Previous DAC SIGDA Ph.D. forum presenters are not eligible.

Important Dates

Submission Requirements

  • A two-page PDF abstract of the dissertation (in two-column format, using 10-11 pt. fonts and single-spaced lines), including name, institution, advisor, contact information, estimated (or actual) graduation date, whether the work has been presented at ASP-DAC Ph.D. Forum or DATE Ph.D. Forum, as well as figures, and bibliography (if applicable). The two-page limit on the abstract will be strictly enforced: any material beyond the second page will be truncated before sending to the reviewers. Please include a description of the supporting paper, including the publication forum. A list of all papers authored or co-authored by the student, related to the dissertation topic and included in the two-page abstract, will strengthen the submission.
  • A published (or accepted) paper, in support of the submitted dissertation abstract. The paper must be related to the dissertation topic and the publication forum must have a valid ISBN number. It will be helpful, but is not required, to include your name and the publication forum on the first page of the paper. Papers on topics unrelated to the dissertation abstract or not yet accepted will not be considered during the review process.

Please Note:

  • The abstract is the key part of your submission. Write the abstract for someone familiar with your technical area, but entirely unfamiliar with your work. Clearly indicate the motivation of your Ph.D. dissertation topic, the uniqueness of your approach, as well as the potential impact your approach may have on the topic.
  • In the beginning of the abstract, please indicate to which track your submission belongs to.
  • Proper spelling, grammar, and coherent organization are critical: remember that the two pages may be the only information about yourself and your PhD research available to the reviewers.
  • All submissions must be made electronically.
  • Please include the supporting paper with the abstract in one PDF file and submit the single file. There are many free utilities available online which can merge multiple PDF files into a single file if necessary.

Topics of Interest (not limited by)

  1. System-level Design, Synthesis and Optimization (including network-on-chip, system-on-chip and multi/many-core, HW/SW co-design, embedded software issues, modeling and simulation)
  2. Internet of Things (IoT)
  3. Autonomous Systems
  4. High Level Synthesis, Logic Level Synthesis
  5. Power and Reliability Analysis and Optimization (including power management from system level to circuit level, thermal management, process variability management)
  6. Timing Analysis, Circuit and Interconnect Simulation
  7. Physical Design and Manufacturability
  8. Signal Integrity and Design Reliability
  9. Verification, Testing, Pre- and Post-Silicon Validation, Failure Analysis
  10. Reconfigurable and Adaptive Systems
  11. Analog/Mixed Signals and RF
  12. Hardware Security
  13. Machine learning/AI
  14. Printable and flexible hybrid electronics (FHE)
  15. Emerging Design, Technologies, and Computing Methods (carbon nanotubes, molecular electronics, MEMS, microfluidic system, biologically-inspired systems, quantum computing, etc.)

Contact Information

For questions not addressed on this page, please send e-mail to Dr. Topaloglu (rasit@us.ibm.com). Please include “DAC Ph.D Forum” in the subject line of your email.

Organizing Committee

Rasit Topaloglu, IBM (Chair)

Iris Hui-Ru Jiang, National Taiwan University

Robert Wille, Johannes Kepler University, Linz

Jingtong Hu (SIGDA Representative), University of Pittsburgh

Program Committee

CristinelAbabeiMarquette University
RaidAyoubIntel
AteetBhallaIndependent Technology Consultant, India
Rajat SubhraChakrabortyAssociate Professor, Dept. of CSE, IIT Kharagpur
XiaomingChenInstitute of Computing Technology, Chinese Academy of Sciences
LiDuUniversity of California, Los Angeles
Shao-YunFangNational Taiwan University of Science and Technology
Hui-RuJiangNational Taiwan University
JinwookJungIBM
RyanKimColorado State University
Myung-ChulKimGoogle
YounghyunKimUniversity of Wisconsin-Madison
BingLiTechnical University of Munich
PreetiPandaIndian Institute of Technology Delhi
SudeepPasrichaColorado State University
RahulRaoIBM
EmreSalmanStony Brook University
HassanSalmaniHoward University
KorkutTokgozTokyo Institute of Technology
Rasit OnurTopalogluIBM
MiroslavVelevAries Design Automation
RobertWilleJohannes Kepler University Linz
HuaXiangIBM
JiangXuHong Kong Universtiy of Science and Technology
Cindy YangYiVirginia Tech
WeiZhangThe Hong Kong University of Science and Technology

FPGA 2020 TOC

FPGA ’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays


Full Citation in the ACM Digital Library

SESSION: Morning Tutorial Session

Invited Tutorial: Dynamatic: From C/C++ to Dynamically Scheduled Circuits

  • Lana Josipović

High-level synthesis tools, both commercial and academic, typically rely on static
scheduling to produce high-throughput pipelines. However, in applications with unpredictable
memory accesses or irregular control flow, these tools need to make pessimistic scheduling
assumptions. In contrast, dataflow circuits implement dynamically scheduled circuits,
in which components communicate locally using a handshake mechanism and exchange data
as soon as all conditions for a transaction are satisfied. Due to their ability to
adapt the schedule at runtime, dataflow circuits are suitable for handling irregular
and control-dominated code. This paper describes Dynamatic, an open-source HLS framework
which generates synchronous dataflow circuits out of C/C++ code. The purpose of this
paper is to give an introductory overview of Dynamatic and demonstrate some of its
use cases, in order to enable others to use the tool and participate in its development.

Invited Tutorial: FPGA Hardware Security for Datacenters and Beyond

  • Kaspar Matas

Since FPGAs are now available in datacenters to accelerate applications, providing
FPGA hardware security is a high priority. FPGA security is becoming more serious
with the transition to FPGA-as-a-Service where users can upload their own bitstreams.
Full control over FPGA hardware through the bitstream enables attacks to weaken an
FPGA-based system. These include physically damaging the FPGA equipment and leaking
of sensitive information such as the secret keys of crypto algorithms. While there
is no known attacks in the commercial settings so far, it is not so much a question
of if but more of when? The tutorial will show concrete attacks applicable on datacenter
FPGAs. The goal of this tutorial is to prepare the FPGA community to impending security
issues in order to pave way for a proactive security. First, we will give a tour through
the FPGA hardware security jungle surveying practical attacks and potential threats.
We will reinforce this with live demos of denial of service attacks. Less than 10%
of the logic resources on an FPGA can draw enough dynamic power to crash a datacenter
FPGA card. In the second part of the tutorial, we will show different mitigations
that are either vendor supported or proposed by the academic community. In summary,
the tutorial will communicate that while FPGA hardware security is complicated to
bring about, there are acceptable solutions for known FPGA security problems.

SESSION: Invited Session: Security in FPGA Design and Application

Session details: Invited Session: Security in FPGA Design and Application

  • Ryan Kastner

Establishing Trust in Microelectronics

  • Lee W. Lerner

In recent years, substantial attention has been drawn to vulnerabilities in the architectural
design of microelectronics, as well as the security of their global supply chains.
In reality, establishing trust in microelectronics requires broader considerations,
from verification of the software leveraged to implement hardware designs, to analyzing
third-party intellectual property cores, all the way to run-time design assurance
and periodic device screening post-deployment. These concerns are relevant to stakeholders
at all levels, from small independent design houses all the way to multi-national
strategic interests. One notable example of the latter is the U.S. Department of Defense’s
Trusted and Assured Microelectronics (T&AM) program, which seeks assured access to
state-of-the-art foundries through modern trust and assurance methods and demonstrations
[1]. This talk will describe research efforts at the Georgia Tech Research Institute
centered around providing assurance of FPGAs. Current research thrusts include the
development of verification techniques at multiple stages of the design process, including
vendor design software execution, implementation of user designs, and even the operation
of the underlying physical device hardware itself. For example, to address trust in
synthesis and implementation of high-level user source code, we discuss the development
of canary circuits which are compiled alongside user design circuits and can be independently
inspected and verified to ensure adherence to user-defined implementation rules. Additionally,
we discuss one avenue for providing trust in vendor hardware devices through our development
of Independent Functional Test (IFT) suites.

Thermal and Voltage Side and Covert Channels and Attacks in Cloud FPGAs

  • Jakub Szefer

Cloud FPGAs have been gaining interest in recent years due to the ability of users
to request FPGA resources quickly, flexibly, and on-demand. In addition to the existing
single-tenant deployments, where each user gets an access to a whole FPGA, recent
academic proposals have looked at creating multi-tenant deployments, where multiple
users share a single FPGA, e.g., [3]. In both settings, there is a large amount of
infrastructure and physical resources that are shared among users. Sharing of the
physical resources in data centers and processors is well known to lead to potential
attacks, e.g., [4]. However, only recently have there been demonstrations of various
security attacks that our group and others have shown to be possible in Cloud FPGA
setting, e.g., [5].

This talk will discuss Cloud FPGA security from the perspective of side and covert
channel attacks that arise due to these shared resources. It will first cover our
recent work on thermal channels that can be used to create covert channels between
users renting same FPGA over time [5]. These channels can create stealthy communication
medium for leaking small amounts of sensitive information, e.g., cryptographic keys.
As defense strategies, the talk will point out possible solutions at the system level
and at the hardware level. At the system level, adding delays between when different
users can access the same FPGA, or preventing users from being able to identify unique
FPGA instances can mitigate the threats, but does increase overhead. At the hardware
level, additional cooling to erase thermal information after users uses and FPGA,
or new sensors to monitor FPGAs and generate an alert when excessive heat is detected
are possible solutions that will be discussed.

The talk will also discuss recent work on voltage-based attacks that leverage custom
circuits instantiated inside the FPGAs to measure voltage changes. Voltage-based channels
can be used to leak sensitive information across FPGAs (in single-tenant or multi-tenant
settings) [2], or can be combined with other existing attacks to perform cross-talk
leakage inside the FPGAs (in multi-tenant settings) [1]. These attacks highlight the
power of attacker when they are able to synthesize any circuit into a shared FPGA
environment. Furthermore, even with certain restrictions on the types of designs that
can be synthesized, this talk will show how attacks can be deployed. As defense strategies,
the talk will point out possible new design check rules that can be used by Cloud
FPGA providers.

In light of the attacks and defenses, Cloud FPGA security remains a cat-and-mouse
game. There is then the foremost need to better understand the existing and potential
attacks — to design defenses and deploy them before malicious users try to launch
such attacks. Only with proper understanding of the possible FPGA attacks, can secure
Cloud FPGAs be created.

Multi-tenant FPGA Security: Challenges and Opportunities

  • Patrick Koeberl

An emerging trend in the data center is the at-scale deployment of Field-Programmable
Gate Arrays (FPGAs) which combine multi-gigabit and ultra-low latency workload acceleration
with hardware-level reconfigurability. In particular, for applications such as Deep
Learning where techniques and algorithms are rapidly changing, the inherent flexibility
of FPGAs grants them an edge over hardened data processing units such as ASICs or
GPUs. Inevitably, Cloud Service Providers (CSPs) will seek to maximize resource utilization
for their FPGA investments as they currently do for general-purpose computing resources.
Current FPGA deployments in the data center tend to be single-tenant or support multiple
tenants through time multiplexing (temporal multi-tenancy) which can result in resource
underutilization. This approach does not scale and presents challenges for elastic
workloads whose properties are not fully known ahead of time. Instead, we expect that
closing the resource utilization gap will require efficient spatial allocation of
FPGA resources across multiple tenants while maintaining security and QoS guarantees.
In particular, new usage models such as FPGA-as-a-Service, where resources are exposed
directly to the cloud tenant, present unique challenges on the security and QoS side.
In this talk we review the threat landscape and trust models associated with FPGA
multi-tenancy, highlight future research challenges and examine the unique opportunities
that FPGA multi-tenancy enables given adequate guarantees on security and QoS.

FPGA / SoC Security: Arms Race in the Cloud

  • Steven McNeil

Technology and cost are motivating more and more developers to put more and more of
their “secret sauce” in programmable logic. This is great for the consumer as it opens
the market to the smaller players. However, it also opens the market for IP theft;
after all, why spend years making something yourself if you can just pilfer it and
re-brand it as your own. A sad statement for sure but it is the reality of the world
we live in. Worsening the situation is the fact that FPGA’s and SoC’s are starting
to become the anchor for the security of larger systems. This now brings in another
set of bad guys; ones that are tech savvy and armed with lab equipment. They are not
looking for your IP but are looking to break into the system your IP is protecting.
The number of adversaries is growing just as fast as the markets for the devices themselves
and is quickly becoming an arms race.

The first volley was fired when the adversaries started reverse engineering the programming
files (bitstreams) so we added bitstream encryption (3-DES at the time). However,
commercial computational power rendered that algorithm obsolete, so we moved to AES-256.
The adversary gave up attacking the algorithm and started going after the key itself,
so we added physical protections. Failing to break into the device they opted for
less physically invasive attacks such as Differential Power Analysis, so we added
authentication before decryption and key rolling. Frustrated with these and other
protections the adversary went back to physical attacks. This was aided by the ever-expanding
capabilities of Failure Analysis which needs the same equipment to understand device
failures as attackers need to break into devices. They started with circuit probing
and edits using the Focus Ion Beam (FIB). This allowed them to disable security features
or tap into the key space. To counter this redundancy and circuit obfuscation techniques
were added. They then switched to less destructive imaging methods which forced us
to levy system requirements (detection and prevention techniques) on the customer;
costly but effective.

The security of FPGAs and SoCs has greatly evolved over the years but the next battlefield
in the arms race is on the horizon: FPGAs / SoC’s in the cloud. Most of the security
in modern day devices primarily considers an attacker on the outside of the device
with close physical access. As such, most of the mitigations make a similar assumption.
However, devices in the cloud creates an entire new form of device warfare; remote
attacks on FPGAs and SoCs. This is not an issue of hacking software; that has been
around since before the internet. It is about the number of devices in the cloud that
are programmable and, therefore, hackable. Side channel attacks such as row-hammer
(1) and CLKscrew (2) show that if there are secrets and some level of device access,
the attackers will find a way to exploit it. In this modern era, security engineers
are going to have to look for adversaries where they would never expect them; inside
the system.

SESSION: Panel

Session details: Panel

  • Andrew Putnam

What To Do With Datacenter FPGAs Besides Deep Learning

  • Andrew Putnam

FPGAs have been deployed in datacenters worldwide and are now available for use by
in both public and private clouds. Enormous focus has been given to optimizing machine
learning workloads for FPGAs, especially for deep neural networks (DNNs) in areas
like web search, image classification, and translation. However, major cloud applications
encompasses a variety of areas that aren’t primarily machine learning workloads, including
databases, video encoding, text processing, gaming, bioinformatics, productivity and
collaboration, file hosting and storage, e-mail, and many more. While machine learning
can certainly play a role in each of these areas, is there more that can be done to
accelerate these more traditional workloads? Even more challenging than identifying
promising workloads is figuring out how developers can practically create and deploy
useful applications using FPGAs to the cloud. While FPGAs-as-a-Service allow access
to FPGAs in the cloud, there is a huge gap between raw programmable hardware and a
customer paying money to use an application powered by that hardware. A wide variety
of FPGA IP exists for developers to use, but individual IP blocks are a long way from
being a fully functional cloud application. Building block IPs like Memcached, regex
matching, protocol parsing, and linear algebra are only a subset of the necessary
functionality for full cloud applications. Developing or acquiring IP and integrating
it into a full application that customers will pay for is a significant task. And
even when a customer pays, how should the money be distributed between IP vendors.
Should it be a onetime fee? By usage? By number of FPGAs deployed? Who should have
the burden for support if something goes wrong? In traditional cloud applications,
FPGA IP block functions are implemented in software libraries. However, few examples
of optimized software libraries are commercially successful, so is selling FPGA IP
even a viable commercial model for cloud applications? High-level synthesis (HLS)
tools promise to provide one path to enable software developers to make effective
use of FPGAs for computing tasks, but are any tools really capable of accelerating
cloud-scale applications? Many HLS tools require substantial microarchitectural guidance
in the form of pragmas or configuration files to come out with good results. Real
cloud applications also rarely have a single dominant function and have significant
data movement, so without proper partitioning and tuning, the acceleration gains from
the FPGA are quickly wiped out by data movement and Amdahl’s Law. This panel will
gather experts in using FPGAs for cloud application areas beyond machine learning,
and how those applications can be built and successfully deployed. We will cover topics
such as: -What are the most important cloud workloads for FPGAs to target besides
machine learning? -Are there specific changes to the FPGA architecture that would
benefit these cloud applications? -What are the economic models that will work for
IP developers, application developers, and cloud providers? -How can we make development
of FPGA applications easier for the Cloud? -Will open source IP make it impossible
for IP vendors to make commercially successful libraries? -What advances are necessary
for HLS tools to be practical in the Cloud? The panel is comprised of experts in applications,
IP development, and cloud deployment. Each will give a short presentation of what
they find as the most important applications and how they see FPGA development for
the cloud going forward, then we will open the floor to an interactive discussion
with the audience.

SESSION: Session: Keynote I

Session details: Session: Keynote I

  • Lesley Shannon

Symbiosis in Action: Reconfigurable Architectures and EDA

  • Mahesh A. Iyer

Spatial compute architectures, like Field Programmable Gate Arrays (FPGAs), constitute
a key architectural pillar in modern heterogeneous compute platforms. Spatial architectures
need a sophisticated Electronic Design Automation (EDA) compiler to optimally map
and fit a user’s workload/design onto the underlying spatial device. This EDA compiler
not only helps users to custom-configure the spatial device but is also critically
required for architectural exploration of new spatial architectures. The FPGA industry
has had a long history of innovation in this symbiotic relationship between EDA and
reconfigurable spatial architectures.

This talk will walk down the memory lane of multiple waves of such innovation, amplifying
how the complexity of EDA technology has not only scaled with Moore’s law scaling
of size and complexity of silicon hardware, but also how it has been pivotal in the
architectural design of modern FPGAs. A general overview of modern FPGA EDA flows
and key differences compared to Application-Specific Integrated Circuit (ASIC) EDA
flows will be discussed. State-of-the-art FPGAs, Stratix® 10 and AgileX™ from Intel
incorporate an advanced register-rich HyperFlex™ architecture that introduces disruptive
optimization opportunities in the EDA compiler. Such physical synthesis optimization
technologies like logic retiming, clock skew optimization, time borrowing, and their
synergies and challenges will be discussed. Solving these challenges enables FPGAs
to achieve non-linear performance improvements.

Logic retiming was first introduced as a powerful sequential design optimization technique
three decades ago, yet gained limited popularity in the ASIC industry, because of
the lack of scalable sequential verification techniques. This talk will highlight
the root causes of this issue and present innovations in retiming technology and constrained
random simulation that allow the successful verification of retimed circuits, thereby
enabling the use of logic retiming for FPGAs.

FPGAs have traditionally targeted Register-Transfer Level (RTL) designers. To enable
wider adoption of FPGAs, Intel has developed several High-Level Design (HLD) tools,
frameworks, libraries, and methodologies, raising the level of programming abstraction.
This talk will provide a glimpse into Intel’s HLD offerings that enable software developers
in the broader ecosystem to leverage FPGAs.

Academic researchers will also be provided with some key research vectors to help
propel the FPGA industry further.

SESSION: Session: High-Level Abstractions and Tools I

Session details: Session: High-Level Abstractions and Tools I

  • Caiwen Ding

Maximizing the Serviceability of Partially Reconfigurable FPGA Systems in Multi-tenant
Environment

  • Tuan D. A. Nguyen

In cloud computing, software is transitioning from monolithic to microservices architecture
to improve the maintainability, upgradability and the flexibility of the applications.
They are able to request a service with different implementations of the same functionality,
including hardware accelerator, depending on cost and performance. This model opens
up a new opportunity to integrate reconfigurable hardware, specifically, FPGA, in
the cloud to offer such services. There are many research works discussing solutions
for this problem but they focus primarily on the high-level aspects of resource manager,
hypervisor or hardware architecture. The low-level physical design choices of FPGA
to maximize the accelerator allocation success rate (called serviceability) is largely
untouched. In this paper, we propose a design space exploration algorithm to determine
the best configuration of partially reconfigurable regions (PRRs) to host the accelerators.
Besides, the algorithm is capable of estimating the actual resources occupied by the
PRRs on the FPGA even before floorplanning. We systematically study the effects of
having more PRRs on the system in various aspects, i.e., serviceability, waiting time
and resource wastage. The experiments show that at a certain number of PRRs, upto
91% serviceability can be achieved for 12 concurrent users. It is a significant improvement
from 52% without our approach. The average amount of time that each request has to
wait to be served is also reduced by 6.3X. Furthermore, the cumulative unused FPGA
resources is reduced almost by half.

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

  • Pengfei Xu

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for
domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips
is non-trivial because: (1) mainstream DNNs have millions of parameters and billions
of operations; (2) the design space is large due to numerous design choices of dataflows,
processing elements, memory hierarchy, etc.; and (3) there is an algorithm/hardware
co-design need for the same DNN functionality to have a different decomposition that
would require different hardware IPs and thus correspond to dramatically different
performance/energy/area tradeoffs. Therefore, DNN chips often take months to years
to design and require a large team of cross-disciplinary experts. To enable fast and
effective DNN chip design, we propose AutoDNNchip – a DNN chip generator that can
automatically produce both FPGA- and ASIC-based DNN chip implementation (i.e., synthesizable
RTL code with optimized algorithm-to-hardware mapping) from DNNs developed by machine
learning frameworks (e.g., PyTorch) for a designated application and dataset without
humans in the loop. Specifically, AutoDNNchip consists of 2 integrated enablers: (1)
a Chip Predictor, which can accurately and efficiently predict a DNN accelerator’s
energy, throughput, latency, and area based on the DNN model parameters, hardware
configurations, technology-based IPs, and platform constraints; and (2) a Chip Builder,
which can automatically explore the design space of DNN chips (including IP selections,
block configurations, resource balancing, etc.), optimize chip designs via the Chip
Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve
the target design metrics. Experimental results show that our Chip Predictor’s predicted
performance differs from real-measured ones by <10% when validated using 15 DNN models
and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, DNN accelerators generated
by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that
of expert-crafted state-of-the-art FPGA- and ASIC-based accelerators, showing the
effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration

  • Jiajie Li

The domain-specific language (DSL) for image processing, Halide, has generated a lot
of interest because of its capability of decoupling algorithms from schedules that
allow programmers to search for optimized mappings targeting CPU and GPU. Unfortunately,
while the Halide community has been growing rapidly, there is currently no way to
easily map the vast number of Halide programs to efficient FPGA accelerators. To tackle
this challenge, we propose HeteroHalide, an end-to-end system for compiling Halide
programs to FPGA accelerators. This system makes use of both algorithm and scheduling
information specified in a Halide program. Compared to the existing approaches, flow
provided by HeteroHalide is significantly simplified, as it only requires moderate
modifications for Halide programs on the scheduling part to be applicable to FPGAs.
For part of the compilation flow, and to act as the intermediate representation (IR)
of HeteroHalide, we choose HeteroCL, a heterogeneous programming infrastructure which
supports multiple implementation backends (such as systolic arrays and stencil implementations).
By using HeteroCL, HeteroHalide can generate efficient accelerators by choosing different
backends according to the application. The performance evaluation compares the accelerator
generated by HeteroHalide with multi-core CPU and an existing Halide-HLS compiler.
As a result, HeteroHalide achieves 4.15\texttimes speedup on average over 28 CPU cores,
and 2 \textasciitilde 4\texttimes throughput improvement compared with the existing
Halide-HLS compiler.

Fingerprinting Cloud FPGA Infrastructures

  • Shanquan Tian

In recent years, multiple public cloud FPGA providers have emerged, increasing interest
in FPGA acceleration of cryptographic, bioinformatic, financial, and machine learning
algorithms. To help understand the security of the cloud FPGA infrastructures, this
paper focuses on a fundamental question of understanding what an adversary can learn
about the cloud FPGA infrastructure itself, without attacking it or damaging it. In
particular, this work explores how unique features of FPGAs can be exploited to instantiate
Physical Unclonable Functions (PUFs) that can distinguish between otherwise-identical
FPGA boards. This paper specifically introduces the first method for identifying cloud
FPGA instances by extracting a unique and stable FPGA fingerprint based on PUFs measured
from the FPGA boards’ DRAM modules. Experiments conducted on the Amazon Web Services
(AWS) cloud reveal the probability of renting the same physical board more than once.
Moreover, the experimental results show that hardware is not shared among f1.2xlarge,
f1.4xlarge, and f1.16xlarge instance types. As the approach used does not violate
any restrictions currently placed by Amazon, this paper also presents a set of defense
mechanisms that can be added to existing countermeasures to mitigate users’ attempts
to fingerprint cloud FPGA infrastructures.

SESSION: Session: Applications I

Session details: Session: Applications I

  • Miriam Leeser

Massively Simulating Adiabatic Bifurcations with FPGA to Solve Combinatorial Optimization

  • Yu Zou

Combinatorial optimizations are widely adopted in scientific and engineering applications,
such as VLSI design, automated machine learning (AutoML), and compiler design. Combinatorial
optimization problems are notoriously challenging to exactly solve due to the NP-hardness.
Scientists have long discovered that numerically simulating classical nonlinear Hamiltonian
systems can effectively solve many well-known combinatorial optimization problems.
However, such physical simulation typically requires a massive amount of computation,
which even outstrips the logic capability of modern reconfigurable digital fabrics.
In this work, we proposed an FPGA-based general combinatorial optimization problem
solver which achieved ultra-high performance and scalability. Specifically, we first
reformulated a broad range of combinatorial optimization problems with a general graph-based
data structure called the Ising model. Second, instead of utilizing classical simulated
annealing to find an approximate solution, we utilized a new heuristic algorithm,
simulated bifurcation, to search for solutions. Third, we designed an efficient hardware
architecture to fully exploit FPGAs’ potentials to accelerate the algorithm, and proposed
three hardware-software co-optimizations to further improve the performance. By experimenting
on benchmarks, our proposal outperformed the state-of-the-art simulated annealing
optimization solver by up to 10.91 times.

High-Performance FPGA Network Switch Architecture

  • Philippos Papaphilippou

We present a high-throughput FPGA design for supporting high-performance network switching.
FPGAs have recently been attracting attention for datacenter computing due to their
increasing transceiver count and capabilities, which also benefit the implementation
and refinement of network switches. Our solution replaces the crossbar in favour of
a novel, more pipeline-friendly approach, the “Combined parallel round-robin arbiter”.
It also removes the overhead of incorporating an often-iterative scheduling or matching
algorithm, which sometimes tries to fit too many steps in a single or a few FPGA cycles.
The result is a network switch implementation on FPGAs operating at a high frequency
and with a low port-to-port latency. It also provides a wiser buffer memory utilisation
than traditional Virtual Output Queue (VOQ)-based switches and is able to keep 100%
throughput for a wider range of traffic patterns using a fraction of the buffer memory
and shorter packets.

Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic
Cancer Treatment Simulator

  • Tanner Young-Schultz

The simulation of light propagation through tissues is important for medical applications,
such as photodynamic therapy (PDT) for cancer treatment. To optimize PDT an inverse
problem, which works backwards from a desired distribution of light to the parameters
that caused it, must be solved. These problems have no closed-form solution and therefore
must be solved numerically using an iterative method. This involves running many forward
light propagation simulations which is time-consuming and computationally intensive.

Currently, the fastest general software solver for this problem is FulMonteSW. It
models complex 3D geometries with tetrahedral meshes and uses Monte Carlo techniques
to model photon interactions with tissues. This work presents FullMonteFPGACL: an
FPGA-accelerated version of FullMonteSW using an Intel Stratix 10 FPGA and the Intel
FPGA SDK for OpenCL. FullMonteFPGACL has been validated and benchmarked using several
models and achieves improvements in performance (4x) and energy-efficiency (11x) over
the optimized and multi-threaded FullMonteSW implementation. We discuss methods for
extending the design to improve the performance and energy-efficiency ratios to 16x
and 17x, respectively. We achieved these gains by developing in an agile fashion using
OpenCL to facilitate quick prototyping and hardware-software partitioning. However,
achieving competitive area and performance required careful design of the hardware
pipeline and expression of its structure in OpenCL. This led to a hybrid design style
that can improve productivity when developing complex applications on an FPGA.

Energy-Efficient 360-Degree Video Rendering on FPGA via Algorithm-Architecture Co-Design

  • Qiuyue Sun

360° panoramic video provides an immersive Virtual Reality experience. However, rendering
360° videos consumes excessive energy on client devices. FPGA is an ideal offloading
target to improve the energy-efficiency. However, a naive implementation of the processing
algorithm would lead to an excessive memory footprint that offsets the energy benefit.
In this paper, we propose an algorithm-architecture co-designed system that dramatically
reduces the on-chip memory requirement of VR video processing to enable FPGA offloading.
Evaluation shows that our system is able to achieve significant energy reduction with
no loss of performance compared to today’s off-the-shelf VR video rendering system.

Real-Time Spatial 3D Audio Synthesis on FPGAs for Blind Sailing

  • Anish Singhani

The real-time synthesis of 3D spatial audio has many applications, from virtual reality
to navigation for the visually-impaired. Head-related transfer functions (HRTF) can
be used to generate spatial audio based on a model of the user’s head. Previous studies
have focused on the creation and interpolation of these functions with little regard
for real-time performance. In this paper, we present an FPGA-based platform for real-time
synthesis of spatial audio using FIR filters created from head-related transfer functions.
For performance reasons, we run filtering, crossfading, and audio output on FPGA fabric,
while calculating audio source locations and storing audio files on the CPU. We use
a head-mounted 9-axis IMU to track the user’s head in real-time and adjust relative
spatial audio locations to create the perception that audio sources are fixed in space.
Our system, running on a Xilinx Zynq Z-7020, is able to support 4X more audio sources
than a comparable GPU and 8X more sources than a CPU while maintaining sub-millisecond
latency and comparable power consumption. Furthermore, we show how our system can
be leveraged to communicate the location of landmarks and obstacles to a visually-impaired
user during a sailing race or other navigation scenario. We test our system with multiple
users and show that, as a result of our reduced latency, a user is able to locate
a virtual audio source with an extremely high degree of accuracy and navigate toward
it.

SESSION: Session: Deep Learning I

Session details: Session: Deep Learning I

  • Bita Rouhani

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

  • Vladimir Rybalkin

Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of
one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to
show state-of-the-art results in various applications including handwritten text recognition,
medical imaging, and many more. However, efficient implementation suffers from very
sequential execution that tremendously slows down both training and inference compared
to other neural networks. This is the primary reason that prevents intensive research
involving MD-LSTM in the recent years, despite large progress in microelectronics
and architectures. The main goal of the current research is to provide acceleration
for inference of MD-LSTM, so to open a door for efficient training that can boost
application of MD-LSTM. By this research we advocate that FPGA is an alternative platform
for deep learning that can offer a solution in cases when a massive parallelism of
GPUs does not provide the necessary performance required by the application. In this
paper, we present the first hardware architecture for MD-LSTM. We conduct a systematic
exploration of precision vs. accuracy trade-off using challenging dataset for historical
document image binarization from DIBCO 2017 contest, and well known MNIST dataset
for handwritten digits recognition. Based on our new architecture we implement FPGA-based
accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by
up to 50x and energy efficiency by up to 746x. At the same time, our accelerator demonstrates
higher accuracy and comparable throughput in comparison with state-of-the-art FPGA-based
implementations of multilayer perceptron for MNIST dataset.

Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

  • Yunxuan Yu

Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet,
SqueezeNet, etc., have emerged in the past few years for fast inference on embedded
and mobile system. However, lightweight operations limit acceleration potential by
GPU due to their memory bounded nature and their parallel mechanisms that are not
friendly to SIMD. This calls for more specific accelerators. In this paper, we propose
an FPGA-based overlay processor with a corresponding compilation flow for general
LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates
and decomposes lightweight operations for efficient acceleration. Moreover, our instruction
architecture considers sharing of major computation engine between LW operations and
conventional convolution operations. This improves the run-time resource efficiency
and overall power efficiency. Finally, Light-OPU is software programmable, since loading
of compiled codes and kernel weights completes switch of targeted network without
FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves
5.5x better latency and 3.0x higher power efficiency on average compared with edge
GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency
compared with previous customized FPGA accelerators. To the best of our knowledge,
Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs
acceleration with high performance and power efficiency, which is evaluated using
all major LW-CNNs including the newly released MobileNetV3.

End-to-End Optimization of Deep Learning Applications

  • Atefeh Sohrabizadeh

The irregularity of recent Convolutional Neural Network (CNN) models such as less
data reuse and parallelism due to the extensive network pruning and simplification
creates new challenges for FPGA acceleration. Furthermore, without proper optimization,
there could be significant overheads when integrating FPGAs into existing machine
learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous
studies. However, our study shows that a naive FPGA integration into TensorFlow could
lead to up to 8.45x performance degradation. To address the challenges mentioned above,
we propose several SW/HW co-design approaches to perform the end-to-end optimization
of deep learning applications. We present a flexible and composable architecture called
FlexCNN. It can deliver high computation efficiency for different types of convolution
layers using techniques including dynamic tiling and data layout optimization. FlexCNN
is further integrated into the TensorFlow framework with a fully-pipelined software-hardware
integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake
and other non-CNN processing stages. We use OpenPose, a popular CNN-based application
for human pose recognition, as a case study. Experimental results show that with the
FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The
pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization
produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for
OpenPose with floating-point precision, which is the highest performance reported
for this application on FPGA in the literature.

SESSION: Session: FPGA Architecture

Session details: Session: FPGA Architecture

  • Satwant Singh

Architectural Enhancements in Intel® Agilex™ FPGAs

  • Jeffrey Chromczak

This paper describes architectural enhancements in Intel® Agilex™ FPGAs and SoCs.
Agilex devices are built on Intel’s 10nm process and feature next-generation programmable
fabric, tightly coupled with a quad-core ARM processor subsystem, a secure device
manager, IO and memory interfaces, and multiple companion transceiver tile choices.
The Agilex fabric features multiple logic block enhancements that significantly improve
propagation delays and integrate more effectively with the second-generation HyperFlexAgilex™
pipelined routing architecture. Routing connections are re-designed to be point-to-point,
dropping intermediate connections featured in prior FPGA generations and replacing
them with a wider variety of shorter wire types. Fine-grain programmable clock skew
and time-borrowing were introduced throughout the fabric to augment the slack-balancing
capabilities of HyperFlex registers. DSP capabilities are also extended to natively
support new INT9/BFLOAT16/FP16 formats. Together, along with process and circuit enhancements,
these changes support more than 40% performance improvement over the Stratix® 10 family
of FPGAs.

Straight to the Point: Intra- and Intercluster LUT Connections to Mitigate the Delay of Programmable Routing

  • Stefan Nikolić

Technology scaling makes metal delay ever more problematic, but routing between Look-Up
Tables (LUTs) still passes through a series of transistors. It seems wise to avoid
the corresponding delay whenever possible. Direct connections between LUTs, both within
and across multiple clusters, can eschew the transistor delays of crossbars, connection
blocks, and switch blocks. In this paper we investigate the usefulness of enhancing
classical Field-Programmable Gate Array (FPGA) architectures with direct connections
between LUTs. We present an efficient algorithm for searching automatically the most
interesting patterns of such direct connections. Despite our methods being fairly
conservative and relying on the use of unmodified standard CAD tools, we obtain a
2.77% improvement of the geometric mean critical path delay of a standard benchmark
set, with improvement ranging from -0.17% to 7.3% for individual circuits. As modest
as these results may seem at first glance, we believe that they position direct connections
between LUTs as a promising topic for future research. Extending this work with dedicated
CAD algorithms and exploiting the increased possibilities for optimal buffering, diagonal
routing, and pipelining could prove direct connections important to the continuation
of performance improvement into next generation FPGAs.

LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations

  • Seyedramin Rasoulinezhad

We propose two tiers of modifications to FPGA logic cell architecture to deliver a
variety of performance and utilization benefits with only minor area overheads. In
the first tier, we augment existing commercial logic cell datapaths with a 6-input
XOR gate in order to improve the expressiveness of each element, while maintaining
backward compatibility. This new architecture is vendor-agnostic, and we refer to
it as LUXOR. We also consider a secondary tier of vendor-specific modifications to
both Xilinx and Intel FPGAs, which we refer to as X-LUXOR+ and I-LUXOR+ respectively.
We demonstrate that compressor tree synthesis using generalized parallel counters
(GPCs) is further improved with the proposed modifications. Using both the Intel adaptive
logic module and the Xilinx slice at the 65nm technology node for a comparative study,
it is shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We demonstrate
that LUXOR can deliver an average reduction of 13-19% in logic utilization on micro-benchmarks
from a variety of domains. BNN benchmarks benefit the most with an average reduction
of 37-47% in logic utilization, which is due to the highly-efficient mapping of the
XnorPopcount operation on our proposed LUXOR+ logic cells.

SESSION: Invited Panel

Session details: Invited Panel

  • Raymond Nijssen

FPGAs will Never be the Same Again: How the Newest FPGA Architectures are Totally Disrupting the Entire FPGA Ecosystem
as We Know It

  • Raymond Nijssen

Since the inception of FPGAs over 2 decades ago, the micro-architectures and macro-architectures
of FPGAs across all FPGA vendors have been converging strongly to the point that comparable
FPGAs from the main FPGA vendors had virtually the same use models, and the same programming
models. User designs were getting easier to port from one vendor to the other with
every generation. Recent developments in from different FPGA vendors targeting the
most advanced semiconductor technology nodes are an abrupt and disruptive break from
this trend, especially at the macro-architectural level.

SESSION: Session: Keynote II

Session details: Session: Keynote II

  • George Constantinides

Xilinx Vitis Unified Software Platform

  • Vinod Kathail

FPGAs provide significant advantages in throughput, latency, and energy efficiency
for implementing low-latency, compute-intensive applications when compared to general-purpose
CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable SoCs
with on-chip CPUs, domain-specific programmable accelerators, and flexible connectivity
options. Recently, Xilinx introduced a new heterogeneous compute architecture, the
Adaptive Compute Acceleration Platform (ACAP), with significantly more flexibility
and performance to address an evolving set of new applications such as machine learning.
This advancement on the device side is accompanied by similar advances on higher-level
programming approaches to make FPGAs and ACAPs significantly easy to use for a wide
range of applications. Xilinx Vitis Unified Software Platform is a comprehensive development
environment to build and seamlessly deploy accelerated applications on Xilinx platforms
including Alveo cards, FPGA-instances in the cloud, and embedded platforms. It addresses
the three major industry trends: the need for heterogenous computing, applications
that span cloud to edge to end-point, and AI proliferation. Vitis supports application
programming using C, C++ and OpenCL, and it enables the development of large-scale
data processing and machine learning applications using familiar, higher-level frameworks
such as TensorFlow and SPARK. To facilitate communication between the host application
and accelerators, Xilinx Runtime library (XRT) provides APIs for accelerator life-cycle
management, accelerator execution management, memory allocation, and data communication
between the host application and accelerators. In addition, a rich set of performance-optimized,
open-source libraries significantly ease the application development. Vitis AI, an
integral part of Vitis, enables AI inference acceleration on Xilinx platforms. It
supports industry’s leading deep learning frameworks like Tensorflow and Caffe, and
offers a comprehensive suite of tools and APIs to prune, quantize, optimize, and compile
pre-trained models to achieve the highest AI inference performance on Xilinx platforms.
This talk provides an overview of Vitis and Vitis AI development environments.

SESSION: Session: High-Level Abstractions and Tools II

Session details: Session: High-Level Abstractions and Tools II

  • Ilya Ganusov

StateMover: Combining Simulation and Hardware Execution for Efficient FPGA Debugging

  • Sameh Attia

Debugging consumes a large portion of FPGA design time, and with the growing complexity
of traditional FPGA systems and the additional verification challenges posed by multiple
FPGAs interacting within data centers, debugging productivity is becoming even more
important. Current debugging flows either depend on simulation, which is extremely
slow but has full visibility, or on hardware execution, which is fast but provides
very limited control and visibility. In this paper, we present StateMover, a checkpointing-based
debugging framework for FPGAs, which can move design state back and forth between
an FPGA and a simulator in a seamless way. StateMover leverages the speed of hardware
execution and the full visibility and ease-of-use of a simulator. This enables a novel
debugging flow that has a software-like combination of speed with full observability
and controllability. StateMover adds minimal hardware to the design to safely stop
the design under test so that its state can be extracted or modified in an orderly
manner. The added hardware has no timing overhead and a very small area overhead.
StateMover currently supports Xilinx UltraScale devices, and its underlying techniques
and tools can be ported to other device families that support configuration readback.
Moving the state from/to an FPGA to/from a simulator can be performed in a few seconds
for large FPGAs, enabling a new debugging flow.

Buffer Placement and Sizing for High-Performance Dataflow Circuits

  • Lana Josipović

Commercial high-level synthesis tools typically produce statically scheduled circuits.
Yet, effective C-to-circuit conversion of arbitrary software applications calls for
dataflow circuits, as they can handle efficiently variable latencies (e.g., caches)
and unpredictable memory dependencies. Dataflow circuits exhibit an unconventional
property: registers (usually referred to as “buffers”) can be placed anywhere in the
circuit without changing its semantics, in strong contrast to what happens in traditional
datapaths. Yet, although functionally irrelevant, this placement has a significant
impact on the circuit’s timing and throughput. In this work, we show how to strategically
place buffers into a dataflow circuit to optimize its performance. Our approach extracts
a set of choice-free critical loops from arbitrary dataflow circuits and relies on
the theory of marked graphs to optimize the buffer placement and sizing. We demonstrate
the performance benefits of our approach on a set of dataflow circuits obtained from
imperative code.

Closing Leaks: Routing Against Crosstalk Side-Channel Attacks

  • Zeinab Seifoori

This paper presents an extension to PathFinder FPGA routing algorithm, which enables
it to deliver FPGA designs free from risks of crosstalk attacks. Crosstalk side-channel
attacks are a real threat in large designs assembled from various IPs, where some
IPs are provided by trusted and some by untrusted sources. It suffices that a ring-oscillator
based sensor is conveniently routed next to a signal that carries secret information
(for instance, a cryptographic key), for this information to possibly get leaked.
To address this security concern, we apply several different strategies and evaluate
them on benchmark circuits from Verilog-to-Routing tool suite. Our experiments show
that, for a quite conservative scenario where 10-20% of all design nets are carrying
sensitive information, the crosstalk-attack-aware router ensures that no information
leaks at a very small penalty: 1.58-7.69% increase in minimum routing channel width
and 0.12-1.18% increase in critical path delay, on average. In comparison, in an AES-128
cryptographic core, less than 5% of nets carry the key or the intermediate state values
of interest to an attacker, making it highly likely that the overhead for obtaining
a secure design is, in practice, even smaller.

Built-in Self-Evaluation of First-Order Power Side-Channel Leakage for FPGAs

  • Ognjen Glamočanin

Embedded and cyber-physical systems are pervading all aspects of our lives, including
sensitive and critical ones. As a result, they are an alluring target for cyber attacks.
These systems, whose implementation is often based on reconfigurable hardware, are
typically deployed in places accessible to attackers. Therefore, they require protection
against tampering and side-channel attacks. However, a side-channel resistant implementation
of a security primitive is not sufficient, as it can be weakened by an adversary,
aging, or environmental factors. To detect this, legitimate users should be able to
evaluate the side-channel resistance of their systems not only when deploying them
for the first time, but also during their entire service life. The most widespread
and de facto standard methodology for measuring power side-channel leakage uses Welch’s
t-test. In practice, collecting the data for the t-test requires physical access to
the device, a device-specific test setup, and the equipment for measuring the power
consumption during device operation. Consequently, only a small number of cyber-physical
systems deployed in the field can be tested this way and the tests to reevaluate the
device resistance to side-channel attacks cannot be easily repeated. To address these
issues, we present a design and an FPGA implementation of a built-in test for self-evaluation
of the resistance to first-order power side-channel attacks. Once our test is triggered,
the FPGA measures its own internal power-supply voltage and computes the t-test statistic
in real time. Experimental results on two different implementations of the AES-128
algorithm demonstrate that the self-evaluation test is very reliable. We believe that
this work is an important step towards the development of security sensors for the
next generation of safe and robust cyber-physical systems.

SESSION: Session: Applications II

Session details: Session: Applications II

  • Grace Zgheib

Dependency-Driven Trace-Based Network-on-Chip Emulation on FPGAs

  • Thiem Van Chu

FPGA emulation is a promising approach to accelerating Network-on-Chip (NoC) modeling
which has traditionally relied on software simulators. In most early studies of FPGA-based
NoC emulators, only synthetic workloads like uniform and bit permutations were considered.
Although a set of carefully designed synthetic workloads can reveal a relatively thorough
coverage of the characteristics of the NoC under evaluation, they alone are insufficient,
especially when the NoC needs to be optimized for specific applications. In such cases,
trace-driven workloads are effective. However, there is a problem with conventional
trace-driven workloads that has been pointed out by some recent studies: the network
load and congestion may be distorted because dependencies between packets are not
considered. These studies also provide infrastructures for extending existing software
simulators to enforce dependencies between packets. Unfortunately, enforcing dependencies
between packets is not trivial in the FPGA emulation approach. Therefore, although
there are some recent FPGA-based NoC emulators supporting trace-driven workloads,
most of them ignore packet dependencies. In this paper, we first clarify the challenges
of supporting trace-driven workloads with dependencies between packets taken into
account in the FPGA emulation approach. We then propose efficient methods and architectures
to tackle these challenges and build an FPGA-based NoC emulator, which we call DNoC,
based on the proposals. Our evaluation results show that (1) on a VC707 FPGA board,
DNoC achieves an average speed of 10,753K cycles/s when emulating an 8×8 NoC with
trace data collected from full-system simulation of the PARSEC benchmark suite, which
is 274x higher than the speed reported in a recent related work on dependency-driven
trace-based NoC emulation on FPGAs; (2) Compared to BookSim, one of the most popular
NoC simulators, DNoC is 395x faster while providing the same results; (3) DNoC can
scale to a 4,096-node NoC on a VC707 board, and the size of the largest NoC depends
on only the on-chip memory capacity of the target FPGA.

FPGA-Accelerated Samplesort for Large Data Sets

  • Han Chen

Sorting is a fundamental operation in many applications such as databases, search,
and social networks. Although FPGAs have been shown very effective at sorting data
sizes that fit on chip, systems that sort larger data sets by shuffling data on and
off chip are bottlenecked by costly merge operations or data transfer time. We propose
a new technique for sorting large data sets, which uses a variant of the samplesort
algorithm on a server with a PCIe-connected FPGA. Samplesort avoids merging by randomly
sampling values to determine how to partition data into non-overlapping buckets that
can be independently sorted. The key to our design is a novel parallel multi-stage
hardware partitioner, which is a scalable high-throughput solution that greatly accelerates
the samplesort partitioning step. Using samplesort for FPGA-accelerated sorting provides
several advantages over mergesort, while also presenting a number of new challenges
that we address with cooperation between the FPGA and the software running on the
host CPU. We prototype our design using Amazon Web Services FPGA instances, which
pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments
demonstrate that our prototype system sorts 2^30 key-value records with a speed of
7.2 GB/s, limited only by the on-board DRAM capacity and available PCIe bandwidth.
When sorting 2^30 records, our system exhibits a 37.4x speedup over the widely used
GNU parallel sort on an 8-thread state-of-the-art CPU.

BiS-KM: Enabling Any-Precision K-Means on FPGAs

  • Zhenhao He

K-Means is a popular clustering algorithm widely used and extensively studied in the
literature. In this paper we explore the challenges and opportunities in using low
precision input in conjunction with a standard K-Means algorithm as a way to improve
the memory bandwidth utilization on hardware accelerators. Low precision input through
quantization has become a standard technique in machine learning to reduce computational
costs and memory traffic. When applied in FPGAs, several issues need to be addressed.
First and foremost is the overhead of storing the data at different precision levels
since, depending on the training objective, different levels of precision might be
needed. Second, the FPGA design needs to accommodate varying precision without requiring
reconfiguration. To address these concerns, we propose Bit-Serial K-Means (BiS-KM),
a combination of a hybrid memory layout supporting data retrieval at any level of
precision, a novel FPGA design based on bit-serial arithmetic, and a modified K-Means
algorithm tailored to FPGAs. We have tested BiS-KM with various data sets and compared
our design with a state-of-the-art FPGA accelerator. BiS-KM achieves an almost linear
speedup as precision decreases, providing a more effective way to perform K-Means
on FPGAs.

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

  • Johannes de Fine Licht

Data movement is the dominating factor affecting performance and energy in modern
computing systems. Consequently, many algorithms have been developed to minimize the
number of I/O operations for common computing patterns. Matrix multiplication is no
exception, and lower bounds have been proven and implemented both for shared and distributed
memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing
algorithms, as they offer full control of memory accesses to the programmer. While
bounds developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources requires
a decentralized approach to optimize algorithms for maximum hardware utilization.
We present a model to optimize matrix multiplication for FPGA platforms, simultaneously
targeting maximum performance and minimum off-chip data movement, within constraints
set by the hardware. We map the model to a concrete architecture using a high-level
synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary
data types, and enables maintainability and portability across FPGA devices. Kernels
generated from our architecture are shown to offer competitive performance in practice,
scaling with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing algorithms
on reconfigurable hardware platforms.

SESSION: Session: Deep Learning II

Session details: Session: Deep Learning II

  • Lita Yang

GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms

  • Hanqing Zeng

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning
model for representation learning on graphs. It is challenging to accelerate training
of GCNs, due to (1) substantial and irregular data communication to propagate information
within the graph, and (2) intensive computation to propagate information along the
neural network layers. To address these challenges, we design a novel accelerator
for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture
co-optimizations. We first analyze the computation and communication characteristics
of various GCN training algorithms, and select a subgraph-based algorithm that is
well suited for hardware execution. To optimize the feature propagation within subgraphs,
we propose a light-weight pre-processing step based on a graph theoretic approach.
Such pre-processing performed on the CPU significantly reduces the memory access requirements
and the computation to be performed on the FPGA. To accelerate the weight update in
GCN layers, we propose a systolic array based design for efficient parallelization.
We integrate the above optimizations into a complete hardware pipeline, and analyze
its load-balance and resource utilization by accurate performance modeling. We evaluate
our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three
large graphs, we achieve an order of magnitude training speedup with negligible accuracy
loss, compared with state-of-the-art implementation on a multi-core platform.

Reuse Kernels or Activations?: A Flexible Dataflow for Low-latency Spectral CNN Acceleration

  • Yue Niu

Spectral-domain CNNs have been shown to be more efficient than traditional spatial
CNNs in terms of reducing computation complexity. However they come with a ‘kernel
explosion’ problem that, even after compression (pruning), imposes a high memory burden
and off-chip bandwidth requirement for kernel access. This creates a performance gap
between the potential acceleration offered by compression and actual FPGA implementation
performance, especially for low-latency CNN inference. In this paper, we develop a
principled approach to overcoming this performance gap and designing a low-latency,
low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage
tradeoff of sparse convolutional layers and locate communication bottlenecks. We then
develop a dataflow for flexibly optimizing data reuse in different layers to minimize
off-chip communication. Finally, we propose a novel scheduling algorithm to optimally
schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts.
On a state-of-the-art FPGA platform, our design reduces data transfers by 42% with
DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared
to the baseline state-of-the-art latency of 68 ms.

SESSION: Session: High-Level Synthesis and Tools

Session details: Session: High-Level Synthesis and Tools

  • Peter Cheung

Finding and Understanding Bugs in FPGA Synthesis Tools

  • Yann Herklotz

All software ultimately relies on hardware functioning correctly. Hardware correctness
is becoming increasingly important due to the growing use of custom accelerators using
FPGAs to speed up applications on servers. Furthermore, the increasing complexity
of hardware also leads to ever more reliance on automation, meaning that the correctness
of synthesis tools is vital for the reliability of the hardware. This paper aims to
improve the quality of FPGA synthesis tools by introducing a method to test them automatically
using randomly generated, correct Verilog, and checking that the synthesised netlist
is always equivalent to the original design. The main contributions of this work are
twofold: firstly a method for generating random behavioural Verilog free of undefined
values, and secondly a Verilog test case reducer used to locate the cause of the bug
that was found. These are implemented in a tool called Verismith. This paper also
provides a qualitative and quantitative analysis of the bugs found in Yosys, Vivado,
XST and Quartus Prime. Every synthesis tool except Quartus Prime was found to introduce
discrepancies between the netlist and the design. In addition to that, Vivado and
a development version of Yosys were found to crash when given valid input. Using Verismith,
eleven bugs were reported to tool vendors, of which six have already been fixed.

Combining Dynamic & Static Scheduling in High-level Synthesis

  • Jianyi Cheng

A central task in high-level synthesis is scheduling: the allocation of operations
to clock cycles. The classic approach to scheduling is static, in which each operation
is mapped to a clock cycle at compile-time, but recent years have seen the emergence
of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time.
Both approaches have their merits: static scheduling can lead to simpler circuitry
and more resource sharing, while dynamic scheduling can lead to faster hardware when
the computation has non-trivial control flow.

In this work, we seek a scheduling approach that combines the best of both worlds.
Our idea is to identify the parts of the input program where dynamic scheduling does
not bring any performance advantage and to use static scheduling on those parts. These
statically-scheduled parts are then treated as black boxes when creating a dataflow
circuit for the remainder of the program which can benefit from the flexibility of
dynamic scheduling.

An empirical evaluation on a range of applications suggests that by using this approach,
we can obtain 74% of the area savings that would be made by switching from dynamic
to static scheduling, and 135% of the performance benefits that would be made by switching
from static to dynamic scheduling.

Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL
Applications on FPGAs

  • Jiantong Jiang

FPGA vendors provide OpenCL software development kits for easier programmability,
with the goal of replacing the time-consuming and error-prone register-transfer level
(RTL) programming. Many studies explore optimization methods (e.g., loop unrolling,
local memory) to accelerate OpenCL programs running on FPGAs. These programs typically
follow the default OpenCL execution model, where a kernel deploys multiple work-items
arranged into work-groups. However, the default execution model is not always a good
fit for an application mapped to the FPGA architecture, which is very different from
the multithreaded architecture of GPUs, for which OpenCL was originally designed.
In this work, we identify three other execution models that can better utilize the
FPGA resources for the OpenCL applications that do not fit well into the default execution
model. These three execution models are based on two OpenCL features devised for FPGA
programming (namely, single work-item kernel and OpenCL channel). We observe that
the selection of the right execution model determines the performance upper bound
of a particular application, which can vary by two orders magnitude between the most
suitable execution model and the most unsuitable one. However, there is no way to
select the most suitable execution model other than empiricall exploring the optimization
space for the four of them, which can be prohibitive. To help FPGA programmers identify
the right execution model, we propose Boyi, a systematic framework that makes automatic
decisions by analyzing OpenCL programming patterns in an application. After finding
the right execution model with the help of Boyi, programmers can apply other conventional
optimizations to reach the performance upper bound. Our experimental evaluation shows
that Boyi can 1) accurately determine the right execution model, and 2) greatly reduce
the exploration space of conventional optimization methods.

SESSION: Poster Session I

Session details: Poster Session I

  • Vaughn Betz

Programming Abstractions for Configurable Hardware: Survey and Research Directions

  • Samuel Dewan

Programming abstractions decrease the cognitive gap between program idealization and
expression. In the software domain, this high-level expressive power is achieved through
layered abstractions – virtual machines, compilers, operating systems – which translate,
at design and runtime, programmer visible code into hardware-compatible code. While
this paradigm is ideal for static, i.e., unmodifiable, hardware, several of these
abstractions break down when programming configurable hardware. State of the art hardware/software
co-design techniques (e.g., High Level Synthesis (HLS), Intermediate Fabrics) are,
for the most part, ad hoc patches to the traditional abstraction stack, applicable
only to specific toolchains or software components. In this paper, we survey current
hardware design and hardware/software co-design abstractions, from the perspective
of the design language/toolchain. We perform a systematic analysis of different design
paradigms, including HLS, Domain Specific Languages (DSL), and new-generation Hardware
Description Languages (HDL). We analyze how these paradigms differ in expressiveness,
support for hardware/software interaction, hierarchy and modularity, HDL interoperability,
and interface with the outside world.

Pipeline-aware Logic Deduplication in High-Level Synthesis for Post-Quantum Cryptography
Algorithms

  • Changsu Kim

With the technical advance of quantum computers that can solve intractable problems
for conventional computers, many of the currently used public-key cryptosystems become
vulnerable. Recently proposed post-quantum cryptography (PQC) is secure against both
classical and quantum computers, but existing embedded systems such as smart card
can not easily support the PQC algorithms due to their much larger key sizes and more
complex arithmetics. To accelerate the PQC algorithms, embedded systems have to embed
the PQC hardware blocks, which can lead to huge hardware design costs. Although High-Level
Synthesis (HLS) helps significantly reduce the design costs, current HLS frameworks
produce inefficient hardware design for the PQC algorithms in terms of area and performance.
This work analyzes common features of the PQC algorithms and proposes a new pipeline-aware
logic deduplication method in HLS. The proposed method shares commonly invoked logic
across hardware design while considering load balancing in pipeline and resolving
dynamic memory accesses. This work implements FPGA hardware design of seven PQC algorithms
in the round 2 candidates from the National Institute of Standards and Technology
(NIST) PQC standardization process. Compared to commercial HLS framework, the proposed
method achieves an area-delay-product reduction by 34.5%.

Advanced Dataflow Programming using Actor Machines for High-Level Synthesis

  • Endri Bezati

The use of parallelism has increased drastically in recent years. Parallel platforms
come in many forms: multi-core processors, embedded hybrid solutions such as multi-processor
system-on-chip with reconfigurable logic, and cloud datacenters with multi-core and
reconfigurable logic. These heterogeneous platforms can offer massive parallelism,
but it can be difficult to exploit, particularly when combining solutions constructed
with multiple architectures. To program a heterogeneous platform, a developer must
master different programming languages, tools, and APIs to program each aspect of
platform separately and then must find a means to connect them with communication
interfaces. The motivation of this work is to provide a single programming model and
framework for hardware-software stream programs on heterogeneous platforms. Our framework,
StreamBlocks, starts with a dataflow programming model for both embedded and datacenter
platforms. Dataflow programming is an alternative model of computation that captures
both data and task parallelism. We describe a compiler infrastructure for CAL dataflow
programs for hardware code generation. CAL is a dataflow programming language that
can express multiple dataflow models of computation. StreamBlocks is based on the
Tycho compiler infrastructure, which transforms each actor in a dataflow program to
an abstract machine model, called Actor Machine. Actor Machines provides a unified
model for executing actors in both hardware and software and permit our compiler extension
and backend to generate efficient FPGA code. Unlike other systems, the programming
model and compiler directly support hardware-software systems in which an FPGA functions
as a coprocessor to a CPU. This permits easy integration with existing workflows.

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum
Frequency

  • Licheng Guo

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency
compared to manual RTL designs. We study the timing issues in a diverse set of nine
realistic HLS designs and observe that in most cases the frequency degradation is
related to the signal broadcast structures. In this work, we classify the common broadcast
types in HLS designs, including the data signal broadcast and two types of control
signal broadcast: the pipeline control broadcast and the synchronization signal broadcast.
We further identify several common limitations of the current HLS tools, which lead
to improper handling of the broadcasts. First, the HLS delay model does not consider
the extra delay caused by broadcasts, thus the scheduling results will be suboptimal.
To solve the issue, we implement a set of comprehensive synthetic designs and benchmark
the extra delay to calibrate the HLS delay model. Second, the HLS adopts back-pressure
signals for pipeline control, which will lead to large broadcasts. Instead, we propose
to use the skid-buffer-based pipeline control, where the back-pressure signal is removed,
and an extra skid-buffer is used for flow-control. We use dynamic programming to minimize
the area of the extra FIFO. Third, there exist redundant synchronizations among concurrent
modules that may lead to huge broadcasts. We propose methods to identify and prune
unnecessary synchronization signals. Our solutions boost the frequency of nine real-world
HLS benchmarks by 53% on average and with marginal area and latency overhead. In some
cases, the gain is more than 100 MHz.

Productive Hardware Designs using Hybrid HLS-RTL Development

  • Blaise Tine

Current High-Level Synthesis frameworks provide a productive hardware development
methodology where hardware accelerators are generated directly from high-level languages
like C/C++ or OpenCL, allowing software developers to quickly accelerate their applications.
However, the hardware generated by these frameworks is sub-optimal compared to often
hand-optimized RTL modules. A hybrid development approach would leverage the productive
software stack and hardware board support package that HLS provides but allow for
fine-grained optimization using RTL components. In this work, we introduce a new software-hardware
co-design framework that integrates OpenCL/OpenACC with RTL code enabling direct execution
on FPGAs as well as full emulation with a high-speed simulator to reduce the development
time.

Unleashing the Power of FPGAs as Programmable Switches

  • Thomas Luinaud

The P4 language and the PISA architecture have revolutionized the field of networking.
Thanks to P4 and PISA, new networking applications and protocols can be rapidly evaluated
on high performance switches. While P4 allows the expression of a wide range of packet
processing algorithms, current programmable switch architecture limit the overall
processing flexibility. To address this shortcoming recent work have proposed to implement
PISA on FPGAs. However, little effort has been devoted to analyze whether FPGAs are
good candidates to implement PISA. In this work, we take a step back and evaluate
the micro-architecture efficiency of various PISA blocks. Using a theoretical analysis
and experiments, we demonstrate that current FPGA architecture drastically limit the
performance of a few PISA blocks. Thus, we explore two avenues to alleviate these
shortcomings. First, we identify some network applications that are well tailored
to current FPGAs. Second, to support a wider range of networking applications, we
propose modifications to the FPGA architecture which can also be of interest outside
the networking field.

Early-stage Automated Identification of Similar Hardware Implementations with Abstract-Syntax-Tree

  • Parnian Mokri

The resource requirements of application-specific accelerators challenge embedded
system designers who have a tight area budget but must cover a range of possible software
kernels. We propose an early detection methodology (ReconfAST) to identify computationally
similar synthesizable kernels to build Shared Accelerators (SAs). SAs are specialized
hardware accelerators that execute very different software kernels but share the common
hardware functions between them. SAs increase the fraction of workloads covered by
specialized hardware by detecting similarities in dataflow and control flow between
seemingly very different workloads. Existing methods use either dynamic traces or
analyze register transfer level (RTL) implementations to find these similarities which
require deep knowledge of RTL and time-consuming design process.

ReconfAST leverages abstract-syntax-trees (ASTs) generated from LLVM’s-clang to discover
similar kernels among workloads. ASTs provide the right level of abstraction to detect
commonalities. ASTs are compact, unlike control and dataflow representations, but
contain extra syntax and variable node ordering that complicates workload comparison.
ReconfAST, transforms ASTs into a new clustered-ASTs (CASTs) representation, removes
unneeded nodes, and uses a regular expression to match common node configurations.
The approach is validated using MachSuite accelerator benchmarks.

On FPGAs, a good Shared Accelerator accelerates workloads by an average of 5x and
reduces the resources required for FPGA implementations: 37% FFs, 16% DSPs, and 10%
on LUTs on average over a dedicated accelerator implementation.

Hardware Description Beyond Register-Transfer Level Languages

  • Oron Port

Prevalent hardware description languages (HDLs), e.g., Verilog and VHDL, employ register-transfer
level (RTL) as their underlying programming model. One major downside of the RTL model
is that it tightly couples design functionality with timing and device constraints.
This coupling increases code complexity and yields code that is more verbose and less
portable. High-level synthesis (HLS) tools decouple functionality from timing and
design constraints by utilizing constructs from imperative programming languages.
These constructs and their sequential semantics, however, impede construction of inherently
parallel hardware and data scheduling, which is crucial in many design use-cases.

In our work we present a novel dataflow hardware description abstraction layer as
basis for hardware design and apply it to DFiant, a Scala-embedded HDL. DFiant leverages
dataflow semantics along with modern software language features (e.g., inheritance,
polymorphism) and classic HDL traits (e.g., bit-accuracy, input/output ports) to decouple
functionality from implementation constraints. Therefore, DFiant designs are timing-agnostic
and device-agnostic and can be automatically pipelined by the DFiant compiler to meet
target performance requirements. With DFiant we demonstrate how dataflow HDL code
can be substantially more portable and compact than its equivalent RTL code, yet without
compromising its target design performance.

MLSBench: A Synthesizable Dataset of HLS Designs to Support ML Based Design Flows

  • Pingakshya Goswami

With the advent of Machine Learning (ML), predictive EDA tools are becoming the next
hot topic of research in the EDA community, and researchers are working on ML-based
tools to predict the performance of the EDA tool. As the designs become complex, there
is a need to start the design using higher levels of abstraction, such as High-Level
Synthesis (HLS) tools in FPGA and SoC design flows. Quick prediction of performance-related
parameters of the final design after the C-synthesis stage, can help in rapid design
closure. Even though multiple papers exist in the domain of post routing performance
prediction of HLS tools, there are no standard benchmarks available to compare the
performance and accuracy of the predictive models. In this paper, we have presented
MLSBench, a collection of around 5000 synthesizable designs written in C and C++.
We provide a methodology to generate designs with various variations from a single
design, which creates a potential for creating newer designs and enlarging the database
in the future. This is followed by analysis, and validating the generated designs
are indeed different. This allows designers to create generalized machine-learning-based
models that are not overfitted to a small dataset. We also perform statistical analysis
for measuring the design diversity by synthesizing them using Xilinx-Vivado HLS for
Zynq 7000 device series.

A Top-Down Design Methodology for Synthesizing FPGA Fabrics Using Standard ASIC Flow

  • Prashanth Mohan

Design methodologies for synthesizing FPGA fabrics presented in the literature typically
employ a bottom-up approach wherein individual tiles are synthesized in isolation
and later stitched together to generate the large FPGA fabric. However, using a bottom-up
methodology to ensure fabric-level performance targets is challenging due to the lack
of a global timing view across multiple tiles spanning the FPGA fabric. While previous
works address this problem with a combination of manual buffering and floorplanning,
these additional steps introduce significant deviations from standard push-button
ASIC flows. In this paper, a top-down synthesis methodology is proposed, which eliminates
the need for floorplanning and manual buffering by providing a global timing view
of the FPGA fabric. To evaluate the proposed design methodology, we developed an FPGA
fabric generator using the Chisel hardware construction language. The fabric generator
reads in the Verilog-to-Routing architecture file, describing the user-defined FPGA
fabric, and generates the Verilog netlist and timing exceptions required to automatically
place and route the FPGA fabric in any technology node with a standard cell library.
Post layout timing analysis of placed and routed FPGA fabrics on a 28nm industrial
CMOS process demonstrates that the top-down methodology can place and route fabrics
without the need for any manual buffering or floorplanning while providing ~20% average
improvement in performance across multiple benchmark designs.

ConvCloud: An Adaptive Convolutional Neural Network Accelerator on Cloud FPGAs

  • Yang Yang

Among all the neural network specialized hardware accelerators like the Application-Specific-Integrate-Circuit(ASIC),
an FPGA accelerator stands out for its flexibility, short time-to-market, and energy
efficiency. However, when it comes to multitasking and high-speed requirements or
realtime and power-efficient scenarios (e.g., UAVs, self-driving cars, and IoT devices),
a single-board FPGA accelerator has difficulties in achieving excellent performance.
Therefore, Cloud FPGAs(Multi-FPGAs) will be a significant role in high-performance
and energy-efficient computation of CNNs for both mobile and cloud computing domains.
In this work, we propose an adaptive neural network accelerator on Cloud FPGAs, using
multi-FPGA design to satisfy multitasking and high-speed requirements or realtime
and power-efficient scenarios. We adopt the roofline model to figure out the optimal
configuration of each CNN layer. And a layer clustering algorithm and a layer sequence
detection method are proposed to transform CNN models into layer sequences for mapping
the CNN model layers efficiently to different FPGA boards. Then, we built an adaptive
CNN mapping method of Multi-FPGA chips for CNN models. Preliminary results on the
Multi-FPGAs platform demonstrate that our accelerator can improve the performance
significantly due to the adaptive mapping method.

Scalable FPGA Median Filtering using Multiple Efficient Passes

  • Oscar Rahnama

The 2-D median filter, one of the oldest and most well-established image-filtering
techniques, still sees widespread use throughout computer vision. Despite its relative
algorithmic simplicity, accelerating the 2-D median filter via a hardware implementation
becomes increasingly challenging as the window size increases, since the resources
required grow quadratically with the window size. Previous works, in a non-FPGA context,
have shown that applying a sequence of multiple directional median filters to an image
yields results that are competitive with, and in some cases even better than, those
of a classic 2-D window median. Inspired by these approaches, we propose a novel way
of substituting a 2-D median filter on an FPGA with a sequence of directional median
filters, in our case in the pursuit of an FPGA implementation that achieves better
scalability and hardware efficiency without sacrificing accuracy. We empirically show
that the combination of three particular directional filters, in any order, achieves
this, whilst requiring quadratically fewer resources on the FPGA. Our approach allows
for much higher throughput and is easier to implement as a pipeline.

FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on
Intel Stratix 10

  • Ke He

Deep learning has becoming increasingly more popular in recent years, and there are
many popular frameworks in the market accordingly, such as Caffe, TensorFlow and Pytorch.
All these frameworks natively support CPUs and GPGPUs. However, FPGAs still cannot
provide a comprehensive support by these frameworks for deep learning development,
especially for the training phase. In this paper, we firstly propose the FeCaffe,
i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based
on the Caffe, to enable FPGA to support CNN training features. Moreover, we provide
some benchmarks of popular CNN networks with FeCaffe, and further analysis in details
accordingly. Finally, some optimization directions including FPGA kernel design, system
pipeline, network architecture, user case application and heterogeneous platform levels,
have been proposed gradually. The result demonstrates the proposed FeCaffe can support
almost full features for training and inference respectively with high degree of design
flexibility, expansibility and reusability for deep learning development. Compared
to prior studies, our architecture can support more network and training settings
and current configuration can achieve 6.4x and 8.4x average execution time improvement
for forward and backward respectively for LeNet.

SESSION: Poster Session II

Session details: Poster Session II

  • Mike Hutton

DOMIS: Dual-Bank Optimal Micro-Architecture for Iterative Stencils

  • Juan Escobedo

High-Level Synthesis (HLS) can achieve significant performance improvements through
effective memory partitioning and meticulous data reuse. Many modern applications,
such as medical imaging and convolutional layers in a CNN, mostly contain kernels
where iterations can be reordered freely without compromising its correctness. In
this paper, we propose an optimal micro-architecture that can be automatically implemented
for simple and iterative stencil computations that utilizes only 2 banks to achieve
fully parallel conflict memory accesses from single stage stencil kernels, while only
requiring reuse buffers of size proportional to the kernel size to achieve an II of
1, irrespectively of the stencil geometry. We demonstrate the effectiveness of our
micro-architecture by implementing it with a Kintex 7 xc7k160tg676-1 Xilinx FPGA and
testing it with several stencil-based kernels found in real-world applications. On
average, when compared with the mainstream GMP and SRC architectures our approach
achieves approximately 30- 70% reduction in hardware usage, while improving performance
by about 15%. Moreover, the number of independent memory banks required to accomplish
conflict-free data accesses have dropped by more than 30% together with some increase
in power consumption due to higher clock frequencies.

Scalable FPGA-based Architecture for High-Performance Per-Flow Traffic Measurement

  • Junzhong Shen

Per-flow traffic measurement has emerged as a critical but challenging task in data
center in recent years in the face of massive network traffic. Many approximate methods
have been proposed to resolve the existing resource-accuracy trade-off in per-flow
traffic measurement, one of which is the sketch-based method. However, sketches are
affected by their high computational cost and low throughput; moreover, their measurement
accuracy is hard to guarantee under the conditions of changing network bandwidth or
flow size distribution. Recently, FPGA platforms have been widely deployed in data
centers, as they demonstrate a good fit for high-speed network processing. In this
work, we propose a scalable pipelined architecture for high high-throughput per-flow
traffic measurement on FPGA. We adopts memory-friendly D-left hashing in our design,
which guarantees high space utilization that successfully addressing the challenge
of tracking high speed data stream under limit memory resource on FPGA. Comparisons
with state-of-the-art sketch-based solutions show that our design outperforms state-of-the-art
sketch-based methods in terms of throughput by over 80x.

Codesign-NAS: Automatic FPGA/CNN Codesign Using Neural Architecture Search

  • Mohamed S. Abdelfattah

Field-programmable gate arrays (FPGAs) have become a popular compute platform for
convolutional neural network (CNN) inference; however, the design of a CNN model and
its FPGA accelerator has been inherently sequential. A CNN is first prototyped with
no-or-little hardware awareness to attain high accuracy; subsequently, an FPGA accelerator
is tuned to that specific CNN to maximize its efficiency. Instead, we formulate a
neural architecture search (NAS) optimization problem that contains parameters from
both the CNN model and the FPGA accelerator, and we jointly search for the best CNN
model-accelerator pair that boosts accuracy and efficiency -we call this Codesign-NAS.
In this paper we focus on defining the Codesign-NAS multiobjective optimization problem,
demonstrating its effectiveness, and exploring different ways of navigating the codesign
search space. For Cifar-10 image classification, we enumerate close to 4 billion model-accelerator
pairs, and find the Pareto frontier within that large search space. Next we propose
accelerator innovations that improve the entire Pareto frontier. Finally, we compare
to ResNet on a highly-tuned accelerator, and show that using codesign, we can improve
on Cifar-100 classification accuracy by 1.8% while simultaneously increasing performance/area
by 41% in just 1000 GPU-hours of running Codesign-NAS, thus demonstrating that our
automated codesign approach is superior to sequential design of a CNN model and accelerator.

Placement Aware Design and Automation of High Speed Architectures for Tree-Structured
Linear Cellular Automata on FPGAs with Scan Path Insertion

  • Ayan Palchaudhuri

VLSI implementation of Cellular Automata (CAs) has gained importance owing to its
features which guarantee parallelism, locality and structural regularity. In this
work, we have addressed the design challenges pertaining to an implementation optimized
for speed, of tree-structured linear CA architectures on Field Programmable Gate Array
(FPGA) with built-in scan paths. Scan based design facilitates state initialization,
helps to escape from any graveyard state, or figure out faulty locations (if any)
on which the circuit is mapped. Our design automation platform generates synthesizable
circuit descriptions of tree-structured CA on FPGA, and appends scan functionality
without additional logic or speed overhead. Placement algorithms governing the map
of CA cell nodes on the FPGA slices have been proposed to ensure maximum physical
proximity among CA cells sharing neighborhood dependencies. This is done to exploit
the VLSI amenable features such as physical adjacency of the neighboring nodes participating
in the next state (NS) computation of each other. The ultimate implementation leads
to minimum spacing of linear order between CA neighbours. The NS logic of each CA
cell inclusive of scan multiplexing, owing to restricted neighborhood size, is realized
using a single Look-Up Table. Our architectures outperform behavioral implementations
realized with higher levels of design style abstraction.

INCAME: INterruptible CNN Accelerator for Multi-robot Exploration

  • Jincheng Yu

Multi-Robot Exploration (MR-Exploration) that provides the location and map is a basic
task for many multi-robot applications. Recent researches introduce Convolutional
Neural Network (CNN) to critical components in MR-Exploration, like Feature-point
Extraction (FE) and Place Recognition (PR), to improve the system performance. Such
CNN-based MR-Exploration requires running multiple CNN models simultaneously, together
with complex post-processing algorithms, greatly challenges the hardware platforms,
which are usually embedded systems. Previous researches have shown that FPGA is a
good candidate for CNN processing on embedded platforms. But such accelerators usually
process different models sequentially, lacking the ability to schedule multiple tasks
at runtime. Furthermore, post-processing of CNNs in FE is also computation consuming
and becomes the system bottleneck after accelerating the CNN models. To handle such
problems, we propose an INterruptible CNN Accelerator for Multi-Robot Exploration
(INCAME) framework for rapid deployment of robot applications on FPGA. In INCAME,
we propose a virtual-instruction-based interrupt method to support multi-task on CNN
accelerators. INCAME also includes hardware modules to accelerate the post-processing
of the CNN-based components. Experimental results show that INCAME enables multi-task
scheduling on the CNN accelerator with negligible performance degradation (0.3%).
With the help of multi-task supporting and post-processing acceleration, INCAME enables
embedded FPGA to execute MR-Exploration in real time (20 fps).

LPAC: A Low-Precision Accelerator for CNN on FPGAs

  • Tianyu Zhang

Low bit quantization of neural network is required on edge devices to achieve lower
power consumption and higher performance. 8bit or binary network either consumes a
lot of resources or has accuracy degradation. Thus, a full-process hardware-friendly
quantization solution of 4A4W (activations 4bit and weights 4bit) is proposed to achieve
better accuracy/resource trade-off. It doesn’t contain any additional floating operations
and achieve accuracy comparable to full-precision. We also implement a low-precision
accelerator for CNN (LPAC) on the Xilinx FPGA, which takes full advantage of its DSP
by efficiently mapping convolutional computations. Through on-chip reassign management
and resource-saving analysis, high performance can be achieved on small chips. Our
4A4W solution achieves 1.8x higher performance than 8A8W and 2.42x increase in power
efficiency under the same resource. On ImageNet classification, the accuracy has a
gap less than 1% to full-precision in Top-5. On the human pose estimation, we achieve
261 frames per second on ZU2EG, which is 1.78x speed up compared to 8A8W and the accuracy
has only 1.62% gap to full-precision. This proves that our solution has better universality.

Enable Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

  • Shulin Zeng

FPGAs have shown great potential in providing low-latency and energy-efficient solutions
for deep learning applications, especially for the deep neural network (DNN). Currently,
the majority of FPGA based DNN accelerators are designed for single-task and static-workload
applications, making it difficult to adapt to the multi-task and dynamic-workload
applications in the cloud. To meet these requirements, DNN accelerators need to support
multi-task concurrent execution and low-overhead runtime resources reconfiguration.
However, neither instruction set architecture (ISA) based nor template-based FPGA
accelerators can support both functions at the same time. In this paper, we introduce
a novel FPGA virtualization framework for ISA-based DNN accelerators in the cloud.
As for the design goals of supporting multi-task and runtime reconfiguration, we propose
a two-level instruction dispatch module and deep learning hardware resources pooling
technique at the hardware level. As for the software level, we propose a tiling-based
instruction frame package design and two-stage static-dynamic compilation. Furthermore,
we propose a history information aware scheduling algorithm for the proposed ISA-based
deep learning accelerators in the cloud scenario. According to our evaluation on Xilinx
VU9P FPGA, the proposed virtualization method achieves 1.88x to 2.20x higher throughput
and 1.36x to 1.77x lower latency against the static baseline design.

Evaluation of Optimized CNNs on FPGA and non-FPGA based Accelerators using a Novel
Benchmarking Approach

  • Michaela Blott

Numerous algorithmic optimization techniques have been proposed to alleviate the computational
complexity of convolutional neural networks (CNNs). However, given the broad selection
of inference accelerators, it is not obvious which approach benefits from which optimization
and to what degree. In addition, the design space is further obscured by many deployment
settings such as power and operating modes, batch sizes, as well as ill-defined measurement
methodologies. In this paper, we systematically benchmark different types of CNNs
leveraging both pruning and quantization as the most promising optimization techniques
leveraging a novel benchmarking approach. We evaluate a spectrum of FPGA implementations,
GPU, TPU and VLIW processor, for a selection of systematically pruned and quantized
neural networks (including ResNet50, GoogleNetv1, MobileNetv1, a VGG derivative, and
a multilayer perceptron) taking the full design space into account including batch
sizes, thread counts, stream sizes and operating modes, and considering power, latency,
and throughput at a specific accuracy as figure of merit. Our findings show that channel
pruning is effective across most hardware platforms, with resulting speedups directly
correlated to the reduction in compute load, while FPGAs benefit the most from quantization.
FPGAs outperform regarding latency and latency variation for the majority of CNNs,
in particular with feed-forward dataflow implementations. Finally, pruning and quantization
are orthogonal techniques and yield the majority of all optimal design points when
combined. With this benchmarking approach, both in terms of methodology and measured
results, we aim to drive more clarity in the choice of CNN implementations and optimizations.

CloudMoles: Surveillance of Power-Wasting Activities by Infiltrating Undercover Sensors

  • Seyedeh Sharareh Mirzargar

Recently, FPGA-accelerated cloud has emerged as a new computing environment. The inclusion
of FPGAs in the cloud has created new security risks, some of which are due to circuits
exercising excessive switching activity. These power-wasting tenants can cause timing
faults in the collocated circuits or a denial-of-service attack by resetting the host
FPGA. In this work, we present the idea of populating the FPGA with voltage sensors
based on ring oscillators, to continuously monitor the core voltage fluctuations across
the entire FPGA. To implement the sensors, we do not lock any FPGA resources; instead,
we infiltrate the sensors undercover, by taking advantage of the logic and the routing
resources unused by the tenants. Additionally, we infiltrate the sensors into the
FPGA circuits after their implementation, but before their deployment on the cloud;
the tenants are thus neither aware nor affected by our voltage monitoring system.
Finally, we devise a novel metric that takes the sensor measurements to quantify the
power wasting activity in the FPGA clock regions where the sensors are infiltrated.
We use VTR benchmarks and a Xilinx Virtex-7 FPGA to test the feasibility of our approach.
Experimental results demonstrate that, using the undercover voltage sensors and our
novel metric, one can accurately locate the source of the malicious power-wasting
activity.

Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL

  • Adel Ejjeh

High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve hardware
design productivity and enable efficient design space exploration, by providing simple
program directives (pragmas) and/or API calls that allow hardware programmers to use
higher-level languages (like HLS-C or OpenCL). However, modern HLS tools sometimes
miss important optimizations that are necessary for high performance. In this poster,
we present a study of the tradeoffs in HLS optimizations, and the potential of a modern
HLS tool in automatically optimizing an application. We perform the study on a generic,
5-stage camera ISP pipeline using the Intel FPGA SDK for OpenCL and an Arria 10 FPGA
Dev Kit. We show that automatic optimizations in the HLS tool are valuable, achieving
up to 2.7x speedup over equivalent CPU execution. With further hand tuning, however,
we can achieve up to 36.5x speedup over CPU. We draw several specific lessons about
the effectiveness of automatic optimizations guided by simple directives and about
the nature of manual rewriting required for high performance. Finally, we conclude
that there is a gap in the current potential of HLS tools which needs to be filled
by next-gen research.

CANSEE: Customized Accelerator for Neural Signal Enhancement and Extraction from the
Calcium Image in Real Time

  • Zhe Chen

Miniaturized fluorescent calcium imaging miniscope has become a prominent technique
in monitoring the activity of a large population of neurons in vivo. However, existing
calcium image processing algorithms are developed for off-line analysis, and their
implementations on general-purpose processors are difficult to meet the real-time
processing requirement under constrained energy budget for closed-loop applications.
In this paper, we propose the CANSEE, a customized accelerator for neural signal enhancement
and extraction from calcium image in real time. The accelerator can perform the motion
correction, the calcium image enhancement, and the fluorescence tracing from up to
512 cells with less than 1-ms processing latency. We also designed the hardware that
can detect new cells based on the long short-term memory (LSTM) inference. We implemented
the accelerator on a Xilinx Ultra96 FPGA. The implementation achieves 15.8x speedup
and over 2 orders of magnitude improvement in energy efficiency compared to the evaluation
on the multi-core CPU.

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

  • Chen Wu

Low precision data representation is important to reduce storage size and memory access
for convolutional neural networks (CNNs). Yet, existing methods have two major limitations:
(1) requiring re-training to maintain accuracy for deep CNNs, and (2) needing 16-bit
floating point or 8-bit fixed point for a good accuracy.

In this paper, we propose a low precision (8-bit) floating point (LPFP) quantization
method for FPGA-based acceleration to overcome the above limitations. Without any
re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5
accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly
better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP
multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder, and therefore
implement four 8-bit LPFP multiplications using one DSP slice of Xilinx Kintex-7 family
(KC705 in this paper) while one DSP can implement only two 8-bit fixed point multiplications.
Experiments on six typical CNNs for inference show that on average, we improve throughput
by 64.5× over Intel i9 CPU and by 1.5× over existing FPGA accelerators. Particularly
for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput
by 3.5× and 27.5× and improve average throughput per DSP by 4.1× and 5×, respectively.
To the best of our knowledge, this is the first in-depth study to simplify one multiplication
for CNN inference to one 4-bit MAC and implement four multiplications within one DSP
while maintaining comparable accuracy without any re-training.

Maximizing CNN Throughput on FPGA Clusters

  • Ruihao Li

Field Programmable Gate Array (FPGA) platform has been a popular choice for deploying
Convolutional Neural Networks (CNNs) as a result of its high parallelism and low energy
consumption. Due to the limitation of on-chip resources on a single board, FPGA clusters
become promising solutions to improve the throughput of CNNs. In this paper, we firstly
put forward strategies to optimize the resource allocation intra and inter FPGA boards.
Then we model the multi-board cluster problem and design algorithms based on knapsack
problem and dynamic programming to calculate the optimal topology of the FPGA clusters.
We also give a quantitative analysis of the inter-board data transmission bandwidth
requirement. To make our design accommodate for more situations, we provide solutions
for deploying fully connected layers and special convolution layers with large memory
requirement. Experimental results show that typical well-known CNNs with the proposed
topology of FPGA clusters could obtain a higher throughput per board than single-board
solutions and other multi-board solutions.

R2CNN: Recurrent Residual Convolutional Neural Network on FPGA

  • Hiroki Nakahara

Over the past years, feed-forward convolutional neural networks (CNNs) have evolved
from a simple feed-forward architecture to deep and residual (skip-connection) architectures,
demonstrating increasingly higher object categorization accuracy and increasingly
better explanatory power of both neural and behavioral responses. However, from the
neuroscientist point of view, the relationship between such deep architectures and
the ventral visual pathway is incomplete. For example, current state-of-the-art CNNs
appear to be too complex (e.g., now over 100 layers for ResNet) compared with the
relatively shallow cortical hierarchy (4-8 layers). We introduce new CNNs with shallow
recurrent architectures and skip connections requiring fewer parameters. With higher
accuracy for classification, we propose an architecture for recurrent residual convolutional
neural network (R2CNN) on FPGA, which efficiently utilizes on-chip memory bandwidth.
We propose an Output-Kernel- Input-Parallel (OKIP) convolution circuit for a recurrent
residual convolution stage. We implement the inference hardware on a Xilinx ZCU104
evaluation board with high-level synthesis. Our R2CNN accelerator achieves top-5 accuracy
of 90.08% on ImageNet bench- mark, which has higher accuracy than conventional FPGA
implementations.

Synthesis-Free, Flexible and Fast Hardware Library for Biophysically Plausible Neurosimulations

  • Rene Miedema

Computational neuroscience uses models to study the brain. The Hodgkin-Huxley (HH)
model, and its extensions, is one of the most powerful, biophysically meaningful models
currently used. The high experimental value of the (extended) Hodgkin-Huxley (eHH)
models comes at the cost of steep computational requirements. Consequently, for larger
networks, neuroscientists either opt for simpler models, losing neuro-computational
features, or use high-performance computing systems. The eHH models can be efficiently
implemented as a dataflow application on a FPGA-based architecture. The state-of-the-art
FPGA-based implementations have proven to be time-consuming because of the long-duration
synthesis requirements. We have developed flexHH, a flexible hardware library, compatible
with a widely used neuron-model description format, implementing five FPGA-accelerated
and parameterizable variants of eHH models (standard HH with optional extensions:
custom ion-gates, gap junctions, and/or multiple cell compartments). Therefore, flexHH
is a crucial step towards high-flexibility and high-performance FPGA-based simulations,
eschewing the penalty of re-engineering and re-synthesis, dismissing the need for
an engineer. In terms of performance, flexHH achieves a speedup of 1,065x against
NEURON, the simulator standard in computational neuroscience, and speedups between
8x-20x against sequential C. Furthermore, flexHH is faster per simulation step compared
to other HPC technologies, provides 65% or better performance density (in FLOPS/LUT)
compared to related works, and only shows a marginal performance drop in real-time
simulations.

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

  • Mathew Hall

This poster presents a novel cross-layer-pipelined Convolutional Neural Network accelerator
architecture, and network compiler, that make use of precision minimization and parameter
pruning to fit ResNet-50 entirely into on-chip memory on a Stratix 10 2800 FPGA. By
statically partitioning the hardware across each of the layers in the network, our
architecture enables full DSP utilization and reduces the soft logic per DSP ratio
by roughly 4x over prior work on sparse CNN accelerators for FPGAs. This high DSP
utilization, a frequency of 420MHz, and skipping zero weights enable our architecture
to execute a sparse ResNet-50 model at a batch size of 1 at 3300 images/s, which is
nearly 3x higher throughput than NVIDIA’s fastest machine learning targeted GPU, the
V100. We also present a network compiler and a flexible hardware interface that make
it easy to add support for new types of neural networks, and to optimize these networks
for FPGAs with different on-chip resources.

FTDL: An FPGA-tailored Architecture for Deep Learning Systems

  • Runbin Shi

Hardware acceleration of deep learning (DL) systems has been increasingly studied
to achieve desirable performance and energy efficiency. The FPGA strikes a balance
between high energy efficiency and fast development cycle and therefore is widely
used as a DNN accelerator. However, there exists an architecture-layout mismatch in
the current designs, which introduces scalability and flexibility issues, leading
to irregular routing and resource imbalance problems. To address these limitations,
in this work, we propose FTDL, an FPGA-tailored architecture with a parameterized
and hierarchical hardware that is adaptive to different FPGA devices. FTDL has the
following novelties: (i) At the architecture level, FTDL consists of Tiled Processing
Elements (TPE) and super blocks, to achieve a near-to-theoretical digital signal processing
(DSP) operating-frequency of 650 MHz. More importantly, FTDL is configurable and delivers
good scalability, i.e., the timing is stabilized even when the design is scaled-up
to 100% resource utilization for different deep learning systems. (ii) In workload
compilation, FTDL provides a compiler that manages to map the DL workloads to the
architecture level in an optimal manner. Experimental results show that for most benchmark
layers in MLPerf, FTDL achieves an over 80% hardware efficiency.

SESSION: Poster Session III

Session details: Poster Session III

  • Kia Bazargan

Cash: A Single-Source Hardware-Software Codesign Framework for Rapid Prototyping

  • Blaise Tine

With Moore’s Law coming to an end, hardware specialization and systems on chips are
providing new opportunities for continuing performance scaling while reducing the
energy cost of computation. However, the current hardware design methodologies require
significant engineering efforts and domain expertise, making the design process unscalable.
More importantly, hardware specialization presents a unique challenge for a much tighter
software and hardware co-design environment to exploit domain-specific optimizations
and design efficiency. In this work, we introduce Cash, a single-source hardware-software
co-design framework for rapid SoC prototyping and accelerators research. Cash leverages
the unique efficiency and generative attributes of Modern C++ to provide a unified
development environment, aiming at closing the architecture research methodology gap.
The Cash framework introduces new co-design programming abstractions that enable seamless
integration with existing software from architecture research simulators to high-level
synthesis.

Performance Evaluation and Power Analysis of Teraflop-scale Fluid Simulation with
Stratix 10 FPGA

  • Atsushi Koshiba

Stream computing is a suitable approach to improve both performance and power efficiency
of numerical computations with FPGAs. To achieve further performance gain, temporal
and spatial parallelism were exploited: the first one deepens and the latter duplicates
pipelines of streamed computation cores. These two types of parallelism were previously
evaluated with Arria 10 FPGA. However, it has not been verified if they are also effective
for the latest FPGA, Stratix 10, which has a larger amount of logic elements (i.e.,
2.4X of Arria 10) and is equipped with a new feature to improve the maximum clock
frequency (i.e., HyperFlex architecture). To show the scalability for such state-of-the-art
FPGAs, in this paper, we firstly implemented a streamed fluid simulation accelerator
with both parallelism types for Stratix 10. We then thoroughly evaluated it by obtaining
computational performance (FLOPS), power efficiency (FLOPS/W), resource utilization,
and maximum clock frequency (Fmax). From the results, we found that this implementation
excessively used DSP blocks due to inefficient mapping of floating-point operations,
which reduced Fmax and the number of pipelined cores. To improve the scalability,
we optimized the implementation to reduce the DSP block usage by utilizing a Multiply-Add
function in a single DSP block. As a result, the optimized fluid simulation achieves
1.06 TFLOPS and 12.6 GFLOPS/W, which is 1.36X and 1.24X higher than the non-optimized
version, respectively. Moreover, we estimate that the fluid simulation with Stratix
10 could outperform GPU-based implementation with Tesla V100 by optimizing it for
HyperFlex architecture.

On the Exploration of Connection-aware Partitioning for Parallel FPGA Routing

  • Yun Zhou

Routing is one of the most time-consuming steps in the FPGA synthesis flow. Existing
works have described several ways to accelerate the routing process. The partitioning-based
parallel routing technique that leverages the high-performance computing of multi-core
processors are gaining popularity recently. Specifically, those parallel routers partition
nets to regions by nets’ bounding boxes, followed by a parallel routing procedure.
Nets can be split up into source-sink connections that share wire segments as much
as possible. In order to exploit more parallelism by a finer granularity in both spatial
partitioning and routing, a connection-aware routing bounding box model is introduced
in this work. We first explore in detail to show that connection-aware partitioning
using the new routing bounding boxes enables the parallel routing to perform better
runtime efficiency than the existing net-based partitioning by analyzing the workloads
of parallel routers. It reduces the connections spanning more than one region and
exploits more parallelism. The large heterogeneous Titan23 designs and a detailed
representation of the Stratix IV FPGA are used for benchmarking. Experimental results
show that the parallel FPGA router is faster when using our connection-aware partitioning
than using the existing net-based partitioning, while achieving similar quality of
routing results in terms of the wirelength and critical path delay. The connection-aware
routing bounding box model is easy to be embedded into other existing parallel routers
and further enables them to be faster.

High Density Pipelined 8bit Multiplier Systolic Arrays for FPGA

  • Martin Langhammer

With the advent of AI and machine learning as the highest profile FPGA applications,
INT8 performance is currently one of the key benchmarking metrics. In current devices,
INT8 multipliers must be extracted from higher precision multipliers. Recently, we
reported the implementation of a mixed DSP Block and soft logic design, with 22,400
INT8 multipliers, and a system clock rate of 416MHz, on the Intel Stratix 10 2800
chip.

In this paper we demonstrate alternate techniques for integer multiplier construction
to better balance the resource types on current FPGAs – logic, memory, and DSP – to
make a significant improvement in the multiplier, and therefore the dot product, density.
We further extend these techniques to 8 bit signed-magnitude (SM) 1.7 representation,
which can further improve arithmetic density by using the logic and memory resources
more flexibly. We describe variable composition dot product structures, which can
be assembled in a scalable 2D systolic array. In one example, we report a design containing
32,768 SM1.7 multipliers, with a clock rate of 432MHz, giving a system performance
of over 28 TOPs. Our INT8 densities are improved by up to 30% over the earlier work
– we show one design with 28,800 INT8 multipliers. In all cases, enough device resources
are left free and accessible to implement a full application level design.

Reactive Signal Obfuscation with Time-Fracturing to Counter Information Leakage in
FPGAs

  • Stephen M. Williams

With tremendous economic and technological ramifications, hardware security has become
an increasingly more critical design metric for FPGA-based logic design. In this work,
we focus on countermeasures against power side-channel attacks in any reconfigurable
computing system implemented with modern FPGA fabric. We design and implement a novel
countermeasure technique called Time-Fracturing (TF) to fend off side-channel-based
information leakage, which proves to be both hardware-efficient and minimally invasive.
To validate its effectiveness, we have applied our TF technique to an FPGA-based AES128
encryption core. Our experimental results have shown an increase of more than 50 times,
when compared to its unprotected baseline, in its attack difficulty measured by the
number of traces required to extract the secret key. Furthermore, our approach is
orthogonal to existing methods, thus having the potential to be integrated in the
future for a multi-variate defense mechanism.

Cycle-Free FPGA Routing Graphs

  • Ang Li

Accurate timing characterization of FPGA routing resources, i.e. wires and switches,
is critical to achieving high quality of results from FPGA routing tools. Although
the composition and connectivity of the routing resources are easily extracted from
an FPGA’s architecture, post-layout timing characterization of the FPGA’s wires and
switches (NOT the design being mapped onto the FPGA) with EDA tools is a challenging
task due to the large quantity of combinational loops (cycles in the routing graph).
Likewise, the use of EDA tools is severely limited when constructing new FPGA architectures.
This work addresses the challenge by proposing an algorithm to construct cycle-free
FPGA routing graphs. A cycle-free FPGA routing graph is achieved by logically ordering
wires and intelligently removing or rearranging a small fraction of the switch block
connections in order to break cycles. The proposed approach enables constraining the
timing of all routing resources, which is otherwise impossible due to the combinational
loops. This technique can be applied to post-layout static timing analysis (STA) of
existing FPGAs, significantly reducing the complexity and improving the accuracy of
the analysis. In addition, this cycle-free approach can be adopted when designing
new FPGAs, transforming costly hand layout into an automated step compatible with
commercial ASIC EDA tools.

An Algorithm for Delay Optimal Logic Replication for FPGAs Accounting for Combinational
Loops

  • Rupesh S. Shelar

Logic replication is often necessary to improve speed of emulation for systems employing
field programmable gate arrays (FPGAs), since design sizes are large enough requiring
partitioning to fit a design into multiple (boards of) FPGAs. In this paper, we propose
a polynomial time algorithm for combinational logic replication that ensures delay
optimality for directed acyclic graphs and reduces overhead due to look-up table (LUT)
and cut resources. The algorithm is further extended to consider combinational loops,
often yielding delay optimal results. Experimental results on industrial designs show,
on an average, 44%, 33%, and 33% reduction in overhead due to cut, LUT costs, and
runtimes, respectively, compared to existing heuristics, thus demonstrating the efficiency
of the algorithm.

QTAccel: A Generic FPGA based Design for Q-Table based Reinforcement Learning Accelerators

  • Rachit Rajat

Q-Table based Reinforcement Learning (QRL) is a class of widely used algorithms in
AI that work by successively improving the estimates of Q values — quality of state-action
pairs, stored in a table. They significantly outperform Neural Network based techniques
when the state space is tractable. Fast learning for AI applications in several domains
(e.g. robotics), with tractable ‘mid-sized’ Q-tables, still necessitates performing
substantial rapid updates. State-of-the-art FPGA implementations of QRL do not scale
with the increasing Q-Table state space, thus are not efficient for such applications.
In this work, we develop a novel FPGA implementation of QRL, scalable to large state
spaces and facilitating a large class of AI applications. Our pipelined architecture
provides higher throughput while using significantly fewer on-chip resources and thereby
supports a variety of action selection policies that covers Q-Learning and variations
of bandit algorithms. Possible dependencies caused by consecutive Q value updates
are handled, allowing the design to process one Q-sample every clock cycle. Additionally,
we provide the first known FPGA implementation of the SARSA (State-Action-Reward-State-Action)
algorithm. We evaluate our architecture for Q-Learning and SARSA algorithms and show
that our designs achieve a high throughput of up to 180 million Q samples per second.

The Case for Hard Matrix Multiplier Blocks in an FPGA

  • Aman Arora

Designing efficient hardware for accelerating machine learning (ML) applications is
a major challenge. Rapid changing algorithms and network architectures in this field
make FPGA based designs an attractive solution. But the generic building blocks available
in current FPGAs (ALMs/CLBs, DSP blocks) limit the acceleration that can be achieved.
We propose a modification to the current FPGA architecture that makes FPGAs specialized
for ML applications. Specifically, we propose adding hard matrix multiplier blocks
(matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays
of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect
between neighboring matmuls to make larger systolic matrix multipliers. We explore
various matmul sizes (4x4x4, 8x8x8, 16x16x16, 32x32x32) and various strategies to
place these blocks on the FPGA (clustered, surround, columnar). We recommend 4x4x4
matmul blocks with columnar placement after studying tradeoffs between area, frequency,
fragmentation and channel width. Experimental results and analytical evaluation reveal
that providing matmuls in an FPGA speeds up state-of-the-art neural networks (Resnet50,
GNMT, Transformer, Minigo) by ~2.5x on average, compared to a DSP-heavy FPGA with
equal number of MACs. Therefore, FPGAs with hard matrix multipliers can be used to
design faster, more area (and hence, power) efficient hardware accelerators for ML
applications, compared to current FPGAs, at the cost of reducing the flexibility of
the FPGA for other applications. A matmul-heavy FPGA fabric could be a part of bigger
FPGA, the rest of which can have general programmable logic, or fully ML-specific
FPGAs with matmuls could be created.

Performance Portable FPGA Design

  • Nils Voss

FPGA platforms are widely used for application acceleration. Although a number of
high-level design frameworks exist, application and performance portability across
different platforms remain challenging. To address the above problem, we propose an
API design for high-level development tools to separate platform-dependent code from
the remaining application design. Additionally, we propose design guidelines to assist
with performance portability. To demonstrate our techniques, a large-scale application,
originally developed for an Intel Stratix-V FPGA is ported to several new Xilinx Virtex
UltraScale+ systems. The accelerated application, developed in a high-level framework,
is rapidly moved onto the new platforms with minimal changes. The original, unmodified
kernel code delivers a 1.74x speedup due to increased clock frequency on the new platform.
Subsequently, the application is further optimised to make use of the additional resources
available on the larger Ultrascale+ FPGAs, guided by a simple analytical performance
model. This results in an additional performance increase of up to 7.4x. Using the
presented framework, we demonstrate rapid deployment of the same application across
a number of different platforms that leverage the same FPGA family but differ in their
low-level implementation details and the available peripherals. As a result, the same
application code supports five different platforms: Maxeler MAX5C DFE, Amazon EC2
F1, Xilinx Alveo U200, U250 and the original Intel Stratix-V accelerator card, with
performance close to what is theoretically achievable for each of these platforms.

Accuracy-Aware Memory Allocation to Mitigate BRAM Errors for Voltage Underscaling
on FPGA Overlay Accelerators

  • Tanvir Ahmed

Approximate computing (AC) aims to achieve energy-efficiency in digital systems by
sacrificing the computational accuracy of an application. Memory-intensive applications,
in which a large amount of data is processed to reach a meaningful conclusion, are
the primary target. Systems for such applications consists of a large pool of compute-unit
and sizeable on-chip memory. The total energy consumption for such applications is
often dominated by the on-chip memory. We, therefore, focus on improving the energy
efficiency of the on-chip memory by appropriately scaling down its supply voltage.

In this paper, we propose a memory allocation technique for FPGA-based accelerators
to improve accuracy and energy consumption for such memory-intensive applications.
Unlike state-of-the-art, our technique focusses on the BRAM of the FPGA. Since an
application consists of both critical and non-critical data and is required to treat
them accordingly to maintain good computational accuracy, we thereby use LUTRAM of
FPGA to realize the reliable memory, whereas BRAM operating at a lower voltage is
considered as the unreliable one. First, we introduce a compiler pre-processor to
annotate the arrays of an application as critical and non-critical ones. Afterward,
we employ an exploration heuristic to select an optimal point of the reliable and
unreliable memories for the application without incurring run-time as well as energy
consumption based on pre-characterize memory power. Experimental results on various
signal and image processing applications reveal that the proposed memory allocation
heuristic improves the accuracy from 13.0% to 73.2% along with 0.77x energy savings
while incurring 1.12x circuit area.

Near-memory Acceleration for Scalable Phylogenetic Inference

  • Nikolaos Alachiotis

Phylogenetics study the evolutionary history of a collection of organisms based on
observed heritable molecular traits, finding practical application in a wide range
of domains, from conservation biology and epidemiology, to forensics and drug development.
A fundamental computational kernel to evaluate evolutionary histories, also referred
to as phylogenies, is the Phylogenetic Likelihood Function (PLF), which dominates
the total execution time (by up to 95%) of widely used maximum-likelihood phylogenetic
methods. Numerous efforts to boost PLF performance over the years mostly focused on
accelerating computation; since the PLF is a data-intensive, memory-bound operation,
performance remains limited by data movement. In this work, we employ near-memory
computation units (NMUs) within a FPGA-based computing environment with disaggregated
memory to alleviate the data movement problem and improve performance and energy efficiency
when inferring large-scale phylogenies. NMUs were deployed on a multi-FPGA emulation
platform for the IBM dReDBox disaggregated datacenter prototype. We find that performance
and power efficiency improves by an order of magnitude when NMUs compute on local
data that reside on the same server tray. This is achieved through an efficient data-allocation
scheme that minimizes inter-tray data transfers (remote-data movement) when computing
the PLF. More specifically, we observe up to 22x better FLOPS performance and 13x
higher power efficiency (FLOPS/Watt) over the more traditional, accelerator-as-a-coprocessor
model, which requires explicit remote-data transfers between disaggregated memory
modules and accelerator units.

FPTLOPT: An Automatic Transistor-Level Optimization Tool for GRM FPGA

  • Yufan Zhang

The FPGA circuit design usually adopts full-custom design method, it indicates that
it is difficult to design and optimize an FPGA manually. So, we present FPTLOPT (FPGA
Transistor-Level Optimization Tool) which supports a more complex FPGA architecture
called general routing matrix (GRM) architecture, and also has higher-accuracy and
higher-speed than COFFE [1]. To fit a more complex FPGA architecture, we use the regular
matching method to automatically extract the circuits type and build the circuits
netlist; To get the higher-accuracy, we predict the layout area by area model we build,
then we precisely predict the layout post simulation delay by load model we build;
To get the higher-speed, we devise the variable range greedy algorithm, to expanding
range automatically. We also provide equalization kernel multi-thread acceleration
that can change the thread number according to the current CPU hardware environment.
The experimental results illustrate that FPTLOPT supports the optimization of GRM
architecture and build the key sub-circuit netlist. Also, the area prediction is by
maximum of 43%, the delay get from delay prediction is 28% more precise than the ones
in COFFE. Besides, quickly gets the optimal transistor sizing results for different
optimization objectives. For the same circuit, the optimization speed is 19.96 times
faster than COFFE.

INTB: A New FPGA Interconnect Model for Architecture Exploration

  • Chengyu Hu

CAD exploration is important for designing FPGA interconnect topologies. It includes
two steps: first, design a model with some parameters that can express as much architecture
space. Second, use CAD flow to analyze the described interconnect architecture. In
this paper, we present a new interconnect model, named INTB (Interconnect Block).
At a logical position, one INTB is adopted to represent all related routing resources
and hierarchical parameters are designed to simplify description. Compared with existing
CB-SB model, INTB model can support more interconnect features of modern FPGA, such
as various types of wire segment and complex connections. These features can improve
FPGA routing ability. For the application of INTB model, two modifications are made
in CAD flow: one is generation of routing resource graph (RRG). A tile-based method
is proposed to generate RRG from parameters. The other is cost computing during routing
process. Two strategies are applied respectively for cost estimation of short and
curve wire segment, which do not exist in CB-SB model. INTB model and CAD improvement
are implemented in VTR 8.0. The experiments consist of two parts. First, INTB model
is adopted to re-describe CB-SB architectures to verify its description capacity.
After CAD flow, average difference of routing area and timing between two models is
about 4% and 5%. Second, INTB model is used to explore architecture space with modern
FPGA features. Experimental results show obvious performance enhancement, over 10%
in some benchmarks.

V-LSTM: An Efficient LSTM Accelerator Using Fixed Nonzero-Ratio Viterbi-Based Pruning

  • Taesu Kim

Long Short-Term Memory (LSTM) has been widely adopted in tasks with sequence data,
such as speech recognition and language modeling. LSTM brought significant accuracy
improvement by introducing additional parameters to Recurrent Neural Network (RNN).
However, increasing number of parameters and computations also led to inefficiency
in computing LSTM on edge devices with limited on-chip memory size and DRAM bandwidth.
In order to reduce the latency and energy of LSTM computations, there has been a pressing
need for model compression schemes and suitable hardware accelerators. In this paper,
we first propose the Fixed Nonzero-ratio Viterbi-based Pruning, which can reduce the
memory footprint of LSTM models by 96% with negligible accuracy loss. By applying
additional constraints on the distribution of surviving weights in Viterbi-based Pruning,
the proposed pruning scheme mitigates the load-imbalance problem and thereby increases
the processing engine utilization rate. Then, we propose the V-LSTM, an efficient
sparse LSTM accelerator based on the proposed pruning scheme. High compression ratio
of the proposed pruning scheme allows the proposed accelerator to achieve 24.9% lower
per-sample latency than that of state-of-the-art accelerators. The proposed accelerator
is implemented on Xilinx VC-709 FPGA evaluation board running at 200MHz for evaluation.

DBHI: A Tool for Decoupled Functional Hardware-Software Co-Design on SoCs

  • Unai Martinez-Corral

This paper presents a system-level co-simulation and co-verification workflow to ease
the transition from a software-only procedure, executed in a General Purpose processor,
to the integration of a custom hardware accelerator developed in a Hardware Description
Language (HDL).

We propose a tool which enables Dynamic Binary Modification to decouple the development
of the hardware accelerator from the software-only application to be accelerated.
It provides support for rapid iterative exploration and functional verification of
hardware designs while keeping the unmodified software application as a reference.
DBHI is able to instrument an application and inject compiled hardware. It allows
progressive migration from application source code, to non-synthesizable HDL, and
to synthesizable HDL. At the same time, it preserves cycle-accurate/bit-accurate results,
and provides run-time visibility of the internal data buffers for debugging purposes.
Foreign architecture emulation overhead during development is avoided, and early integration
with peripherals in the target System-on-Chip is possible.

The proposed design flow was evaluated on executions of hardware simulations on x86-64
and Arm. DBHI was developed from existing off-the-shelf tools, and we evaluated it
on multiple architectures, however, the technique is not tied to any specific architecture.

Distinguished Speaker

ACM Distinguished Speaker Program (DSP): Apply to become an ACM SIGDA Candidate

ACM SIGDA is pleased to offer the opportunity to apply to become an ACM Distinguished Speaker in the fields represented by ACM SIGDA.  Being part of the DSP is a way of giving back to the community, as well as inspiring the next generation of computing professionals.

Application

The nomination for a candidate or a self-nomination must include the following:

  • CV/Resume
  • Personal URL
  • Recent talks/short courses/presentations within the last three years, please include approximate audience size
  • URL to LinkedIn profile
  • URL to recent talk slides
  • URL to a lecture video
  • Nomination letter of Support
  • Optional: Reference letters

A pre-requisite for becoming a DSP are a minimum of 5 years experience (either in academia or industry or a combination of both).

Process

The SIGDA Executive Committee (EC) will review the application and if accepted, will forward the application to the ACM DSP Committee as a SIGDA nominee. Each DSP Committee member will review the application and decide on the acceptability of the SIGDA nominee.

Kindly note, that the ACM DSP Speaker term will be for three years.

Deadlines

ACM SIGDA welcomes applications twice a year. Deadlines dates: May 31 and Dec. 15th of each year.

Application material should be sent to SIGDA-DSP@acm.org.

Further Information

For more information please contact SIGDA-DSP@acm.org or check the complete policies about the ACM Distinguished Speaker program at: https://speakers.acm.org/about/policies, which provides you with travel guidelines, financial guidelines and tips for speakers.

CADathlon 2019 Problem References

Problem 1: Circuit Design and Analysis
Contributed by Jianlei Yang, Beihang University
Overview: Solve Landau-Lifshitz-Gilbert (LLG) equation (in C++)
Reference: Iwasaki, Junichi, Masahito Mochizuki, and Naoto Nagaosa. “Current-induced skyrmion dynamics in constricted geometries,” Nature nanotechnology 8.10 (2013): 742.

Problem 2: Physical Design & Design for Manufacturability
Contributed by William Chow, Cadence
Overview: Tap assignment for gated clock network (in C++)
Reference: W-H Chen, C-K Wang, H-M Chen, Y-C Chou, and C-H Tsai, “A Comparative Study on Multisource Clock Network Synthesis,” The 22nd Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI), 2016

Problem 3: Logic & High-Level Synthesis
Overview: Boolean Function Manipulation by Quantification (in C++)
Reference: No specific reference is provided.

Problem 4: System Design & Analysis
Contributed by Andy Yu-Guang Chen, National Central University
Overview: On-line Wake-up Scheduling for Multi-module design (in C++)
Reference 1: D. Brelaz, “New Methods to Color the Vertices of a Graph,” Communications of the ACM, Vol.22, Issue 4, Apr. 1979.
Reference 2: M.C. Lee, Y. Shi, Y.G. Chen, D. Marculescu, S.C. Chang, “Efficient On-Line Module-Level Wake-Up Scheduling for High Performance Multi-Module Designs,” Proc. on the International Symposium on Physical Design (ISPD), 2012, Page(s): 97-104.

Problem 5: Functional Verification & Testing
Contributed by Hao Zheng, University of South Florida
Overview: Cycle-based logic simulation (in C++)
Reference 1: S. Palnitkar and D. Parham, “Cycle Simulation Techniques,” IEEE International Verilog HDL Conference, 1995, Page(s) 2-8.
Reference 2: A. Biere, “The AIGER And-Inverter Graph (AIG) Format, Version 20070427,” Johannes Kepler University, 2006-2007

Problem 6: Future technologies (Bio-EDA, Security, AI, etc.)
Contributed by Mimi Xie, The University of Texas at San Antonio and Caiwen Ding, University of Connecticut
Overview: Efficient Pruning for Neural Networks (in Python)
Reference: Han, Song, Jeff Pool, John Tran, and William Dally. “Learning both weights and connections for efficient neural network,” In Advances in neural information processing systems, pp. 1135-1143. 2015.

ICCAD 2019 TOC

SESSION: Keynote

Session details: Keynote

  • Bustany
    Ismail

Fusion: The Dawn of the Hyperconvergence Era in EDA

  • Krishnamoorthy
    Shankar

Hyperconvergence is a software-centric architecture which has disrupted the datacenter
industry in a dramatic way by bringing the disparate areas of compute, storage and
networking into a single system. A hyperconverged system allows the integrated …

SESSION: New Advances in Placement

Session details: New Advances in Placement

  • Yang
    Stephen

How Deep Learning Can Drive Physical Synthesis Towards More Predictable Legalization

  • Netto
    Renan

Machine learning has been used to improve the predictability of different physical
design problems, such as timing, clock tree synthesis and routing, but not for legalization.
Predicting the outcome of legalization can be helpful to guide incremental …

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing
Balancing

  • Chang
    Ya-Chu

As the wide adoption of FinFET technology in mass production, dynamic power becomes
the bottleneck to achieving low power. Therefore, clock power reduction is crucial
in modern IC design. Register clustering can effectively save clock power because
of …

Device Layer-Aware Analytical Placement for Analog Circuits

  • Xu
    Biying

The layouts of analog/mixed-signal (AMS) integrated circuits (ICs) are dramatically
different from their digital counterparts. AMS circuit layouts usually include a variety
of devices, including transistors, capacitors, resistors, and inductors. A …

Analytical Mixed-Cell-Height Legalization Considering Average and Maximum Movement
Minimization

  • Li
    Xingquan

Modern circuit designs often contain standard cells of different row heights to meet
various design requirements. Due to the higher interference among heterogeneous cell
structures, the legalization problem for mixed-cell-height standard cells becomes

SESSION: FPGA Special Session: Advances in Adaptable Heterogeneous Computing and Acceleration
for Big Data

Session details: FPGA Special Session: Advances in Adaptable Heterogeneous Computing
and Acceleration for Big Data

  • Iyer
    Mahesh

FPGA-based Computing in the Era of AI and Big Data

  • Nurvitadhi
    Eriko

The continued rapid growth of data, along with advances in Artificial Intelligence
(AI) to extract knowledge from such data, is reshaping the computing ecosystem landscape.
With AI becoming an essential part of almost every end-user application, our …

Advances in Adaptable Computing

  • Gupta
    Amit

Recent technical challenges have forced the industry to explore options beyond the
conventional “one size fits all” CPU scalar processing solution. Very large vector
processing (DSP, GPU) solves some problems, but it runs into traditional scaling …

Improving Programmability and Efficiency of Large-Scale Graph Analytics for FPGA Platforms

  • Ozdal
    Muhammet Mustafa

Large-scale graph analytics has gained importance due to emergence of new applications
in different contexts such as web, social networks, and computational biology. It
is known that typical CPU/GPU implementations for sparse graph applications cannot

SESSION: Routing in All Forms

Session details: Routing in All Forms

  • Madden
    Patrick

Pin Access-Driven Design Rule Clean and DFM Optimized Routing of Standard Cells under
Boolean Constraints

  • Ryzhenko
    Nikolay

In this paper, we propose a routing flow for nets within a standard cell that generates
layout of standard cells without any design rule violations. Design rules, density
rules for metal fill, and pin-access requirements are modeled via Boolean formulas

PSION: Combining Logical Topology and Physical Layout Optimization for Wavelength-Routed
ONoCs

  • Truppel
    Alexandre

Optical Networks-on-Chip (ONoCs) are a promising solution for high-performance multi-core
integration with better latency and bandwidth than traditional Electrical NoCs. Wavelength-routed
ONoCs (WRONoCs) offer yet additional performance guarantees. …

Construction of All Multilayer Monolithic Rectilinear Steiner Minimum Trees on the
3D Hanan Grid for Monolithic 3D IC Routing

  • Lin
    Sheng-En David

Monolithic three-dimensional~(3D) integration enables stacking multiple ultra-thin
silicon tiers in a single package, thereby providing smaller footprint area, shorter
wirelength, higher performance, and lower power consumption than conventional planar

ROAD: Routability Analysis and Diagnosis Framework Based on SAT Techniques

  • Park
    Dongwon

Routability diagnosis has increasingly become the bottleneck in detailed routing for
sub-10nm technology due to the limited tracks, high density, and complex design rules. The
conventional ways to examine the routability of detailed routing are ILP- and …

SESSION: Keynote

Session details: Keynote

  • Menezes
    Noel

A Perspective on Security and Trust Requirements for the Future

  • Plaks
    Kenneth

As integrated circuit manufacturing becomes increasingly global and the availability
of domestically produced advanced transistor nodes shrinks, security vulnerabilities
within the supply chain become a significant issue for IC defense applications. In

SESSION: Patterning and Machine Learning

Session details: Patterning and Machine Learning

  • Young
    Evangeline

Declarative Language for Geometric Pattern Matching in VLSI Process Rule Modeling

  • Suto
    Gyuszi

This paper presents a formal (machine readable) declarative language developed for
the specific reason of modeling physical design process rules of any complexity. Case
studies are presented on synthetic as well as industry known design rules of simple

Electromigration-Aware Interconnect Design

  • Sapatnekar
    Sachin S.

Electromigration (EM) is seen as a growing problem in recent and upcoming technology
nodes, and affects a wider variety of wires (e.g., power grid, clock/signal nets),
circuits (e.g., digital, analog, mixed-signal), and systems (e.g., mobile, server,

Toward Intelligent Physical Design: Deep Learning and GPU Acceleration

  • Ren
    Haoxing

Deep learning (DL) has achieved tremendous success in computer vision, natural language
processing and gaming. Would DL help push physical design toward a more intelligent
paradigm to meet the post-Moore era design automation challenges? We will discuss

Multiple Patterning Layout Compliance with Minimizing Topology Disturbance and Polygon
Displacement

  • Chang
    Hua-Yu

Multiple patterning lithography (MPL) divides a layout into several masks and manufactures
them by a series of exposure and etching steps. As technology advances, MPL is still
indispensable because of its cost effectiveness and hybrid lithography …

SESSION: Cyber-Physical Systems

Session details: Cyber-Physical Systems

  • Groeneveld
    Patrick

From Electronic Design Automation to Automotive Design Automation

  • Lin
    Chung-Wei

Advanced driver assistance systems (ADAS), autonomous functions, and connected applications
bring a revolution to automotive systems, but they also make automotive design, especially
software and electronics, more complex than ever. The complexity …

Enterprise-wide AI-enabled Digital Transformation

  • Maasoumy
    Mehdi

Having solved the data integration problem, we discuss how convergence of 4 technology
vectors, namely Big Data, Artificial Intelligence, Cloud Computing, and Internet of
Things (IoT) has, for the first time, enabled us to solve a class of problems …

Secure and Trustworthy Cyber-Physical System Design: A Cross-Layer Perspective

  • Nuzzo
    Pierluigi

This talk discusses some of the design challenges posed by cyber-physical system security
at different abstraction layers, from algorithm design to the realization of trusted
hardware platforms. We introduce two design problems, namely, detecting sensor …

SESSION: Lifetime Achievement Award Tribute to Professor Alberto Sangiovanni-Vicentelli

Session details: Lifetime Achievement Award Tribute to Professor Alberto Sangiovanni-Vicentelli

  • Nuzzo
    Pierluigi

The Slow Start of Fast Spice: A Brief History of Timing

  • White
    Jacob K.

The list of Professor Alberto Sangiovanni-Vincentelli’s research contributions is
astounding in length and breadth, yet does not entirely capture what this author believes
is his true genius. In so many areas of computer-aided design, Sangiovanni-…

Basic and Advanced Researches in Logic Synthesis and their Industrial Contributions

  • Fujita
    Masahiro

We first present historical view on the techniques for two-level and multi-level logic
optimizations, and discuss the practical issues with respect to them. Then the techniques
for sequential optimizations are briefly reviewed. Based on them, a new …

From Electronic Design Automation to Cyber-Physical System Design Automation: A Tale of Platforms and Contracts

  • Nuzzo
    Pierluigi

This paper reflects on the design challenges posed by cyber-physical systems, what
distinguishes cyber-physical system design from large-scale integrated circuit design,
and what could be the opportunities for the design automation community. The paper

My 50-Year Journey from Punched Cards to Swarm Systems

  • Sangiovanni Vincentelli
    Alberto

The article is a reflection onmy journey during the development of the EDA field,
from its early days to its explosive growth and present maturity. The two special
issues of the Solid State Circuit Society Magazine “Corsi e Ricorsi: Alberto Sangiovanni

SESSION: Lifetime Achievement Award Dinner Banquet Keynote

Freedom From Choice and the Power of Models: in Honor of Alberto Sangiovanni-Vincentelli

  • Lee
    Edward A.

Discovery, invention, and design are all about models. When we say “Joseph Priestly
discovered oxygen in 1774,” we do not mean that Priestly dug up a canister of oxygen,
recognized it as something new, and released it, for the first time, into the air.

SESSION: Physical Design – Where are we going?

Session details: Physical Design – Where are we going?

  • Cheng
    C.K.

Analog Layout Synthesis: Are We There Yet?

  • Mangalagiri
    Prasanth

Over the past decade, spurred by advances in mobile computing, there has been a fundamental
shift in computing needs of consumer applications. There has been an industry-wide
transition from highly CPU-centric to a peripheral-centric, connectivity and …

Lagrangian Relaxation Based Gate Sizing With Clock Skew Scheduling – A Fast and Effective
Approach

  • Sharma
    Ankur

Recent work has established Lagrangian relaxation (LR) based gate sizing as state-of-the-art
providing the best power reduction with low run time. Gate sizing has limited potential
to reduce the power when the timing constraints are tight. By adjusting …

Adaptive Clustering and Sampling for High-Dimensional and Multi-Failure-Region SRAM
Yield Analysis

  • Shi
    Xiao

Statistical circuit simulation is exhibiting increasing importance for memory circuits
under process variation. It is challenging to accurately estimate the extremely low
failure probability as it becomes a high-dimensional and multi-failure-region …

SESSION: Detailed Routing Contest Results

Session details: Detailed Routing Contest Results

  • Chinnery
    David

ISPD 2019 Initial Detailed Routing Contest and Benchmark with Advanced Routing Rules

  • Liu
    Wen-Hao

Detailed routing becomes the most complicated and runtime consuming stage in the physical
design flow as technology nodes advance. Due to the inaccessibility of advanced routing
rules and industrial designs, it is hard to conduct detailed routing …

ICCAD 2016 TOC

Scope – quality retaining display rendering workload scaling based on user-smartphone
distance

  • Nixon
    Kent W.

Modern smartphone display system come equipped with powerful GPU’s capable of rendering
advanced 2D and 3D graphics. These GPU’s make up a significant portion of the system
power profile due to the high resolution and framerate of smartphone display. …

NVSim-CAM: a circuit-level simulator for emerging nonvolatile memory based content-addressable
memory

  • Li
    Shuangchen

Ternary Content-Addressable Memory (TCAM) is widely used in networking routers, fully
associative caches, search engines, etc. While the conventional SRAM-based TCAM suffers
from the poor scalability, the emerging nonvolatile memories (NVM, i.e., MRAM, …

Design technology for fault-free and maximally-parallel wavelength-routed optical
networks-on-chip

  • Peano
    Andrea

The recent interest in emerging interconnect technologies is bringing the issue of
a proper EDA support for them to the forefront, so to tackle the design complexity.
A relevant case study is provided by wavelength-routed optical NoCs (WRONoCs), which

Fast generation of lexicographic satisfiable assignments: Enabling canonicity in SAT-based applications

  • Petkovska
    Ana

Lexicographic Boolean satisfiability (LEXSAT) is a variation of the Boolean satisfiability
problem (SAT). Given a variable order, LEXSAT finds a satisfying assignment whose
integer value under the given variable order is minimum (maximum) among all …

Analytic approaches to the collapse operation and equivalence verification of threshold
logic circuits

  • Lee
    Nian-Ze

Threshold logic circuits gain increasing attention due to their feasible realization
with emerging technologies and strong bind to neural network applications. In this
paper, for logic synthesis we formulate the fundamental operation of collapsing …

A flash-based digital circuit design flow

  • Abusultan
    Monther

Traditionally, floating gate (flash) transistors have been used exclusively to implement
non-volatile memory in its various forms. Recently, we showed that flash transistors
can be used to implement digital circuits as well. In this paper, we present …

MrDP: <u>m</u>ultiple-<u>r</u>ow <u>d</u>etailed <u>p</u>lacement of heterogeneous-sized
cells for advanced nodes

  • Lin
    Yibo

As VLSI technology shrinks to fewer tracks per standard cell, e.g., from 10-track
to 7.5-track libraries (and lesser for 7nm), there has been a rapid increase in the
usage of multiple-row cells like two- and three-row flip-flops, buffers, etc., for

OWARU: free space-aware timing-driven incremental placement

  • Jung
    Jinwook

This paper proposes a powerful new technique called “OWARU”1 that re-places and re-sizes multiple gates simultaneously to improve the most critical
paths of a design. In essence, it is an incremental timing-driven placement technique
integrated with …

Detailed placement for modern FPGAs using 2D dynamic programming

  • Dhar
    Shounak

In this paper, we propose a 2-dimensional dynamic programming (DP) based detailed
placement algorithm for modern FPGAs for wirelength and timing optimization. By tuning
a control parameter, our algorithm can perform fast heuristic or exact optimization.

Security and privacy threats to on-chip non-volatile memories and countermeasures

  • Ghosh
    Swaroop

Non-volatile memories (NVMs) such as Spin-Transfer Torque RAM (STTRAM) have drawn
significant attention due to complete elimination of bitcell leakage. In addition
to the plethora of benefits such as density, non-volatility, low-power and high speed,

Security engineering of nanostructures and nanomaterials

  • Shahrjerdi
    D.

Proliferation of electronics and their increasing connectivity pose formidable challenges
for information security. At the most fundamental level, nanostructures and nanomaterials
offer an unprecedented opportunity to introduce new approaches to …

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks

  • Zhang
    Chen

With the recent advancement of multilayer convolutional neural networks (CNN), deep
learning has achieved amazing success in many areas, especially in visual content
understanding and classification. To improve the performance and energy-efficiency
of …

Re-architecting the on-chip memory sub-system of machine-learning accelerator for
embedded devices

  • Wang
    Ying

The rapid development of deep learning are enabling a plenty of novel applications
such as image and speech recognition for embedded systems, robotics or smart wearable
devices. However, typical deep learning models like deep convolutional neural …

A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication
kernel

  • Li
    Sicheng

Sparse matrix-vector multiplication (SpMV) is an important computational kernel in
many applications. For performance improvement, software libraries designated for
SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library …

Compact oscillation neuron exploiting metal-insulator-transition for neuromorphic
computing

  • Chen
    Pai-Yu

The phenomenon of metal-insulator-transition (MIT) in strongly correlated oxides,
such as NbO2, have shown the oscillation behavior in recent experiments. In this work, the MIT
based two-terminal device is proposed as a compact oscillation neuron for …

A new tightly-coupled transient electro-thermal simulation method for power electronics

  • Chen
    Quan

This paper presents a new transient electro-thermal (ET) simulation method for fast
3D chip-level analysis of power electronics with field solver accuracy. The metallization
stacks are meshed and solved with 3D field solver using nonlinear temperature-…

A tensor-based volterra series black-box nonlinear system identification and simulation
framework

  • Batselier
    Kim

Tensors are a multi-linear generalization of matrices to their d-way counterparts, and are receiving intense interest recently due to their natural
representation of high-dimensional data and the availability of fast tensor decomposition
algorithms. …

Efficient statistical analysis for correlated rare failure events via asymptotic probability
approximation

  • Yu
    Handi

In this paper, a novel Asymptotic Probability Approximation (APA) method is proposed
to estimate the overall rare probability of correlated failure events for complex
circuits containing a large number of replicated cells (e.g., SRAM bit-cells). The
key …

Duplex: simultaneous parameter-performance exploration for optimizing analog circuits

  • Ahmadyan
    Seyed Nematollah

We present Duplex random tree search, an algorithm to optimize performance metrics
of analog and mixed signal circuits. Duplex determines the optimal design, the Pareto
set and the sensitivity of circuit’s performance metrics to its parameters. We …

Improved flop tray-based design implementation for power reduction

  • Kahng
    Andrew B.

Clock network power reduction is critical in modern SoC designs. Application of flop trays (i.e., multi-bit flip-flops) can significantly reduce the number of sinks in a clock
network, and thus reduce the number of clock buffers, clock wirelength, and …

RC-aware global routing

  • Scheifele
    Rudolf

We address the problem of incorporating RC delay constraints into global routing.
In contrast to the usual global routing approach that focuses on minimizing net length
while obeying constraints given by other tools such as layer assignments, our method

Scalable, high-quality, SAT-based multi-layer escape routing

  • Bayless
    Sam

Escape routing for Printed Circuit Boards (PCBs) is an important problem arising from
modern packaging with large numbers of densely spaced pins, such as BGAs. Single-layer
escape routing has been well-studied, but large, dense BGAs often require …

Redistribution layer routing for integrated fan-out wafer-level chip-scale packages

  • Lin
    Bo-Qiao

The integrated fan-out (InFO) wafer-level chip-scale package (WLCSP) s an emerging
packaging technology, which typically consists of multiple redistribution layers (RDLs)
for signal redistributions among multiple chips. There is still no published work

The architecture value engine: measuring and delivering sustainable SoC improvement

  • Carballo
    Juan-Antonio

The value of semiconductor-based systems continues to increase rapidly especially
when considering the cost associated with building it. As such, Moore’s Law has become
a law associated broadly with value growth instead of pure performance growth. While

Circuit valorization in the IC design ecosystem

  • de Gyvez
    José Pineda

Staying at the forefront of research, or in the top tier product market requires circuit
innovation as a key differentiation. We are entering an era where more than Moore
is becoming increasingly evident, not only because of the physical limitations of

Interconnect-aware device targeting from PPA perspective

  • Badaroglu
    Mustafa

CMOS scaling so far enabled simultaneous system throughput scaling by concurrent improvements
in delay, power, and area with thanks to Moore’s law. CMOS scaling becomes more difficult
with the limits of interconnect and increasing wafer cost. It is …

Measuring progress and value of IC implementation technology

  • Kahng
    Andrew B.

Over the past decade, “Moore’s Law” has become increasingly well-understood as being
a law of “value scaling”: success of new electronics- and semiconductor-based products
depends on improved cost-efficiency, utility, and value. Design Automation (DA) …

Provably secure camouflaging strategy for IC protection

  • Li
    Meng

The advancing of reverse engineering techniques has complicated the efforts in intellectual
property protection. Proactive methods have been developed recently, among which layout-level
IC camouflaging is the leading example. However, existing …

CamoPerturb: secure IC camouflaging for minterm protection

  • Yasin
    Muhammad

Integrated circuit (IC) camouflaging is a layout-level technique that thwarts reverse
engineering attacks on ICs by introducing camouflaged cells that look alike, but can
implement one of many possible Boolean functions. Existing camouflaging techniques

Chip editor: leveraging circuit edit for logic obfuscation and trusted fabrication

  • Shakya
    Bicky

The globalization of the semiconductor foundry business poses grave risks in terms
of intellectual property (IP) protection, especially for critical applications. Over
the past few years, several techniques have been proposed that allow manufacturing
of …

Arbitrary streaming permutations with minimum memory and latency

  • Koehn
    Thaddeus

Streaming architectures are a popular choice for data intensive application due to
their high throughput requirements. When assembling components for a streaming application,
it is often necessary to build translation blocks between them to match the …

Multibank memory optimization for parallel data access in multiple data arrays

  • Yin
    Shouyi

To realize high throughput out of a relatively low bandwidth, memory partitioning
algorithms have been proposed to separate data arrays into multiple memory banks,
from which multiple data can be accessed in parallel. However, previous partitioning

Allocation of multi-bit flip-flops in logic synthesis for power optimization

  • Yi
    Dongyoun

In this paper, a new approach to the problem of allocating multi-bit flip-flops for
data storage is presented. Previous approaches divide the allocation problem into
two separate steps: (i) placing single-bit flip-flops under circuit timing constraints

Model-based design of resource-efficient automotive control software

  • Chang
    Wanli

Automotive platforms today run hundreds of millions of lines of software code implementing
a large number of different control applications spanning across safety-critical functionality
to driver assistance and comfort-related functions. While such …

Testing automotive embedded systems under X-in-the-loop setups

  • Tibba
    Ghizlane

The development of automotive electronics and software systems is often associated
with high costs due to their multi-domain nature (including control engineering, electronics,
hydraulics, mechanics, etc). The involvement of these different disciplines …

Efficient statistical validation of machine learning systems for autonomous driving

  • Shi
    Weijing

Today’s automotive industry is making a bold move to equip vehicles with intelligent
driver assistance features. A modern automobile is now equipped with a powerful computing
platform to run multiple machine learning algorithms for environment …

CONVINCE: a cross-layer modeling, exploration and validation framework for next-generation connected
vehicles

  • Zheng
    Bowen

Next-generation autonomous and semi-autonomous vehicles will not only precept the
environment with their own sensors, but also communicate with other vehicles and surrounding
infrastructures for vehicle safety and transportation efficiency. The design, …

Overview of the 2016 CAD contest at ICCAD

  • Huang
    Shih-Hsu

The CAD Contest at ICCAD is a challenging, multi-month competition, focusing on advanced,
real-world problems in the field of Electronic Design Automation (EDA). In its fifth
year, the 2016 CAD Contest at ICCAD attracted 135 teams from 11 regions/…

ICCAD-2016 CAD contest in large-scale identical fault search

  • Wei
    Tangent

Injecting faults into designs is a way to qualify a verification environment. To improve
the performance of a qualifying process, we need to remove identical faults. The problem
will provide some faulty design cases; the contestants must identify all …

ICCAD-2016 CAD contest in non-exact projective NPNP boolean matching and benchmark
suite

  • Wu
    Chi-An (Rocky)

Boolean Matching is significant to industry applications, such as library binding,
synthesis, engineer change order, and hardware Trojan detection. Instead of basic
Boolean matching, Non-exact Projective NPNP Boolean Matching allows to match two designs

ICCAD-2016 CAD contest in pattern classification for integrated circuit design space
analysis and benchmark suite

  • Topaloglu
    Rasit O.

Layout pattern classification has been utilized in recent years in integrated circuit
design towards various goals such as design space analysis, design rule generation,
and systematic yield optimization. There is a need for open source or academic …

OpenDesign flow database: the infrastructure for VLSI design and design automation
research

  • Jung
    Jinwook

Recently, there have been a slew of design automation contests and released benchmarks.
ISPD place & route contests, DAC placement contests, timing analysis contests at TAU
and CAD contests at ICCAD are good examples in the past and more of new contests …

Malicious LUT: a stealthy FPGA trojan injected and triggered by the design flow

  • Krieg
    Christian

We present a novel type of Trojan trigger targeted at the field-programmable gate
array (FPGA) design flow. Traditional triggers base on rare events, such as rare values
or sequences. While in most cases these trigger circuits are able to hide a Trojan

On detecting delay anomalies introduced by hardware trojans

  • Ismari
    D.

A hardware Trojan (HT) detection method is presented that is based on measuring and
detecting small systematic changes in path delays introduced by capacitive loading
effects or series inserted gates of HTs. The path delays are measured using a high

An optimization-theoretic approach for attacking physical unclonable functions

  • Liu
    Yuntao

Physical unclonable functions (PUFs) utilize manufacturing ariations of circuit elements
to produce unpredictable response to any challenge vector. The attack on PUF aims
to predict the PUF response to all challenge vectors while only a small number of

LRR-DPUF: learning resilient and reliable digital physical unclonable function

  • Miao
    Jin

Conventional silicon physical unclonable function (PUF) extracts fingerprints from
transistor’s analog attributes, which are vulnerable to environmental and operational
variations. Recently, digitalized PUF prototypes have emerged to overcome the …

Enabling online learning in lithography hotspot detection with information-theoretic
feature optimization

  • Zhang
    Hang

With the continuous shrinking of technology nodes, lithography hotspot detection and
elimination in the physical verification phase is of great value. Recently machine
learning and pattern matching based methods have been extensively studied to overcome

Incorporating cut redistribution with mask assignment to enable 1D gridded design

  • Kuang
    Jian

1D gridded design is one of the most promising solutions that can enable the scaling
to 10nm technology node and beyond. Line-end cuts are needed to fabricate 1D layouts, where
two techniques are available to resolve the conflicts between cuts: cut …

VCR: simultaneous via-template and cut-template-aware routing for directed self-assembly
technology

  • Su
    Yu-Hsuan

The directed self-assembly (DSA) technology for next-generation lithography has been
shown its great potential for fabricating highly dense via patterns and cut masks
in the sub-5 nm technology node and beyond. However, DSA via and cut optimizations

DSA-compliant routing for two-dimensional patterns using block copolymer lithography

  • Su
    Yu-Hsuan

Two-dimensional (2D) directed self-assembly (DSA) is an emerging lithography for the
5 nm process node and beyond that can substantially increase design flexibility in
critical routing layers and reduce the number of cuts for better yield. The state-of-…

The art of semi-formal bug hunting

  • Nalla
    Pradeep Kumar

Verification is a critical task in the development of correct computing systems. Simulation
remains the predominantly used technique to identify design flaws, due to its scalability.
However, simulation intrinsically suffers from low functional coverage,…

Compiled symbolic simulation for systemC

  • Herdt
    Vladimir

Ensuring the correctness of SystemC virtual prototypes is indispensable. For such
models, existing symbolic simulation approaches are based on interpreting their behavior.
In this paper we propose a major enhancement called Compiled Symbolic Simulation (…

Exact diagnosis using boolean satisfiability

  • Riener
    Heinz

We propose an exact algorithm to model-free diagnosis with an application to fault
localization in digital circuits. We assume that a faulty circuit and a correctness
specification, e.g., in terms of an un-optimized reference circuit, are available.
Our …

Efficient and accurate analysis of single event transients propagation using SMT-based
techniques

  • Hamad
    Ghaith Bany

This paper presents a hierarchical framework to model, analyze, and estimate digital
design vulnerability to soft errors due to Single Event Transients (SETs). A new SET
propagation model is proposed. This model simultaneously includes the impact of …

Power delivery in 3D packages: current crowding effects, dynamic IR drop and compensation network using sensors (invited
paper)

  • Kannan
    Sukeshwar

In 3D packages top-die power delivery is a not only limited by back-end of line (technology
scaling), but also by the TSV integration scheme, the stacking method and the microbump
current-carrying capability. The microbump structure and its …

Cost analysis and cost-driven IP reuse methodology for SoC design based on 2.5D/3D
integration

  • Stow
    Dylan

Due to the increasing fabrication and design complexity with new process nodes, the
cost per transistor trend originally identified in Moore’s Law is slowing when using
traditional integration methods. However, emerging die-level integration …

Energy-efficient and reliable 3D network-on-chip (NoC): architectures and optimization
algorithms

  • Das
    Sourav

The Network-on-Chip (NoC) paradigm has emerged as an enabler for integrating a large
number of embedded cores in a single die. Three-dimensional (3D) integration, a breakthrough
technology to achieve “More Moore and More Than Moore,” provides numerous …

The hype, myths, and realities of testing 3D integrated circuits

  • Wang
    Ran

Three-dimensional (3D) integration using through-silicon vias (TSVs) promises higher
integration levels in a single package, keeping pace with Moore’s law. Despite the
promise and benefits offered by 3D integration, testing remains a major obstacle that

TASA: toolchain-agnostic static software randomisation for critical real-time systems

  • Kosmidis
    Leonidas

Measurement-Based Probabilistic Timing Analysis (MBPTA) derives WCET estimates for
tasks running on processors comprising high-performance features such as caches. MBPTA’s
correct application requires the system to exhibit certain timing properties, …

Splitting functions in code management on scratchpad memories

  • Kim
    Youngbin

As the number of cores increases, cache-based memory hierarchy is becoming a major
problem in terms of the scalability and energy consumption. Software-managed scratchpad
memories (SPM) is a scalable alternative to caches, but the benefit comes at the …

Adaptive performance prediction for integrated GPUs

  • Gupta
    Ujjwal

Integrated GPUs have become an indispensable component of mobile processors due to
the increasing popularity of graphics applications. The GPU frequency is a key factor
both in application throughput and mobile processor power consumption under graphics

Energy-efficient fault tolerance approach for internet of things applications

  • Xu
    Teng

Fault tolerance (FT) is essential in many Internet of Things (IoT) applications, in
particular in the domains such as medical devices and automotive systems where a single
fault in the system can lead to serious consequences. Non-volatile memory (NVM), …

Critical path isolation for time-to-failure extension and lower voltage operation

  • Masuda
    Yutaka

Device miniaturization due to technology scaling has made manufacturing variability
and aging more significant, and lower supply voltage makes circuits sensitive to dynamic
environmental fluctuation. These may shorten the time to failure (TTF) of …

Control synthesis and delay sensor deployment for efficient ASV designs

  • Li
    Chaofan

Adaptive Supply Voltage (ASV) is a power-efficient approach to achieving resilience
against process variation and circuit aging. Fine-grained ASV offers further power-efficiency
gains, but entails relatively complex control circuit, which has not been …

Performance driven routing for modern FPGAs

  • Kannan
    Parivallal

FPGA routing is a well studied problem. Basic point-to-point routing of nets on FPGA
fabrics can be done optimally using well known shortest path algorithms like Dijkstra’s
and A-star. Practical rip-up and reroute algorithms like PathFinder have been …

UTPlaceF: a routability-driven FPGA placer with physical and congestion aware packing

  • Li
    Wuxi

FPGA packing and placement without routability consideration could lead to unroutable
results for high-utilization designs. Conventional FPGA packing and placement approaches
are shown to have severe difficulties to yield good routability. In this paper,…

RippleFPGA: a routability-driven placement for large-scale heterogeneous FPGAs

  • Pui
    Chak-Wa

As the complexity and scale of FPGA circuits grows, resolving routing congestion becomes
more important in FPGA placement. In this paper, we propose a routability-driven placement
algorithm for large-scale heterogeneous FPGAs. Our proposed algorithm …

GPlace: a congestion-aware placement tool for ultrascale FPGAs

  • Pattison
    Ryan

Traditional FPGA flows that wait until the routing stage to tackle congestion are
quickly becoming less effective. This is due to the increasing size and complexity
of FPGA architectures and the designs targeted for them. In this paper, we present
two …

Resiliency in dynamically power managed designs

  • Lai
    Liangzhen

Dynamic power management has become essential for low power designs and systems. Whether
intentionally or unintentionally, these power reduction techniques and corresponding
management schemes can impact the hardware reliability and system resiliency in …

Dynamic reliability management for near-threshold dark silicon processors

  • Kim
    Taeyoung

In this article, we propose a new dynamic reliability management (DRM) techniques
at the system level for emerging low power dark silicon manycore microprocessors operating
in near-threshold region. We mainly consider the electromigration (EM) failures. …

A cross-layer approach for resiliency and energy efficiency in near threshold computing

  • Golanbari
    M. S.

Energy constrained systems become the cornerstone of emerging energy harvested or
battery-limited applications in Internet of Thing (IoT) platforms. A promising approach
is to operate at near threshold voltage ranges, which can significantly reduce …

Design space exploration of drone infrastructure for large-scale delivery services

  • Park
    Sangyoung

Drones, also referred to as unmanned aerial vehicles (UAVs), are recently expanding
their field of usage beyond military surveillance and tactical applications. Commercial
drone delivery service is one of the promising applications in the near future, …

Multi-objective design optimization for flexible hybrid electronics

  • Bhat
    Ganapati

Flexible systems that can conform to any shape are desirable for wearable applications.
Over the past decade, there have been tremendous advances in the domain of flexible
electronics which enabled printing of devices, such as sensors on a flexible …

KCAD: kinetic cyber-attack detection method for cyber-physical additive manufacturing systems

  • Chhetri
    Sujit Rokka

Additive Manufacturing (AM) uses Cyber-Physical Systems (CPS) (e.g., 3D Printers)
that are vulnerable to kinetic cyber-attacks. Kinetic cyber-attacks cause physical
damage to the system from the cyber domain. In AM, kinetic cyber-attacks are realized
by …

Autonomous sensor-context learning in dynamic human-centered internet-of-things environments

  • Rokni
    Seyed Ali

Human-centered Internet-of-Things (IoT) applications utilize computational algorithms
such as machine learning and signal processing techniques to infer knowledge about
important events such as physical activities and medical complications. The …

Formulating customized specifications for resource allocation problem of distributed
embedded systems

  • Zhang
    Xinhai

There are plentiful attempts for increasing the efficiency, generality and optimality
of the Design Space Exploration (DSE) algorithms for resource allocation problems
of distributed embedded systems. Most contemporary approaches formulate DSE as an

A polyhedral model-based framework for dataflow implementation on FPGA devices of
iterative stencil loops

  • Natale
    Giuseppe

Iterative Stencil Loops (ISLs) are a specific class of algorithms of great importance
for their substantial presence in a lot of industrial and scientific computing applications,
such as in numerical methods for solving partial differential equation —

Efficient memory compression in deep neural networks using coarse-grain sparsification
for speech applications

  • Kadetotad
    Deepak

Recent breakthroughs in deep neural networks have led to the proliferation of its
use in image and speech applications. Conventional deep neural networks (DNNs) are
fully-connected multi-layer networks with hundreds or thousands of neurons in each

Parallel code-specific CPU simulation with dynamic phase convergence modeling for
HW/SW co-design

  • Kemmerer
    Warren

While SystemC models provide a promising solution to the complex problem of HW/SW
co-design within the system-on-chip paradigm, such requires a detailed annotation
of transaction level energy and performance data within the model. While this data
can be …

Architectural-space exploration of approximate multipliers

  • Rehman
    Semeen

This paper presents an architectural-space exploration methodology for designing approximate
multipliers. Unlike state-of-the-art, our methodology generates various design points
by adapting three key parameters: (1) different types of elementary …

Design of power-efficient approximate multipliers for approximate artificial neural
networks

  • Mrazek
    Vojtech

Artificial neural networks (NN) have shown a significant promise in difficult tasks
like image classification or speech recognition. Even well-optimized hardware implementations
of digital NNs show significant power consumption. It is mainly due to non-…

Automated error prediction for approximate sequential circuits

  • Kapare
    Amrut

Synthesis tools for approximate sequential circuits require the ability to quickly,
efficiently, and automatically characterize and bound the errors produced by the circuits.
Previous approaches to characterize errors in approximate sequential circuits …

Approximation-aware rewriting of AIGs for error tolerant applications

  • Chandrasekharan
    Arun

Approximation circuits offer superior performance (speed and area) compared to traditional
circuits at the cost of computational accuracy. The accuracy of the results in approximation
circuits is evaluated based on several error metrics such as worst-…

Properties first? a new design methodology for hardware, and its perspectives in safety
analysis

  • Urdahl
    Joakim

This paper discusses the possible role of formal verification techniques in system-level
design flows. It is argued that the role of formal verification techniques should
not be limited to “bug hunting” alone. Instead, formal technology should be …

Where formal verification can help in functional safety analysis

  • Bernardini
    Alessandro

Formal techniques seem to be a way to cope with the exploding complexity of functional
safety analysis. Here, the overall fault propagation probability to a certain safety-point
in the design must be analyzed. As a consequence, the careful verification …

Formal approaches to design of active cell balancing architectures in battery management
systems

  • Steinhorst
    Sebastian

Large battery packs composed of Lithium-Ion cells are continuously gaining in importance
due to their applications in Electric Vehicles (EVs) and smart energy grids. To ensure
maximum lifetime, safety and performance of the battery pack, complex …

How much cost reduction justifies the adoption of monolithic 3D ICs at 7nm node?

  • Ku
    Bon Woong

In this paper we study power, performance, and cost (PPC) tradeoffs for 2-tier, gate-level,
full-chip GDS monolithic 3D ICs (M3D) built using a foundry-grade 7nm bulk FinFET
technology. We first develop highly-accurate wafer and die cost models for 2D …

A novel unified dummy fill insertion framework with SQP-based optimization method

  • Tao
    Yudong

Dummy fill insertion is widely applied to significantly improve the planarity of topographic
patterns for chemical mechanical polishing process in VLSI manufacture. However, these
dummies will lead to additional parasitic capacitance and deteriorate the …

Efficient yield estimation through generalized importance sampling with application
to NBL-assisted SRAM bitcells

  • Ciampolini
    Lorenzo

We consider the general problem of the efficient and accurate determination of the
yield of an integrated circuit, through electrical circuit level simulation, under
variability constraints due to the manufacturing process. We demonstrate the …

Are proximity attacks a threat to the security of split manufacturing of integrated
circuits?

  • Magaña
    Jonathon

Split manufacturing is a technique that allows manufacturing the transistor-level
and lower metal layers of an IC at a high-end, untrusted foundry, while manufacturing
only the higher metal layers at a smaller, trusted foundry. Using split manufacturing

Making split-fabrication more secure

  • Yang
    Ping-Lin

Today many design houses must outsource their design fabrication to a third party
which is often an overseas foundry. Split-fabrication is proposed for combining the
FEOL capabilities of an advanced but untrusted foundry with the BEOL capabilities
of a …

A machine learning approach to fab-of-origin attestation

  • Ahmadi
    Ali

We introduce a machine learning approach for distinguishing between integrated circuits
fabricated in a ratified facility and circuits originating from an unknown or undesired
source based on parametric measurements. Unlike earlier approaches, which …

OpenRAM: an open-source memory compiler

  • Guthaus
    Matthew R.

Computer systems research is often inhibited by the availability of memory designs.
Existing Process Design Kits (PDKs) frequently lack memory compilers, while expensive
commercial solutions only provide memory models with immutable cells, limited …

A hardware-based technique for efficient implicit information flow tracking

  • Shin
    Jangseop

To access sensitive information, some recent advanced attacks have been successful
in exploiting implicit flows in a program in which sensitive data affects the control
path and in turn affects other data. To track the sensitive data through implicit

Imprecise security: quality and complexity tradeoffs for hardware information flow tracking

  • Hu
    Wei

Secure hardware design is a challenging task that goes far beyond ensuring functional
correctness. Important design properties such as non-interference cannot be verified
on functional circuit models due to the lack of essential information (e.g., …

Encasing block ciphers to foil key recovery attempts via side channel

  • Agosta
    Giovanni

Providing efficient protection against energy consumption based side channel attacks
(SCAs) for block ciphers is a relevant topic for the research community, as current
overheads are in the 100x range. Unprofiled SCAs exploit information leakage from

Security of neuromorphic computing: thwarting learning attacks using memristor’s obsolescence effect

  • Yang
    Chaofei

Neuromorphic architectures are widely used in many applications for advanced data
processing, and often implements proprietary algorithms. In this work, we prevent
an attacker with physical access from learning the proprietary algorithm implemented
by …

Generation and use of statistical timing macro-models considering slew and load variability

  • Sinha
    Debjit

Timing macro-modeling captures the timing characteristics of a circuit in a compact
form for use in a hierarchical timing environment. At the same time, statistical timing
provides coverage of the impact from variability sources with the goal of …

TinySPICE plus: scaling up statistical SPICE simulations on GPU leveraging shared-memory based sparse
matrix solution techniques

  • Han
    Lengfei

TinySPICE was a SPICE simulator on GPU developed to achieve dramatic speedups in statistical
simulations of small nonlinear circuits, such as standard cell designs and SRAMs.
While TinySPICE can perform circuit simulations much faster than traditional …

PieceTimer: a holistic timing analysis framework considering setup/hold time interdependency using
a piecewise model

  • Zhang
    Grace Li

In static timing analysis, clock-to-q delays of flip-flops are considered as constants.
Setup times and hold times are characterized separately and also used as constants.
The characterized delays, setup times and hold times, are applied in timing …

A fast layer elimination approach for power grid reduction

  • Yassine
    Abdul-Amir

Simulation and verification of the on-die power delivery network (PDN) is one of the
key steps in the design of integrated circuits (ICs). With the very large sizes of
modern grids, verification of PDNs has become very expensive and a host of techniques

A deterministic approach to stochastic computation

  • Jenson
    Devon

Stochastic logic performs computation on data represented by random bit streams. The
representation allows complex arithmetic to be performed with very simple logic, but
it suffers from high latency and poor precision. Furthermore, the results are …

Control-fluidic CoDesign for paper-based digital microfluidic biochips

  • Wang
    Qin

Paper-based digital microfluidic biochips (P-DMFBs) have recently emerged as a promising
low-cost and fast-responsive platform for biochemical assays. In P-DMFBs, electrodes
and control lines are printed on a piece of photo paper using inkjet printer …

Neural networks designing neural networks: multi-objective hyper-parameter optimization

  • Smithson
    Sean C.

Artificial neural networks have gone through a recent rise in popularity, achieving
state-of-the-art results in various fields, including image classification, speech
recognition, and automated control. Both the performance and computational complexity

Error recovery in a micro-electrode-dot-array digital microfluidic biochip?

  • Li
    Zipeng

A digital microfluidic biochip (DMFB) is an attractive technology platform for automating
laboratory procedures in biochemistry. However, today’s DMFBs suffer from several
limitations: (i) constraints on droplet size and the inability to vary droplet …

Privacy protection via appliance scheduling in smart homes

  • Wu
    Jie

Smart grid, managed by intelligent devices, have demonstrated great potentials to
help residential customers to optimally schedule and manage the appliances’ energy
consumption. Due to the fine-grained power consumption information collected by smart

Framework designs to enhance reliable and timely services of disaster management systems

  • Shih
    Chi-Sheng

How to tolerate fault is a fundamental requirement to the designs of many cyber-physical
systems. Devices or sensors might have different requirements on their levels of reliability
and/or timely services in the composition of a cyber-physical system. …

Analysis of production data manipulation attacks in petroleum cyber-physical systems

  • Chen
    Xiaodao

Petroleum Cyber-Physical System (CPS) marks the beginning of a new chapter of the
oil and gas industry. Combining vast computational power with intelligent Computer
Aided Design (CAD) algorithms, petroleum CPS is capable of precisely modeling the
flow …

Security challenges in smart surveillance systems and the solutions based on emerging
nano-devices

  • Yang
    Chaofei

Modern smart surveillance systems can not only record the monitored environment but
also identify the targeted objects and detect anomaly activities. These advanced functions
are often facilitated by deep neural networks, achieving very high accuracy …

Fast physics-based electromigration checking for on-die power grids

  • Chatterjee
    Sandeep

Due to technology scaling, electromigration (EM) signoff has become increasingly difficult,
mainly due to the use of inaccurate methods for EM assessment, such as the empirical
Black’s model. In this paper, we present a novel approach for EM checking …

Exploring aging deceleration in FinFET-based multi-core systems

  • Cai
    Ermao

Power and thermal issues are the main constraints for highperformance multi-core systems.
As the current technology of choice, FinFET is observed to have lower delay under
higher temperature in super-threshold voltage region, an effect called …

An efficient and accurate algorithm for computing RC current response with applications
to EM reliability evaluation

  • Guan
    Zhong

In this paper, we propose a current waveform estimation algorithm for signal lines
without the necessity of SPICE simulation. Unlike previous methods, we do not use
function fitting or compute the effective capacitance. Instead, the proposed algorithm

Voltage-based electromigration immortality check for general multi-branch interconnects

  • Sun
    Zeyu

As VLSI technology features are pushed to the limit with every generation and with
the introduction of new materials and increased current densities to satisfy the performance
demands, Electromigration (EM) is projected to be a key reliability issue for …

Exploiting randomness in sketching for efficient hardware implementation of machine
learning applications

  • Wang
    Ye

Energy-efficient processing of large matrices for big-data applications using hardware
acceleration is an intense area of research. Sketching of large matrices into their
lower-dimensional representations is an effective strategy. For the first time, …

Making neural encoding robust and energy efficient: an advanced analog temporal encoder for brain-inspired computing systems

  • Zhao
    Chenyuan

Neural encoder is one of the key components in neuromorphic computing systems, whereby
sensory information is transformed into spike coded trains. The design of temporal
encoder has attracted a widespread attention in the field of neuromorphic computing

Statistical methodology to identify optimal placement of on-chip process monitors
for predicting fmax

  • Mu
    Szu-Pang

In previous literatures, many approaches use ring oscillators or other process monitors
to correlate the chip’s maximum operating frequency (Fmax). But none of them focus on the placement of these on-chip process monitors (OPMs)
on a chip. The placement …

BugMD: automatic mismatch diagnosis for bug triaging

  • Mammo
    Biruk

System-level validation is the most challenging phase of design verification. A common
methodology in this context entails simulating the design under validation in lockstep
with a high-level golden model, while comparing the architectural state of the …

ODESY: a novel 3T-3MTJ cell design with optimized area DEnsity, scalability and latencY

  • Xue
    Linuo

The STT-RAM (Spin-Transfer Torque Magnetic RAM) technology is a promising candidate
for cache memory because of its high density, low standy-power, and non-volatility.
As technology scales, especially under 40nm technology node, the read disturbance

Delay-optimal technology mapping for in-memory computing using ReRAM devices

  • Bhattacharjee
    Debjyoti

Recent propositions of diverse In-Memory Computing platforms have shown a promising
alternative to classical Von Neumann computing models. Significant benefits, in terms
of energy-efficiency and performance, are reported for in-memory arithmetic …

Reconfigurable in-memory computing with resistive memory crossbar

  • Zha
    Yue

Driven by recent advances in resistive random-access memory (RRAM), there have been
growing interests in exploring alternative computing concept, i.e., in-memory processing,
to address the classical von Neumann bottlenecks. Despite of their great …

Exploiting ferroelectric FETs for low-power non-volatile logic-in-memory circuits

  • Yin
    Xunzhao

Numerous research efforts are targeting new devices that could continue performance
scaling trends associated with Moore’s Law and/or accomplish computational tasks with
less energy. One such device is the ferroelectric FET (FeFET), which offers the …

Approximation knob: power capping meets energy efficiency

  • Kanduri
    Anil

Power Capping techniques are used to restrict power consumption of computer systems
to a thermally safe limit. Current many-core systems employ dynamic voltage and frequency
scaling (DVFS), power gating (PG) and scheduling methods as actuators for power …

IC thermal analyzer for versatile 3-D structures using multigrid preconditioned krylov
methods

  • Ladenheim
    Scott

Thermal analysis is crucial for determining the propagation of heat and tracking the
formation of hot spots in advanced integrated circuit technologies. At the core of
the thermal analysis for integrated circuits is the numerical solution of the heat

BoostNoC: power efficient network-on-chip architecture for near threshold computing

  • Rajamanikkam
    Chidhambaranathan

While near threshold design space provides a promising approach towards energy-efficient
computing, it is plagued by sub-optimal performance. Application characteristics and
hardware non-idealities of conventional architectures (optimized for the …

QScale: thermally-efficient QoS management on heterogeneous mobile platforms

  • Sahin
    Onur

Single-ISA heterogeneous mobile processors integrate low-power and power-hungry CPU
cores together to combine energy efficiency with high performance. While running computationally
demanding applications, current power management and scheduling …

Synthesis of statically analyzable accelerator networks from sequential programs

  • Cheng
    Shaoyi

This paper describes a general framework for transforming a sequential program into
a network of processes, which are then converted to hardware accelerators through
high level synthesis. Also proposed is a complementing technique for performing static

Joint loop mapping and data placement for coarse-grained reconfigurable architecture
with multi-bank memory

  • Yin
    Shouyi

Coarse-Grained Reconfigurable Architecture (CGRA) is a promising architecture with
high performance, high power-efficiency and attraction of flexibility. The compute-intensive
parts of an application (e.g. loops) are often mapped onto CGRA for …

Efficient synthesis of graph methods: a dynamically scheduled architecture

  • Minutoli
    Marco

RDF databases naturally map to a graph representation and employ languages, such as
SPARQL, that implements queries as graph pattern matching routines. Graph methods
exhibit an irregular behavior: they present unpredictable, fine-grained data accesses,

Tier partitioning strategy to mitigate BEOL degradation and cost issues in monolithic
3D ICs

  • Samal
    Sandeep Kumar

In this paper, we develop tier partitioning strategy to mitigate back-end-of-line
(BEOL) interconnect delay degradation and cost issues in monolithic 3D ICs (M3D).
First, we study the routing overhead and delay degradation caused by tungsten BEOL

Cascade2D: A design-aware partitioning approach to monolithic 3D IC with 2D commercial tools

  • Chang
    Kyungwook

Monolithic 3D IC (M3D) can continue to improve power, performance, area and cost beyond
traditional Moore’s law scaling limitations by leveraging the third-dimension and
fine-grained monolithic inter-tier vias (MIVs). Several recent studies present …

SAINT: handling module folding and alignment in fixed-outline floorplans for 3D ICs

  • Lin
    Jai-Ming

Three-dimensional integrated circuits (3D ICs) offer significant improvements over
two-dimensional circuits in several aspects. Classic 3D floorplanning algorithm places
each module at one single die. However, power consumption and wirelength of a 3D IC

From biochips to quantum circuits: computer-aided design for emerging technologies

  • Wille
    Robert

While previous decades have witnessed impressive accomplishments in the design and
realization of conventional computing devices, physical boundaries and cost restrictions
led to an increasing interest in alternative technologies (often referred to as

Multilevel design understanding: from specification to logic invited paper

  • Ray
    Sandip

We present an outline of the field of Multilevel Design Understanding by first defining
and motivating the related problems, and then describing the key issues which must
be addressed in future research.

FPGA 2017 TOC

WORKSHOP SESSION: FPGA’17 Workshops

OLAF’17: Third International Workshop on Overlay Architectures for FPGAs

  • So
    Hayden Kwok-Hay

The Third International Workshop on Overlay Architectures for FPGAs (OLAF) is held
in Monterey, California, USA, on Feburary 22, 2017 and co-located with FPGA 2017:
The 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. The
main …

SESSION: Special Session: The Role of FPGAs in Deep Learning

Session details: Special Session: The Role of FPGAs in Deep Learning

  • Ling
    Andrew

The Role of FPGAs in Deep Learning

  • Ling
    Andrew

Deep learning has garnered significant visibility recently as an Artificial Intelligence
(AI) paradigm, with success in wide ranging applications such as image and speech
recognition, natural language understanding, self-driving cars, and game playing (…

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

  • Nurvitadhi
    Eriko

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily
on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular
parallelism, high TFLOP/s). Because of this, GPUs are widely used for …

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

  • Zhao
    Ritchie

Convolutional neural networks (CNN) are the current stateof-the-art for many computer
vision tasks. CNNs outperform older methods in accuracy, but require vast amounts
of computation and memory. As a result, existing CNN applications are typically run

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural
Network

  • Zhang
    Jialiang

OpenCL FPGA has recently gained great popularity with emerging needs for workload
acceleration such as Convolutional Neural Network (CNN), which is the most popular
deep learning architecture in the domain of computer vision. While OpenCL enhances
the …

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared
Memory System

  • Zhang
    Chi

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks
(CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier
Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational …

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional
Neural Networks

  • Ma
    Yufei

As convolution layers contribute most operations in convolutional neural network (CNN)
algorithms, an effective convolution acceleration scheme significantly affects the
efficiency and performance of a hardware CNN accelerator. Convolution in CNNs …

SESSION: Machine Learning

Session details: Machine Learning

  • Cong
    Jason

An OpenCL™ Deep Learning Accelerator on Arria 10

  • Aydonat
    Utku

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks,
particularly in the area of image classification. FPGAs are well known to be able
to perform convolutions efficiently, however, most recent efforts to run CNNs on …

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

  • Umuroglu
    Yaman

Research has shown that convolutional neural networks contain significant redundancy,
and high classification accuracy can be obtained even when weights and activations
are reduced from floating point to binary values. In this paper, we present FINN,
a …

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

  • Han
    Song

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve
higher prediction accuracy, machine learning scientists have built increasingly larger
models. Such large model is both computation intensive and memory intensive. …

SESSION: Interconnect and Routing

Session details: Interconnect and Routing

  • Kaptanoglu
    Sinan

Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays

  • Giesen
    Hans

How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated
that the negative effects of variation can be largely mitigated using complete knowledge
of device characteristics and full per-FPGA CAD flow. However, the cost …

Synchronization Constraints for Interconnect Synthesis

  • Rodionov
    Alex

Interconnect synthesis tools ease the burden on the designer by automatically generating
and optimizing communication hardware. In this paper we propose a novel capability
for FPGA interconnect synthesis tools that further simplifies the designer’s …

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

  • Shen
    Minghua

FPGAs are increasingly popular as application-specific accelerators because they lead
to a good balance between flexibility and energy efficiency, compared to CPUs and
ASICs. However, the long routing time imposes a barrier on FPGA computing, which …

SESSION: Architecture

Session details: Architecture

  • Wilton
    Steve

Don’t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration

  • Yazdanshenas
    Sadegh

While academic FPGA architecture exploration tools have become sufficiently advanced
to enable a wide variety of explorations and optimizations on soft fabric and outing,
support for Block RAM (BRAM) has been very limited. In this paper, we present …

Automatic Construction of Program-Optimized FPGA Memory Networks

  • Yang
    Hsin-Jung

Memory systems play a key role in the performance of FPGA applications. As FPGA deployments
move towards design entry points that are more serial, memory latency has become a
serious design consideration. For these applications, memory network …

NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element

  • Huang
    Zhihong

The And-Inverter Cone has been introduced as an alternative logic element to the look-up
table in FPGAs, since it improves their performance and resource utilization. However,
further analysis of the AIC design showed that it suffers from the delay …

120-core microAptiv MIPS Overlay for the Terasic DE5-NET FPGA board

  • Kumar H B
    Chethan

We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight
message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested
RTL source code for the microAptiv MIPS processor made available …

SESSION: CAD Tools

Session details: CAD Tools

  • Shannon
    Lesley

A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology
Mapping

  • Liu
    Gai

Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations
on the input logic network before carrying out technology mapping. While the “known
recipes” of logic transformations often lead to improved mapping results, …

A Parallel Bandit-Based Approach for Autotuning FPGA Compilation

  • Xu
    Chang

Mainstream FPGA CAD tools provide an extensive collection of optimization options
that have a significant impact on the quality of the final design. These options together
create an enormous and complex design space that cannot effectively be explored …

PANEL SESSION: Panel: FPGAs in the Cloud

Session details: Panel: FPGAs in the Cloud

  • Constantinides
    George

FPGAs in the Cloud

  • Constantinides
    George A.

Ever greater amounts of computing and storage are happening remotely in the cloud,
and it is estimated that spending on public cloud services will grow by over 19%/year
to $140B in 2019. Besides commodity processors, network and storage infrastructure,

SESSION: High-Level Synthesis — Tools and Applications

Session details: High-Level Synthesis — Tools and Applications

  • Neuendorffer
    Stephen

Hardware Synthesis of Weakly Consistent C Concurrency

  • Ramanathan
    Nadesh

Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion
but via fine-grained atomic operations (‘atomics’), have been shown empirically to
be the fastest class of multi-threaded algorithms in the realm of conventional …

A New Approach to Automatic Memory Banking using Trace-Based Address Mining

  • Zhou
    Yuan

Recent years have seen an increased deployment of FPGAs as programmable accelerators
for improving the performance and energy efficiency of compute-intensive applications.
A well-known “secret sauce” of achieving highly efficient FPGA acceleration is to

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis

  • Dai
    Steve

Current pipelining approach in high-level synthesis (HLS) achieves high performance
for applications with regular and statically analyzable memory access patterns. However,
it cannot effectively handle infrequent data-dependent structural and data …

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

  • Srivastava
    Nitish Kumar

High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively
cope with design complexity of emerging applications on modern programmable system-on-chip
(SoC). While HLS continues to evolve with a growing set of algorithms,…

Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules

  • Rozhko
    Daniel

Packet processing systems increasingly need larger rulesets to satisfy the needs of
deep-network intrusion prevention and cluster computing. FPGA-based implementations
of packet processing systems have been proposed but their use of on-chip memory …

SESSION: Graph Processing Applications

Session details: Graph Processing Applications

  • Kapre
    Nachiket

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

  • Zhang
    Jialiang

Large graph processing has gained great attention in recent years due to its broad
applicability from machine learning to social science. Large real-world graphs, however,
are inherently difficult to process efficiently, not only due to their large …

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

  • Dai
    Guohao

The performance of large-scale graph processing suffers from challenges including
poor locality, lack of scalability, random access pattern, and heavy data conflicts.
Some characteristics of FPGA make it a promising solution to accelerate various …

FPGA-Accelerated Transactional Execution of Graph Workloads

  • Ma
    Xiaoyu

Many applications that operate on large graphs can be intuitively parallelized by
executing a large number of the graph operations concurrently and as transactions
to deal with potential conflicts. However, large numbers of operations occurring …

SESSION: Virtualization and Applications

Session details: Virtualization and Applications

  • Lockwood
    John

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

  • Tarafdar
    Naif

We present a framework for creating network FPGA clusters in a heterogeneous cloud
data center. The FPGA clusters are created using a logical kernel description describing
how a group of FPGA kernels are to be connected (independent of which FPGA these …

Energy Efficient Scientific Computing on FPGAs using OpenCL

  • Weller
    Dennis

An indispensable part of our modern life is scientific computing which is used in
large-scale high-performance systems as well as in low-power smart cyber-physical
systems. Hence, accelerators for scientific computing need to be fast and energy …

Secure Function Evaluation Using an FPGA Overlay Architecture

  • Fang
    Xin

Secure Function Evaluation (SFE) has received considerable attention recently due
to the massive collection and mining of personal data over the Internet, but large
computational costs still render it impractical. In this paper, we leverage hardware

SESSION: Applications

Session details: Applications

  • Leeser
    Miriam

FPGA Acceleration for Computational Glass-Free Displays

  • He
    Zhuolun

The increasing computational power enables various new applications that are runtime
prohibitive before. FPGA is one of such computational power with both reconfigurability
and energy efficiency. In this paper, we demonstrate the feasibility of …

Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling

  • Huang
    Sitao

With the advent of several accurate and sophisticated statistical algorithms and pipelines
for DNA sequence analysis, it is becoming increasingly possible to translate raw sequencing
data into biologically meaningful information for further clinical …

POSTER SESSION: Poster Session 1

Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors
(Abstract Only)

  • Ye
    Andy Gean

This work measures the performance and power consumption gap between the current generation
of low power FPGAs and low power microprocessors (microcontrollers) through an implementation
of the Canny edge detection algorithm. In particular, the algorithm …

A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology
(Abstract Only)

  • Zha
    Yue

This poster presents a data-centric reconfigurable architecture, which is enabled
by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture
of commercial FPGAs, it is inherently a homogeneous architecture comprising of a …

A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming
Model (Abstract Only)

  • Wang
    Shuo

Iterative stencil algorithms find applications in a wide range of domains. FPGAs have
long been adopted for computation acceleration due to its advantages of dedicated
hardware design. Hence, FPGAs are a compelling alternative for executing iterative

Scala Based FPGA Design Flow (Abstract Only)

  • Liu
    Yanqiang

With the rapid growth of data scale, data analysis applications start to meet the
performance bottleneck, and thus requiring the aid of hardware acceleration. At the
same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability

Thermal Flattening in 3D FPGAs Using Embedded Cooling (Abstract Only)

  • Deshpande
    Girish

Thermal management is one of the key concerns in modern high power density chips.
A variety of thermal cooling techniques that have been in use in industrial applications
are now also being applied to integrated circuits. In this work, we explore the …

A Machine Learning Framework for FPGA Placement (Abstract Only)

  • Grewal
    Gary

Many of the key stages in the traditional FPGA CAD flow require substantial amounts
of computational effort. Moreover, due to limited overlap among individual stages,
poor decisions made in earlier stages will often adversely affect the quality of …

Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only)

  • Salomon
    Ralf

In high-performance applications, such as quantum physics and positron emission tomography,
precise coincidence detection is of central importance: The quality of the reconstructed
images depends on the accuracy with which the underlying system detects …

Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming
HPC Applications (Abstract Only)

  • Koraei
    Mostafa

Streaming HPC applications are data intensive and have widespread use in various fields
(e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist
of different processing kernels where each kernel performs a specific computation

Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT
Interpolation Filter (Abstract Only)

  • Abdelsalam
    Ahmed M.

Implementing an accurate and fast activation function with low cost is a crucial aspect
to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy
approximation approach for the hyperbolic tangent activation function of …

An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract
Only)

  • Luinaud
    Thomas

Snort and Bro are Deep Packet Inspection systems which express complex rules with
regular expressions. Before performing a regular expression search, these applications
apply a filter to select which regular expressions must be searched. One way to …

POSTER SESSION: Poster Session 2

Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)

  • Zhao
    Zhipeng

There have been ample successful examples of applying Xilinx Vivado’s “function-to-module”
high-level synthesis (HLS) where the subject is algorithmic in nature. In this work,
we carried out a design study to assess the effectiveness of applying Vivado-…

Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip
(Abstract Only)

  • Bobda
    Christophe

Component based design is one of the preferred methods to tackle system complexity,
and reduce costs and time-to-market. Major parts of the system design and IC production
are outsourced to facilities distributed across the globe, thus opening the door …

Accelerating Financial Market Server through Hybrid List Design (Abstract Only)

  • Fu
    Haohuan

The financial market server in exchanges aims to maintain the order books and provide
real time market data feeds to traders. Low-latency processing is in a great demand
in financial trading. Although software solutions provide the flexibility to …

Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level
Synthesis (Abstract Only)

  • Lu
    Tianyi

High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient
compilation process targeting FPGAs for algorithm evaluation and product prototyping.
However, the massively parallel memory access demands and the extremely expensive

A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA
(Abstract Only)

  • Nakahara
    Hiroki

A pre-trained convolutional deep neural network (CNN) is a feed-forward computation
perspective, which is widely used for the embedded systems, requires high power-and-area
efficiency. This paper realizes a binarized CNN which treats only binary 2-…

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural
Networks (Abstract Only)

  • Li
    Yixing

FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained
great attentions due to its higher energy efficiency than GPUs. However, it has been
a challenge for FPGA-based solutions to achieve a higher throughput than GPU …

CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only)

  • Cong
    Jason

To efficiently process a tremendous amount of data, today’s big data applications
tend to distribute the datasets into multiple partitions, such that each partition
can be fit into memory and be processed by a separate core/server in parallel. Meanwhile,…

Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural
Network (Abstract Only)

  • Alawad
    Mohammed

Large-scale convolutional neural network (CNN), conceptually mimicking the operational
principle of visual perception in human brain, has been widely applied to tackle many
challenging computer vision and artificial intelligence applications. …

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)

  • Venieris
    Stylianos I.

In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art
in several Artificial Intelligence tasks. Across the range of applications, the performance
needs vary significantly, from high-throughput image recognition to …

POSTER SESSION: Poster Session 3

FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging
(Abstract Only)

  • Pezzotti
    Emanuele

Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of
MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier
Transform for image reconstruction using the Inverse Fast Fourier Transform (…

Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network
Layers (Abstract Only)

  • Shen
    Yongming

Convolutional neural networks (CNNs) are used to solve many challenging machine learning
problems. These networks typically use convolutional layers for feature extraction
and fully-connected layers to perform classification using those features. …

ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only)

  • Banerjee
    Subho S.

The proliferation of high-throughput sequencing machines allows for the rapid generation
of billions of short nucleotide fragments in a short period. This massive amount of
sequence data can quickly overwhelm today’s storage and compute infrastructure. …

RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction
(Abstract Only)

  • Lotfi
    Atieh

Despite the considerable improvements in the quality of HLS tools, they still require
the designer’s manual optimizations and tweaks to generate efficient results, which
negates the HLS design productivity gains. Majority of designer interventions lead

GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)

  • Wu
    Haoyang

Although there is explosive growth of theoretical research on cognitive radio, the
real-time platform for cognitive radio is progressing at a low pace. Researchers expect
fast prototyping their designs with appropriate wireless platforms to precisely …

FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations
(Abstract Only)

  • Siripurapu
    Srinivas

FPGAs have been used as accelerators in a wide variety of domains such as learning,
search, genomics, signal processing, compression, analytics and so on. In recent years,
the availability of tools and flows such as high-level synthesis has made it even

Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable
Architectures (Abstract Only)

  • Yin
    Shouyi

Data flow graph (DFG) mapping is critical for the compiling of spatial programmable
architecture, where compilation time is a key factor for both time-to-market requirement
and mapping successful rate. Inspired from the great progress made in tree …

Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only)

  • Chaudhuri
    Sumanta

In this presentation we show that side-channels arising from micro-architecture of
SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which
performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA.

Dynamic Partitioning for Library based Placement on Heterogeneous FPGAs (Abstract
Only)

  • Mao
    Fubing

Library based design and IP reuses have been previously proposed to speed up the synthesis
of large-scale FPGA designs. However, existing methods result in large area wastage
due to the module size difference and the waste area inside each module. In …

An Energy-Efficient Design-Time Scheduler for FPGAs Leveraging Dynamic Frequency Scaling
Emulation (Abstract Only)

  • Loke
    Wei Ting

We present a design-time tool, EASTA, that combines the feature of reconfigurability
in FPGAs and Dynamic Frequency Scaling to realize an efficient multiprocessing scheduler
on a single-FPGA system. Multiple deadlines, reconvergent nodes, flow …