SLIP ’22: Proceedings of the 24th ACM/IEEE Workshop on System Level Interconnect Pathfinding
Full Citation in the ACM Digital Library
SESSION: Breaking the Interconnect Limits
Session details: Breaking the Interconnect Limits
- Ismail Bustany
Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning be?
- Raveena Raikar
- Dirk Stroobandt
High-capacity multi-die FPGA systems generally consist of multiple dies connected by external interposer lines. These external connections are limited in number. Further, these connections also contribute to a higher delay as compared to the internal network on a monolithic FPGA and should therefore be sparsely used. These architectural changes compel the placement & routing tools to minimize the number of signals at the die boundary. Incorporating a netlist partitioning step in the CAD flow can help to minimize the overall number of signals using the cross-die connections.
Conventional partitioning techniques focus on minimizing the cut edges at the cost of generating unequal-sized partitions. Such highly unbalanced partitions can affect the overall placement & routing quality by causing congestion on the denser die. Moreover, this can also negatively impact the overall runtime of the placement & routing tools as well as the FPGA resource utilization.
In previous studies, a low value of the unbalance was proposed to generate equal-sized partitions. In this work, we investigate the factors that influence the netlist partitioning quality for a multi-die FPGA system. A die-level partitioning step, performed using hMETIS, is incorporated into the flow before the packing step. Large heterogeneous circuits from the Koios benchmark suite are used to analyze the partitioning-packing results. Consequently, we examine the variation in output unbalance, the number of cut edges vs the input value of unbalance. We propose an empirical optimal parametric value of the unbalance factor for achieving the desired partitioning quality for the Koios benchmark suite.
Limiting Interconnect Heating in Power-Driven Physical Synthesis
- Xiuyan Zhang
- Shantanu Dutt
Current technology trend of VLSI chips includes sub-10 nm nodes and 3D ICs. Unfortunately, due to significantly increased Joule heating in these technologies, interconnect reliability has become a significant casualty. In this paper, we explore how interconnect power dissipation (of CV2/2 per logic transition) and thus heating can be effectively constrained during a power-optimizing physical synthesis (PS) flow that applies three different PS transformations: cell sizing, Vth assignment and cell replication; the latter is particularly useful for limiting interconnect heating. Other constraints considered are timing, slew and cell fanout load. To address this multi-constraint power-optimization problem effectively, we consider the application of the aforementioned three transforms simultaneously (as opposed to sequentially in some order) as well as simultaneously across all cells of the circuit using a novel discrete optimization technique called discretized network flow (DNF). We applied our algorithm to ISPD-13 benchmark circuits: the ISPD-13 competition was for power optimization for cell-sizing and Vth assignment transforms under timing, slew and cell fanout load constraints; to these we added the interconnect heating constraint and the cell replication transform—a much harder transform to engineer in a simultaneous-consideration framework than the other two. Results show the significant efficacy of our techniques.
SESSION: 2.5D/3D Extension for High-Performance Computing
Session details: 2.5D/3D Extension for High-Performance Computing
- Pascal Vivet
Opportunities of Chip Power Integrity and Performance Improvement through Wafer Backside (BS) Connection: Invited Paper
- Rongmei Chen
- Giuliano Sisto
- Odysseas Zografos
- Dragomir Milojevic
- Pieter Weckx
- Geert Van der Plas
- Eric Beyne
Technology node scaling is driven by the need to increase system performance, but it also leads to a significant power integrity bottleneck, due to the associated back-end-of-line (BEOL) scaling. Power integrity degradation induced by on-chip Power Delivery Network (PDN) IR drop is a result of increased power density and number of metal layers in the BEOL and their resistivity. Meanwhile, signal routing limits the SoC performance improvements due to increased routing congestion and delays. To conquer these issues, we introduce a disruptive technology: wafer backside (BS) connection to realize chip BS PDN (BSPDN) and BS signal routing. We first provide some key wafer processes features that were developed at imec to enable this technology. Further, we show benefits of this technology by demonstrating a large improvement in chip power integrity and performance after applying this technology to BSPDN and BS routing with a sub-2nm technology node design rule. Challenges and outlook of the BS technology are also discussed before conclusion of this paper.
SESSION: Compute-in-Memory and Design of Structured Compute Arrays
Session details: Compute-in-Memory and Design of Structured Compute Arrays
- Shantanu Dutt
An Automated Design Methodology for Computational SRAM Dedicated to Highly Data-Centric Applications: Invited Paper
- A. Philippe
- L. Ciampolini
- A. Philippe
- M. Gerbaud
- M. Ramirez-Corrales
- V. Egloff
- B. Giraud
- J.-P. Noel
To meet the performance requirements of highly data-centric applications (e.g. edge-AI or lattice-based cryptography), Computational SRAM (C-SRAM), a new type of computational memory, was designed as a key element of an emerging computing paradigm called near-memory computing. For this particular type of applications, C-SRAM has been specialized to perform low-latency vector operations in order to limit energy-intensive data transfers with the processor or dedicated processing units. This paper presents a design methodology that aims at making the C-SRAM design flow as simple as possible by automating the configuration of the memory part (e.g. number of SRAM cuts and access ports) according to system constraints (e.g. instruction frequency or memory capacity) and off-the-shelf SRAM compilers. In order to fairly quantify the benefits of the proposed memory selector, it has been evaluated with three different CMOS process technologies from two different foundries. The results show that this memory selection methodology makes it possible to determine the best memory configuration whatever the CMOS process technology and the trade-off between area and power consumption. Furthermore, we also show how this methodology could be used to efficiently assess the level of design optimization of available SRAM compilers in a targeted CMOS process technology.
A Machine Learning Approach for Accelerating SimPL-Based Global Placement for FPGA’s
- Tianyi Yu
- Nima Karimpour Darav
- Ismail Bustany
- Mehrdad Eslami Dehkordi
Many commercial FPGA placement tools are based on the SimPL framework where the Lower Bound (LB) phase optimizes wire length and timing without considering cell overlaps and the Upper Bound (UB) phase spreads out cells while considering the target FPGA architectures. In the SimPL framework, the number of iterations depends on design complexity and the quality of UB placement, which highly impacts runtime. In this work, we propose a machine learning (ML) scheme where the anchor weights of cells are dynamically adjusted to make the process converge in a pre-determined budget for the number of iterations. In our approach and for a given FPGA architecture, a ML model constructs a trajectory guide function that is used for adjusting anchor weights during SimPL’s iterations. Our experimental results on industrial benchmarks show, we can achieve on average 28.01% and 4.7% runtime reduction in the runtime of Global Placement and the runtime of the whole placer, respectively while maintaining the quality of solutions within an acceptable range.
SESSION: Interconnect Performance Estimation Techniques
Session details: Interconnect Performance Estimation Techniques
- Rasit Topaloglu
Neural Network Model for Detour Net Prediction
- Jaehoon Ahn
- Taewhan Kim
Identifying nets in a placement which will be very likely to be detoured routes in routing is very useful in that (1) in conjunction with the routing congestion, path timing, or design rule violation (DRV) prediction, predicting detour nets can be used as a complementary means of characterizing the outcome of those predictions in a more depth and (2) we can place more importance on the detour predicted nets for optimizing timing and routing resources in the early stage of placement since those nets consume more timing budget as well as metal/via resources. In this context, this work proposes a neural network based detour net prediction model. Our proposed model consists of two parts: CNN based and ANN based. The CNN based model processes the features describing various physical proximity maps or states while the ANN based model processes the features of individual nets in the form of vector descriptions, concatenated to the CNN outputs. Through experiments, we analyze and assess the accuracy of our prediction model in terms of F1 score and the complementary role of timing prediction and optimization. More specifically, it is shown that our proposed model improves the prediction accuracy by 9.9% on average in comparison with that produced by the conventional (vanilla ANN based) detour net prediction model. Furthermore, linking our prediction model to a state-of-the-art timing optimization of the commercial tool is able to reduce the worst negative slack by 18.4%, the total negative slack by 40.8%, and the number of timing violation paths by 30.9% on average.
Machine-Learning Based Delay Prediction for FPGA Technology Mapping
- Hailiang Hu
- Jiang Hu
- Fan Zhang
- Bing Tian
- Ismail Bustany
Accurate delay prediction is important in the early stages of logic and high-level synthesis. In technology mapping for field programmable gate array (FPGA), a gate-level circuit is transcribed into a lookup table (LUT)-level circuit. Quick timing analysis is necessary on a pre-mapped circuit to guide optimizations downstream. However, a static timing analyzer is too slow due to its complexity and highly inaccurate like other faster empirical heuristics before technology mapping. In this work, we present a machine learning based framework for accurately and efficiently estimating the delay of a gate-level circuit from predicting the depth of the corresponding LUT logic after technology mapping. Our experimental results show that the proposed method achieves a 56x accuracy improvement compared to the existing delay estimation heuristic. Instead of running the mapper for the ground truth, our delay estimator saves 87.5% on runtime with negligible error.