SESSION: Keynote 1
Session details: Keynote 1
Scalable System and Silicon Architectures to Handle the Workloads of the Post-Moore
Era
The end of Moore’s law has been proclaimed on many occasions and it’s probably safe
to say that we are now working in the post-Moore era. But no one is ready to slow
down just yet. We can view Gordon Moore’s observation on transistor densification
as just one aspect of a longer-term underlying technological trend – the Law of Accelerating
Returns articulated by Kurzweil. Arguably, companies became somewhat complacent in
the Moore era, happy to settle for the gains brought by each new process node. Although
we can expect scaling to continue, albeit at a slower pace, the end of Moore’s Law
delivers a stronger incentive to push other trends of technology progress harder.
Some exciting new technologies are now emerging such as multi-chip 3D integration
and the introduction of new technologies such as storage-class memory and silicon
photonics. Moreover, we are also entering a golden age of computer architecture innovation.
One of the key drivers is the pursuit of domain-specific architectures as proclaimed
by Turing award winners John Hennessy and David Patterson. A good example is the Xilinx’s
AI Engine, one of the important features of the Versal? ACAP (adaptive compute acceleration
platform) [1]. Today, the explosion of AI workloads is one of the most powerful drivers
shifting our attention to find faster ways of moving data into, across, and out of
accelerators. Features such as massive parallel processing elements, the use of domain
specific accelerators, the dense interconnect between distributed on-chip memories
and processing elements, are examples of the ways chip makers are looking beyond scaling
to achieve next-generation performance gains. Next, the growing demands of scaling-out
hyperscale datacenter applications drive much of the new architecture developments.
Given a high diversification of workloads that invoke massive compute and data movement,
datacenter architectures are moving away from rigid CPU-centric structures and instead
prioritize adaptability and configurability to optimize resources such as memory and
connectivity of accelerators assigned to individual workloads. There is no longer
a single figure of merit. It’s not all about Tera-OPS. Other metrics such as transfers-per-second
and latency come to the fore as demands become more real-time; autonomous vehicles
being an obvious and important example. Moreover, the transition to 5G will result
in solutions that operate across the traditional boundaries between the cloud and
edge and embedded platforms that are obviously power-conscious and cost-sensitive.
Future workloads will require agile software flows that accommodate the spread of
functions across edge and cloud. Another industry megatrend that will drive technology
requirements especially in encryption, data storage and communication, is Blockchain.
To some, it may already have a bad reputation, tarnished by association with the anarchy
of cryptocurrency, but it will be more widely relevant than many of us realize. Who
could have foreseen the development of today’s Internet when ARPANET first appeared
as a simple platform for distributed computing and sending email? Through projects
such as the open-source Hyperledger, Blockchain technology could be game-changing
as a platform for building trust in transactions executed over the Internet. We may
soon be talking in terms of the Trusted Internet. The predictability of Moore’s law
may have become rather too comfortable and slow. The future requires maximizing the
flexibility, agility, and efficiency of new technologies. With Moore’s Law now mostly
behind us, new adaptable and scalable architectures will allow us to further provide
exponential return from technology in order to create a more adaptable and intelligent
world.
SESSION: Session 1: Placement
Session details: Session 1: Placement
Placement Optimization with Deep Reinforcement Learning
Placement Optimization is an important problem in systems and chip design, which consists
of mapping the nodes of a graph onto a limited set of resources to optimize for an
objective, subject to constraints. In this paper, we start by motivating reinforcement
learning as a solution to the placement problem. We then give an overview of what
deep reinforcement learning is. We next formulate the placement problem as a reinforcement
learning problem, and show how this problem can be solved with policy gradient optimization.
Finally, we describe lessons we have learned from training deep reinforcement learning
policies across a variety of placement optimization problems.
Hill Climbing with Trees: Detail Placement for Large Windows
Integrated circuit design encompasses a wide range of intractable optimization problems.
In this paper, we extend linear time hill climbing techniques from graph partitioning
to address detailed placement — this results in a new way to refine circuit designs,
dramatically expands the size of practical optimization windows, and enables wire
length reductions on a variety of benchmark problems. The approach is versatile and
straight-forward to implement, allowing it to be applied to a wide range of problems
within design automation, and beyond.
Via Pillar-aware Detailed Placement
With the feature size shrinking down to 7 nm and beyond, the impact of wire resistance
is significantly growing, and the circuit delay incurred by metal wires is noticeably
raising. To address this issue, a new technique called via pillar insertion is developed.
However, the poor success rate of the via pillar insertion process immediately becomes
an important problem. In this paper, we explore the causes of via pillar insertion
failures by experiments on the ISPD 2015 benchmarks, which are embedded with a real
industrial cell library. The results show that the reasons for the low success rate
may be due to track misalignment, power and ground stripe overlapping, and insufficient
margin area. Therefore, we propose the first detailed placement flow which is aware
of via pillars to maximize the success rate of via pillar insertion. In the proposed
flow, we first filter out infeasible cell rows and then move the via pillar-inserting
cells to their eligible positions. Next, we adopt a two-stage legalization method
with high flexibility on cell ordering based on a dynamic programming-based detailed
placement algorithm. Finally, we improve congested rows with a global moving process.
Experiment results show that our algorithm improves the insertion rates by 54-58%,
and achieves over 99% insertion rate on average.
Soft-Clustering Driven Flip-flop Placement Targeting Clock-induced OCV
On-Chip Variation (OCV) in advanced technology nodes introduces delay uncertainties
that may cause timing violations. This problem drastically affects the clock tree
that, besides the growing design complexity, needs to be appropriately synthesized
to tackle the increased variability effects. To reduce the magnitude of the clock-induced
OCV, we incrementally relocate the flip-flops and the clock gaters in a bottom-up
manner to implicitly guide the clock tree synthesis engine to produce clock trees
with increased common clock tree paths. The relocation of the clock elements is performed
using a soft clustering approach that is orthogonal to the clock tree synthesis method
used. The clock elements are repeatedly relocated and incrementally re-clustered,
thus gradually forming better clusters and settling to more appropriate positions
to increase the common paths of the clock tree. This behavior is verified by applying
the proposed method in industrial designs, resulting in clock trees which are more
resilient to process variations, while exhibiting improved overall timing.
SESSION: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging
Session details: Session 2: Breaking New Ground: From Carbon Nanotubes to Packaging
Advances in Carbon Nanotube Technologies: From Transistors to a RISC-V Microprocessor
Carbon nanotube (CNT) field-effect transistors (CNFETs) promise to improve the energy
efficiency of very-large-scale integrated (VLSI) systems. However, multiple challenges
have prevented VLSI CNFET circuits from being realized, including inherent nano-scale
material defects, robust processing for yielding complementary CNFETs (i.e., CNT CMOS:
including both PMOS and NMOS CNFETs), and major CNT variations. Here, we summarize
techniques that we have recently developed to overcome these outstanding challenges,
enabling VLSI CNFET circuits to be experimentally realized today using standard VLSI
processing and design flows. Leveraging these techniques, we demonstrate the most
complex CNFET circuits and systems to-date, including a three-dimensional (3D) imaging
system comprising CNFETs fabricated directly on top of a silicon imager, CNT CMOS
analog and mixed-signal circuits, 1 kilobit CNFET static random-access memory (SRAM)
memory arrays, and a 16-bit RISC-V microprocessor built entirely out of CNFETs.
Full-Chip Electro-Thermal Coupling Extraction and Analysis for Face-to-Face Bonded
3D ICs
Due to the short die-to-die distance and inferior heat dissipation capability, Face-to-Face
(F2F) boned 3D ICs are often considered to be vulnerable to electrical and thermal
coupling. This study is the first to quantify the impacts of the electro-thermal coupling
on the full-chip timing, power, and performance. We first present an implementation
flow for realistic F2F 3D ICs including pad layers and power grids. Then, we propose
our signal integrity analysis, parasitic extraction, and thermal analysis flows. Next,
we investigate the impacts of the coupling on the delay, power, and noise of F2F 3D
ICs, and provide guidelines to mitigate these effects. Our experimental results show
that the inter-die electrical coupling causes up to 5.81% timing degradation and 4.00%
noise increase, while the thermal coupling leads to less than 0.41% timing degradation
and nearly no noise increase. The impact of the combined electro-thermal coupling
on delay and noise reaches 6.07% and 4.05%, respectively.
Pseudo-3D Approaches for Commercial-Grade RTL-to-GDS Tool Flow Targeting Monolithic
3D ICs
Despite the recent academic efforts to develop Electronic Design Automation (EDA)
algorithms for 3D ICs, the current market does not have commercial 3D computer-aided
design (CAD) tools. Insteadpseudo-3D alternative design flows have been devised which
utilize commercial 2D CAD engines with tricks that help them operate as a fairly-efficient
3D CAD tool. In this paper we provide detailed discussions and fair power-performance-area
(PPA) comparisons of state-of-the-art pseudo-3D design flows. We also analyze the
limitations of each design flow and provide solutions with better PPA and various
design options. Our experiments using commercial PDK, GDS layouts, and sign-off simulations
demonstrate that we achieve up to 26% wirelength and 10% power consumption reduction
for pseudo-3D design flows. We also provide a partitioning-first scheme to partitioning-last
design flow which increases design freedom with tolerable PPA degradation.
SESSION: Session 3: Machine Learning for Physical Design (part 1)
Session details: Session 3: Machine Learning for Physical Design (part 1)
Learning from Experience: Applying ML to Analog Circuit Design
The problem of analog design automation has vexed several generations of researchers
in electronic design automation. At its core, the difficulty of the problem is related
to the fact that machinegenerated designs have been unable to match the quality of
the human designer. The human designer typically recognizes blocks from a netlist
and draws upon her/his experience to translate these blocks into a circuit that is
laid out in silicon. The ability to annotate blocks in a schematic or netlist-level
description of a circuit is key to this entire process, but it is a process fraught
with complexity due to the large number of variants of each circuit type. For example,
the number of topologies of operational transconductance amplifiers (OTAs) easily
numbers in the hundreds. A designer manages this complexity by dividing this large
set of variants into classes (e.g., OTAs may be telescopic, folded cascode, etc.).
Even so, the number of minor variations within each class is large. Early approaches
to analog design automation attempted to use rule-based methods to capture these variations,
but this database of rules required tender care: each new variant might require a
new rule. As machine learning (ML) based alternatives have become more viable, alternative
forms of solving this problem have begun to be explored.
Our effort is part of the ALIGN (Analog Layout, Intelligently Generated from Netlists)
project [2, 3], which is developing opensource software for analog/mixed-signal circuit
layout [1]. Our specific goal is to translate a netlist into a physical layout, with
24-hour turnaround and no human in the loop. The ALIGN flow inputs a netlist whose
topology and transistor sizes have already been chosen, a set of performance specifications,
and a process design kit (PDK) that defines the process technology. The output of
ALIGN is a layout in GDSII format.
Transforming Global Routing Report into DRC Violation Map with Convolutional Neural
Network
In this paper, we have proposed a machine-learning framework to predict the DRC-violation
map of a given design resulting from its detailed routing based on the congestion
report resulting from its global routing. The proposed framework utilizes convolutional
neural network as its core technique to train this prediction model. The training
dataset is collected from 15 industrial designs using a leading commercial APR tool,
and the total number of collected training samples exceed 26M. A specialized under-sampling
technique is proposed to select important training samples for learning, compensate
for the inaccuracy misled by a highly imbalanced training dataset, and speed up the
entire training process. The experimental result demonstrates that our trained model
can result in not only a significantly higher accuracy than previous related works
but also a DRC violation map visually matching the actual ones closely. The average
runtime of using our learned model to generate a DRC-violation map is only 3% of that
of global routing, and hence our proposed framework can be viewed as a simple add-on
tool to a current commercial global router that can efficiently and effectively generate
a more realistic DRC-violation map without really applying detailed routing.
Lookahead Placement Optimization with Cell Library-based Pin Accessibility Prediction
via Active Learning
With the development of advanced process nodes of semiconductor, the problem of pin
access has become one of the major factors to impact the occurrences of design rule
violations (DRVs) due to complex design rules and limited routing resource. Many state-of-the-art
works address the problem of DRV prediction by adopting supervised machine learning
approaches. However, those supervised learning approaches extract the labels of training
data by generating a great number of routed designs in advance, giving rise to large
effort on training data preparation. In addition, the pre-trained model could hardly
predict unseen data and thus may not be applied to predict other designs containing
cells that are not used in the training data. In this paper, we propose the first
work of cell library-based pin accessibility prediction (PAP) by using active learning
techniques. A given set of standard cell libraries is served as the only input for
model training. Unlike most of existing studies that aim at design-specific training,
we propose a library-based model which can be applied to all designs referencing to
the same standard cell library set. Experimental results show that the proposed model
can be applied to predict two different designs with different reference library sets.
The number of remaining DRVs and M2 shorts of the designs optimized by the proposed
model are also much fewer than those of design-specific models.
SESSION: Keynote 2
Session details: Keynote 2
Physical Design for 3D Chiplets and System Integration
The convergence of 5G and Artificial Intelligence (AI) that covers the gamut from
cloud data centers through network routers to edge applications is poised to open
possibilities beyond our imagination and transform how we will go about our daily
lives. As the foundational technology supporting 5G and AI innovation, semiconductors
strive for greater system performance and broader bandwidth, while increasing functionality
and lowering cost. In response, device innovation is transitioning from SoCs to 3D
chiplets that combine advanced wafer-level system integration (WLSI) technologies
such as CoWoS® (Chip on Wafer on Substrate), Integrated Fan-Out (InFO), Wafer-on-Wafer
(WoW) and System-on-Integrated-Chips (SoIC), to enable system integration that meets
these demands. Designing 3D chiplets and housing various chips on wafer-level for
system integration creates a whole new set of challenges. These start with design
partitioning and include handling interfaces between or passing through chips, design
for testing (DFT), thermal dissipation, databases and tools integration for chip and
packaging design, new IO/ESD (electrostatic discharge), simulation run time and tool
capacity, among others. Considering current capabilities and constraints, divide-and-conquer
remains the most feasible approach for 3D chiplet design and packaging. Chiplet design
needs to integrate data bases and tools with packaging environments for both verification
and optimization. Leveraging existing 2D physical design solutions and chip-level
abstraction can help meet 3D verification and optimization requirements. The IC industry
also needs more DFT and thermal dissipation innovation, especially the latter one.
Thermal optimization is critical to 3D chiplets and system integration. The current
thermal solution only covers thermal analysis + system-level thermal dissipation.
It should start at the IPs and across chip design process, i.e., thermal-aware 3D
IC design, to cover IP, macros, and transistors. This speech will address these and
other challenges, then propose physical design solutions for 3D chiplets and system
integration. CCS CONCEPTS – VLSI design, 3D integrated circuits, VLSI system specification
and constraints, and VLSI packaging KEYWORDS Physical design, 3D chiplets and system
integration, thermal optimization BIOGRAPHY Dr. Cliff Hou was appointed Vice President
of Research and Development at Taiwan Semiconductor Manufacturing Co. Ltd. (TSMC)
in 2011. Since 1999, he has worked to establish node-specific reference flows from
0.13μm to today’s leading-edge 3nm at TSMC. Dr. Hou also led TSMC’s in-house IP development
teams from 2008 to 2010. He is now spearheading TSMC’s efforts to build total platform
solutions for the industry’s high growth markets in Mobile, IoT, Automotive, and High-Performance
Computing. Dr. Hou holds 44 U.S. Patents and serves as a member of Board of Directors
in Global Unichip Corp. He received B.S. degree in Control Engineering from Taiwan’s
National Chiao-Tung University, and Ph.D. in Electrical and Computer Engineering from
Syracuse University.
SESSION: Session 4: Circuit Design and Security
Session details: Session 4: Circuit Design and Security
Hardware Security For and Beyond CMOS Technology: An Overview on Fundamentals, Applications, and Challenges
As with most aspects of electronic systems and integrated circuits, hardware security
has traditionally evolved around the dominant CMOS technology. However, with the rise
of various emerging technologies, whose main purpose is to overcome the fundamental
limitations for scaling and power consumption of CMOS technology, unique opportunities
arise also to advance the notion of hardware security. In this paper, I first provide
an overview on hardware security in general. Next, I review selected emerging technologies,
namely (i) spintronics, (ii) memristors, (iii) carbon nanotubes and related transistors,
(iv) nanowires and related transistors, and (v) 3D and 2.5D integration. I then discuss
their application to advance hardware security and also outline related challenges.
Design Optimization by Fine-grained Interleaving of Local Netlist Transformations
in Lagrangian Relaxation
Design optimization modifies a netlist with the goal of satisfying the timing constraints
at the minimum area and leakage power, without violating any slew or load capacitance
constraints. Lagrangian relaxation (LR) based optimization has been established as
a viable approach for this. We extend LR-based optimization by interleaving in each
iteration techniques such as: gate and flip-flop sizing; buffering to fix late and
early timing violations; pin swapping; and useful clock skew. Locally optimal decisions
are made using LR-based cost functions, without the need for incremental timing updates.
Sub-steps are applied in a balanced manner, accounting for the expected savings and
any conflicting timing violations, maximizing the final quality of results under multiple
process/operating corners with a reasonable runtime. Experimental results show that
our approach achieves better timing, and both lower area and leakage power than the
winner of the TAU 2019 contest, on those benchmarks.
Selective Sensor Placement for Cost-Effective Online Aging Monitoring and Resilience
Aggressive technology scaling trends, such as thinner gate oxide without proportional
downscaling of supply voltage, aggravate the aging impact and thus necessitate an
aging-aware reliability verification and optimization framework during early design
stages. In this paper, we propose a novel in-situ sensing strategy based on deploying
transition detectors (TDs), for on-chip aging monitoring and resilience. Transformed
into the set cover problem and then formulated into maximum satisfiability, the proposed
problem of TD/sensor placement can be solved efficiently. Experimental results show
that, by introducing at most 2.2% area overhead (for TD/sensor placement), the aging
behavior of a target circuit can be effectively monitored, and the correctness of
its functionality can be perfectly guaranteed with an average of 77% aging resilience
achieved. In other words, with 2.2% area overhead, potential aging-induced timing
errors can be detected and then eliminated, while achieving 77% recovery from aging-induced
performance degradation.
SESSION: Session 5: Timing and Clocking
Session details: Session 5: Timing and Clocking
Synthesis of Clock Networks with a Mode Reconfigurable Topology and No Short Circuit
Current
Circuits deployed in the Internet of Things operate in low and high performance modes
to cater to variable frequency and power requirements. Consequently, the clock networks
for such circuits must be synthesized meeting drastically different timing constraints
under variations in the different modes. The overall power consumption and robustness
to variations of a clock network is determined by the topology. However, state-of-the-art
clock networks use the same topology in every mode, despite that the timing constraints
in the low and high performance modes are very different. In this paper, we propose
a clock network with a mode reconfigurable topology (MRT) for circuits with positive-edge
triggered sequential elements. In high performance modes, the required robustness
to variations is provided by reconfiguring the MRT structure into a near-tree. In
low performance modes, the MRT structure is reconfigured into a tree to save power.
Non-tree (or near-tree) structures provide robustness to variations by appropriately
constructing multiple alternative paths from the clock source to the clock sinks,
which neutralizes the negative impact of variations. In MRT structures, OR-gates are
used to join multiple alternative paths into a single path. Consequently, the MRT
structures consume no short circuit power because there is only one gate driving each
net. Moreover, it is straightforward to reconfigure MRT structures into a tree by
gating the clock signal in part of the structure. Compared with state-of-the-art near-tree
structures, MRT structures have 8% lower power consumption and similar robustness
to variations in high performance modes. In low performance modes, the power consumption
is 16% smaller when reconfiguration is used.
Timing Driven Partition for Multi-FPGA Systems with TDM Awareness
Multi-FPGA system is a popular approach to achieve hardware acceleration with the
scalability to accommodate large designs. To overcome the connectivity constraint
between each pair of FPGAs, Time-division multiplexing (TDM) is adopted with the expense
of additional delay that dominates the performance on multi-FPGA system based emulator.
To the best of our knowledge, there is no prior work on partitioning for multi-FPGA
system considering hardware configuration and the impact of TDM. This work proposes
a partition methodology to improve timing performance for multi-FPGA system. Delay
introduced by TDM is estimated and optimized using look-up table for better efficiency.
Our experimental result shows 43% improvement in maximum delay while considering both
hardware configuration and impact of TDM compared with cut driven partition approach.
SESSION: Session 6: Machine Learning for Physical Design (part 2)
Session details: Session 6: Machine Learning for Physical Design (part 2)
Understanding Graphs in EDA: From Shallow to Deep Learning
As the scale of integrated circuits keeps increasing, it is witnessed that there is
a surge in the research of electronic design automation (EDA) to make the technology
node scaling happen. Graph is of great significance in the technology evolution since
it is one of the most natural ways of abstraction to many fundamental objects in EDA
problems like netlist and layout, and hence many EDA problems are essentially graph
problems. Traditional approaches for solving these problems are mostly based on analytical
solutions or heuristic algorithms, which require substantial efforts in designing
and tuning. With the emergence of the learning techniques, dealing with graph problems
with machine learning or deep learning has become a potential way to further improve
the quality of solutions. In this paper, we discuss a set of key techniques for conducting
machine learning on graphs. Particularly, a few challenges in applying graph learning
to EDA applications are highlighted. Furthermore, two case studies are presented to
demonstrate the potential of graph learning on EDA applications.
TEMPO: Fast Mask Topography Effect Modeling with Deep Learning
With the continuous shrinking of the semiconductor device dimensions, mask topography
effects stand out among the major factors influencing the lithography process. Including
these effects in the lithography optimization procedure has become necessary for advanced
technology nodes. However, conventional rigorous simulation for mask topography effects
is extremely computationally expensive for high accuracy. In this work, we propose
TEMPO as a novel generative learning-based framework for efficient and accurate 3D
aerial image prediction. At its core, TEMPO comprises a generative adversarial network
capable of predicting aerial image intensity at different resist heights. Compared
to the default approach of building a unique model for each desired height, TEMPO
takes as one of its inputs the desired height to produce the corresponding aerial
image. In this way, the global model in TEMPO can capture the shared behavior among
different heights, thus, resulting in smaller model size. Besides, across-height information
sharing results in better model accuracy and generalization capability. Our experimental
results demonstrate that TEMPO can obtain up to 1170x speedup compared with rigorous
simulation while achieving satisfactory accuracy.
DRC Hotspot Prediction at Sub-10nm Process Nodes Using Customized Convolutional Network
As the semiconductor process technology advances into sub-10nm regime, cell pin accessibility,
which is a complex joint effect from the pin shape and nearby blockages, becomes a
main cause for DRC violations. Therefore, a machine learning model for DRC hotspot
prediction needs to consider both very high-resolution pin shape patterns and low-resolution
layout information as input features. A new convolutional neural network technique,
J-Net, is introduced for the prediction with mixed resolution features. This is a
customized architecture that is flexible for handling various input and output resolution
requirements. It can be applied at placement stage without using global routing information.
This technique is evaluated on 12 industrial designs at 7nm technology node. The results
show that it can improve true positive rate by 37%, 40% and 14% respectively, compared
to three recent works, with similar false positive rates.
SESSION: Keynote 3
Session details: Keynote 3
Physical Verification at Advanced Technology Nodes and the Road Ahead
In spite of “doomsday” expectations, Moore’s Law is alive and well. Semiconductor
manufacturing and design companies, as well as the Electronic Design Automation (EDA)
industry have been pushing ahead to bring more functionality to satisfy more aggressive
space/power/performance requirements.
Physical verification occupies a unique space in the ecosystem as one of the key bridges
between design and manufacturing. As such, the traditional space of design rule checking
(DRC) and layout versus schematic (LVS) have expanded into electrical verification
and yield enabling technologies such as optical proximity correction, critical area
analysis, multi-patterning decomposition and automated filling.
To achieve the expected accuracy and performance demanded by the design and manufacturing
community, it is necessary to consider the physical effects of the manufacturing processes
and electronic devices and to use the most advanced software engineering technology
and computational capabilities.
SESSION: Session 8: ISPD 2020 Contest Results and Poster Presentations
Session details: Session 8: ISPD 2020 Contest Results and Poster Presentations
ISPD 2020 Physical Mapping of Neural Networks on a Wafer-Scale Deep Learning Accelerator
This paper introduces a special case of the floorplanning problem for optimizing neural
networks to run on a wafer-scale computing engine. From a compute perspective, neural
networks can be represented by a deeply layered structure of compute kernels. During
the training of a neural network, gradient descent is used to determine the weight
factors. Each layer then uses a local weight tensor to transform “activations” and
“gradients” that are shared among connected kernels according to the topology of the
network. This process is computationally intensive and requires high memory and communication
bandwidth. Cerebras has developed a novel computer system designed for this work that
is powered by a 21.5cm by 21.5cm wafer-scale processor with 400,000 programmable compute
cores. It is structured as a regular array of 633 by 633 processing elements, each
with its own local high bandwidth SRAM memory and direct high bandwidth connection
to its neighboring cores. In addition to supporting traditional execution models for
neural network training and inference, this engine has a unique capability to compile
and compute every layer of a complete neural network simultaneously. Mapping a neural
network in this fashion onto Cerebras’ Wafer-Scale Engine (WSE) is reminiscent of
the traditional floorplanning problem in physical design. A kernel ends up as a rectangle
of x by y compute elements. These are the flexible blocks that need to be placed to
optimize performance. This paper describes an ISPD 2020 challenge to develop algorithms
and heuristics that produce compiled neural networks that achieve the highest possible
performance on the Cerebras WSE.