NOCS 2019 TOC
NOCS ’19: Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip
SESSION: NoC and router design
UBERNoC: unified buffer power-efficient router for network-on-chip
Networks-on-Chip (NoCs) address many shortcomings of traditional interconnects. However,
they consume a considerable portion of a chip’s total power – particularly when the
utilization is low. As transistor size continues to shrink, we expect NoCs to contribute
even more, especially static power. A wide range of prior-art focuses on reducing
the contribution of NoC power consumption. These can be categorized into two main
groups: (1) power-gating, and (2) simplified router microarchitectures. Maintaining
the performance and the flexibility of the network are key challenges that have not
yet been addressed by these two groups of low-power architectures. In this paper,
we propose UBERNoC, a simplified router microarchitecture, which reduces underutilized
buffer space by leveraging an observation that for most switch traversals, only a
single packet is present. We use a unified buffer with multiple virtual channels shared
amongst the input ports to reduce both power and area. The empirical results demonstrate
that compared to a conventional router, UBERNoC achieves 58% and 69% reduction in
power and area respectively, with negligible latency overhead.
Ghost routers: energy-efficient asymmetric multicore processors with symmetric NoCs
Asymmetric multicore architectures have been proposed to exploit the benefits of heterogeneous
cores. However, asymmetric cores present challenge to network-on-chip (NoC) designers
since the floorplan is not necessarily regular with “nodes” being different size.
In contrast, most of the previously proposed NoC topologies commonly assume a regular
or symmetric floorplan with equal size nodes. In this work, we first describe how
asymmetric floorplan leads to asymmetric topology and can limit overall performance.
To overcome the asymmetric floorplan, we present Ghost Routers – extra “dummy” routers that are added to the NoC to create a symmetric NoC architecture
for asymmetric multicore architectures. Ghost router provides higher network path
diversity and provides higher network performance that leads to higher system performance.
Ghost routers also enable simpler routing algorithms because of the symmetric NoC
architecture. While ghost routers is a simplistic modification to the NoC architecture,
it does increase NoC cost. However, ghost routers exploit the observations that in
realistic systems, the cost of NoC is not a significant fraction of overall system
cost. Our evaluations show that ghost routers can improve performance by up to 21%
while improving overall energy-efficiency of the system by up to 26%.
BINDU: deadlock-freedom with one bubble in the network
Every interconnection network must ensure, for its functional correctness, that it
is deadlock free. A routing deadlock occurs when there is a cyclic dependency of packets
when acquiring the buffers of the routers. Prior solutions have provisioned an extra
set of escape buffers to resolve deadlocks, or restrict the path that a packet can take in the
network by disallowing certain turns. This either pays higher power/area overhead
or impacts performance. In this work, we demonstrate that (i) keeping one virtual-channel
in the entire network (called ‘Bindu’) empty, and (ii) forcing it to move through
all input ports of every router in the network via a pre-defined path, can guarantee
deadlock-freedom. We show that our scheme (a) is topology agnostic (we evaluate it
on multiple topologies, both regular and irregular), (b) does not impose any turn
restrictions on packets, (c) does not require an extra set of escape buffers, and
(d) is free from the complex circuitry for detecting and recovering from deadlocks.
We report 15% average improvement in throughput for synthetic traffic and 7% average
reduction in runtime for real applications over state-of-the-art deadlock freedom
schemes.
SESSION: Best paper nominees
NoC-enabled software/hardware co-design framework for accelerating k-mer counting
Counting k-mers (substrings of fixed length k) in DNA and protein sequences generate non-uniform and irregular memory access patterns.
Processing-in-Memory (PIM) architectures have the potential to significantly reduce
the overheads associated with such frequent and irregular memory accesses. However,
existing k-mer counting algorithms are not designed to exploit the advantages of PIM architectures.
Furthermore, owing to thermal constraints, the allowable power budget is limited in
conventional PIM designs. Moreover, k-mer counting generates unbalanced and long-range traffic patterns that need to be handled
by an efficient Network-on-Chip (NoC). In this paper, we present an NoC-enabled software/hardware
co-design framework to implement high-performance k-mer counting. The proposed architecture enables more computational power, efficient communication
between cores/memory – all without creating a thermal bottleneck; while the software
component exposes more in-memory opportunities to exploit the PIM and aids in the
NoC design. Experimental results show that the proposed architecture outperforms a
state-of-the-art software implementation of k-mer counting utilizing Hybrid Memory Cube (HMC), by up to 7.14X, while allowing significantly
higher power budgets.
SMART++: reducing cost and improving efficiency of multi-hop bypass in NoC routers
Low latency and low implementation cost are two key requirements in NoCs. SMART routers
implement multi-hop bypass, obtaining latency values close to an ideal point-to-point
interconnect. However, it requires a significant amount of resources such as Virtual
Channels (VCs), which are not used as efficiently as possible, preventing bypass in
certain scenarios. This translates into increased area and delay, compared to an ideal
implementation.
In this paper, we introduce SMART++, an efficient multi-hop bypass mechanism which
combines four key ideas: SMART bypass, multi-packet buffers, Non-Empty Buffer Bypass and Per-packet allocation. SMART++ relies on a more aggressive VC reallocation policy
and supports bypass of buffers even when they are not completely free. With these
desirable characteristics, SMART++ requires limited resources and exhibits high performance.
SMART++ is evaluated using functional simulation and HDL synthesis tools. SMART++
without VCs and with a reduced amount of buffer slots outperforms the original SMART
using 8 VCs, while reducing the amount of logic and dynamic power in an FPGA by 5.5x
and 5.0x respectively. Additionally, it allows for up to 2.1x frequency; this might
translate into more than 31.9% base latency reduction and 42.2% throughput increase.
APEC: improved acknowledgement prioritization through erasure coding in bufferless NoCs
Bufferless NoCs have been proposed as they come with a decreased silicon area footprint
and a reduced power consumption, when compared to buffered NoCs. However, while known
for their inherent simplicity, they suffer from early saturation and depend on additional
measures to ensure reliable packet delivery, such as control protocols based on ACKs
or NACKs. In this paper, we propose APEC, a novel concept for bufferless NoCs that
allows to prioritize ACKs and NACKs over single payload flits of colliding packets
by discarding the latter. Lightweight heuristic erasure codes are used to compensate
for discarded payload flits. By trading off the erasure code overhead for packet retransmissions,
a more efficient network operation is achieved. For ACK-based networks, APEC saturates
at 2.1x and 2.875x higher generation rates than a conventional ACK-based bufferless
NoC for packets between 5 and 17 flits. For NACK-based networks, APEC does not require
concepts such as deflection routing or circuit-switched overlay NACK-networks, as
prior work does. Therefore, it can simplify the network implementation compared to
prior work while achieving similar performance.
SESSION: NoC potpourri
ClusCross: a new topology for silicon interposer-based network-on-chip
The increasing number of cores challenges the scalability of chip multiprocessors.
Recent studies proposed the idea of disintegration by partitioning a large chip into
multiple smaller chips and using silicon interposer-based integration (2.5D) to connect
these smaller chips. This method can improve yield, but as the number of small chips
increases, the chip-to-chip communication becomes a performance bottleneck.
This paper proposes a new network topology, ClusCross, to improve network performance
for multicore interconnection networks on silicon interposer-based systems. The key
idea is to treat each small chip as a cluster and use cross-cluster long links to
increase bisection width and decrease average hop count without increasing the number
of ports in the routers. Synthetic traffic patterns and real applications are simulated
on a cycle-accurate simulator. Network latency reduction and saturation throughput
improvement are demonstrated as compared to previously proposed topologies. Two versions
of the ClusCross topology are evaluated. One version of ClusCross has a 10% average
latency reduction for coherence traffic as compared to the state-of-the-art network-on-interposer
topology, the misaligned ButterDonut. The other version of ClusCross has a 7% and
a 10% reduction in power consumption as compared to the FoldedTorus and the ButterDonut
topologies, respectively.
Distributed SDN architecture for NoC-based many-core SoCs
In the Software-Defined Networking (SDN) paradigm, routers are generic and programmable
forwarding units that transmit packets according to a given policy defined by a software
controller. Recent research has shown the potential of such a communication concept
for NoC management, resulting in hardware complexity reduction, management flexibility,
real-time guarantees, and self-adaptation. However, a centralized SDN controller is
a bottleneck for large-scale systems.
Assuming an NoC with multiple physical subnets, this work proposes a distributed SDN
architecture (D-SDN), with each controller managing one cluster of routers. Controllers
work in parallel for local (intra-cluster) paths. For global (inter-cluster) paths,
the controllers execute a synchronization protocol inspired by VLSI routing, with
global and detailed routing phases. This work also proposes a short path establishment
heuristic for global paths that explores the controllers” parallelism.
D-SDN outperforms a centralized approach (C-SDN) for larger networks without loss
of success rate. Evaluations up to 2,304 cores and 6 subnets shows that: (i) D-SDN outperforms C-SDN in path establishment latency up to 69.7% for 1 subnet above
32 cores, and 51% for 6 subnets above 1,024 cores; (ii) D-SDN achieves a smaller latency then C-SDN (on average 54%) for scenarios with
more than 70% of local paths; (iii) the path success rate, for all scenarios, is similar in both approaches, with an
average difference of 1.7%; (iv) the data storage for the C-SDN controller increases with the system size, while
it remains constant for D-SDN.
Approximate nanophotonic interconnects
The energy consumption of manycore is dominated by data movement, which calls for
energy-efficient and high-bandwidth interconnects. Integrated optics is promising
technology to overcome the bandwidth limitations of electrical interconnects. However,
it suffers from high power overhead related to low efficiency lasers, which calls
for the use of approximate communications for error tolerant applications. In this
context, this paper investigates the design of an Optical NoC supporting the transmission
of approximate data. For this purpose, the least significant bits of floating point
numbers are transmitted with low power optical signals. A transmission model allows
estimating the laser power according to the targeted BER and a micro-architecture
allows configuring, at run-time, the number of approximated bits and the laser output
powers. Simulations results show that, compared to an interconnect involving only
robust communications, approximations in the optical transmission lead to up to 42%
laser power reduction for image processing application with a limited degradation
at the application level.
Direct-modulated optical networks for interposer systems
We present a new interposer-level optical network based on direct-modulated lasers
such as vertical-cavity surface-emitting lasers (VCSELs) or transistor lasers (TLs).
Our key observation is that, the physics of these lasers is such that they must transmit
significantly more power (21x) than is needed by the receiver. We take advantage of
this excess optical power to create a new network architecture called Rome, which splits optical signals using passive splitters to allow flexible bandwidth
allocation among different transmitter and receiver pairs while imposing minimal power
and design costs. Using multi-chip module GPUs (MCM-GPUs) as a case study, we thoroughly
evaluate network power and performance, and show that (1) Rome is capable of efficiently
scaling up MCM-GPUs with up to 1024 streaming multiprocessors, and (2) Rome outperforms
various competing designs in terms of energy efficiency (by up to 4x) and performance
(by up to 143%).
SESSION: Interconnection networks for deep neural networks
NoC-based DNN accelerator: a future design paradigm
Deep Neural Networks (DNN) have shown significant advantages in many domains such
as pattern recognition, prediction, and control optimization. The edge computing demand
in the Internet-of-Things era has motivated many kinds of computing platforms to accelerate
the DNN operations. The most common platforms are CPU, GPU, ASIC, and FPGA. However,
these platforms suffer from low performance (i.e., CPU and GPU), large power consumption (i.e., CPU, GPU, ASIC, and FPGA), or low computational flexibility at runtime (i.e., FPGA and ASIC). In this paper, we suggest the NoC-based DNN platform as a new accelerator
design paradigm. The NoC-based designs can reduce the off-chip memory accesses through
a flexible interconnect that facilitates data exchange between processing elements
on the chip. We first comprehensively investigate conventional platforms and methodologies
used in DNN computing. Then we study and analyze different design parameters to implement
the NoC-based DNN accelerator. The presented accelerator is based on mesh topology,
neuron clustering, random mapping, and XY-routing. The experimental results on LeNet,
MobileNet, and VGG-16 models show the benefits of the NoC-based DNN accelerator in
reducing off-chip memory accesses and improving runtime computational flexibility.
Energy-efficient and high-performance NoC architecture and mapping solution for deep
neural networks
With the advancement and miniaturization of transistor technology, hundreds of cores
can be integrated on a single chip. Network-on-Chips (NoCs) are the de facto on-chip communication fabrics for multi/many core systems because of their benefits
over the traditional bus in terms of scalability, parallelism, and power efficiency
[20]. Because of these properties of NoC, communication architecture for different
layers of a deep neural network can be developed using NoC. However, traditional NoC
architectures and strategies may not be suitable for running deep neural networks
because of the different types of communication patterns (e.g. one-to-many and many-to-one
communication between layers and zero communication within a single layer) in neural
networks. Furthermore, because of the different communication patterns, computations
of the different layers of a neural network need to be mapped in a way that reduces
communication bottleneck in NoC. Therefore, we explore different NoC architectures
and mapping solutions for deep neural networks, and then propose an efficient concentrated
mesh NoC architecture and a load-balanced mapping solution (including mathematical
model) for accelerating deep neural networks. We also present preliminary results
to show the effectiveness of our proposed approaches to accelerate deep neural networks
while achieving energy-efficient and high-performance NoC.
Flow mapping and data distribution on mesh-based deep learning accelerator
Convolutional neural networks have been proposed as an approach for classifying data
corresponding to labeled and unlabeled datasets. The fast-growing data empowers deep
learning algorithms to achieve higher accuracy. Numerous trained models have been
proposed, which involve complex algorithms and increasing network depth. The main
challenges of implementing deep convolutional neural networks are high energy consumption,
high on-chip and off-chip bandwidth requirements, and large memory footprint. Different
types of on-chip communication networks and traffic distribution methods have been
proposed to reduce memory access latency and energy consumption of data movement.
This paper proposes a new traffic distribution mechanism on a mesh topology using
distributer nodes by considering memory access mechanism in the AlexNet, VggNet, and
GoogleNet trained models. We also propose a flow mapping method (FMM) based on dataflow
stationary which reduces energy consumption by 8%.
SESSION: Heterogeneous integration and interconnect fabrics
3D NoCs with active interposer for multi-die systems
Advances in interconnect technologies for system-in-package manufacturing have re-introduced
multi-chip module (MCM) architectures as an alternative to the current monolithic
approach. MCMs or multi-die systems implement multiple smaller chiplets in a single
package. These MCMs are connected through various package interconnect technologies,
such as current industry solutions in AMD’s Infinity Fabric, Intel’s Foveros active
interposer, and Marvell’s Mochi Interconnect. Although MCMs improve manufacturing
yields and are cost-effective, additional challenges on the Network-on-Chip (NoC)
within a single chiplet and across multiple chiplets need to be addressed. These challenges
include routing, scalability performance, and resource allocation. This work introduces
a scalable MCM 3D interconnect infrastructure called “MCM-3D-NoC” with multiple 3D
chiplets connected through an active interposer. System-level simulations of MCM-3D-NoC
are performed to validate the proposed architecture and provide performance evaluation
of network latency, throughput, and EDP.
Global and semi-global communication on Si-IF
On-chip scaling continues to pose significant technological and design challenges.
Nonetheless, the key obstacle in on-chip scaling is the high fabrication cost of the
state-of-the-art technology nodes. An opportunity exists however, to continue scaling
at the system level. Silicon interconnect fabric (Si-IF) is a platform that aims to
replace both the package and printed circuit board to enable heterogeneous integration
and high inter-chip performance. Bare dies are attached directly to the Si-IF at fine
vertical interconnect pitch (2 to 10 μm) and small inter-die spacing (≤ 100 μm). The
Si-IF is a single-hierarchy integration construct that supports dies of any process,
technology, and dimensions. In addition to development of the fabrication and integration
processes, system-level challenges need to be addressed to enable integration of heterogeneous
systems on the Si-IF. Communication is a fundamental challenge on large Si-IF platforms
(up to 300 mm diameter wafers). Different technological and design approaches for
global and semi-global communication are discussed in this paper. The area overhead
associated with global communication on the Si-IF is determined.
A 7.5-mW 10-Gb/s 16-QAM wireline transceiver with carrier synchronization and threshold
calibration for mobile inter-chip communications in 16-nm FinFET
A compact energy-efficient 16-QAM wireline transceiver with carrier synchronization
and threshold calibration is proposed to leverage high-density fine-pitch interconnects.
Utilizing frequency-division multiplexing, the transceiver transfers four-bit data
through one RF band to reduce intersymbol interferences. A forwarded clock is also
transmitted through the same interconnect with the data simultaneously to enable low-power
PVT-insensitive symbol clock recovery. A carrier synchronization algorithm is proposed
to overcome nontrivial current and phase mismatches by including DC offset calibration
and dedicated I/Q phase adjustments. Along with this carrier synchronization, a threshold
calibration process is used for the transceiver to tolerate channel and circuit variations.
The transceiver implemented in 16-nm FinFET occupies only 0.006-mm2 and achieves 10 Gb/s with 0.75-pJ/bit efficiency and <2.5-ns latency.
SESSION: Work in progress posters
Reinforcement learning based interconnection routing for adaptive traffic optimization
Applying Machine Learning (ML) techniques to design and optimize computer architectures
is a promising research direction. Optimizing the runtime performance of a Network-on-Chip
(NoC) necessitates a continuous learning framework. In this work, we demonstrate the
promise of applying reinforcement learning (RL) to optimize NoC runtime performance.
We present three RL-based methods for learning optimal routing algorithms. The experimental
results show the algorithms can successfully learn a near-optimal solution across
different environment states.
Power efficient photonic network-on-chip for a scalable GPU
In this paper, we propose an energy efficient and scalable optical interconnect for
GPUs. We intelligently divide the components in a GPU into different types of clusters
and enable these clusters to communicate optically with each other. In order to reduce
the network delay, we use separate networks for coherence and non-coherence traffic.
Moreover, to reduce the static power consumption in optical interconnects, we modulate
the off-chip light source by proposing a novel GPU specific prediction scheme for
on-chip network traffic. Using our design, we were able to increase the performance
by 17% and achieve a 65% reduction in ED2 as compared to a state-of-the-art optical topology.
CDMA-based multiple multicast communications on WiNOC for efficient parallel computing
In this work, we introduce an hybrid WiNoC, which judicially uses the wired and wireless
interconnects for broadcasting/multicasting of packets. A code division multiple access
(CDMA) method is used to support multiple broadcast operations originating from multiple
applications executed on the multiprocessor platform. The CDMA-based WiNoC is compared
in terms of network latency and power consumption with wired-broadcast/multicast NoC.
Channel mapping strategies for effective protection switching in fail-operational
hard real-time NoCs
With Multi Processor System-on-Chips (MPSoC) scaling up to thousands of processing
elements, bus-based solutions have been dropped in favor of Network-on-Chips (NoC)
as proposed in [2]. However, MPSoCs are yet hesitantly adopted in safety-critical
fields, mainly due to the difficulty of ensuring strict isolation between different
applications running on a single MPSoC as well as providing communication with Guaranteed
Service (GS) to critical applications. This is particularly difficult in the NoC as
it constitutes a network of shared resources. Moreover, safety-critical applications
require some degree of Fault-Tolerance (FT) to guarantee safe operation at all times.
Multi-carrier spread-spectrum transceiver for WiNoC
In this paper, we propose a low-power, high-speed, multi-carrier reconfigurable transceiver
based on Frequency Division Multiplexing to ensure data transfer in future Wireless
NoCs. The proposed transceiver supports a medium access control method to sustain
unicast, broadcast and multicast communication patterns, providing dynamic data exchange
among wireless nodes. The proposed transceiver designed using a 28-nm FDSOI technology
consumes only 2.37 mW and 4.82 mW in unicast/broadcast and multicast modes, respectively,
with an area footprint of 0.0138 mm2.
Detection and prevention protocol for black hole attack in network-on-chip
Network-on-Chip (NoC) has become exposed to security threats. It can be infected with
a Hardware Trojan (HT) to degrade the system performance and apply a denial of service
attack. In this paper, we proposed a new HT-based threat model, known as Black Hole
Router (BHR), where it deliberately drops packets from the NoC. We proposed a detection
and prevention protocol to such BHR attack with reasonably low overhead. The results
show 10.83%, 27.78%, and 21.31% overhead in area, power, and performance, respectively.
However, our proposed protocol not only detects the BHR attack but also avoids it
and assures packet-delivery.
Analyzing networks-on-chip based deep neural networks
One of the most promising architectures for performing deep neural network inferences
on resource-constrained embedded devices is based on massive parallel and specialized
cores interconnected by means of a Network-on-Chip (NoC). In this paper, we extensively
evaluate NoC-based deep neural network accelerators by exploring the design space
spanned by several architectural parameters. We show how latency is mainly dominated
by the on-chip communication whereas energy consumption is mainly accounted by memory
(both on-chip and off-chip).