Design Article
The challenges of nextgen multicore networks-on-chip systems: Part 1
Luca Benini and Giovanni De Micheli
2/6/2007 12:30 PM EST
Whereas it took monstrous efforts to be completed, it appears now as a simple object to us. Indeed, the microprocessor involved the connection of a computational engine to a layered memory system, and this was achieved using busses. In the last decade, the frontiers of integrated circuit design opened widely. On one side, complex application specific integrated circuits (ASICs)were designed to address-speci�c applications, for example mobile telephony.
These systems require functional units, thus requiring efficient on-chip communication. On another side, multiprocessing platforms were developed to address high-performance computation, such as image rendering. Examples are Sony's emotion engine [25] and IBM's cell chip [26], where on-chip communication efficiency is key to the overall system performance.
At the same time, the shrinking of processing technology in the deep submicron (DSM) domain exacerbated the imbalance between gate delays and wire delays on chip. Accurate physical design became the bottleneck for design closure, a word in jargon to indicate the ability to conclude successfully a tape out. Thus, the on-chip interconnection is now the dominant factor in determining performance. Architecting the interconnect level at a higher abstraction level is a key factor for system design.
We have to understand the introduction
of NoCs in systems-on-chip (SoCs) design as
a gradual process, namely
as an evolution of bus interconnect technology. For example, there is
not a strict distinction between multi-layer busses and crossbar NoCs.
We have also to credit C. Seitz and W. Dally [9] for stressing the need
of
network interconnect for high-performance multiprocessing, and for
realizing the �rst prototypes of networked integrated multiprocessors.
But overall, NoC has become a broad topic of research and development in the new millennium, when designers were confronted with technological limitations, rising hardware design costs and increasingly higher system complexity.
![]() |
| Figure 1.1. Traf�c pattern in a large-scale system. Limited parallelism is often a cause of congestion. |
Why on-chip networking?
Systems on silicon have a complexity comparable to skyscrapers or
aircraft carriers, when measured in terms of number of basic elements.
Differently from other complex systems, they can be cloned in a
straightforward way but they have to be designed in correctly, as
repairs are nearly impossible. SoCs require design methodologies that
have commonalities with other types of large-scale system design (Figure 1.1 above). In particular,
when looking at on-chip interconnect design methods, it is useful to
compare the on-chip interconnect to the worldwide interconnect provided
by the Internet.
The latter is capable of taming the
system complexity and of providing reliable service in presence of
local malfunctions. Thus, networking technology has been able to
provide us with quality of service (QoS),
despite the heterogeneity and
variability of the Internet nodes and links. It is then obvious that
networking technology can be instrumental for the bettering of
very-large-scale integration (VLSI) circuit/system design technology.
On the other hand, the challenges in
marrying network and VLSI technologies are in leveraging the essential
features of networking that are crucial to obtaining fast and reliable
on-chip communication. Some novices think that on-chip networking
equates to porting the Transmission Control Protocol/Internet Protocol
(TCP/IP) to silicon or achieving
an on-chip Internet.
This is not feasible, due to the high latency related to the complexity of TCP/IP. On-chip communication must be fast, and thus networking techniques must be simple and effective. Bandwidth, latency and energy consumption for communication must be traded off in the search for the best solution.
On the bright side, VLSI chips have wide
availability of wires on many layers, which can be used to carry data
and control information. Wide data busses realize the parallel
transport of information. Moreover, data and control do not need to be
transported by the same means, as in networked computers (Figure 1.2, below). Local proximity
of computational and storage unit on chip makes transport extremely
fast. Overall, the wire-oriented nature of VLSI chips makes on-chip
networking both an opportunity and a challenge.
![]() |
| Figure 1.2. Distributed systems communicate via a limited number of cables (a). Conversely, VLSI chips use up to 10 levels of wires for communicating (b). |
In summary, the main motivation for using on-chip networking is to achieve performance using a system perspective of communication. This reason is corroborated by the fact that simple on-chip communication solutions do not scale up when the number of processing and storage arrays on chip increases. For example, on-chip busses can serve a limited number of units, and beyond that, performance degrades due to the bus parasitic capacitance and the complexity of arbitration.
Technology trends
In the current projections [37] of future silicon technologies, the
operating frequency and transistor density will continue to grow,
making energy dissipation and heat extraction a major concern.
At the same time, on-chip supply voltages will continue to decrease, with adverse impact on signal integrity. The voltage reduction, even though bene�cial, will not suf�ce to mitigate the energy consumption problem, where a major contribution is due to leakage. Thus, SoCs will incorporate dynamic power management (DPM) techniques in various forms to satisfy energy consumption bounds [4].
Global wires, connecting different functional units, are likely to have propagation delays largely exceeding the clock period [18]. Whereas signal pipelining on interconnections will become common practice, correct design will require knowing the signal delay with reasonable accuracy. Indeed, a negative side effect of technology downsizing will be the spreading of physical parameters (e.g., variance of wire delay per unit length) and its relative importance as compared to the timing reference signals (e.g., clock period).
The spreading of physical parameters will make it harder to achieve high-performing chips that safely meet all timing constraints. Worst-case timing methodologies, that require clocking period larger than the worst-case propagation delay, may underuse the potentials of the technology, especially when the worst-case propagation delays are rare events.
Moreover, it is likely that varying on-chip temperature profiles (due to varying loads and DPM) will increase the spread of wiring delays [2]. Thus, it will be mandatory to go beyond worst-case design methodology, and use fault-tolerant schemes that can recover from timing errors [11, 30, 35].
Most large SoCs are designed using different voltage islands [23], which are regions with speci�c voltage and operation frequencies, which in turn may depend on the workload and dynamic voltage and frequency scaling. Synchronization among these islands may become extremely hard to achieve, due to timing skews and spreads. Global wires will span multiple clock domains, and synchronization failures in communicating between different clock domains will be rare but unavoidable events [12].
Signal Integrity
With forthcoming technologies, it will be harder to guarantee
error-free information transfer (at the electrical level) on wires
because of several reasons:
- Reduced signal swings with a corresponding reduction of voltage noise margins.
- Crosstalk is bound to increase, and the complexity of avoiding crosstalk by identifying all potential on-chip noise sources will make it unlikely to succeed fully.
- Electromagnetic interference (EMI) by external sources will become more of a threat because of the smaller voltage swings and smaller dynamic storage capacitances.
- The probability of occasional synchronization failures and/or metastability will rise. These erroneous conditions are possible during system operation because of transmission speed changes, local clock frequency changes, timing noise (jitter), etc.
- Soft errors due to collision of thermal neutrons (produced by the decay of cosmic ray showers) and/or alpha particles (emitted by impurities in the package). Soft errors can create spurious pulses, which can affect signals on chip and/or discharge dynamic storage capacitances.
Moreover, SoCs may be willfully operated in error-prone operating conditions because of the need of extending battery lifetime by lowering energy consumption via supply voltage over-reduction. Thus, speci�c run-time policies may trade-off signal integrity for energy consumption reduction, thus exacerbating the problems due to the fabrication technology.
Reliability
System-level reliability is the probability that the system will
operate correctly at time, t,
as a function of time. The expected value of the reliability function
is the mean time to failure (MTTF).
Increasing MTTF well beyond the
expected useful life of a product is an important design criterion.
Highly reliable systems have been object of study for many years.
Beyond traditional applications, such as aircraft control, defense
applications and reliable computing, there are many new �elds requiring
high-reliable SoCs, ranging from medical applications to automotive
control and more generally to embedded systems that are critical for
human operation and life.
![]() |
| Figure 1.3. Failure on a wire due to electromigration. |
The increased demand of high-reliable SoCs is counterbalanced by the increased failure rates of devices and interconnects. Due to technology downscaling, failures in the interconnect due to electromigration are more likely to happen (Figure 1.3, above). Similarly, device failure due to dielectric breakdown is more likely because of higher electric �elds and carrier speed (Figure 1.4, below). Temperature cycles on chip induce mechanical stress, that has counter-productive effects [28].
For these reasons, SoCs need to be
designed with speci�c resilience toward hard (i.e., permanent) and
soft (i.e., transient) malfunctions. System-level solutions for hard
errors involve redundancy, and thus require the on-line connection of a
stand-by unit and disconnection of the faulty unit. Solutions for soft
errors include design techniques for error containment, error detection
and correction via encoding.
Moreover, when soft errors induce timing errors, system based on double-latch clocking can be used for detection and correction. NoCs can provide resilient solutions toward hard errors (by supporting seamless connection/ disconnection of units) and soft errors (by layered error correction).
![]() |
| Figure 1.4. Failure on a transistor due to oxide breakdown. |
Non-determinism in SoC Modeling
and Design
As SoC complexity scales, it will be more dif�cult, if not impossible,
to capture their functionality with fully deterministic models of
operation. In other words, system models may have multiple
implementations. Property abstraction, which is key to managing
complexity in modeling and design, will hide implementation details and
designers will have to relinquish control of such details.
Whereas abstract modeling and automated synthesis enables complex system design, such an approach increases the variability of the physical and electrical parameters. In summary, to ensure correct and safe realizations, the system architecture and design style have to be resilient against errors generated by various sources, including:
- process technology (parameter spreading, defect density, failure rates);
- environment (temperature variation, EMI, radiation);
- operation mode (very-low-voltage operation);
- design style (abstraction and synthesis from non-deterministic models).
Variability, Design
Methodologies and NoCs
Dealing with variability is an important matter affecting many aspects
of SoC design. We consider here a few aspects related to on-chip
communication design.
The �rst important issue deals with
malfunction containment. Traditionally, malfunctions have been avoided
by putting stringent rules on physical design and by applying stringent
tests on signal integrity before tape out. Rules are such that
variations of process parameters can be tolerated, and integrity
analysis can detect potential problems such as crosstalk. This approach
is conservative in nature, and leads to perfecting the physical layout
of circuits.
On the other hand, the downscaling of technologies has unveiled many potential problems and as a result the physical design tools have grown in complexity, cost and time to achieve design closure. At some point, correct-by-construction design at the physical level will no longer be possible. Similarly, the increasingly larger amount of connections on chip will make signal integrity analysis unlikely to detect all potential crosstalk errors.
Future trends will soften requirements at the physical and electrical level, and require higher-level mechanisms for error correction. Thus, electrical errors will be considered inevitable. Nevertheless, their effect can be contained by techniques that correct them at the logic and functional levels. In other words, the error detection/correction paradigm applied to networking will become a standard tool in on-chip communication design.
Timing errors are an important side
effect of variability. Timing errors can be originated by a wide
variety of causes, including but not limited to: incorrect wiring delay
estimate, overaggressive clocking, crosstalk and soft
(radiation-induced) errors. Timing errors can be detected by double
latches, gated by different clocking signals, and by comparing the
latched data. When the data differs, it means that most likely the
signal settled after the �rst latch was gated, that is, that a timing
error was on the verge of being propagated. (Unfortunately, errors can
happen also in the latch themselves.)
![]() |
| Figure 1.5. The voltage swing on communication busses is reduced, even though signal integrity is partially compromised [35]. Encoding techniques are used to detect corrupted data which is retransmitted. The retransmission rate is an input to a closed-loop dynamic voltage scaling (DVS) control scheme, which sets the voltage swing at a trade-off point between energy saving and latency penalty (due to data retransmission). |
Asynchronous design methodologies can make the circuit resilient to delay variations. For example, speed-independent and delay-insensitive circuit families can operate correctly in presence of delay variations in gates and interconnects. Unfortunately, design complexity often make the application of an integral asynchronous design methodology impractical. A viable compromise is the use of globally asynchronous locally synchronous (GALS) circuits that use asynchronous handshaking protocols to link various synchronous domains possibly clocked at various frequencies.
![]() |
| Figure 1.6. Razor [11] is another realization of self-calibrating circuits, where a processor's supply is lowered till errors occur. The correct operation of the processor is preserved by an error detection and pipeline adjustment technique. As a result, the processor settles on-line to an operating voltage which minimizes the energy consumption even in the presence of variation of technological parameters. |
NoCs are well poised to deal with variability because networking technology is layered and error detection, containment and correction can be done at various layers, according to the nature of the possible malfunction. There are several paradigms that deal with variability for NoCs. Self-calibrating circuits are circuits that adapt on-line to the operating conditions. There are several embodiments of self-calibrating circuits, as shown in Figure 1.5 and Figure 1.6 above and Figure 1.7 below.
![]() |
| Figure 1.7. T-error is a timing methodology for NoCs where data is pipelined through double latches, where the former used an aggressive period and the latter a safe one. For most patterns, T-error will forward data from the �rst latch. When the slowest patterns are transmitted that fail the deadline at the �rst latch, correct but slower operation is performed by the second latch [30]. |
Next in Part 2: System on
chip objectives and network on chip needs
Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on material from "Networks On Chips: Technology and Tools," by Luca Benini and Giovanni De Micheli.
Luca
Benini is
professor at the Department of Electrical Engineering and Computer
Science at the University of Bologna, Italy. Giovanni De Micheli is
professor and director of the Integrated Systems Center at EPF in
Lausanne, Switzerland.
References
[1] A. Adriahantenaina, H. Charlery, A. Greiner, L.
Mortiezand and C. Zeferino, "SPIN:
A Scalable, Packet Switched, On-Chip
Micro-network,''DATE - Design, Automation
and
Test in Europe Conference and
Exhibition, 2003, pp. 70 -73 .
[2]A.H. Ajami, K. Banerjee and M. Pedram, "Modeling and
Analysis of Nonuniform Substrate Temperature Effects on Global ULSI
Interconnects,'' IEEE Transactions on
CAD,
Vol. 24, No. 6, June 2005, pp. 849 - 861.
[3] H. Bakoglu, Circuits, Interconnections, and Packaging
for VLSI, Addison-Wesley, Upper Saddle River, NJ, 1990.
[4] L. Benini, A. Bogliolo and G. De Micheli, "A
Survey of
Design Techniques for System-Level Dynamic Power Management,'' IEEE
Transactions on Very Large-Scale
Integration Systems, Vol. 8, No. 3, June 2000, pp.
299 - 316.
[5]W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S.
Yoo, L. Gauthier,
M. Diaz-Nava and A.A. Jerraya, "Multiprocessor
SoC Platforms: A
Component-Based Design Approach,'' IEEE Design and
Test of Computers, Vol. 19, No. 6,
November"December 2002, pp. 52 - 63.
[6]W. Dally and B. Towles,
Principles and Practices of
Interconnection Networks, Morgan Kaufmann, San Francisco, CA, 2004.
[7]W. Dally and B. Towles, "Route
Packets, Not Wires:
On-Chip Interconnection Networks,'' Proceedings of
the
38th Design Automation Conference.
2001.
[8]W.J. Dally and H. Aoki, "Deadlock-Free
Adaptive
Routing in Multicomputer Networks Using Virtual Channels,'' IEEE
Transactions on Parallel and Distributed
Systems, Vol. 4, No. 4, April 1993, pp. 466 - 475.
[9]W. Dally and C. Seitz, "The Torus
Routing Chip,'' Distributed
Processing, Vol. 1, 1996, pp. 187 - 196.
[10]M. Dall'Osso, G. Biccari, L. Giovannini, D.
Bertozzi and L. Benini, "Xpipes: A Latency
Insensitive Parameterized
Network-on-Chip Architecture for Multiprocessor SoCs,''
International
Conference on Computer Design, 2003, pp. 536"539.
[11]D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin,
T. Mudge, N. S. Kim and K. Flautner, "Razor:
Circuit-Level Correction
of Timing Errors for Low-Power Operation,'' IEEE Micro,
Vol. 24, No. 6, November-December 2004, pp. 10 - 20.
[12]W. Dally and J. Poulton, Digital
Systems
Engineering, Cambridge University Press, Cambridge, MA, 1998.
[13]J. Duato, S. Yalamanchili and L. Ni,
Interconnection
Networks: An Engineering Approach, Morgan Kaufmann, San
Francisco, CA, 2003.
[14]T. Dumitra, S. Kerner and R. Marculescu, "Towards
On-Chip Fault-Tolerant Communication,'' ASPDAC - Proceedings
of the Asian-South Paci�c Design
Automation Conference, 2003, pp. 225 - 232.
[15]S. Goel, K. Chiu, E. Marinissen, T. Nguyen and S.
Oostdijk, "Test
Infrastructure Design for the Nexperia Home Platform
PNX8550 System Chip,'' DATE - Proceedings of
the Design Automation and Test
Europe Conference, 2004.
[16]K. Goossens, J. van Meerbergen, A. Peeters and P.
Wielage, "Networks
on Silicon: Combining Best Efforts and Guaranteed
Services,'' Design Automation and Test
in Europe Conference, 2002, pp. 423 - 427.
[17]R. Hegde and N. Shanbhag, "Toward
Achieving
Energy Ef�ciency in Presence of Deep Submicron Noise,'' IEEE
Transactions
on VLSI Systems, Vol. 8, No. 4, August
2000, pp. 379 - 391.
[18]R. Ho, K. Mai and M. Horowitz, "The
Future of
Wires,'' Proceedings of the IEEE, January 2001.
[19]J. Hu and R. Marculescu, "Energy-Aware
Mapping
for Tile-Based NOC Architectures Under Performance Constraints,'' Asian-Pacific
Design Automation Conference, 2003.
[20]F. Karim, A. Nguyen and S. Dey, "On-Chip
Communication Architecture for OC-768 Network Processors,''
Proceedings
of the 38th Design Automation Conference, 2001.
[21]B. Khailany, et al., "Imagine:
Media Processing
with Streams,'' IEEE Micro, Vol. 21, No. 2, 2001, pp. 35"46.
[22]S. Kumar, et al., "A
Network on Chip Architecture
and Design Methodology,'' VLSI on Annual Symposium, IEEE Computer
Society ISVLSI 2002.
[23]D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S.
Gould and J. Cohn, "Managing
Power and Performance for Systems on Chip
Design Using Voltage Islands,'' ICCAD - International
Conference on
Computer Aided Design, 2002, pp. 195 - 202.
[24]P. Lieverse, P. van der Wolf, K. Vissers and E.
Deprettere, "A
Methodology for Architecture Exploration of
Heterogeneous Signal Processing Systems,'' Journal of
VLSI Signal Processing for Signal,
Image and Video Technology, Vol.
29, No. 3, 2001, pp. 197 - 207.
[25]M. Oka and M. Suzuoki, "Designing
and Programming
the Emotion Engine,'' IEEE Micro, Vol. 19, No.
6,
November - December 1999, pp. 20 - 28.
[26]D. Pham, et al., "Overview
of the Architecture,
Circuit Design, and Physical Implementation of a First-Generation Cell
Processor,'' IEEE Journal of Solid-State
Circuits, Vol. 41, No. 1, January 2006, pp. 179 - 196.
[27]A. Pinto, L. Carloni and A.
Sangiovanni-Vincentelli, "Constraint-Driven
Communication Synthesis,''
Design Automation Conference, 2002, pp. 195 - 202.
[28]K. Skadron, et al., "Temperature-Aware
Computer
Systems: Opportunities and Challenges,'' IEEE Micro,
Vol. 23, No. 6, November"December 2003, pp. 52 - 61.
[29]D. Sylvester and K. Keutzer, "A
Global Wiring
Paradigm for Deep Submicron Design,'' IEEE Transactions
on CAD/ICAS, Vol. 19, No. 2, February 2000, pp.
242 - 252.
[30]R. Tamhankar, S. Murali and G. De Micheli,
"Performance
Driven Reliable Link for Networks on Chip,'' ASPDAC - Proceedings
of the Asian Paci�c Conference on Design Automation,
Shahghai, 2005, pp. 749 - 754.
[31]T. Theis, "The
Future of Interconnection
Technology,'' IBM Journal of Research
and Development, Vol. 44, No. 3, May 2000, pp.
379"390.
[32]E. Waingold, et al., "Baring
It All to Software:
Raw Machines,'' IEEE Computer, Vol. 30, No. 9,
September 1997, pp. 86 - 93.
[33]J. Walrand and P. Varaiya,
High-Performance
Communication Networks, Morgan Kaufmann, San
Francisco, CA, 2000.
[34]M. Wolfe, High Performance Compilers
for Parallel Computing, Addison-Wesley,
Upper Saddle River, NJ, 1995.
[35]F. Worm, P. Ienne, P. Thiran and G. De Micheli,
"An Adaptive
Low-Power Transmission Scheme for On-Chip Networks,'' ISSS,
Proceedings of the International
Symposium on System Synthesis,
Kyoto, October 2002, pp. 92 - 100.
[36] H. Zhang, V. George and J. Rabaey, "Low-Swing
On-Chip Signaling Techniques: Effectiveness and Robustness,'' IEEE
Transactions on VLSI Systems,
Vol. 8, No. 3, June 2000, pp. 264 - 272.
[37]http://public.itrs.net/









