Design Article

Single-stage, single-image NPU model simplifies design

3/10/2003 9:15 AM EST

Single-stage, single-image NPU model simplifies design
Keith Morris, Director of Product Strategy, Robin Melnick, Director of Software Product Management, Switching and Network Processing Division, Applied Micro Circuits Corp. (AMCC), Sunnyvale, Calif.

The on-going operational costs at the increasing diverse network edge to inventory and swap out a variety of line card types in order to tweak the mix of services can be a major drain on service provider profitability. Meanwhile, the development cost of so many line-card variations is a significant burden on equipment vendors.

By deploying multiple services in the smallest possible number of line-card variations, equipment vendors and service providers alike can reduce their cost of doing business in this increasingly heterogeneous environment. More importantly, providers need the seamless flexibility to mix-and-match different services and line interfaces in order to maximize today's service opportunities while also providing a smooth forward migration path to meet tomorrow's requirements.

Accomplishing the goal of "any port, any speed, any service" requires a high degree of software programmability; however, such flexibility cannot come at the cost of sacrificed performance. And, performance and flexibility cannot drive up system or line-card costs, which would defeat the original motivation for consolidation. The overall goal must then be to provide "universal port services at dedicated port economics."

Several steps are required at the implementation level to support this mix of both cell and packet-based protocols, a range of line speeds, deep channelization to support a high density of low-speed interfaces, and a range of subscriber services applications all at minimum cost and in a minimum number of hardware configurations.

The designer must effectively merge programmable network processing and fine-granularity traffic management at the lowest possible level. This can be best accomplished through a highly integrated chip-level approach that combines a high-performance, multi-core network processor (NPU) and a hardware-based traffic manager (TM) on the same device along with other specialized co-processor resources, integrated line interfaces to support both legacy and emerging standards. Embedded networking design software developers will also need effective techniques for rapidly developing and deploying the wide range of protocols and applications involved in such multi-service environments.

A single-chip design, implemented using 0.13 micron fabrication, can combine the network processing functions, TM functions and co-processors within much less space than required for multiple special-function devices. This not only significantly reduces the cost and power constraints of line-card design, it also improves simplicity and reliability through the elimination of external interconnections between devices.

An single chip architecture consisting of three NPU cores is capable of supporting 24 independent tasks as a theoretical combined capacity to process 72 separate packets or cells simultaneously. However, the real key for the embedded software developer lies in the use of a single-stage, single-image programming model and efficient single-instruction access to all of the on-chip hardware co-processors.

The single-stage, single-image model enables programmers to implement an entire data flow algorithm as a single complete unified program, which can be executed identically by each task on each core in the cluster, thereby dramatically simplifying the programming task and streamlining the run-time environment. Each packet runs to completion on a single task/thread on a single NPU core.

Programmers thus treat the multi-core cluster as if it were a single powerful processor, eliminating the hassle of having to segment code into a chain of separate serial blocks - where not all algorithms fit easily into such a sequential methodology. More importantly, also eliminated are the significant burdens of creating inter-segment hand-off code then load-balancing, debugging, and performance tuning the mix of these blocks - challenges that can degrade performance in typical multi-stage pipeline models.

No bottlenecks

Even in decision-rich processing situations like those encountered in multi-service implementations, the single-image, single-stage model means that case statements and conditional jumps will not result in pipeline breakage or cause bottlenecks in the processing flow. Regardless of any variations in the elapsed time needed to process a particular packet, no other cores are impacted, left idle or under-utilized. Because the single-thread execution pipeline is not sensitive to latency, it can deliver deterministic performance under virtually any set of conditions.

The tight integration of on-chip co-processors allows any core in the NPU cluster to directly access dedicated hardware resources for performing specialized functions. In many cases, the core processors do not even need to expend instruction cycles in order to access these resources because the core's act of loading a sequence into a register will automatically trigger the co-processor's actions.

By using dual-access registers in shared memory space, the chip-level architecture provides zero-cycle task switching and zero-cycle branching for most critical network processing operations. Several complex processing activities such as queuing, scheduling, policy-based flow classifications, searches, packet transformations, metering, and policing can be accomplished with a handful of instructions from the NPU programmer's perspective.

When it comes to handling high volumes of diverse traffic flows, the on-chip integration of a hardware-based TM is a critical factor. Software-centric traffic management invariably must make compromises between throughput and capacity, thus sacrificing either performance or granularity of traffic flows.

Within the single-chip architecture described, the integrated TM offers wire-speed queuing and scheduling of up to 128K individual ingress/egress flows. The hardware TM handles all per-flow queuing and scheduling operations with its own direct access to payload memory for up to 2 million cells of storage so that it can run at full speed, completely independent of the NPU core processors.

Every individual flow represents a unidirectional stream of traffic through the system from an ingress port to an egress port. A flow can consist of either a single connection or a group of connections with the same traffic characteristics.

The on-chip TM provides the flexibility to manage the 128K flows individually or to group them within as many as 4000 pipes and/or 512 sub-ports. This enables dynamic provisioning of a variety of service alternatives with virtually any level of granularity, ranging from T1/E1 or OC-1 up through OC-48.

In addition to the space, cost, power and simplicity advantages of combining the TM on the same chip with the NPU cores, the integrated design actually increases functionality through tighter on-chip integration. For example, the NPU core processors can have direct software control over the TM's hardware queues, which would be difficult or impossible to implement across separate multiple-chip designs.

Direct software control allows the TM to act as a memory manager for the core processor, enabling the NPU to use "programmable queuing" functions such as re-ordering or assembling packets together within a flow and then linking them back to the TM. This can be particularly useful for streamlining the implementation of advanced features such as Segmentation and Reassembly (SAR), IP fragmentation or application layer protocol termination functions.

Combining the NPU with an appropriate framer to offer deeply channelized interfaces plus the ability to support both ATM cell-based connections and IP packet-based traffic on the same fundamental chip-level platform opens up an attractive range of affordable and mixed-service applications for system designers. Service providers can deploy lower-cost bandwidth creation services such as Inverse Multiplexing over ATM (IMA), Multi-Link / Multi-Class PPP (ML/MC-PPP), and Multi-Link Frame Relay (ML-FR), as well as the interworking among them, all via software libraries accompanying such a chipset.

While a network processor is typically selected for "fast-path" data flow processing, another architectural dilemma for the system designer is where to perform additional in-depth "slow-path" processing for advanced services.

Exception flow

One approach is to embed an additional control CPU into an NPU, but this not only significantly increases size and power, it means more complexity for the programmer, with yet another development and debugging process, tool chain, and so on. A newer alternative allows both "fast-path" and "exception-path" processing to be handled by the same existing set of efficient NPU cores by buffering up packets or cells for special handling without having any impact on the flow of fast-path traffic.

The exception flow utilizes an on-chip channel buffer manager and an associated channel service memory to temporarily store or accumulate packets or cells under specified conditions, without the need for intervention by a general-purpose controller.

By flagging the packets or cells for special processing and moving them into this exception-flow data path, the system can handle special cases without impeding the deterministic traffic flows through the primary path.

For example, in TCP termination functions, the exception path's store-and-forward capabilities can be used to accumulate all of the packets from a particular flow regardless of their arrival sequence, thus allowing the entire flow to be processed together. Similarly, pre-assembling the fragmented bits for a jumbo frame before subsequent operations can enhance overall processing efficiency.

Another useful application is streamlining of system control functions by enabling the CPU to send large packets, such as router table or software updates, which can be transmitted and assembled in the background via the exception flow without impacting any on-going processes in the primary data path.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form