Design Article
How to scale network processors
Rob Munoz, Product Marketing Manager, Agere Systems, Allentown, Pa.
9/25/2001 12:35 PM EDT
Network equipment vendors typically opt for a network processor instead of a hardwired ASIC to improve time-to-market, reduce development costs and increase time-in-market. Although the industry generally agrees on the desired benefits of using a network processor approach, network processor suppliers disagree on several fronts, including: What is the best way to program network processors? How much required functionality should they supply vs. how much should the equiment vendor supply? How is a network processor solution architected?
The tactic a network processor supplier chooses in each of these areas can significantly affect its ability to scale a network processor product to OC-192c performance (and beyond) and simultaneously provide adequate processing power per packet. This power is required to support the desired set of applications and traffic mixes.
Many network processors require low-level microcode/picocode/assembly coding and tedious optimization of data-path functionality in order to deliver adequate performance. Although the software community has long understood the disadvantages of low-level programming, low-level network processor programming is an order of magnitude more complex than a general-purpose processing due to the use of parallel and/or pipelined engines in many architectures. Some solutions require application software to manage and schedule parallel execution of tasks split across multiple processing resources and to manage the associated sharing of state information.
A more insidious problem with network processors that require low-level programming is discovering how best to scale up performance while maintaining compatibility with previously written software. Low-level software typically must know the microarchitecture-the number of processing engines, the pipeline structure and the like-of the network processor for which it is written. Scaling up performance by methods other than simple improvements to clock frequency invariably requires changes to the underlying microarchitecture. These changes, in turn, mean that software written for the previous microarchitecture probably will not work effectively on the new one. Even data-path application software written in C often depends on a network processors underlying organization of support engines, which usually changes from generation to generation.
A more preferable approach is to provide high-level programmability using application-oriented languages or models that do not require application software to explicitly manage parallel execution or sharing of state information across multiple processor resources. This strategy optimizes lifetime software development costs and intervals. It can, however, also lead to poor performance, because it is difficult to map the abstractions provided by the software model to the facilities provided by the underlying hardware. The challenge for the network processor supplier who follows this strategy is to create a high-performance, economical architecture that efficiently supports this mapping.
Most equipment vendors who use a network processor solution expect to obtain compatible traffic management functionality. Scaling a network processor product that lacks traffic management is easier than one that offers it, but such an approach merely pushes the problem back to the equipment vendor, who is then forced to either develop a homegrown solution or integrate functionality from another supplier. Either of these approaches increases the equipment vendors cost and risk, especially if the integrated configuration has not already undergone extensive interoperability testing. The same is true for other functionalities that are missing from some offerings, such as policing, statistics, and segmentation and reassembly.
Ideally, the network processor is part of a full, preintegrated fiber-to-fabric offering. This allows the equipment vendor to focus development and integration efforts on only those areas that add unique differentiating value.
At OC192c, back-to-back 40-byte TCP/IP packets can arrive approximately every 39 ns. At OC-768c, packets arrive four times faster than this. At these rates, the external memory system becomes a significant bottleneck. Network traffic has little temporal or spatial locality (because packet arrivals are random), so caching is not nearly as effective as it is in traditional computing applications.
Support for full-sized routing tables and full-sized packet buffers almost invariably require external memory storage. Given the large amount of external storage required, it is preferable to use DRAM as much as possible due to its huge advantages in cost, power and space compared to CAM and SRAM types of memory. Consider, for example, the cost/power/space implications of storing 1M+ IPv4 routes in CAMs or using 64+ MB of SRAM-based packet buffer memory in each direction! Typically, line cards need to be in a power envelope of 150 W or less in order to be economically deployed.
Memory bandwidth is often cited as the most serious constraint on network processor scalability. While bandwidth is certainly an issue (especially at OC-768c rates), memory bandwidth can nonetheless be scaled by:
Memory latency is sometimes considered a constraint, although suitable pipelining can effectively hide latency. However, the random read/write cycle time of DRAM (typically ~65-75 ns for most DRAM types) is potentially significant.
For example, in the case of the buffer management function at OC-192c speeds, arriving 40-byte packets must be deposited into buffer memory every ~39 ns, and departing packets must be retrieved from buffer memory every ~39 ns. Thus, the buffer memory subsystem must support a write and a read every ~39 ns when processing a stream of back-to-back 40 byte packets. A simple buffer memory implementation that uses a DDR SDRAM interface could only support either a write or a read every ~65-75 ns, which is a significant gap from the required level of performance.
This scenario also illustrates that, for a given line speed, it is easier to support channelized configurations, such as 4 x OC-48c, than concatenated configurations, such as 1 x OC-192c. For example, in a 4 x OC-48c configuration, while packets might arrive every ~39 ns, packets from any single one of those OC-48c interfaces will arrive only at 1/4th that rate. Packets arriving from separate interfaces can be assigned to separate processing and buffer memory resources without packet reordering becoming a problem. Likewise, these separate processing resources will probably not need to extensively share state information when processing separate streams of traffic.
How multicast packet processing is implemented can also significantly affect functionality and performance. Because of the large number of memory I/O pins needed in OC-192c and above configurations, most network processor suppliers have placed classification and traffic management in separate chips. However, opinions differ as to where the packet modification function should be placed. Some network processor architectures place it with the classifier chip; others with the traffic-management chip. Placing the packet modification function in the traffic manager offers some important scalability and functionality advantages.
These include the following:
Given some of these issues, even if a particular generation of a network processor can handle wire-speed traffic it is important to understand such issues as:
What kinds of traffic workloads are supported? Can the network processor support full line rate with any packet size with concatenated interfaces (including back-to-back 40-byte packets)? How is the random cycle time problem solved (for 10-Gbit and higher line speeds) to allow this in the face of random packet arrivals?
An example of these scalability principles at work is Agere Systems PayloadPlus family of software-compatible network processor solutions (2.5-Gbps and 10-Gbps versions have been announced with higher-performance versions in development). PayloadPlus Network processors are programmed in high-level application-oriented languages. The resulting software is focused almost entirely on performing the application and does not contain underlying details of the microarchitecture.
These languages are:
PayloadPlus network processors provide full carrier-class packet processing functionality, including classification, policing, statistics, queuing, scheduling/shaping, buffer management, data modification and fabric/framer interfacing. |



