News & Analysis
Flexible NPU achieves wire speed
Pradeep Shenoy, Director, Terapacket Product Marketing, Teradiant Networks Inc., San Jose, Calif.
11/1/2002 1:17 PM EST
In the evolving marketplace for network-processing semiconductors, the conflicting demands designers tackle have never been more evident. Networking-equipment OEMs are looking for superior operating performance, flexibility and scalability-all delivered with favorable power consumption, cost and consumed real estate.
In addition, time-to-deployment looms while OEMs strive to achieve the next quantum advance in router and switch technology. The OEMs' service provider customers define success in terms of reduced capital and operating expenditures in their drive to integrate diverse packet and cell networks.
In addressing the speed-vs.-flexibility question in the network processor design debate, our design engineers set out to achieve wire-speed performance in a network-processing engine without sacrificing the flexibility that is the hallmark of programmable network processors.
Programmable NPU architectures do provide the flexibility needed to meet the exact requirements of the target application, but they do not provide deterministic wire-speed performance while supporting advanced functions of the chip. Consequently, a router or switch designer may need to deploy multiple chips and multiple line cards to achieve wire-speed performance.
Our designers took a different approach, however. They decided to develop a fully configurable, pipelined architecture that would deliver guaranteed deterministic wire-speed performance across all packet sizes and quality-of-service conditions. This means that a 40-Gbit/second chip, for example, would deliver 40-Gbit/s throughput under any network traffic conditions and with all chip functions enabled. We think the configurable architecture we developed strikes a fine balance between the programmability of a traditional network processor and the performance of an ASIC.
While most of the standard protocol operations are hardwired, our NPU chip set maintains flexibility in each of its individual features via register bits. In addition, rich programmability is available for functions that require extremely high degrees of flexibility. For example, the packet engine has multiple lookup keys that are fully programmed to pick arbitrary bits in the packet headers.
To ensure deterministic performance, we designed each stage of the pipeline to handle worst-case scenarios-thus ensuring that Internet Protocol packets of a minimum of 40 bytes could be handled at wire speed.
We also added sufficient headroom to incorporate internal efficiency factors. Memory throughput is a key issue in sustaining performance under some worst-case scenarios in network processors. For example, a 40-Gbit/s simplex chip needs memory throughput of 80 Gbits/s for concurrent reads and writes. Factor in internal efficiency factors, and the maximum throughput requirement approaches 160 Gbits/s. Through a combination of careful design and new techniques for bank conflict avoidance and block placement, we achieved a high memory throughput with a low pin count.
Deployment issues
Time-to-deployment is another troublesome issue with regard to programmable network processors. Because these chips can incorporate 40 or more microprocessor cores, microcoding can add months of development time to a project, and synchronization of the cores can be extremely challenging.
Our team eliminated the need for microcoding altogether in our pipelined architecture by incorporating a rich set of configurable parameters into tables and registers on the chips. Using these options, the designer selects values for each configurable parameter, according to the requirements of the application.
By providing these configuration options, the design team addressed its goal of enabling system designers to achieve superior system flexibility without an upfront investment in programming, debugging and verification, and without the need for continued software maintenance.
Developers can use any third-party vendor's stack or the OEM's own stack, and further flexibility is available through high-level application programming interfaces and drivers that the team incorporated into the design.
Our design team also set out to address the three factors that together have the greatest impact on the capital costs of equipment OEMs, and both the capital and operating costs of service providers: power consumption, cost and real estate.
A "metric of goodness" can be used to calculate and express the value of a network processor for a specified level of performance. This metric is defined as: (power) x (cost) x (real estate) / gigabit, or PCR/G. The formula becomes a tool for providing objective assessment of different network processor-based solutions.
To be competitive, Teradiant needed to make it possible to build four-times OC-192 functionality (four OC-192 lines) into a single card. We did that by giving each 10-Gbit/s line card its own memory for the forwarding table. With Teradiant's design, up to four 10-Gbit/s ports can be built into a single line card and share a common forwarding table. This technique reduces cost and lowers power consumption. To further reduce cost, the design team incorporated into the chip set an integral TCAM for multiple classification lookups, and an innovative SRAM-based scheme for longest-prefix lookup.


See related chart
