Design Article

Network programming can be tamed

Robin Melnick and Eric Cowden

8/18/2003 9:47 AM EDT

Network programming can be tamed
Software-programmable network processors (NPUs), until recently the domain of ASICs, now provide designers an alternative that offers lower development costs, faster time-to-market and future-proofing programmable flexibility. However, ever-increasing "wire-speed" performance and scalability requirements in these environments usually call for the use of multiple-processor architectures, which can significantly increase the complexity of programming.

Encapsulated networking protocols typically demand a sequential programming approach, and sequential programming within a multiple-processor environment can involve challenges such as complex algorithm partitioning and load balancing.

An ideal solution would place these challenges on the programming environment, not the programmer. Achieving this goal requires the blending of a number of key factors.

First, such a solution should allow designers to leverage multiple processor cores for performance, but with the more-straightforward programming model of a single CPU-that is, a single-stage, single-image programming model. In essence, the programmer sees this kind of multiple-processor device as if it were a single, sequentially programmed entity, even though many packets or cells are simultaneously "in flight" across multiple tasks and multiple cores. Next, the underlying operating system or kernel software itself should handle all of the required interprocessor coordination.

Finally, common networking functions that nevertheless require a large number of instructions to implement-such as traffic management, queuing/ scheduling, packet transformations, classification, search/lookup, statistics collection and so on-should be offloaded to specialized, on-chip coprocessor engines. At the same time, the software architecture should provide simple, single-instruction access to each of these functions via an application programming interface. Taken together, these characteristics can enable designers to embed high-performance network-processing capabilities while simplifying and minimizing the number of lines of code to be created.

Reducing lines of code

In order to deliver maximum performance while reducing program size and complexity, an underlying hardware architecture can divide packet processing between RISC-based NPU cores and special-function, on-chip coprocessors. From the programmer's perspective, each coprocessor implements a single-instruction operation as program tasks post requests to and receive data from, the appropriate coprocessing elements. This activity creates a "network-optimized instruction-set computing," or NISC, device. The combination of RISC and coprocessing engines gives programmers the ability to offload common yet complex tasks, while preserving software-programmable flexibility in terms of the structure and flow of packet processing.

A NISC architecture yields several benefits. It balances performance and flexibility, can deliver comparable application performance with smaller, lower-power hardware; and most importantly, offers simpler software with fewer lines of code.

In contrast to NISC, less-flexible NPU approaches achieve performance by partially hardwiring the flow order of certain functions, leaving them potentially incapable of deploying unforeseen algorithm designs or future protocols. It also contrasts with architectures that achieve performance by ganging together a larger number of more general-purpose RISC cores. The latter offer flexibility but typically increase complexity by requiring many more functions to be implemented purely in software. Comparisons have shown that such architectures require an order-of-magnitude greater number of lines of code to be written, debugged and performance-tuned in order to implement similar applications.

A 'single' model

In the approach described here, a multiple-core architecture can be deployed using a simple, single-stage programming model, where a single core handles in its entirety a given packet or cell. Performance scales in parallel, with multiple tasks on each of several cores managing many packets or cells simultaneously. This is the logical inverse of multiple processor-core models, where each core launches one stage of the processing algorithm.

In a multistage model, any given cell or packet must serially pass through multiple cores as it is processed.

In the parallel, single-stage model, an arriving packet or cell is automatically assigned to an available (idle state) processing task and is then processed in its entirety, or "runs to completion," on this single task on a single core. It's possible to create the entire data-flow algorithm as a single, complete program, just as it would be on a single-processor CPU, but with the performance-scaling advantage of using multiple cores.

This approach is also "single-image" in that each task on each core executes the same code. Once again, engineers can approach the algorithm design in the same straightforward manner as for a single-processor CPU. Different frames may, of course, exercise different branches of the code, but programming is greatly simplified, regardless of the number of tasks and cores, because there is no code to be written or processor cycles spent on handoffs from task to task or core to core as a packet or cell is processed. More importantly, the programmer does not need to subdivide an algorithm into multiple serial stages, as is required in some NPU models, nor is any developer time spent load balancing multiple stages or tuning the performance of multiple stages to avoid one stage overrunning another, possibly creating system bottlenecks.

This parallel approach also provides smoother performance scalability over the course of multiple generations. Because the same software runs on all tasks on all processor cores, there is no need to re-subdivide or load-balance algorithms again as more tasks and cores are added over time to increase line rates or application performance. An additional benefit of the single-stage, parallel-execution model is that performance is not sensitive to latency or bottlenecks, as is a multistage pipeline. Even in decision-rich network-processing environments, case statements and conditional jumps do not produce pipeline breakages. Regardless of the elapsed time needed to process a particular packet, no other cores sit idle awaiting completion of a prior stage.

Unifying kernel

The "kernel" software consists of NPU-internal operating system layers that give programmers a unified mechanism for managing all resources within the NPU, while also providing the control-plane CPU with API visibility into NPU operations.

Because the kernel provides all elements of common application infrastructure, it greatly reduces the number of new lines of code that system developers must create. The kernel itself comprises more than 80 percent of the NPU-resident portion of code needed for most applications.

The embedded NPU kernel also offers extensive macro libraries, providing thousands of lines of preoptimized code that programmers would otherwise have to write from scratch. A programmer simply calls a macro with the appropriate syntax and parameters, and the software build tools expand the macro to create all the required run-time code structures.

In addition, acknowledging the networked environment in which such embedded communications applications operate, the software should provide for multilayer, communications-aware debugging facilities that enable remote diagnosis and repair of problems in the field. The kernel provides built-in facilities to support applets-downloadable executables-that can be used for diagnostic work, system updates and so on. For example, the embedded control CPU could implement protocol-layer, over-the-wire access to these facilities, which could be used for flushing/resetting the system, updating route tables or in-the-field debugging. Because the applet functions as an "intelligent interrupt" that carries with it all of the required executable code, these periodic or ad hoc routines needn't remain resident in the NPU, conserving program memory for ongoing run-time operations.

Putting it all together

Embedding network processor functionality for high-performance systems is not trivial. It requires a carefully structured blend of programming model, facilities and a unifying software infrastructure that enables both performance and flexibility. It must support fast and easy code development, with a high degree of code reuse from one generation to the next.

By combining the application-creation and debug facilities of a NISC architecture with a development-simplifying programming model for parallel multiprocessors and an API-based embedded kernel to speed application development, industry-leading NPU architectures give system designers the best of both worlds: high performance, specialized embedded functionality and the ability to efficiently access it.

Robin Melnick is director of software product management and Eric Cowden is a software marketing engineer at Applied Micro Circuits Corp. (Sunnyvale, Calif.).

P> See related chart





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form