News & Analysis
Simplicity reigns in NPU programming
Arnon Mordoh, Vice President, Software Tools , Wintegra Inc., Austin, Texas
11/1/2002 1:26 PM EST
The frequent changes to the communication standards and the growing demand for quality-of-service support has created the need for programmable networking devices that can replace traditional field-programmable gate arrays in implementing high bit-rate data-path networking functionality.
However, in order to reach the performance levels required to support these high bit rates, most of these networking devices (also known as network processors) are built using some sort of a multiprocessing architecture that operates on multiple data units in parallel.
Trying to map standard communication protocols and complex classification algorithms onto the new multiprocessing network processing architectures requires implementing interprocess communication and resource management functions at a level of complexity found only in advanced distributed computing applications.
In addition, the lack of mature software development tools for network processing architectures produced long development cycles, difficult software and hardware integration phases and software that is neither portable nor easy to maintain for future system generations.
The networking processor architecture has a major effect on the ability to easily develop and integrate a system using a programmable networking platform.
Therefore, we knew when we started designing our architecture that we needed to take a new approach in order to avoid the distributed application development complexity. The goal was to define an architecture that comprises multiple networking processing engines, in order to satisfy the high bit-rate data-path processing requirements, but at the same time maintains the programming model of a single multithreaded processor.
Our solution turned out to be a design based on the concepts of a balanced symmetric multiprocessing multithreaded architecture (BSMM). A BSMM architecture combines a striping architecture, in which each processing engine handles a data unit from start to finish, with a pipelined architecture, in which processing of a data unit is divided into smaller tasks each executing on a different processing engine.
In a BSMM architecture, a number of processing engines can operate on an arbitrary number of threads executing in parallel in the system. Each tread is assigned a thread context (TC) memory scratch pad that holds all the local thread parameters and data.
When a processing engine becomes available, a "ready-to-execute" thread is selected and it starts execution. The selection of a thread to execute is done by hardware based on the dynamic list of threads that require service at that point in time.
Each thread can execute on any processing engine in the system. Once a thread requires data from an external resource (like external memory or external hardware accelerator), it is scheduled out of the processing engine into a pending queue and a different thread starts execution (context switch).
When the external resource data becomes available, the thread becomes ready-to-execute again and will be picked by the next processing engine that becomes free. In order for this scheme to work efficiently, the context switch process (the process of terminating the work on one thread and stating to work on a different thread) should be very efficient with little or no overhead.
This scheme provides many advantages. First, it maintains a single processor programming model. Software developed to execute on top of this architecture need not be aware that multiple processing engines comprise the architecture. Since the thread distribution is done dynamically, at run-time, by hardware, the same code will execute on machines with one, two, four or any arbitrary number of processing engines.
Second, it provides an automatic even distribution of the system load across the different processing engines. There is no need to manually level the system load at the system integration phase.
Third, this approach imposes no dependency between the total number of threads that can execute simultaneously in the system and the number of processing engines. The total number of processing engines and threads is derived by the application performance requirements.
Finally, such a symmetrical multiprocessing scheme allows easy scalability to different performance levels by simply adding more processing engines or more thread context scratchpad memories, while still maintaining code compatibility.
Multithread complexity
The Wintegra team also addressed how a programmer for a multithreaded architecture would manage the complexities of a mulithreaded architecture system, since such a system is able to process many threads simultaneously. In such an environment, however, processing of certain data units may require a certain order to achieve the desired result (like transmitting an ATM frame with the cells in the original order).
Data units may also be processed on different processing engines, and the processing may be finished on a particular data unit in any order that is dictated by restrictions, the processing load or different data-unit processing flows. These factors call for the use of a mechanism to control and limit processing and resource access that can ensure that packets will be processed in the proper order.
In order to simplify application development, ordering and coherency hardware mechanisms have been built into the processing engines ensuring that packet processing takes place correctly. Solving this programming complexity problem also required the development of a rich, mature set of software development tools at the heart of which lies a specialized data-path language that simplifies code development in such an environment.
Developing sophisticated networking algorithms (for example, packet-classification processing) requires a rich programming language that is able to support complex data structures like rule definition and data parsing. In addition, it should be fluid enough to support the rapid changes of communication standards and operator requirements. At the same time, adherence to embedded processing development de facto standards, like C programming, is desired.
To meet these requirements the Data Path Language (DPL) was defined. DPL is a subset of C and follows the standard C preprocessor and language syntax. We defined DPL to support standard high-level programming for our processing engine architecture, by removing standard C features that do not map efficiently onto the architecture (for example, floating-point support). The C subset selected as DPL provides an excellent blend of high-level programming, with all its benefits, and good overall system performance.
Since the processing engines in our symmetric multiprocessing NPU design are optimized for networking applications, a few processing primitives that the architecture supports cannot be mapped to standard C semantic operations.
For example, an operation that finds the first bit set in a 32-bit register is supported using a single cycle operation. However, supporting them in the DPL compiler is essential to obtaining good system performance.
In order to add these operations to the DPL semantics, and yet maintain the C standard syntax rules, the concept of intrinsic functions was introduced. This concept, gaining popularity in C compilers for digital signal processors as well, allows extending the standard C language by additional operators that are expressed in the language as function calls.
For example, the compiler recognizes the expression Val=_ffs(key) as a special operation that performs the find first bit set operation. In this case, the compiler does not generate code that implements a function call, but rather inserts the special primitive that implements this functionality.
All data-path code is written in DPL, not assembly code, which provides substantial time-to-market advantages. One indicator of the power and flexibility of this language is that we had more than 18 protocols available at first product announcement and numerous additions are under way.
An architecture does not survive on the basis of an appropriate programming language alone. Simulation and debugging tools are also necessary, because debugging and integration phases are crucial components in the overall system-development schedules. Robust, mature tools that include fast and accurate simulation and easy-to-use hardware debugging tools can simplify the system-integration effort and produce a faster system time-to-market.
By using the most recent simulator tool technology available, our data-path developers were able to write most of the first available software protocols long before the actual hardware was available. This tool simulates all the architecture blocks and events and is used for protocol verification, plus system and functional modeling. It supports a rich scripting and command language with both a command line and GUI interface. Source level DPL debugging is supported. High-level system modeling is facilitated with the use of host code running on workstations. Input and output traffic flows can be managed using files.


See related chart
