Design Article
Data flow architecture must match the network to the application
Gary Lidington, Vice President, Marketing, Xelerated, Inc., Burlington, Mass.
5/9/2003 11:09 AM EDT
Network processor design represents a complex trade-off between cost, performance and flexibility where achieving all three simultaneously has been an elusive goal. Until recently, the design alternatives for the processing elements have all been based on derivatives of the venerable RISC processor. The design space has centered around the trade offs between processor interconnect topologies (parallel, pipelined and matrix) and the degree of hardwired support for specialized functions -either via specialized instructions or specialized coprocessor).
The resulting solutions span a wide range from high performance, cost effective hardwired designs that are very inflexible to very flexible designs that are either very costly or low performance. The problem is that none have arguably achieved high performance, low cost and a high degree of flexibility. However, by shifting to a more data flow oriented design approach that trades off some breadth of applications support, it is possible to achieve a much higher level of efficiency (cost/performance) while maintaining flexibility.
Traditional approaches to network processor design have viewed one dimension of flexibility to be the breadth of applications supported. An attempt is made to support not only all applications within a particular level of the ISO model, but all levels of the ISO model as well. This tremendous breadth of application support requires a generality of processor design that severely impacts efficiency. Looking closely at the attributes of the applications in this space though, a clear dividing line emerges in the middle of layer 4 of the ISO model.
On either side of this line, the attributes of the applications are very different, making it hard to come up with a processor architecture that is optimal for both. In fact, applications below the line have attributes similar to data path (or data plane) applications, while applications above the line have attributes similar to control plane applications.
It was this difference in attributes that lead to the design of switches that used ASICs to off-load data plane functions from general purpose load/store processors in the first place. In this case, rather than resort to hard wired approaches, we attempt to select a programmable architecture that is a better match to the application attributes than traditional RISC architectures.
Data flow architectures have an intuitive appeal as a starting point for L2-L4 applications, because unlike RISC architectures, they are very efficient at moving data and the amount of parallelism that can be extracted from the application is limited only by the data dependencies. Given these attributes, the challenges that remain are to scale the processing capability in a manner that allows easy programming while introducing the notion of fully deterministic execution2 to meet hard real time (wire speed) requirements. Because L2-L4 applications have no data dependencies between packets, but have some dependencies between instructions, it is more efficient to use the available transistors on a chip to extract parallelism from the data stream rather than the instruction stream. This means that employing a large number of small, simple processors will allow more efficient performance scaling than a small number of large, complex processors.
A simple data flow processor, consists of a traditional register set, as well as a set of registers for holding a portion of the packet and four simple execution units. The execution units are organized as a 4-way VLIW machine allowing the parallel execution of up to four instructions.
No execution units are replicated within the processor and scheduling of instructions is done in software. This minimizes the hardware overhead, thereby reducing the overall size and allowing maximal replication of the entire processor. While operating, data consisting of the program context and a portion of the packet flows into the register set causing the instruction pointed to by the instruction pointer (RIP) to fetched and executed.
The parallel execution units read their operands from the register set and write their results to the register set of the next processor in a flow through manner. Unused data flows around the execution units into the register set of the next processor. The entire packet flows through the pipeline.
The program context is associated with a portion of the packet (called a fragment) that the program is currently operating on. It can be shifted to other packet fragments under program control. Since only one instruction executes on a processor and all instructions and data are local there can be no resource conflicts or stalls, so instruction execution is fully deterministic. Also since one instruction is executed per processor, the processor boundaries are masked by the instruction boundaries making any number of processors appear as a uni-processor to the programmer.
Efficiency means more
In contrast to multiprocessor RISC architectures, the load/store instructions are eliminated, processor size is dramatically reduced, processor to processor interconnect is simplified, central data memory is eliminated and there is no instruction replication required to achieve a uni-processor programming model. This increased efficiency allows hundreds rather than tens of processors to be placed on a single 0.13 micron die while still leaving room for many specialized co-processors. This increases processing power significantly while reducing both power and cost.
With RISC processors, complex multithreading is required to reduce the effects of stalling while waiting for I/O operations to complete and complex drivers are required to interface to I/O devices. With a data flow architecture, I/O can be transparently offloaded to I/O processors inserted into the processing pipeline.
These processors behave as a seamless extension to the processing pipeline. The program context and portion of the packet flow directly from a block of data flow processors into an I/O processor. This processor is configured at boot time by downloading simple drivers that extract the data from the specified registers and send it to the specified I/O device returning any results to a specified register.
While the I/O operation is being executed, the program context and portion of the packet flow through a synchronization FIFO and are merged with the returned data. The new context and portion of the packet then flow into the next block of processors.
The I/O processors isolate the processing blocks from non-deterministic, high latency I/O operations and the overhead of storing and executing kernel services greatly simplifying the programming environment. Since the program counter can be specified as the destination to which results can be returned, multi-way branching can also be offloaded from the processing blocks.
To design complete L2-L7 solutions around data flow processors, multiple ports and port switching are employed. Packets enter the data flow processor on one port and the packets are classified and layer 2-4 ingress processing is performed.
If the packet requires additional L4-L7 processing it is switched to a port to which a general purpose RISC processor(s) is attached and then passed back through the data flow processor for modification and forwarding. If the packet does not require L4-L7 processing then it is switched to a port that connects to the rest of the system.

For applications with high L4-L7 content, there is a one to many relationship between the data flow processor and RISC processors, so the data flow processor also performs a load balancing function as well as L2-L4 offload increasing both scalability and efficiency. For applications with low L4-L7 content, the dataflow processor handles the majority of the packets improving efficiency.
By dividing the application space into two pieces, centered around layer 4 of the ISO model and employing architectures that match the attributes of the applications, more efficient system solutions can be developed.
At layers 2-4 data flow processors can be used to gain significant increases in efficiency by leveraging application attributes such as program length and data access patterns to eliminate load/store instructions, central data memory, redundant instruction storage and complex interconnects. This is accompanied by a fully deterministic, uni-processor programming model.
Throughputs of up to 40Gbit/sec at 6.5 watts of power dissipation have been achieved to date in 0.13 micron technology. For layer 2-7 applications, data flow processors can be combined with RISC based architectures to increase performance while improving efficiency by offloading the L2-L7 component of the application and providing load balancing functionality for the L4-L7 component.



