Design Article

Performance 'tuning' at the network edge

Ken Hines, vice president and chief scientist, Ross Ortega, vice president and chief technology officer, Consystant Design Technology, Inc., Kirkland, Wash.

3/10/2003 9:05 AM EST

Performance 'tuning' at the network edge

Network processors are specialized for forwarding plane applications, which require the same functionality to be applied to large numbers of packets in small amounts of time. To deal with this, network processors must process several packets simultaneously, and are built to handle high degrees of parallelism. Applications must be tuned to use this parallelism in a number of ways to achieve aggressive throughput requirements.

Debugging and tuning the application once it is deployed to the processing resources are among the most time consuming tasks in network processing application development. A graphical view of software component interactions can greatly reduce the effort needed to achieve correctness, and high performance in a network processing environment.

Network processing software is typically divided into two classes, control plane and forwarding plane, that must interact well with each other. Forwarding plane software must handle packets at line rate -- 10 Gbit/s ec for OC-192 at the network core and 2.5 Gbyte/sec for OC-48 and below at the network edge. No matter what the data rate, the forwarding plane is responsible for most common cases (e.g., routing packets according to an existing routing table, counting packets of particular types, etc) which are particularly complex at the network edge where there is a diversity of services, protocols and network media.

Control plane software typically configures the forwarding plane software and handles exceptional cases for example, changing the entries in a routing table because of network conditions. Forwarding plane software must be designed for extremely high throughput, often with data rates well beyond processor clock rates. This calls out for special, powerful, processors that can support throughput requirements using high degrees of parallelism and which have additional hardware support for common forwarding plane functionality.

The most common solution is to build the forwarding plane processor (NPU) from a number of multi-threaded packet processing engines (PPEs).

For example, processors in Intel's IXP line of network processors, which cover the range from network edge to core applications, consist of eight to sixteen PPEs called "microengines" (eight for the 2400, sixteen for the 2800), each of which contains eight hardware threads, special queues for receiving and transmitting packets, hardware state machines that specifically read from and write to the transmit and receive queues, among other hardware elements.

In order to fully harness the power of these forwarding plane processors, it is critical for programmers to understand not only the instruction sets of the processors, but to understand how to rapidly synchronize concurrent threads, make full use of the on chip hardware resources, as well has understanding the corner cases of the specific application being built.

Many forwarding plane software developers are adopting component models to simplify the overall software design, designate reasonable divisions in functionality such that it can be distributed among several processors, and to facilitate reuse of common functionality across a number of designs. An example of this is Intel's microblock model for the IXP line which is targeted at many of the diverse application at the network edge.

Forwarding plane software writers face a large number of choices when implementing functionality and each of these choices can have a significant impact on performance, especially at the network edge where it is necessary to carefully balance flexibility versus performance.

Often the performance impact isn't entirely obvious. While most NPU vendors supply cycle accurate simulation environments to help designers tune and debug their application before loading it onto the hardware, these tend to be very low level, focusing on assembly instructions.

While these can demonstrate that the performance of an application is poor, they do little to help designers narrow down the cause of this poor performance. The environment for Intel's IXP 2800 is called the Transactor. It allows designers to simulate all (up to) 128 threads of forwarding plane software at once. Other vendors' simulators may simulate all threads in their respective NPU, or the minimum necessary to show full system functionality.

Intel's Transactor presents a thread based visualization of system execution, but it does not distinguish between components, nor does it illustrate packet flow. As such it is difficult to extract design level information — such as when microblocks are sending packets to other microblocks, and how a packet is processed throughout the entire application. Without this information, it is difficult to isolate functional failures and performance bottlenecks.

At the design level

There is a strong need for design level visualization tools for system executions. These tools must show design level components; component level interactions including packet flow; system level performance and component level performance.

The most appropriate visualization tool methodology for this kind of distributed, loosely coupled multiprocessor environment is one that is based on a coordination-centric methodology. It eliminates the need for developers to work at a lower abstraction layer, deciphering and mapping out the complex interactions between software modules that are hardwired to lower level system resources.

A coordination- centric visualization methodology allows developers to separate functional behavior from coordination of separate software components - simplifying design and debugging, speeding integration with hardware, enabling reusable code and allowing easy retargeting of embedded designs to different hardware architectures. Building on this layer of abstraction, such a methodology directly facilitates embedded networking systems design automation by providing a conceptual framework for graphical design entry, simulation, system-level debugging, platform targeting and code synthesis.

Without tools based on such methodologies, designers can spend days identifying problems that are visible in minutes using a high-level visualization. In a network processor design it allows the developer to visualize, track and handle at least six major elements: a trace for each component; subdivision of traces to show how components are deployed over threads; significant events (e.g., packet received, packet transmitted, state changed; evolution of control state; lnteractions between components (data transferred); complete packet flows through all components; and execution durations.

Suppose it is necessary to determine system execution in which a vast majority of the packets are being dropped between the first and second components?. With a coordination based visualizer optimized for such an environment, it would be immediately apparent because of the many packets received on the second component, only a few correspond to transmitted packets. From this it is straight forward to focus on the code in this component to determine why many of the packets are being dropped.

Or suppose it is necessary to look at the execution of the POS_RX IPV4 Forward C6_TX system in a design in which each component is mapped to a single thread in a single microengine. It would be immediately apparent from an appropriately designed visualizer that the forwarding component is the performance bottleneck for the entire system.

Depending on how much of this time is spent accessing memory, how memory accesses line up, and how large the critical sections of this component are, this problem could be solved by either breaking the forwarding component down such that it can be mapped across a number of microengines. Or it could be done by simply increasing the number of threads on which it executes in the same microengine — each thread is assumed to be executing the same functionality on separate packets.

In many cases, although counterintuitive, it is possible that increasing the number of threads can decrease performance. In this case, deploying the IP Forwarder over eight threads effectively removes it as the bottleneck. The visualizations help both in identifying the performance bottleneck, and evaluating the quality of an attempt at eliminating it.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form