Design Article

Packets challenge next-gen nets

Bill Carlson, Field Applications Engineer, Network Processors, Intel Corp., Tempe, Ariz.

8/5/2002 7:35 AM EDT

Packets challenge next-gen nets

As line rates have steadily increased along with the requirements for rich packet processing, traditional computing problems that were not effectively addressed with first-generation network processors are now being fully exposed. At higher, OC-48 to OC-192 line rates, these problems can dramatically reduce network processor performance if they are not effectively resolved.

The first generation of network processors exploited the unrelated nature of networking data and the weaknesses of traditional cached microprocessors by using parallel processing, multithreading and pipelining techniques.

In general, these network processors overestimated the unrelated nature of the packets. They either ignored or minimized the significant amount of interpacket or interprocess dependencies for some of the most complex processing tasks. As a result, performance levels did not meet the expectations of many customers, whether the network processor was fully software-programmable or used in a more fixed-function design. At OC-12 and Gigabit rates, such performance degradation is manageable. At the newer and higher OC-48 and OC-192 rates, it can become completely unacceptable.

For the next generation of network processors to be successful, they must address challenges associated with the fact that packets are both independent and dependent. These challenges can be summarized as serial stream processing problems: The challenge of dependence occurs when interdependent, related streams of packets are associated with common data structures. This is typically known as a lock contention problem. The challenge of independence occurs when independent and nonrelated packets are associated with common data structures.

Although the basis of network processing assumes that packets are unrelated and can be processed in parallel, this is not always the case. If multiple processes must access common data structures -whether they are dependent on or independent of one another-there are certain important and time-consumingoperations that can cause a process to become serialized, thereby removing the benefits of parallel processing. A few examples of these operations include:

  • Asynchronous transfer mode and IP flow: When many thousands of independent ATM and IP flows are aggregated, they will access a single common data structure, such as a transmit queue. Even though the packets or cells are processed independently, they may share common queues.
  • Buffer management: When a stream of packets-independent of or dependent on each other-is received from the network, the packets have to be buffered. To do this, the receive process requests a buffer from a common free buffer pool. This common buffer pool is accessed from different processes, an example of the "independent problem."
  • Classification, metering, policing and congestion management: At the core of the dependent and independent problems is a classical read-modify-write (RMW) problem, where one process must access a data structure from memory, modify it and then write it back to memory.

Because another subsequent process may try to access and modify that same data structure, it has to be locked so that mutual exclusivity can be given to the first process. This is accomplished via a message that is associated with that data structure. Other processes can read the message to determine if the data structure is currently being modified. This message is typically called a mutual exclusion, or mutex.

The entire process of locking down a data structure and modifying it is called a critical section. When a process enters a critical section, the following six general steps are taken:

  1. Check mutex for locked status.
  2. If locked, go to Step 1, or else set mutex to a locked status.
  3. Read data.
  4. Modify data.
  5. Write back data.
  6. Release lock by setting mutex to unlocked status.

These steps apply in both the dependent and independent cases. Often, the data to be modified is state information concerning the connection or flow and can be accessed quickly because it is internal to the processor.

On the other hand, when thousands of flows or connections are involved the data structure is most likely in external memory and requires more time to access. The worst-case situation is when the data is stored in external memory in a linked list. For the linked list to be traversed, several dependent memory-read and memory-write operations are necessary. One memory access needs to finish before another one starts, resulting in very long latencies.

The RMW and locking process is dependent on memory latencies, the principal source of high-speed packet problems. As packet rates continue to climb, the effects of memory latency grow. At the OC-192 line rate, such latencies can exceed the packet arrival rate.

When memory latencies are proportionately long relative to the packet arrival rate, serial-stream processing problems can negatively affect pipelining architectures. This type of architecture applies a series of tasks to an incoming packet stream and is tuned so that each task completes in a fixed time period. Typically, this time period is the arrival rate of the incoming packet. If the processing time-defined as instruction processing time plus memory latency time-is exceeded, the pipeline is broken and packets will be dropped.

If the OC-192 cell rate is 35 nanoseconds, with DRAM latency at 55 to 70 ns, and SRAM latency is 10 ns, it can be readily seen that the pipeline period is significantly occupied by memory latencies. Processes operating on linked lists can break a pipeline because of the multiple dependent memory operations required.

To solve the memory latency challenge and reduce the effects of serialization, we incorporated special hardware features in our next-generation IXP network processors; specifically, distributed cache and SRAM Q_Array. These features are designed to directly attack the dependence and independence challenges associated with memory latency.

Essentially, distributed cache is the mechanism by which we deal with the dependence problem or lock contention problem. Traditional methods keep the mutex in SRAM to store the state of critical data. The critical data itself is also stored in external memory. The time taken to lock, unlock and check this mutex, along with accessing the critical data itself from external memory, becomes prohibitive because of the long memory latencies. It has a particularly negative impact for dependent packets at high line rates.

In the approach we use in the next-generation IXP network processors, each network processor incorporates a locking mechanism and the means by which to store actual critical data internal to the processing elements (microengines). This is the basis of the distributed cache. By localizing the dependency check in the microengine, the check can be done in hardware very quickly.

Meanwhile, the same is true of the critical data. By having the resource stored locally, the RMW can be performed at the speed of the microengine itself.

This dependency checking is performed with a 16-entry content-addressable memory (CAM) in the microengine. The CAM is loaded with a critical data identifier that identifies a particular flow, connection, queue number and so on. When a thread has to check if particular data is being worked on, it will check the CAM with an appropriate identifier.

Multiflow checkup
This CAM provides single-cycle access to the process requesting the check. Many, if not most, applications must support thousands of flows. For those applications, the CAM is especially helpful since it helps determine if back-to-back interdependent packets are being processed. Software pipelining is used to ensure that only one microengine is operating in a critical section at a time.

The interdependence problem is the venue of an integrated SRAM Q_Array to accelerate linked-list and ring buffer operations. This is especially critical to solve the problem of independent serial-stream processing when unrelated independent packets or cells have to access a common data structure. Data is added to and removed from linked lists-as well as circular buffers-using several sequential memory references. Additionally, these have to be protected in critical sections.

Essentially, the Q_Array automates this portion of the process by providing atomic operation for these tasks. The SRAM Q_Array can support any combination of 64 linked lists or ring buffers. Most applications will have to support thousands of linked lists, so the Q_Array can cache the most recently used queue descriptors. The microengine CAM can then be used to quickly determine which Q_descriptors are locally cached in the SRAM unit.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form