Design Article

Packet-buffer memory bandwidth causes NPU performance bottlenecks

Michael Ching, Product Marketing Manager, Rambus, Inc., Los Altos, Calif.

5/9/2003 11:17 AM EDT

Packet-buffer memory bandwidth causes NPU performance bottlenecks

Spiraling network line rates, presently hitting up to 10 Gbit/second, are approaching the performance limits of the packet buffers in today's datacom line cards. For its part, buffer-memory performance depends largely on the memory chip's I/O signaling interface, core architecture, and address and command protocols. Other performance factors depend on the NPU design, and, in particular, on its ability to exploit different memory architectures and features.

Already, the performance of general-purpose network processing units (NPUs) needed to handle emerging 10 Gbit/sec line rates has outstripped the I/O bandwidth capability of most conventional high-volume DRAMs. In response, memory makers have developed several high-performance DRAM offshoots, as well as specialty chips.

In such an I/O and data movement intensive environment, and with still higher network line rates in the wings, future generation memory chips will be pressed for yet higher performance. Toward this end, one approach being advanced to boost memory performance centers on a novel signaling technique that pushes I/O data rates to 3.2 GHz and beyond.

Network line-card designers strive to create products that they can bring to market quickly and cost-effectively. To help them, NPUs offer an appealing alternative to expensive, design-intensive — but performance-tuned — custom ASICs for network processing. To keep down prices, NPU vendors are spinning flexible designs that span a range of application segments, thereby achieving economies of scale. At the same time, NPUs are approaching the performance level of custom ASIC processors; performance so high that it precludes building the processors with field-programmable gate arrays.

To complement these NPUs, a variety of memory chips sporting different architectures is also available. It is not a stretch to say that the design team's choice of memory type is a critical factor in achieving high performance and market success. Among the range of options are specialty memories like fast cycle RAMs (FCRAM) and reduced-latency DRAMs (RLDRAM), and more general-purpose memory chips that use double data rate (DDR) and Rambus Signaling Levels (RSL) I/O interfaces to speed up conventional DRAM cores. These different approaches, each with its own tradeoffs, all have as their present performance aim the typical 4 Gbytes/sec. bandwidth target needed to process OC-192 (10 Gbytes/sec) packets.

The reason that a large and very fast packet buffer is needed is that packets undergo several memory operations that must all be completed at the line rate while retaining data integrity by not dropping packets. Typical operations include storing, prioritizing ("classifying"), and forwarding packets; maintaining quality-of-service (QoS) and error checking and correction functions; and value-added services such as encrypting data for virtual private networks.

Another challenge facing memory performance is the data traffic's random nature and variable packet size, which typically ranges from 40 to 1,500 bytes. Specifically, the random arrival and out-of-order forwarding of data, in particular small packets, tends to lower a memory's sustained performance. Yet sustained high performance is precisely what packet buffers must deliver.

At the same time, line-card designers must work within ever tighter space and power constraints. The many line cards, up to 64 in a large multi-rack network-router chassis; growing number of lines and chips per card; and the quadrupling of line rates for successive generations, all contribute to the designer's space and power challenges.

For example, in next-generation NPU for OC-768 or multiple OC-192 line rates, memory subsystems will require multiple parallel memory chips, posing a significant challenge to ASIC pin-count with even today's most advanced flip-chip ball-grid arrays (BGAs). NPU designers must also carefully manage packet-buffer power, which is typically limited to a 10W maximum for all line-card memory components, including control-store SRAMs.

Speed by design

Still, performance is among the most important considerations for selecting a packet buffer. Network processors must be paired with a memory system that delivers sufficient bandwidth to sustain the network line rate and level of application services offered.

The key consideration in choosing packet buffer memory is the I/O signaling and frequency, which together largely determine memory component count, number of NPU pins used, and overall power. Among the most common signaling levels, mainstream series-stub-terminated logic (SSTL) signals can achieve up to 400 megatransfers/sec.; high-speed transceiver logic (HSTL), 600 megatransfers/sec. and RSL, 1,200 megatransfers/sec. These maximum rates, multiplied by the memory bus width, give the memory subsystem's peak bandwidth.

Because they use SSTL and HSTL I/O signaling interfaces, some high-speed memories, like FCRAMs and RLDRAMs, can only deliver their optimal performance when connected point-to-point to the NPU. (Compared to other interfaces, SSTL signaling also uses more power.) In contrast, RSL signaling allows multiple memory chips to be connected on a bus (called multidrop), thereby paving a convenient way to increase a line card's total memory capacity, even in the field. Multidrop bus support also gives designers the flexibility to target one basic line-card design for a range of applications and market segments.

The demand of constant network traffic means that a packet-buffer memory must continuously maintain its high performance. While I/O signaling determines the data transfer rate between the memory and the NPU, performance efficiency — the degree to which a chip can sustain its peak bandwidth — depends on the memory chip's core architecture and address and command protocols.

For example, a 40 byte packet at OC-192 line rates requires a memory chip's core architecture to access random data in 32 nanoseconds or less. One way to do this, given a standard DRAM's 60 nsec row access time (tRC), is to use memories with specialty cores, like FCRAMs and RLDRAMs, which typically have a 25 nsec tRC. Another way is to interleave or pipeline transactions across a large number of nonconflicting DRAM banks, preferably banks within the same chip.

What's more, for a line card to handle two OC-192 streams, the tRC requirement is slashed to 16 ns. And around the corner awaits OC-768, which will call for an 8 nsec tRC. Such performance is beyond the reach of even today's specialty DRAM cores, and capacity needs and power constraints preclude going with SRAMs as an alternative memory.

Short burst access

Along with core architecture, a memory's address and command protocols also determine performance efficiency. For handling small packets, designers should look for memories that can access short bursts of data. Conventional multiplexed row and column (RAS/CAS) addressing used by DDR and FCRAM chips sustains minimum burst lengths of 32 bytes. Therefore, 64 bytes must be transferred even if only 40 bytes are needed, resulting in 62.5 percent efficiency. Other approaches include RLDRAM's single-cycle SRAM-like protocol and the RDRAM high-frequency packet protocol. Both of these designs sustain minimum burst lengths of 16 bytes, allowing 48 byte transfers and achieving an 83.3 percent efficiency.

In addition to performance, line card designers should consider the ability to put multiple memory chips on a multidrop bus as a way to increase storage capacity, improve bank interleaving, and impart application flexibility. Among DRAM memory types, DDR and RDRAM components allow multidrop configurations. In addition, the RDRAM protocol can concurrently access subsequent packets before finishing with the present one, allowing for up to five pipelined operations.

Other considerations include memory cost, component count, and power. As recently presented at industry conferences, specialty memories chips, like RLDRAMs, typically cost (per bit) twice that of conventional DRAMs. In sharp contrast, DDR and RDRAM memories enjoy wide use, assuring a high-volume cost structure and proven reliability and performance. For their part, DDR memories, being slower than the specialty memories, carry the lowest cost but require more pins and chips per board, thus pressing space and power constraints. Also, with only four banks per chip, DDR memories tend toward low performance efficiency.

In contrast, RDRAM memories, though they are widely used, provide high performance by putting 32 banks in one chip. Thus, while RDRAM components accommodate multidrop signaling, they are less likely to need multiple chips to achieve high sustained bandwidth. Instead, they achieve their performance by switching among the many on-chip banks. For these reasons, plus a greater than 1 GHz I/O frequency compared to DDR's 400 MHz rate, RDRAM components access data about four times faster than conventional DRAMs.

These performance benefits, plus a relatively low cost, account for the popularity of RDRAM memories in general-purpose designs that can accommodate different packet sizes and network standards as well as span a range of capacity and performance options.

As designers take aim at OC-768 rates, they face buffer memory performance requirements that approach 20 Gbytes/second, pushing the numbers of NPU pins for just memory alone to the limits of flip-chip packages. Fortunately, an ultrafast, next-generation memory technology, code named Yellowstone, has been announced and is expected to run at I/O frequencies starting at 3.2 GHz with a roadmap to 6.4 GHz and beyond, pushing memory bandwidths as high as 100 Gbytes/sec.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form