News & Analysis

Implementing Parallel Packet Buffering: Part 1

Sailesh Kumar, Raja Venkatesh, Joji Philip, and Sunil Shukla, Paxonet Communications

4/22/2002 5:14 AM EDT

Implementing Parallel Packet Buffering: Part 1
Modern switches and routers often use dynamic RAM (DRAM) in order to provide large buffer storage space. However, the effective bandwidth of DRAM is frequently a limiting factor in the design of high-speed switches and routers.

To increase effective bandwidth, designers can now turn to a new buffering architecture, called the parallel packet buffering (PPB), which increases the effective memory bandwidth significantly. This two-part series details the PPB architecture. In part 1, we'll deliver an overview of the PPB architecture, how it works, and the advantages it brings to a system architecture. In the second half, we'll look at some challenges associated with the PPB approach and some methods for correcting those challenges.

During the discussion, this article will demonstrate that for most of the traffic profiles, the PPB architecture utilizes the total buffering space much more efficiently than the single memory-based architectures. In addition, simulation results will show that the PPB architecture achieves much higher bandwidth than a single memory with same aggregate data-width for any traffic.

Why PPB is Needed
Any high-speed networking design built around a store-and-forward architecture needs to buffer the incoming traffic at the line rate.3,11 The buffered data then has to be processed and forwarded to the appropriate port at the line rate. All aggregation, scheduling, packet classification, and most of the layer 2 and 3 devices belong to this category.

A layer 2 or 3 device also needs to perform the header lookup from a common lookup memory table for all the incoming packets. For aggregation and scheduling4, the device should be capable of buffering a very large amount of data arriving at the line rate in a logically common memory pool for further processing.

Typically, every node in a network should be capable of buffering an amount of data equal to (Rtt x line rate)5, where Rtt is the round trip time of the network. Therefore, as the line rate grows, the buffering space requirement of any node also increases proportionally. This means that the device must have a large, logically single shared memory to buffer all the incoming data on all ports. The bandwidth capacity of this memory should match the line rate of the device.

The problem, however, is that memory bandwidth is not keeping up with the exponential increases in line rate. Memory access time for the fastest available commercial DRAM is about 30 ns, and has made very slight development in recent years. This slow operation speed is causing a severe bottleneck in the scalability of the networking devices. Simply put, network switches and routers are running out of memory bandwidth.

To answer the memory bandwidth problem, a host of new packet-buffering architectures have been developed.2, 8,9,10,11,12 These architectures generally use multiple memory modules to emulate the single buffer of large size and bandwidth. Different memories are accessed simultaneously in parallel for read/write operations thereby increasing the effective data-width of the buffer.

In general, the new architectures, often called logically single memory, are accessed simultaneously in parallel just to increase the data-width. However, the logical data bus width of any memory buffer has a limit irrespective of the memory technology or buffering architecture. This bus width shouldn't exceed the minimum packet size arriving at the ingress or the data bus will remain underutilized for each write operation of such packets.

For example if the minimum packet size to be handled by a device is 40 bytes (320 bits), then there is no point in exceeding the data width of the memory buffer beyond 320 bits. This itself imposes the limitation on the bandwidth scalability of some new memory buffers. With the current memory technology, the fastest DRAMs with 320-bit wide data bus can only support OC-192 rates.

Another problem for newer buffer memory architectures is that performance consistently degrades for a random traffic pattern comprising of variable packet sizes. Under many of the new memory approaches, aggregate bandwidth and single read/write latency doesn't remain constant and deterministic. Also, the effective utilization of the total buffering space decreases considerably.

For example, a buffering architecture optimized for buffering 64-byte packets will work perfectly fine for incoming traffic comprising of 64-byte packets. However, the same architecture will impose a severe bandwidth bottleneck for traffic comprising of 65-byte packets. Thus, such traffic generally results in significant wastage of memory space due to lack of a finer granularity in partitioning of the memory.

With these problems in mind, a new packet buffering architecture, called PPB, has been developed to increase the overall buffering capability significantly both in terms of bandwidth and size. This architecture guarantees efficient utilization of the total buffering space for all kinds of traffic pattern and variable packet sizes. The memory modules used are DRAMs with minimal individual bandwidth requirement that results in lower cost and higher capacity in terms of size. The architecture is scalable enough to support any bandwidth capacity as high as OC-3072 and above.

The PPB Approach
When building a DRAM-based packet buffering architecture, the objective is to include the following features:

  1. Scalability to achieve extremely high bandwidth capacity with the existing memory technology, not achievable by any logically single memory based architectures.
  2. Comparatively higher bandwidth capacity than logically single memory-based architectures with similar memory and aggregate data width.
  3. Finer granular accesses to memory modules for reading and writing packets. This ensures that even for a large variation in the packet sizes, bandwidth of the buffer doesn't deteriorate.
  4. Efficient use of the total buffering space available. This is to ensure that the total available memory is utilized completely and efficiently for packets of any random size.
  5. Fast and deterministic memory module arbitration logic and efficient read/write logic without any contention for each individual memory access. This ensures that the device performance doesn't deteriorate for the fluctuating traffic pattern and random read/write sequence.

PPB incorporates n individual DRAM memory modules to emulate the common memory buffer. Each module has same size, data width, and access time.

The value of n can be chosen depending on the bandwidth and latency requirements of the packet buffer. All the n memory modules are accessed simultaneously and independently. The total amount of data that can be buffered is the aggregate buffering capacity of all the n memory modules since the incoming packets are buffered in each of the memories. However, a complete incoming packet is always written into one memory module only.

The read and write operations are performed in a pipelined manner in each individual memory module. Thus, while a packet is getting written into one memory module, a newly arrived packet can be written in some other memory module, which is not being accessed currently. This pipelined and simultaneous access to the individual memory modules boosts the aggregate bandwidth while reducing the load on each individual memory.

The aggregate bandwidth capacity of all the n memories must equal the line rate of the device. For a given aggregate bandwidth, as the value of n increases, bandwidth requirement of the individual memory modules goes down. Therefore, the data width of the individual memories is inversely proportional to n.

Thus, for higher values of n, write and read operations to each individual memory can be performed with a finer granularity. Each write access is of the same size as the data width of each individual memory. This makes the PPB architecture immune to varying packet sizes. The simulation results that follow will show that as n increases, the effective utilization of buffering space and overall bandwidth increases significantly for all traffic profiles.

Reading and Writing and PPB
Under the PPB approach, a memory module in the PPB is idle when read or write operations are not executing on the module. When read and write operations are executing, the module is called busy.

Click here for Figure 1

Figure 1: Top-level schematic diagram of the PPB architecture.

A packet is written into one of the idle memory modules only. There is an arbiter that selects one out of all the idle memory modules for each of the incoming packets. For each incoming packet, the PPB arbiter (Figure 1) selects a memory module and starts writing the packet into that module. One complete packet is always written in one memory module only. For each incoming packet, PPB performs the following sequence of operations:

  1. Select an idle memory module, where the complete packet will be written.
  2. Update the status of the chosen module as busy and start writing the packet into it and look for the new packets arriving.
  3. When a packet has to be read out, select the appropriate memory module where the packet is residing, update the status of the module as busy and start reading out the packet.
  4. As soon as the read/write operation on any memory module is over, update the status of the module as idle.

Memory Module Selection
The PPB approach includes a fast, efficient, and deterministic policy for selecting which memory module to write a packet to. The PPB arbiter maintains the status and occupancies of each of the n memory modules. For any incoming packet, PPB selects an idle module with the least occupancy level. And if none of the modules are found idle at that time, the PPB waits for one of the module to become idle (Figure 2).

Click here for Figure 2

Figure 2 Packet queuing in an n memory based PPB.

Since the line rate of a system is equal to the aggregate bandwidth capacity of the n modules, the capacity of an individual module is less than the line rate by a factor of n. Therefore, it is possible for a packet to arrive before an existing packet gets written completely to a module. Note: This will happen more often for higher values of n.

Some will say that receiving a new packet while still processing an existing packet could cause problems in the system architecture. However, the PPB arbiter selection policy can always find an idle memory module for each arriving packet as long as the ingress data rate doesn't exceed the aggregate capacity of the PPB memory.

Let's look at an example to prove this point. Since there are n memories (all makes up the common memory buffer) in PPB, aggregate bandwidth capacity has to be at least twice the line rate R. This happens because data is simultaneously written into and read out of the memory modules. Thus, at any point of time, on average, reads will be performed on at most n/2 memories if the egress rate doesn't violate the line rate. So, we can say that at any point of time, there will exist at least n /2 memories where read operation is not taking place. Thus we get our Lemma 1, which states that any time read operation will be performed only on n/2 memories.

Now, we have to prove that incoming data packets can be written into the n/2 idle memories, provided ingress rate doesn't violates the line rate. If the line rate equals R and the average packet size equals S, then average packet arrival time is S/R. In the PBB approach, the bandwidth capacity of each module is 2R/n. The time to write a packet of size S equals (S)/(2R/n) which equals Sn/2R.

Now if we consider S/R as one time slot, the average packet arrival time equals 1 time slot while the time to write packet of size S in a module equals n/2 time slots. This combined with Lemma 1 from above proves that if we have n/2 modules, packets can be buffered in the memory with the selection policy of selecting the memories where reads are not getting performed.

PPB read/write operations are performed simultaneously in a pipelined manner over each individual memory modules. If the packet arrival interval is assumed to be of one time slot, each individual read/write operation in the PPB will take at least n/2 time slots. Thus during each complete read/write operation, n/2 new packets will arrive at the PPB and all of them have to be buffered.

The complete operation of PPB for n=4 is depicted in Figure 3. The read/write operation pipeline length will be 2 (N/2 = 4/2 = 2). Therefore, a single read/write operation on any memory module lasts for two time slots.

Click here for Figure 3

Figure 3: Operation of PPB with n = 4.

Advantages of PPB
Ok, so we've laid out how PPB works. Now, let's examine the benefits that PPB architecture bring to networking equipment designs.

1. Scalable to support higher line rates:
Traditionally, the bandwidth of the memory buffer increases proportionally with an increase in its data width. However, when the bus width reaches 320 bits, we hit a limit. Most IP switches today are engineered to process packets with a minimum size of 40 byes (320 bits). But, that is a minimum total. In general these systems must handle much larger packets. The problem, however, is that if they process packets larger than 40 bytes, memory bus bandwidth does not increase while memory space wastage rises.

The PPB architecture can solve this problem. Designers using the PPB architectures can increase memory bandwidth by choosing higher values of n and appropriate data width D. For example, under the PPB approach, a designer can implement 8 memory modules each featuring a 160-b bandwidth. While this bandwidth is less than the 320 bits provided by a single memory above, overall the 8 memory modules will achieve an aggregate bandwidth of 1280 bits, which is 4X larger than single memory architectures. Thus, the PPB approach allows designers to more easily handle the larger packets entering the system.

Theoretically, it seems a logically single memory based architectures with wider data bus width same as nxD (until nxD is less than 320) will give same bandwidth as the PPB architecture. However, this ideology doesn't hold true for all kind of traffic profiles. Actually for any practical traffic profile, we will prove below that logically single memory-based architectures will not give the deterministic and constant bandwidth as compared to the PPB.

2. Higher bandwidth with the same data width:
The data width of any memory buffer is generally chosen to be an integral factor of the average packet size in order to maximize the bandwidth and reduce the buffer wastage. For example, if the device is optimized to process minimum 40-byte packet, then the data width can be chosen to be 320-bit or 320/i, where i is a positive integer.

For high-speed routers and switches, to achieve very high bandwidth, generally the data width of the memory has to be made very wide. In these designs, i is generally chosen to be equal to 1 or the data bus width is made the maximum. Thus, one 40-byte packet will be written in just one time slot in the memory so the read/write operation can execute at the fastest possible rate.

However, in any practical scenario, the arriving packet size varies over wide range. If a 41-byte packet arrives, the single memory architecture will need 2 cycles to write the packet--40 bytes in one slot and 1 byte in the next. Thus, write bandwidth has gone down by a factor of [41/(40x2)] 51%. Now if such odd size packets keep coming, the buffer will only support 51% of the peak bandwidth capacity.

PPB eradicates such a situation by delivering a finer granularity to memory accesses. For example if we choose n= 8 and D = 320/8 = 40, the peak bandwidth capacity will match the single memory-based architecture described above. Now, if a 40-byte packet is received, it will take 8 time slots to get written completely and the bandwidth utilization will be 100%. The 41-B packet, on the other hand, will require 9 time slots.

In first 8 time slots, effective bandwidth utilization is 100%. In the last slot, however, 1 byte has been written in the same time that 5 bytes (40 bits) could have been written. So the bandwidth utilization in last time slot is 1/5=20%.

On the surface, this looks like a more inefficient way to handle traffic. However, the real benefit of this approach is seen in overall bandwidth utilization. In this example, effective bandwidth utilization equals (8x100% + 1x20%) / (8 + 1) = 91.1%. Thus, even in this demanding scenario, in which the potential performance of a single memory-based system could significantly deteriorate, PPB consistently gives 91% of the ideal bandwidth. And this figure can be extrapolated further by increasing the value of n.

The following equation demonstrates the bandwidth advantage that PPB delivers over equivalent single memory based buffering architecture:

Where, L=Rate at which the bandwidth of single memory based buffer architecture degrades for a given traffic profile; D=Aggregate data width of PPB or single memory; and k =Constant depending on L, D and the traffic profile.

As soon as n becomes equal to D, the PPB achieves the maximum possible ideal bandwidth irrespective of the traffic profile and the packet size. This happens because the granularity of memory access reaches the minimum limit of 1 bit and for any packet size, bandwidth will not be wasted.

To illustrate the bandwidth capabilities provided by PPB, we generated a traffic pattern with sustained data rate (100% load) and the packet size varying as per the Bernoulli distribution function. The value of p is chosen to be 0.75 and mean packet size 40 bytes. But the packets of size less than 40-bytes were filtered out to keep the minimum packet size at 40 bytes.

Figure 4 shows the effective bandwidth achieved from the PPB as a function of n with the constraint that the aggregate data-width (nxD) is equal to 320. This is made to match the PPB bandwidth with logically single memory based architectures.

Click here for Figure 4

Figure 4: Bandwidth of PPB as a function of n.

The graph in Figure 4 shows that as the value of n increases, the effective bandwidth of PPB will also increase. As n approaches 320 bits, the D becomes 1 and effective bandwidth of PPB equals the ideal maximum possible bandwidth irrespective of the varying packet sizes.

3. More available memory (efficient utilization of buffering space):
In addition to delivering higher bandwidth, the PPB structure also delivers better utilization of bandwidth. Specifically, as the value of n increases, the effective memory utilization of a PPB structure also increases. By that we mean that if a logically single memory of same data-width accommodates M packets, PPB will accommodate (M + Δ) packets, where Δ will depend on n and can be consistently very large for a real traffic pattern.

Let's take the same example of logically single memory based buffering architecture from above (see point 2). A 41-byte packet has arrived at such memory buffer with a 320-bit data bus width. Thus in the first address space, we will be storing full 320 bits representing the 100% utilization of the memory space. In second address space, we will be storing 8 bits where ideally 320 bits of data could be stored. Thus in the buffering space of 640 bits, only 328 bits has been stored and the effective memory space utilization is 51% again.

In a PPB architecture with n = 8 and D = 40, the same 41-byte packet will be stored in 9 locations or address spaces. The first 8 locations will be utilized completely while last location will be 20% utilized resulting in the effective buffering space utilization as high as 91%. Thus, we can say that if the classical memory implementation will accommodate 1 million such packets, PPB will accommodate approximately 1.5 million such packets with the same total size of memory.

We have again generated the same Bernoulli traffic profile with mean data packet size 40 bytes and p to be 0.75. The packets having size lower than 40-bytes has been filtered out. The effective buffering space utilization or the maximum number of packets that could have been buffered will have similar pattern as depicted in Figure 4. It can be seen that as the value of n goes up, the effective buffering space utilization also goes up because of finer partitioning of buffers in the memory.

We have also generated constant sized packets with minimum limited to 40 bytes and buffered in a PPB architecture and an equivalent single memory based architecture. In this example, n=8 and D=40 for PPB while the data-width of logically single memory was made 320. Figure 5 shows the percentage utilization of the memory buffering space as a function of the packet size. It can be seen that as the packet size crosses the integral multiplicity of the data width, the space utilization changes considerably. But this is very less in the PPB again because of finer partitioning of the memory buffers.

Click here for Figure 5

Figure 5: Buffering space utilization as a function of packet size.

Based on the results shown in Figure 5, it can be seen that that any buffering architecture employing wider data bus works at higher buffering granularity results in non-predictive performance varying considerably as the packet size varies. Also, these buffer architectures waster a lot of available memory space depending on the traffic pattern and packet size variations.

In PPB, data width of each of the memory have been reduced by a factor of n while increasing the number of memory modules. Thus, the buffer granularity has been reduced while keeping the effective bandwidth constant. This results in a number of advantages inherent to the lower granularity processing. It can handle a wide range packet sizes without any deterioration in the effective performance for a random traffic with wide variation in packet size.

Setting the Stage for Part 2
That wraps up part 1 in our two-part set on parallel packet buffering (PPB). In part 2, which will appear online next week, we'll dive into some of the challenges with PPB and some solutions for solving these challenges. We'll also lay out some requirements for choosing the right amount of memory modules for a networking design.

References

  1. Obara, H., "Optimum architecture for input queueing ATM switches," IEE Electronics Letters, pp. 555-557, March 28, 1991.
  2. Marwan Krunz, Herman Hughes, and Parviz Yegani, "Design and analysis of a buffer management scheme for multimedia traffic with loss and delay priorities," Proceedings of the IEEE GLOBECOM '94 Conference, pp. 1560-1564, San Francisco, November 1994.
  3. J. Liebeherr and N. Christin, "JoBS: Joint Buffer Management and Scheduling for Differentiated Services", Proceedings of IEEE/IFIP International Workshop on Quality of Service (IWQoS '2001). Karlsruhe, Germany, June 2001
  4. H. Fu and E. Knightly, "Aggregation and Scalable QoS: A Performance Study", Proceedings of IWQoS '01, Karlsruhe, Germany, June 2001.
  5. M. Shor, K. Li, and J. Walpole, "Application of Control Theory to Modeling and Analysis of Computer Systems," Proceedings of Japan-USA-Vietnam Workshop on Research and Education in Systems, 2000.
  6. Eric M. Mantion, "Tomorrow.s Internet Broadband Capillaries, Need Terabit Arteries", Cahners IN-STAT Group, August 2001.
  7. Gideon Kaempfer, "The Challenges of Building a Terabit Router", Charlotte's Web Networks.
  8. Youngmi Joo, and Nick McKeown, "Doubling Memory Bandwidths for Network Buffers", IEEE Infocom, April 1998, Vol 2, pp. 808-815, San Francisco.
  9. Sundar Iyer, Ramana Rao Kompella, and Nick McKeown, "Analysis of a Memory Architecture for Fast Packet Buffers", IEEE Workshop on High Performance Switching and Routing, Dallas, Texas, May 2001.
  10. D. K. Hunter, M. C. Chia, and I. Andonovic, "Buffering in optical packet switches," J. Lightwave Technol., vol. 16, no. 12, pp.
  11. M. I. Irland, "Buffer Management in a Packet Switch," IEEE Trans. Commun., vol. 26, pp. 328-- 337, Mar. 1978.
  12. D.K. Hunter, B. Qiu, K. M. Guild, M.C. Chia, A. C. Bryce, M. H. M. Nizam, Y. Qian, I. Andonovic, J. S. Aitchison, J. Marsh, M. J. O'Mahony: "Buffering Strategies for Optical Packet Switches", OSA Topical Meeting on Photonics in Switching, 21-23 July 1999, Santa Barbara, CA
  13. Jin Cao and Kavita Ramanan, "A Poisson Limit for Buffer Overflow Probabilities" Proceedings from INFOCOMM, 2002.
  14. C. Courcoubetis and R. Weber, "Buffer overflow asymptotics for a buffer handling many traffic sources," Journal of Applied Probability, vol. 33, pp. 886--903, 1996.
  15. Michel Mandjes and Sem Borst, "Overflow behavior in queues with any long-tailed inputs," Advances in Applied Probability, vol. 32, pp. 1150-- 1167, 2001.
  16. Marco Ajmone Marsan, Andrea Bianco, Paolo Giaccone, Emilio Leonardi, Fabio Neri, "Packet Scheduling in Input-Queued Cell-Based Switches", IEEE INFOCOM'01, Ankorage, Alaska, USA, April 2001.
  17. N.McKeown, "Scheduling algorithms for input-queued cell switches", Ph.D. Thesis, Un. of California at Berkeley, 1995.
  18. A. B. DRAGUT, M. Dragut, "Upper Bound Decision Process for a Time-Constrained Radical New Product Development Process (TR-NPD)-A Finite Horizon Discrete-Time Markov Nonstationary Approach", Proceedings of the X-th International Symposium on Applied Stochastic Models and Data Analysis, June 12-15, 2001, Compigne, France, pp.392-397.
  19. B.D.Choi, Y.C.Kim, D.I.Choi, D.K.Sung, "An Analysis of M, MMPP/G/1 Queues with QLT Scheduling Policy and Bernoulli Schedule", IEICE Transactions on Communication, Vol.E81-B, pp.1-10, No.1, 1998.

About the Authors
Joji Philip, Raja Venkatesh, Sailesh Kumar, and Sunil Shukla are design engineers at Paxonet Communications. Joji can be reached at joji@pune.paxonet.com; Raja can be reached at raja@pune.paxonet.com; Sailesh can be reached at sailesh@pune.paxonet.com; and Sunil can be reached at sunil@pune.paxonet.com.





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form