Design Article

Per-packet load balancing ups IPSec processor performance

Dan Eakins, Senior Product Line Manager, Security Line of Business, Broadcom Corp., Irvine, Calif.

4/22/2002 8:25 AM EDT

Per-packet load balancing ups IPSec processor performance

The majority of computer and iAppliance users connect to their networks simply to exchange e-mail messages and small files such as Word documents or PowerPoint presentations. These applications are not likely to tax a the current single CPU core approach to security processing, and when there are numerous users sending low-volume traffic, the limitations of the traditional session-based balancing schemes are not that noticeable. However, even a single high-bandwidth application can demonstrate the weakness in this design because the cores simply can't keep up with the fastest traffic.

But this problem is going to become more apparent in the near-future. Security systems are becoming more common, and the result will be more data streams that need to be encrypted and decrypted. And,new applications will become more common, requiring higher-bandwidth data streams. Videoconferencing is mentioned frequently, but downloading files from remote databases could be just as taxing to a network's security engines.

Complicating the issue is the fact that network today are much faster than just a few years ago. Individual machines are connected to corporate LANs with at least 100 Mbits/second fast Ethernet cards, networked servers sit on switched gigabit links, and broadband is increasingly common in the home. High-end routers often must process data streams running at 10 Gbit/sec, and the optical core systems are much faster than that. A few years ago these speeds seemed impossible; a few years from now those rates may seem antiquated and slow.

Adding security protection to these multi-gigabit data flows may seem like an impossible task, but users are asking for it now, and will demand it very soon. However, the security chips available today for use in routers and switches and in the VPN hardware and software resident on a server can hardly keep up with these rates.

Until recently, the fastest parts on the market could process data at about 200 Mbit/sec, an impressive rate but hardly enough to keep up with gigabit-class routers. Upcoming designs are moving security processors into the gigabit-realm, and the market is eagerly waiting for secure communications at gigabit data rates.

Security processors have an endless task. Unlike computing-based systems, where the processors can take work through some problems and put other work aside in a buffer to pick up later, communications systems are almost always running. Any buffer of incomplete tasks would quickly fill up and the processor would never have enough down-time to complete these earlier jobs. As a result, communications systems require the network processors and the encryption engines to both run at wire-speed, processing every single packet as it comes in. The complexity of the computations needed to insure data integrity and privacy on a per-packet basis at gigabit rates poses unique problems for security components manufacturers.

One of the most common techniques for increasing performance in encryption chips is to integrate multiple processing cores onto a single die as it is not feasible to build single engines capable of multi-gigabit data rates. This offers many advantages because two or four cores can do more work than one powerful processing core. However, it is often difficult to keep every core processing data at full efficiency while also presenting the functional equivalent of a single, multi-gigabit encryption and authentication engine. Optimizing the processing capabilities of many engines is one of the most difficult, and most important, tasks in designing a multi-core security processor.

In the IPsecurity market, corporate virtual private network (VPN) systems must host thousands of connections from remote users simultaneously. One of the most common techniques for balancing the load of each core is to assign each individual connection session to one specific processor. This has the advantage of not requiring the processor to route every single packet, but simply to direct each session stream to the same core.

However, this approach has a critical flaw: if there are two simultaneous sessions, with one user simply sending only a few e-mail messages, while the other is hosting a videoconference, these sessions are far from equal. Under this session-based load balancing approach, one processor would quickly encrypt and decrypt the e-mail traffic and then look for new work. At the same time, the core that is handling the videoconference would be over-taxed encrypting and authenticating its data-intensive application.

The first core now idle cannot assist in the processing because the videoconference stream can only be directed to the original core. By expanding this simple example of two connections to several thousands connections, each with unique processing requirements, session-based load balancing quickly becomes a less than ideal solution for optimizing security processing tasks. With session-based load balancing, the overall performance of the system becomes limited to the overall efficiency of each core.

In the design of multigigabit security processors, we decided to take a slightly different approach. In our design, we use parallel processing to increase performance into the multi-gigabit range. But instead of load-balancing on a per-session basis we implemented load-balancing on a per-packet basis. Instead of assigning each session to a single core, the chip assigns each individual packet to a core for processing.

This technique goes a long way towards eliminating the inefficiencies and overall throughput limitations of the session-based approach. Load-balancing by packets offers several advantages. Because the security systems must process data at wire-speed without disrupting traffic patterns, packets must exit the box in the correct order. For systems that assign a single stream to the same core, the core must work sequentially to finish working on each packet before moving to the next. But if the system can assign packets from the same session to different cores, it can speed up the overall processing task by allowing one core to work on a large packet, while smaller packets could be processed by different cores and then buffered. Once the larger packet has been fully processed, the chip takes the awaiting data out of the buffer, correctly orders the data stream, and sends it on to its destination.

This approach allows our design to not only load-balance on a per-packet basis, it even allows the chip to assign packets to specific processing cores based on packet size. This design is equally able to process small, 64-byte packets or large, 1,536-byte packets without performance limitations, because it measures the size of the packet and then assigns it to the core that is ready for that project. Session-based approaches cannot do this; instead they assign new sessions to cores randomly and are unable to determine the length of the incoming data stream or the size of individual packets.

Furthermore, session-based designs must allow for a significant overhead in their capabilities. Since the system does not determine the size of each packet, it must be prepared to process all of them no matter the size. In some cases, the smallest packets arrive for processing, carrying an overhead penalty of up to 30 percent processing power that was allocated to process a large packet and has instead been wasted. Load-balancing on a per-packet basis eliminates this waste.

So, does this design really improve performance? A common test for such devices is to connect a single, gigabit data stream to the engine and watch it go to work. By definition, designs that assign each session to one core will only allow a single core to process that stream. A chip that has four cores and advertises itself as capable of 1 Gbit/sec throughput would not be able to keep up with a 1 Gbit/sec data stream because that stream of packets would be assigned to only one core, one-fourth of the chip's entire processing capability. A likely result would show that the fastest data stream the chip could handle would be 250 Mbit/ps.

By varying packet sizes from 1500 bytes to 64 bytes, and including different authentication (SHA-1, MD5) and encryption (3DES and AES) types, one can also judge the overall efficiency of this security engine in processing a variety of packet types in a worse-case network application. In contrast, a chip that assigns individual sessions to multiple processing cores can harness all the horsepower of the chip to keep up with even the fastest data streams no matter the packet size or combination of ciphers or authentication types.

Using this testing approach, we found that our approach demonstrated processing speeds of up to 2 Gbit/sec, even on continuous streams consisting of only small packets. Future versions with more encryption cores will be able to scale this performance. Competing designs that must balance tasks on a per-session basis will not be able to scale their performance upward by implementing more cores, because even with more cores they still can_t take on faster data streams if each stream must still be assigned to just one engine.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form