Design Article
A new approach to framing for 10Gbit/s TCP offload engines
Uri Elzur, Strategic Marketing Manager, Broadcom Corp., Irvine, Calif.
4/14/2003 12:27 PM EDT
Problems in TCP framing over Ethernet networks must be addressed in next-generation TCP Offload Engines (TOEs) to allow a server and its I/O subsystem to scale to 10 Gbit/second data rates for storage and other applications. Marker-based Protocol Data Unit alignment (MPA) solves these problems by providing data placement services for multiple applications.
The framing problem crops up in storage networks that use the iSCSI protocol layered on top of the TCP protocol and processed on a TOE chip. When a TOE/iSCSI receiver obtains a TCP message and tries to place it, its first challenge is to find the beginning of the current iSCSI header by knowing the position within the TCP byte stream of the preceding header and adding it to the length of the previous Protocol Data Unit (PDU).
When TCP segments arrive out of order, the receiver may not have received the TCP segment with the iSCSI header. Therefore the receiver has no way to determine where to place the data received in the current TCP segment. This is the framing problem.
Because TCP is a byte stream protocol with no knowledge of the upper layer protocol, TCP has no way to alert the receiver regarding the boundaries of the upper layer protocol, iSCSI in this case. A receiver may drop out-of-order segments, but that has negative performance implications on a high-speed link. All of the dropped segments would have to be re-transmitted and the TCP protocol could reduce the allowed bandwidth on the link resulting in a long recovery time.
Therefore, today's Gbit iSCSI/TOE adapter cards feature a large buffer memory pool for TCP segment reassembly. Since the required buffer size is a product of the TCP connection bandwidth and the end-to-end delay, the buffer scales with the network's speed, increasing the cost while deterring potential users.
As it turns out, the bandwidth required for that memory is at least twice the wire speed, requiring a very challenging high-speed memory design. Thus, the TOE must have more pins and additional logic in order to interface to a 64-bit or 128-bit wide memory, further increasing cost of the TOE.
Flow-through TOE
The optimal solution to this framing problem is to build a flow- through TOE. All of the data received by such a TOE, whether in-order or out-of-order, will be able to flow immediately from the TOE to host memory eliminating the additional cost and the complexity associated with the TOE's TCP reassembly buffer. To achieve this immediate flow through, the TOE needs to know the placement information of each TCP segment it receives.
MPA is an approach to solve the framing problem adopted by the RDMA consortium (www.rdmaconsortium.org), an industry group developing Remote Direct Memory Access technology to reduce data copies and protocol stack processing in TCP/IP networks. With MPA, every TCP segment is self descriptive and carries all of the information required to place its payload data in the host buffer provided by the application.
MPA allows the receiver to use a much smaller sized buffer that can be integrated on-chip, thus saving additional cost and complexity of a dedicated memory subsystem. It also eliminates dependency on the network's speed as well as on the number of connections supported.
Under MPA, the sender encapsulates the placement information immediately following the TCP header, saving the receiver the need to search for it inside the TCP segment. This is known as PDU alignment and significantly reduces the receiver's buffer size and algorithmic complexity and allows for a flow through design.
MPA is based on a marker placed inside the TCP byte stream in intervals of 512 bytes, such that it is present in large TCP segments and the receiver needs no further help to locate it. However, instead of pointing to the next PDU, it points to the beginning of current TCP segment, where the placement information is expected to be.
It is also critical to guarantee that data is not accidentally interpreted as placement information, which was a potential shortcoming of statistical approaches in locating the placement information inside the TCP segment. Therefore, MPA also allows the receiver to detect if an intermediate box, such as a firewall, is in the path between the sender and receiver, is re-segmenting the TCP byte stream and thereby changing the position of the placement information inside the TCP segment.
MPA-based 10-Gbit/s TOE hardware can now become application agnostic and can provide data placement services to any application layered on top of the RDMA protocol stack. This saves engineers the effort of building specific hardware that is limited to just one application and allows TOE to provide additional functionality to I/O processing in a more standard way that enables increased convergence within the data center.



