Design Article
What will be the next generation TOE design challenges?
Shridhar Mukund, Director of Engineering, Adaptec Inc., Milpitas, Calif.
4/14/2003 12:24 PM EDT
The design of first-generation TCP Offload Engines (TOEs) leave ample room for improvement. While general-purpose processor solutions provided a reasonable start in this area, an on-chip multi-processor based ASIC offers better scalability in functionality, throughput, footprint, power and cost. However, engineers must prepare to face complex design and programming issues on these TOEs.
The TOE is essentially another processor connected to the end host memory, but is specialized to handle network transport functions. It relieves the load from the host CPU and memory bus by handling copies and data transformations. TOE also amortizes user-kernel transitions and context switches over large data transfers to and from the host memory.
A next-generation TOE, or transport processor, essentially will turn a TCP/IP network into a memory-to-memory fabric. It can manage a socket-byte-stream for TCP, files for NAS, blocks for iSCSI, message queues for inter-process communications, and virtual memory segments for iWARP. Off-loaded functions may include Denial-Of-Service filters at the IP level, in-line IPsec algorithms, built-in zero-copy data path of iSCSI, NAS data paths, and generalized zero-copy data path of iWARP.
TOE vendors may choose to implement control path work on the host with or without the cooperation with the host operating system. Designers can also choose whether to let the host establish and terminate TCP connections.
There are a number of approaches to designing next-generation TOEs that vary based on factors such as target application, throughput, price point, level of integration, and vendor core skill set. Overall, a general-purpose processor approach can speed time-to-market, while an ASIC scales to fatter pipes, enables vertical features, and reduces cost.
While general-purpose programming works well for lower throughputs and the initial phase where there has not yet been price erosion, this approach makes it difficult to scale to higher throughputs, smaller footprints, lower power and to contain overall cost. Although general-purpose processors feature hierarchical caching, they use a single memory view, an obstacle to this scalability. General-purpose processors may be improved by the integration of interface logic and several ad-hoc point accelerators, such as checksum and CRC computation blocks.
The ASIC approach involves partitioning the problem into data and control planes. The data plane is implemented on an ASIC using programmable pipeline processing elements, while the control plane is implemented on either an embedded general-purpose processor subsystem or the host processor itself.
The data plane is mapped to an on-chip micro-network of programmable pipeline processors so that, for the most part, the data flow does not meander in and out of the chip. The ASIC should consist of tens of parameterized processors with simple point-to-point channels, organized to solve the transport data plane class of problems efficiently. This increases the effective memory bandwidth and processing power by over an order of magnitude.
Transport processing for tasks such as encryption, authentication, data digest, checksum, context lookup, manipulating payload bounds in out-of-order segments is very memory intensive. There may still be an off-chip memory but not in the performance path. With ASIC, the memory throughput requirement is small enough to enable single chip LAN-on-motherboard implementation, where the host memory doubles as the off-chip memory.
Getting memory fast
Using tens of simple processors that are largely parameterized in memory dimensions is an effective way to address in a TOE the need for fast memory access. Each processor complex is then designed as a modular sub-chip that is small enough not to fall into deep sub-micron traps. The channels that traverse these sub-chips need to be designed carefully to circumvent long wire issues.
Such an ASIC brings an interesting set of design automation challenges that won't be well addressed by applying the traditional RTL-over-the-wall approach. However, a handful of technology developments have opened up new methods for design-for-layout, design-for-timing, design-for-skew, design-for-repair, and so on. The key is to have automated methods in place to rapidly traverse between architectural adjustments and near-layout, so a what-if analysis can be made efficiently to prevent costly rework and to speed the design of derivatives.
Programming and validation is by far the most important aspect of TOE design. With the ASIC approach, the control plane is responsible for bulk of the programming complexity. Over 80 percent of code lines tends to be in the control plane. Therefore, it is imperative that the control plane be programmed in C under a well-known OS environment.
Programming tens of processors on the data plane ASIC, however, can be exponentially more complex than programming a single control plane general-purpose processor. The solution is to plan the architecture and tools up front, so as to provide the programmer with a single-processor view of tens of concurrent processes.
Thus, the data plane programming needs to be at fairly high level. As long as data objects are not visualized as bits and bytes, moving the language control structure closer to assembly programming works well. The key is to have an environment for co-design, where there is a rapid feedback between coding and debugging its effect in a near-real system. A bus functional model helps that runs at several orders of magnitude faster than RTL, followed by a systematic method to verify that against the RTL.



