Design Article

Infiniband jumps design hurdles

Steven J. Sears, Member of Technical Staff, Network Appliance Inc., Sunnyvale, Calif.

4/14/2003 12:07 PM EDT

Infiniband jumps design hurdles

Infiniband is the right technology to bust through bottlenecks that many designers are hitting in high-speed networked system design. Features such as Infiniband's remote direct memory access (RDMA) capability, coupled with the direct-access transport application programming interface will provide world-class file access performance for Network Appliance's 900 series Filers.

Infiniband solves problems we have hit on our way to fast CPUs and networks. When we benchmarked our 1-Gbit Ethernet networks with the new 1-GHz CPU workstations, we were underwhelmed. In fact, results were about 40 percent of the expected throughput. And the CPU was running at 100 percent utilization, consumed by running the networking stack at full speed.

Given that the CPU defines the speed wall, using multiple network-interface cards only exacerbates the problem. Using multiple servers to make up the bandwidth difference is costly, complicated and error-prone.

Our workstations hit the CPU/network wall without too much trouble using Gigabit Ethernet. And 10-Gbit Ethernet isn't going to provide a performance improvement if we continue with the current TCP/IP architecture.

TCP/IP was designed with a model of relatively slow, lossy networks where the wire times so dominated the transfer of data that anything done to reduce network traffic was considered a good thing. Architects didn't anticipate networking speeds approaching bus speeds. In general, today networked data that needs to be processed, looked at or in any way touched by the CPU is going to run into a speed wall.

The pain points of the TCP/IP stack that designers need to address are data copies, user/kernel boundary crossings and data touches such as checksums and verification mechanisms.

Data copying for speed

RDMA addresses problems of data copying in today's high-speed Ethernet systems. In the TCP/IP receive path, data is copied between the network interface and the kernel, and again between the kernel and the application buffers-two copies and at least three bus crossings before the data is placed in its final destination. An RDMA implementation will cross the bus exactly once and place the data precisely where it is needed-zero copies.

RDMA works exactly like DMA between a local memory and a device: It copies a memory region without CPU intervention. But it does so across a network, between memories located on different machines. RDMA introduces keys that securely open regions of memory to write or read operations from a remote machine. One machine grants keys to another for specific operations and can revoke keys at any time. Since a key is specific to a region of memory, an application is unable to access random memory on the remote machine or to scribble where it isn't allowed.

No one would design a computer today without DMA for devices. We believe that RDMA can do for networking what DMA does for local data transfer. Our experience leads us to believe that any network operating at 10 Gbits is going to need RDMA.

Network Appliance contributed to the development of the Direct Access File System (DAFS) protocol, an NFS-like network file system that has been tuned for RDMA networks. DAFS takes full advantage of RDMA for file requests.

RDMA was one of the keys to a file-based I/O backed by RAID, outperforming raw disk. We teamed with Fujitsu Siemens and Sybase to achieve record-breaking TPC-C benchmark numbers using the Emulex VI/IP Gigabit Ethernet adapter and DAFS. Indeed, a Fujitsu Primepower 850 server with 16 CPUs and six Emulex VI/IP adapters connected to NetApp storage systems using the DAFS file system protocol set records in March 2002 for best price/performance of any Unix system on the TPC-C benchmark (see www.tpc.org/results/ FDR/TPCC/NTAP_fdr_fj850.pdf ).

One of the early issues with Infiniband was the lack of an application programming interface (API). The standards body for Infiniband, the Infiniband Trade Association, carefully specified Verbs as legal operations and wire formats, but they didn't specify a common API for implementing the Verbs, which are functional definitions that describe the behavior of Infiniband devices. Every vendor offered a different version of Verbs-sometimes significantly different. Early on it was clear that this was a barrier to Infiniband acceptance.

Fortunately, an industry group called the DAT Collaborative defined an API for accessing RDMA networks. It was designed with Infiniband in mind, but is general enough to encompass most networking protocols with RDMA capabilities. It is creatively known as the direct-access transport, or DAT, spec (see www.datcollaborative.org).

The DAT API is implemented in a direct-access programming library (DAPL), and is currently supported by most of the Infiniband vendors. It is also being discussed for use in upcoming 10-Gbit Ethernet networks. An open-source reference implementation of DAPL on Infiniband appears at http://sourceforge.net/projects/dapl/.

DAPL is designed to be a thin layer on top of a Verbs implementation or to replace a Verbs implementation altogether. DAPL is specified as a user-mode library, and performance-critical data transfer operations are initiated entirely in user mode without trapping to the kernel. Performance of early implementations is good (see http://osdn.dl.sourceforge.net/sourceforge/dapl/udapl_san_paper.pdf ).

Middleware presented another big obstacle to getting Infiniband off the ground. It appears that most of the Infiniband vendors built chips with the hope that someone else would build software. But it didn't happen that way.

Missing middle link

Initially, there was a choice of silicon and, interestingly enough, there were soon choices for Infiniband network-management tools. But missing from the picture was the stuff in the middle, the software needed by applications. Some of the management companies tried to fill the void, but the task often diverted them from revenue-producing products. It may have contributed to the demise of some of the Infiniband companies because it was a significant effort without much hope of remuneration.

Today, the core of Infiniband software is available from the vendors supplying the host control adapters. Development work continues and products are maturing quickly. High-quality implementations are readily available.

But Infiniband still faces a tough market question: What will motivate end users to purchase the infrastructure necessary to run this new network? The answer: latency and bandwidth. On the Mellanox A0 Silicon, Ohio State University reports Infiniband one-way latency of 7.5 microseconds and unidirectional bandwidth up to 822 Mbytes/s using commonly available machines (see http://nowlab.cis.ohio-state.edu/projects/mpi-iba/).

Despite market difficulties, we are now pushing ahead with Infiniband, a 10-Gig network with RDMA capabilities. Anyone with even a passing familiarity with Infiniband will know that vendors have exited the market, and the demise of the technology is predicted regularly. Still, for all of the politics, the market pressures and the difficulty of introducing new products to a down market, Infiniband appears to deliver.

Thus, we believe Infiniband is coming into its own and will appeal to certain customers needing high bandwidth. No high-speed network will be useful for storage or networking unless it includes RDMA capability, because the CPU becomes the bottleneck. And it is clear to us that the DAT API will provide a significant benefit to applications using RDMA networks.

http://www.eet.com





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form