Design Article
InfiniBand Networks for Enterprise Computing
Casimer DeCusatis
12/3/2002 12:00 AM EST
|
ABOUT THE AUTHOR
Dr. Casimer DeCusatis is a senior technical staff member with IBM Corporation, Poughkeepsie, New York, responsible for eServer technology roadmaps, and a co-author of the initial InfiniBand Architecture Specification (release 1.0). He is the recipient of several industry awards, including the 2002 IEEE Kiyo Tomiyasu Award for innovative applications of fiber-optic technology to computer systems, the 2000 EDN Innovator of the Year Award, and the 1999 Outstanding Young Electrical Engineer award from Eta Kappa Nu, the national electrical engineering honor society. Dr. DeCusatis is co-inventor of 38 issued patents, co-author of over 75 technical papers, and editor of the Handbook of Fiber Optic Data Communication (Academic Press; 2nd edition, 2002).
|
||


has received increasing attention as a high-speed interconnect technology that will compete with switched Ethernet for interchassis communication environments, and also as a compelling technology for server clustering. Recently published estimates of the server market opportunity underscore this trend.
In 2005, about 1.3 million servers are expected to ship in clustered environments worldwide; about a third of these servers will end up in IB clusters. If we include an equivalent number of server-to-storage IB nodes, and allow for up to 5 IB switch ports per node and a small additional number of IB-to-Fibre Channel adapters, the total market for IB interfaces becomes a significant fraction of the future clustered-computing environment.
The market need for IB is well established. While processor speed and storage capacity have continued to advance, driven by the economies of scale to be had from Moore's Law, the shared I/O bus remains a major bottleneck to high-performance servers trying to run next generation e-business applications, such as clustered multi-Pentium servers and workstations that aspire to enterprise-class performance. In this article, we describe features of the InfiniBand architecture which apply to enterprise-computing clusters and describe testbed results based on an IB fabric.
The desire to leverage mainframe-class technology to break this I/O bottleneck on commodity servers has led to a combining of two previous industry consortiums, Next Generation I/O (NGIO) and Future I/O, in August of 1999 to form the the InfiniBand Trade Association (IBTA). The core group of companies that form its steering committee and current Board of Directors include IBM, Intel, Hewlett Packard, Compaq, Dell, Microsoft, and Sun. These companies were joined by many other companies representing virtually every aspect of the computer and networking industry, including Sponsoring Members Agilent, Cisco, Brocade, 3Com, Adaptec, EMC, Fujitsu- Siemens, Lucent, Nortel, Hitachi, and NEC. At the end of 2001, there were 226 member companies participating in the InfiniBand Trade Association (IBTA).
The initial InfiniBand Architecture Specification was released to the industry on October 24, 2000, at the second annual InfiniBand Developers Conference. Since that time, there have been over 100 product announcements for silicon chipsets, adapter cards, switches, software, cables, and test equipment.
The value proposition of InfiniBand lies in a combination of performance improvement and better reliability (at least for entry level and midrange systems) in a scalable, low-cost architecture. The technology holds the potential to offer features and functions traditionally reserved for enterprise servers, combined with the potential for high-volume economies of scale and reduced total cost of ownership associated with industry standardization.
IB is a bottoms-up design for high-performance interconnects, suitable for many different server platforms and applications ranging from storage area networks (SANs) to server clustering. Originally conceived as a replacement for PCI-type data buses, IB has broadened its focus to encompass client-server communications, server clustering, and storage as well. While microprocessor and memory subsystems have advanced at a rapid pace during the last decade, high performance I/O has traditionally been reserved only for large mainframes or enterprise servers, which implemented direct channel attachment for I/O devices. One objective of the IBTA was to introduce an I/O subsystem with significantly higher performance than conventional shared bus architectures, with characteristics inspired more by the enterprise server environment (Table 1). Reductions in CPU utilization overhead and higher-bandwidth, lower-latency communications are important attributes of IB. The IB hardware reduces CPU overhead by off-loading much of the I/O communications workload from the CPUa significant change from existing protocols such as TCP/IP. This provides features such as zero copy-data transfers with no kernal involvement.
| Bandwidth | ||
| Broadcast | ||
| Communications | ||
| Number of Devices Supported |
(Max. 4 per Bus) |
(Thousands) |
| Electrical Characteristics |
Short Distance |
Long Distance |
| Application Availability |
Fault Domains | |
| Stack Overhead | ||
| Quality of Service |
Table 1: Comparison of shared I/O bus and InfiniBand fabric features
The topology of IB, shown in Figure 1, is a switched fabric that provides direct-channel connections via switches among all of the nodes in the network or subnet. IB also employs a message-passing fabric (the CPU is allowed to pass or drop off data intended for an I/O device into memory and move on to other tasks, rather than wait for the device to respond). Coupled with higher speed links, this allows more efficient utilization of the high-speed microprocessors commonly used in today's servers. While existing enterprise servers likely won't see as much benefit from this approach, the use of fabric-centric, message-based interconnect offers important advantages for server clustering as well.
|
This approach brings with it improvements in reliability, error detection, and fault isolation. IB provides fault zones in its message-passing fabric, such that a failure on an IB-attached I/O device will not compromise the rest of the server, as in a shared-bus architecture. You can configure redundant IB fabrics and transparent multi-pathing for high-availability applications. Although data delivery is still on a best-effort basis, similar to IP or Internet messaging, it also offers selectable Quality of Service (QoS) features which can be provisioned by specified service levels, automatic failover, and virtual lanes. It is anticipated that IB nodes will offer full concurrent maintenance, including dynamic add/remove capability for nodes and links. The shared I/O architecture of IB allows multiple servers to be clustered so that they can fail over to each other; this feature has the potential to provide increased reliability from commodity components.
IB performance is also more scalable than its predecessors; it can readily accommodate faster, higher-performance I/O devices and performance of the IB subsystem is not reduced by the addition of extra storage or networking capacity. IB is designed to reduce high TCP/IP stack latencies between transaction servers, database servers, and load-balanced Web servers as well as between servers and storage in future Internet data centers. Duplex link speeds can scale from 500 Mbytes/second to 6 Gbytes/second, per link (by contrast, the Peripheral Component Interconnect standard, PCI-X, offers a peak burst bandwidth of 1 Gbyte/second half-duplex at 64 bits, with next generation plans for 2-4 Gbytes/second still in the definition process).
There can be thousands of subnets in an IB fabric each, in turn, serving thousands of nodes (servers, storage, switches, routers, network analyzers, and other devices). Since it is a layered architecture, IB allows the rate of technology evolution to drive changes and improvements in each layer. In this way, IB solutions have the potential to offer competitive cost/performance tradeoffs for every tier in the server market. IB is scalable in other ways as well; its channel architecture provides linkages between chips or across backplanes, as well as through network interface adapter cards. With node addressing based on the IPv6 standard, an IB switched fabric can effectively extend into the metropolitan or wide area network (WAN) environments. Thus, IB encompasses a system-area network which could find applications in remote backup systems for disaster recovery, including data mirroring, multi-site remote-server clustering, Linux clusters, and even grid computers. IB defines a communication and management infrastructure which supports both intra- and inter-processor communications, ranging from a small single-processor server I/O to massively parallel supercomputers.
The IB specification defines three basic building blocks for switched fabricsa host channel adapter (HCA), target channel adapter (TCA), and fabric switch (Figure 2). An optional fourth element is a network router for IB over the WAN, mapping the IB global identifier (GID) into an IPv6 frame header. When the IB channel connects with a router or other external network, it uses a similar type of channel adapter called an xCA. The HCA resides in the server node and provides the connection between system memory and the IB network. HCA includes a programmable, direct memory access (DMA) processor with address-translation and protection features that allow DMA operation to be initiated either locally or remotely (permitting a source to read or write directly to its target's memory address space). The TCA resides in the storage or I/O device network (such as Ethernet or Fibre Channel) and provides the connection to the IB network.
|
Though it is similar to the HCA and may implement the physical, link, and transport layers of the IB protocol, the TCA can be simplified according to the requirements of the attached devices. For example, you can implement the TCA as a very simple interface, replacing the SCSI attachment to a single hard disk. Direct connection between an HCA and TCA does not require processor intervention. Fabric switches provide the ability to interconnect up to several thousand nodes into a single network. Although the switches are intelligent enough to provide inter-subnet routing, management, topology discovery, and differentiated quality of service, they remain transparent to the fabric, neither generating nor consuming data packets, but simply passing them along based on the destination address in the packet's route header.
Connections between these three components are provided by IB links, which are based on a 2.5 Gbit/second bidirectional (full duplex) serial connection with 8B/10B encoded data. This could be implemented, for example, using the same optical transceiver hardware as a telecom serial OC-48 physical layer, or as a GBIC which could easily swap between copper and optical interfaces. These connections can be striped in parallel to form four-lane wide (eight wire) or 12-lane wide (24 wire) connections as depicted in Table 2; IB supports optical links using both short-wave (SX) and long-wave (LX) transmitters as well as copper links.
Signaling Rate |
Signaling Rate |
|
in each lane |
||
in each lane |
Table 2: InfiniBand link types and signalling rates
Optical links specify the LC duplex connector for 1X links, the MTP/MPO for 4X links, and dual MTP/MPO for 12X links. All of these optical connectors have been accepted as ad hoc industry standards for other optical data-communication protocols as well. Copper cable connectors are from two different families. The 1X connector is a derivative of the Fibre Channel connector currently in use, known as HSSDC2, designed by Tyco Electronics and licensed by Molex. The 4X and 12X copper connectors are from the microgiga CN family developed by Fujitsu. The four-lane implementation has also been adopted by 10 Gigabit Ethernet (IEEE 802.3ae) and 10 Gigabit Fibre Channel. Prior to data transmission, the links are trained, width negotiated, and de-skewed; the auto-negotiation allows connection of two protocol-aware devices with different lane widths, using the lowest common denominator lane width from Table 2. The IB data packets, including headers, comma characters for clock synchronization, start of packet, and end of packet delimiters, can be as long as 4608 bytes, but the data payload is only 4096 bytes (4 kbytes) long. Table 3 shows link distances supported by both copper and fiber optic IB links.
125 meters, 62.5 micron fiber |
||
75 meters, 62.5 micron fiber |
||
75 meters, 62.5 micron fiber |
||
20 inches FR4 PCB, with pre-emphasis drivers |
Table 3: InfiniBand link distances (Note: very short reach (VSR) optical links use SX transmitters; minimum fiber bandwidth for 50-micron fiber is 500 MHz-km, and for 62.5-micron fiber is 200 MHz-km; long reach links use LX optical transmitters).
using IB clustered servers running DB2 Universal DatabaseEnterprise Extended Edition. The testbed is shown in Figure 3of particular interest is the transparent nature of how DB2 exploits InfiniBand.
|
This demo was assembled from an IB-attached cluster of 12 IBM x series model 342 e-business servers (each equipped with dual Intel Pentium 4 processors, 2 Gbytes RAM, and an Intel InfiniBand HCA). Using a three-tiered architecture, one server in the presentation tier runs a standard Linux kernal hosted SAP Business Information Warehouse 2.1c and acts as the MySAP.com application server. Of the remaining 11 servers in the compute tier, one node ran a subnet manager to coordinate the fabric under Microsoft Windows 2000, and the clustered database was distributed over the remaining 10 nodes running Linux. All of the servers were clustered through a 32-port 1X switch from Infiniswitch and a 16-port 1X switch from Qlogic, with physical connections provided by 1X copper GBICs. Optical GBICs could easily be substituted to increase distance without impacting performance.
This was the first public showing of an IB-to-Gigabit Ethernet TCA running Intel's Virtual NIC wire protocol. A Network Appliance F800 filer cabinet in the data tier was indirectly connected to the fabric via an Omegaband OmegaGEM TCA to Gigabit Ethernet conversion; the filer served as a staging platform for additional resources in the data warehouse.
This testbed also supported the Virtual Interface Architecture (VIA), which has emerged as a de facto standard for high-speed interconnect on many server platforms. VIA was first demonstrated on DB2, and is more closely tied to the underlying IB hardware architecture, since it is based on remote direct memory access and the direct-attach file system, rather than on large software stacks like TCP/IP. There are various IBTA members committed to supporting VI stacks for IB, with early adopters including IBM, Intel, and Mellanox. This version of InfiniBand LAN emulation provides legacy device connectivity at InfiniBand speeds by tunnelling standard Ethernet TCP/IPv4 frames across the subnet using transport services in the IB architecture.
Subsequently, there have been a number of other successful InfiniBand testbed demonstrations, including a multi-server platform demo with two subnets. This demo featured one subnet consisting of an IBM p-series 170 eServer running a Linux operating system and hosting DB2; another p-series server emulating a z-series S/390 mainframe running AIX 4.3; two x series 342 eServers running Windows 2000; and a Thinkpad 770 client machine, all interconnected by an InfiniSwitch and an Omegaband IB-IP gateway. The second subnet included a p-series B50 running Linux; an x series 4500R hosting the Vieo subnet manager; an x series 342 running Windows 2000; and a Thinkpad 770 client. This illustrates the hardware transparency of an InfiniBand fabric for database storage; recently, direct-access file system (DAFS) storage running over IB has also been demonstrated.
Ongoing development of IB products include software development, management-infrastructure issues, and open-standard application programming interfaces (APIs) for the kernal to allow operating-system portability. The IBTA has not addressed development of APIs; this task is being driven by a consortium under The Open Group, most notably through founding members IBM, Intel, Hewlett Packard, Compaq, and Sun. So far, the Group has defined three APIs (sockets extensions as well as both native and managed APIs). There is also some confusion in the marketplace between InfiniBand and the PCI Express standard proposed recently by Intel. Clearly, IB will be one option among several, and developers will remain free to make their own decisions on the best combination of technologies for their application.



