Design Article

Optimizing Network Interface Cards for Operation in a Standard High Volume Server

8/14/1998 12:00 AM EDT


The Standard High Volume (SHV) Server is based on products offered by Intel and Microsoft. This class of server has several key architectural features (Figure 1):

  • Supports 1 though 8 concurrent running processors.
  • Has a high-speed host bus (i.e. processor bus), supporting 64-bit operation at 100 MHz.
  • Has one or more PCI buses for I/O devices.
  • Supports multiple memory banks using SDRAM or RDRAM with memory size in excess of 4 Gbytes.
  • Has a single host bus connection for I/O and processor data exchange and cache coherency.

Figure 1:  Typical SHV Server bus architecture

The combination of these features yields a low cost server capable of high-performance computing and future scalability. SHV Server performance is based on the characteristics of the buses present in the architecture.

The PCI buses in SHV Servers are typically 32-bit wide buses, although future versions will also incorporate 64-bit widths. PCI bus operation can scale from 33 MHz to 66 MHz. This gives a maximum, theoretical throughput limit of approximately 1 Gbps for a 32-bit wide 33 MHz bus, and 4 Gbps for a 64-bit wide 66 MHz bus. In actual operation however, the realized throughput is actually a percentage of the theoretical maximum. The realized bandwidth a PCI bus can achieve under normal traffic conditions is based on the number of devices in operation on the bus, the latency introduced by a bridge or device while data is fetched and transferred, and the burst capability of the bus mastering devices.


Number of Devices Sharing a Bus

The number of devices concurrently operating on a PCI bus decreases throughput due to arbitration latencies and master latency timeouts. The arbitration latency is the time the bus protocol consumes in allowing one bus master to transfer ownership of the PCI bus to another bus master and for the second bus master to begin its cycle. For SHV Servers, this is a minimum of two PCI clock cycles, and typically four or five PCI clock cycles. In order to maintain fair access for all I/O devices, master latency timeout forces a bus master device to terminate its current burst after a pre-determined number of data transfers. This artificially limits the burst length on the bus and creates a limitation of how much data can be transferred. The master latency timer is programmable from 0 to 255. Typically, the timer is set to allow 64 data transfers.


Bridge Latency

Data stalls by the transaction target or transaction initiator decrease throughput by creating forced idle conditions during transfers. When transferring the address for the target of a transaction, the PCI bus protocol forces the initiator to insert at least one clock of idle bus time. This is increased to two clocks for addresses above 4 Gbits. Stalls by the target or initiator after the address phase generally exist because a device must fetch data before transferring. For example, a PCI bridge device must often issue a transaction on another bus to fetch the data for the current transaction. While the second transaction completes, the bridge stalls the first transaction. The PCI bus limits the data stalls a device may insert to 12 clocks. After this limit, the device must disconnect the transfer. For read transactions, a device typically inserts 10 clocks of stall on the initial data phase. For a write transaction, there are usually only three clocks of data stall. After the initial data transfer, devices typically have enough buffering to allow bursts without data stall.


Burst Capability

Burst capability by the initiator decreases bandwidth when the transaction terminates due to insufficient buffer to complete the entire burst request. This is a result of either insufficient buffer capacity in the initiator, or incorrect command usage by the initiator. If the initiator has insufficient buffer capacity, the transaction terminates even if the master latency time would allow the transaction to continue. If the initiator does not follow the PCI protocol usage guidelines for the advanced commands, Memory Read Line (MRL), Memory Read Multiple (MRM), and Memory Write and Invalidate (MWI), the target of the transaction may not reserve sufficient prefetch buffer for the data transaction and disconnect the transaction before the master latency time expires.

Figure 2:  Illustration of PCI data and data stalls

An analysis of these conditions creates a typical profile for realized bandwidth on a PCI bus. The realized bandwidth is computed by dividing the number of data transfers by the total number of clock cycles in the transaction. For example, in a system with a typical master latency setting of 64 data transfers, 10 clocks of initial data stall, no stall bursts, large bus master buffer capability, and correct command usage, the realized bandwidth is 64 data transfers / (64 data transfers + five clock cycles for arbitration + one clock cycle for address + 10 clock cycles for initial latency) or 80% of theoretical. This number is further reduced when devices do not have sufficient buffer capability to utilize the 64 data transfer latency time. The most notable example of a bus master with limited burst capability is the CPU, which can typically only transfer a single data phase during I/O cycles.


Additional Performance Considerations

Although the PCI bus is the obvious bus to examine for I/O performance, the other system buses also affect I/O capability by limiting the response of other agents in the system. For example, the Front Side Bus (FSB) is a 64-bit bus operating at 66 MHz or higher. The FSB typically also employs a degree of pipelining to allow transaction requests to issue while previous transactions are still completing. In this way the FSB can maintain burst data rates without data stalls for new addresses requests. This bus, acting alone, is more than capable of providing data to a 32-bit PCI bus at 33 MHz. However the front side bus becomes taxed when multiple PCI buses at 64-bit width are connected to the single FSB along with multiple processors.

The memory bus from the system memory controller may also limit system throughput. In a typical SDRAM-based memory subsystem, the memory bus is 64-bits wide operating at 66 MHz or higher. This is sufficient to supply data to the host bus. The design of SDRAM, however, only allows this speed during burst data transfer. An SDRAM can only burst data that has been loaded into its row latches. The data from an entire row is loaded into the latches when an address in the row is requested. This data is referred to as a page. The size of a page depends on the size of the SDRAM. A 16M bit SDRAM has a 4K-bit page. A 4M bit SDRAM has a 2K-bit page. SDRAM and other burst nature DRAMs perform best when requests are linear in nature. When data is requested outside of the SDRAM's page, the SDRAM must close that page and open a new page. If the PCI initiator consistently issues transactions to closed pages, the SDRAM can not provide data at the burst rate. The end result is lower throughput capability for those devices that issues multiple, small transactions that are not sequential in address.


Typical PCI Handling of Ethernet Traffic

Each I/O device in a SHV Server interfaces to structures that create different load types on the PCI bus. In particular, Ethernet networks operating with TCP/IP protocols, interface to a structure that creates multiple small data transfers. Ethernet transfers have a maximum size of 1518 bytes and a minimum of 64 bytes. The TCP/IP protocol creates a request and acknowledgment system that sends requests for data, blocks of data, and then sends acknowledgments of that data. The data transfers are generally large, close to the 1518 byte limit. The acknowledgments and requests are generally small, close to the 64-byte minimum. The average size of a TCP/IP Ethernet packet is approximately 256 bytes. This packet size corresponds to 64 data transfers on a 32-bit PCI bus and 32 data transfers on a 64-bit PCI bus.

Ethernet network interface cards (NICs) typically employ a scatter-gather structure for moving data from the network to the operating system and back. In this structure, the CPU sets up a command structure in memory. This command structure is typically small, on the order of 16 bytes. The structure contains the location of packet data and the length of the segments of the packet data. The data for a single packet is not usually within a contiguous memory space. This is the result of multiple CPU processes setting up the packet data. For example, TCP/IP packet data generally has 3 segments, the application data, the TCP/IP header, and the MAC header. Each of these segments requires a separate PCI transaction to fetch.

Figure 3:  Typical 16-byte scatter-gather command structure for a TCP/IP packet

The combination of the scatter-gather structure and TCP/IP packet segmentation decreases an Ethernet NIC's capability to burst on the PCI bus. Each fetch of a single scatter-gather structure is a four clock data burst on the PCI bus. The MAC header is 16 bytes, requiring only a four-clock data burst. The TCP/IP header is 40 bytes without TCP header options, requiring a 10 clock data burst. Thus, even if there are large packets to transfer, the segmentation of the network structures and the scatter-gather command structure forces bus masters to burst for short periods on the PCI bus. Using the average size of a TCP Ethernet frame of 256 Bytes, the scatter-gather structure will cause burst cycles of only 25 data transfers. This burst size reduces the realized PCI throughput to 60% using the typical PCI latency and arbitration numbers from above.

Beyond the packet data movement, Ethernet traffic creates other system performance limitations. When a typical Ethernet NIC receives a new packet from the network, the NIC fetches an available scatter-gather structure and transfers the packet data to memory. It then informs the CPU of the new packet by asserting an interrupt to the CPU. For a 256 byte average packet size, a 100 Mbps Ethernet link generates more than 97,000 interrupts per second. This equates to one interrupt every 350 PCI clocks at 33 MHz PCI. A 1000 Mbps Ethernet link creates one interrupt every 35 PCI clocks.

The effect of interrupts on the CPU is to cancel the CPU's pipelining and parallel execution nature of advanced CPUs. If the pipeline or cache of the processor is stalled or flushed, the CPU must wait for the pipeline or cache to refill before it can operate at full capacity. Each interrupt also generates several I/O reads and writes on the PCI bus by the CPU. These I/O reads and writes are costly in terms of PCI bus bandwidth because they are single data phase transactions. I/O reads and writes are also costly in terms of CPU processing capability. The I/O reads and writes cancel some of the internal pipelining of the CPU. Thus, in the case of one interrupt per packet, the CPU's advanced features are not available for packet processing, and CPU bandwidth is taken away from application processing.


Optimizing Ethernet Traffic in the SHV

As described, SHV Servers are optimized for I/O traffic that generates long bursts on the PCI bus, creates few interrupts to the CPU, and minimizes the I/O reads and writes by the CPU. Unfortunately, Ethernet LAN traffic does not create large packets. Furthermore, the standard linked list data movement structure also utilizes many; small messages and creates many interrupts and I/O accesses. Correcting these deficiencies requires a new model for network data movement. This model should combine Ethernet LAN packets into large bursts on the PCI bus when possible, reduce or eliminate the need for CPU interrupts, and operate with infrequent CPU accesses to the I/O device. In addition, the Ethernet device should optimize the PCI bus transactions by making use of the advanced commands, embedding large data buffers for bursting, and eliminate idle cycles.

One example of this type of interface is utilized in network accelerator chips developed by Jato Technologies. This interface, known as PropulsionTM, allows the NIC to combine small Ethernet packets into a single, large PCI burst, minimizes the number of CPU I/O accesses to the NIC, and operates with either reduced or zero CPU interrupts. Propulsion also eliminates the costly virtual to physical address translation necessary for a linked list data structure. Jato has applied for a patent for its Propulsion techniques.

Propulsion accomplishes these feats by creating large blocks of data transfer space in system memory. These blocks, known as Packet Descriptor Command (PDC) buffers, are used for transferring both packet information and packet data between the card and the CPU. PDC buffers are extremely large, a maximum of 64K Bytes. This size allows the transfer of multiple Ethernet packets in sequence to the same buffer. Packet and control information is contained within the sequences of data within the PDC buffer, so CPU I/O accesses to the NIC are not needed for packet information. Propulsion also greatly reduces interrupt count to the CPU by concatenating multiple packets into the same interrupt and allowing interrupt free transfer.

The PCI interface within the Jato's network accelerator chips is an example of an optimized bus master. This PCI interface contains large FIFO blocks. These FIFOs allow transfer of an entire PDC buffer in a single PCI burst transaction, if the latency time allows. The PCI bus master fully utilizes the MRL, MRM, and MWI PCI commands for the large data transfers of the PDC buffers. The PCI interface also allows no wait state operation at speeds up to 66MHz for full burst rate support.


Conclusion

The SHV Server architecture is designed to allow high performance at a reduced price. However, traditional Ethernet packet handling does not utilize the performance architecture of the SHV Server. An optimized PCI interface and a new data transfer model are required to take advantage of the SHV Server performance. These components utilize the burst capability of PCI, reduce interrupt and I/O transfer between the CPU and the NIC, and optimize the movement of large data blocks. The result is a model that enables the transfer of high speed Ethernet traffic within the SHV Server.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form