Design Article

Distributed CPU blades enable loosely coupled application servers

Greg Whelan,Director of Product Marketing, StarGen, Inc., Marlborough, Mass.

5/20/2002 9:01 AM EDT

Distributed CPU blades enable loosely coupled application servers

Loosely coupled application servers are a natural application for first-generation blade servers- essentially a repackaging of the widely deployed single board 1U server. However, to move beyond basic Web serving and other stand-alone applications, blade servers need to evolve into a more flexible and powerful model, based on distributed blade computing.

With distributed blade computing, a system is truly disaggregated and is comprised of CPU/memory, storage, and I/O elements each on separate optimized and independently upgradeable blades. Efficient communication channels then connect each of the elements. And, with this approach, many more interesting applications can be implemented. A few examples include compute clusters, multi-processor servers, Beowulf clusters, and wide range distributed processing systems.

What is needed to make this a reality is an inter-blade connectivity standard. This interconnect technology must provide both a highly efficient, high-performance communication channel for processor-to-processor communication and at the same time, efficient processor-to-I/O device communication.

To move the market forward, this standard must leverage existing standards and interfaces while, at the same time, enable advanced features and capabilities. Two standards that are most prevalent and need to be moved forward are PCI, for processor-to-device communication, and Sockets, for inter-processor communication. The advanced features required are new levels of scalability, new levels of availability, and multiple classes of traffic (QoS). And, all of this must be made available in a cost-effective and low-power fashion.

While PCI is an extremely prevalent interface and is an ideal interconnect solution for processor-to-device communication, but it's shared parallel bus nature makes it unacceptable as a distributed blade computing interconnect.

So, what is required is an interconnect that is completely interoperable with PCI devices but provides distributed blade computing functionality. With distributed blade computing, the capability to have multiple processor types working together to solve a problem is a basic tenet of the capability. Depending on the application, one could envision processor blades with DSPs or Network Signal Processors (NSPs) interconnected with a traditional CPU blade. Even if each processor was identical, they need not run the same operating system.

As first-generation blade servers evolve to take on more mission-critical roles, system availability becomes paramount. The goal becomes not how many "nines" a system can achieve, but how much degradation is incurred during an upgrade or failure. Solutions must be designed with high-availability built in from the onset and not as an after thought. Distributed blade computing applications will be able to self-modify and direct traffic around failed nodes of any sort.

As systems scale and take on a wider range of applications, having multiple classes of traffic that can be transmitted through a system with different priorities becomes essential. In specific applications, instructions between processing nodes might require a higher priority and use a higher priority class of service, while data between the subsystems can be carried with a lower priority class of service. Whatever the system requirements are, some traffic will be of greater importance in a temporal sense and will need to move through the system as fast as possible. In some cases, the ability to pre-allocate and guarantee some level of performance for a specific class of service is highly beneficial. Lastly, the ability to support multicast groups where data or messages can be sent to any given number of nodes simultaneously is important.

New standards from the PCI Industrial Computer Manufacturers Group (PICMG), called PICMG 2.17 and PICMG 3.3, based on StarFabric, are solutions for distributed blade computing. PICMG 3.3 provides for StarFabric implementation in the AdvancedTCA architecture being developed currently by PICMG in the PICMG 3.0 base specifications. The objective of these specification is to provide an evolutionary path from today's bus-based CompactPCI architecture to one based on a multi-gigabit switched interconnect fabric.

PICMG 2.17 specifies both centralized and distributed switching configurations. In a centralized configuration, a dedicated switchblade or two in a redundant configuration is utilized. Each of the additional blades would then contain a StarFabric interface and would have a dedicated 5 Gbit/second serial link to the switchblade. A distributed switching system would have a switch device on each line card.

StarFabric uses a load-store model instead of stack processing to move data. This creates a significantly more efficient interconnect than one based on LAN protocols. As blade servers take on more demanding application roles, the efficiency to communicate between processors becomes critical. Message passing between processing nodes is possible using the 32 bytes of scratchpad registers and 32 'doorbells' in the StarFabric bridge. A remote processor can write a message to a scratch pad register followed by a write to a doorbell. This results in an interrupt being asserted on the PCI bus of the destination processor. The processor then reads the contents on the scratch pad and gets the message from the source processor.

Efficient processor-to-device communication is important as well. StarFabric utilizes enhancements to PCI, based on event management. Events will typically default to one host node in a system. However, each bridge implements event path tables that can be programmed to distribute signal or chip events to alternate destinations in the fabric. The sharing of interrupt pins between devices can also be problematic in large-scale systems.

With first-generation blade servers, the I/O is the network interface. With distributed blade computing, additional I/O devices can be added to the system. In the case of an imaging system, the data capture device could be attached via a specific image capture blade.

Once the blades are inserted into the system, the management system can then configure the requisite blades to accomplish the specific mission. For example, a group of CPUs, DSPs, storage, and I/O blades could be configured as one system and another group could be configured as another. These two systems could then be interconnected into a larger system. Apply this to multiple chassis and racks and the fact that the chassis could be up to 40 feet apart and the tremendous flexibility of distributed blade computing is apparent.

PICMG 2.16, a switched Ethernet architecture, was created a year before PICMG 2.17 and is starting to see real-world applications, including first-generation blade servers. In server applications, where processor-to-processor communication is infrequent, 100baseT is sufficient. Applications that require more than 100Mbit/second of inter-processor communication face a challenge when moving to Gigabit Ethernet

Gigabit Ethernet requires close to 1 GHz of processing power to run the protocol stack and to handle the tens of thousands of interrupts every second. In fact, one network message takes 10,000 instructions and an additional 10 instructions per byte of transferred data. PICMG 2.17, on the other hand, can accomplish this using less than 5% of a 1 GHz processor and use 20 instructions per network message. The processor burden of PICMG 2.16 operating at 1 Gbit/second results in higher-cost blades, more-power dissipation, and less-available board real estate.

In some applications, PICMG 2.16 and 2.17 can be used together. PICMG 2.16, at 100Mbbit/second, is sufficient for many applications, yet becomes problematic at 1 Gbit/second speeds. One such solution would be to use the lower speed 2.16 as the control plane only. PICMG 2.17 would then be the high-performance data plane for low-latency inter-processor communications and high-performance I/O devices.

See related chart





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form