News & Analysis

Getting higher reliability out of Windows 2000 servers

Simon Graham, ftServer chief architect, Stratus Technologies Inc., Maynard, Mass.

1/9/2002 8:03 AM EST

Getting higher reliability out of Windows 2000 servers
Most major vendors offer cluster solutions as their best performing solutions, although clusters are known to be difficult to install, operate and maintain. However, in the absence of viable alternatives at reasonable costs, the 97 to 99.9 percent reliability that clusters could deliver had to suffice.

Hardware based fault-tolerance is entirely different from cluster technology in its approach to availability, architecture and uptime performance. This is true for both traditional fault-tolerant systems and the new WinTel-based fault-tolerant servers. In fact, those WinTel-based fault-tolerant servers can achieve even higher levels of availability — up to 99.9999 percent, or fewer than 30 seconds of potential unplanned downtime, than those systems that have been running mission-critical applications for decades. And compared with clusters, a fault-tolerant server is also very simple to deploy, manage and maintain.

Fault-tolerant servers are designed from the outset not to fail. Clusters, on the other hand, deal with recovery after a failure occurs. The key to design a fault tolerant hardware/software system, be it a standalone server or a cluster, lies in balancing and matching the capabilities of the underlying hardware with a properly hardened operating system.

For example, Stratus Technologies' ftServer architecture is based on what is called a dual modular redundancy model. Such a system is built with replicated, fault-tolerant hardware, including CPU/memory, disks, PCI cards, fans and power supplies, to eliminate single points of failure. Duplicate CPU/memory components operate in lockstep, processing the same instruction at the same time against a single operating system image. In other words, two physical computers operate as one logical computer (a server with triple modular redundancy uses three CPU/memory modules and functions as one logical computer; it's the best protection there is against downtime).

If a component in the system should break, the partner component simply continues to operate as normal and the system and the application continue to run. Users experience no interruption to processing, no data loss and no performance degradation.

That sounds easy, but the magic comes in "lockstepping" the processors so that they do exactly the same thing at the same time and in the fault-detection and isolation technology that protects the system from misbehaving parts.

When proprietary onboard error-detection circuitry identifies a faulty component, that motherboard is immediately isolated from the system and removed from service. A second level of error detection compares outputs from each CPU/memory unit on each I/O operation. In a dual-modular-redundancy system, if a comparison error occurs with no onboard error indication (a very uncommon event), software algorithms are used to determine which board to remove from service based on historical data. In a triple-redundancy system, "odd-man-out" voting logic is used to identify and isolate faults. In either event, processing continues on the remaining motherboard or boards without interruption or performance degradation. The entire error-detection and isolation process occurs in just milliseconds, without any interruption to system operation.

In our experience, a WinTel-based fault-tolerant system can extend the protection of redundant hardware throughout the architecture in a way that is completely transparent to the Windows 2000 operating system, middleware and applications. These components include duplicated, hot-swappable CPU/memory units, I/O boards, PCI cards, storage devices, power supplies and fan units. With hardware redundancy, there is no reliance on cluster-aware software scripting or configuration control to ensure availability. The redundant component simply continues to operate in the event of error or failure.

True availability, however, must embrace the entire solution, including the hardware, operating system software and application. Any system promising continuous availability needs to account for potential points of failure at all three levels.

Beyond the Win2000 operating system, a WinTel-based fault tolerant system can complement and enhance the dependability of server platform if it includes a number of software enhancements that address known sources of system and application failures and minimize downtime during repair and maintenance.

These enhancements include hardened device drivers, a hardened hardware application layer and I/O Virtual Addressing protection. These enhancements can be implemented without affecting the Windows 2000 core operating system code. As a result, the system maintains complete Application Binary Interface compatibility and all value-added software features are available to standard Windows 2000 applications and middleware.

Device drivers provide servers with interfaces to system peripherals and communications lines. They can be a major source of operating system instability. Furthermore, problems caused by improperly functioning device drivers tend to be difficult to isolate and diagnose.

Hardened device drivers, which are available for supported PCI adapters on the system, prevent failures and assure system uptime using multiple methods that are completely transparent to Windows 2000 applications. The drivers detect and stop adapter card writes beyond allocated physical memory, a condition that typically results in a system crash when left unchecked.

Equally important, the driver software should allow protected PCI adapters to take advantage of two other capabilities that significantly contribute to server robustness. One of these is a self-monitoring ability that continuously checks for intermittent correctable errors on each supported PCI adapter. Should the error rate exceed an allowable threshold, the adapter is automatically removed from service. Because value-added software enables PCI adapters to be paired with redundant components, the duplicate card continues to function so that users and applications are not affected when a PCI adapter is taken offline.

The second is the partnering of each adapter with a duplicate component. A malfunctioning PCI adapter is replaced by hot-plugging a new adapter into the system while the server continues to operate normally. In contrast to other server designs that support hot-plug mechanisms, no software intervention is required to disable the PCI card before the device is removed. This difference reduces the probability of human error and a resulting system crash.

Windows 2000 provides a number of tools to help create appropriately fault-hardened device drivers. One is the Windows Management Instrumentation included with the operating system It defines core capabilities employed in the development of these drivers: the ability to read the device status, board status and revision level; the ability to add and remove device adapter cards; the ability to run diagnostics while the adapter is out of service; and the ability to update adapter firmware online.

In addition, a device driver kit is available as a design reference model to allow third parties to develop fully hardened drivers. It includes instructions, design guides and a library of code to assist developers implementing robust drivers that meet the requirements of driver certification tests offered by Microsoft for Windows 2000. In a WinTel-based fault-tolerant server, modifications must be made to the hardened application layer. These include a reset affecting both clocks; the capability to deal with the replicated CMOS and real-time clock devices; PCI interrupt connections defined for other boards; special initializations for the Accelerated Graphics Port, which is run as a standard 64-bit PCI bus; and a machine check handler to flush L1 and L2 caches and remove the CPU from service.

By carefully modifying the hardened application layer of the operating system, it is possible to implement the necessary changes in both initialization and error handling to accommodate redundant devices, and to do it without affecting the Windows 2000 kernel or requiring any modifications that would not be part of the off-the shelf release version regularly sold by Microsoft.

See related chart





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form