Design Article
Get High Availability Using Effective Fault Management
Michael Christofferson
7/1/2002 4:49 AM EDT
Editor's Note: To view a PDF version of this article, Click Here.
The demand for carrier-class service, formerly the domain of the telecommunications systems suppliers, has spread to encompass virtually the entire sector of communications equipment manufacturers. However, carrier-class service requires the building of a product with an extraordinarily high level of reliability, one that is tolerant of faults and that can be maintained and enhancedwithout being taken out of service.
But how does one achieve this high availability? To describe the process, we must resort to a logical sequence of steps, as follows:
- Define high availability and its ramifications;
- Understand the concepts of faults, errors and failures;
- Outline the role of error detection;
- Cite the need for centralized fault management;
- Define recovery and repair models for high availability, as well as the concept of recovery domains;
- Define hardware and software design principles and requirements based on this analysis.
High availability
High availability (HA) is the classification generally used in the telecommunications industry for carrier-class systems. It means that the system features 99.999 percent uptime (five nines)or less than 315 seconds of downtime per year (less than 1 second per day.)
The availability of a system is defined by the equation:
Availability = MTTF/(MTTF +MTTR)
where MTTF = mean time to failure and MTTR = mean time to repair.
This relationship has some interesting properties. Mathematically, to improve availability, we can either increase MTTF or decrease MTTR. Increasing MTTF by factor N is the same as decreasing MTTR by a factor of 1/N. But if we look at the equation additively, we find that decreasing MTTR by 50 percent (for instance, MTTR ->0.5 MTTR) is better than increasing MTTF by 50 percent (MTTF->1.5 MTTF). But there are other, more important system properties of each of these quantities. Consider an example of a system, using some simplified assumptions:
A system is composed of N components, each with the same MTTFcall it MTTFcompwherein component failures are independent of each other and memory-less (not dependent on previous failures), and each has the same MTTR. Then the MTTF of the system is:
MTTFsystem = MTTFcomp/N
Assume 100 distinct components, then for system availability:
Availabilitysystem= (MTTFcomp/100) / ((MTTFcomp/100) + MTTR)
If each component is five-nines qualified, then from the general equation:
MTTFcomp = 99,999 * MTTR
Thus: Availabilitysystem = .999001, or three-nines availability.
So, for the system to achieve HA status or five-nines, each component must have seven-nines-level availability. This is a simplified illustration, but the point is that we must consider MTTF as a function of each component in the system, and that the system MTTF is roughly inversely proportional to the number of independent components of a system. Improving MTTFcomp for a component has little effect on the overall MTTFsystem as the number of components becomes larger, as we see in Figure 1.
MTTR might generally increase with system complexity (for example, the number of components). But with good design, MTTR need not be directly proportional to the number of components (Figure 2). It can be more a function of the worst-case repair time of the set of components in the system, if we can treat repair of each component more or less independently. So this is really the goalto find methods of design that exhibit this model of MTTR so we can achieve higher system availability.
We can conceive of a smaller number of repair operations (each with an MTTR) that can fit most component failure scenarios, such that the overall problem can become manageable.
Furthermore, MTTF is typically a difficult number to calculate by any sort of analytic means. The quality of software can almost always be improved by spending more time and money, and by improving tools. But usually any significant improvement in quality (for instance, MTTF) comes at a very steep cost. MTTR on the other hand is much more deterministically computed, and often up to an order-of-magnitude improvement can be achieved at much lower relative and quantifiable cost.
To accomplish this, most HA system designers define a small class of repair operations that will fit all component failures. As we shall see, reducing MTTR by designing redundant components into the system is one of the most powerful ways to mitigate decreasing MTTF as a function of multiple components in a system. But we don't mean to imply here that MTTF considerations are unimportant. There are some general software design techniques for generally raising the level of MTTF in systems.
Errors, faults and failures
Before we can address the concept of repair, we must clearly understand the set of phenomena that cause us to think of repair in the first place. An error is an observable condition wherein a value or response departs from the true or correct value. A fault is defined as a failure of an interacting system or component that causes an error. And a failure is defined as a condition wherein the observed behavior of the system or interacting component deviates from the specified behavior.
A fault causes an error, which may lead to a failure. Or the error may be correctable within the context of the detection of the error condition. Further, a fault may cause another fault, which causes another error and a so on. But the important distinction is that errors are observable, not faults. We can try to "repair" the fault by correcting the error if possible, but in most cases, we cannot repair by error correction. To understand this, we need to look at the classes of errors that can occur in a system.
In general, there are two such classes. The first class is a well-defined type of error, for which the cause (fault) is well-understood, where the consequences are well-known, deterministic and bounded in scope. These often tend to be the correctable errors that don't lead to failures with proper handling. The second class of error is transient in nature, occurs infrequently, is often not repeatable or reproducible, and is typically related to timing or overload effects. Studies have shown that in mature systems, the latter class of transient error outnumbers the more deterministic error type by 100 to 1. For this class of error, we often cannot repair by correction, and thus prevent failure. This is because we most often cannot identify the exact fault that caused such errors. Sometimes we can infer the fault from the error, but far more often than not, we have no clue. So rather, we try to repair by recovery from the failure that isor will beinduced by the error. More often, we say we shall recover from the source of the failure, which is the fault, and fault tolerance is the extent to which fault recovery is effective.
As seen above, most faults cannot be corrected, so fault recovery operations, rather than error correction, will tend to be the primary focus for HA design. But errors are the observable, so we must first examine the role that error detection can play.
Most designers are familiar with a large variety of error-detection techniques-hardware, software, implicit, explicit, background sanity checks and so on. Any and all techniques should be exploited as much as possible in any robust system. But the real point here is that there are a large number of different types of errors that must be detected, each with its own context-sensitive ramifications.
Further, several errors may occur nearly simultaneously as a result of the same or related faults, and some can, if not properly treated, cause other errors or faults. And recall that we too often cannot determine the exact nature of the fault. This means that in most cases, error handling cannot be performed locally by the software that detected the error, because error-handling software may itself be corrupted. In HA, we need to minimize risk, and so error events must be propagated to the highest level possible if it is to have any chance of locating the fault or dealing with appropriate repair operations. Therefore, there must be a common agent to which detected errors are forwarded for handlinga "fault manager."
Fault management
For some types of errors, we can effectively isolate or determine the source of the fault, but as we have seen, for most types of errors this is difficult or impossible. So in these cases we try to contain the fault, localizing its effect and thus preventing further damage by inducing more faults and failures. Further, even if we have determined how contained the fault may be, we still need to determine the exact collection of software or hardware units that must be repaired. To do a coherent job of all this, a unified approach to fault handling or management is required for successful HA design. The general requirements for fault management in an HA system are as follows:
- Isolation: Isolate the source of the fault to a specific hardware or software unit. Note that this is not always possible. Often it is relatively easy to isolate hardware faults vs. software faults. However, many studies have shown that in most mature systems, software faults outnumber hardware faults by a factor of six to one. So software faults are by far the most prevalent, and by far the most difficult to isolate and deal with.
- Containment: Prevent the fault from doing further damage.
- Localization: Determine the extent of the fault, such as which hardware/software components are affected.
- Identification: Determine the type or class of the fault. This determines the repair operations that need to be performed, either correction, if possible, or recovery.
Fault management's role is to provide a comprehensive framework so that implementation of repair policies may be effectively managed. So it is time to look at options for repair.
Fault-repair design requires a reference frame that outlines the necessary responses to different types of faults. There are two main categories of repair operationavoidance and correction.
Fault avoidance means the repair of a fault before it causes a failure. Some examples of this technique are: 1) preventive maintenance of data structures, such as background auditors or salvagers of damaged data structures, and 2) system-state consistency checks, such as repairing inconsistent states, again with auditors or salvagers. While very worthwhile because they can improve MTTF, such techniques are often based on a heuristic approach, and are difficult to do well or appropriately.
Fault correction
Fault correction means to repair the error after it has caused a failure. This class of repair operations defines the MTTR of the system. Of the correction techniques, there are two categories: masking and recovery.
With masking, the fault produces a correctable error, so that the effects of the failure can be masked from the system at large. The procedure uses some redundant or alternate information store, or it uses some self-contained correction procedures, like error-correcting codes and such. N-plex redundant hardware architectures support this form of correction.
With recovery, the error is not correctable or maskable. Instead, the failure manifests itself and we attempt to recover from its effects. There are two forms of recovery: forward and backward.
With forward recovery, the fault has produced a recoverable error, where we can recover from a failure by constructing a new "correct" state, with only minor interruption to the component or system as a whole. An example of forward recovery is the simple technique of resending a message if a negative acknowledgment or time-out on its response has occurred.
Backward recovery means essentially that there is no appropriate way to correct the error and thus failure presents itself. Therefore, the recovery procedure is to return to a previous known-correct state, and restart operations from there.
The most prevalent form of fault repair in an HA system tends to be backward recovery. As we have seen from the previous analysis, most faults that occur in systems cannot be exactly identified, and as such the scope of damage cannot be ascertained by any analytic means, so it is not clear what needs to be fixed. And even if the exact fault could be determined, masking or forward-repair operations are often just not practical or feasible. The safest way to deal with most errors, then, is to assume that hopeless corruption has occurred, purposely fail the component(s) affected by the fault and then restore the service provided by the failed components in the most recently known error-free or correct state.
There are two basic models for backward recovery: restart, and replication or redundancy. Restart means that the failed component is restarted. Restart may also imply a reload of the component software. Replication or redundancy means that copies of the failed component exist in the system, so that upon failure, we switch over to the redundant component. The term "component" may mean either a software subsystem within a processing node or the processing node itself, or even multiple nodes. The time required to perform these operations varies with the definition of "failed component" and is reflected in the MTTR. Roughly, the MTTR for each type of recovery can be ranked as follows. For simplicity, we will not consider the multiple-component case:
- Processing node restart (or reboot): highest MTTR;
- Software subsystem/component restart: next-highest MTTR (can be close to or lower than 3 in some cases);
- Processing-node switchover: next-highest MTTR (can be close to or higher than 2 in some cases);
- Software subsystem/component switchover: lowest MTTR.
Note that in some cases, a switchover operation may not strictly be considered a backward-recovery operation, but rather an error-correction operation, by masking the fault. For example, failure of a server component and switching to an alternate server may not involve restoration of a previous state. However, this assumes that the replicate server and the activebut now failedserver were in exact state synchronization. This is rarely the case for most systems, but it does occur.
Restart is generally a very heavyweight operation, as it involves killing/deleting the failed component and completely re-initializing the component to a previous known state. Sometimes it is not possible if a fatal hardware fault has occurred. And if software, the component may be corrupted, requiring a complete reload, which is often even more time-consuming. However, sometimes restart can be an effective technique if designed properly. But, replication or redundancy greatly reduces the possibility of single-point of catastrophic failure that a restart may be prone to.
So in general, switchover operations using redundant components are a better recovery method. The key to successful switchover is a reliable scheme for "checkpointing" each consistent state of a component, so that the failed component can be completely recovered if possible, or the replicated component can be restored to this last known-consistent state. So the failed component still needs to be restarted if possible, but the time that this takes is not critical.
But this begs another question: When talking about a component failure in general, how can we be sure that the fault did not corrupt other components?
Recovery domains
The concept of recovery domains is crucial to any HA design (Figure 3). The section on fault management above alluded to this, in that two of the responsibilities of fault management are to contain and localize the error. In all cases in an HA design, the net extent of the fault, as well as exactly what needs to be recovered, must be determined. And for appropriate recovery operations, all the information that represents the last known-consistent state of the component must be known in advance. The set of components, hardware and software, that represents this is called the recovery domain of the fault. So, the recovery domain is what defines the containment and localization properties of a fault.
Very careful design and implementation consideration must be given to the concept of recovery domains. Fortunately, there are some guidelines for this in software design, and these will be discussed shortly below. Another important consideration concerning recovery domains involves repair time. In general, the smaller the recovery domain for a given fault, the lower the MTTR for recovery from the fault, regardless of the recovery technique employed. This then becomes an important design goal.
Fault-tolerant architectures
Time and space do not permit an extensive treatment of fault-tolerant architectures. But some concepts need to be mentioned. In general, there are three processor-system architectures that are widely used in HA designs: redundant pairs, clusters and N-plexing.
Redundant pairs are two processors with the same software load, but that are not running in parallel. One is currently active and the other is in standby, waiting to take over in case of active failure. Given this, usually the backward-recovery operation-with switchover-is the repair mechanism used, as the standby may often not be in complete synch with the active when it fails. It may have to be updated, or be brought to a state consistent with the rest of the system when a switchover operation occurs. Clearly, when a redundant-pair switchover occurs, the recovery domain of the fault was the processor domain. Note that redundant-pair switchover is often done as a result of software errors, not hardware or processor failure. Recall that this is due to the fact that reboot or restarting the node often takes too much time. So redundant processors are not just for hardware failure. This type of replication for recovery is the most widely used, and conceptually the easiest to deal with.
Clusters are groups of processors that are loosely coupled, directly connected, but electrically isolated. All processors are not executing the same software. Rather, most if not all the software subsystems defined for the system are replicated somewhere in the cluster. This architecture can best exploit the switchover recovery mechanism at the software subsystem level the recovery domain here is a subsystem. If done well, this architecture can exploit the typically lower MTTR of subsystem switchover operations, and deliver very high-availability solutions. If services are partitioned well, and replicated a sufficient number of times across the cluster, this architecture can also support processor domain failures, sometimes more elegantly than redundant pairs, but with the greater expense of more complexity.
Extreme-case reliability
N-plexing refers to an arrangement of N processors, where N is ≥2. For completeness, we mention it here. Each processor is executing the same code in parallel, and may share the same interface or communications devices.
The concept is that for all system outputs, there is an arbitration action to decide which output is used if there is disagreement. If N is ≥3, then a voting mechanism is used. In essence, this is a form of fault correction by masking the fault, not recovering. If N = 2, then typically, the N-plex pair is stopped and then some recovery operation takes place; perhaps switching over to a backup N-plex pair, for a more sophisticated form of the redundant-pairs architecture. Due to cost considerations, N-plexing is mostly used in systems with more safety-critical requirements like transportation and large factory process-control systems where failure can be catastrophic.
The requirements of fault management are satisfied both by specific units of software, and by specific design practices and principles. So far, we have focused on a framework to help us understand the issues and general techniques involved in fault tolerance for HA. And we have looked at some high-level aspects of hardware fault tolerance. Now it is time to draw some more practical conclusions concerning software HA design. The following is a list of such design principles and requirements.
One of the best ways to improve the overall reliability, and thus MTTF, of a software system is to honor the concept of simplicity. It has long been known that decomposing a system into many small modules yields better, more reliable results than having large modules. Complexity increases exponentially with size. The same argument applies to the programming patterns or models used, as expressed through the application programming interfaces (APIs) used. Keeping the number of patterns and APIs to a minimum will yield more reliable code. This also applies to operating-system APIs and the programming patterns they support.
Software separation
We have seen that two of the major goals of fault management in HA design are containment and localization-that is, containing the effect or spread of a fault and localizing the exact pieces of software that are affected and need to be repaired. To accomplish this, designers should organize all software subsystems into separate recovery domains that are both physically and logically separated from each otherphysically, by memory protection (MMU), and logically by avoiding shared data. The finer the decomposition into recovery domain subsystems, the better to take advantage of lower MTTR for recovery operations.
All communications between such subsystems should use clean transparent interfaces so that the state of each subsystem is completely self-contained. It is also extremely desirable that each subsystem have its own private resources. Minimize or avoid global or globally shared resources. Expediency and performance concerns should not override the separation principle if an HA-qualified system is desired.
Though backward recovery is the most prevalent and successful approach, it imposes some specific requirements on our design and operating environment (includes the operating system). These are as follows:
- Checkpointing, or saving system/ subsystem state, so that in recovery operations, the state may be restored. There are many design models and patterns for this technique, and we don't have time to survey them here. The point is that system-state data must exist outside the recovery domain to ensure that it has not been corrupted by a fault. And further, this repository of data must itself be replicated or redundant to protect from a single point of failure; and depending on the architecture and method of recovery, it must be conveniently accessible from other processing units in the system.
- Resource reclamation: Backward recovery requires that the offending recovery domain be "killed." This means that the processes or tasks within the domain are to be terminated. It is additionally necessary and extremely important to return, reclaim or clean up all other resources that are owned or have been allocated by the offending subsystem. Otherwise, the domain may not be able to be properly restarted, or the system will suffer resource leakages and cause other failures over time. Often the operating system provides some help with this, and this in particular should be looked for as a beneficial real-time operating system (RTOS) feature.
Health/event monitoring
The need for supervision or health/event monitoring was stated in the section on error detection. But it can be derived from the fact that with faults, and backward-error recovery, software components of the system may cease providing service at any time. In other words, we need to look at the flip side of faults and failures; we need to look at those components that are interacting with the now-failed component. Applications in communication with failed components must have reliable notification when this occurs. For example, a server application needs to know when a client dies, to free any resources that it has allocated from the server, and also to know when to switch over if necessary (Figure 4). The application should not have to actively poll to do this. Rather, the operating environment or the operating system itself should offer a service to provide these notifications automatically. This requirement also applies across a network.
Communications
A result of backward recovery using the switchover method at the subsystem level is that a scheme for communications transparency is required. This is especially true in a cluster system architecture. By transparency we mean that components that need to switch over communications with a replicated component (of the failed one) must not be required to "know" exactly where the replicated component is. It may be on the same processing unit or another in the network, but the communications scheme employed must be identical.
Furthermore, the route addresses to all replicated services must be known in advance. Again, this requirement usually needs to be addressed by the operating environment or operating system.
Due to the wide variety of faults, and their differing nature, error handling or more precisely, fault management must not be handled at the lowest level, such as in the code that detects the error. Processing error returns from system entities (like RTOSes) in application code space is generally a bad idea. Rather, it is better to "throw" the error to some centralized fault management entity, to help contain the fault and produce consistent repair policies. It is good practice to associate an error handler or fault manager component with the system, as well as with each defined recovery domain (Figure 5).
In the case of backward recovery, it is especially important that this component not be part of the recovery domain itself, as we must assume that the whole recovery domain is corrupted. Another benefit of this approach is that separation of error- or fault-handling code from the application into a centralized place makes for simpler code modules, and as we have seen, simplicity itself is a virtue in HA design. In essence this component is part of the operating environment, and any help that the operating system itself can provide in this regard is of great benefit.
Another derived requirement from backward recovery at the subsystem component level is that of dynamic configurability of these components. The failed component must ultimately be restarted, and as such, must be able to dynamically bind or configure itself to the run-time environment, so it can re-establish communications and operation with the rest of the system. All or most static configurations for a software component should be avoided. This is especially true if the component is required to be dynamically loaded. Dynamic loading of individual components, while not strictly required for a fault-tolerant system, is often a very useful recovery aid, and is employed as such in many systems. It is especially useful in field upgrade or maintenance situations. We have not specifically addressed field upgradability and maintenance, but the same rules apply. To upgrade or maintain a fielded system, some piece or pieces of it may be taken out of service and upgraded, or repaired. This should be minimized to avoid affecting the MTTR.
Dynamic configurability of individual software components needs direct support from the operating environment as it includes communication issues.
In trying to understand what is required by HA, we have looked at a rather high-level framework involving the definition of HA, the nature of faults, fault detection issues and their effects, as well as what repair means, and the implications of various repair methods and hardware architectures. All of this helps clearly define which approaches are important and why. We have also derived some useful design principles and requirements from this framework for implementing HA systems. Most of these requirements relate directly to the services and features of the operating environment or operating system, which we haven't discussed in any detail (this is another topic). But designers would be well-rewarded by carefully examining these requirements with respect to selection of commercial operating systems and middleware solutions. Many such solutions exist, and not all are equivalent to each other.
Michael Christofferson is a product-marketing manager for OSE Systems Inc. He holds an MS in physics from the University of Michigan and has spent 15 years programming, designing and specifying systems for the defense communications, data communications and telecommunications industries. Christofferson has spent the last six years with a variety of leading RTOS technology providers. Michael can be reached at mikec@enea.com.



