Design Article
Pay attention to design patterns in embedded high availability net systems
David Kalinsky, Director of Customer Education , OSE Systems, Inc. San Jose, Calif.
4/18/2003 9:37 AM EDT
Uninterrupted service is expected from many of the connected embedded systems that surround us every day: the phone system, automated banking and credit card verification.
Designers must implement embedded networking infrastructure systems that run reliably to the degree that they're in service 99.999% of the time "five-nines" availability equivalent to less than 1 second of downtime per day.
Designers who want to build a high availability system need to focus a major proportion of their design effort on issues of failures and faults. A "failure" can be defined as a situation in which the service provided by a system does not meet its specification. A "fault", on the other hand, is a failure of an interacting system. It might be a failure of a sub-system within a large system, or a component failure, or a failure of an external system, or a programming mistake. Faults can trigger additional faults. Or they can sometimes trigger failures.
Faults may be transient, permanent or intermittent. When they are activated, they may cause errors in the state of a system or sub-system. And it is these errors that can trigger failures of your system. There are a number of approaches to dealing with faults, the most popular of which is "fault tolerance". It is implemented by using redundant, perhaps diverse implementations of a system or its subsystems to avoid the effects of faults.
The main design concept for fault tolerance is redundancy. It's based on the idea that multiple independent faults will not strike your system together. A fault tolerant system should be designed to avoid single points-of-failure. Redundancy can be implemented in a variety of dimensions, including hardware redundancy, software redundancy, time redundancy, and information redundancy.
Examples of hardware redundancy include self-checking logic circuits and multiple flight computers in a single airplane. Software redundancy might use two different algorithms to calculate the same result. Time redundancy may be done by communication re-transmissions. And information redundancy can be done using backups, checksums and error correction codes.
Redundancy may be either dynamic or static. Both use replicated system elements. In static redundancy, all replicas are active at the same time: If one replica "throws a fault", the other replicas can be used immediately to allow the system to continue correct operation. In dynamic redundancy, one replica is "active" and others are "passive": If the "active" replica "throws a fault", a previously "passive" replica is activated and takes over critical operations.
Most hardware faults are random faults resulting from physical defects. Software faults, on the other hand, are not physical. Software does not wear out. Instead, software faults result from the invocation of software paths that contain defects in the software design or implementation that were always there. Since software is typically much more complex than hardware, it can be expected to have many such built-in defects. Software fault tolerance is thus much more expensive to design than hardware fault tolerance.
"N-version programming" is a veteran design pattern for software fault tolerance. It is based on "static" redundancy, and is the software analogy of hardware N-Plexing. But it's not as simple as the hardware replication of N-Plexing. If N copies of the same software were running they'd simply contain the same software faults and produce the same software errors N times. So if N units of some software functionality need to run in parallel, they need to be N disparate implementations of that functionality - independently implemented by N separate development teams.
Back in the 1970's, N-version programming was the state-of-the-art in software fault tolerance. But since then, experience has exposed a number of problems with this design pattern: Software development costs skyrocket when you use it; but if you try to skimp on some of the costs, you'll run into what's called the "average IQ" problem: Less expensive development teams contain less-qualified software engineers that will produce lower-quality code. So, you may end up with N diverse programs that are all riddled with faults, created in N different ways.
Overcoming faults
Another downfall of N-version programming is the issue of what to provide as input to the N independent development teams. In general, a single specification is photocopied and provided to all N development teams. But if this specification is flawed, you'll get N independently developed versions of similarly flawed software that'll all do the wrong thing. Nowadays, the pricetag of N-version programming is usually thought to be better spent by asking one top-notch software development team to develop one high-quality software version using the best available infrastructure, software development tools, techniques and testing.
In contrast to the static redundancy of N-version programming, there are a number of software fault tolerance design patterns based on dynamic redundancy. These often take a "fail stop" approach when software errors are detected. But it is often unclear how to "un-do" the effects of faulty software behavior surrounding an error.
One very helpful tool in this regard is the concept of a transaction. A transaction is a collection of operations on the state of an application, such that the beginning of a transaction and the end of a transaction are points at which the application is in a consistent state.
If we want to use the concept of transactions for fault tolerance, our system has got be able to save its state at the beginning of a transaction. This is called "checkpointing". It involves taking a "snapshot" of the software's situation just as it is about to begin the first step of the new transaction.
The snapshot is only taken if the previous transaction was completed in an error-free state. The basic recovery strategy here is re-execution: When an error is detected during a transaction, the transaction is "failstopped" and then the system is reloaded back to the latest saved checkpoint. Then service is continued from this checkpoint, allowing new transactions to build upon its consistent state. This kind of error recovery is referred to as "backward error recovery", since the software state is rolled back to a past error-free point.
Simple checkpointing has its own dangerous single point-of-failure: There might be a fault during the taking of the "snapshot" of the checkpoint itself. But there's a solution to this, sometimes called "checkpoint-rollback".
In such a configuration, a software client on some sort of remote system or device and a software server communicate with one another by sending messages through queues. During a transaction, data are being modified within the server. At the end of a transaction, a consistent set of data should be recorded on each of the two persistent mass-storage devices. Together with the data, a transaction sequence number should be recorded. If an error is detected sometime later and the server is fail-stopped, it may be restarted or a replica server may be started. As part of the startup recovery procedure, the transaction sequence numbers on the two mass-storage devices are checked. Recovery of server data will be done from the device containing the highest sequence number.
A limitation of this checkpointing design pattern is that recovery after a fault may take quite a long time. What could speed this up is a "hot backup" server working directly with persistent mass storage device(s) of its own. This design pattern is called "Process Pairs."
In this pattern the "primary" server works very much as in the previous checkpointing scenario. Whenever the primary server successfully completes an entire transaction, it passes information about its new consistent state to a "backup" server (on the right). Both the primary and backup servers record these data in their persistent mass-storage devices. While the primary server is "up" and available to clients, it sends regular "heartbeat" messages to the backup server. If the backup server detects that the stream of heartbeat messages has stopped, it understands that the primary server is stopped or unavailable, and it will very quickly take over as a new primary server.
The "process pairs" design pattern is only design patterns such as these that will allow everyday commercial-quality hardware and software to be used as building blocks for truly high-availability systems - Systems that can, without human intervention, achieve "five-nines" (99.999%) or greater availability, equivalent to less than 1 second of downtime per day over years of operation.
This article was excerpted from ESC paper 349, titled "Some Design Patterns for High Availability Embedded Systems."


See related chart
