Design Article
Picking the right system design methodology for your embedded apps: Part 3
Wayne Wolf
3/17/2010 12:05 AM EDT
A product can be of low quality for several reasons, such as it was shoddily manufactured, its components were improperly designed, its architecture was poorly conceived, and the product's requirements were poorly understood.
Quality must be designed in. You can't test out enough bugs to deliver a high-quality product. The quality assurance (QA) process is vital for the delivery of a satisfactory system. In this last part in this series, we will concentrate on portions of the methodology particularly aimed at improving the quality of the resulting system.
The software testing techniques described earlier in this series constitute one component of quality assurance, but the pursuit of quality extends throughout the design flow. For example, settling on the proper requirements and specification cannot be overlooked as an important determinant of quality. If the system is too difficult to design, it will probably be difficult to keep it working properly.
Customers may desire features that sound nice but in fact don't add much to the overall usefulness of the system. In many cases, having too many features only makes the design more complicated and the final device more prone to breakage.
To help us understand the importance of QA, the Application Example below describes serious safety problems in one computer-controlled medical system. Medical equipment, like aviation electronics, is a safety-critical application; unfortunately, this medical equipment caused deaths before its design errors were properly understood.
This example also allows us to use specification techniques to understand software design problems. In the rest of the section, we look at several ways of improving quality: design reviews, measurement-based QA, and techniques for debugging large systems.
Application Example. The Therac-25 medical imaging system. The Therac-25 medical imaging system caused what Leveson and Turner called "the most serious computer-related accidents to date (at least nonmilitary and admitted)."
In the course of six known accidents, these machines delivered massive radiation overdoses, causing deaths and serious injuries. Leveson and Turner analyzed the Therac-25 system and the causes for these accidents.
The Therac-25 was controlled by a PDP-11 minicomputer. The computer was responsible for controlling a radiation gun that delivered a dose of radiation to the patient. It also runs a terminal that presents the main user interface. The machine's software was developed by a single programmer in PDP-11 assembly language over several years. The software includes four major components: stored data, a scheduler, a set of tasks, and interrupt services. The three major critical tasks in the system were as follows:
1) A treatment monitor controls and monitors the setup and delivery of the treatment in eight phases.
2) A servo task controls the radiation gun, machine motions, and so on.
3) A housekeeper task takes care of system status interlocks and limit checks. (A limit check determines whether some system parameter has gone beyond preset limits.)
The code was relatively crude—the software allowed several processes access to shared memory, there was no synchronization mechanism aside from shared variables, and test-and set for shared variables were not indivisible operations. Let's examine the software problems responsible for one series of accidents. Leveson and Turner reverse-engineered a specification for the relevant software as shown below:
Treat is the treatment monitor task, divided into eight subroutines (Reset, Datent, and so on). Tphase is a variable that controls which of these subroutines is currently executing. Treat reschedules itself after the execution of each subroutine. The Datent subroutine communicates with the keyboard entry task via the data entry completion flag, which is a shared variable.
Datent looks at this flag to determine when it should leave the data entry mode and go to the Setup test mode. The Mode/energy offset variable is a shared variable: The top byte holds offset parameters used by the Datent subroutine, and the low-order byte holds mode and energy offset used by the Hand task.
When the machine is run, the operator is forced to enter the mode and energy (there is one mode in which the energy is set to a default), but the operator can later edit the mode and energy separately.
The software's behavior is timing dependent. If the keyboard handler sets the completion variable before the operator changes the Mode/energy data, the Datent task will not detect the change—once Treat leaves Datent, it will not enter that subroutine again during the treatment. However, the Hand task, which runs concurrently, will see the new Mode/energy information. Apparently, the software included no checks to detect the incompatible data.
After the Mode/energy data are set, the software sends parameters to a digital/analog converter and then calls a Magnet subroutine to set the bending magnets. Setting the magnets takes about 8 seconds and a subroutine called Ptime is used to introduce a time delay.
Due to the way that Datent, Magnet, and Ptime are written, it is possible that changes to the parameters made by the user can be shown on the screen but will not be sensed by Datent. One accident occurred when the operator initially entered Mode/energy, went to the command line, changed Mode/energy, and returned to the command line within 8 seconds.
The error therefore depended on the typing speed of the operator. Since operators become faster and more skillful with the machine over time, this error is more likely to occur with experienced operators. Leveson and Turner emphasize that the following poor design methodologies and flawed architectures were at the root of the particular bugs that led to the accidents:
1) The designers performed a very limited safety analysis. For example, low probabilities were assigned to certain errors with no apparent justification.
2) Mechanical backups were not used to check the operation of the machine (such as testing beam energy), even though such backups were employed in earlier models of the machine.
3) Programmers created overly complex programs based on unreliable coding styles.
In summary, the designers of the Therac-25 relied on system testing with insufficient module testing or formal analysis.


