News & Analysis
Gauging RTOSes' real-world response times
David N. Kleidermacher
1/12/2004 10:47 AM EST
Embedded systems are deemed real-time if they will suffer a failure when a critical system event cannot be serviced in the required time, and for many embedded systems this time limit is very short. For example, a semiconductor equipment device has a fire-control processor, which must respond within a mere handful of microseconds to prevent the semiconductor from being damaged or destroyed. The real-time operating system (RTOS) must not only support a fast response time, but the worst-case response time must be known and guaranteed not to be exceeded. Most operating systems suffer from an inability either to meet tight response time requirements or to prove (and thus guarantee) a worst-case response time.
The importance of understanding the worst-case interrupt disabling sequence must not be understated. The reality, however, is that operating system vendors generally publish an average or best-case interrupt latency, measured in a lab environment. Is it possible to compute the worst-case disabling region statistically? A team of researchers recently attempted to answer that question for one commercial RTOS in use today.
The case study employed some advanced methods of program flow analysis in an attempt to determine the location and structure of all the interrupt-disabling regions. The researchers used cycle-accurate models to determine execution counts of the selected regions. The case study took five months.
The results are not encouraging. Because the instruction employed to disable and enable interrupts uses a register value as its source, it was often impossible to determine statistically whether a given instruction enabled or disabled interrupts. Other problems of program flow, such as nested disabling/enabling sequences, hampered the study.
In the end, the researchers estimated that only half-or just 612-of the disabling regions could be positively identified. In other words, another 600 regions, with an unknown impact on the worst-case response time, lurked in the system.
Finally, the researchers estimated the execution time of identified regions. Some of the regions had calls to out-of-line functions; a few regions even had triply nested loops. And some loops found in critical regions were of variable bound. The cycle count estimate for one of the nested loop regions was 26,729. On a 100-MHz microprocessor, that would translate into approximately 250 microseconds. Rest assured that no real-time operating-system vendor would claim an interrupt latency measurement of that magnitude.
The reality is that the interrupt latency numbers claimed for an operating system that disables interrupts regularly cannot be trusted. Developers must ask the vendor how the value was obtained. If it was obtained empirically, ask how the vendor guaranteed that every possible sequence of interrupt-disabling sequence was exercised. If obtained by static analysis, ask to see the source code and the corresponding list of all interrupt-disabling sequences, and make sure you can understand the run-time behavior of every sequence. You may be surprised to find that the claims of real-time behavior cannot be proved. Real-time systems and the people who depend on them cannot afford the employment of an OS that has unproven response time.
Legacy operating systems disable interrupts so that the periodic scheduler timer interrupt cannot fire, potentially causing a thread switch, while the kernel is manipulating critical data structures. In effect, this sacrifices the highest-priority interrupt to avoid adding latency to the low-priority scheduler interrupt. A better solution, implemented in the Integrity real-time operating system, is never to disable interrupts in kernel service calls; instead, postpone the handling of a scheduler interrupt until the kernel service call completes.
This strategy requires every kernel service call to be short or able to be checkpointed so that scheduling events can be permitted before the service call is completed. Therefore, the time to get to the scheduler may vary by a few instructions (insignificant for a typical 60-Hz scheduler) but will always be short and bounded. It is far more difficult to engineer a kernel in this manner, which might explain why most kernels do not do it. But the result of this design is that the highest-priority interrupt is always handled with the absolute minimum and consistent latency.
In this interrupt model, kernel calls (such as releasing a semaphore to wake up a thread) are permitted in ISRs by the use of efficient callbacks that are executed when the kernel is in a consistent state just before scheduling. As with any RTOS, designers should always limit the work performed in an ISR to limit latency.
Context switch time is an important component of thread response time. It is critical that an RTOS minimize context switch time, and most try to do so.
But the interaction between the thread responding to the high-priority interrupt and other, lower-priority interrupts yields a problem. Since interrupts are enabled while the high-priority thread is executing, an unbounded number of low-priority interrupts can fire, increasing the thread response time as each interrupt service routine is executed. Once again, the empirical worst-case thread response time does not match up with the theoretical worst-case response time, depending on the kinds and frequency of other interrupt sources in the system.
The real-time kernel should provide a method of preventing this kind of priority inversion. The solution implemented in Integrity is to enable developers to prioritize certain interrupts below critical interrupt-handling threads. When a high-priority ISR is executed, the kernel disables interrupts that are assigned a lower priority than the thread that must be awakened to handle the event. When the high-priority thread has finished handling the event and is descheduled, the kernel automatically re-enables the lower-priority interrupts-a simple yet effective solution.
Typically, RTOSes provide fixed, priority-based scheduling because it must be possible to guarantee that the most critical threads in the system can run immediately in response to an event. It is forbidden to use heuristics or any other constructs in the kernel that might make this response nondeterministic.
Some operating systems, such as Linux, employ a fairness-based heuristic scheduler. This of course comes from Linux's Unix heritage as a time-sharing, interactive operating system. Thus, it is not possible for the designer to specify an absolute "highest" priority thread. When an interrupt handler makes a thread ready to run in order to process the event, the Linux scheduler is quite likely to choose some other thread to run first. It simply isn't possible to determine the worst-case thread response time.
David N. Kleidermacher is vice president of engineering at Green Hills Software Inc. (Santa Barbara, Calif.).


See related chart
