Design Article
How to meet real-time requirements in embedded distributed systems
Nigel Tracey, Director of Product Management, LiveDevices, Ltd., York, England
4/18/2003 9:24 AM EDT
When viewed as a whole, many embedded systems, particularly in automobiles and aircraft are by their nature distributed: they are connected to sensors and actuators in various different parts of the vehicle. Several factors drive us to distribute the "intelligence" of the system, for example:
- Keep control systems (ECUs) close to the I/O that they interact with, minimizing "peripheral wiring."
- Partition a complex problem into sub-problems.
- Provide fault-tolerance, degraded "limp home" modes are possible if some parts of the system malfunction.
- Allow different subsystem suppliers to implement different ECUs.
Once the system is distributed, it becomes attractive to pass information between the ECUs via (one or a small number of) shared multiplex buses, rather than numerous point-to-point connections. Minimizing wiring in this way serves several goals: it reduces cost (including measures such as mass), increases reliability (by reducing the number of connections) and allows the network to be "sized" to meet the needs of the system. Using shared multiplex buses also allows new integrated functionality to be added economically.
Many of these have to meet hard real-time requirements: if a deadline is not met, the system has failed. One example might be that a car's brake lights must illuminate no more than 200 ms after the brakes are applied, for safety reasons. Often, these end-to-end deadlines traverse the network: one ECU, for example, the ABS unit, must recognize brake application and propagate this to another ECU -perhaps the rear electronics module- which actually turns on the lights.
Thus, moving to multiplexed buses introduces a new problem: ensuring that the various signals which share the communications medium can all meet their various deadlines.
In considering how we might analyze a distributed hard real-time system, we will first examine the different properties that the analysis must take into account. Essentially, the system functionality in a distributed system is provided via a collection of nodes communicating using a network.
The nodes co-operate using the network to provide the end-to-end functionality. In a real-time distributed system this end-to-end functionality, from event through to response, must be provided within a specified deadline.
Timing analysis is applied to a resource shared between multiple activities: for a node, several software functions are executed on a single CPU, while a multiplexed network carries a number of messages sent by various nodes.
Thus, we partition the analysis into components which can be individually analyzed: the end-to-end response time depends on the response times of each component involved in meeting the deadline. While it is sufficient that the sum of the worst-case response times for the components to meet the deadline, additional control is achieved by setting deadlines for each component.
These deadlines are typically arrived at by some negotiation between the node and network implementers, and the agreed deadlines place contractual requirements on each component. When all contracts are satisfied, the overall deadline is guaranteed.
This approach has several advantages over simply analyzing the collection of components post-implementation. First, each component has a specified deadline that can be used in the design, implementation and verification process. In addition, system integration is easier because overall deadlines are guaranteed by the component specifications. Also, subsequent modification is easier, because an updated component must meet its agreed deadline. There is no effect on other components or the overall system if the worst-case response time varies within the limit of the deadline.
Most importantly, a complex problem has been partitioned. Each node and network will typically need to meet a number of deadlines - for example, the network carries many unrelated messages each with their own deadlines - and these per-component scheduling problems have been decoupled by the deadline contracts. Otherwise, we would have a global optimization problem in which changing the priority of a signal on the network affects that signal's response time, the acceptable response times for the producing and consuming nodes, and the response times of other signals on the network and so the acceptable response times for unrelated nodes.
Meeting deadlines
In general, there may be multiple networks with gateways used to build the system. The network deadlines then become "network - gateway - network", in effect breaking down the overall network contract into sub-deadlines and sub-contracts.
Efficient use of network bandwidth also calls for several signals to be carried in the same transmission unit on a network, for instance, several output values from the same ECU are likely to be packed into a single CAN message. The network deadline for the frame can be derived from the shortest network deadline of the contained signals. Similar treatment applies when a signal has multiple consumers with different deadlines.
Consider the source node in a distributed transaction. The node will begin processing an event typically in response to an interrupt generated in the environment. There may be other processing taking place on the node, and consequently the node's processing resource has to be shared in some fashion among these tasks. Once the node has completed its processing of the event, it will generate a message for transmission on the network. At some, possibly later, point the message will be ready to be transmitted on the network.
To determine if the node is meeting its contract it is necessary to calculate the node's worst-case response time. This is the time from the occurrence of the event to the point when the message is ready for transmission on the network. This needs to include the following:
- time between the event occurring and the start of processing
- time between the processing beginning and the message being generated
- time between the message being generated and it being ready for transmission
Now, consider the destination node in a distributed transaction. In response to the receipt of a message, the node will begin processing its contents. Again, there may be other processing taking place on the node, consequently the processing resource has to be shared in some fashion amongst these tasks. Once the node has completed its processing of the message it will generate a response (typically controlling some external hardware through actuators). At some possibly later point the response will be observable in the environment.
As for the source node, we can determine if the destination node is meeting its contract by calculating its worst-case response time. This is the time from the messaging becoming ready for consumption to the point when the response occurs. This needs to include the following:
- time between the message becoming ready for consumption and the start of processing
- time between the processing beginning and the response being generated
- time between the response being generated and the response occurring in the environment
The network component of the distributed system consists of both the physical network connection and the network software layer running within the nodes. Once a message is made ready for transmission by a node, the network software will put it forward into the network for transmission. The message will then spend some time queued, competing for the network with the other messages waiting to be transmitted. After some amount of time the message will gain access to the network and be physically transmitted along the bus to the destination node. The network software in the destination node receives the message and makes it ready for consumption by the application.
To determine if the network is meeting its contract, we need to calculate the worst-case transmission time for each message to be transmitted across the network. This needs to include the following:
- time between the message being made ready for transmission by the application software and when the network software enters it into the queue
- queuing time, i.e. the time taken for the message to gain access on the network
- time taken to physically transmit the message across the network
- time between transmission completion and the message being ready for consumption in the application software
In order to successfully build a distributed hard real-time system we need to use techniques that allow all of these different times to be calculated and combined: static, or cyclic scheduling, dynamic preemptive scheduling, and non-pre-emptive scheduling.
A common implementation for embedded software is to use a cyclic system. This involves the static creation of the schedule that typically consists of a number of minor cycles running one after another to form the overall schedule (known as the major cycle). Each minor cycle runs tasks one after the other in order, checking for events and generating responses.
Using the CPU
A disadvantage of any cyclic system is that it does not make very effective use of the CPU, even though the scheduler itself can be implemented efficiently. In a cyclic system, all activities must run at harmonic processing rates, and the processing time requirements for sporadic events are determined by their response time requirements rather than their arrival pattern. This dramatically increases the processing load and can ultimately mean that faster or more powerful microprocessors have to be used.
Pre-emptive scheduling is a dynamic approach in which a run time kernel (or RTOS) ensures that the highest priority task (or interrupt) is always executed. Consequently, when a high-priority task is released, for example, because its period means that it should now run again, it will pre-empt any lower priority task. The lower priority task will resume execution when all higher priority activities have completed.
While preemption does offer significant efficiency and development advantages when compared with the cyclic approach, the timing behavior of the system is much more complex. However, scheduling mathematics address this additional complexity.
For the cyclic approaches, both for node and network implementation, this is relatively easy. In a static environment the tasks run in a fixed pattern. The response time of a specific task is determined by the task's activation pattern (which introduces a polling delay) and the task's computation time. The worst-case polling delay occurs when an event occurs just after its handling task begins execution. In this case the event will not be processed until the next activation of the task as determined by the static task activation pattern. The static nature of the implementation environment makes it straightforward to calculate the response time for all tasks and interrupts in the system.
In a priority-based environment, it is more difficult to determine the worst-case response times and hence to determine if the contract has been satisfied. This is where schedulability mathematics comes in. Scheduling mathematics called deadline monotonic analysis (DMA) can analyze the interactions in a priority-based environment and determine the worst-case response times. Initially, we introduce DMA in the context of the priority-based scheduling of the application software running on the nodes.
In a pre-emptive environment blocking will normally occur during the use of a shared resource which is protected through the use of semaphores. However, in a non-preemptive environment once a task begins to run it cannot be preempted even by higher priority tasks. Consequently, this additional blocking needs to be taken account of in the analysis. This can be analyzed as if all tasks lock a single common semaphore for their entire execution. This allows DMA to be applied to a non-preemptive environment such as CAN.
In a priority-based non-preemptive network environment such as CAN, the worst-case response times can be determined using a modified form of DMA, similar to node analysis where the response time was made up from computation time, interference and blocking. A message on the network is analogous to a task/interrupt on the node and its response time also consists of computation time and its queuing time. The queuing time is the length of time a message waits before it is transmitted on the network. The queuing time is made up of both interference and blocking. In the context of the network the computation time is the physical transmission time of the message. Interference is the length of time the message takes to win arbitration on the network. Blocking is the length of time the message waits for lower priority messages (which are non-pre-emptable) to complete transmission on the network and for arbitration to restart.
This article was excerpted from ESC paper 502, titled "Meeting Real-time Requirements in Distributed Systems."
- time between the message being made ready for transmission by the application software and when the network software enters it into the queue


See related chart
