News & Analysis
Multicore SoC software requires partitioning
Paul Kimelman, Development Systems Technical Architect, ARM Ltd., Cambridge, United Kingdom
3/7/2002 7:50 PM EST
In a system-on-chip multiprocessor design, the software must be partitioned carefully. Too much interaction or dependence between processors can erase any advantages an embedded multiprocessor design offers.
But how an application is partitioned depends a lot on what kinds of things it will do as well as what infrastructure exists to support it. A disk-drive controller application needs to respond first to servo feedback, secondly to the request channel (such as where to move), and lastly to any other inputs. A rotational processing system such as an automobile transmission, antilock brakes, and engine control must begin processing at a starting point in each cycle, often referred to as top dead center from the notch in a shaft pointing straight up. All calculations and sensor readings originate from that same point to insure consistency. After processing the set of data, the algorithm must wait for the next starting point.
For stream processing, the goal is to keep the input and output channels full at all times. The output usually is the gating item, and so processing speed must be fast enough to keep it full. Otherwise, output suffers (drop-out, glitches, etc). For packet processing, in such applications as network routers/switches, the processing starts each time a new packet/frame arrives. The important measure is the turnaround/transfer times. In this case, response time is critical.
When designing a multicore application, the partitioning is usually guided by the needs of the application. Typically, there are external inputs/outputs that need high response time and/or bandwidth and those that need far less of one or both. Response time must factor in worst case vs. average in many applications. If a phone has an average latency from input to output of 20 microseconds, but a worst case of 20 milliseconds, the user will perceive very poor quality because the sensitivity to such audio delays is quite high. Bandwidth usually is only concerned with average as long as dropout is not a factor. So, buffering capacity must be considered.
It is notable that the design elements of partitioning an application onto two or more processors are very similar to partitioning an application onto a real-time operating system (RTOS). In the case of an RTOS, priorities of threads are used to try to balance response time, load, and communications.
Similarly, it becomes necessary to have your application components guard against the effects of debugging. For instance, it is important to be able to deal with one processor stopping while the other is running. If the other processor faults due to this situation, debugging of the system becomes much more difficult. But the use of multicore/multiprocessor and SoC only make debugging more difficult. There are a number of techniques available to aid in debug in such designs: They include JTAG, (Joint Test Action Group) cross triggering, trace, and back-channel communications.
True in-circuit emulators are rare, given the high clock speeds, use of cache, and internal memory of most designs The most typical debug mechanism is JTAG. JTAG works using four or five pins to access the chip. The original model of JTAG simply allowed access to the signals at the boundary between the cells and the outer packaging. The debug use of JTAG includes a far more sophisticated JTAG test access port (TAP) which provides control and access using more specific mechanisms than boundary chains. The typical JTAG TAP for debugging includes techniques for executing instructions into the processor core at slow speeds as well as access to debugging control registers for breakpoints, stepping, and trace. The instructions are used to move data into and out of the core registers as well as into and out of memory. By using instructions to do this, it takes advantage of the processor, which already can do these actions, thus simplifying the TAP's tasks.
In a multicores/multiprocessor design , one only has to daisy chain the TAPs together. Daisy chaining involves tying the output of one processor (TDO) into the input of the next (TDI). The emulator connects to the input of the first processor and the output of the last. The other two or three pins are tied to all processors in parallel. This allows a multicore chip to only have four or five pins coming out of it for debug.
Even when the debugger does not support debug of multiple cores/processors at once, most emulators will still allow daisy chaining. This is accomplished by only identifying the processor to be accessed and defining the rest to be bypassed. Bypass is a JTAG concept that says that the IR can be fed a specific pattern (all ones) which will connect a DR of length 1. The Bypass DR has no meaning, but given it is size 1, it simply passes along its input. So, the emulator only has to emit an extra value for each device in Bypass and can then debug the targeted processor.
One of the most difficult problems with multiprocessor designs is tracking down what happened when things go wrong. If one processor detects bad data from the interprocessor communications, it is critical to be able to stop the other processor right away. If stopped immediately, it is possible to understand its state and how the corrupted data was sent. If it is allowed to run for thousands of instructions or more, there will not likely be any context to examine.
The only way to get fast cross triggering is with hardware support. Most processors do not have a specific design for cross triggering, but usually do have the signals needed to make it happen. The two signals/pins needed are debug acknowledge (a.k.a. breakout) and debug request input (a.k.a. abort). Using some glue logic, it is possible to cross-connect these. The cross-connect needs to be optional of course, ideally enabled and disabled via memory-mapped register. Some modern debuggers allow enable/disable of the connections through the GUI. Any debugger would allow access to the memory-mapped register.
The design of the connection needs to be set up according to how the chips work. For example, in many cases, simply cross-wiring these would cause them to never be able to start. This is because both debug acknowledge lines would be held while in debug and any attempt to start one would get stopped immediately. So, logic is needed to insure correct handling of edges vs. levels and consideration of when the pulse is valid. Some processor vendors do describe this in detail.
Cross-triggering may be used for more than stopping. It is possible to use this kind of mechanism to cause other interactions, some of which may be safer. For example, in a storage application, the stop of one could trigger a specific nonmaskable interrupt (NMI) to allow the servo controller to gracefully stop the head.
Like in-circuit emulators, trace is no longer as valid as it once was. The traditional approach is to attach a logic analyzer to the chip bus (address + data + control). The logic analyzers traditionally had internal setups to allow naming the collection of pins to make triggering and analysis easier. With the increasing use of on-chip memory and cache, the logic analyzer approach does not work well. If an attempt is made to force all accesses to show up externally and disable cache, the behavior is usually very different from what happens in a real run.
Some processors include a form of on-chip trace. This involves the use of logic to control how/what is collected. It may also include on-chip trace memory (a buffer to store results) and/or dedicated pins to send the trace off-chip.
Most digital signal processors (DSPs) include a simple control-flow trace buffer. These collect the last few control-flow changes branches, calls, returns, and possibly interrupts which can be examined only when the processor stops. The purpose is to provide some context in the debugger for how the code got to the place where it stopped. Coupled with well-placed breakpoints, this can be used to track down complex problems, which cannot be analyzed by looking only at the current memory contents.
A few processors provide far more sophisticated on-chip trace, often as an optional add-on because it involves a whole macrocell of additional logic. For example, the ARM series of processors provide the ETM trace, and LSI Logic has E-JTAG trace, and Motorola Mcore had the Nexus trace. When the trace buffer is external, these use few pins with compressed output. The small number of pins being driven at slower rates, makes it an attractive option. But, for very fast processors, on-chip buffers are usually required to keep up with the rate, at the sacrifice of trace depth. On-chip trace units can provide very sophisticated control.
Currently none of the trace units provide trace beyond the processor itself. As more designs move to multicore, it will be important to trace shared memory and bus accesses (data trace).
Traditionally, print debugging also known as logging was done using a serial channel and a terminal or PC. Some processors now contain a back-channel mechanism for debugging. These back channels usually consist of registers visible on the processor and the JTAG TAP. The processor and debugger can send and receive data across these registers while the processor is running. So, the application, a thread, or a monitor, can write data across this channel as fast as it will accept data, and the debugger or other client on the host can pull it off. Data can be pushed onto the processor from the host as well.
The advantage of back-channel schemes is that no additional devices, pins, or resources are needed to facilitate this. The main drawback: They are not deterministic in terms of rate since the host may take longer to get the data off in one run vs. another.
This article is based on excerpts taken from ESC class # 553, Developing embedded software in multi-core SoCs.



