Design Article
Multi-core and Multi-threaded SoCs Present New Debugging Challenges
Earl Mitchell
1/7/2004 12:00 AM EST
Experienced developers know that good tools support is critical for successful implementation, debugging, and maintenance of embedded products. SoC designs have introduced a new set of problems that make good tools support even more critical. Some problems can be attributed to the inherent decrease in accessibility and visibility for components used in SoC designs; others can be attributed to the increasing complexity of hardware and software. In particular, the increasing use of concurrency in SoC designs will introduce problems that the current generation of tools are not well suited to help debug. None of these concurrency techniques are new but in most cases their use in embedded designs has traditionally been limited compared to usage in desktop and enterprise (supercomputing) products.
Developers must be aware of these issues as early in the design process as possible in order to ensure they will have the tools necessary to deal with these challenges later in the product development cycle. SoC-based designs are heavily dependent on on-chip debug support. Most processor core vendors do not develop tools themselves. Nevertheless, they must still provide on-chip debug support required to facilitate the development of such debugging tools.
A process is a task running on an operating system that uses an MMU to prevent other tasks from accessing its memory regions. Threads running on such operating systems are called lightweight processes and they share virtual address spaces. Lightweight processes relax memory protection constraints and behave more like tasks. For example, schedulers can take advantage of the shared program space to speed up inter-process-communication by passing pointers to messages instead of copying the entire message. Also, the shared program space uses memory more efficiently and improves the cache hit rate. The process model is used by "full featured" OSes like Linux, which has become a popular choice for embedded use. Some platforms like Java provide their own application level thread schedulers but these simply map their application threads into an OS level task or thread.
Concurrency techniques used in hardware design increase performance by using multiple instruction pipelines, reducing idle cycles for shared resources (for example, pipelines, buses), and by distributing the processing load across multiple cores. In general, it is easier to gain processing power by adding more cores than to increase the clock speed. Multi-core SoCs consist of some combination of general purpose CPUs, DSPs, and application specific cores. Some cores provide multithreading support to increase CPU utilization and to boost context-switch performance.
Core vendors are also experimenting with two types of multithreading microprocessor architectures. The first provides extra logic to present OSes with an appearance of multiple virtual processor units instead of one physical processor. The idea is to allow idle units to be used by a second thread while the current one is running. The second type uses internal scheduling algorithms to reduce idle pipeline cycles by issuing instructions from a different thread when the current one stalls. The problem here is what criteria should be use to choose the new thread to switch to? To avoid QoS problems a priority based scheme would be required, and it would have to be controllable by the OS scheduler.
|
The objective in any debugging scenario is to figure out what is going on. To accomplish this in a multitasking, multithreading, multi-core system the debuggers must evolve to be task-aware and thread-aware in order to help the developers isolate which threads of execution are causing problems. They should also provide concurrent execution control capabilities and more informative trace information. All of these features would require on-chip support to implement.
It would also be beneficial to incorporate more task- and thread-awareness into instruction trace logs (Figure 2). Trace logs only show which instructions were executed at some point in time. There is no task id or thread id associated with the trace frames. Again if the OS scheduler notified the core which task or thread was now active for a context-switch then the trace block could potentially access that information and record it in the trace log. Currently the only alternative is to use instrumentation techniques which are intrusive and do not provide assembly level trace information.
|
As a result, the most popular solutions for attacking these types of problems are "using printfs" or event logging based techniques which instrument the software. Tool vendors claim their instrumentation techniques typically add about 2-5% overhead. In most cases these techniques are sufficient and the overhead is acceptable. But in highly optimized architectures code instrumentation techniques introduce unacceptable changes in pipeline and cache behavior as well as adding more non-deterministic side effects. Unfortunately, this case will become more common as SoC designs continue to scale up in processing speed and degree of concurrency. The alternative is to use non-intrusive real-time trace debugging techniques. Again it would be very useful to add context-switch events into the trace to determine which task or thread was active at the time.
Since the goal of using concurrency techniques is to increase resource utilization (for example, reduce idle cycles), it would be desireable to have various counters and statistics (for example, cache hits/misses, CPU/pipeline stalls) provided by the core to measure the effectiveness of these techniques.
Multiple cores also present new challenges for trace support. Ideally you would like to have an on-chip trace buffer for each core. Unfortunately, this may be too expensive and if the cores are heterogeneous and/or running at different clock frequencies then correlating the trace dumps would be difficult. Having correlated trace dumps would be very useful in debugging an SMP-like architecture where one core can be stalled by another waiting for a shared resource. But correlation requires a common timebase or point-of-reference. A common timebase could be implemented by having one core periodically send signals to other cores to reset (sync) an internal counter. Alternatively, a single on-chip trace buffer could be shared by the cores or a single off-chip trace buffer within the probe could be used (Figure 3). In the latter case the probe could provide the timebase for time-stamping.
|



