Design Article
Embedded DSP Software Design on a Multicore SoC Architecture: Part 1
Robert Oshana, Texas Instruments
11/21/2007 12:15 AM EST
Designing and building embedded
systems is a difficult task, given the inherent scarcity of resources
in embedded systems (processing power, memory, throughput, battery
life, and cost). Various trade-offs are made between these resources
when designing an embedded system.
Modern embedded systems are using devices with
multiple processing units manufactured on a single chip, creating a
sort of multicore system-on-a-chip (SoC) can increase the processing
power and throughput of the system while at the same time increasing
the battery life and reducing the overall cost.
One example of a DSP based SoC is shown in Figure 11.1 below. Multicore approaches keep hardware design in the low frequency range (each individual processor can run at a lower speed, which reduces overall power consumption as well as heat generation), offering significant price, performance, and flexibility (in software design and partitioning) over higher speed single-core designs.
![]() |
| Figure 11.1. Block diagram of a DSP SoC |
There are several characteristics of SoC that we will discuss [1]. I will use an example processor to demonstrate these characteristics and how they are deployed in an existing SoC.
1.
Customized to
the application " Like embedded systems in general, SoC are
customized to an application space. As an example, I will reference the
video application space. A suitable block diagram showing the flow of
an embedded video application space is shown in Figure 11.2 below.
This system consists of input capture, real-time
signal processing, and output display components. As a system there are
multiple technologies associated with building a flexible system
including analog formats, video converters, digital formats, and
digital processing. An SoC processor will incorporate a system of
components; processing elements, peripherals, memories, I/O, and so
forth to implement a system such as that shown in Figure 11.2 below.
![]() |
| Figure 11.2 Digital video system application model (courtesy of Texas Instruments) |
An example of an SoC processor that implements a digital video system is shown in Figure 11.3 below. This processor consists of various components to input, process, and output digital video information. More about the details of this in a moment.
2.
SoCs improve
power/performance ratio " Large
processors running at high frequencies consume more power, and are more
expensive to cool. Several smaller processors running at a lower
frequency can perform the same amount of work without consuming as much
energy and power.
In Figure 11.1, the ARM processor, the two DSPs, and the hardware accelerators can run a large signal processing application efficiently by properly partitioning the application across these four different processing elements.
3. Many apps require programmability " SoC contains multiple programmable processing elements. These are required for a number of reasons:
New technology " Programmability supports upgradeability and
changeability easier
than nonprogrammable devices. For example, as new video codec
technology is
developed, the algorithms to support these new standards can be
implemented on
a programmable processing element easily. New features are also easier
to add.
Support for multiple standards and algorithms " Some digital video
applications
require support for multiple video standards, resolutions, and quality.
Its easier
to implement these on a programmable system.
Full algorithm control " A programmable system provides the designer
the ability
to customize and/or optimize a specific algorithm as necessary which
provides the
application developer more control over differentiation of the
application.
Software reuse in future systems " By developing digital video
software as components, these can be reuse/repackaged as building
blocks for future systems as necessary.
4. Constraints such as real-time, power, cost
" There are many constraints in real-time embedded systems. Many of
these constraints are met by customizing to the
application.
![]() |
| Figure 11.3. A SoC processor customized for Digital Video Systems (courtesy of Texas Instruments) |
5. Special instructions - SoCs have
special CPU instructions to speed up the application. As an example,
the SoC in Figure 11.3 above
contains special instructions on the DSP to accelerate operations such
as:
32-bit multiply instructions for extended precision computation
Expanded arithmetic functions to support FFT and DCT algorithms
Improve complex multiplications
Double dot product instructions for improving throughput of FIR loops
Parallel packing Instructions
Enhanced Galois Field Multiply
Each of these instructions accelerate the processing of certain digital video algorithms. Of course, compiler support is necessary to schedule these instructions, so the tools become an important part of the entire system as well.
6. Extensible " Many SoCs are extensible in ways such as word size and cache size. Special tooling is also made available to analyze systems as these system parameters are changes.
7. Hardware acceleration " There are several benefits to using hardware acceleration in an SoC. The primary reason is better cost/performance ratio. Fast processors are costly. By partitioning into several smaller processing elements, cost can be reduced in the overall system. Smaller processing elements also consume less power and can actually be better at implementing real-time systems as the dedicated units can respond more efficiently to external events.
Hardware accelerators are useful in applications that
have algorithmic functions that do not map to a CPU architecture well.
For example, algorithms that require a lot of bit manipulation require
a lot of registers. A traditional CPU register model may not be suited
to efficiently execute these algorithms.
A specialized hardware accelerator can b built that
performs bit manipulation efficiently which sits beside the CPU and
used by the CPU for bit manipulation operations. Highly responsive I/O
operations are another area where a dedicated accelerator with an
attached I/O peripheral will perform better.
Finally, applications that are required to process
streams of data, such as many wireless and multimedia applications, do
not map well to the traditional CPU architecture, especially those that
implement caching systems.
![]() |
| Figure 11.4 Block diagram of the video processing subsystem acceleration module of the SoC in Figure 11.3 (courtesy of Texas Instruments) |
Since each streaming data element may have a limited lifetime, processing will require the constant thrashing of cache for new data elements. A specialized hardware accelerator with special fetch logic can be implemented to provide dedicated support to these data streams.
Hardware acceleration is used on SoCs as a way to
efficiently execute classes of algorithms. We mentioned in the chapter
on power optimization, how the use of accelerators if possible can
lower overall system power since these accelerators are customized to
the class of processing and, therefore, perform these calculations very
efficiently.
The SoC in Figure 11.3 has hardware acceleration
support. In particular, the video processing sub-system (VPSS) as well
as the Video Acceleration block within the DSP subsystem are examples
of hardware acceleration blocks used to efficiently process video
algorithms.
Figure 11.4 above shows a block diagram of one of the VPSS. This hardware accelerator contains:
A front end module containing:
CCDC (charge coupled device)
Previewer
Resizer (accepts data from the previewer or from external memory and
resizes from ¼x to 4x)
And a back end module containing:
Color space conversion
DACS
Digital output
On-screen display
This VPSS processing element eases the overall
DSP/ARM loading through hardware acceleration. An example application
using the VPSS is shown in Figure
11.5 below.
![]() |
| Figure 11.5 A Video phone example using the VPSS acceleration module (courtesy of Texas Instruments) |
8. Heterogeneous memory systems " Many SoC devices contain separate memories for the different processing elements. This provides a performance boost because of lower latencies on memory accesses, as well as lower power from reduced bus arbitration and switching.
This programmable coprocessor is optimized for imaging and video applications. Specifically, this accelerator is optimized to perform operations such as filtering, scaling, matrix multiplication, addition, subtraction, summing absolute differences, and other related computations.
Much of the computation is specified in the form of commands which operate on arrays of streaming data. A simple set of APIs can be used to make processing calls into this accelerator. In that sense, a single command can drive hundreds or thousands of cycles.
As discussed previously, accelerators are used to
perform computations that do not map efficiently to a CPU. The
accelerator in Figure 11.6 below
is an example of an accelerator that performs efficient operations
using parallel computation.
![]() |
| Figure 11.6 A hardware accelerator example; video and imaging coprocessor (courtesy of Texas Instruments) |
This accelerator has an 8-parallel multiply accumulate (MAC) engine which significantly accelerates classes of signal processing algorithms that requires this type of parallel computation. Examples include:
JPEG encode and decode
MPEG-1/2/4 encode and decode
H.263 encode and decode
WMV9 decode
H.264 baseline profile decode
The variable length code/decode (VLCD) module in this accelerator supports the following fundamental operations very efficiently:
Quantization and inverse quantization (Q/IQ)
Variable length coding and decoding (VLC/VLD)
Huffman tables
Zigzag scan flexibility
The design of this block is such that it operates on a macroblock of data at a time (max 6 8x8 blocks, 4:2:0 format). Before starting to encode or decode a bitstream, the proper registers and memory in the VLCD module must first be initialized by the application software.
This hardware accelerator also contains a block
called a sequencer which is really just a 16-bit microprocessor
targeted for simple control, address calculation, and loop control
functions. This simple processing element offloads the sequential
operations from the DSP.
The application developer can program this sequencer to coordinate the operations among the other accelerator elements including the iMX, VLCD, System DMA, and the DSP. The sequencer code is compiled using a simple macro using support tools, and is linked with the DSP code to be later loaded by the CPU at run time.
One of the other driving factors for the development
of SoC technology is the fact that there is an increasing demand for
programmable performance. For many applications, performance
requirements are increasing faster than the ability of a single CPU to
keep pace.
The allocation of performance, and thus response
time, for complex realtime systems is often easier with multiple CPUs.
And dedicated CPUs in peripherals or special accelerators can offload
low-level functionality from a main CPU, allowing it to focus on
higher-level functions.
Robert Oshana is an engineering manager in the Software Development Organization of Texas Instruments DSP Systems business. He is responsible for the development of hardware and software debug technology for many of TI's programmable devices. He has 25 years of real-time embedded development experience.
Used with the permission of the publisher, Newnes/Elsevier this series of two articles is based on material from DSP Software Development Techniques for Embedded and Real Time Systems, by Robert Oshana.
References
1. Multiprocessor systems-on-chips, by Ahmed Jerraya, Hannu Tenhunen and Wayne Wolf, page 36, IEEE Computer, July 2005.
2. Embedded Software in Real-Time Signal Processing Systems: Design Technologies, Proceedings of the IEEE, vol. 85, no. 3, March 1997.
3. A Software/Hardware Co-design Methodology for Embedded Microprocessor Core Design, IEEE 1999.
4. Component-Based Design Approach for Multicore SoCs, Copyright 2002, ACM.
5. A Customizable Embedded SoC Platform Architecture, IEEE IWSOC'04 <- International Workshop on System-on-Chip for Real-Time Applications.
6. How virtual prototypes aid SoC hardware design, Hellestrand, Graham. EEdesign.com May 2004.
7. Panel Weighs Hardware, Software Design Options, Edwards, Chris. EETUK.com Jun 2000.
8. Back to the Basics: Programmable SoCs. Zeidman, Bob. Embedded.com July 2005.









