News & Analysis
Emerging DSP architectures bring DVD-quality to PDAs
Steve Wilson, Vice President, Marketing, ChipWrights, Newton, Mass.
2/21/2002 12:52 PM EST
As we enter the realm of mobile devices that carry wireless streaming video, it is useful to consider the types of DSP architectures that will be available for such products. We can glean tomorrow's architectures from the solutions that are being developed today and extrapolate from there.
Mobile devices are characterized by low-power, low-cost and small-form-factor requirements. To meet these objectives, today's products from digital cameras to MP3 players to PDAs typically operate on complex, highly integrated system-on-chip devices that include both the processing blocks, such as DSPs, and the I/O blocks specific to the application.
Solutions using a general-purpose processor provide very flexible platforms, but are challenged when facing the demands of high-performance applications. Hardwired, fixed-function solutions support high-performance and low-cost objectives, but at the expense of flexibility. They also raise development cost and time-to-market risk, among other things, trade-offs that are fairly well understood by system designers today.
Between these two extremes, there are other approaches to be considered and new technologies on the horizon. For example, to increase performance, some solutions offer a general-purpose processor and a DSP on the same die. Others use multiple processors or microcoded processing engines. In some cases, the most demanding applications involving image processing use a combination of all these techniques.
Lesser-used but possible future approaches include customizable processing engines, dynamically reconfigurable logic and reconfigurable processors. Application-specific processors have emerged as a viable way of developing products that benefit from the advantages of having a fully programmable development environment, but for which general-purpose compute engines are inadequate or unsuitable.
The most compelling applications for future mobile devices will be imaging based, whether capturing video for e-mail or video phones, or receiving an audio/video stream from the Internet or another caller. The advances in networking and peer-to-peer communications brought about by 802.11, Bluetooth and cellular radio will enable visual communications as simply as we make phone calls today. However, at this point user expectations are largely left unfulfilled by the low-resolution, low-frame-rate video that current approaches struggle to support.
Video algorithms must process sequences of images that are rapidly generated by an image sensor. For example, a VGA sensor (640 x 480 pixels) running at 30 frames per second generates an input data stream of about 9 Mbytes/s of data. In a cell phone, the input data rate of a typical voice sample is only 16 kbytes/s and in a CD-quality audio recording, still only about 250 kbytes/s. Image processing, like audio and voice processing, must be done in real-time since even digital still cameras offer MPEG-4 video-capture modes and burst-shooting capability. These image-processing algorithms must be executed on all pixels and then compressed according to the appropriate compression algorithm (JPEG, JPEG2000, MPEG-4, MPEG-2, H.263, H.26L).
Digital image processing is computationally intensive because the data volumes are very large and the operations involve a lot of manipulation on non-32-bit boundaries. General-purpose DSPs and RISC engines are not well-equipped to handle this efficiently. To boost performance, special instructions can be implemented. For example, the sum-of-absolute-differences instruction improves the performance of motion-estimation algorithms by doing byte-wise subtraction on a packed 32-bit word, then adding the results. This single operation would take tens of instructions in a general-purpose processor with no sum-of-absolute-differences instruction. Other characteristics of image data must also be leveraged to develop an efficient processing engine.
Image-processing algorithms often require only 8- and 16-bit formats, not the 32-bit and higher formats upon which modern processors and compilers have focused. Encoding a bit stream during the image-compression process often requires packing bits together. This is typically accomplished with several mask and shift instructions. Operation on byte and word boundaries is often addressed with single-instruction, multiple-data (SIMD) structures that are supported on top of the general-purpose architecture. Most SIMD implementations take this form, but this approach often requires a lot of data reorganization and precision compromises due to the underlying architecture.
A compute engine optimized to support these types of operations well would significantly accelerate image-processing algorithms. Ultimately, processing solutions tailored to the specific computational needs of image processing will bring DVD-like quality to mobile devices.
Consider, for example, that while video processing has always been one of the most demanding applications, it is now commonplace for high-end processors to run MPEG compression algorithms. In the early 1990s, it took multiple ASICs to build an MPEG-2 encoder. Today, high-end processors can run MPEG-2 encode in real-time and many of them can decode multiple video streams in parallel. While today, even MPEG-2 decode is difficult for a programmable mobile device, it's certainly not unreasonable to expect such capabilities to be available in the near future.
Fully programmable mobile solutions, however, will require a keen focus on low power and low cost. This means the whole system-level solution needs to be taken into consideration. For example, embedded DRAM can be used to reduce overall system power since the memory I/O signals do not go off-chip. However, the processor core itself needs to be efficient in its design more transistors means higher cost and higher power. Consideration needs to be paid to cache complexity or to whether a cache is needed at all.
For imaging operations, a data memory implemented in SRAM can be more efficient than a data cache since imaging data is not random. The modern very long instruction word (VLIW) architectures store and route wide instruction words and require many instruction decoders, but imaging applications benefit much more from data-level parallelism than instruction-level parallelism. Architectures that focus on this aspect of acceleration can be far more efficient, since they require fewer transistors to implement.
There are several new approaches to solving these types of application-specific problems. ARC Cores, for example, offers a customizable processing engine that makes it possible to add user-defined instructions. Improv Systems offers a fully customizable processor core that allows the user to select the quantity and type of execution units desired. These approaches let developers differentiate their products and balance the overall processor performance with the device application needs to achieve higher levels of performance at acceptable price and power points. Those considering this approach need to weigh the additional development complexity and risk against both the tactical and long-term advantages they expect to gain.
Further out on the horizon, devices with dynamically reconfigurable logic that is, logic that can be reconfigured in real-time offer the hope of reducing the silicon required to implement a host of mobile functions. Also known as adaptive-computing machines, these revolutionary devices could reduce the total cost and power of a mobile device because the same logic (presumably less than a separate implementation of the functions would entail) could be used to perform many functions for example, MP3 or MPEG-4 decoding, or voice processing. QuickSilver Technology is a startup taking this approach.
Development often moves in evolutionary steps, however. The proposed solutions for third-generation cell phones are a good example. Such platforms are likely to feature two processor subsystems, one to handle the modem functionality and call processing, and one to tackle new applications such as PDA functionality, MP3 playback and image processing.
However, the applications processor of an entertainment-centric device would revolve around imaging and video, and would likely be representative of a new class of application-specific DSPs tailored to accelerate image processing. Such a device might be optimized to work on bit and byte boundaries and implement aggressive data-level parallelism, for example, while still supporting a standard development environment and general-purpose computing needs.
Future mobile devices will become very video-centric, and with the multitude of compression technologies currently deployed and the premium value of higher-quality, low-bit-rate coding we can expect they will require flexible processing solutions. The most successful mobile devices will support high-quality streaming video, both capturing and viewing, since that is what consumers have come to expect. Designers will be able to select from various approaches to meet their computing needs from RISC processors and RISC plus DSP, to processor plus fixed-function logic, and reconfigurable and application-specific processors.
In five years, the performance curve for mobile applications will rise, largely driven by image- and video-processing requirements. Successful design for this performance level will require a new approach to the problem and the development of a new signal-processing architecture. For example, ChipWrights has recently announced an application-specific processor architecture that strikes a new balance between general-purpose compute power and video-processing compute power.
The CWvx combines a general-purpose RISC processor with an advanced vector array of DSP execution units that are specifically designed to accelerate image processing. The machine operates in SIMD fashion and can be scaled to support between two and 16 DSP execution units. Each 32-bit, fixed-point execution unit can support six operations per cycle and four, 8-bit multiply-accumulates per cycle with full precision. Its extract and insert stages support operations on values of less than 32 bits without the penalty of additional instructions.
This type of architecture provides a highly efficient compute engine for mobile video encoding and decoding. By leveraging the characteristics of image processing and incorporating a general-purpose processor, DVD-quality video is attainable in a mobile device such as a PDA. Furthermore, the platform is fully programmable, so future codecs can be quickly implemented.


See related chart
