Design Article

New DSP Architectures Go "Post-Harvard" for Higher Performance and Flexibility

Richard A. Quinnell

5/1/2002 12:00 AM EDT

After remaining unchanged for more than a decade, DSP architectures have started to evolve. They are even trying to encompass control operations.

When first introduced in the 1980s, digital signal processors (DSPs) were rare and unusual, dedicated to arcane mathematical manipulation beyond the understanding of most design engineers. Since then they have become a fundamental building block in communications and consumer electronics. Yet their architecture remained essentially unchanged for most of their history. Now, new DSP architectures are springing up as the technology adapts to its multitude of uses.

Conventional DSP architectures, which emerged in the 1980s, follow fundamentally the same pattern, shown in Figure 1. Their memory access typically uses a Harvard-style architecture, with separate data and instruction buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or 48-bit words in either fixed-point or floating-point format. Whatever the word width, these conventional DSPs offer fixed-width instructions, executing one instruction per clock cycle.

Figure 1:  The conventional DSP architecture, introduced in the 1980s, uses separate data and memory buses and features fixed-width instructions, executing one instruction per clock cycle.

The instructions themselves can be fairly complex. A single instruction may embody two data moves, a MAC operation, and two pointer updates. These complex instructions help the conventional DSP offer a high degree of code density when performing repeated mathematical operations on arrays of numbers. As control devices, however, they leave something to be desired. The fixed-width instructions are inefficient when tasked with performing simple counter increments as part of a control loop, for instance. Even if the counter is only going as high as 10, the processor needs to use the full word width for the values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting.

Still, because of their number-crunching proficiency, conventional DSPs soon gained popularity in communications and media applications. The communications devices, including modem and telephony processors, needed the computational power for echo canceling, voice coding, and filtering. Media applications, including digital audio, video, and imaging, needed computational power for compression and filtering along with program flexibility to track evolving standards. DSPs also found a home in disk-drive and other servo-motor-control applications.


Enhanced DSPs Emerge

As semiconductor process technology evolved, conventional DSPs began to acquire a number of on-chip peripherals such as local memory, I/O ports, timers, and DMA controllers. Their basic architecture, however, didn't change for more than a decade. Eventually, though, the relative weakness in bit-level manipulation began to catch up with conventional DSPs, as did the incessant demand for greater performance. By the mid-1990s, enhanced DSPs began to emerge.

One common feature of these enhanced DSPs is the presence of a second MAC, which allows for some parallelism in computation. In many cases, this parallelism extends to other elements in the DSP, allowing the device to perform single-instruction, multiple-data (SIMD) operations. Often this is accomplished with data packing, which allows registers, data paths, and the like to handle two half-word operands each clock cycle. Along with data packing, many enhanced DSPs allow the instructions themselves to use fractional word widths, which allows multiple instructions to launch simultaneously.

The enhanced DSPs also tend to incorporate features that speed execution of algorithms in a specific application space as well as add special-purpose peripherals and memory. The exact nature of the specialization varies with the application an enhanced DSP targets, which makes direct comparisons difficult. Many include hardware accelerators for frequently-used operations as well as provide specialized addressing modes and augmented instruction sets that target the application space. The augmented instruction sets may include both special DSP instructions and RISC-like instructions for improved control operation.

Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets voice, video, and data communications signal processing along with control operations. Figure 2 shows a block diagram of the Blackfin core, which is the same in all family members. The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In addition, a control unit handles sequencing of instructions so that a mix of 16-bit control and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or 32-bit format.

Figure 2:  Analog Devices' Blackfin DSP architecture handles multi-width data words and can simultaneously execute 16-bit control and 32-bit DSP instructions.

The core also includes two data address generators (DSGs) to simplify both DSP and control operations. DSP addressing operations include circular buffering, for matrix operations, and bit-reversal, for unscrambling FFT results. Control operations include auto-increment, auto-decrement, and base-plus-immediate-offset addressing modes not found in conventional DSPs.


Instruction Sets Target Applications

The instruction set of the Blackfin core includes both general DSP instructions and RISC-like control instructions. In addition, the core has complex instructions geared toward the needs of the intended applications. For Huffman coding, used in communications algorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine Transform, used in imaging and video, an IEEE 1180 rounding operation is available. Video compression algorithms can take advantage of the "Sum Absolute Difference" instruction.

These specialty instructions are one way that the Blackfin family targets applications. The other way is the peripheral mix each family member offers. The ADSP-21532, for example, aims at low-cost consumer multimedia applications by including peripherals supporting surround-sound and video-specific operating modes. The ADSP-21535 goes after high-performance communications applications with USB and PCI interfaces as well as substantial amounts of on-chip SRAM.

The range and variety of variations within the Blackfin family as well as the nature of its specialized instructions mirror the diversity of enhanced conventional DSPs, available from companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the enhancements, these DSPs follow basically the same programming model as the conventional device.

Other DSP architectures have emerged that follow a different programming model. In search of the highest performance levels, these architectures allow the DSP to launch multiple instructions at the same time for parallel execution. While these approaches result in greater code execution speed, they also make software more difficult to optimize. They require careful instruction ordering to avoid needing simultaneous access to the same data. They also need to avoid attempting simultaneous execution of instructions where one instruction depends on the results of the other for its operands. Not all DSP application software has a structure suitable for multiple-launch execution, but when it does, these DSPs offer the highest performance.


Parallelism Arises

Two different forms of multiple-launch DSPs have arisen: very long instruction word (VLIW) and superscalar architectures. Both have multiple execution units configured to operate in parallel and use RISC-like instruction sets. The instructions of a VLIW architecture are explicitly parallel, being composed of several sub-instructions that control different resources. The superscalar architectures, on the other hand, load instructions in bulk, then use hardware run-time scheduling to identify instructions that can run in parallel and map them to the execution units.

Of the multi-launch architectures, VLIW designs are the most common. Devices from Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into this category, although they vary considerably with the type and number of parallel execution units they offer. The TI TMS320C64xx processors, for instance, have eight execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP, on the other hand, is scalable from two to 32 clusters, each with several execution units.

The Adelante Saturn DSP core, shown in Figure 3, demonstrates the essence of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to deliver data and 96-bit wide instructions to an array of execution units simultaneously. These units include two multipliers (MPY), four 16-bit ALUs that can combine to form two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add application-specific execution units (AXU) to speed processing.

Figure 3:  Adelante's Saturn DSP core handles VLIW instructions that can comprise several sub-instructions that control different resources. The core also handles application-specific execution units (AXUs) to accelerate processing.

The Saturn core uses a unique approach to get around one of the problems the wide word widths of VLIW architectures cause. Accessing external memory is a challenge for these DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses internally. Adelante developed this mapping after analyzing millions of lines of code for common applications. However, the core also allows developers to create their own application-specific instructions that map into the VLIW.


Superscalar DSPs

While the 16-bit external instruction width of the Saturn processor is unusual for VLIW architectures, it is typical for superscalar architectures. These devices pull in several instructions at a time and dynamically map them to the execution units. Internally the effect is much the same as a VLIW architecture in that execution units are operating in parallel. But from the software development viewpoint the approach reduces programming complexity. With hardware handling the sequencing and arranging of instructions, the developer is free to work with the more manageable short instructions.

The structure of a sample superscalar DSP, the LSI Logic ZSP600, appears in Figure 4. Because it is a core its memory interface isn't constrained, making it look like a VLIW architecture. But the presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as many as six, using its four MAC and two ALU execution units simultaneously. Data packing allows the units to perform 16- or 32-bit operations. The architecture also allows for the addition of coprocessors to speed specific DSP functions.

Figure 4:  Superscalar DSPs, such as LSI Logic's ZSP600, use several instructions simultaneously and dynamically map these instructions to the execution units. Since hardware handles the sequencing and arranging of instructions, the software developer's task is simplified compared to working with a VLIW-supported architecture.

This ability to add coprocessors is becoming a common feature of high-performance DSP cores. In many cases the core's creators have also created coprocessors for functions such as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't available, however, creating your own can be a major design challenge.

A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure, as shown in Figure 5. It is designed as a systolic array, which means that all data transfers occur synchronously on a clock edge. Each processing element in the array has selectable I/O paths, local data memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a programming bus running through it. The combination makes the array reprogrammable, either statically or dynamically. The array structure is intended to handle low-complexity but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a coprocessor.

Figure 5:  Systolix's PulseDSP is a systolic array that can run as a coprocessor or as a standalone unit for applications such as filters and FFTs. The array is programmable, with each processing element having its own selectable I/O paths, local data memory, and an ALU.

The array can also be used as a stand-alone processor for some types of algorithms, such as filters and FFTs. One of the commercial implementations of the array, in fact, is to provide filtering in an Analog Devices data acquisition part, the AD7725. The device combines the PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired data. The DSP array implements various filter algorithms.

Innovations such as the PulseDSP as well as the proliferation within the other DSP architectures are a strong indicator of how important these once-arcane processors have become. In many applications, especially communications, they share the spotlight with the RISC processor. The DSP handles the data and the RISC handles the protocols. There are problems with the two-processor approach, of course, including increased cost and software development complexity. One reason many DSPs are adding RISC-like instructions to their set is to be able to edge out the other processor in such applications.

The same thing is happening with some RISC processors. Extensible cores, such as the Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements so that communications applications need only one processor. These enhancements follow the architecture of the conventional DSP, but merge the DSP functions into the instruction set of the RISC core.

The ARCtangent, shown in Figure 6, demonstrates how the two get blended. The DSP instruction decode and processing elements both connect with the rest of the core, allowing them to use the core's resources as well as their own. The extensions have full access to registers and operate in the same instruction stream as the RISC core. ARC's DSP offerings include MACs in varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also support DSP addressing modes such as bit-reversal.

Figure 6:  The ARCtangent core from ARC International blends DSP functionality into a RISC processor. Both DSP instruction-decode and processing elements connect with the rest of the core, allowing these elements to use the cores resources as well as their own.

These extended RISC processors, enhanced conventional DSPs, and high-performance architectures have all proliferated in the last few years, a sure sign of the importance DSPs have acquired. Furthermore, that proliferation is likely to continue. With process technology allowing integration of multiple peripherals with DSP cores and instruction sets extending to match application needs, DSPs are headed the way of the microcontroller. From obscure, specialized parts, they are evolving to become a fundamental building block for virtually any system.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form