News & Analysis

Vector architecture optimized for DSP imaging/wireless apps

John Redford, Bret Bersack,Matt Moniz,Design Engineers, ChipWrights Inc., Newton, Mass.

5/13/2002 7:31 AM EDT

Vector architecture optimized for DSP imaging/wireless apps
A novel DSP architecture that exploits the parallelism and narrow data paths typical of image processing will be presented at the Custom Integrated Circuits Conference. The following article contains excerpts from the paper titled, "A Vector DSP for Imaging," reprinted with permission from IEEE 2002 CICC.

Consumer image-processing applications such as digital copiers, cameras and camcorders need high computing performance at a low cost and power. But until now, the only route to that was to use fixed-function hardware. We have designed a system-on-chip containing a DSP with a novel vector architecture that can provide similar performance, cost and power in a fully programmable way, permitting new classes of applications, shorter time-to-market and easy product differentiation.

The CW4011 exploits the parallelism and narrow data typical of image processing to gain high performance at a low cost and low power. It contains eight 32-bit data paths, all working off of a single instruction, and can perform sixteen 16-bit MACs per cycle, as well as four 32-bit memory accesses per cycle to 128 kbytes of on-chip memory. It contains a serial data path for handling low-performance code and OS functions, and it includes memory, video and I/O interfaces on an industry-standard bus. It is built in 0.18-micron technology , measures 7.8 x 6.8 millimeters, runs at 200 MHz (worst-case) and draws less than 500 milliwatts of power.

We believe the approach provides the best cost/performance of any processor on the market for imaging applications. The high performance is achieved by exploiting several features of imaging applications: easily found parallelism, narrow data and regular memory access patterns. The low cost comes from having little overhead in the instruction fetch and dispatch; from the integration of peripheral; and from using low-cost processes, design styles (i.e., no full-custom) and packages. Low power results from low overhead on instruction handling and from extensive clock control.

The architecture spends as many gates and SRAM bits as possible on data handling rather than instruction handling by using a single-instruction- multiple-data (SIMD) architecture as opposed to a superscalar or VLIW approach. It gets around some of the traditional problems with programming SIMD machines, such as Intel's MMX ISA or Oak Technology's iDSP family, via some novel architectural features.

Essentially, the DSP is a variant of a vector architecture with eight parallel units, a central serial data path and a four bank on-chip memory interleaved on a 32-bit basis. A single 32-bit instruction is fetched from the Icache each cycle and is checked for hazards. From there, it is distributed to the serial and then the parallel data paths.

The serial data path is a typical 32-bit RISC with 32 registers and three operands. It is an easy porting target for low-performance application code and the operating system. It also supplies data types and address bases to the parallel data paths, and handles loop counts.

The parallel data paths handle the vector operations. Like a traditional vector architecture, they all perform the same operation on different registers and do "strided" or scatter-gather accesses to memory. All operations are conditioned by an enable bit that can be set by comparison instructions.

The actual number of data paths is hidden from the software, permitting the machine to be scaled up or down. But unlike a traditional vector architecture, each data path has its own register file and can be envisioned as operating by itself. Programmers and compilers do not have to think in parallel to use the machine. Instead, they can think about one data path working on one part of an array, and the hardware takes care of spreading the other data paths around the other segments of the array.

This "hiding of parallelism" is due to several factors . For example, the number of active data paths is controlled by a register and can vary at run-time as well as among implementations. There are looping instructions that know the number of active data paths. If more are active, fewer loop passes are done. If a run-time check finds that no parallelism is possible, the number of active data paths can be set to one. The loop instructions also know when the end of an array has been reached and can disable any data paths that are beyond it. Arrays do not have to be a multiple of the number of data paths in length.

In addition, branch instructions control the enable bits. They can save the old state of the enables, set the new state and branch if all the data paths happen to be inactive. In some cases, some data paths may need to execute the "then" clause of a branch and others the "else" clause. The enables are set differently for each. Being able to easily save the state of the enables means that it's possible to nest if-then-elses arbitrarily deeply. In a standard vector machine, much extra work would be needed for the save and restore of enables.

Also, there are scatter-gather instructions that let each data path have its own private data structures, free from interference by the others. In a typical scatter operation, there is no assurance that two vector elements won't be written to the same address. In this machine, the scatter op can insert the data path index into the LSBs of the address so that each data path's address is unique. The scatter ops also allow a programmer to have deterministic access time to tables by using multiple copies of vectors.

The parallel data paths also have a number of features to handle data of narrow bit-width. The instructions can combine the extraction or insertion of bit fields with other operations, so that handling narrow fields adds no overhead. For example, one instruction can extract a byte or word from a 32-bit register, do a multiply-accumulate, perform a shift right to remove fractional bits, run a saturation check on the result, and then do an insert into a byte or word of a 32-bit output register. The extraction and saturation check can be signed or unsigned.

Also, the extract or insert position can be incremented or decremented as part of the instruction for sweeping through registers. And registers can be treated as four 8-bit items or two 16-bit. The accumulator is reformatted to match and thus can be treated as four 24-bit accumulators for packed bytes, two 40-bit accumulators for packed words or a single 40-bitter for packed long words. That provides extra precision for avoiding overflows.

Each item is ordinarily paired with a matching item in the other operand, but the machine can also spread one item among many (a "splat" operation) and can do dot-products and sum-of-absolute-differences.

Each instruction can have up to seven input operands: an A operand register, the A data type (8 bits, 16 bits, 32 bits, packed 8 bits, packed 16 bits and signed/unsigned), a B operand register and the B data type, a 9-bit literal, the accumulator and the output data type. These complex instructions permit many operations to be specified by a single 32-bit instruction. This can be called a very dense instruction word (VDIW) and permits high performance with a single-issue instruction dispatcher and a small instruction cache, saving area and power.

The DSP and primary memory are combined with a set of I/O blocks, all linked by an on-chip system bus — a 32-bit AHB bus running at half the DSP clock rate.

The I/O blocks are:

  • A 32/16-bit SDRAM interface. It includes a 16-deep write FIFO and four 8-deep read prefetch buffers to handle multiple streams of contiguous data such as DMA and Icache fills.

  • A video interface with 16/8 bits in and out ports each backed by a 256-byte FIFO. Streams of data from sensors or to printers can be moved through these. The ports can handle standard HSYNC/VSYNC signaling as well as req/ack control.

  • A three-channel DMA controller,:one channel for video input to SDRAM, one for SDRAM to video output, and one for SDRAM to/from internal memory. Each channel moves two-dimensional patches of data instead of just blocks.

  • A UART, synchronous serial port (for sensor control) and 16 general-purpose I/O pins. Unused video pins can also be GPIOs.

  • A JTAG debug port that can read and write anything on the AHB bus. A breakpoint can be put on any PC range, and a watch point can be put on any address range or masked data value.

  • A 16-bit host interface for connecting to an outside host processor or controlling low-speed outside peripherals.

  • A PLL for clock multiplication. The clock can be dynamically controlled for clock management, and clocks to individual blocks can be disabled. The chip can be put to sleep or awakened by a GPIO or timer interrupt. Clock gating is used throughout all the blocks.

See related chart





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form