Design Article
Massively parallel processors for DSP, part 1
BDTI
6/18/2007 3:00 AM EDT
[Part 2, BDTI looks at the innovative new tools for massively parallel processors.]
In the last few years a number of start-up companies have announced massively parallel processors for embedded DSP applications. With their arrays of processing elements, these processors target high-end digital video, software-defined radio and other computationally demanding applications for which traditional DSP processors lack sufficient horsepower and ASICs are too inflexible or too costly to design. In some cases, massively parallel architectures are employed to reduce power consumption; if the chip has many parallel resources, it can potentially accomplish the same work at a lower clock speed and burn less power.
Chips that have a few processors on them have been widely deployed for many years, but what's new is the growing number of chips that contain tens—or even hundreds—of processing elements. There are significant differences among massively parallel chips, but because the technology is relatively new, there isn't yet a clear taxonomy. Without one, it can be difficult to figure out how to compare these chips to each other and understand potential strengths and weaknesses.
In this article, which is Part I of a two-part series, we'll explain the key technology differentiators among the latest massively parallel chips. (See our earlier article for a discussion of mainstream DSP processors.) In Part II we'll take a look at new development tools and methodologies that vendors are using to try to make their chips easier to use.
Four Dimensions of Differentiation
For the purposes of this article, we'll define four key dimensions of differentiation for massively parallel processor architectures: the granularity of the processing elements; whether the processing elements are homogeneous or heterogeneous; the method used to control the processing elements; and the method used to partition and distribute tasks across processing elements. Understanding where a given architecture fits in these dimensions of differentiation provides a framework for comparing the widely disparate massively parallel architectures available today.
Granularity
The first differentiator we'll discuss is the granularity of the processing elements. Some chips contain arrays of complete processor cores, while others have lower-level elements, like ALUs. Of the chips that use arrays of processors, some are based on complex, possibly VLIW-based CPUs, while others use very simple processor cores. Finer-grained processing elements are generally more flexible, but may require a more hardware-oriented programming approach (e.g., use of HDL versus a high-level programming language).
Though they aren't covered in this article, FPGAs can be thought of as the most fine-grained multiprocessors, with gate-level programmability. A step up in granularity is seen in MathStar's MOA2400D chip, which is based on the company's "FPOA" (or "field-programmable object array") architecture. This chip contains an array of 400 functional units, which include ALUs, multiply-accumulate ("MAC") units, and register files. MathStar claims that its medium-grained approach provides much of the flexibility of FPGAs while offering a simpler programming approach and higher clock speeds (up to 1 GHz, vs. approximately 300-500 MHz for high-performance DSP-oriented FPGAs). Each functional unit is individually configured using SystemC code; functional units exchange data via a synchronous interconnect. Unlike in FPGAs, the clock speed of the chip doesn't depend on the functionality being implemented.
At the other end of the granularity spectrum lies IBM's Cell processor, which incorporates eight "synergistic processing elements" ("SPEs"), each of which is a complex, 32-bit superscalar processor with a high level of parallelism for accelerating DSP algorithms. These processors are controlled by the "POWER Processing Element" (PPE)—a separate 32-bit superscalar CPU hooked up to a cache. The PPE is responsible for running the operating system and coordinating the activities of the SPEs, which essentially act as co-processors. Cell was designed to function as a high-performance programmable processor for gaming applications; its top clock speed is about 3 GHz.
Massively parallel processors are always complex. The choice between a large number of simple processing elements and a small number of complex processors is, in effect, a choice between different types of complexity. Simple processing elements (like those found in MathStar's FPOA) have a limited repertoire of capabilities, and so tend to be straightforward to use—at least, on a per-element basis. But it takes many of them working together to achieve high performance, and that's where the complexity arises.
With fewer, more-complex processing elements, such as is found in the IBM Cell, partitioning the workload and coordinating the activities of the processing elements is less daunting (though it can still be quite challenging), but getting the most out of each processing element can be harder. Processor-based chips tend to use software development tools that are similar to those used for single processors; finer-granularity chips (such as FPGAs) often use very different toolchains and development paradigms.
Heterogeneous vs. Homogeneous Processing Elements
Arrays of processing elements can be either homogeneous (all elements are the same) or heterogeneous (a mixture of two or more different types of elements). Homogeneous arrays contain processing elements that are interchangeable, which can reduce complexity and make it easier to partition an application. If all processing elements are not the same, it can be challenging to determine which type of processing element should be used for a given task. Heterogeneous arrays may also be more complex to understand and use. However, there are significant benefits to the heterogeneous approach; elements can be specialized for particular tasks (or classes of tasks) and thus be more efficient. The net effect on the user (in terms of ease of use) is likely to depend as much on the development tools as on the specific mixture of processing elements.



