News & Analysis
Multi-processing SBC bus options eye signal, image apps
Richard Jaenicke, Director Product Marketing,Mercury Computer Systems Inc. Chelmsford, Mass.
1/17/2002 12:46 PM EST
A common multiprocessing architecture for a single-board computer (SBC), especially in the signal- and image-processing realms, is one that places multiple high-performance vector-processing engines on the same bus.
In signal and image processing, in particular, the AltiVec vector-processing engine in the PowerPC G4 processor continues to provide high-end performance for both fixed-point and floating-point applications. Currently operating at 500 MHz, the 128-bit vector engine can execute floating-point calculations at a rate of up to 4 Gflops and can execute byte data (e.g., 8-bit pixels) up to 16 Gops (16 x 109) arithmetic operations per second) by performing two operations per cycle on each of 16 bytes of data. Applications requiring only a couple of processors can be addressed by an SBC with multiple PowerPC G4 processors.
But the big question is, how do you feed those engines enough data to keep them busy and avoid unproductive processor time?
A number of suppliers in the market have the answer to this question. Most of them use a single host bridge to load and unload all four processors that are connected on a shared PowerPC bus. Typically, this host bridge gets data from a single Peripheral Component Interconnect bus segment, which in turn comes through a single bridge from another shared PCI or VMEbus. The throughput of the data in and out of the board is limited by the slowest part of the path, which in this case is the PCI bus, typically operating at a peak of 266 Mbytes/second.
Using this four-processor bus-based model, real-time data is fed to one of the processors at up to 266 Mbytes/s, and then the results are read from a processor at up to 266 Mbytes/s. This must be repeated three more times to get data through each of the four processors.
Clearly, the shared-bus segment emerges as a choke point in this architecture. A new solution is required that can provide the data communications bandwidth demanded by these processors.
The PCI community has attempted to respond to these limitations by bolstering the PCI specification. The clock rate has increased to 66 MHz and then to 133 MHz. Although the PCI-X specification does more than simply increase the PCI bus data rate, the higher clock rate bumps up against physical limitations that result in a smaller number of devices per bus segment. Ultimately, this solution turns out to be no solution at all for multiprocessor architectures: the problem it tries to address is how to feed multiple processors faster, so the answer cannot be to feed fewer processors.
Another approach is to incorporate a switch-fabric interconnect that provides both an interface to each processor and the ability for multiple simultaneous data transfers. In such a model, each processor gets a dedicated interface that runs at least as fast as a PCI connection, thereby eliminating the dilutive effects of sharing the data interface. The switches replace the multidrop bus, enabling many transactions to occur simultaneously throughout the fabric. Advanced features such as adaptive routing around network hot spots are also possible.
PCI can be combined with switch fabrics in a number of different ways to create scalable processor systems without losing the benefits of PCI.
One way to extend PCI with switch-fabric communication is to add a high-speed auxiliary communication network independent of the PCI bus. Current switch-fabric solutions, such as the RACE++ interconnect, run each link about the same speed as a 66-MHz, 32-bit PCI. Each device gets a dedicated interface to the switch fabric so the bandwidth is not shared with other processors or I/O devices.
Such a design uses the PCI bus for basic control information and low-bandwidth I/O. High-bandwidth communication passes though the switch fabric with both high speed and low latency. As processing requirements of the application grow, more processors are added that interface to both the PCI bus and the switch fabric. The switch fabric adds another point-to-point connection for each additional processing node, thereby scaling bandwidth with processing.
With processor clock rates expected to double over the next 18 months, even the current switch-fabric technologies eventually will be stressed to keep up with the increased thirst for data.
Newer, high-speed embedded fabrics like RapidIO are coming online to fill those high-end needs. The RapidIO specification defines a high-performance interconnect architecture designed for passing data and control information between microprocessors, digital signal processors, communications and network processors, system memory and peripheral devices within a system.
The initial RapidIO specification defines technology suitable for chip-to-chip and board-to-board communications across standard printed-circuit board technology. Such communication utilizes low-voltage differential signaling technology and exceeds throughputs of 10 Gbits per second.
Beyond the broad adoption of switch fabrics, the trend to integrate the standard functionality of an SBC to just a handful of chips will have a significant impact on future system designs. Beyond the processing power for a modest application, a typical SBC today provides mostly network and storage interfaces, such as Ethernet and SCSI, flexible expansion in the form of PMC sites and the I/O processing to control them.
Oftentimes the functionality added on the PMC sites is one of the primary functions of the board. In this manner, an SBC can be thought of as a highly intelligent PMC carrier board.
Typically, these interfaces on an SBC are connected via a PCI bus, with a host bridge chip connecting the main processor and its memory to the PCI bus. It is already possible to select processors that have an integrated PCI bus. Although not as powerful as a high-end microprocessor, some integrated host processors have a few additional I/O interfaces implemented directly on the chip.
As processing power increases, these integrated host processors will have enough performance to carry out many of the tasks assigned to the CPU on a single-board computer. One example is the Motorola G5 PowerPC, the MPC8540. It operates at GHz speeds and incorporates a large, on-chip L2 cache. With a full complement of I/O interfaces, including dual Gigabit Ethernet, PCI-X, and a RapidIO port implemented directly on the microprocessor, the 8540 is pretty close to "an SBC on a chip."
With that level of integration, a simple SBC will be a commodity item with little value added. The remedies are to pull in some functionality from other boards in the system, such as the graphics board, or to migrate the SBC functionality onto any other board or multiple boards in the system.
Multiprocessor boards for signal and image processing, for example, could find the room for a separate "SBC" processor if it were sufficiently integrated. Even using today's moderately integrated technology, such a board is possible if only a moderate amount of memory is required and the switch-fabric interfaces and switches themselves are each only a single chip.
An example architecture of such a board: The integrated I/O processor could control a variety of I/O options, including serial ports, Ethernet connections, or Fibre Channel links. It would also function as an intelligent partner with the processors dedicated to signal processing functions, maintaining the crucial balance between I/O and processing.



