News & Analysis

Architecture drives SBC performance

Rodger H. Hosking, Vice President, Pentek Inc., Upper Saddle River, N. J.

1/17/2002 12:35 PM EST

Architecture drives SBC performance
Striving to deliver solutions for maximum overall system performance, embedded board vendors are now offering multi-processor products featuring the latest generation of digital signal processing and RISC processors. Because of their stunning computational speeds and I/O data transfer rates, as the number of processors on a board increases, it becomes more difficult to provide adequate memory, interprocessor communication and I/O data channel bandwidth.

The external data resources available for a number of recently introduced embedded processor chips illustrates pervasive problems. For example, Texas Instruments' TMS320C6415, Motorola's MPC7410 AltiVec G4 PowerPC and Analog Devices' ADSP TS-101-S Tiger Sharc all offer buses capable of peak I/O transfer rates of more than 1 Gbyte/second.

But while this rate may appear to be more than adequate, the new devices support multiple parallel operations during each processor clock cycle by virtue of enhancements to on-chip processing architectures. In particular, such enhancements include the very long-instruction word engine on the C6000, the AltiVec vector coprocessor on the 7410 and the superscalar engine on the TS101-S. As a result, at peak processing rates of 4.8 billion fixed-point operations per second for 32-bit integers on the C6415, and 1 or 2 billion 32-bit IEEE floating-point operations per second on the TS101-S or 7410, respectively, these devices can clearly challenge the I/O capabilities of any processor.

Since the cycle time for external devices is restricted by board layout, memory speed and the electrical characteristics of I/O interface drivers, other strategies have been used to improve data transfers to peripherals. These include moving to wider buses of 64 bits or more, and using multiple buses, like the three found on the C6415.

Further, since code fetches to external memory can seriously impact the availability of the external bus for I/O transfers, most of these new processors incorporate L1 and/or L2 cache memory within the chip, and make these resources as large as possible. The 7410 employs a separate 64-bit bus to support an external 2-Mbyte L2 cache, while the C6415 embeds its 1-Mbyte L2 cache right on the chip.

These optimized compute engines with their enhanced peripheral interfaces can demonstrate some incredible benchmarks, but only when the code and data are sitting in just the right place. However, unless they can be adequately coupled to the real-time environment at the board and system level, actual performance of these new processors can be quickly sacrificed due to data path bottlenecks to and from external memory and peripheral devices.

In trying to provide the best solution for the widest range of applications, however, commercial off-the-shelf (COTS) board vendors are faced with conflicting trade-offs in features and costs when defining new board architectures. For these reasons, systems designers must choose carefully when selecting a COTS board for their application.

A basic configuration of a generic four-processor COTS board usually features a single global data bus, which connects the processors to shared memory, mezzanine I/O, backplane I/O and to each other. To conduct any one of these activities, each processor must wait for the other three processors to relinquish this single global bus. Even if the global bus is fast and wide, the aggregate demands for I/O for the four processors can quickly result in a serious performance penalty for the board.

Equally important is the enormous difficulty in writing software to optimize program execution to mesh with available time slots for bus access across all four processors. When poorly coordinated, the processors may sit idle, waiting for bus access and wasting precious processing capacity.

In a high-performance, real-time system like a network processor, for example, the I/O streams are often quite unpredictable, resulting in unacceptably wide variations in performance at the system level.

One of the first obvious, but effective, improvements offered by many board vendors was the addition of local memory to each processor node. Apart from the local cache memory required by virtually all processors like the 7410, additional local memory is often required. This memory is usually much larger than the cache and is often implemented as cost-effective SDRAM. In this arrangement, larger blocks of code and data can be accessed during program execution without arbitration for the global shared memory. Although this improves operation, it still leaves the processor nodes vying for I/O over the other shared resources.

Depending on the application, communication paths for efficiently moving large blocks of data between processors can significantly boost performance, especially for pipelined-processing applications. At least three strategies have emerged to tackle this requirement.

The first is a dedicated interprocessor link directly connecting two processors, and permitting high-speed data transfers that are completely independent of traffic on the global bus. Some processor chips include built-in interprocessor links, like the Tiger Sharc with four 8-bit link ports, each supporting bidirectional data transfers of up to 180 Mbytes/s.

Unfortunately, the C6000 and PowerPC processors have no such internal links to support multiprocessing and must rely instead upon board-level hardware connected to one of the external buses. One popular implementation of these board-level paths is the linking of a pair of processors with bidirectional FIFOs, thus allowing either processor to read or write data to its FIFO port without having to wait for the other processor to be free.

A second approach is an on-board crossbar switch or switch fabric joining the processors. Unlike the dedicated links using point-to-point connectivity, a switched fabric can reallocate signal paths as required to meet changing I/O requirements. An excellent example of such a board-level switch fabric is Raceway.

A third strategy involves adding an auxiliary bus dedicated to moving data between processors while independent transfers continue on the global bus. A new PowerPC node controller from Galileo Technology provides dual 64-bit, 66-MHz Peripheral Component Interconnect (PCI) buses, one for global transfers and the second for interprocessor transfers, each capable of peak rates of 528 Mbytes/s.

For data transfers between boards in a system, the backplane has usually been the first choice. However, as interboard communication rates increase, auxiliary backplane buses and switched-backplane fabrics can be extremely effective alternatives.

One strategy for improving the connectivity between processor nodes and backplane interfaces is to use direct links to these resources. Like the interprocessor links discussed above, some of these links are built into the processors. Examples include C40 communications ports, Sharc link ports, Raceway ports, PCI ports and others. Emerging standards like Infiniband and RapidIO will soon be appearing as standard interfaces on next-generation processors.





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form