News & Analysis

UltraSparc III ready for next generation networks

Dale Greenley, Engineering Manager, Sparc Processors, Marc Tremblay, Distinguished Engineer, Chief Architect Sparc Processors, Sun Microsystems Inc., Mountain View, Calif.

1/9/2002 7:55 AM EST

UltraSparc III ready for next generation networks
The new generation of net-centric processors must be able to balance data processing with data flow and movement. Also, they must incorporate features to make possible scalable multiprocessing in large systems, control large amounts of memory — and provide high degrees of reliability, availability and serviceability for enterprise network computing environments where rebooting is not an option in the event of a system problem.

For net-centric computing, the essential requirements are all associated with the ability to maximize both the movement and processing of data. Fiber and Gigabit Ethernet and their associated transmission subsystems have massive data-throughput potential. So to keep the processor from becoming the network bottleneck, the design of a net-centric CPU must pay special attention to concerns like the actual speed of data processing, protocol stack performance and I/O management. Ultimately, the extent to which a high-speed network can realize its transmission potential will depend on the ability of the CPU to access, process and direct the movement of data through the network.

Microprocessors designed for heavy data manipulation typically feature buses with long latencies, but are able to pump data at extremely high bandwidths once their latencies are satisfied. But the performance of long latency buses degrades considerably when required to switch between small, rapidly changing and multisized data sets. Conversely, buses designed to optimize switching speed suffer from low bandwidth. Processors with complex caching schemes may be optimized for certain data sets; however, designing a complex cache hierarchy means that memory latencies will rise if the caches have to be flushed and refilled during a miss — which occurs frequently with the wide range of sizes characteristic of data sets encountered in network computing.

The best processors for network computing are not optimized for any single workload, such as real-time interrupt processing, fast-switched buses or huge database support. Rather, at heart network computing demands versatile processors that can perform well in all net-centric application environments, typified by ever-changing data processing requirements. This means that a well-designed net-centric processor should function efficiently in a net-centric computing environments including workstations, blade and rack-mount servers and so on, all the way up the spectrum of processing needs to very high-end, mainframe-like hundred-or-more-way servers.

The primary focus of network computing is to access, process and distribute huge amounts of data. The UltraSparc III processor marks Sun Microsystems' third-generation 64-bit design based on the Sparc instruction set. The 64-bit design allows data access far above the 4-Gbyte limit imposed by 32-bit address spaces, and 64-bit processing offers the potential for managing memory spaces as large as 16 -exabytes enough space to process in system memory not only the largest data bases that now exist, but also the most massive databases forseeable over the next half-century.

Operating at 1,050 MHz, the UltraSparc III chip dissipates 75 W of peak sustained power — achieving an industry-leading SPECint/W ratio. Variants of the processor design are suitable for a broad range of net-centric computing products, including embedded cPCI applications, blade computing, desktop workstations, low-end servers and midrange servers, all the way up to the most powerful servers for the most demanding tasks required of enterprise network computing.

The design team incorporated two wide data buses into the new architecture: a 256-bit wide bus to its external L2 cache with a bandwidth of 13.6 Gbytes/second (at 1,050 MHz), and a 128-bit wide, 150-MHz bus used for memory and I/O traffic. For maximum memory bandwidth, we implemented A 512-bit interface to external dram into the UltraSparc III.

We also designed the UltraSparc III processor with an onboard memory controller able to access up to 16 Gbytes of SDRAM. In an N-way UltraSparc III processor design, each processor can control a portion of main memory up to this limit, enabling system memory to scale linearly with the number of processors. Within an multiprocessing (MP) system, memory can be managed as an aggregate or divided into various system domains.

To help minimize memory access latencies, we designed UltraSparc III with four level one (L1) caches on chip. In addition to 32-kbyte instruction and a 64-kbyte data caches, we included a 2-kbyte prefetch cache and a 2-kbyte prefetch cache. The prefetch cache operates in parallel with the L1 data cache and can be used by speculatively executed load instructions to store floating-point data before it's needed.

We included a 2-kbyte write cache to allow multiple word stores to coalesce into a single larger write, able to take better advantage of the wide 256-bit data bus to level two (L2) cache. The result in a 90 percent reduction in write-through traffic to the L2 cache generated by the L1 caches. This facilitates the distribution of processed network data, by releasing the bandwidth thus saved for full cache throughput.

Since aggregate data access latencies can add up to huge delays in processing data, we implemented the UltraSparc III processor with on-chip address tags for its large off-chip L2 cache. Compared to keeping the L2 tags off-chip in SRAM (with the L2 cache data itself), keeping these tags on-chip provides a 3X speedup in both latency and overall bandwidth when accessing them. This is an important factor in speeding cache coherence snoop operations in MP configurations.

Current implementations of the UltraSparc III architecture do not support a third level (L3) of cache hierarchy. While an L3 cache may optimize processor throughput in certain applications, current fabrication technology (cost and clock rates) and the aggregate latencies involved in serially filling three sets of cache lines on a combined miss, mitigate against this sort of design. In particular, it would adversely affect the processor's ability to simultaneously handle the widely varying data sets associated with network computing.

The ability of the UltraSparc III processor to efficiently manipulate data is a function of its 14-stage pipeline combined with its ability to issue four instructions per cycle into six parallel processing units. This combination of a deep pipeline with four-way superscalar instruction issue would normally require advanced branch management to avoid performance-sapping stalls at every code branch. To avoid this as much as possible, we incorporated a sophisticated, dynamic, history-based branch prediction algorithm that predicts taken or not taken branches with 95 percent accuracy.

Keeping the pipeline filled with data and instructions is an important consideration in maintaining high system performance in data-intensive networks. To do this we designed the UltraSparc III processor with a nonstalling pipeline. This allows for the predicated completion of instructions that normally would have been stalled due to the existence of a prior condition. The processor will complete these instructions, discard results and reload previous instructions, all without stalling the pipeline.

In MP configurations, since each UltraSparc III CPU controls its own local memory, correctable errors detected by one processor in its locally controlled memory are fixed before they have a chance to propagate to other processors. The CPU also marks any uncorrectable copy-back errors to prevent them from spreading to any other CPUs.

Other contributors to this article include David Davidian, senior research analyst, Sparc processors, and Harlan McGhan, strategic marketing manager, Sparc processors, Sun Microsystems Inc..





Please sign in to post comment

Navigate to related information

EE Buzz DesignCon

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form