News & Analysis
ISSCC Analysis: Memory bottleneck continues to haunt designers
Ronald Wilson
2/19/1999 4:57 PM EST
SAN FRANCISCO Concern over processor-to-memory bandwidth was a common thread running through the papers on CPUs, media processors and memory devices presented at the International Solid-State Circuits Conference here this week.
Despite the enthusiasm over nearly-GHz processor clocks and ever-more-powerful arrays of processing elements, there was a pervasive worry at the conference that the new, deep-submicron engines may starve for want of memory access. There was no single resolution on how to solve the problem-rather, ISSCC papers showed how varied are the approaches architects are investigating to opening the memory bottleneck.
To a great extent, the problem has been created by the continuing success of process engineers and processor designers. With the ability to dispatch multiple instructions per clock and with clock frequencies approaching 1 GHz, the leading CPUs are easily able to crush the fragile pipes connecting them to their sources of instructions. Even specialized media processors, in which instruction streams are predictable and may be contained within a chip, can now process data faster than existing main-memory systems can deliver it.
This has created something of an architectural crisis. A number of designs, particularly in the media-processing area, have already become so bandwidth-constrained that they are unable to reach their design performance on meaningful codes. But both architects and memory-design teams are battling back with a range of ideas.
If one thing was clear this week, it was that there are no silver bullets in anyone's gun belt. There is no single new memory interface or memory hierarchy that can promise relief to all sorts of systems.
The furor over Rambus perhaps illustrates the futility of any search for a panacea. While Intel continues to invest massive amounts of cash in back-end process equipment for DRAM vendors, the DRAM vendors themselves are increasingly saying that Direct Rambus will not be a major issue in the market before the year 2000. Intel is reportedly rescheduling its rollout for the technology, and the DRAM vendors are still struggling with costs and technical problems.
Not least among those problems is scalability, as one senior Hitachi Ltd. researcher remarked this week. "A serious issue with Rambus is that it requires voltages in the I/O ring that are higher than the voltages we use in advanced DRAM cores," the researcher said.
With Rambus slowing, the window appears open for another generation of double-data-rate (DDR) synchronous DRAMs to open up the pipe to main memory.
Papers at ISSCC reflected designers' continuing optimism about DDR. One-Gbit DRAM papers from IBM and Siemens, Samsung and NEC presented designs with synchronous data rates at the I/O pins of 500, 333 and 250 Mbits per second, respectively, all using low-swing SSTL-type pin electronics.
While these speeds do not challenge Rambus' 800 MHz, the interfaces can be made significantly wider-32 bits, in the case of the IBM design-allowing single-chip throughput to equal that of a Rambus channel.
With designs from disparate vendors informally converging on SSTL-2 and 250-MHz clocking, it seems clear that another generation of DDR SDRAMs is in the works. In private conversation, Hitachi designers confirmed that at least one more generation seemed likely.
But interface bandwidth alone won't solve the DRAM designers' problems. Before they can exploit a wider pipe, the memory architects need to create cores that can supply these burst-oriented interfaces with huge amounts of data at very low latency.
To this end, the traditional DRAM core with its enormous banks of cells and long bit lines is disappearing. Virtually every DRAM paper emphasized some variation on a new approach to array organization. The common theme is to build the array out of a matrix of small blocks of cells. The overhead of this approach-first tried by MoSys Inc.-means a larger die. But the combination of short bit lines, dedicated sense amps and clever interleaving can mean reducing DRAM access latencies to near SRAM levels. NEC, for example, used such approaches to create a DRAM macro-intended for embedded use-with a row-access delay under 7ns.
Hedging on DRAM
Even with such speeds on the horizon, virtually no CPU designer is betting the farm on DRAM. All are wrestling to increase bandwidth and shrink latency in their cache hierarchies as well.
The brute approach is to move L2 cache on-chip and make it enormous. This avoids the problems of an off-chip interface, which falls far behind the CPU in speed as clock frequencies get above 250 MHz. The outstanding example continues to be Hewlett Packard, which this year described a 1.5-Mbyte L2 on its latest HP PA/RISC chip. Samsung used a smaller-scale version of the same approach in the 96-kbyte secondary cache of its 600-MHz SOI Alpha CPU.
Some designers are continuing to go for larger L1 caches and external L2, however. External SRAM is unquestionably cheaper, and to some degree the CPU can be made immune to the added latency of an external cache by careful attention to non-blocking cache design. Fast external caches are featured on Advanced Micro Devices' K7, Motorola's G4 and a fascinating multiprocessing design for an IBM S390.
One example from the K7 is instructive. The chip's primary caches are dual 64-kbyte two-way units. The control logic for the L2 is on the CPU die, and the 64-bit-wide L2 interface has been tuned to operate at up to perhaps 350 MHz. This, combined with the extensive thought AMD has given to keeping the CPU busy while L2 references are pending, is the company's approach to avoiding the penalties of a chip crossing.
Another alternative, not used in any of the CPU papers, was presented by NEC. The main problem with huge on-chip L2 caches, NEC researchers observed, is not architectural-it is the size and power consumption of SRAM. If an adequately fast cache could be made from DRAM, on-chip L2 would be even more attractive.
As a result, NEC presented a paper on a 12-ns L2 cache built from DRAM cells and embedded on a MIPS R10000 CPU die.
Like the previously mentioned macro, the cache uses an array of small (in this case 16-kbit) subarrays with dedicated sense amps and a number of other novel features. A rather complex interface manages the arrays and sense amps to give the memory the appearance of an SRAM cache. Even refresh cycles have been cleverly concealed by requiring the CPU to perform refresh in software.
As even these techniques begin to fall behind, it may be up to architects to pick up the baton from memory designers. Indications from the Sony media-processor paper suggest the direction of thought about the next generation of CPUs.
The Sony chip is a cluster of specialized media processors around a general-purpose MIPS core. In it, each individual processor has its own specialized memory: instruction and data caches and local SRAM for the MIPS core; microcontrol store and data buffers for the vector processors; and a local video buffer for an MPEG-2 decoder.
All of these local stores are backed by an external DRAM main memory via a 128-bit internal memory bus and, reportedly, a multichannel Direct Rambus external interface. It remains to be seen whether the size of the internal memory blocks-the largest of which are only 16 kbytes-and the latency of the Rambus interface will suffice.
Beyond allocating local scratchpads, designers are looking ahead to more aggressive methods. CPUs are increasingly leaning on data prefetch instructions and even speculative loads to keep their SIMD execution units from overrunning their data caches.



