News & Analysis
ISSCC: Embedded DRAM headed for architectural overhaul
Anthony Cataldo
2/15/1999 10:10 AM EST
SAN FRANCISCO As embedded DRAM becomes a key part of chip process technology, companies are looking toward new approaches to the architecture itself. In separate papers that Hitachi and NEC will present to the International Solid-State Circuits Conference here this week, researchers will outline new-yet very different-approaches to improving both initial access time and overall bandwidth of on-chip DRAM data transfers.
The efforts are not unlike the changes on the motherboard, where high-speed DRAMs such as Direct Rambus will dramatically boost bandwidth of PC main memory so as not to bog down CPU performance. Similarly, bunching together more logic macros on the same chip is starting to strain embedded-DRAM performance, said researchers.
"In the future, we can integrate CPU, DSP and other logic macros with DRAM," said Takao Watanabe, senior researcher at Hitachi's Central Research Laboratory (Tokyo). "These 'DRAM masters' will all have access to DRAM, so access conflicts will create a wall."
A second barrier is on-chip CPU access to embedded DRAM after a cache miss as the number of I/Os grows. In some respects this is counterintuitive, because using more I/O lines yields better bandwidth. But DRAMs-and the embedded kind are no exception-suffer from an inherent high latency on initial access. As I/Os increase, the initial access latency becomes a bigger problem.
"If the number of I/Os is high, then the cache-fill latency time from the DRAM to the CPU can be high," Watanabe said.
Hideo Toyoshima, NEC's research manager for the System ULSI Research Laboratories (Kanagawa, Japan), said the problem stems from the fact that DRAM macros were never really optimized for system-on-chip devices. Discrete DRAMs have long been designed to pack in as many bits as possible without giving a second thought to performance.
"We use a long word line to increase density and cut down on peripherals, but it causes high initial latency," he said. "The embedded-DRAM macro concept is similar to conventional DRAM but in the future, systems-on-chips will require less than 10-ns first access."

Hitachi and NEC have similar goals in mind-cut initial access times and boost overall bandwidth-but different ways of meeting them. Hitachi is touting what it calls an "access optimizer," a piece of logic that reduces cache-fill latency and access conflicts by minimizing the number of cache misses. It also acts as a traffic cop among different logic macros that draw data from the DRAM.
NEC, meanwhile, will revamp the DRAM macro for more granularity, minimize sense-amp delays and improve timing distribution.
Hitachi's access optimizer consists of three control mechanisms: one for self-prefetching, another for address alignment and a third for arbitrating accesses from different logic macros that draw from embedded DRAM. The access optimizer takes advantage of the multi-bank DRAM architecture the company now offers and takes it a step further by using the sense amplifiers as cache, Watanabe said.
"Today, data is stored in a memory cell and to read we have to activate the word line and data of the word line, then transfer to the sense amp," he said. "With the access optimizer, we can keep this data for a while; we don't need to activate the word line again. We just read the sense amp so we can use it as a cache memory."
Reducing on-chip CPU bottlenecks is done in two ways. First, the address from the CPU is reorganized from simple groupings of row, bank, macro and column calls, in accordance with the structure of the L1 cache and DRAM. Hence, the realignment is intended to avoid activating different word lines in the same bank, which causes cache misses. Also, generating a self-prefetching address can prevent successive cache misses that tend to repeat every eight cycles.
The access optimizer's other big job is to arbitrate DRAM accesses from the several "masters." It changes the sequence of data coming from a DRAM macro that serves more than one master using different I/Os. This minimizes the influence of access misses and avoids conflicts.
Hitachi claims the access optimizer speeds cache fills 30 percent and yields hit rates in the 90 percent range. Overall chip performance increases 39 percent. Hitachi has fabricated a 0.18-micron, 8-Mbit prototype that occupies 1.5 mm2 and dissipates 26 mW. Watanabe said the company will likely implement access optimizers in devices starting at the 0.15-micron generation.
"It's just logic so the technical barriers are not so large," he said.
NEC is also taking a swipe at access latency with its newest embedded 64-Mbit DRAM macro. Built on a 0.25-micron, five-metal-layer merged DRAM and logic process, the architecture is designed to slash access times from 50 ns to less than 10 ns.
The most significant reduction comes from trimming the size of the memory-cell array: NEC divided the macro into small (8-kbit) microcell units.
A new sense-amp operation sequence also helps. In the conventional design, the order of the sense control changes when instructions switch from read to write, causing sensing delays. NEC's scheme allows for simultaneous reads and writes, resulting in a 2.1-ns gain in access time.
A two-dimensional signal distribution shaves another 1.1 ns by inserting a latch circuit in the cell array. "We always use the clock edge of the slow signal by inserting the latch circuit," Toyoshima said. All told, the initial access time drops to 6.8 ns.
NEC also made the DRAM module more flexible for ASIC designers by dividing it into eight 8-Mbit blocks. Each block in turn is made up of eight 1-Mbit modules, each consisting of a 8 x 16 array of 8-kbit microcells. One microcell can read and write 32-bit data simultaneously by sharing I/Os with microcells to either side. A 4-Mbit configuration has a maximum of 128-bit I/O, while a 64-Mbit macro is capable of 1,024-bit I/O. "Access and cycle times stay constant no matter what the size is," Toyoshima said.
The result is a DRAM macro with a 9.1-ns access cycle and 7.7-ns random write time. Maximum bandwidth using a 1,024-bit I/O running at 100 MHz is 12.8 Gbytes/second.



