Design Article
Conquering the memory bottleneck
James Mac Hale, Sonics, Inc.
9/13/2010 7:52 PM EDT
Editor's
Note: I asked James Mac Hale for this feature as a preview to Sonics' EETimes' webinar that was held on September 23, 2010. You can check
out the webinar's details and listen to the playback here.
The evolution of high-bandwidth, consumer system on chip (SoC) devices is driving new design requirements as developers look for innovative ways to conquer bandwidth and efficiency issues on-chip. Today’s most popular home entertainment and mobile devices, such as smart phones, pad computers, high-definition TVs and personal media players, require an ever increasing number of processors that are dependent on sharing the same DRAM pipe. This has generated a substantial efficiency bottleneck for SoC designers and system architects.
Advanced SoCs now require a wide array of multiple processors and special-purpose processors that demand simultaneous memory access. Designers want to alleviate memory congestion and ensure memory efficiency and bandwidth are fully optimized in each design. However, the real challenge is for designers to retrieve that additional raw bandwidth, derive increased efficiencies on-chip and optimize DRAM access while beating market pressures and remaining on budget—all without incremental system costs.
The memory bottleneck challenge emerged because DRAM architectures have not evolved in response to DRAM requirements of SoC technology. These DRAM architectures have been driven by the needs of the PC market, and by the economic benefits of supply and commoditized pricing of a standardized memory product. For example, the DDR3 memory interface reaches higher interface speeds and higher bandwidth by drawing from more banks of DRAM internally, but the drawback is longer minimum burst length. This approach boosts absolute bandwidth and performance, but overall system efficiency goes down as a result when memory accesses are shorter than this minimum burst length (which is common in SoCs).
The Challenges
The architect’s challenge is then how to optimize and reconcile three conflicting goals, analogous to three legs of a stool, given the bandwidth bottleneck to external DRAM. These include: latency of latency-sensitive traffic, required bandwidth of bandwidth-sensitive traffic, and the overall efficiency of the DRAM channel. The solution lies in scheduling the traffic to achieve the optimal tradeoff where “optimal” can be defined by the system architect based on the target application.
Various scheduling schemes are used today in the design of memory controllers. A common, basic approach is to provide multiple ports into the memory controller, queue the requests up outside each port, switch between ports, and process requests from different initiators. The algorithm for switching between ports is often very carefully crafted to yield the optimal result for a statically defined scenario. When the system behavior changes (due to different traffic patterns or a change in specification), the scheduling algorithm needs to be changed radically. Additionally, although scheduling between CPU and video traffic may be optimized to achieve target latencies and bandwidths, typically the efficiency of the memory channel drops significantly as a result (hence, the third leg of the stool suffers). Effects such as read-write turnaround, “bank busy” and chip-switching reduce the efficiency.
Next: Memory Bottleneck Page 2
The evolution of high-bandwidth, consumer system on chip (SoC) devices is driving new design requirements as developers look for innovative ways to conquer bandwidth and efficiency issues on-chip. Today’s most popular home entertainment and mobile devices, such as smart phones, pad computers, high-definition TVs and personal media players, require an ever increasing number of processors that are dependent on sharing the same DRAM pipe. This has generated a substantial efficiency bottleneck for SoC designers and system architects.
Advanced SoCs now require a wide array of multiple processors and special-purpose processors that demand simultaneous memory access. Designers want to alleviate memory congestion and ensure memory efficiency and bandwidth are fully optimized in each design. However, the real challenge is for designers to retrieve that additional raw bandwidth, derive increased efficiencies on-chip and optimize DRAM access while beating market pressures and remaining on budget—all without incremental system costs.
The memory bottleneck challenge emerged because DRAM architectures have not evolved in response to DRAM requirements of SoC technology. These DRAM architectures have been driven by the needs of the PC market, and by the economic benefits of supply and commoditized pricing of a standardized memory product. For example, the DDR3 memory interface reaches higher interface speeds and higher bandwidth by drawing from more banks of DRAM internally, but the drawback is longer minimum burst length. This approach boosts absolute bandwidth and performance, but overall system efficiency goes down as a result when memory accesses are shorter than this minimum burst length (which is common in SoCs).
The Challenges
The architect’s challenge is then how to optimize and reconcile three conflicting goals, analogous to three legs of a stool, given the bandwidth bottleneck to external DRAM. These include: latency of latency-sensitive traffic, required bandwidth of bandwidth-sensitive traffic, and the overall efficiency of the DRAM channel. The solution lies in scheduling the traffic to achieve the optimal tradeoff where “optimal” can be defined by the system architect based on the target application.
Various scheduling schemes are used today in the design of memory controllers. A common, basic approach is to provide multiple ports into the memory controller, queue the requests up outside each port, switch between ports, and process requests from different initiators. The algorithm for switching between ports is often very carefully crafted to yield the optimal result for a statically defined scenario. When the system behavior changes (due to different traffic patterns or a change in specification), the scheduling algorithm needs to be changed radically. Additionally, although scheduling between CPU and video traffic may be optimized to achieve target latencies and bandwidths, typically the efficiency of the memory channel drops significantly as a result (hence, the third leg of the stool suffers). Effects such as read-write turnaround, “bank busy” and chip-switching reduce the efficiency.
Next: Memory Bottleneck Page 2
Navigate to related information



Neo1
9/14/2010 11:47 PM EDT
This concept is quite nice but a lingering doubt, would the AXI be optimally utilized using this component or it only cares about the request-demand optimization by various masters?
Sign in to Reply
James.MacHale
9/15/2010 4:37 PM EDT
Thanks Neo. Since no 2 systems are exactly the same, the final choices on optimizing the 3 factors, latency and bandwidth requirements (from initiators) and efficiency, can be programmed by the architect. So there is a clear overall gain in efficiency but still many tradeoffs for the architect to optimize.
Sign in to Reply