Design Article
Cache vs. DMA: trade offs for programmers
David Katz and Rick Gentile
8/18/2003 9:56 AM EDT
Now that there are embedded media processors available that can handle both MCU and DSP tasks, C programmers who are very familiar with the MCU model of application development are transitioning into a new realm, where intelligent management of code and data flow can significantly improve system performance. Careful consideration needs to be given to the high-performance direct-memory access (DMA) capabilities of the media processor. Recognizing the tradeoffs between using cache and DMA in these applications can lead to a better understanding of programming for system optimization.
Today's media processors have hierarchical memory architectures that strive to balance several levels of memory with differing sizes and performance levels. Typically, the memory closest to the core processor (known as "Level 1," or "L1," memory) operates at the full clock rate and usually supports instruction execution in a single cycle.
A quick survey of the embedded media processor market reveals core processor speeds at 600 MHz and beyond. While this performance can open the door to many new applications, the maximum speed is only realized when code runs from internal L1 memory. Of course, the ideal embedded processor would have an unlimited amount of L1 memory, but this is not practical. Therefore, programmers must consider several alternatives to take advantage of the L1 memory that exists in the processor, while optimizing memory and data flows for their particular system. Let's examine some of these scenarios.
The first, and most straightforward, situation is when target application code fits entirely into L1 instruction memory. For this case, there are no special actions required, other than for the programmer to map the application code directly to this memory space. This is why media processors that provide both MCU and DSP functionality must excel in code density at the architectural level.
In the second scenario, a caching mechanism is used to allow programmers access to larger, less expensive external memories. The cache serves as a way to automatically bring code into L1 instruction memory as it is needed. The key advantage of this process is that the programmer does not have to manage the movement of code into and out of the cache. This method is best when the code being executed is somewhat linear in nature. For nonlinear code, cache lines may be replaced too often to allow any real performance improvement.
Most strict real-time programmers tend not to trust the cache to obtain the best system performance. Their argument is that if a set of instructions is not in cache when needed for execution, there will be a performance hit. Taking advantage of cache-locking mechanisms can offset this issue. Once the critical instructions are loaded into cache, the cache lines can be locked, and thus not replaced. This gives programmers the ability to keep what they need in cache and to let the caching mechanism manage less critical instructions.
In a final scenario, code can be moved in and out of L1 memory using a DMA channel that is independent of the processor core. While the core is operating on one section of memory, the DMA is bringing in the next section to be executed. This scheme is commonly referred to as an overlay technique.
While overlaying code into L1 instruction memory via DMA provides more determinism than caching it, the trade off comes in the form of increased programmer involvement. In other words, the programmer needs to map out an overlay strategy and configure the DMA channels appropriately. Still, the performance payoff for a well-planned approach can be well worth the extra involvement.
Data memory management
Because there are often multiple data transfers taking place at any one time in a multimedia application, the bus structure must support both core and DMA accesses to all areas of internal and external memory.
To effectively use DMA in a multimedia system, there must be enough DMA channels to support the processor's peripheral set fully, with more than one pair of Memory DMA streams. This is an important point, because it recognizes that there are bound to be raw media streams incoming to external memory (via high-speed peripherals), while at the same time data blocks will be moving back and forth between external memory and L1 memory for core processing. What's more, DMA engines that allow direct data transfer between peripherals and external memory, rather than requiring a "stopover" in L1 memory, can save extra data passes in numerically intensive algorithms.
As data rates and performance demands increase, it becomes critical for designers to have "system performance tuning" controls at their disposal. For example, the DMA controller might be optimized to transfer a data word on every clock cycle. When there are multiple transfers ongoing in the same direction (e.g., all from internal memory to external memory), this is usually the most efficient way to operate the controller because it prevents idle time on the DMA bus.
But in cases involving multiple bidirectional video and audio streams, "traffic control" becomes mandatory in order to prevent one stream from usurping the bus entirely. For instance, if the DMA controller always granted the DMA bus to any peripheral who was ready to transfer a data word, overall throughput would be degraded when connected to a device such as an SDRAM. In situations where data transfers switch direction on nearly every cycle, the latency associated with turn-around time on the SDRAM bus will lower throughput significantly. As a result, DMA controllers that have a channel-programmable burst size hold a clear advantage over those with a fixed transfer size. Because each DMA channel can connect a peripheral to either internal or external memory, it is also important to be able to automatically service a peripheral that may issue an urgent request for the bus.
The flexibility of today's DMA controllers is a double-edged sword. When a large C/C++ application is ported between processors, the programmer is sometimes hesitant to integrate DMA functionality into already working code. This is where data cache can be very useful. Typically, the data cache can be used to bring in data to L1 memory for the fastest processing. The data cache is attractive because it acts like a mini-DMA, but with minimal interaction on the programmer's part.
Because of the nature of typical cache-line fills, data cache is most useful when the processor is operating on consecutive data locations in external memory. This is because the cache doesn't just store the immediate data currently being processed; instead, it pre-fetches data in a region contiguous to the current data. In other words, the cache mechanism assumes that there's a good chance that the current data word is part of a block of neighboring data about to be processed. For multimedia image, audio and video streams, this is a reasonable conjecture.
Because data buffers usually originate from external peripherals, operating with data cache is not always as easy as with instruction cache. This is due to the fact that coherency must be managed manually in non-"snooping" caches. For these caches, the data buffer must be invalidated before making any attempt to access the new data.
In short, there is no single answer as to whether cache or DMA should be the mechanism of choice for code and data movement in a given multimedia system. However, once developers are aware of the tradeoffs involved, they could well be lured into the "middle ground," the perfect optimization point for their system.
David Katz and Rick Gentile, Senior DSP Applications Engineers,Blackfin Applications Group, Analog Devices, Inc., Norwood, MA


