Design Article

Optimizing for cache performance, part 2

Jackie Brenner, Senior Member of Technical Staff, TI

7/10/2006 11:00 AM EDT

In part 1, we explained the rationale for using caches and showed how caches work. This week we explain how to minimize cache misses, giving some practical examples. As noted in part 1, cache misses can be divided into one of these three classes:
  • Compulsory misses: these misses occur during the first access to a cache line. This miss occurs because there was no prior opportunity for the data to be loaded or allocated in the cache. These are sometimes referred to as "first-reference misses."
  • Capacity misses: these misses occur when the cache does not have sufficient room to hold all the data during the execution of a program.
  • Conflict misses: these misses occur because multiple blocks of data or program code are competing for the same cache line.

This article looks at a variety of techniques to minimize each type of cache miss. Note that this article assumes an understanding of the C64x cache architecture, which was presented in part 1.

Optimizing system software for cache
Optimizing system software for cache efficiency is not much different from optimizing a flat memory model system for effective code and data usage. For example, we want to group functions that work frequently together close together in memory. We do this in a flat memory system so that the DMA can effectively be used to copy these functions from external memory to internal memory. We group functions together in a cache-based system to maximize both temporal and spatial locality.

Here are some guidelines for optimizing your code system. We touch on some of these briefly as we examine the benchmarks for specific code examples.

  • Allow each function to do as much processing on each piece of data as possible to increase data reuse.
  • Organize data and program to increase cache hits.
  • Partition algorithms to utilize program cache and data cache evenly.
  • Group functions that work together on the same data together in memory.

Data reuse
As previously mentioned, block filter algorithms work well in a cache environment. The block FIR filter code in Figure 1 is benchmarked with different filter lengths. As the filter length is increased, the amount of work done on each particular sample increases. The convolution window is larger so there is more data reuse as well. A data set of 1024 points of 16-bit data was used with filter lengths from 16 to 64 real taps (T).

Figure 1 shows the efficiency of the cache as a function of filter length. Notice that the efficiency of the cache increases as we increase the filter length.


1. Effect of filter length on cache efficiency.

Initially, the circular data buffer used by this algorithm is empty. This buffer, located in L1D, needs to be filled with data from the L2 memory. The filling process can be made faster by using a function that pre-fetches the data that is needed. This function takes advantage of the data cache miss pipelining mechanism discussed in part 1. The data pre-fetch function is similar to an initialization routine in a flat memory system that copies data from off-chip memory to on-chip memory. By utilizing this function, the cache effectiveness can be augmented even further, as shown in Figure 2.


2. Effect of data pre-fetch on cache efficiency.

Data/program organization
In the FIR filter example above, input data was accessed primarily in consecutive order within the circular buffer located in the L1D cache. Some functions do not naturally access consecutive values, and may actually have a large data reach requirement. One example of this is an interleaver which uses a lookup table to reshuffle data. The input data is read randomly and the output data is written sequentially. These random reads from the lookup table can cause thrashing in the L1D, as the input data may have a reach greater than 16K bytes. The output data does not need to be stored in L1D. Note that, when there is an L1D write miss, the data is not allocated in L1D but written directly into L2. Therefore, the lookup table can be reordered such that the input data gets read consecutively and then gets written out randomly into a larger address reach. The output data does not affect L1D, as this data goes into L2, which has a much larger reach of 1024K bytes. This can boost the cache efficiency from 60% to 85%. For details on this technique, see the TMS320C6000 DSP Cache User's Guide.

Avoiding program cache (L1P) conflict misses
In the previous section we saw how re-ordering data could reduce cache misses. Similarly, inappropriate memory placement of code can cause conflict misses during program execution. Most of these misses can be avoided by altering the order in which functions are linked in memory. In general, code that is accessed within a short time window should be allocated contiguously in memory. Consider the example in Figure 3.


3. An example of functions accessed within a short time window.

Assume that function_1 and function_2 overlap in L2, as shown in Figure 4. When function_1 is called the first time, it is allocated in L1P (1). A subsequent call to function_2 causes its code to be allocated in L1P (2). This also thrashes parts of function_1, since these lines overlap in L1P (3). When function_1 is called again in the next iteration, these lines have to be brought back into L1P, only to be thrashed again by function_2. The end results is a total of four misses per loop iteration. These misses can be completely avoided by placing the code of the two functions contiguously in memory (4).


4. An inappropriate program memory layout can cause misses.
(Click this image to view a larger version)




Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form