Design Article
Common multicore programming problems: Part 4 - Memory, cache issues and consistency
Shameem Akhter and Jason Roberts, Intel Corp.
11/13/2007 12:24 AM EST
In recent decades, microprocessors have grown much faster in speed than in memory. A single microprocessor core can execute hundreds of operations in the time it takes to read or write a value in main memory.
Programs now are often limited by the memory bottleneck, not processor speed. Multi-core processors can exacerbate the problem unless care is taken to conserve memory bandwidth and avoid memory contention.
Bandwidth
To conserve bandwidth, pack data more tightly, or move it less
frequently between cores. Packing the data tighter is usually
straightforward, and benefits sequential execution as well. For
example, pack Boolean arrays into one Boolean value per bit, not one
value per byte.
Use the shortest integer type that can hold values in the required range. When declaring structures in C/C++, declare fields in order of descending size. This strategy tends to minimize the extra padding that the compiler must insert to maintain alignment requirements, as exemplified in Figure 7.15 below.
![]() |
| Figure 7.15. Order Fields by Decreasing Size to Reduce Padding |
Some compilers also support "#pragma pack" directives that pack structures even more tightly, possibly by removing all padding. Such very tight packing may be counterproductive, however, because it causes misaligned loads and stores that may be significantly slower than aligned loads and stores.
Working in the Cache
Moving data less frequently is a more subtle exercise than packing,
because mainstream programming languages do not have explicit commands
to move data between a core and memory. Data movement arises from the
way the cores read and write memory. There are two categories of
interactions to consider: those between cores and memory, and those
between cores.
Data movement between a core and memory also occurs in single-core processors, so minimizing data movement benefits sequential programs as well. There exist numerous techniques.
For example, a technique called cache-oblivious blocking recursively divides a problem into smaller and smaller subproblems. Eventually the subproblems become so small that they each fit in cache. The Fastest Fourier Transform in the West (Frigo 1997) uses this approach and indeed lives up to its name.
![]() |
| Figure 7.16. Cache-Unfriendly Sieve of Eratosthenes |
Another technique for reducing the cache footprint is to reorder steps in the code. Sometimes this is as simple as interchanging loops. Other times it requires more significant restructuring.
The Sieve of Eratosthenes is an elementary programming exercise that demonstrates such restructuring and its benefits. Figure 7.16 above presents the Sieve of Eratosthenes for enumerating prime numbers up to n.
This version has two nested loops: the outer loop finds primes, and the inner loop, inside function Strike, strikes out composite numbers. This version is unfriendly to cache, because the inner loop is over the full length of array composite, which might be much larger than what fits in cache.
Figure 7.17 below shows how the sieve can be restructured to be cache friendly. Instead of directly representing the conceptual sieve as one big array, it represents it as a small window into the conceptual sieve. The window size is approximately the square root of n bytes.
![]() |
| Figure 7.17. Cache-Friendly Sieve of Eratosthenes |
The restructuring requires that the original inner loop be stopped when it reaches the end of a window, and restarted when processing the next window. The array striker stores the indices of these suspended loops, and has an element for each prime up to the square root of n.
The data structures grow much more slowly than n, and so fit in a 106 byte cache even when n approaches values as large as 1011. Of course, allocating array composite to hold 1011 bytes is impractical on most machines. The later discussion of multi-threading the sieve describes how to reduce composite to the square root of n bytes instead of n bytes.
The restructuring introduces extra complexity and bookkeeping operations. But because processor speed so greatly outstrips memory speed, the extra bookkeeping pays off dramatically.
Figure 7.18 below shows this performance difference. On this log plot, the cache friendly code has a fairly straight performance plot, while the cache unfriendly version's running time steps up from one straight line to another when n reaches approximately 106.
![]() |
| Figure 7.18. Performance Difference between Figure 7.16 and Figure 7.17 |
The step is characteristic of algorithms that transition from
running in cache to running out of cache as the problem size increases.
The restructured version is five times faster than the original version
when n significantly exceeds the cache size, despite the extra
processor operations required by the restructuring.







