Design Article
Code techniques for processor pipeline optimization: Part 3
Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
7/29/2008 12:00 PM EDT
The optimizing control-oriented operation has three key components: branch overhead reduction, conditional execution, and addressing techniques.
Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during a
branch, many stages of a pipeline can potentially be flushed, thus
causing stalls of five cycles. XScale has branch prediction logic and
branch target buffers to ensure that branch penalties are as small as
possible.
From the application developer's point of view, avoiding unnecessary branches and ensuring that the branches are predictable helps improve performance. The following set of examples demonstrates how to eliminate potential branches, eliminate unnecessary loop overhead, and make control flow efficient.
The branch prediction and branch target buffers can reduce branch-related penalties a great deal. The operating system and platform level software must ensure that the branch target buffer (BTB) is enabled.
The BTB has only 32 entries. The size of the branch target buffer limits the number of correctly predictable branches. Because the total number of branches executed in a program is relatively large compared to the size of the branch target buffer, it is often beneficial to minimize the number of branches in a program. Consider the following C code segment:
int
foo(int a) {
if (a > 10)
return 0;
else
return 1;
}
The code generated for the if...else portion of this code segment using branches is:
cmp r0,#10
ble
L1
mov
r0,#0
b
L2
L1:
mov
r0,#1
L2:
This code takes three cycles to execute the else statement and four cycles for the if statement assuming best-case conditions and no branch misprediction penalties. In the case of the XScale, a branch misprediction incurs a penalty of four cycles.
If the branch is mispredicted 50 percent of the time and if both the if statement and the else statement are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Using the XScale to execute instructions conditionally, the code generated for the preceding if...else statement is:
cmp
r0,#10
movgt r0,#0
movle r0,#1
The preceding code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best-case conditions. Using conditional instructions helps to speed up execution significantly.
Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking for
a peak in a spectrum, are loop intensive. Branch density - the number
of instructions per branch - can be high, which may mean that the
number of instructions in a loop or number of instructions between two
branches is low. Each loop has management overhead - counting the loop
pointer and comparing for the exit condition - associated with it.
The XScale provides the ability to execute instructions conditionally based on a set of conditional flags. This conditional execution feature, combined with the ability of the instructions to modify the condition codes, makes a wide array of optimizations possible.



