Design Article
Code techniques for processor pipeline optimization: Part 2
Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
7/29/2008 1:00 AM EDT
Fast Multiply Operations
In the architecture under discussion, there are two sets of
multiplication units, one in the Intel XScale microarchitecture and the
other in the Intel Wireless MMX instructions. These two sets of
multipliers support different levels of precision of data-processing
capability.
The XScale microarchitecture supports half-word and word multiplication with results of word and double-word width. Selecting the correct precision for the algorithm under implementation helps reduce the execution time; for example, SMULxy has a latency of one cycle whereas SMULL has a latency of two cycles.
Multiply instructions can cause pipeline stalls due to resource conflicts or result latencies. The following code segment incurs a stall of zero to three cycles depending on the values in registers r1, r2, r4, and r5 due to resource conflicts:
mul
r0,
r1, r2
mul r3, r4, r5 @0-3
stalls
The second multiply operation would stall by three cycles if r1 and r2 did not have any trivial value and the S bit was set. Just as issue latency depends on the values of the operands, the result latency can vary between one and three cycles. In the following example, the mov instruction incurs the result penalty:
mul
r0,
r1, r2
mov r4, r0 @stall
until previous mult
However, if an arithmetic operation follows the multiplication operation, it does not stall as long as no register dependency exists. Multiply instructions should be separated out from each other by the worst-case latency, especially if you have no a priori knowledge of the data value.
ARM instructions can set conditional flags so that following instructions can execute conditionally based on the flags. A multiply instruction that sets the condition codes blocks the multiply and arithmetic pipeline. Blocking stalls any subsequent instructions. For instance, in the following example, the add instruction waits three to four cycles for the muls instruction to finish.
muls
r0, r1, r2 @mult that updates flags
add r3, r3, #1
@stalls until the mul
finish
sub r4, r4, #1
sub r5, r5, #1
Thus, it is not efficient to use the multiplication operation to update the flags. The modified code is as follows:
mul
r0,
r1, r2
add r3, r3, #1
sub r4, r4, #1
sub r5, r5, #1
cmp r0, #0
The issue latency of the WMUL and WMADD instructions is one cycle; the result and resource latency are two cycles. The second WMUL instruction in the following example stalls for one cycle due to the two-cycle issue latency.
WMULUM
wR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall
Hence, two WMUL instructions should be separated by one instruction. The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.
wmulm
wR0, wR1, wR2
waddhus wR1, wR0, wR2 @two cycle
stall
Thus, any instruction waiting on the result should be separated by two other instructions. However, if the latter instruction is another SIMD-multiplication instruction, then the stall is one cycle despite data dependency.
Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) is
the most commonly used operation. In addition to multipliers, Intel
Wireless MMX technology offers accumulation capabilities. In the SIMD
coprocessor, any of the registers can be used as an accumulator.
Performing MAC Operations on Registers in Intel XScale Core A MAC operation can be done using TMIA 32-bit and TMIAPH 16-bit instructions. TMIA and TMIAPH instructions allow the use of two registers in the Intel XScale core as two operands and produce the result of multiplication and accumulation to any of the coprocessor registers.
The issue latency of the TMIA instruction is one cycle; the result and resource latency are two cycles. The second TMIA instruction in the following example stalls for one cycle due to the two-cycle resource latency.
tmia
wR0, r2, r3
tmia wR1, r4, r5
@stall 1 cycle
The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.
tmia
wR0, r2, r3
waddhus wR1, wR0, wR2



