Subject: GCC optimization suggestion
To: None <port-arm32@netbsd.org>
From: None <kim@pvv.ntnu.no>
List: port-arm32
Date: 12/12/1998 02:08:16
I am currently writing some integer math. It is somewhat like
mpeg, DSP, 3D geometry, whatever. The code has this form:
ldr r3, [r6, #-1016]
add r4, r4, r3
ldr r3, [r6, #-1024]
add r4, r4, r3
ldr r3, [r6, #4]
add r4, r4, r3
ldr r3, [r6, #-2044]
add r4, r4, r3
ldr r3, [r6, #-2048]
add r4, r4, r3
It loads a value, (waits for it,) adds it, and so on.
However, this is not the best way to do it on a StrongARM.
Accessing memory takes time, and the StrongARM has a pipeline for this,
meaning time can be saved by doing a different instruction while waiting
for the memory. If the code is changed thus:
ldr r3, [r6, #-1016]
ldr r2, [r6, #-1024]
add r4, r4, r3
ldr r3, [r6, #4]
add r4, r4, r2
ldr r2, [r6, #-2044]
add r4, r4, r3
ldr r3, [r6, #-2048]
add r4, r4, r2
add r4, r4, r3
This uses 2 registers for memory accesses, and using the previous register
when adding. This is a sort of double buffering which puts 2 instructions
between a memoryaccess and its subsequent use.
Unfortunately GCC does not do this optimization, but perhaps one of the
experts could make it do so? After all, other processors have similar
things in GCC, such as delayed branching and multiply.
Kim0