Subject: GCC optimization suggestion
To: None <port-arm32@netbsd.org>
From: None <kim@pvv.ntnu.no>
List: port-arm32
Date: 12/12/1998 02:08:16
I am currently writing some integer math. It is somewhat like
mpeg, DSP, 3D geometry, whatever. The code has this form:
	ldr	r3, [r6, #-1016]
	add	r4, r4, r3
	ldr	r3, [r6, #-1024]
	add	r4, r4, r3
	ldr	r3, [r6, #4]
	add	r4, r4, r3
	ldr	r3, [r6, #-2044]
	add	r4, r4, r3
	ldr	r3, [r6, #-2048]
	add	r4, r4, r3

It loads a value, (waits for it,) adds it, and so on.
However, this is not the best way to do it on a StrongARM.
Accessing memory takes time, and the StrongARM has a pipeline for this,
meaning time can be saved by doing a different instruction while waiting
for the memory. If the code is changed thus:
	ldr	r3, [r6, #-1016]
	ldr	r2, [r6, #-1024]
	add	r4, r4, r3
	ldr	r3, [r6, #4]
	add	r4, r4, r2
	ldr	r2, [r6, #-2044]
	add	r4, r4, r3
	ldr	r3, [r6, #-2048]
	add	r4, r4, r2
	add	r4, r4, r3

This uses 2 registers for memory accesses, and using the previous register
when adding. This is a sort of double buffering which puts 2 instructions
between a memoryaccess and its subsequent use.

Unfortunately GCC does not do this optimization, but perhaps one of the
experts could make it do so? After all, other processors have similar
things in GCC, such as delayed branching and multiply.

Kim0