Subject: copy performance
To: None <port-arm@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 03/20/2002 23:04:40
After failing to write a faster copyin/out, I've just been doing
some performance measurements on my 200MHz SA1100 system.

The limiting facter for reasonable length copies is clearly
the DRAM access.

The fun really starts with misaligned copies (16k for these tests).
The body of the loop is typically:
10:	ldrb    r4, [r0], #1
	strb    r4, [r1], #1
	subs    r2, r2, #1
	bne     10b
this took 660 for data that won't be in the cache.

However adding an extra memory read:
10:	ldrb    r4, [r1]
	ldrb    r4, [r0], #1
	strb    r4, [r1], #1
	subs    r2, r2, #1
	bne     10b
puts the destination into the data cache and speeds it up to 470

Reordering the instructions to fill the delay slots:
	ldrb    r4, [r0], #1
10:	subs    r2, r2, #1
	strb    r4, [r1], #1
	ldrneb  r4, [r1]
	ldrneb  r4, [r0], #1
	bne     10b
and it takes 340 - almost twice as fast as it was originally.

If you can read beyond the end of the buffer, then:
	ldrb    r4, [r0], #1
10:	subs    r2, r2, #1
	ldrb    r5, [r1,#20]
	strb    r4, [r1], #1
	ldrneb  r4, [r0], #1
	bne     10b
Brings it down to around 280.

Note that the instruction reorder has little effect without the
extra read cycle.

For aligned copies using ldmia/stmia loops forcing a read
doesn't help large copies.  However short copies (ie ones
where the source and destination stay in the cache) speed
up by a factor of 4 if the destination is in the data cache.
(the source was always cached during this test.)

For kernel copyin() of relatively short buffers the data will
probably be read almost immediately - so reading the
cache line in before doing the copy could well be benefitial.
This is probably even more true on ARM v5 - where you
can use the prs? instruction to preload a cache line.

	David

-- 
David Laight: david@l8s.co.uk