Subject: copy performance
To: None <port-arm@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 03/20/2002 23:04:40
After failing to write a faster copyin/out, I've just been doing
some performance measurements on my 200MHz SA1100 system.
The limiting facter for reasonable length copies is clearly
the DRAM access.
The fun really starts with misaligned copies (16k for these tests).
The body of the loop is typically:
10: ldrb r4, [r0], #1
strb r4, [r1], #1
subs r2, r2, #1
bne 10b
this took 660 for data that won't be in the cache.
However adding an extra memory read:
10: ldrb r4, [r1]
ldrb r4, [r0], #1
strb r4, [r1], #1
subs r2, r2, #1
bne 10b
puts the destination into the data cache and speeds it up to 470
Reordering the instructions to fill the delay slots:
ldrb r4, [r0], #1
10: subs r2, r2, #1
strb r4, [r1], #1
ldrneb r4, [r1]
ldrneb r4, [r0], #1
bne 10b
and it takes 340 - almost twice as fast as it was originally.
If you can read beyond the end of the buffer, then:
ldrb r4, [r0], #1
10: subs r2, r2, #1
ldrb r5, [r1,#20]
strb r4, [r1], #1
ldrneb r4, [r0], #1
bne 10b
Brings it down to around 280.
Note that the instruction reorder has little effect without the
extra read cycle.
For aligned copies using ldmia/stmia loops forcing a read
doesn't help large copies. However short copies (ie ones
where the source and destination stay in the cache) speed
up by a factor of 4 if the destination is in the data cache.
(the source was always cached during this test.)
For kernel copyin() of relatively short buffers the data will
probably be read almost immediately - so reading the
cache line in before doing the copy could well be benefitial.
This is probably even more true on ARM v5 - where you
can use the prs? instruction to preload a cache line.
David
--
David Laight: david@l8s.co.uk