port-arm32: Re: bcopy optimisation

Subject: Re: bcopy optimisation
To: None <port-arm32@NetBSD.ORG>
From: Olly Betts <olly@MANTIS.CO.UK>
List: port-arm32
Date: 07/09/1996 00:02:46

"Mark Brinicombe" writes:
>[fast bcopy needed]
>In addition to making it fast typically using the LDM and STM instructions
>consideration needs to be given to the sizes being copied. Logging statistics
>for the bcopy routine shows that it is regularly called for certain sizes
>of copy far more frequently than others.
>The most common sizes are 12, 8, 128, 6, 4, 16, 2 in that order.
>This may mean that the best performance will be gained if these sizes are
>spotted and specially coded.

OK, here's a first attempt.  I've gone for the "source and destination 4-byte
aligned, size multiple of 4 bytes" case, which probably covers most of the
common ones Mark lists.  This doesn't handle overlapping blocks (i.e. it's
memcpy, not memmove).  Mark asked for an "overlapping memcpy" -- does this
mean memmove is actually required?

fast_memcpy
; In: R0 -> src, R1 -> dest, R2 = length
;Out: R0 preserved (R1,R2,R3,ip corrupted as APCS allows)
; Are src and dest are word-aligned and we're copying a multiple of 4 bytes?
 ORR     R3,R0,R1
 ORR     R3,R3,R2
 TST     R3,#3
 BNE     memcpy ; whatever is currently used as memcpy
 ;
; OK, we're ready to rock'n'roll...
; Use ip as R0 needs to be unchanged on exit
 MOV     ip,R0
|_alignedwordcpy|
|_alignedwordcpylp3|
 SUBS    R2,R2,#4
 LDRGE   R3,[ip],#4
 STRGE   R3,[R1],#4
; to unroll this loop, repeat these 3 instructions
 SUBGES  R2,R2,#4
 LDRGE   R3,[ip],#4
 STRGE   R3,[R1],#4
;
 BNE     |_alignedwordcpylp3|
 MOVS    PC,R14

I've tested this under RISC OS on an ARM610 Risc PC and is 25% faster than
the Shared C Library on a selection of small aligned blocks with sizes which
are multiples of 4.  I haven't had time to install RiscBSD yet :(

BTW, a quick play at unrolling suggested the code as I've given it is
a good trade-off.

Olly