Subject: bcopy optimisations (?).
To: None <port-alpha@netbsd.org>
From: Simon Burge <simonb@netbsd.org>
List: port-alpha
Date: 04/07/2000 14:47:11
Folks,
I've had a play with the alpha bcopy.S in an attempt to try some
unrolled loops to see if it's any faster. I tried loops using 4 longs
per loop (l32), 8 longs per loop (64) and 8 longs per loop with a few
of the stq_u's mixed in a few instructions after the relevent ldq_u's
(l64i). I've also included tests for the libc bcopy.S, a C version that
uses an unrolloed loop of 4 longs (c-l32), and a version of the Linux
Alpha memcpy.c that is C with some embedded _asm()'s. The Linux memcpy
was the only other implementation I could find on the 'Net with a short
web search. Tests were done on an 500MHz AlphaPC164 with a single bank
of memory using the lmbench "bw_mem bcopy" test.
Here's some results (in MB/sec):
size libc c-l32 l16 l32 l64 l64i linux
512B 1217.24 883.80 867.23 870.04 873.59 868.83 1219.94
1kB 1270.61 893.86 894.15 896.80 897.01 897.11 1270.62
2kB 1298.66 906.33 908.93 905.19 912.28 905.63 1298.33
4kB 1305.03 902.91 900.76 890.87 903.71 891.57 1312.34
8kB 1307.83 889.89 884.12 884.03 880.87 889.16 1295.76
16kB 844.19 836.74 874.94 145.72 723.19 791.71 868.89
32kB 818.84 768.98 521.95 808.43 639.36 607.29 198.13
64kB 503.88 306.25 243.40 227.19 243.61 487.26 197.58
128kB 130.43 147.41 274.16 227.36 173.51 136.76 180.83
256kB 143.39 133.74 109.33 110.42 105.76 119.94 105.52
512kB 82.95 79.67 80.86 75.81 74.17 79.37 82.87
1MB 65.11 63.70 64.52 61.05 62.17 65.64 63.33
8MB 55.29 55.96 55.82 55.82 55.73 55.78 55.24
So it appears the for any sort of loop unrolling, performance drops off
pretty quickly, except for the largest copy sizes and a good results at
128kB.
Note - I've never programmed in Alpha assembler before! I've included
my l64 implementation below. Some Alpha asm gurus might be able to
point out any glaring performance issues. Hmm, it just occured to
me that the code should be checking for > 64 bytes to copy and not a
multiple of 64 bytes, and then using bcopy_small_left to finish off...
Simon.
--
Index: bcopy.S
===================================================================
RCS file: /cvsroot/basesrc/lib/libc/arch/alpha/string/bcopy.S,v
retrieving revision 1.3
diff -p -u -r1.3 bcopy.S
--- bcopy.S 1996/10/17 03:08:11 1.3
+++ bcopy.S 2000/04/07 01:10:38
@@ -104,6 +104,40 @@ bcopy_all_aligned:
bic t0,7,t0
beq t0,bcopy_samealign_lp_end
+ /* Check for mulitple of 64 bytes */
+ and SIZEREG,63,t12
+ bne t12,bcopy_samealign_lp
+
+bcopy_samealign_lp64:
+ stq_u t2,0(DSTREG)
+ addq DSTREG,64,DSTREG
+
+ ldq_u t3,8(SRCREG)
+ ldq_u t6,16(SRCREG)
+ ldq_u t7,24(SRCREG)
+ ldq_u t8,32(SRCREG)
+ ldq_u t9,40(SRCREG)
+ ldq_u t10,48(SRCREG)
+ ldq_u t11,56(SRCREG)
+ ldq_u t2,64(SRCREG)
+
+ stq_u t3,-56(DSTREG)
+ stq_u t6,-48(DSTREG)
+ stq_u t7,-40(DSTREG)
+ stq_u t8,-32(DSTREG)
+ stq_u t9,-24(DSTREG)
+ stq_u t10,-16(DSTREG)
+ stq_u t11,-8(DSTREG)
+
+ subq t0,64,t0
+ addq SRCREG,64,SRCREG
+ bne t0,bcopy_samealign_lp64
+
+ /* If we're done, exit */
+ bne SIZEREG,bcopy_small_left
+ stq_u t2,0(DSTREG)
+ RET
+
bcopy_samealign_lp:
stq_u t2,0(DSTREG)
addq DSTREG,8,DSTREG
-