port-alpha: bcopy optimisations (?).

Subject: bcopy optimisations (?).
To: None <port-alpha@netbsd.org>
From: Simon Burge <simonb@netbsd.org>
List: port-alpha
Date: 04/07/2000 14:47:11
Folks,

I've had a play with the alpha bcopy.S in an attempt to try some
unrolled loops to see if it's any faster.  I tried loops using 4 longs
per loop (l32), 8 longs per loop (64) and 8 longs per loop with a few
of the stq_u's mixed in a few instructions after the relevent ldq_u's
(l64i).  I've also included tests for the libc bcopy.S, a C version that
uses an unrolloed loop of 4 longs (c-l32), and a version of the Linux
Alpha memcpy.c that is C with some embedded _asm()'s.  The Linux memcpy
was the only other implementation I could find on the 'Net with a short
web search.  Tests were done on an 500MHz AlphaPC164 with a single bank
of memory using the lmbench "bw_mem bcopy" test.

Here's some results (in MB/sec):

	size	libc	c-l32	l16	l32	l64	l64i	linux

	512B	1217.24	 883.80	 867.23	 870.04	 873.59	 868.83	1219.94
	1kB	1270.61	 893.86	 894.15	 896.80	 897.01	 897.11	1270.62
	2kB	1298.66	 906.33	 908.93	 905.19	 912.28	 905.63	1298.33
	4kB	1305.03	 902.91	 900.76	 890.87	 903.71	 891.57	1312.34
	8kB	1307.83	 889.89	 884.12	 884.03	 880.87	 889.16	1295.76
	16kB	 844.19	 836.74	 874.94	 145.72	 723.19	 791.71	 868.89
	32kB	 818.84	 768.98	 521.95	 808.43	 639.36	 607.29	 198.13
	64kB	 503.88	 306.25	 243.40	 227.19	 243.61	 487.26	 197.58
	128kB	 130.43	 147.41	 274.16	 227.36	 173.51	 136.76	 180.83
	256kB	 143.39	 133.74	 109.33	 110.42	 105.76	 119.94	 105.52
	512kB	  82.95	  79.67	  80.86	  75.81	  74.17	  79.37	  82.87
	1MB	  65.11	  63.70	  64.52	  61.05	  62.17	  65.64	  63.33
	8MB	  55.29	  55.96	  55.82	  55.82	  55.73	  55.78	  55.24

So it appears the for any sort of loop unrolling, performance drops off
pretty quickly, except for the largest copy sizes and a good results at
128kB.

Note - I've never programmed in Alpha assembler before!  I've included
my l64 implementation below.  Some Alpha asm gurus might be able to
point out any glaring performance issues.  Hmm, it just occured to
me that the code should be checking for > 64 bytes to copy and not a
multiple of 64 bytes, and then using bcopy_small_left to finish off...

Simon.
--
Index: bcopy.S
===================================================================
RCS file: /cvsroot/basesrc/lib/libc/arch/alpha/string/bcopy.S,v
retrieving revision 1.3
diff -p -u -r1.3 bcopy.S
--- bcopy.S	1996/10/17 03:08:11	1.3
+++ bcopy.S	2000/04/07 01:10:38
@@ -104,6 +104,40 @@ bcopy_all_aligned:
 	bic	t0,7,t0
 	beq	t0,bcopy_samealign_lp_end
 
+	/* Check for mulitple of 64 bytes */
+	and	SIZEREG,63,t12
+	bne	t12,bcopy_samealign_lp
+
+bcopy_samealign_lp64:
+	stq_u	t2,0(DSTREG)
+	addq	DSTREG,64,DSTREG
+
+	ldq_u	t3,8(SRCREG)
+	ldq_u	t6,16(SRCREG)
+	ldq_u	t7,24(SRCREG)
+	ldq_u	t8,32(SRCREG)
+	ldq_u	t9,40(SRCREG)
+	ldq_u	t10,48(SRCREG)
+	ldq_u	t11,56(SRCREG)
+	ldq_u	t2,64(SRCREG)
+
+	stq_u	t3,-56(DSTREG)
+	stq_u	t6,-48(DSTREG)
+	stq_u	t7,-40(DSTREG)
+	stq_u	t8,-32(DSTREG)
+	stq_u	t9,-24(DSTREG)
+	stq_u	t10,-16(DSTREG)
+	stq_u	t11,-8(DSTREG)
+
+	subq	t0,64,t0
+	addq	SRCREG,64,SRCREG
+	bne	t0,bcopy_samealign_lp64
+
+	/* If we're done, exit */
+	bne	SIZEREG,bcopy_small_left
+	stq_u	t2,0(DSTREG)
+	RET
+
 bcopy_samealign_lp:
 	stq_u	t2,0(DSTREG)
 	addq	DSTREG,8,DSTREG
-