Subject: speeding up bzero
To: None <port-i386@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 04/11/2003 21:43:00
The patch below speeds up bzero on both P2 and old Athlons.
I suspect that it is a gain on P3, it is almost certainly
a win on P4.

My Athlon 700 gains about 1.5% on 8k aligned calls.
For 20 byte aligned transfers the gain is 38%
For 20 byte misaligned transfers the gain is 29%

The changes are twofold:
1) avoid jumps in the aligned path
2) avoid stos for small counts

16 bytes is somewhere near the breakeven point for byte transfers
(it depends on the alignment and trailing bytes.)

I haven't done any tests to find out at what point 'rep stosb'
wins over the code loop.

	David

Index: bzero.S
===================================================================
RCS file: /cvsroot/src/sys/lib/libkern/arch/i386/bzero.S,v
retrieving revision 1.6
diff -u -p -r1.6 bzero.S
--- bzero.S	1998/02/22 08:14:57	1.6
+++ bzero.S	2003/04/11 20:34:21
@@ -12,34 +12,48 @@
 ENTRY(bzero)
 	pushl	%edi
 	movl	8(%esp),%edi
-	movl	12(%esp),%edx
+	movl	12(%esp),%ecx
 
 	cld				/* set fill direction forward */
 	xorl	%eax,%eax		/* set fill data to 0 */
 
 	/*
 	 * if the string is too short, it's really not worth the overhead
 	 * of aligning to word boundries, etc.  So we jump to a plain
 	 * unaligned set.
 	 */
-	cmpl	$16,%edx
+	cmpl	$16,%ecx
 	jb	L1
 
-	movl	%edi,%ecx		/* compute misalignment */
-	negl	%ecx
-	andl	$3,%ecx
-	subl	%ecx,%edx
-	rep				/* zero until word aligned */
-	stosb
-
-	movl	%edx,%ecx		/* zero by words */
+	movl	%edi,%edx		/* detect misalignment */
+	andl	$3,%edx
+	jnz	align
+aligned:
+	movl	%ecx,%edx		/* zero by words */
 	shrl	$2,%ecx
 	andl	$3,%edx
 	rep
 	stosl
+	jnz	do_remainder
+	pop	%edi
+	ret
 
-L1:	movl	%edx,%ecx		/* zero remainder by bytes */
-	rep
+align:
+	xorb	$3,%dl			/* get misaligned count */
+	subl	%edx,%ecx		/* remove frommain count */
+do_remainder:
+1:	movb	%al,(%edi)		/* coping byte by byte is... */
+	inc	%edi			/* ...faster than rep stosb */
+	dec	%edx
+	jnz	1b
+	test	%ecx,%ecx		/* zero if doing remainder */
+	jnz	aligned
+	pop	%edi
+	ret
+
+L1:	rep
 	stosb
 
 	popl	%edi
	ret
-- 
David Laight: david@l8s.co.uk