Subject: port-i386/25263: much faster implementation of i386 str* functions
To: None <gnats-bugs@gnats.netbsd.org>
From: J.T. Conklin <jtc@orac.acorntoolworks.com>
List: netbsd-bugs
Date: 04/20/2004 10:13:01
>Number:         25263
>Category:       port-i386
>Synopsis:       much faster implementation of i386 str* functions
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    port-i386-maintainer
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 20 17:14:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     J.T. Conklin
>Release:        NetBSD 2.0C
>Organization:
J.T. Conklin
>Environment:
System: NetBSD orac 1.6T NetBSD 1.6T (GENERIC) #0: Fri Jun 6 22:31:11 PDT 2003 jtc@orac:/home/jtc/netbsd/NetBSD-current/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
For the last couple of years, my day job has been working on a L4-L7
network test and measurement device.  Profiles showed one of the hot
spots was the use of the str* family of functions in HTTP header
processing, particularly strchr().

Although the version of strchr() that came with the RTOS was slightly
worse than the one that I wrote for NetBSD/i386 many, many, years ago.
I thought to myself that as soon as I get some a chance, I'd revisit
NetBSD's str*() functions.  I recently got that chance.

The current str* implementations process strings a byte at a time with
some loop unrolling and a little effort towards the scheduling for the
i586 (I'm not sure how effective the latter was, as I was using a i486
at the time).  The new implementations I'm providing with this PR use
a more RISC-like approach, where strings are processed a machine word
at a time by using an arithmetic sequence that detects whether a byte
in a word is zero (or some other value).  

I didn't know about this technique when I wrote the original versions.
Thanks to such books as the "PowerPC Compiler Writers Guide" and Henry
S. Warren Jr.'s "Hacker's Delight" it is no longer secret.

With this PR, I provide memchr, strcat, strchr/index, strcmp, strcpy,
strlen, and strrchr/rindex.  On my Athlon MP machine, the new
implementations range from about 2.5 to 3.5 times faster than the
current ones (even memchr, which is essentially a single x86 insn).
I've also tested them on a PentiumPro and a Pentium VI with similar
results.  Unfortunately, I have not been able to test them machines
older than the PPro.  However, I have no reason to believe that they
won't be much better than before.

There is a tiny bit more overhead as the new functions must ensure the
string(s) are aligned, etc. but my tests show that the faster inner
loop overcomes this quickly.  In my plots of string length over clock
cycles, the two lines intersect at around 3 bytes.  So for very small
strings, the cost is negligible; for longer strings it's a clear win.

I see that fvdl has adapted my original i386 code for the use in
NetBSD/amd64.  This new code may be even more beneficial in that
context, as it will be able to process 8 bytes per loop instead of 4.
The benefit may be in the 6-7x range.  I don't have an AMD64 machine
of my own (yet), so someone else may wish to experiment.

I've enclosed the changes as complete files.  Diffs were larger and
more confusing.

I've also made one small change to the the way that index/strchr (and
rindex/strrchr) are built.  In -current, strchr.S defines STRCHR and
includes index.S.  In the new version index.S defines INDEX and
includes strchr.S.  This is so strchr.S can be copied from
libc/arch/string/strchr.S to sys/lib/libkern/arch/i386/strchr.S
without modification.

	--jtc

>How-To-Repeat:
	
>Fix:
# This is a shell archive.  Save it in a file, remove anything before
# this line, and then unpack it by entering "sh file".  Note, it may
# create directories; files and directories will be owned by you and
# have default permissions.
#
# This archive contains:
#
#	index.S
#	memchr.S
#	rindex.S
#	strcat.S
#	strchr.S
#	strcmp.S
#	strlen.S
#	strrchr.S
#
echo x - index.S
sed 's/^X//' >index.S << 'END-of-index.S'
X/*	$NetBSD: $	*/
X
X#define INDEX
X#include "strchr.S"
END-of-index.S
echo x - memchr.S
sed 's/^X//' >memchr.S << 'END-of-memchr.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X	
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
XENTRY(memchr)
X	pushl	%esi
X	movl	 8(%esp),%eax
X	movl	12(%esp),%ecx
X	movl	16(%esp),%esi
X
X	/*
X	 * Align to word boundry
X	 * Consider unrolling loop?
X	 */
X	testl	%esi,%esi			/* nbytes == 0? */
X	je	.Lzero
X.Lalign:		
X	testb	$3,%al
X	je	.Lword_aligned
X	cmpb	(%eax),%cl
X	je	.Ldone
X	incl	%eax
X	decl	%esi
X	jnz	.Lalign
X	
X.Lword_aligned:
X	/* copy char to all bytes in word */
X	movb	%cl,%ch
X	movl	%ecx,%edx
X	sall	$16,%ecx
X	orl	%edx,%ecx
X
X	_ALIGN_TEXT	
X.Lloop:
X	cmpl	$3,%esi				/* nbytes > 4 */
X	jbe	.Lbyte
X	movl	(%eax),%edx
X	addl	$4,%eax
X	xorl	%ecx,%edx
X	subl	$4,%esi
X	subl	$0x01010101,%edx
X	testl	$0x80808080,%edx
X	je	.Lloop
X
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word are
X	 * equal to ch.
X	 */
X
X	/*
X	 * High load-use latency on the Athlon leads to significant
X	 * stalls, so we preload the next char as soon as possible
X	 * instead of using cmp mem8, reg8.
X	 *
X	 * Alignment here avoids a stall on the Athlon, even though
X	 * it's not a branch target.
X	 */
X	_ALIGN_TEXT
X	cmpb	-4(%eax),%cl			/* 1st byte == ch? */
X	movb	-3(%eax),%dl
X	jne	1f
X	subl	$4,%eax
X	jmp	.Ldone
X
X	_ALIGN_TEXT
X1:	cmpb	%dl,%cl				/* 2nd byte == ch? */
X	movb	-2(%eax),%dl
X	jne	1f
X	subl	$3,%eax	
X	jmp	.Ldone
X
X	_ALIGN_TEXT
X1:	cmpb	%dl,%cl				/* 3rd byte == ch? */
X	movb	-1(%eax),%dl
X	jne	1f
X	subl	$2,%eax
X	jmp	.Ldone
X
X	_ALIGN_TEXT	
X1:	cmpb	%dl,%cl				/* 4th byte == ch? */
X	jne	.Lloop
X	decl	%eax
X	jmp	.Ldone
X	
X.Lbyte:		
X	testl	%esi,%esi
X	je	.Lzero
X.Lbyte_loop:
X	cmpb	(%eax),%cl
X	je	.Ldone
X	incl	%eax
X	decl	%esi
X	jnz	.Lbyte_loop
X	
X.Lzero:
X	xorl	%eax,%eax
X	
X.Ldone:
X	popl	%esi
X	ret
END-of-memchr.S
echo x - rindex.S
sed 's/^X//' >rindex.S << 'END-of-rindex.S'
X/*	$NetBSD: $	*/
X
X#define RINDEX
X#include "strrchr.S"
END-of-rindex.S
echo x - strcat.S
sed 's/^X//' >strcat.S << 'END-of-strcat.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
XENTRY(strcat)
X	pushl	%ebx
X	movl	8(%esp),%ecx
X	movl	12(%esp),%eax
X
X	/*
X	 * Align destination to word boundary.
X	 * Consider unrolling loop?
X	 */
X.Lscan:	
X.Lscan_align:
X	testb	$3,%cl
X	je	.Lscan_aligned
X	cmpb	$0,(%ecx)
X	je	.Lcopy
X	incl	%ecx
X	jmp	.Lscan_align
X
X	_ALIGN_TEXT
X.Lscan_aligned:
X.Lscan_loop:
X	movl	(%ecx),%ebx
X	addl	$4,%ecx	
X	leal	-0x01010101(%ebx),%edx
X	testl	$0x80808080,%edx
X	je	.Lscan_loop
X
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word equal 0.
X	 */
X
X	/*
X	 * The optimal code for determining whether each byte is zero
X	 * differs by processor.  This space-optimized code should be
X	 * acceptable on all, especially since we don't expect it to
X	 * be run frequently,
X	 */
X
X	testb	%bl,%bl					/* 1st byte == 0? */
X	jne	1f
X	subl	$4,%ecx
X	jmp	.Lcopy
X
X1:	testb	%bh,%bh					/* 2nd byte == 0? */
X	jne	1f
X	subl	$3,%ecx
X	jmp	.Lcopy
X
X1:	shrl	$16,%ebx
X	testb	%bl,%bl					/* 3rd byte == 0? */
X	jne	1f
X	subl	$2,%ecx
X	jmp	.Lcopy
X
X1:	testb	%bh,%bh					/* 4th byte == 0? */
X	jne	.Lscan_loop
X	subl	$1,%ecx
X
X	/*
X	 * Align source to a word boundary.
X	 * Consider unrolling loop?
X	 */
X.Lcopy:
X.Lcopy_align:	
X	testl	$3,%eax
X	je	.Lcopy_aligned
X	movb	(%eax),%bl
X	incl	%eax
X	movb	%bl,(%ecx)
X	incl	%ecx
X	testb	%bl,%bl
X	jne	.Lcopy_align
X	jmp	.Ldone
X	
X	_ALIGN_TEXT
X.Lcopy_loop:
X	movl	%ebx,(%ecx)
X	addl	$4,%ecx
X.Lcopy_aligned:
X	movl	(%eax),%ebx
X	addl	$4,%eax
X	leal	-0x01010101(%ebx),%edx
X	testl	$0x80808080,%edx
X	je	.Lcopy_loop
X	
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word equal 0.
X	 */
X
X	movb	%bl,(%ecx)
X	incl	%ecx
X	testb	%bl,%bl
X	je	.Ldone
X
X	movb	%bh,(%ecx)
X	incl	%ecx
X	testb	%bh,%bh
X	je	.Ldone
X
X	shrl	$16,%ebx
X	movb	%bl,(%ecx)
X	incl	%ecx
X	testb	%bl,%bl
X	je	.Ldone
X
X	movb	%bh,(%ecx)
X	incl	%ecx
X	testb	%bh,%bh
X	jne	.Lcopy_aligned
X
X.Ldone:
X	movl	8(%esp),%eax
X	popl	%ebx
X	ret
END-of-strcat.S
echo x - strchr.S
sed 's/^X//' >strchr.S << 'END-of-strchr.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X	
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
X#ifdef INDEX
XENTRY(index)
X#else
XENTRY(strchr)
X#endif
X	pushl	%esi
X	pushl	%ebx
X	movl	12(%esp),%eax
X	movl	16(%esp),%ecx
X
X	/*
X	 * Align to word boundary.
X	 * Consider unrolling loop? 
X         */
X.Lalign:
X	testb	$3,%al
X	je	.Lword_aligned
X	movb	(%eax),%bl
X	cmpb	%cl,%bl
X	je	.Ldone
X	testb	%bl,%bl
X	je	.Lzero
X	incl	%eax
X	jmp	.Lalign
X	
X.Lword_aligned:
X	/* copy char to all bytes in word */
X	movb	%cl,%ch
X	movl	%ecx,%edx
X	sall	$16,%ecx
X	orl	%edx,%ecx
X	
X	/* Check whether any byte in the word is equal to ch or 0. */
X	_ALIGN_TEXT
X.Lloop:
X	movl	(%eax),%ebx
X	addl	$4,%eax
X	movl	%ebx,%esi
X	leal	-0x01010101(%ebx),%edx
X	xorl	%ecx,%esi
X	subl	$0x01010101,%esi
X	orl	%esi,%edx
X	testl	$0x80808080,%edx
X	je	.Lloop
X
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word match
X	 * ch or are equal to 0.
X	 */
X
X	/*
X	 * Alignment here avoids a stall on the Athlon, even though
X	 * it's not a branch target.
X	 */
X	
X	_ALIGN_TEXT
X	cmpb	%cl,%bl				/* 1st byte == ch? */
X	jne	1f
X	subl	$4,%eax
X	jmp	.Ldone
X1:	testb	%bl,%bl				/* 1st byte == 0? */
X	je	.Lzero
X
X	cmpb	%cl,%bh				/* 2nd byte == ch? */
X	jne	1f
X	subl	$3,%eax
X	jmp	.Ldone
X1:	testb	%bh,%bh				/* 2nd byte == 0? */
X	je	.Lzero
X
X	shrl	$16,%ebx
X	cmpb	%cl,%bl				/* 3rd byte == ch? */
X	jne	1f
X	subl	$2,%eax
X	jmp	.Ldone
X1:	testb	%bl,%bl				/* 3rd byte == 0? */
X	je	.Lzero
X
X	cmpb	%cl,%bh				/* 4th byte == ch? */
X	jne	1f
X	decl	%eax
X	jmp	.Ldone
X1:	testb	%bh,%bh				/* 4th byte == 0? */
X	jne	.Lloop
X						
X.Lzero:
X	/* If a ch wasn't found, return 0. */
X	xorl	%eax,%eax
X
X.Ldone:
X	popl	%ebx
X	popl	%esi
X	ret
END-of-strchr.S
echo x - strcmp.S
sed 's/^X//' >strcmp.S << 'END-of-strcmp.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
XENTRY(strcmp)
X	pushl	%esi
X	pushl	%ebx
X	movl	12(%esp),%ebx
X	movl	16(%esp),%esi
X
X	/*
X	 * Align s1 to word boundary.
X	 * Consider unrolling loop?
X	 */
X.Ls1align:
X	testb	$3,%bl
X	je	.Ls1aligned
X	movb	(%ebx),%al
X	incl	%ebx
X	movb	(%esi),%dl
X	incl	%esi
X	testb	%al,%al
X	je	.Ldone
X	cmpb	%al,%dl
X	je	.Ls1align
X	jmp	.Ldone
X
X	/*
X	 * Check whether s2 is aligned to a word boundry.  If it is, we 
X	 * can compare by words.  Otherwise we have to compare by bytes.
X	 */
X.Ls1aligned:
X	testl	$3,%esi
X	jne	.Lbyte_loop
X
X	subl	$4,%ebx
X	subl	$4,%esi
X
X	_ALIGN_TEXT
X.Lword_loop:
X	movl	4(%ebx),%eax
X	addl	$4,%ebx
X	movl	4(%esi),%edx
X	addl	$4,%esi
X	cmpl	%eax,%edx
X	jne	.Lbyte_loop
X	subl	$0x01010101,%edx
X	notl	%eax
X	andl	%eax,%edx
X	testl	$0x80808080,%edx
X	je	.Lword_loop
X
X	_ALIGN_TEXT
X.Lbyte_loop:
X	movb	(%ebx),%al
X	incl	%ebx
X	movb	(%esi),%dl
X	incl	%esi
X	testb	%al,%al
X	je	.Ldone
X	cmpb	%al,%dl
X	je	.Lbyte_loop
X
X.Ldone:
X	movzbl	%al,%eax
X	movzbl	%dl,%edx
X	subl	%edx,%eax
X	popl	%ebx
X	popl	%esi
X	ret
END-of-strcmp.S
echo x - strlen.S
sed 's/^X//' >strlen.S << 'END-of-strlen.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
XENTRY(strlen)
X	movl	4(%esp),%eax
X
X.Lalign:
X	/* Consider unrolling loop? */
X	testb	$3,%al
X	je	.Lword_aligned
X	cmpb	$0,(%eax)
X	je	.Ldone
X	incl	%eax
X	jmp	.Lalign
X
X	/*
X	 * There are many well known branch-free sequences which are used
X	 * for determining whether a zero-byte is contained within a word.
X	 * These sequences are generally much more efficent than loading
X	 * and comparing each byte individually.
X	 *
X	 * The expression [1,2]:
X	 *
X	 * (1)	~(((x & 0x7f7f7f7f) + 0x7f7f7f7f) | (x | 0x7f7f7f7f))
X	 *
X	 * evaluates to a non-zero value if any of the bytes in the
X	 * original word is zero.
X	 *
X	 * It also has the useful property that bytes in the result word
X	 * that coorespond to non-zero bytes in the original word have
X	 * the value 0x00, while bytes cooresponding to zero bytes have
X	 * the value 0x80. This allows calculation of the first (and
X	 * last) occurance of a zero byte within the word (useful for C's
X	 * str* primitives) by counting the number of leading (or
X	 * trailing) zeros and dividing the result by 8.  On machines
X	 * without (or with slow) clz() / ctz() instructions, testing
X	 * each byte in the result word for zero is necessary.
X	 *
X	 * This typically takes 4 instructions (5 on machines without
X	 * "not-or") not including those needed to load the constant.
X	 *
X	 *
X	 * The expression:
X	 *
X	 * (2)	((x - 0x01010101) & ~x & 0x80808080)
X	 *
X	 * evaluates to a non-zero value if any of the bytes in the
X	 * original word is zero.
X	 *
X	 * On little endian machines, the first byte in the result word
X	 * that cooresponds to a zero byte in the original byte is 0x80,
X	 * so clz() can be used as above.  On big endian machines, and
X	 * little endian machines without (or with a slow) clz() insn,
X	 * testing each byte in the original for zero is necessary
X	 * 
X	 * This typically takes 3 instructions (4 on machines without
X	 * "and with complement") not including those needed to load
X	 * constants.
X	 *
X	 *
X	 * The expression:
X	 *
X	 * (3)	((x - 0x01010101) & 0x80808080)
X	 *
X	 * always evaluates to a non-zero value if any of the bytes in
X	 * the original word is zero.  However, in rare cases, it also
X	 * evaluates to a non-zero value when none of the bytes in the
X	 * original word is zero.	
X	 * 
X	 * To account for possible false positives, each byte of the
X	 * original word must be checked when the expression evaluates to
X	 * a non-zero value.  However, because it is simpler than those
X	 * presented above, code that uses it will be faster as long as
X	 * the rate of false positives is low.
X	 * 
X	 * This is likely, because the the false positive can only occur
X	 * if the most siginificant bit of a byte within the word is set.
X	 * The expression will never fail for typical 7-bit ASCII strings.
X	 *
X	 * This typically takes 2 instructions not including those needed
X	 * to load constants.
X	 *
X	 *
X	 * [1] Henry S. Warren Jr., "Hacker's Delight", Addison-Westley 2003
X	 *
X	 * [2] International Business Machines, "The PowerPC Compiler Writer's
X	 *     Guide", Warthman Associates, 1996
X	 */
X
X	_ALIGN_TEXT
X.Lword_aligned:
X.Lloop:
X	movl	(%eax),%ecx
X	addl	$4,%eax	
X	leal	-0x01010101(%ecx),%edx
X	testl	$0x80808080,%edx
X	je	.Lloop
X
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word equal 0.
X	 */
X
X	/*
X	 * The optimal code for determining whether each byte is zero
X	 * differs by processor.  This space-optimized code should be
X	 * acceptable on all, especially since we don't expect it to
X	 * be run frequently,
X	 */
X
X	testb	%cl,%cl					/* 1st byte == 0? */
X	jne	1f
X	subl	$4,%eax
X	jmp	.Ldone
X
X1:	testb	%ch,%ch					/* 2nd byte == 0? */
X	jne	1f
X	subl	$3,%eax
X	jmp	.Ldone
X
X1:	shrl	$16,%ecx
X	testb	%cl,%cl					/* 3rd byte == 0? */
X	jne	1f
X	subl	$2,%eax
X	jmp	.Ldone
X
X1:	testb	%ch,%ch					/* 4th byte == 0? */
X	jne	.Lloop
X	decl	%eax
X
X.Ldone:
X	subl	4(%esp),%eax
X	ret
END-of-strlen.S
echo x - strrchr.S
sed 's/^X//' >strrchr.S << 'END-of-strrchr.S'
X/*
X * Written by J.T. Conklin <jtc@acorntoolworks.com>
X * Public domain.
X */
X
X#include <machine/asm.h>
X
X#if defined(LIBC_SCCS)
X	RCSID("$NetBSD: $")
X#endif
X
X#ifdef RINDEX
XENTRY(rindex)
X#else
XENTRY(strrchr)
X#endif
X	pushl	%esi
X	pushl	%edi
X	pushl	%ebx
X	movl	16(%esp),%edx
X	movl	20(%esp),%ecx
X
X	/* zero return value */
X	xorl	%eax,%eax
X
X	/*
X	 * Align to word boundary.
X	 * Consider unrolling loop?
X	 */
X.Lalign:
X	testb	$3,%dl
X	je	.Lword_aligned
X	movb	(%edx),%bl
X	cmpb	%cl,%bl
X	jne	1f
X	movl	%edx,%eax
X1:	testb	%bl,%bl
X	je	.Ldone
X	incl	%edx
X	jmp	.Lalign
X	
X.Lword_aligned:
X	/* copy char to all bytes in word */
X	movb	%cl,%ch
X	movl	%ecx,%edi
X	sall	$16,%ecx
X	orl	%edi,%ecx
X
X	/* Check whether any byte in the word is equal to ch or 0.  */
X	_ALIGN_TEXT
X.Lloop:
X	movl	(%edx),%ebx
X	addl	$4,%edx
X	movl	%ebx,%esi
X	leal	-0x01010101(%ebx),%edi
X	xorl	%ecx,%esi
X	subl	$0x01010101,%esi
X	orl	%esi,%edi
X	testl	$0x80808080,%edi
X	je	.Lloop
X
X	/*
X	 * In rare cases, the above loop may exit prematurely. We must
X	 * return to the loop if none of the bytes in the word match
X	 * ch or are equal to 0.
X	 */
X
X	_ALIGN_TEXT
X	cmpb	%cl,%bl				/* 1st byte == ch? */
X	jne	1f
X	leal	-4(%edx),%eax
X1:	testb	%bl,%bl				/* 1st byte == 0? */
X	je	.Ldone
X
X	cmpb	%cl,%bh				/* 2nd byte == ch? */
X	jne	1f
X	leal	-3(%edx),%eax
X1:	testb	%bh,%bh				/* 2nd byte == 0? */
X	je	.Ldone
X
X	shrl	$16,%ebx
X	cmpb	%cl,%bl				/* 3rd byte == ch? */
X	jne	1f
X	leal	-2(%edx),%eax
X1:	testb	%bl,%bl				/* 3rd byte == 0? */
X	je	.Ldone
X
X	cmpb	%cl,%bh				/* 4th byte == ch? */
X	jne	1f
X	leal	-1(%edx),%eax
X1:	testb	%bh,%bh				/* 4th byte == 0? */
X	jne	.Lloop
X						
X.Ldone:
X	popl	%ebx
X	popl	%edi
X	popl	%esi
X	ret
END-of-strrchr.S
exit
>Release-Note:
>Audit-Trail:
>Unformatted: