Subject: Re: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/22/2002 21:37:09
On Wed, Sep 18, 2002 at 12:28:58AM +0100, David Laight wrote:
> > It would be interesting to know what PIII, P4 and athlon XP get.
>
> The P4 figures are interesting!
> The best is a rolled up C loop! - even then it is still only a
> quarter of the speed of a similar athlon.

I've managed to write an SSE2 version for the P4, this gives:

           32 bit C sum f807 took     4185 usecs 0.498891 nsec/byte
      32 bit C pair sum f807 took     8373 usecs 0.998139 nsec/byte
 	  sse2 test sum f807 took     2205 usecs 0.262856 nsec/byte
(I think this is a 1.8GHz P4 - thanks to Greg Oster for testing
this for me.)

I suspect that minor instruction re-ordering will give additional
benefit (I'd start with an empty loop and add the instructions
1 by 1 to different places to see which order is best!)
Unrolling (to 64byte blocks) is also a probable winner.

If I get bored tomorrow I might include the routines in the actual
checksum code...

sse2_mask:
	.word	0xffff,0xffff,0xffff,0
	.word	0xffff,0xffff,0xffff,0
ENTRY(sum_sse2)
	movl	4(%esp),%edx
	movl	8(%esp),%ecx
	pushl	%ebx
	pushl	%esi
	pushl	%edi

	pxor	%xmm0,%xmm0
	pxor	%xmm2,%xmm2
	movdqu	sse2_mask,%xmm7
	xorl	%eax,%eax
	xorl	%ebx,%ebx
1:
	movdqa	(%edx),%xmm1
	movdqa	16(%edx),%xmm3
	pextrw	$3,%xmm1,%esi
	pextrw	$7,%xmm1,%edi
	pand	%xmm7,%xmm1
	addl	%esi,%eax
	pextrw	$3,%xmm3,%esi
	addl	%edi,%ebx
	pextrw	$7,%xmm3,%edi
	paddq	%xmm1,%xmm0
	pand	%xmm7,%xmm3
	addl	%esi,%eax
	addl	%edi,%ebx
	paddq	%xmm3,%xmm2
	addl	$32,%edx
	subl	$32,%ecx
	jnz	1b

	paddq	%xmm2,%xmm0
	addl	%ebx,%eax

	pshufd	$0xee,%xmm0,%xmm1	# abcd -> abab
	paddq	%xmm1,%xmm0		# xx(ab+cd)
	movd	%xmm0,%ebx
	pextrw	$2,%xmm0,%esi
	pextrw	$3,%xmm0,%edi
	addl	%esi,%edi
	addl	%ebx,%eax
	adcl	%edi,%eax
	adcl	$0,%eax

	popl	%edi
	popl	%esi
	popl	%ebx
	ret



	David

--
David Laight: david@l8s.co.uk