port-i386: 25%+ improvement in in

Subject: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org, tech-perform@l8s.co.uk>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/17/2002 17:04:25
I read the the pentium xxx performance guide the other week
(looking for something else) and it struck me that the IP
checksum routine ought to use some of the principles.
Looking at the code it doesn't (last optimised in 1996).

On my (old) athlon 700, the following loop is fastest
at .270ns/byte instead of .370 for the standard i386 code:

1:      
	addl    (%edx),%eax
	adcl    4(%edx),%eax
	adcl    8(%edx),%eax 
	adcl    12(%edx),%eax
	adcl    16(%edx),%eax
	adcl    20(%edx),%eax
	adcl    24(%edx),%eax
	adcl    28(%edx),%eax
	adcl    $0,%eax
	addl    32(%edx),%ebx
	adcl    36(%edx),%ebx
	adcl    40(%edx),%ebx
	adcl    44(%edx),%ebx
	adcl    48(%edx),%ebx
	adcl    52(%edx),%ebx
	adcl    56(%edx),%ebx
	adcl    60(%edx),%ebx
	adcl    $0,%ebx
	leal    64(%edx),%edx
	subl    $64,%ecx
	jnz     1b

I wrote a test program to compare the functions, with 8k buffers
it gives:

          in_cksum.s sum f807 took     3125 usecs 0.372529 nsec/byte
           asm adc 1 sum f807 took    13754 usecs 1.639605 nsec/byte
          asm adc 1a sum f807 took     8952 usecs 1.067162 nsec/byte
          asm adc 1b sum f807 took     6982 usecs 0.832319 nsec/byte
          asm adc 1c sum f807 took     8994 usecs 1.072168 nsec/byte
          asm adc 1d sum f807 took     8951 usecs 1.067042 nsec/byte
           asm adc 2 sum f807 took     6541 usecs 0.779748 nsec/byte
           asm adc 4 sum f807 took     4496 usecs 0.535965 nsec/byte
          asm adc 4b sum f807 took     3794 usecs 0.452280 nsec/byte
          asm adc 8b sum f807 took     3782 usecs 0.450850 nsec/byte
          asm pair 2 sum f807 took     4997 usecs 0.595689 nsec/byte
          asm pair 4 sum f807 took     3461 usecs 0.412583 nsec/byte
          asm pair 8 sum f807 took     2827 usecs 0.337005 nsec/byte
         asm pair 16 sum f807 took     2460 usecs 0.293255 nsec/byte
        asm pair 16a sum f807 took     2311 usecs 0.275493 nsec/byte
         asm pair 32 sum f807 took     2278 usecs 0.271559 nsec/byte
          asm quad 8 sum f807 took     3020 usecs 0.360012 nsec/byte
            16 bit C sum f807 took    20865 usecs 2.487302 nsec/byte
            32 bit C sum f807 took     8978 usecs 1.070261 nsec/byte
       32 bit C pair sum f807 took    11632 usecs 1.386642 nsec/byte

It would be interesting to know what PIII, P4 and athlon XP get.
(note in_cksum.s is at a slight disadvantage because it deals
with missized buffers, but with an 8k transfer this is minimal.)

Code available from:
http://www.btinternet.com/~david.laight/netbsd/in_cksum/

sumtest.c is the test harness
sum.S contains a hacked version of the code from in_cksum.s and
series of other checksum fucntions.



	David

-- 
David Laight: david@l8s.co.uk