Subject: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org, tech-perform@l8s.co.uk>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/17/2002 17:04:25
I read the the pentium xxx performance guide the other week
(looking for something else) and it struck me that the IP
checksum routine ought to use some of the principles.
Looking at the code it doesn't (last optimised in 1996).
On my (old) athlon 700, the following loop is fastest
at .270ns/byte instead of .370 for the standard i386 code:
1:
addl (%edx),%eax
adcl 4(%edx),%eax
adcl 8(%edx),%eax
adcl 12(%edx),%eax
adcl 16(%edx),%eax
adcl 20(%edx),%eax
adcl 24(%edx),%eax
adcl 28(%edx),%eax
adcl $0,%eax
addl 32(%edx),%ebx
adcl 36(%edx),%ebx
adcl 40(%edx),%ebx
adcl 44(%edx),%ebx
adcl 48(%edx),%ebx
adcl 52(%edx),%ebx
adcl 56(%edx),%ebx
adcl 60(%edx),%ebx
adcl $0,%ebx
leal 64(%edx),%edx
subl $64,%ecx
jnz 1b
I wrote a test program to compare the functions, with 8k buffers
it gives:
in_cksum.s sum f807 took 3125 usecs 0.372529 nsec/byte
asm adc 1 sum f807 took 13754 usecs 1.639605 nsec/byte
asm adc 1a sum f807 took 8952 usecs 1.067162 nsec/byte
asm adc 1b sum f807 took 6982 usecs 0.832319 nsec/byte
asm adc 1c sum f807 took 8994 usecs 1.072168 nsec/byte
asm adc 1d sum f807 took 8951 usecs 1.067042 nsec/byte
asm adc 2 sum f807 took 6541 usecs 0.779748 nsec/byte
asm adc 4 sum f807 took 4496 usecs 0.535965 nsec/byte
asm adc 4b sum f807 took 3794 usecs 0.452280 nsec/byte
asm adc 8b sum f807 took 3782 usecs 0.450850 nsec/byte
asm pair 2 sum f807 took 4997 usecs 0.595689 nsec/byte
asm pair 4 sum f807 took 3461 usecs 0.412583 nsec/byte
asm pair 8 sum f807 took 2827 usecs 0.337005 nsec/byte
asm pair 16 sum f807 took 2460 usecs 0.293255 nsec/byte
asm pair 16a sum f807 took 2311 usecs 0.275493 nsec/byte
asm pair 32 sum f807 took 2278 usecs 0.271559 nsec/byte
asm quad 8 sum f807 took 3020 usecs 0.360012 nsec/byte
16 bit C sum f807 took 20865 usecs 2.487302 nsec/byte
32 bit C sum f807 took 8978 usecs 1.070261 nsec/byte
32 bit C pair sum f807 took 11632 usecs 1.386642 nsec/byte
It would be interesting to know what PIII, P4 and athlon XP get.
(note in_cksum.s is at a slight disadvantage because it deals
with missized buffers, but with an 8k transfer this is minimal.)
Code available from:
http://www.btinternet.com/~david.laight/netbsd/in_cksum/
sumtest.c is the test harness
sum.S contains a hacked version of the code from in_cksum.s and
series of other checksum fucntions.
David
--
David Laight: david@l8s.co.uk