Subject: Re: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org, tech-perform@l8s.co.uk>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/18/2002 00:28:58
> It would be interesting to know what PIII, P4 and athlon XP get.
The P4 figures are interesting!
The best is a rolled up C loop! - even then it is still only a
quarter of the speed of a similar athlon.
1.8GHz P4
in_cksum.s sum f807 took 9292 usecs 1.107693 nsec/byte
asm adc 1 sum f807 took 11814 usecs 1.408339 nsec/byte
asm adc 1a sum f807 took 11789 usecs 1.405358 nsec/byte
asm adc 1b sum f807 took 8420 usecs 1.003742 nsec/byte
asm adc 1c sum f807 took 8879 usecs 1.058459 nsec/byte
asm adc 1d sum f807 took 8919 usecs 1.063228 nsec/byte
asm adc 2 sum f807 took 10587 usecs 1.262069 nsec/byte
asm adc 4 sum f807 took 9979 usecs 1.189590 nsec/byte
asm adc 4b sum f807 took 8422 usecs 1.003981 nsec/byte
asm adc 8b sum f807 took 9072 usecs 1.081467 nsec/byte
asm pair 2 sum f807 took 4861 usecs 0.579476 nsec/byte
asm pair 4 sum f807 took 5916 usecs 0.705242 nsec/byte
asm pair 8 sum f807 took 7679 usecs 0.915408 nsec/byte
asm pair 16 sum f807 took 8531 usecs 1.016974 nsec/byte
asm pair 16a sum f807 took 8533 usecs 1.017213 nsec/byte
asm pair 32 sum f807 took 8965 usecs 1.068711 nsec/byte
asm quad 8 sum f807 took 5916 usecs 0.705242 nsec/byte
16 bit C sum f807 took 4810 usecs 0.573397 nsec/byte
32 bit C sum f807 took 3733 usecs 0.445008 nsec/byte
32 bit C pair sum f807 took 8337 usecs 0.993848 nsec/byte
A Pentium 233 MMX shows little benefit 'asm pair 32' is the only
slight gain...
> Code available from:
> http://www.btinternet.com/~david.laight/netbsd/in_cksum/
I had some thoughts over a few beers...
Changing the leal for addl gives a slight improvement to most tests
on my system - pulls the 'asm pair 16a' down by 9% or so.
I've also added another test 'asm tree' which the P4 might do
better at :-)
But run 'sumtest -l6144' so same block size is used throghout.
David
--
David Laight: david@l8s.co.uk