Subject: Re: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org, tech-perform@l8s.co.uk>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/18/2002 00:28:58
> It would be interesting to know what PIII, P4 and athlon XP get.

The P4 figures are interesting!
The best is a rolled up C loop! - even then it is still only a
quarter of the speed of a similar athlon.

1.8GHz P4                                                                       
          in_cksum.s sum f807 took     9292 usecs 1.107693 nsec/byte            
           asm adc 1 sum f807 took    11814 usecs 1.408339 nsec/byte            
          asm adc 1a sum f807 took    11789 usecs 1.405358 nsec/byte            
          asm adc 1b sum f807 took     8420 usecs 1.003742 nsec/byte            
          asm adc 1c sum f807 took     8879 usecs 1.058459 nsec/byte            
          asm adc 1d sum f807 took     8919 usecs 1.063228 nsec/byte            
           asm adc 2 sum f807 took    10587 usecs 1.262069 nsec/byte            
           asm adc 4 sum f807 took     9979 usecs 1.189590 nsec/byte            
          asm adc 4b sum f807 took     8422 usecs 1.003981 nsec/byte            
          asm adc 8b sum f807 took     9072 usecs 1.081467 nsec/byte            
          asm pair 2 sum f807 took     4861 usecs 0.579476 nsec/byte            
          asm pair 4 sum f807 took     5916 usecs 0.705242 nsec/byte            
          asm pair 8 sum f807 took     7679 usecs 0.915408 nsec/byte            
         asm pair 16 sum f807 took     8531 usecs 1.016974 nsec/byte            
        asm pair 16a sum f807 took     8533 usecs 1.017213 nsec/byte            
         asm pair 32 sum f807 took     8965 usecs 1.068711 nsec/byte            
          asm quad 8 sum f807 took     5916 usecs 0.705242 nsec/byte            
            16 bit C sum f807 took     4810 usecs 0.573397 nsec/byte            
            32 bit C sum f807 took     3733 usecs 0.445008 nsec/byte            
       32 bit C pair sum f807 took     8337 usecs 0.993848 nsec/byte            

A Pentium 233 MMX shows little benefit 'asm pair 32' is the only
slight gain...

> Code available from:
> http://www.btinternet.com/~david.laight/netbsd/in_cksum/

I had some thoughts over a few beers...
Changing the leal for addl gives a slight improvement to most tests
on my system - pulls the 'asm pair 16a' down by 9% or so.

I've also added another test 'asm tree' which the P4 might do
better at :-)

But run 'sumtest -l6144' so same block size is used throghout.

	David

-- 
David Laight: david@l8s.co.uk