Subject: Re: C runqueue
To: None <port-i386@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 10/23/2002 17:37:26
>  I note that if you're going strictly by
> cycle count, it would be even faster to put the mask values in a table
> and eliminate the ROL, doing just MOV REG,MEM/AND MEM,REG instead.

Except that the table is unlikely to be in the data cache (in real
life) so you end up with a memory read.  This is slower than any
number of instructions.....

I suspect that any loops used for timing these sequences should
invalidate the I-cache on every iteration, unless, of course,
the code is actually used in a loop.

This could make a massive difference to the 'best' sequence.  The
P4 is likely to come off worse - because it will have to do the
x86 code -> uops (or whatever it calls them) convertion on every
pass, so any 'complex' instructions will have a much larger cost.

> This makes the assembler versions substantially faster than the C
> versions on all x86 processors.

Except you could (probably) code those implementations in C.
If leaf routines are compiled without a stack frame, then the
C is likely to be compiled to relatively good code.
If there is an obvious missing optimisation from the compiler
generated code, the gcc hackers can (probably) fix the compiler
- getting the benifit elsewhere.

FWIW the fastest (non SSE(2)) ipcksum routine for P4 was
from compiling:
	uint64_t sum = 0;
	uint32_t *buffer;
	while(...)
		sum += *buffer++;
(OTOH a 1.8GHz P4 was still a lot slower than my 700MHz athlon)

	David

-- 
David Laight: david@l8s.co.uk