Subject: Re: C runqueue
To: None <port-i386@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 10/23/2002 09:35:13
> The main difference is the btrl instruction.  Last week I went
> back and counted instruction cycles.  (I must have been bored.)
> It seems that gcc was right.  So I'm sure there are some
> micro-optimisation people out there who will enjoy this:
> 
> 			pentium	i486	i386
> 	-----------------------------------------
> 	btrl mem,reg	13	13	13 cycles
> 	-----------------------------------------
> 	mov reg,imm	1	1	2
> 	rol reg,cl	4	3	3
> 	and mem,reg	3	3	7
> 			8	7	12 cycles
> 	-----------------------------------------

Where did you get the 'pentium' cycle counts from?
They aren't documented for anything recent - mainly because
you can't count cycles any more.
IIRC the P4 doesn't have a barrel shifter (easily accessible)
so rotates (etc) by cl and constants (other than 1) are slower
than on other processors.

I'm also no sure that your alternate sequence is actually
equivalent to btr.
OTOH the btr instruction is one of those that is probably
executed slower than a sequence of simple instructions
(unless the bit offset is likely to be large and/or you
are applying the lock prefix.)
> 
> So, we save a few cycles and gain a stackframe with C.  I'd
> say its worth moving i386 over too.

Have you included the cost of setting up the stack frame then?

	David

-- 
David Laight: david@l8s.co.uk