Subject: Re: improving ssh performance on sun4m systems
To: None <shannon@widomaker.com, tv@wasabisystems.com>
From: None <eeh@netbsd.org>
List: port-sparc
Date: 03/15/2002 18:00:47
| On Fri, 15 Mar 2002, Charles Shannon Hendrix wrote:
|
| : > I can pretty much assume that 80%+ of the speed increase is due to using the
| : > mul/div v8 builtin.  This speed can be achieved without a completely
| : > separate snapshot by implementing the libc v8 extension I mentioned earlier.
| :
| : I'm not sure that's all of it, because programs not using that at all are
| : getting boost, though that might just be from instruction reordering.
|
| Oh, I'm sure there's boost, but I'd be curious to see an analysis.  The
| speed difference between mul/div on v7 and v8 is *huge*.

I doubt this has much effect.  Multiply step takes a maximum of 33 cycles.
Since most of the code should already take this into account, the compiler
would try to avoid those operations as much as possible.  

The integer multiply instruction on UltraSPARC CPUs retires 2 bits/cycle.  
And .mul already takes into account small values and uses a 13 cycle version.
So on average it's only 2x as fast as using multiply step.  I haven't been
able to find timings for 32-bit processors, but I doubt they have much faster
integer multipliers.

However, this would be trivial to test.  Replace the current mul.S with:

FUNC(.mul)
	smul	%o0, %o1, %o0
	retl
	 rd	%y, %o1

Replace umul.S with:

FUNC(.mul)
        umul    %o0, %o1, %o0
        retl
         rd     %y, %o1
	
Then rebuild libc.so and see how big a difference it makes.

Or profile and see how much time is spent in .mul and divrem.

I think that the scheduling is much more likely to have a performance impact
than changing multiply and/or divide.

Eduardo