Subject: Re: Minor performance tweak to bcopy_page.S
To: Ben Harris <bjh21@netbsd.org>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm32
Date: 03/08/2001 16:33:02
> One thing I've noticed is that two of your changes have #ifdefs to
> turn them off again.  Do you think these are likely to be necessary?
> It seems to me that if the changes are likely to have an adverse
> effect on some systems, we should either detect those systems at
> run-time, or not apply the changes at all.  Having random undocumented
> #ifdefs all over the place (without even a defopt, usually) is one of
> the things that makes modifying the arm32 sources painful.

A fair point (re documentation), which I fully agree with.  Most of the 
changes were to comment out unnecessary unwinding of the copy code.  Using 
larger blocks would be of (marginal) benefit on an uncached core (since 
then we have to re-fetch the instructions each time).  Uncached cores are 
also less likely to have write buffers, so in that case the branch 
overheads won't be hidden by the write buffer draining (we can safely 
assume that one of the 2 main overheads in the code will be the cost of 
waiting for the write buffer to drain).  As to whether it's worth keeping 
the old code, I'm not sure.

Are we likely to see NetBSD running on uncached ARMs in the future?  
Unlikely in the desktop arena, but in the embedded or palm-top arena it's 
harder to be sure.

In practice using larger blocks would be a win on ARM2 (or on other cores 
if the I$ were off); I see little other need for a larger loop -- it just 
wastes space in the I$.

If we do leave the code in, my feeling is that it should be left as a 
build-time option.  It is, after all, a performance tweak and isn't 
required for correct optimization.  Anyone configuring a kernel for best 
performance on a particular machine should be able to tweak this if 
necessary, but I don't think we really need to worry about run-time 
performance (there won't be many kernels that need building for both ARM2 
& StrongARM! ;-)

> >This patch is a minor tweak to arm/arm32/bcopy_page.S.
> 
> Out of curiosity (I've not looked at the code) is there a good reason
> for not using memcpy() here?

bcopy_page (and the other routine in that file -- bzero_page, IIRC) are 
critical to the pmap performance; they are used for providing COW 
semantics and zeroed memory.  I feel this is probably one case where the 
overhead of duplication is worth it.

R.