Subject: Re: optimizations [for non-debugging] amd64 kernels
To: Andrew Doran <ad@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-amd64
Date: 09/15/2007 23:44:26
On Tue, Sep 11, 2007 at 12:25:49PM +0100, Andrew Doran wrote:
> On Tue, Sep 11, 2007 at 07:09:31AM -0400, Blair Sadewitz wrote:
> 
> > Also, at:
> > 
> > http://bahar.aydogan.net/~blair/amd64-string.diff
> > 
> > is an enhancement for x86_64 memcpy/bzero/bcopy functions in
> > common/libc.  This is authored by fuyuki@hadaly.org and is a slight
> > modification of the latest version (<see
> > http://www.hadaly.org/fuyuki>) of what was originally posted in a PR
> > back around Jan/Feb.
> ...
> > I'd appreciate it if someone who actually knew x86_64 assembly would
> > take a look at this and/or if others would test it so we could get it
> > in the tree at some point.
> 
> The setup and teardown for stos/movs/cmps are really expensive and for small
> strings (like under 512 bytes) you're better off with really simple loops
> using the arithemetic instructions.

Worse still, 'rep movsd' falls foul of the athlon 'store-load' optimiser
when the source and destination addresses are separated by a multiple of
some (relatively small) power of 2 - as they would be for kernel COW.
The code must do 'load, load, store, store' to avoid this.

OTOH ISTR the latest Intel cpu has an optimiser for 'rep movsl' that
performs adequately aligned copies using cache line read-writes.
It might also have fast setup for them ....

I'm also worried about the number of unpredictable branches I suspect
those routines have...  Especially when they aren't being benchmarked
by repeated calls with the same parameters!

	David

-- 
David Laight: david@l8s.co.uk