Subject: Re: lib/35535: memcpy() is very slow if not aligned
To: None <port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: David Laight <david@l8s.co.uk>
List: netbsd-bugs
Date: 02/03/2007 11:20:02
The following reply was made to PR port-amd64/35535; it has been noted by GNATS.

From: David Laight <david@l8s.co.uk>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 11:15:56 +0000

 On Sat, Feb 03, 2007 at 10:30:02AM +0000, Kimura Fuyuki wrote:
 > The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
 > 
 > From: Kimura Fuyuki <fuyuki@hadaly.org>
 > To: gnats-bugs@netbsd.org
 > Cc: 
 > Subject: Re: lib/35535: memcpy() is very slow if not aligned
 > Date: Sat, 3 Feb 2007 19:29:17 +0900
 > 
 >  Sorry for misfiling...
 >  
 >  If you want further improvement, here's also an SSEd version which scales to 
 >  megs and preserves cached data.
 >  
 >  http://www.hadaly.org/fuyuki/bcopy-sse.patch
 
 1) I'm not sure that optimisations for > 128k copies are necessarily
    worthwhile.  Code ought to be passing such data by reference!
    In the kernel, the only common large copy is (ought to be) the
    copy-on-write of shared pages.
 
 2) You want to look at the costs for short copies.  They are much more
    common than you think.
    I've not done any timings for 'rep movsx', but I did do some for
    'rep stosx' a couple of years ago.  The instruction setup costs on
    modern cpus is significant, so they shouldn't be used for small loops.
    A common non-optimisation is the use of a 'rep movsb' instruction to
    move the remaining bytes - which is likely to be zero [1].
    One option is to copy the last 4/8 bytes first!
    I also discovered that the pentium IV needs the target address to be
    8 byte aligned!
 
 3) (2) may well apply to the use to movsb to align copies.
 
 	David
 
 [1] Certain compilers convert:
 	while (a < b)
 	    *a++ = ' ';
     into the inlined version of memset(), including 2 'expensive to setup'
     'rep stosx' instructions, when I explictily wrote the loop because the
     loop count is short....
 
 -- 
 David Laight: david@l8s.co.uk