Subject: Re: lib/35535: memcpy() is very slow if not aligned
To: None <port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Kimura Fuyuki <fuyuki@hadaly.org>
List: netbsd-bugs
Date: 02/03/2007 14:25:02
The following reply was made to PR port-amd64/35535; it has been noted by GNATS.

From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 23:24:24 +0900

 On Saturday 03 February 2007, David Laight wrote:
 >
 >  1) I'm not sure that optimisations for > 128k copies are necessarily
 >     worthwhile.  Code ought to be passing such data by reference!
 >     In the kernel, the only common large copy is (ought to be) the
 >     copy-on-write of shared pages.
 
 In kernel use, it's just true that code for >128k is not so useful. I put it 
 just because we have libs shared on both kernel and user land. If you think 
 optimization for larger buffer is not a good idea, it could be removed or 
 #ifdef-outed for kernel.
 
 >  2) You want to look at the costs for short copies.  They are much more
 >     common than you think.
 >     I've not done any timings for 'rep movsx', but I did do some for
 >     'rep stosx' a couple of years ago.  The instruction setup costs on
 >     modern cpus is significant, so they shouldn't be used for small loops.
 >     A common non-optimisation is the use of a 'rep movsb' instruction to
 >     move the remaining bytes - which is likely to be zero [1].
 >     One option is to copy the last 4/8 bytes first!
 >     I also discovered that the pentium IV needs the target address to be
 >     8 byte aligned!
 
 Fact 1: I misunderstood the gcc's optimization policy a bit; I've thought 
 memcpy()s are more aggressively inlined or unrolled to mov ops. So, short 
 copies are important, right. But, they *are* properly inlined in many cases.
 
 from gcc(1):
        -mmemcpy
        -mno-memcpy
            Force (do not force) the use of "memcpy()" for non-trivial block
            moves.  The default is -mno-memcpy, which allows GCC to inline most
            constant-sized copies.
 
 Fact 2: I think just one branch is not a burden for modern cpus. Real number 
 follows. (ya, could be a little burden..)
 
 plain:
 $ time ./memcpy_bench 64 100000000 0 0
 dst:0x502080 src:0x5020c0 len:64
 ./memcpy_bench 64 100000000 0 0  3.36s user 0.00s system 99% cpu 3.390 total
 
 patched:
 $ time ./memcpy_bench 64 100000000 0 0
 dst:0x502080 src:0x5020c0 len:64
 ./memcpy_bench 64 100000000 0 0  3.49s user 0.00s system 99% cpu 3.517 total
 
 Fact 3: I didn't touch the rep part of the code. I made the patch small as far 
 as I can. I agree that rep prefix should be carefully used.
 
 >  3) (2) may well apply to the use to movsb to align copies.
 
 Actually, I tried three versions of alignment code including movsb-less one 
 and took simpler and faster. Anyway, there's no big difference in these 
 three. Note also that the memcpy's dest address is very likely to be already 
 aligned.
 
 
 The real (what's real?) latency for rep instructions can be seen here  (8.3): 
 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
 
 
 Thanks for your comment.
 
 
 >  [1] Certain compilers convert:
 >  	while (a < b)
 >  	    *a++ = ' ';
 >      into the inlined version of memset(), including 2 'expensive to setup'
 >      'rep stosx' instructions, when I explictily wrote the loop because the
 >      loop count is short....
 
 gcc 4 is a little bit smarter than that, I think. :)