Subject: Re: Performance of various memcpy()'s
To: None <tech-perform@netbsd.org>
From: Bang Jun-Young <junyoung@mogua.com>
List: tech-perform
Date: 10/23/2002 23:54:42
--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Wed, Oct 16, 2002 at 04:18:30AM +0900, Bang Jun-Young wrote:
> Hi,
>
> About 14 monthes ago, I had some discussion on memcpy performance on
> i386 platform here. Monthes later, I took a look into it again, and
> now am coming with (not-so-)new benchmark results (attached). The
> tests were performed on Athlon XP 1800 and DDR 256MB.
>
> >From the results, it's obvious that memcpy() using MMX insns is the
> best for in-cache sized data, typically 50-100% faster than plain old
> memcpy for data <= 32 KB.
This time I got results with out-of-cache data. To eliminate cache
effect, I used 1MB source data and 1MB and destination data, and repeated
memcpy*()'s 1MB / datasize times in inner loop, and 1024 times in outer
loop. Total data size was the same 1GB as previous, but the results
were quite different than those with in-cache data.
In this test, non-temporal movntq instruction was obviously a big win.
Since it doesn't pollute cache lines, you can get 2x performance for
copying data not in cache.
Also, I found that my MMX-optimized i686_copyin() is faster than plain
old memcpy for data > 2~3 KB. It seems that saving/restoring FP state in/
from stack is quite expensive for small data copying (it needs 108 bytes
of memcpying from processor to memory plus some overhead).
I'll come up with finalized i686_copyin/out() soon.
Jun-Young
--
Bang Jun-Young <junyoung@mogua.com>
--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.uncached.txt"
addr1=0x804c000 addr2=0x814c000
memcpy 64B -- 16384 loops
aligned blocks
libc memcpy 2.893993 s
rep movsw 2.859771 s
asm loop 2.669005 s
i686_copyin 2.910439 s
i686_copyin2 2.885610 s
MMX memcpy using MOVQ 2.675665 s
with mingo's MOVUSB (prefetch, non-temporal) 1.949940 s
with simple MOVUSB (no prefetch) 2.719580 s
arjanv's MOVQ (with prefetch) 2.938366 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.552954 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.545507 s
+0/+4 moderately unaligned blocks
libc memcpy 2.723010 s
MMX memcpy using MOVQ 2.893861 s
with mingo's MOVUSB (prefetch, non-temporal) 2.093558 s
with simple MOVUSB (no prefetch) 2.973506 s
+10/+13 cruelly unaligned blocks
libc memcpy 3.125790 s
MMX memcpy using MOVQ 2.661766 s
with mingo's MOVUSB (prefetch, non-temporal) 2.740727 s
with simple MOVUSB (no prefetch) 2.715262 s
addr1=0x804c000 addr2=0x814c000
memcpy 1024B -- 1024 loops
aligned blocks
libc memcpy 2.761827 s
rep movsw 2.764354 s
asm loop 2.820187 s
i686_copyin 2.647857 s
i686_copyin2 2.647648 s
MMX memcpy using MOVQ 2.574933 s
with mingo's MOVUSB (prefetch, non-temporal) 1.870815 s
with simple MOVUSB (no prefetch) 2.684049 s
arjanv's MOVQ (with prefetch) 2.518789 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.588186 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.698439 s
+0/+4 moderately unaligned blocks
libc memcpy 2.800100 s
MMX memcpy using MOVQ 2.588999 s
with mingo's MOVUSB (prefetch, non-temporal) 1.852392 s
with simple MOVUSB (no prefetch) 2.723908 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.749374 s
MMX memcpy using MOVQ 2.683349 s
with mingo's MOVUSB (prefetch, non-temporal) 2.203756 s
with simple MOVUSB (no prefetch) 2.750306 s
addr1=0x804c000 addr2=0x814c000
memcpy 4kB -- 256 loops
aligned blocks
libc memcpy 2.758545 s
rep movsw 2.759825 s
asm loop 2.818919 s
i686_copyin 2.633134 s
i686_copyin2 2.641534 s
MMX memcpy using MOVQ 2.571201 s
with mingo's MOVUSB (prefetch, non-temporal) 1.795929 s
with simple MOVUSB (no prefetch) 2.681924 s
arjanv's MOVQ (with prefetch) 2.512153 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.577637 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.688840 s
+0/+4 moderately unaligned blocks
libc memcpy 2.828267 s
MMX memcpy using MOVQ 2.584795 s
with mingo's MOVUSB (prefetch, non-temporal) 1.773777 s
with simple MOVUSB (no prefetch) 2.691957 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.711029 s
MMX memcpy using MOVQ 2.690554 s
with mingo's MOVUSB (prefetch, non-temporal) 2.047554 s
with simple MOVUSB (no prefetch) 2.782641 s
addr1=0x804c000 addr2=0x814c000
memcpy 64kB -- 16 loops
aligned blocks
libc memcpy 2.764299 s
rep movsw 2.767497 s
asm loop 2.826478 s
i686_copyin 2.626365 s
i686_copyin2 2.625997 s
MMX memcpy using MOVQ 2.570352 s
with mingo's MOVUSB (prefetch, non-temporal) 1.767928 s
with simple MOVUSB (no prefetch) 2.685339 s
arjanv's MOVQ (with prefetch) 2.521904 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.575878 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.682403 s
+0/+4 moderately unaligned blocks
libc memcpy 2.823552 s
MMX memcpy using MOVQ 2.580810 s
with mingo's MOVUSB (prefetch, non-temporal) 1.767096 s
with simple MOVUSB (no prefetch) 2.707592 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.713003 s
MMX memcpy using MOVQ 2.668149 s
with mingo's MOVUSB (prefetch, non-temporal) 1.975933 s
with simple MOVUSB (no prefetch) 2.779886 s
addr1=0x804c000 addr2=0x814c000
memcpy 128kB -- 8 loops
aligned blocks
libc memcpy 2.766495 s
rep movsw 2.767812 s
asm loop 2.827207 s
i686_copyin 2.626962 s
i686_copyin2 2.618238 s
MMX memcpy using MOVQ 2.570613 s
with mingo's MOVUSB (prefetch, non-temporal) 1.775084 s
with simple MOVUSB (no prefetch) 2.684980 s
arjanv's MOVQ (with prefetch) 2.521927 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.575982 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.682593 s
+0/+4 moderately unaligned blocks
libc memcpy 2.817080 s
MMX memcpy using MOVQ 2.588906 s
with mingo's MOVUSB (prefetch, non-temporal) 1.766316 s
with simple MOVUSB (no prefetch) 2.706869 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.711935 s
MMX memcpy using MOVQ 2.674179 s
with mingo's MOVUSB (prefetch, non-temporal) 1.963451 s
with simple MOVUSB (no prefetch) 2.780192 s
addr1=0x804c000 addr2=0x814c000
memcpy 256kB -- 4 loops
aligned blocks
libc memcpy 2.766599 s
rep movsw 2.767784 s
asm loop 2.828783 s
i686_copyin 2.619552 s
i686_copyin2 2.627876 s
MMX memcpy using MOVQ 2.571837 s
with mingo's MOVUSB (prefetch, non-temporal) 1.776927 s
with simple MOVUSB (no prefetch) 2.686435 s
arjanv's MOVQ (with prefetch) 2.523016 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.577187 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.675317 s
+0/+4 moderately unaligned blocks
libc memcpy 2.827427 s
MMX memcpy using MOVQ 2.590171 s
with mingo's MOVUSB (prefetch, non-temporal) 1.769825 s
with simple MOVUSB (no prefetch) 2.708104 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.710984 s
MMX memcpy using MOVQ 2.674800 s
with mingo's MOVUSB (prefetch, non-temporal) 1.972209 s
with simple MOVUSB (no prefetch) 2.787717 s
addr1=0x804c000 addr2=0x814c000
memcpy 512kB -- 2 loops
aligned blocks
libc memcpy 2.766847 s
rep movsw 2.767707 s
asm loop 2.811354 s
i686_copyin 2.626655 s
i686_copyin2 2.626876 s
MMX memcpy using MOVQ 2.571146 s
with mingo's MOVUSB (prefetch, non-temporal) 1.775052 s
with simple MOVUSB (no prefetch) 2.684812 s
arjanv's MOVQ (with prefetch) 2.513970 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.576279 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.683077 s
+0/+4 moderately unaligned blocks
libc memcpy 2.827907 s
MMX memcpy using MOVQ 2.589284 s
with mingo's MOVUSB (prefetch, non-temporal) 1.767601 s
with simple MOVUSB (no prefetch) 2.706929 s
+10/+13 cruelly unaligned blocks
libc memcpy 2.702820 s
MMX memcpy using MOVQ 2.675799 s
with mingo's MOVUSB (prefetch, non-temporal) 1.969484 s
with simple MOVUSB (no prefetch) 2.785175 s
--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.c"
/* -*- c-file-style: "linux" -*- */
/* memcpy speed benchmark using different i86-specific routines.
*
* Framework (C) 2001 by Martin Pool <mbp@samba.org>, based on speed.c
* by tridge.
*
* Routines lifted from all kinds of places.
*
* You must not use floating-point code anywhere in this application
* because it scribbles on the FP state and does not reset it. */
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <sys/time.h>
memcpy_rep_movsl(void *to, const void *from, size_t len);
memcpy_words(void *to, const void *from, size_t len);
i686_copyin(void *to, const void *from, size_t len);
i686_copyin2(void *to, const void *from, size_t len);
#define MAX(a,b) ((a)>(b)?(a):(b))
#define MIN(a,b) ((a)<(b)?(a):(b))
#include <sys/resource.h>
struct rusage tp1,tp2;
static void start_timer()
{
getrusage(RUSAGE_SELF,&tp1);
}
static long end_timer()
{
getrusage(RUSAGE_SELF,&tp2);
#if 0
printf ("tp1 = %ld.%05ld, tp2 = %ld.%05ld\n",
(long) tp1.ru_utime.tv_sec, (long) tp1.ru_utime.tv_usec,
(long) tp2.ru_utime.tv_sec, (long) tp2.ru_utime.tv_usec);
#endif
return ((tp2.ru_utime.tv_sec - tp1.ru_utime.tv_sec) * 1000000 +
(tp2.ru_utime.tv_usec - tp1.ru_utime.tv_usec));
}
/*
* By Ingo Molnar and Doug Ledford; hacked up to remove
* kernel-specific stuff like saving/restoring float registers.
*
* http://people.redhat.com/mingo/mmx-patches/mmx-2.3.99-A0 */
void *
memcpy_movusb (void *to, const void *from, size_t n)
{
size_t size;
#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
"movups %%xmm0,(%1)\n\t"
:
: "r" (from),
"r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
"movups %%xmm0,(%1)\n\t"
:
: "r" (from + size),
"r" (to + size));
n &= ~(ALIGN-1);
}
/*
* Prefetch the first two cachelines now.
*/
__asm__ __volatile__("prefetchnta 0x00(%0)\n\t"
"prefetchnta 0x20(%0)\n\t"
:
: "r" (from));
while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movntps %%xmm0,0x00(%1)\n\t"
"movntps %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
/*
* Note: Intermixing the prefetch at *exactly* this point
* in time has been shown to be the fastest possible.
* Timing these prefetch instructions is a complete black
* art with nothing but trial and error showing the way.
* To that extent, this optimum version was found by using
* a userland version of this routine that we clocked for
* lots of runs. We then fiddled with ordering until we
* settled on our highest speen routines. So, the long
* and short of this is, don't mess with instruction ordering
* here or suffer permance penalties you will.
*/
__asm__ __volatile__(
"prefetchnta 0x20(%0)\n\t"
:
: "r" (from));
to += STEP;
n -= STEP;
}
return to;
}
void *
memcpy_simple_movusb (void *to, const void *from, size_t n)
{
size_t size;
#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
"movups %%xmm0,(%1)\n\t"
:
: "r" (from),
"r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
"movups %%xmm0,(%1)\n\t"
:
: "r" (from + size),
"r" (to + size));
n &= ~(ALIGN-1);
}
while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movups %%xmm0,0x00(%1)\n\t"
"movups %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
to += STEP;
n -= STEP;
}
return to;
}
/* From Linux 2.4.8. I think this must be aligned. */
void *
memcpy_mmx (void *to, const void *from, size_t len)
{
int i;
for(i = 0; i < len / 64; i++) {
__asm__ __volatile__ (
"movq (%0), %%mm0\n"
"\tmovq 8(%0), %%mm1\n"
"\tmovq 16(%0), %%mm2\n"
"\tmovq 24(%0), %%mm3\n"
"\tmovq %%mm0, (%1)\n"
"\tmovq %%mm1, 8(%1)\n"
"\tmovq %%mm2, 16(%1)\n"
"\tmovq %%mm3, 24(%1)\n"
"\tmovq 32(%0), %%mm0\n"
"\tmovq 40(%0), %%mm1\n"
"\tmovq 48(%0), %%mm2\n"
"\tmovq 56(%0), %%mm3\n"
"\tmovq %%mm0, 32(%1)\n"
"\tmovq %%mm1, 40(%1)\n"
"\tmovq %%mm2, 48(%1)\n"
"\tmovq %%mm3, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from += 64;
to += 64;
}
if (len & 63)
memcpy(to, from, len & 63);
return to;
}
static void print_time (char const *msg,
long long loops,
long t)
{
printf(" %-50s %ld.%06ld s\n", msg, t/1000000,
t % 1000000);
}
void *
memcpy_arjanv (void *to, const void *from, size_t len)
{
int i;
__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
" prefetchnta 64(%0)\n"
" prefetchnta 128(%0)\n"
" prefetchnta 192(%0)\n"
" prefetchnta 256(%0)\n"
: : "r" (from) );
for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
"1: prefetchnta 320(%0)\n"
"2: movq (%0), %%mm0\n"
" movq 8(%0), %%mm1\n"
" movq 16(%0), %%mm2\n"
" movq 24(%0), %%mm3\n"
" movq %%mm0, (%1)\n"
" movq %%mm1, 8(%1)\n"
" movq %%mm2, 16(%1)\n"
" movq %%mm3, 24(%1)\n"
" movq 32(%0), %%mm0\n"
" movq 40(%0), %%mm1\n"
" movq 48(%0), %%mm2\n"
" movq 56(%0), %%mm3\n"
" movq %%mm0, 32(%1)\n"
" movq %%mm1, 40(%1)\n"
" movq %%mm2, 48(%1)\n"
" movq %%mm3, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);
return to;
}
void *
memcpy_arjanv_movntq (void *to, const void *from, size_t len)
{
int i;
__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
" prefetchnta 64(%0)\n"
" prefetchnta 128(%0)\n"
" prefetchnta 192(%0)\n"
: : "r" (from) );
for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
" prefetchnta 200(%0)\n"
" movq (%0), %%mm0\n"
" movq 8(%0), %%mm1\n"
" movq 16(%0), %%mm2\n"
" movq 24(%0), %%mm3\n"
" movq 32(%0), %%mm4\n"
" movq 40(%0), %%mm5\n"
" movq 48(%0), %%mm6\n"
" movq 56(%0), %%mm7\n"
" movntq %%mm0, (%1)\n"
" movntq %%mm1, 8(%1)\n"
" movntq %%mm2, 16(%1)\n"
" movntq %%mm3, 24(%1)\n"
" movntq %%mm4, 32(%1)\n"
" movntq %%mm5, 40(%1)\n"
" movntq %%mm6, 48(%1)\n"
" movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);
return to;
}
void *
memcpy_arjanv_interleave (void *to, const void *from, size_t len)
{
int i;
__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
" prefetchnta 64(%0)\n"
" prefetchnta 128(%0)\n"
" prefetchnta 192(%0)\n"
: : "r" (from) );
for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
" prefetchnta 168(%0)\n"
" movq (%0), %%mm0\n"
" movntq %%mm0, (%1)\n"
" movq 8(%0), %%mm1\n"
" movntq %%mm1, 8(%1)\n"
" movq 16(%0), %%mm2\n"
" movntq %%mm2, 16(%1)\n"
" movq 24(%0), %%mm3\n"
" movntq %%mm3, 24(%1)\n"
" movq 32(%0), %%mm4\n"
" movntq %%mm4, 32(%1)\n"
" movq 40(%0), %%mm5\n"
" movntq %%mm5, 40(%1)\n"
" movq 48(%0), %%mm6\n"
" movntq %%mm6, 48(%1)\n"
" movq 56(%0), %%mm7\n"
" movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);
return to;
}
static void wrap (char *p1,
char *p2,
size_t size,
long loops,
void *(*bfn) (void *, const void *, size_t),
const char *msg)
{
long t;
int i, j;
char *tmp1, *tmp2;
memset(p2,42,size);
tmp1 = p1;
tmp2 = p2;
start_timer();
for (j = 0; j < 1024; j++) {
for (i=0; i<loops; i++) {
bfn (tmp1, tmp2, size);
tmp1 += size;
tmp2 += size;
}
tmp1 = p1;
tmp2 = p2;
}
t = end_timer();
print_time (msg, loops, t);
}
static void memcpy_test(size_t size)
{
long loops = 1024*1024 / size;
/* We need to make sure the blocks are *VERY* aligned, because
MMX is potentially pretty fussy. */
char *p1 = (char *) malloc (1024 * 1024);
char *p2 = (char *) malloc (1024 * 1024);
printf("addr1=%p addr2=%p\n", p1, p2);
if (size > 2048)
printf ("memcpy %dkB -- %ld loops\n", size>>10, loops);
else
printf ("memcpy %dB -- %ld loops\n", size, loops);
printf (" aligned blocks\n");
wrap (p1, p2, size, loops, memcpy, "libc memcpy");
wrap (p1, p2, size, loops, memcpy_rep_movsl, "rep movsw");
wrap (p1, p2, size, loops, memcpy_words, "asm loop");
wrap (p1, p2, size, loops, i686_copyin, "i686_copyin");
wrap (p1, p2, size, loops, i686_copyin2, "i686_copyin2");
wrap (p1, p2, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2, size, loops, memcpy_simple_movusb,
"with simple MOVUSB (no prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv,
"arjanv's MOVQ (with prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv_movntq,
"arjanv's MOVNTQ (with prefetch, for Athlon)");
wrap (p1, p2, size, loops, memcpy_arjanv_interleave,
"arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA");
printf (" +0/+4 moderately unaligned blocks\n");
wrap (p1, p2+4, size, loops, memcpy, "libc memcpy");
wrap (p1, p2+4, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2+4, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2+4, size, loops, memcpy_simple_movusb,
"with simple MOVUSB (no prefetch)");
printf (" +10/+13 cruelly unaligned blocks\n");
wrap (p1+10, p2+13, size, loops, memcpy, "libc memcpy");
wrap (p1+10, p2+13, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1+10, p2+13, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1+10, p2+13, size, loops, memcpy_simple_movusb,
"with simple MOVUSB (no prefetch)");
puts("");
free(p1); free(p2);
}
int main (void)
{
memcpy_test(64);
#if 0
memcpy_test(1<<7);
memcpy_test(1<<8);
memcpy_test(1<<9);
#endif
memcpy_test(1024);
#if 0
memcpy_test(1<<11);
#endif
memcpy_test(4096);
#if 0
memcpy_test(1<<13);
memcpy_test(1<<14);
memcpy_test(1<<15);
#endif
memcpy_test(1<<16);
memcpy_test(1<<17);
memcpy_test(1<<18);
memcpy_test(1<<19);
#if 0
memcpy_test(1<<20);
#endif
return 0;
}
--EeQfGwPcQSOJBaQU--