tech-perform: Re: Performance of various memcpy()'s

Subject: Re: Performance of various memcpy()'s
To: None <tech-perform@netbsd.org>
From: Bang Jun-Young <junyoung@mogua.com>
List: tech-perform
Date: 10/23/2002 23:54:42
--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Wed, Oct 16, 2002 at 04:18:30AM +0900, Bang Jun-Young wrote:
> Hi,
> 
> About 14 monthes ago, I had some discussion on memcpy performance on
> i386 platform here. Monthes later, I took a look into it again, and
> now am coming with (not-so-)new benchmark results (attached). The
> tests were performed on Athlon XP 1800 and DDR 256MB. 
> 
> >From the results, it's obvious that memcpy() using MMX insns is the
> best for in-cache sized data, typically 50-100% faster than plain old
> memcpy for data <= 32 KB.

This time I got results with out-of-cache data. To eliminate cache
effect, I used 1MB source data and 1MB and destination data, and repeated
memcpy*()'s 1MB / datasize times in inner loop, and 1024 times in outer
loop. Total data size was the same 1GB as previous, but the results
were quite different than those with in-cache data. 

In this test, non-temporal movntq instruction was obviously a big win.
Since it doesn't pollute cache lines, you can get 2x performance for
copying data not in cache. 

Also, I found that my MMX-optimized i686_copyin() is faster than plain
old memcpy for data > 2~3 KB. It seems that saving/restoring FP state in/
from stack is quite expensive for small data copying (it needs 108 bytes
of memcpying from processor to memory plus some overhead).

I'll come up with finalized i686_copyin/out() soon.

Jun-Young

-- 
Bang Jun-Young <junyoung@mogua.com>

--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.uncached.txt"

addr1=0x804c000 addr2=0x814c000
memcpy 64B -- 16384 loops
  aligned blocks
      libc memcpy                                        2.893993 s
      rep movsw                                          2.859771 s
      asm loop                                           2.669005 s
      i686_copyin                                        2.910439 s
      i686_copyin2                                       2.885610 s
      MMX memcpy using MOVQ                              2.675665 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.949940 s
      with simple MOVUSB (no prefetch)                   2.719580 s
      arjanv's MOVQ (with prefetch)                      2.938366 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.552954 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.545507 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.723010 s
      MMX memcpy using MOVQ                              2.893861 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.093558 s
      with simple MOVUSB (no prefetch)                   2.973506 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        3.125790 s
      MMX memcpy using MOVQ                              2.661766 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.740727 s
      with simple MOVUSB (no prefetch)                   2.715262 s

addr1=0x804c000 addr2=0x814c000
memcpy 1024B -- 1024 loops
  aligned blocks
      libc memcpy                                        2.761827 s
      rep movsw                                          2.764354 s
      asm loop                                           2.820187 s
      i686_copyin                                        2.647857 s
      i686_copyin2                                       2.647648 s
      MMX memcpy using MOVQ                              2.574933 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.870815 s
      with simple MOVUSB (no prefetch)                   2.684049 s
      arjanv's MOVQ (with prefetch)                      2.518789 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.588186 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.698439 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.800100 s
      MMX memcpy using MOVQ                              2.588999 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.852392 s
      with simple MOVUSB (no prefetch)                   2.723908 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.749374 s
      MMX memcpy using MOVQ                              2.683349 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.203756 s
      with simple MOVUSB (no prefetch)                   2.750306 s

addr1=0x804c000 addr2=0x814c000
memcpy 4kB -- 256 loops
  aligned blocks
      libc memcpy                                        2.758545 s
      rep movsw                                          2.759825 s
      asm loop                                           2.818919 s
      i686_copyin                                        2.633134 s
      i686_copyin2                                       2.641534 s
      MMX memcpy using MOVQ                              2.571201 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.795929 s
      with simple MOVUSB (no prefetch)                   2.681924 s
      arjanv's MOVQ (with prefetch)                      2.512153 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577637 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.688840 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.828267 s
      MMX memcpy using MOVQ                              2.584795 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.773777 s
      with simple MOVUSB (no prefetch)                   2.691957 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711029 s
      MMX memcpy using MOVQ                              2.690554 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.047554 s
      with simple MOVUSB (no prefetch)                   2.782641 s

addr1=0x804c000 addr2=0x814c000
memcpy 64kB -- 16 loops
  aligned blocks
      libc memcpy                                        2.764299 s
      rep movsw                                          2.767497 s
      asm loop                                           2.826478 s
      i686_copyin                                        2.626365 s
      i686_copyin2                                       2.625997 s
      MMX memcpy using MOVQ                              2.570352 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767928 s
      with simple MOVUSB (no prefetch)                   2.685339 s
      arjanv's MOVQ (with prefetch)                      2.521904 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575878 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682403 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.823552 s
      MMX memcpy using MOVQ                              2.580810 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767096 s
      with simple MOVUSB (no prefetch)                   2.707592 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.713003 s
      MMX memcpy using MOVQ                              2.668149 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.975933 s
      with simple MOVUSB (no prefetch)                   2.779886 s

addr1=0x804c000 addr2=0x814c000
memcpy 128kB -- 8 loops
  aligned blocks
      libc memcpy                                        2.766495 s
      rep movsw                                          2.767812 s
      asm loop                                           2.827207 s
      i686_copyin                                        2.626962 s
      i686_copyin2                                       2.618238 s
      MMX memcpy using MOVQ                              2.570613 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775084 s
      with simple MOVUSB (no prefetch)                   2.684980 s
      arjanv's MOVQ (with prefetch)                      2.521927 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575982 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682593 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.817080 s
      MMX memcpy using MOVQ                              2.588906 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.766316 s
      with simple MOVUSB (no prefetch)                   2.706869 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711935 s
      MMX memcpy using MOVQ                              2.674179 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.963451 s
      with simple MOVUSB (no prefetch)                   2.780192 s

addr1=0x804c000 addr2=0x814c000
memcpy 256kB -- 4 loops
  aligned blocks
      libc memcpy                                        2.766599 s
      rep movsw                                          2.767784 s
      asm loop                                           2.828783 s
      i686_copyin                                        2.619552 s
      i686_copyin2                                       2.627876 s
      MMX memcpy using MOVQ                              2.571837 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.776927 s
      with simple MOVUSB (no prefetch)                   2.686435 s
      arjanv's MOVQ (with prefetch)                      2.523016 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577187 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.675317 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827427 s
      MMX memcpy using MOVQ                              2.590171 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.769825 s
      with simple MOVUSB (no prefetch)                   2.708104 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.710984 s
      MMX memcpy using MOVQ                              2.674800 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.972209 s
      with simple MOVUSB (no prefetch)                   2.787717 s

addr1=0x804c000 addr2=0x814c000
memcpy 512kB -- 2 loops
  aligned blocks
      libc memcpy                                        2.766847 s
      rep movsw                                          2.767707 s
      asm loop                                           2.811354 s
      i686_copyin                                        2.626655 s
      i686_copyin2                                       2.626876 s
      MMX memcpy using MOVQ                              2.571146 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775052 s
      with simple MOVUSB (no prefetch)                   2.684812 s
      arjanv's MOVQ (with prefetch)                      2.513970 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.576279 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.683077 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827907 s
      MMX memcpy using MOVQ                              2.589284 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767601 s
      with simple MOVUSB (no prefetch)                   2.706929 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.702820 s
      MMX memcpy using MOVQ                              2.675799 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.969484 s
      with simple MOVUSB (no prefetch)                   2.785175 s


--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.c"

/* -*- c-file-style: "linux" -*- */

/* memcpy speed benchmark using different i86-specific routines. 
 *
 * Framework (C) 2001 by Martin Pool <mbp@samba.org>, based on speed.c
 * by tridge.
 *
 * Routines lifted from all kinds of places.
 *
 * You must not use floating-point code anywhere in this application
 * because it scribbles on the FP state and does not reset it.  */


#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <sys/time.h>

memcpy_rep_movsl(void *to, const void *from, size_t len);
memcpy_words(void *to, const void *from, size_t len);
i686_copyin(void *to, const void *from, size_t len);
i686_copyin2(void *to, const void *from, size_t len);

#define MAX(a,b) ((a)>(b)?(a):(b))
#define MIN(a,b) ((a)<(b)?(a):(b))

#include <sys/resource.h>
struct rusage tp1,tp2;

static void start_timer()
{
	getrusage(RUSAGE_SELF,&tp1);
}


static long end_timer()
{
	getrusage(RUSAGE_SELF,&tp2);
#if 0
	printf ("tp1 = %ld.%05ld, tp2 = %ld.%05ld\n", 
		(long) tp1.ru_utime.tv_sec, (long) tp1.ru_utime.tv_usec, 
		(long) tp2.ru_utime.tv_sec, (long) tp2.ru_utime.tv_usec);
#endif

	return ((tp2.ru_utime.tv_sec - tp1.ru_utime.tv_sec) * 1000000 + 
		(tp2.ru_utime.tv_usec - tp1.ru_utime.tv_usec));
}




/*
 * By Ingo Molnar and Doug Ledford; hacked up to remove
 * kernel-specific stuff like saving/restoring float registers.
 *
 * http://people.redhat.com/mingo/mmx-patches/mmx-2.3.99-A0 */
void *
memcpy_movusb (void *to, const void *from, size_t n)
{
	size_t size;

#define STEP 0x20
#define ALIGN 0x10
	if ((unsigned long)to & (ALIGN-1)) {
		size = ALIGN - ((unsigned long)to & (ALIGN-1));
		__asm__ __volatile__("movups (%0),%%xmm0\n\t"
				     "movups %%xmm0,(%1)\n\t"
				     :
				     : "r" (from),
				     "r" (to));
		n -= size;
		from += size;
		to += size;
	}
/*
 * If the copy would have tailings, take care of them
 * now instead of later
 */
	if (n & (ALIGN-1)) {
		size = n - ALIGN;
		__asm__ __volatile__("movups (%0),%%xmm0\n\t"
				     "movups %%xmm0,(%1)\n\t"
				     :
				     : "r" (from + size),
				     "r" (to + size));
		n &= ~(ALIGN-1);
	}
/*
 * Prefetch the first two cachelines now.
 */
	__asm__ __volatile__("prefetchnta 0x00(%0)\n\t"
			     "prefetchnta 0x20(%0)\n\t"
			     :
			     : "r" (from));
	  
	while (n >= STEP) {
		__asm__ __volatile__(
			"movups 0x00(%0),%%xmm0\n\t"
			"movups 0x10(%0),%%xmm1\n\t"
			"movntps %%xmm0,0x00(%1)\n\t"
			"movntps %%xmm1,0x10(%1)\n\t"
			: 
			: "r" (from), "r" (to)
			: "memory");
		from += STEP;
		/*
		 * Note: Intermixing the prefetch at *exactly* this point
		 * in time has been shown to be the fastest possible.
		 * Timing these prefetch instructions is a complete black
		 * art with nothing but trial and error showing the way.
		 * To that extent, this optimum version was found by using
		 * a userland version of this routine that we clocked for
		 * lots of runs.  We then fiddled with ordering until we
		 * settled on our highest speen routines.  So, the long
		 * and short of this is, don't mess with instruction ordering
		 * here or suffer permance penalties you will.
		 */
		__asm__ __volatile__(
			"prefetchnta 0x20(%0)\n\t"
			: 
			: "r" (from));
		to += STEP;
		n -= STEP;
	}
	
	return to;
}

void *
memcpy_simple_movusb (void *to, const void *from, size_t n)
{
	size_t size;

#define STEP 0x20
#define ALIGN 0x10
	if ((unsigned long)to & (ALIGN-1)) {
		size = ALIGN - ((unsigned long)to & (ALIGN-1));
		__asm__ __volatile__("movups (%0),%%xmm0\n\t"
				     "movups %%xmm0,(%1)\n\t"
				     :
				     : "r" (from),
				     "r" (to));
		n -= size;
		from += size;
		to += size;
	}
/*
 * If the copy would have tailings, take care of them
 * now instead of later
 */
	if (n & (ALIGN-1)) {
		size = n - ALIGN;
		__asm__ __volatile__("movups (%0),%%xmm0\n\t"
				     "movups %%xmm0,(%1)\n\t"
				     :
				     : "r" (from + size),
				     "r" (to + size));
		n &= ~(ALIGN-1);
	}

	while (n >= STEP) {
		__asm__ __volatile__(
			"movups 0x00(%0),%%xmm0\n\t"
			"movups 0x10(%0),%%xmm1\n\t"
			"movups %%xmm0,0x00(%1)\n\t"
			"movups %%xmm1,0x10(%1)\n\t"
			: 
			: "r" (from), "r" (to)
			: "memory");
		from += STEP;
		to += STEP;
		n -= STEP;
	}
	
	return to;
}


/* From Linux 2.4.8.  I think this must be aligned. */
void *
memcpy_mmx (void *to, const void *from, size_t len)
{
	int i;

	for(i = 0; i < len / 64; i++) {
      		__asm__ __volatile__ (
		   "movq (%0), %%mm0\n"
		   "\tmovq 8(%0), %%mm1\n"
		   "\tmovq 16(%0), %%mm2\n"
		   "\tmovq 24(%0), %%mm3\n"
		   "\tmovq %%mm0, (%1)\n"
		   "\tmovq %%mm1, 8(%1)\n"
		   "\tmovq %%mm2, 16(%1)\n"
		   "\tmovq %%mm3, 24(%1)\n"
		   "\tmovq 32(%0), %%mm0\n"
		   "\tmovq 40(%0), %%mm1\n"
		   "\tmovq 48(%0), %%mm2\n"
		   "\tmovq 56(%0), %%mm3\n"
		   "\tmovq %%mm0, 32(%1)\n"
		   "\tmovq %%mm1, 40(%1)\n"
		   "\tmovq %%mm2, 48(%1)\n"
		   "\tmovq %%mm3, 56(%1)\n"
		   : : "r" (from), "r" (to) : "memory");
		from += 64;
		to += 64;
	}

	if (len & 63)
		memcpy(to, from, len & 63);

	return to;
}

static void print_time (char const *msg, 
			long long loops,
			long t)
{
	printf("      %-50s %ld.%06ld s\n", msg, t/1000000,
	       t % 1000000);
}

void *
memcpy_arjanv (void *to, const void *from, size_t len)
{
	int i;

	__asm__ __volatile__ (
		"1: prefetchnta (%0)\n"
		"   prefetchnta 64(%0)\n"
		"   prefetchnta 128(%0)\n"
		"   prefetchnta 192(%0)\n"
		"   prefetchnta 256(%0)\n"
		: : "r" (from) );

	for(i=0; i<len/64; i++) {
		__asm__ __volatile__ (
			"1: prefetchnta 320(%0)\n"
			"2: movq (%0), %%mm0\n"
			"   movq 8(%0), %%mm1\n"
			"   movq 16(%0), %%mm2\n"
			"   movq 24(%0), %%mm3\n"
			"   movq %%mm0, (%1)\n"
			"   movq %%mm1, 8(%1)\n"
			"   movq %%mm2, 16(%1)\n"
			"   movq %%mm3, 24(%1)\n"
			"   movq 32(%0), %%mm0\n"
			"   movq 40(%0), %%mm1\n"
			"   movq 48(%0), %%mm2\n"
			"   movq 56(%0), %%mm3\n"
			"   movq %%mm0, 32(%1)\n"
			"   movq %%mm1, 40(%1)\n"
			"   movq %%mm2, 48(%1)\n"
			"   movq %%mm3, 56(%1)\n"
			: : "r" (from), "r" (to) : "memory");
		from+=64;
		to+=64;
	}

	/*
	 *Now do the tail of the block
	 */
	if (len&63)
		memcpy(to, from, len&63);

	return to;
}

void *
memcpy_arjanv_movntq (void *to, const void *from, size_t len)
{
	int i;

	__asm__ __volatile__ (
		"1: prefetchnta (%0)\n"
		"   prefetchnta 64(%0)\n"
		"   prefetchnta 128(%0)\n"
		"   prefetchnta 192(%0)\n"
		: : "r" (from) );

	for(i=0; i<len/64; i++) {
		__asm__ __volatile__ (
			"   prefetchnta 200(%0)\n"
			"   movq (%0), %%mm0\n"
			"   movq 8(%0), %%mm1\n"
			"   movq 16(%0), %%mm2\n"
			"   movq 24(%0), %%mm3\n"
			"   movq 32(%0), %%mm4\n"
			"   movq 40(%0), %%mm5\n"
			"   movq 48(%0), %%mm6\n"
			"   movq 56(%0), %%mm7\n"
			"   movntq %%mm0, (%1)\n"
			"   movntq %%mm1, 8(%1)\n"
			"   movntq %%mm2, 16(%1)\n"
			"   movntq %%mm3, 24(%1)\n"
			"   movntq %%mm4, 32(%1)\n"
			"   movntq %%mm5, 40(%1)\n"
			"   movntq %%mm6, 48(%1)\n"
			"   movntq %%mm7, 56(%1)\n"
			: : "r" (from), "r" (to) : "memory");
		from+=64;
		to+=64;
	}
	/*
	 *Now do the tail of the block
	 */
	if (len&63)
		memcpy(to, from, len&63);
	
	return to;
}

void *
memcpy_arjanv_interleave (void *to, const void *from, size_t len)
{
	int i;

	__asm__ __volatile__ (
		"1: prefetchnta (%0)\n"
		"   prefetchnta 64(%0)\n"
		"   prefetchnta 128(%0)\n"
		"   prefetchnta 192(%0)\n"
		: : "r" (from) );


	for(i=0; i<len/64; i++) {
		__asm__ __volatile__ (
			"   prefetchnta 168(%0)\n"
			"   movq (%0), %%mm0\n"
			"   movntq %%mm0, (%1)\n"
			"   movq 8(%0), %%mm1\n"
			"   movntq %%mm1, 8(%1)\n"
			"   movq 16(%0), %%mm2\n"
			"   movntq %%mm2, 16(%1)\n"
			"   movq 24(%0), %%mm3\n"
			"   movntq %%mm3, 24(%1)\n"
			"   movq 32(%0), %%mm4\n"
			"   movntq %%mm4, 32(%1)\n"
			"   movq 40(%0), %%mm5\n"
			"   movntq %%mm5, 40(%1)\n"
			"   movq 48(%0), %%mm6\n"
			"   movntq %%mm6, 48(%1)\n"
			"   movq 56(%0), %%mm7\n"
			"   movntq %%mm7, 56(%1)\n"
			: : "r" (from), "r" (to) : "memory");
		from+=64;
		to+=64;
	}
	/*
	 *Now do the tail of the block
	 */
	if (len&63)
		memcpy(to, from, len&63);
	
	return to;
}

static void wrap (char *p1, 
		  char *p2,
		  size_t size,
		  long loops,
		  void *(*bfn) (void *, const void *, size_t),
		  const char *msg)
{
	long t;
	int i, j;
	char *tmp1, *tmp2;
	
	
	memset(p2,42,size);

	tmp1 = p1;
	tmp2 = p2;

	start_timer();

	for (j = 0; j < 1024; j++) {
		for (i=0; i<loops; i++) {
			bfn (tmp1, tmp2, size);
			tmp1 += size;
			tmp2 += size;
		}
		tmp1 = p1;
		tmp2 = p2;
	}

	t = end_timer();

	print_time (msg, loops, t);
}

static void memcpy_test(size_t size)
{
	long loops = 1024*1024 / size;

	/* We need to make sure the blocks are *VERY* aligned, because
	   MMX is potentially pretty fussy. */

	char *p1 = (char *) malloc (1024 * 1024);
	char *p2 = (char *) malloc (1024 * 1024);

	printf("addr1=%p addr2=%p\n", p1, p2);

	if (size > 2048)
		printf ("memcpy %dkB -- %ld loops\n", size>>10, loops);
	else
		printf ("memcpy %dB -- %ld loops\n", size, loops);


	printf ("  aligned blocks\n");

	wrap (p1, p2, size, loops, memcpy, "libc memcpy");
	wrap (p1, p2, size, loops, memcpy_rep_movsl, "rep movsw");
	wrap (p1, p2, size, loops, memcpy_words, "asm loop");
	wrap (p1, p2, size, loops, i686_copyin, "i686_copyin");
	wrap (p1, p2, size, loops, i686_copyin2, "i686_copyin2");
	wrap (p1, p2, size, loops, memcpy_mmx,
		"MMX memcpy using MOVQ");
	wrap(p1, p2, size, loops, memcpy_movusb,
		"with mingo's MOVUSB (prefetch, non-temporal)");
	wrap (p1, p2, size, loops, memcpy_simple_movusb,
	      "with simple MOVUSB (no prefetch)");
	wrap (p1, p2, size, loops, memcpy_arjanv,
	      "arjanv's MOVQ (with prefetch)");
	wrap (p1, p2, size, loops, memcpy_arjanv_movntq,
	      "arjanv's MOVNTQ (with prefetch, for Athlon)");
	wrap (p1, p2, size, loops, memcpy_arjanv_interleave,
	      "arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA");

	printf ("  +0/+4 moderately unaligned blocks\n");

	wrap (p1, p2+4, size, loops, memcpy, "libc memcpy");
	wrap (p1, p2+4, size, loops, memcpy_mmx,
		"MMX memcpy using MOVQ");
	wrap(p1, p2+4, size, loops, memcpy_movusb,
		"with mingo's MOVUSB (prefetch, non-temporal)");
	wrap (p1, p2+4, size, loops, memcpy_simple_movusb,
	      "with simple MOVUSB (no prefetch)");

	printf ("  +10/+13 cruelly unaligned blocks\n");

	wrap (p1+10, p2+13, size, loops, memcpy, "libc memcpy");
	wrap (p1+10, p2+13, size, loops, memcpy_mmx,
		"MMX memcpy using MOVQ");
	wrap(p1+10, p2+13, size, loops, memcpy_movusb,
		"with mingo's MOVUSB (prefetch, non-temporal)");
	wrap (p1+10, p2+13, size, loops, memcpy_simple_movusb,
	      "with simple MOVUSB (no prefetch)");

	puts("");

	free(p1); free(p2);
}


int main (void)
{
	memcpy_test(64);
#if 0
	memcpy_test(1<<7);
	memcpy_test(1<<8);
	memcpy_test(1<<9);
#endif
	memcpy_test(1024);
#if 0
	memcpy_test(1<<11);
#endif
	memcpy_test(4096);
#if 0
	memcpy_test(1<<13);
	memcpy_test(1<<14);
	memcpy_test(1<<15);
#endif
	memcpy_test(1<<16);
	memcpy_test(1<<17);
	memcpy_test(1<<18);
	memcpy_test(1<<19);
#if 0
	memcpy_test(1<<20);
#endif
	return 0;
}

--EeQfGwPcQSOJBaQU--