Subject: Re: bzero
To: Martin J. Laubach <mjl@emsi.priv.at>
From: Simon Burge <simonb@wasabisystems.com>
List: port-powerpc
Date: 11/04/2001 15:56:25
"Martin J. Laubach" wrote:

>   So I've been entertaining myself by writing several bzero
> implementations in assembler. Before I place this into libc
> and libkern, could you please give it a good beating? I have
> placed it for download at
> 
>   http://rhubarb.emsi.priv.at/download/bzero.tar.gz
> 
>   Please unpack the archive, then run ./compile and tell me
> the results. On my G4, this looks like
> 
> 	celery:198 [/tmp] % ./compile
> 	Compiling
> 	Running regression tests
> 	ok algorithm 0 (Original C)
> 	ok algorithm 1 (Simple byte)
> 	ok algorithm 2 (Simple word)
> 	ok algorithm 3 (Cache block)
> 	ok algorithm 4 (Cache block 2)
> 	Running speed tests
> 	Running algorithm 0 (Original C): run time: 13979 msec
> 	Running algorithm 1 (Simple byte): run time: 5687 msec
> 	Running algorithm 2 (Simple word): run time: 1491 msec
> 	Running algorithm 3 (Cache block): run time: 832 msec
> 	Running algorithm 4 (Cache block 2): run time: 822 msec
> 	Running algorithm 5 (libc): run time: 852 msec

So, on a 200MHz 405gp (in a Walnut), I see:

	Running algorithm 0 (Original C): run time: 57174 msec
	Running algorithm 1 (Simple byte): run time: 32536 msec
	Running algorithm 2 (Simple word): run time: 8465 msec
	Running algorithm 3 (Cache block): run time: 2552 msec
	Running algorithm 4 (Cache block 2): run time: 2586 msec
	Running algorithm 5 (libc): run time: 11198 msec
	Running algorithm 6 (IBM memset): run time: 5761 msec

The "IBM memset" is memset.o extracted out of one of the libraries that
came with the Walnut eval CD-ROM.

I also rearranged the testing code to run with different sizes (in
bytes) showing the results for each size.  Note that these results use
an equal weighting for each start alignment between 0 and 255.

                  Orig C     byte     word    C-blk   C-blk2     libc      IBM
Size =      16:  5501 ms  3024 ms  2789 ms  3170 ms  2851 ms  3055 ms  2519 ms
Size =      32:  3633 ms  2032 ms  1519 ms  1612 ms  1551 ms  1696 ms  1344 ms
Size =      64:  2699 ms  1512 ms   886 ms   829 ms   864 ms  1016 ms   756 ms
Size =     128:  2232 ms  1260 ms   569 ms   446 ms   463 ms   676 ms   462 ms
Size =     256:  1999 ms  1133 ms   410 ms   254 ms   263 ms   506 ms   315 ms
Size =    1024:  1823 ms  1056 ms   292 ms   111 ms   113 ms   378 ms   205 ms
Size =    4096:  1779 ms  1021 ms   262 ms    75 ms    76 ms   347 ms   177 ms
Size =   16384:  1805 ms  1031 ms   433 ms   178 ms   177 ms   440 ms   427 ms
Size =   65536:  1808 ms  1025 ms   466 ms   173 ms   173 ms   468 ms   462 ms
Size =  524288:  1814 ms  1024 ms   484 ms   172 ms   172 ms   485 ms   484 ms
Size = 1048576:  1818 ms  1028 ms   474 ms   176 ms   175 ms   481 ms   469 ms

The same test always using an alignment of 0 doesn't show much overall
difference for most cases.

                  Orig C     byte     word    C-blk   C-blk2     libc      IBM
Size =      16:  5502 ms  2879 ms  2393 ms  3039 ms  2464 ms  2624 ms  2331 ms
Size =      32:  3633 ms  1942 ms  1323 ms  1438 ms  1354 ms  1480 ms  1249 ms
Size =      64:  2699 ms  1474 ms   787 ms   745 ms   782 ms   908 ms   708 ms
Size =     128:  2248 ms  1246 ms   520 ms   404 ms   422 ms   622 ms   439 ms
Size =     256:  1998 ms  1125 ms   386 ms   234 ms   242 ms   479 ms   304 ms
Size =    1024:  1823 ms  1037 ms   286 ms   105 ms   108 ms   372 ms   202 ms
Size =    4096:  1780 ms  1015 ms   261 ms    73 ms    75 ms   345 ms   176 ms
Size =   16384:  1805 ms  1030 ms   438 ms   177 ms   177 ms   442 ms   432 ms
Size =   65536:  1807 ms  1026 ms   464 ms   173 ms   173 ms   467 ms   462 ms
Size =  524288:  1809 ms  1025 ms   475 ms   172 ms   173 ms   491 ms   474 ms
Size = 1048576:  1815 ms  1035 ms   466 ms   176 ms   176 ms   475 ms   465 ms

The number of bytes zero'd for each size is constant, showing the
setup overhead for the smaller sizes.  Diffs for this style output are
included below my .sig.

Note that we use a page size of 16kB on the walnut too, but the 405gp
has an 8kB data cache.  You can see the cross-over above quite clearly.

I _think_ "Cache block" is abzero and "Cache block 2" is cbzero.
Separating these out to their own .S files and we see that cbzero
("Cache block 2") is 2/3rd the size:

	text    data    bss     dec     hex     filename
	304     0       0       304     130     abzero.o
	192     0       0       192     c0      cbzero.o

I'd be tempted to say "use the smaller one" so the Icache isn't as
affected as much by a call to bzero.

>   Also, PPC savvy people are invited to brick me because of
> the assembler style, feel free.

Looks OK to me.  The amount of comments is quite refreshing to see!


As David Edelsohn mentions, not all PPC chips are made equal.  It does
look like machdep.cachelinesize sysctl is available on all PPC ports.
Maybe we can look at some loader tricks so the different bzero routines
are loaded up depending on cache line size.  Using the libm example for
ld.so.conf (but untested!) we could have something like:

	libc.so.12	 machdep.fpu_present	16:libppc_16.so.0,libc.so.12
	libc.so.12	 machdep.fpu_present	32:libppc_32.so.0,libc.so.12

and have all the 16byte cacheline routines in libppc_16.so.0, all the
32byte routines in libppc_32.so.0 and so on.  Ugly, but you get the
picture...


Also, we tend to prefer memset() now (woohoo - standards!), so that
might be an interesting next project after bzero, and of course you
should be able to reuse much of this work.


Anyways, enough rambling from me for now.
Simon.
--
Simon Burge                            <simonb@wasabisystems.com>
NetBSD CDs, Support and Service:    http://www.wasabisystems.com/



--- bz.c.orig	Sun Nov  4 13:36:23 2001
+++ bz.c	Sun Nov  4 15:15:57 2001
@@ -2,63 +2,76 @@
 #include <strings.h>
 #include <sys/time.h>
 
-char a1[8192];
+char a1[1024 * 1024 + 256];
 
 void orig_bzero(void *, size_t);
 
-char *aname[] = { "Original C", "Simple byte", "Simple word", "Cache block", "Cache block 2", "libc"  };
+char *aname[] = { "Orig C", "byte", "word", "C-blk", "C-blk2", "libc", "IBM" };
 
-int main(int argc, char **argv)
-	{
-	int i, j, alg;
+int sizes[] = { 16, 32, 64, 128, 256, 1024, 4 * 1024, 16 * 1024, 64 * 1024, 512 * 1024, 1024 * 1024 , 0 };
+
+int
+main(int argc, char **argv)
+{
+	int j, alg, *size;
 	struct timeval start, end;
 	int startmsec, endmsec;
 
-	for(alg = 0; alg <= 5; alg++)
-		{
+	memset(a1, 0, 1024 * 1024);	/* prime a1 */
+
+	/* Header */
+	printf("%*s", 15, "");
+	for (alg = 0; alg <= 6; alg++)
+		printf("%9s", aname[alg]);
+	printf("\n");
+
+	for (size = sizes; *size != 0; size++) {
 
-		gettimeofday(&start, NULL);
+		printf("Size = %7d:", *size);
 
-		for(i = 0; i < 4096; i++)
-			{	
-			for(j = 0; j < 256; j++)
-				{
-				switch(alg)
-					{
+		for (alg = 0; alg <= 6; alg++) {
+
+			gettimeofday(&start, NULL);
+
+			for (j = 0; j < 1024 * 1024 * 64 / *size; j++) {
+				switch(alg) {
 					case 0:
-						orig_bzero(a1 + j, i);
+						orig_bzero(a1 + (j & 255), *size);
 						break;
 					case 1:
-						bbzero(a1 + j, i);
+						bbzero(a1 + (j & 255), *size);
 						break;
 					case 2:
-						wbzero(a1 + j, i);
+						wbzero(a1 + (j & 255), *size);
 						break;
 					case 3:
-						abzero(a1 + j, i);
+						abzero(a1 + (j & 255), *size);
 						break;
 					case 4:
-						cbzero(a1 + j, i);
+						cbzero(a1 + (j & 255), *size);
 						break;
 					case 5:
-						bzero(a1 + j, i);
+						bzero(a1 + (j & 255), *size);
+						break;
+					case 6:
+						memset(a1 + (j & 255), 0, *size);
 						break;
 					default:
 						errx(1, "huh?");
 					}
 				}
-			}
 
-		gettimeofday(&end, NULL);
+			gettimeofday(&end, NULL);
 
-		end.tv_sec -= start.tv_sec; 
-		startmsec = start.tv_usec / 1000;
-		endmsec   = end.tv_sec * 1000 + end.tv_usec / 1000;
+			end.tv_sec -= start.tv_sec; 
+			startmsec = start.tv_usec / 1000;
+			endmsec   = end.tv_sec * 1000 + end.tv_usec / 1000;
 
-		printf("Running algorithm %d (%s): ", alg, aname[alg]);
-		printf("run time: %d msec\n", endmsec - startmsec);
+			printf("%6d ms", endmsec - startmsec);
 		}
+		printf("\n");
 	}
+}