Subject: Re: bzero
To: Martin J. Laubach <mjl@emsi.priv.at>
From: Simon Burge <simonb@wasabisystems.com>
List: port-powerpc
Date: 11/04/2001 15:56:25
"Martin J. Laubach" wrote:
> So I've been entertaining myself by writing several bzero
> implementations in assembler. Before I place this into libc
> and libkern, could you please give it a good beating? I have
> placed it for download at
>
> http://rhubarb.emsi.priv.at/download/bzero.tar.gz
>
> Please unpack the archive, then run ./compile and tell me
> the results. On my G4, this looks like
>
> celery:198 [/tmp] % ./compile
> Compiling
> Running regression tests
> ok algorithm 0 (Original C)
> ok algorithm 1 (Simple byte)
> ok algorithm 2 (Simple word)
> ok algorithm 3 (Cache block)
> ok algorithm 4 (Cache block 2)
> Running speed tests
> Running algorithm 0 (Original C): run time: 13979 msec
> Running algorithm 1 (Simple byte): run time: 5687 msec
> Running algorithm 2 (Simple word): run time: 1491 msec
> Running algorithm 3 (Cache block): run time: 832 msec
> Running algorithm 4 (Cache block 2): run time: 822 msec
> Running algorithm 5 (libc): run time: 852 msec
So, on a 200MHz 405gp (in a Walnut), I see:
Running algorithm 0 (Original C): run time: 57174 msec
Running algorithm 1 (Simple byte): run time: 32536 msec
Running algorithm 2 (Simple word): run time: 8465 msec
Running algorithm 3 (Cache block): run time: 2552 msec
Running algorithm 4 (Cache block 2): run time: 2586 msec
Running algorithm 5 (libc): run time: 11198 msec
Running algorithm 6 (IBM memset): run time: 5761 msec
The "IBM memset" is memset.o extracted out of one of the libraries that
came with the Walnut eval CD-ROM.
I also rearranged the testing code to run with different sizes (in
bytes) showing the results for each size. Note that these results use
an equal weighting for each start alignment between 0 and 255.
Orig C byte word C-blk C-blk2 libc IBM
Size = 16: 5501 ms 3024 ms 2789 ms 3170 ms 2851 ms 3055 ms 2519 ms
Size = 32: 3633 ms 2032 ms 1519 ms 1612 ms 1551 ms 1696 ms 1344 ms
Size = 64: 2699 ms 1512 ms 886 ms 829 ms 864 ms 1016 ms 756 ms
Size = 128: 2232 ms 1260 ms 569 ms 446 ms 463 ms 676 ms 462 ms
Size = 256: 1999 ms 1133 ms 410 ms 254 ms 263 ms 506 ms 315 ms
Size = 1024: 1823 ms 1056 ms 292 ms 111 ms 113 ms 378 ms 205 ms
Size = 4096: 1779 ms 1021 ms 262 ms 75 ms 76 ms 347 ms 177 ms
Size = 16384: 1805 ms 1031 ms 433 ms 178 ms 177 ms 440 ms 427 ms
Size = 65536: 1808 ms 1025 ms 466 ms 173 ms 173 ms 468 ms 462 ms
Size = 524288: 1814 ms 1024 ms 484 ms 172 ms 172 ms 485 ms 484 ms
Size = 1048576: 1818 ms 1028 ms 474 ms 176 ms 175 ms 481 ms 469 ms
The same test always using an alignment of 0 doesn't show much overall
difference for most cases.
Orig C byte word C-blk C-blk2 libc IBM
Size = 16: 5502 ms 2879 ms 2393 ms 3039 ms 2464 ms 2624 ms 2331 ms
Size = 32: 3633 ms 1942 ms 1323 ms 1438 ms 1354 ms 1480 ms 1249 ms
Size = 64: 2699 ms 1474 ms 787 ms 745 ms 782 ms 908 ms 708 ms
Size = 128: 2248 ms 1246 ms 520 ms 404 ms 422 ms 622 ms 439 ms
Size = 256: 1998 ms 1125 ms 386 ms 234 ms 242 ms 479 ms 304 ms
Size = 1024: 1823 ms 1037 ms 286 ms 105 ms 108 ms 372 ms 202 ms
Size = 4096: 1780 ms 1015 ms 261 ms 73 ms 75 ms 345 ms 176 ms
Size = 16384: 1805 ms 1030 ms 438 ms 177 ms 177 ms 442 ms 432 ms
Size = 65536: 1807 ms 1026 ms 464 ms 173 ms 173 ms 467 ms 462 ms
Size = 524288: 1809 ms 1025 ms 475 ms 172 ms 173 ms 491 ms 474 ms
Size = 1048576: 1815 ms 1035 ms 466 ms 176 ms 176 ms 475 ms 465 ms
The number of bytes zero'd for each size is constant, showing the
setup overhead for the smaller sizes. Diffs for this style output are
included below my .sig.
Note that we use a page size of 16kB on the walnut too, but the 405gp
has an 8kB data cache. You can see the cross-over above quite clearly.
I _think_ "Cache block" is abzero and "Cache block 2" is cbzero.
Separating these out to their own .S files and we see that cbzero
("Cache block 2") is 2/3rd the size:
text data bss dec hex filename
304 0 0 304 130 abzero.o
192 0 0 192 c0 cbzero.o
I'd be tempted to say "use the smaller one" so the Icache isn't as
affected as much by a call to bzero.
> Also, PPC savvy people are invited to brick me because of
> the assembler style, feel free.
Looks OK to me. The amount of comments is quite refreshing to see!
As David Edelsohn mentions, not all PPC chips are made equal. It does
look like machdep.cachelinesize sysctl is available on all PPC ports.
Maybe we can look at some loader tricks so the different bzero routines
are loaded up depending on cache line size. Using the libm example for
ld.so.conf (but untested!) we could have something like:
libc.so.12 machdep.fpu_present 16:libppc_16.so.0,libc.so.12
libc.so.12 machdep.fpu_present 32:libppc_32.so.0,libc.so.12
and have all the 16byte cacheline routines in libppc_16.so.0, all the
32byte routines in libppc_32.so.0 and so on. Ugly, but you get the
picture...
Also, we tend to prefer memset() now (woohoo - standards!), so that
might be an interesting next project after bzero, and of course you
should be able to reuse much of this work.
Anyways, enough rambling from me for now.
Simon.
--
Simon Burge <simonb@wasabisystems.com>
NetBSD CDs, Support and Service: http://www.wasabisystems.com/
--- bz.c.orig Sun Nov 4 13:36:23 2001
+++ bz.c Sun Nov 4 15:15:57 2001
@@ -2,63 +2,76 @@
#include <strings.h>
#include <sys/time.h>
-char a1[8192];
+char a1[1024 * 1024 + 256];
void orig_bzero(void *, size_t);
-char *aname[] = { "Original C", "Simple byte", "Simple word", "Cache block", "Cache block 2", "libc" };
+char *aname[] = { "Orig C", "byte", "word", "C-blk", "C-blk2", "libc", "IBM" };
-int main(int argc, char **argv)
- {
- int i, j, alg;
+int sizes[] = { 16, 32, 64, 128, 256, 1024, 4 * 1024, 16 * 1024, 64 * 1024, 512 * 1024, 1024 * 1024 , 0 };
+
+int
+main(int argc, char **argv)
+{
+ int j, alg, *size;
struct timeval start, end;
int startmsec, endmsec;
- for(alg = 0; alg <= 5; alg++)
- {
+ memset(a1, 0, 1024 * 1024); /* prime a1 */
+
+ /* Header */
+ printf("%*s", 15, "");
+ for (alg = 0; alg <= 6; alg++)
+ printf("%9s", aname[alg]);
+ printf("\n");
+
+ for (size = sizes; *size != 0; size++) {
- gettimeofday(&start, NULL);
+ printf("Size = %7d:", *size);
- for(i = 0; i < 4096; i++)
- {
- for(j = 0; j < 256; j++)
- {
- switch(alg)
- {
+ for (alg = 0; alg <= 6; alg++) {
+
+ gettimeofday(&start, NULL);
+
+ for (j = 0; j < 1024 * 1024 * 64 / *size; j++) {
+ switch(alg) {
case 0:
- orig_bzero(a1 + j, i);
+ orig_bzero(a1 + (j & 255), *size);
break;
case 1:
- bbzero(a1 + j, i);
+ bbzero(a1 + (j & 255), *size);
break;
case 2:
- wbzero(a1 + j, i);
+ wbzero(a1 + (j & 255), *size);
break;
case 3:
- abzero(a1 + j, i);
+ abzero(a1 + (j & 255), *size);
break;
case 4:
- cbzero(a1 + j, i);
+ cbzero(a1 + (j & 255), *size);
break;
case 5:
- bzero(a1 + j, i);
+ bzero(a1 + (j & 255), *size);
+ break;
+ case 6:
+ memset(a1 + (j & 255), 0, *size);
break;
default:
errx(1, "huh?");
}
}
- }
- gettimeofday(&end, NULL);
+ gettimeofday(&end, NULL);
- end.tv_sec -= start.tv_sec;
- startmsec = start.tv_usec / 1000;
- endmsec = end.tv_sec * 1000 + end.tv_usec / 1000;
+ end.tv_sec -= start.tv_sec;
+ startmsec = start.tv_usec / 1000;
+ endmsec = end.tv_sec * 1000 + end.tv_usec / 1000;
- printf("Running algorithm %d (%s): ", alg, aname[alg]);
- printf("run time: %d msec\n", endmsec - startmsec);
+ printf("%6d ms", endmsec - startmsec);
}
+ printf("\n");
}
+}