Subject: Re: tuning IP checksumming code...
To: Charles M. Hannum <mycroft@mit.edu>
From: Thorsten Lockert <tholo@SigmaSoft.COM>
List: port-i386
Date: 07/17/1996 19:55:07
> Jonathan Stone <jonathan@DSG.Stanford.EDU> writes:
>
> > Experimentally, using the 1.2 in_cksum.c or the tuned in_cksum.s seems
> > to make no significant performance difference on the P120s here; the
> > tuned code may be marginally slower.
They certainly perform differently on my P5/133... See below.
> > Is there something about 4.4bsd, NetBSD, or x86 pipelines that
> > invalidate this conventional wisdom (IIRC, Kay and Pasquale, which
> > dates back some years and was done on a 4.2bsd-ish system without
> > pkthdrs in mbufs. They advised unrolling loops to MLEN bytes' worth.)
>
> There's something about modern *caches* that invalidates that
> `wisdom'. The large loops are highly optimized for 486 and Pentium
> cache loading behaviour, so that you only get one stall per cache line
> (unlike, for example, the OpenBSD version which stalls at least twice
> per cache line).
Thay may be so, but it still perform on average 15-20% better than your new
assembly version on a P5/133 system, as found below.
The fields are mbuf chain length, total bytes in chain, the checksum
generated by the original C version, your new assembly version and the
OpenBSD version, the time each took over 20.000 iterations and how much
faster/slower your assembly is over the C version, how much faster/slower
the OpenBSD one is over the C version and in the last column how much
faster/slower the OpenBSD version is over your assembly version. The
results are further summarized at the end.
With how the computer market is looking nowadays, is it not better to
optimize for the P5 and P6 series which is what people that want or need
high performance is buying? It is already hard to get 386-based systems,
and it is going to start being like that for 486-based systems real soon.
Thorsten
2 78 0x3373 0x3373 0x3373 244564 173023 157536 41.35% 55.24% 9.83%
12 734 0x25dc 0x25dc 0x25dc 2205226 1746517 1443965 26.26% 52.72% 20.95%
3 196 0x2bb4 0x2bb4 0x2bb4 559716 441021 412371 26.91% 35.73% 6.95%
12 653 0x5257 0x5257 0x5257 2005957 1578730 1343478 27.06% 49.31% 17.51%
11 570 0x54ba 0x54ba 0x54ba 1804443 1429900 1256966 26.19% 43.56% 13.76%
14 534 0x6117 0x6117 0x6117 2726217 1594393 1398903 70.99% 94.88% 13.97%
5 161 0x80a3 0x80a3 0x80a3 587718 448553 423384 31.03% 38.81% 5.94%
5 190 0x8ade 0x8ade 0x8ade 583706 466954 531421 25.00% 9.84% -12.13%
10 534 0xe813 0xe813 0xe813 1812623 1370666 1117792 32.24% 62.16% 22.62%
9 458 0x3ed0 0x3ed0 0x3ed0 1595664 1246872 997209 27.97% 60.01% 25.04%
1 55 0x82d9 0x82d9 0x82d9 186371 141738 136207 31.49% 36.83% 4.06%
4 144 0x7f40 0x7f40 0x7f40 646887 481750 470610 34.28% 37.46% 2.37%
10 512 0x8398 0x8398 0x8398 1685054 1333794 1099159 26.34% 53.30% 21.35%
13 780 0x2ca0 0x2ca0 0x2ca0 2300206 1818646 1502008 26.48% 53.14% 21.08%
3 162 0x909d 0x909d 0x909d 507044 372542 386108 36.10% 31.32% -3.51%
7 429 0xa7c7 0xa7c7 0xa7c7 1266856 989309 853154 28.05% 48.49% 15.96%
2 146 0xb8e1 0xb8e1 0xb8e1 415089 322021 302999 28.90% 36.99% 6.28%
9 446 0xe1eb 0xe1eb 0xe1eb 1515701 1214197 1028383 24.83% 47.39% 18.07%
4 256 0xdaa6 0xdaa6 0xdaa6 825621 657556 515643 25.56% 60.11% 27.52%
3 254 0xff2d 0xff2d 0xff2d 649217 529872 454226 22.52% 42.93% 16.65%
3 150 0xbe81 0xbe81 0xbe81 527245 383319 348581 37.55% 51.25% 9.97%
4 168 0xa8c0 0xa8c0 0xa8c0 592882 448359 454478 32.23% 30.45% -1.35%
16 902 0x5910 0x5910 0x5910 2643255 2118910 1774380 24.75% 48.97% 19.42%
13 719 0x9e24 0x9e24 0x9e24 2134861 1639988 1445035 30.18% 47.74% 13.49%
11 510 0x7c6c 0x7c6c 0x7c6c 1609529 1281136 1123965 25.63% 43.20% 13.98%
12 581 0x15d7 0x15d7 0x15d7 1969478 1487857 1274424 32.37% 54.54% 16.75%
6 319 0x47b6 0x47b6 0x47b6 1041361 788613 671611 32.05% 55.05% 17.42%
4 272 0x6a75 0x6a75 0x6a75 825357 622701 550016 32.54% 50.06% 13.22%
10 503 0xe573 0xe573 0xe573 1696054 1310003 1068548 29.47% 58.73% 22.60%
15 726 0x0720 0x0720 0x0720 2403522 1888571 1614392 27.27% 48.88% 16.98%
2 69 0x8135 0x8135 0x8135 306282 227290 244319 34.75% 25.36% -6.97%
1 76 0x6dff 0x6dff 0x6dff 243790 174689 157008 39.56% 55.27% 11.26%
4 255 0x3966 0x3966 0x3966 764455 631945 524659 20.97% 45.71% 20.45%
12 511 0x178e 0x178e 0x178e 1875286 1462139 1249136 28.26% 50.13% 17.05%
8 557 0x4aaa 0x4aaa 0x4aaa 1494125 1158870 1023601 28.93% 45.97% 13.22%
10 412 0x9095 0x9095 0x9095 1682905 1288716 1015784 30.59% 65.68% 26.87%
7 488 0x5789 0x5789 0x5789 1389239 1100008 899319 26.29% 54.48% 22.32%
13 648 0xcbd1 0xcbd1 0xcbd1 2214856 1723311 1423193 28.52% 55.63% 21.09%
16 948 0x5981 0x5981 0x5981 2762244 2157510 1804974 28.03% 53.04% 19.53%
9 498 0x522a 0x522a 0x522a 1604795 1246529 1053561 28.74% 52.32% 18.32%
16 941 0xbcd0 0xbcd0 0xbcd0 2803730 2188428 1813009 28.12% 54.65% 20.71%
6 325 0x37c9 0x37c9 0x37c9 1167166 860893 693476 35.58% 68.31% 24.14%
5 313 0x1ab5 0x1ab5 0x1ab5 675585 587487 586574 15.00% 15.17% 0.16%
7 330 0x84df 0x84df 0x84df 1273284 944959 778833 34.74% 63.49% 21.33%
7 368 0x7d19 0x7d19 0x7d19 1183491 957223 818064 23.64% 44.67% 17.01%
Average over 45 samples:
Hannum's assembly vs. Hannum's C : 29.70%
Dave's assembly vs. Hannum's C : 51.60%
Dave's assembly cs. Hannum's assembly: 16.89%
--
Thorsten Lockert | postmaster@sigmasoft.com | Universe, n.:
1238B Page Street | hostmaster@sigmasoft.com | The problem.
San Francisco, CA 94117 | tholo@sigmasoft.com |