Subject: Re: tuning IP checksumming code...
To: Charles M. Hannum <mycroft@mit.edu>
From: Thorsten Lockert <tholo@SigmaSoft.COM>
List: port-i386
Date: 07/17/1996 19:55:07
> Jonathan Stone <jonathan@DSG.Stanford.EDU> writes:
> 
> > Experimentally, using the 1.2 in_cksum.c or the tuned in_cksum.s seems
> > to make no significant performance difference on the P120s here; the
> > tuned code may be marginally slower.

They certainly perform differently on my P5/133...  See below.

> > Is there something about 4.4bsd, NetBSD, or x86 pipelines that
> > invalidate this conventional wisdom (IIRC, Kay and Pasquale, which
> > dates back some years and was done on a 4.2bsd-ish system without
> > pkthdrs in mbufs. They advised unrolling loops to MLEN bytes' worth.)
> 
> There's something about modern *caches* that invalidates that
> `wisdom'.  The large loops are highly optimized for 486 and Pentium
> cache loading behaviour, so that you only get one stall per cache line
> (unlike, for example, the OpenBSD version which stalls at least twice
> per cache line).

Thay may be so, but it still perform on average 15-20% better than your new
assembly version on a P5/133 system, as found below.

The fields are mbuf chain length, total bytes in chain, the checksum
generated by the original C version, your new assembly version and the
OpenBSD version, the time each took over 20.000 iterations and how much
faster/slower your assembly is over the C version, how much faster/slower
the OpenBSD one is over the C version and in the last column how much
faster/slower the OpenBSD version is over your assembly version.  The
results are further summarized at the end.

With how the computer market is looking nowadays, is it not better to
optimize for the P5 and P6 series which is what people that want or need
high performance is buying?  It is already hard to get 386-based systems,
and it is going to start being like that for 486-based systems real soon.

Thorsten



 2   78 0x3373 0x3373 0x3373  244564  173023  157536  41.35%  55.24%   9.83%
12  734 0x25dc 0x25dc 0x25dc 2205226 1746517 1443965  26.26%  52.72%  20.95%
 3  196 0x2bb4 0x2bb4 0x2bb4  559716  441021  412371  26.91%  35.73%   6.95%
12  653 0x5257 0x5257 0x5257 2005957 1578730 1343478  27.06%  49.31%  17.51%
11  570 0x54ba 0x54ba 0x54ba 1804443 1429900 1256966  26.19%  43.56%  13.76%
14  534 0x6117 0x6117 0x6117 2726217 1594393 1398903  70.99%  94.88%  13.97%
 5  161 0x80a3 0x80a3 0x80a3  587718  448553  423384  31.03%  38.81%   5.94%
 5  190 0x8ade 0x8ade 0x8ade  583706  466954  531421  25.00%   9.84% -12.13%
10  534 0xe813 0xe813 0xe813 1812623 1370666 1117792  32.24%  62.16%  22.62%
 9  458 0x3ed0 0x3ed0 0x3ed0 1595664 1246872  997209  27.97%  60.01%  25.04%
 1   55 0x82d9 0x82d9 0x82d9  186371  141738  136207  31.49%  36.83%   4.06%
 4  144 0x7f40 0x7f40 0x7f40  646887  481750  470610  34.28%  37.46%   2.37%
10  512 0x8398 0x8398 0x8398 1685054 1333794 1099159  26.34%  53.30%  21.35%
13  780 0x2ca0 0x2ca0 0x2ca0 2300206 1818646 1502008  26.48%  53.14%  21.08%
 3  162 0x909d 0x909d 0x909d  507044  372542  386108  36.10%  31.32%  -3.51%
 7  429 0xa7c7 0xa7c7 0xa7c7 1266856  989309  853154  28.05%  48.49%  15.96%
 2  146 0xb8e1 0xb8e1 0xb8e1  415089  322021  302999  28.90%  36.99%   6.28%
 9  446 0xe1eb 0xe1eb 0xe1eb 1515701 1214197 1028383  24.83%  47.39%  18.07%
 4  256 0xdaa6 0xdaa6 0xdaa6  825621  657556  515643  25.56%  60.11%  27.52%
 3  254 0xff2d 0xff2d 0xff2d  649217  529872  454226  22.52%  42.93%  16.65%
 3  150 0xbe81 0xbe81 0xbe81  527245  383319  348581  37.55%  51.25%   9.97%
 4  168 0xa8c0 0xa8c0 0xa8c0  592882  448359  454478  32.23%  30.45%  -1.35%
16  902 0x5910 0x5910 0x5910 2643255 2118910 1774380  24.75%  48.97%  19.42%
13  719 0x9e24 0x9e24 0x9e24 2134861 1639988 1445035  30.18%  47.74%  13.49%
11  510 0x7c6c 0x7c6c 0x7c6c 1609529 1281136 1123965  25.63%  43.20%  13.98%
12  581 0x15d7 0x15d7 0x15d7 1969478 1487857 1274424  32.37%  54.54%  16.75%
 6  319 0x47b6 0x47b6 0x47b6 1041361  788613  671611  32.05%  55.05%  17.42%
 4  272 0x6a75 0x6a75 0x6a75  825357  622701  550016  32.54%  50.06%  13.22%
10  503 0xe573 0xe573 0xe573 1696054 1310003 1068548  29.47%  58.73%  22.60%
15  726 0x0720 0x0720 0x0720 2403522 1888571 1614392  27.27%  48.88%  16.98%
 2   69 0x8135 0x8135 0x8135  306282  227290  244319  34.75%  25.36%  -6.97%
 1   76 0x6dff 0x6dff 0x6dff  243790  174689  157008  39.56%  55.27%  11.26%
 4  255 0x3966 0x3966 0x3966  764455  631945  524659  20.97%  45.71%  20.45%
12  511 0x178e 0x178e 0x178e 1875286 1462139 1249136  28.26%  50.13%  17.05%
 8  557 0x4aaa 0x4aaa 0x4aaa 1494125 1158870 1023601  28.93%  45.97%  13.22%
10  412 0x9095 0x9095 0x9095 1682905 1288716 1015784  30.59%  65.68%  26.87%
 7  488 0x5789 0x5789 0x5789 1389239 1100008  899319  26.29%  54.48%  22.32%
13  648 0xcbd1 0xcbd1 0xcbd1 2214856 1723311 1423193  28.52%  55.63%  21.09%
16  948 0x5981 0x5981 0x5981 2762244 2157510 1804974  28.03%  53.04%  19.53%
 9  498 0x522a 0x522a 0x522a 1604795 1246529 1053561  28.74%  52.32%  18.32%
16  941 0xbcd0 0xbcd0 0xbcd0 2803730 2188428 1813009  28.12%  54.65%  20.71%
 6  325 0x37c9 0x37c9 0x37c9 1167166  860893  693476  35.58%  68.31%  24.14%
 5  313 0x1ab5 0x1ab5 0x1ab5  675585  587487  586574  15.00%  15.17%   0.16%
 7  330 0x84df 0x84df 0x84df 1273284  944959  778833  34.74%  63.49%  21.33%
 7  368 0x7d19 0x7d19 0x7d19 1183491  957223  818064  23.64%  44.67%  17.01%

Average over 45 samples:
Hannum's assembly vs. Hannum's C     :  29.70%
Dave's assembly vs. Hannum's C       :  51.60%
Dave's assembly cs. Hannum's assembly:  16.89%
--
Thorsten Lockert        | postmaster@sigmasoft.com | Universe, n.:
1238B Page Street       | hostmaster@sigmasoft.com |         The problem.
San Francisco, CA 94117 | tholo@sigmasoft.com      |