Subject: Re: tuning IP checksumming code...
To: Thorsten Lockert <tholo@SigmaSoft.COM>
From: Charles M. Hannum <mycroft@mit.edu>
List: tech-net
Date: 07/18/1996 03:30:16
Thorsten Lockert <tholo@SigmaSoft.COM> writes:

> 
> The fields are mbuf chain length, total bytes in chain, the checksum
> generated by the original C version, your new assembly version and the
> OpenBSD version, the time each took over 20.000 iterations and how much
> [...]
> 
> 
>  2   78 0x3373 0x3373 0x3373  244564  173023  157536  41.35%  55.24%   9.83%
> 12  734 0x25dc 0x25dc 0x25dc 2205226 1746517 1443965  26.26%  52.72%  20.95%
>  3  196 0x2bb4 0x2bb4 0x2bb4  559716  441021  412371  26.91%  35.73%   6.95%
> 12  653 0x5257 0x5257 0x5257 2005957 1578730 1343478  27.06%  49.31%  17.51%
> [...]

It may well be that the OpenBSD version is faster in these tests, but:

1) The conditions you're simulating are outright bizarre, and will
never, ever happen in practice.

2) From the packet sizes, it's clear that you're not at all taking the
cache into account.


I try to simulate real-life packet sizes -- primarily the sort of
arrangements you'd expect to get from IP headers, TCP segments, and
reassembled NFS packets.  My results are appended below.

Legend:
* `orig' is the original C version from 386BSD.
* `bde1' was Bruce Evans's new version right before my previous C
version.
* `dr1' is the OpenBSD version from Dave Richards.
* `myc1' is my new version.
* `myc2' is my new version, plus the Pentium-specific tweak that I've
mentioned before.


./t 8	1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
align=2,4,4,1,e,2,6,7,6,a,0,0,e,f,7,9,e,7,5,1,f,9,2,7,b,f,c,f,0,4,f,d
	sum:  orig=5f89 bde1=5f89 dr1=5f89 myc1=5f89 myc2=5f89
	time: orig= 189 bde1= 354 dr1= 118 myc1= 131 myc2= 130
align=d,7,d,7,4,6,8,6,6,3,2,5,9,2,3,0,5,8,2,6,2,d,a,7,b,a,7,7,d,c,6,8
	sum:  orig=07ad bde1=07ad dr1=07ad myc1=07ad myc2=07ad
	time: orig= 188 bde1= 374 dr1= 117 myc1= 131 myc2= 131
align=8,e,8,3,c,2,e,f,d,9,e,8,0,5,c,9,b,e,1,2,8,9,8,2,1,2,b,d,5,3,a,6
	sum:  orig=7019 bde1=7019 dr1=7019 myc1=7019 myc2=7019
	time: orig= 187 bde1= 355 dr1= 126 myc1= 131 myc2= 132
align=1,a,3,6,8,7,a,3,b,a,3,9,1,9,2,4,0,7,2,3,1,e,a,6,f,6,8,1,8,a,a,6
	sum:  orig=87c8 bde1=87c8 dr1=87c8 myc1=87c8 myc2=87c8
	time: orig= 192 bde1= 360 dr1= 117 myc1= 131 myc2= 131
align=a,a,0,e,7,3,b,0,1,8,1,8,e,c,6,1,4,5,3,b,d,a,2,5,4,6,f,3,6,2,8,9
	sum:  orig=53c3 bde1=53c3 dr1=53c3 myc1=53c3 myc2=53c3
	time: orig= 189 bde1= 356 dr1= 121 myc1= 136 myc2= 132
align=2,e,e,d,9,8,1,8,d,2,8,5,5,0,6,0,7,7,5,8,d,f,f,e,0,1,f,3,f,a,3,d
	sum:  orig=d5fd bde1=d5fd dr1=d5fd myc1=d5fd myc2=d5fd
	time: orig= 192 bde1= 368 dr1= 118 myc1= 131 myc2= 131
align=9,6,d,2,e,4,d,a,1,8,8,1,8,4,4,2,a,d,8,c,f,c,0,1,3,a,7,2,4,2,b,3
	sum:  orig=1182 bde1=1182 dr1=1182 myc1=1182 myc2=1182
	time: orig= 191 bde1= 366 dr1= 118 myc1= 131 myc2= 131
align=f,2,d,d,6,9,d,6,c,a,1,a,6,8,e,5,b,7,d,6,5,1,7,e,e,e,9,e,3,a,1,c
	sum:  orig=044a bde1=044a dr1=044a myc1=044a myc2=044a
	time: orig= 190 bde1= 377 dr1= 117 myc1= 132 myc2= 131
./t 8	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
align=5,6,4,3,f,1,e,f,7,c,a,3,4,5,b,a,f,1,b,b,4,e,e,7,5,0,1,6,a,0,f,8
	sum:  orig=246a bde1=246a dr1=246a myc1=246a myc2=246a
	time: orig= 475 bde1= 467 dr1= 233 myc1= 216 myc2= 214
align=8,7,b,e,f,0,0,b,b,3,3,6,3,9,1,7,2,9,6,c,e,2,a,c,4,d,1,0,7,3,a,a
	sum:  orig=92bb bde1=92bb dr1=92bb myc1=92bb myc2=92bb
	time: orig= 484 bde1= 461 dr1= 226 myc1= 220 myc2= 218
align=4,d,6,d,d,2,b,a,1,c,8,e,b,0,8,c,5,b,2,7,e,d,c,b,1,d,d,3,9,6,7,4
	sum:  orig=b70c bde1=b70c dr1=b70c myc1=b70c myc2=b70c
	time: orig= 505 bde1= 471 dr1= 234 myc1= 221 myc2= 216
align=e,b,7,7,1,5,1,0,7,1,0,b,d,e,b,3,d,9,f,5,d,c,7,6,8,a,b,d,3,0,3,2
	sum:  orig=f819 bde1=f819 dr1=f819 myc1=f819 myc2=f819
	time: orig= 516 bde1= 468 dr1= 231 myc1= 219 myc2= 221
align=d,2,c,2,6,7,6,1,b,c,3,b,a,9,6,8,f,5,c,c,5,c,f,1,7,d,4,f,4,6,7,1
	sum:  orig=e066 bde1=e066 dr1=e066 myc1=e066 myc2=e066
	time: orig= 494 bde1= 477 dr1= 232 myc1= 215 myc2= 214
align=5,4,5,8,4,5,e,f,9,5,8,e,5,7,3,6,2,0,9,5,0,c,8,0,c,f,0,8,0,e,0,a
	sum:  orig=cfa2 bde1=cfa2 dr1=cfa2 myc1=cfa2 myc2=cfa2
	time: orig= 478 bde1= 459 dr1= 231 myc1= 206 myc2= 205
align=c,1,2,e,6,d,c,f,0,7,7,4,e,d,f,a,b,c,5,6,6,7,5,5,4,b,5,7,7,c,a,b
	sum:  orig=4779 bde1=4779 dr1=4779 myc1=4779 myc2=4779
	time: orig= 476 bde1= 435 dr1= 229 myc1= 218 myc2= 217
align=8,b,2,d,4,b,4,4,d,b,8,d,7,1,5,e,f,a,0,8,2,c,9,6,d,9,b,c,a,7,f,e
	sum:  orig=a27f bde1=a27f dr1=a27f myc1=a27f myc2=a27f
	time: orig= 469 bde1= 451 dr1= 237 myc1= 213 myc2= 216
./t 8	1 3 1 3 1 3 1
align=b,b,d,5,c,2,1
	sum:  orig=cd17 bde1=cd17 dr1=cd17 myc1=cd17 myc2=cd17
	time: orig=  48 bde1=  94 dr1=  34 myc1=  33 myc2=  37
align=3,3,b,9,5,b,0
	sum:  orig=6c41 bde1=6c41 dr1=6c41 myc1=6c41 myc2=6c41
	time: orig=  48 bde1=  87 dr1=  34 myc1=  35 myc2=  36
align=2,9,c,e,0,4,c
	sum:  orig=b95c bde1=b95c dr1=b95c myc1=b95c myc2=b95c
	time: orig=  52 bde1=  87 dr1=  35 myc1=  33 myc2=  36
align=6,5,a,f,4,b,2
	sum:  orig=dbc2 bde1=dbc2 dr1=dbc2 myc1=dbc2 myc2=dbc2
	time: orig=  47 bde1=  80 dr1=  34 myc1=  38 myc2=  35
align=f,0,2,d,3,4,c
	sum:  orig=30b2 bde1=30b2 dr1=30b2 myc1=30b2 myc2=30b2
	time: orig=  52 bde1=  92 dr1=  33 myc1=  33 myc2=  35
align=c,e,0,b,2,5,3
	sum:  orig=920f bde1=920f dr1=920f myc1=920f myc2=920f
	time: orig=  50 bde1=  89 dr1=  38 myc1=  35 myc2=  36
align=2,8,5,a,5,1,e
	sum:  orig=cd1d bde1=cd1d dr1=cd1d myc1=cd1d myc2=cd1d
	time: orig=  50 bde1=  92 dr1=  33 myc1=  34 myc2=  34
align=b,6,4,f,6,b,d
	sum:  orig=6f6b bde1=6f6b dr1=6f6b myc1=6f6b myc2=6f6b
	time: orig=  48 bde1=  91 dr1=  34 myc1=  36 myc2=  37
./t 8	40
align=7
	sum:  orig=cf13 bde1=cf13 dr1=cf13 myc1=cf13 myc2=cf13
	time: orig=  65 bde1=  22 dr1=  14 myc1=  13 myc2=  13
align=7
	sum:  orig=5679 bde1=5679 dr1=5679 myc1=5679 myc2=5679
	time: orig=  68 bde1=  22 dr1=  12 myc1=  13 myc2=  13
align=a
	sum:  orig=bb0b bde1=bb0b dr1=bb0b myc1=bb0b myc2=bb0b
	time: orig=  43 bde1=  17 dr1=  12 myc1=  12 myc2=  12
align=1
	sum:  orig=5621 bde1=5621 dr1=5621 myc1=5621 myc2=5621
	time: orig=  62 bde1=  19 dr1=  12 myc1=  11 myc2=  11
align=4
	sum:  orig=ec40 bde1=ec40 dr1=ec40 myc1=ec40 myc2=ec40
	time: orig=  45 bde1=  17 dr1=  11 myc1=  10 myc2=  10
align=3
	sum:  orig=f0bb bde1=f0bb dr1=f0bb myc1=f0bb myc2=f0bb
	time: orig=  67 bde1=  19 dr1=  12 myc1=  13 myc2=  12
align=a
	sum:  orig=e220 bde1=e220 dr1=e220 myc1=e220 myc2=e220
	time: orig=  39 bde1=  17 dr1=  11 myc1=  13 myc2=  11
align=6
	sum:  orig=75ec bde1=75ec dr1=75ec myc1=75ec myc2=75ec
	time: orig=  46 bde1=  19 dr1=  11 myc1=  12 myc2=  12
./t 8	2048
align=1
	sum:  orig=de06 bde1=de06 dr1=de06 myc1=de06 myc2=de06
	time: orig=2775 bde1= 145 dr1= 142 myc1= 128 myc2=  74
align=a
	sum:  orig=2516 bde1=2516 dr1=2516 myc1=2516 myc2=2516
	time: orig=1758 bde1= 145 dr1= 140 myc1= 128 myc2=  74
align=3
	sum:  orig=46a6 bde1=46a6 dr1=46a6 myc1=46a6 myc2=46a6
	time: orig=2760 bde1= 145 dr1= 141 myc1= 129 myc2=  74
align=1
	sum:  orig=d2fc bde1=d2fc dr1=d2fc myc1=d2fc myc2=d2fc
	time: orig=2778 bde1= 145 dr1= 141 myc1= 129 myc2=  74
align=3
	sum:  orig=75cf bde1=75cf dr1=75cf myc1=75cf myc2=75cf
	time: orig=2773 bde1= 145 dr1= 142 myc1= 128 myc2=  75
align=6
	sum:  orig=969a bde1=969a dr1=969a myc1=969a myc2=969a
	time: orig=1753 bde1= 145 dr1= 140 myc1= 128 myc2=  74
align=e
	sum:  orig=fdfc bde1=fdfc dr1=fdfc myc1=fdfc myc2=fdfc
	time: orig=1733 bde1= 145 dr1= 141 myc1= 128 myc2=  73
align=3
	sum:  orig=bdc4 bde1=bdc4 dr1=bdc4 myc1=bdc4 myc2=bdc4
	time: orig=2754 bde1= 146 dr1= 142 myc1= 129 myc2=  74
./t 8	1536
align=9
	sum:  orig=ed33 bde1=ed33 dr1=ed33 myc1=ed33 myc2=ed33
	time: orig=2091 bde1= 123 dr1= 109 myc1= 100 myc2=  59
align=6
	sum:  orig=1e63 bde1=1e63 dr1=1e63 myc1=1e63 myc2=1e63
	time: orig=1303 bde1= 113 dr1= 107 myc1=  99 myc2=  59
align=4
	sum:  orig=0131 bde1=0131 dr1=0131 myc1=0131 myc2=0131
	time: orig=1308 bde1= 111 dr1= 108 myc1=  98 myc2=  58
align=b
	sum:  orig=64fd bde1=64fd dr1=64fd myc1=64fd myc2=64fd
	time: orig=2078 bde1= 113 dr1= 113 myc1=  99 myc2=  59
align=a
	sum:  orig=bb47 bde1=bb47 dr1=bb47 myc1=bb47 myc2=bb47
	time: orig=1304 bde1= 112 dr1= 108 myc1= 100 myc2=  57
align=1
	sum:  orig=c319 bde1=c319 dr1=c319 myc1=c319 myc2=c319
	time: orig=2091 bde1= 114 dr1= 109 myc1=  99 myc2=  58
align=8
	sum:  orig=f7f7 bde1=f7f7 dr1=f7f7 myc1=f7f7 myc2=f7f7
	time: orig=1318 bde1= 108 dr1= 109 myc1= 100 myc2=  58
align=9
	sum:  orig=c89a bde1=c89a dr1=c89a myc1=c89a myc2=c89a
	time: orig=2075 bde1= 113 dr1= 108 myc1= 100 myc2=  59
./t 8	576
align=e
	sum:  orig=2138 bde1=2138 dr1=2138 myc1=2138 myc2=2138
	time: orig= 508 bde1=  53 dr1=  46 myc1=  43 myc2=  28
align=a
	sum:  orig=a29f bde1=a29f dr1=a29f myc1=a29f myc2=a29f
	time: orig= 495 bde1=  51 dr1=  47 myc1=  43 myc2=  28
align=a
	sum:  orig=cc89 bde1=cc89 dr1=cc89 myc1=cc89 myc2=cc89
	time: orig= 499 bde1=  51 dr1=  46 myc1=  44 myc2=  28
align=3
	sum:  orig=9080 bde1=9080 dr1=9080 myc1=9080 myc2=9080
	time: orig= 794 bde1=  54 dr1=  47 myc1=  44 myc2=  29
align=0
	sum:  orig=1910 bde1=1910 dr1=1910 myc1=1910 myc2=1910
	time: orig= 499 bde1=  48 dr1=  46 myc1=  42 myc2=  27
align=1
	sum:  orig=6bed bde1=6bed dr1=6bed myc1=6bed myc2=6bed
	time: orig= 802 bde1=  54 dr1=  48 myc1=  43 myc2=  29
align=6
	sum:  orig=9695 bde1=9695 dr1=9695 myc1=9695 myc2=9695
	time: orig= 512 bde1=  53 dr1=  47 myc1=  44 myc2=  28
align=f
	sum:  orig=75e3 bde1=75e3 dr1=75e3 myc1=75e3 myc2=75e3
	time: orig= 791 bde1=  54 dr1=  49 myc1=  44 myc2=  29
./t 8	1536 1536 1536 1536 1536 640
align=3,4,b,5,d,9
	sum:  orig=3927 bde1=3927 dr1=3927 myc1=3927 myc2=3927
	time: orig=10571 bde1= 834 dr1= 900 myc1= 767 myc2= 599
align=9,8,c,6,9,0
	sum:  orig=3be3 bde1=3be3 dr1=3be3 myc1=3be3 myc2=3be3
	time: orig=8768 bde1= 822 dr1= 907 myc1= 761 myc2= 597
align=a,6,d,8,a,4
	sum:  orig=f539 bde1=f539 dr1=f539 myc1=f539 myc2=f539
	time: orig=7958 bde1= 822 dr1= 905 myc1= 764 myc2= 601
align=0,f,3,0,0,e
	sum:  orig=9b4c bde1=9b4c dr1=9b4c myc1=9b4c myc2=9b4c
	time: orig=8821 bde1= 837 dr1= 906 myc1= 756 myc2= 577
align=8,1,1,f,e,b
	sum:  orig=4969 bde1=4969 dr1=4969 myc1=4969 myc2=4969
	time: orig=9793 bde1= 829 dr1= 916 myc1= 768 myc2= 599
align=c,a,0,9,a,5
	sum:  orig=1dd5 bde1=1dd5 dr1=1dd5 myc1=1dd5 myc2=1dd5
	time: orig=8298 bde1= 830 dr1= 913 myc1= 762 myc2= 595
align=2,7,8,4,e,7
	sum:  orig=01d1 bde1=01d1 dr1=01d1 myc1=01d1 myc2=01d1
	time: orig=8311 bde1= 822 dr1= 894 myc1= 763 myc2= 594
align=a,0,2,7,2,4
	sum:  orig=55a5 bde1=55a5 dr1=55a5 myc1=55a5 myc2=55a5
	time: orig=7978 bde1= 827 dr1= 902 myc1= 762 myc2= 587
./t 8	1536 1536 1536 1536 1536 1536 1536 1536 1536 1536 1152
align=0,f,7,e,a,1,5,d,b,d,2
	sum:  orig=4aca bde1=4aca dr1=4aca myc1=4aca myc2=4aca
	time: orig=19687 bde1=1636 dr1=1782 myc1=1503 myc2=1172
align=5,7,8,2,8,9,5,1,f,9,6
	sum:  orig=0554 bde1=0554 dr1=0554 myc1=0554 myc2=0554
	time: orig=19657 bde1=1623 dr1=1773 myc1=1502 myc2=1173
align=d,f,b,6,8,b,e,a,5,5,c
	sum:  orig=97a2 bde1=97a2 dr1=97a2 myc1=97a2 myc2=97a2
	time: orig=18927 bde1=1628 dr1=1777 myc1=1504 myc2=1183
align=d,7,d,4,d,5,0,9,8,b,9
	sum:  orig=eab2 bde1=eab2 dr1=eab2 myc1=eab2 myc2=eab2
	time: orig=20242 bde1=1627 dr1=1779 myc1=1499 myc2=1169
align=0,6,d,3,c,b,b,8,a,0,3
	sum:  orig=8979 bde1=8979 dr1=8979 myc1=8979 myc2=8979
	time: orig=17994 bde1=1628 dr1=1791 myc1=1496 myc2=1161
align=4,d,1,f,6,2,7,8,0,c,3
	sum:  orig=88d1 bde1=88d1 dr1=88d1 myc1=88d1 myc2=88d1
	time: orig=17950 bde1=1614 dr1=1770 myc1=1498 myc2=1158
align=2,6,1,4,3,8,3,c,e,c,8
	sum:  orig=30c7 bde1=30c7 dr1=30c7 myc1=30c7 myc2=30c7
	time: orig=16588 bde1=1607 dr1=1781 myc1=1497 myc2=1155
align=4,6,1,3,f,0,5,0,f,5,5
	sum:  orig=0292 bde1=0292 dr1=0292 myc1=0292 myc2=0292
	time: orig=19492 bde1=1617 dr1=1762 myc1=1502 myc2=1162


Analysis:

* The first 3 cases are obviously degenerate; they are intended to
quickly test the odd-alignment and odd-size handling.  This is the
first time I've actually had to sacrifice a little performance on them
to optimize the other cases.  What they show is that there is a higher
per-mbuf cost in my version (primarily from the alignment-twiddling
code).

* The 4th case is a typical IP header.  This happens a lot, so you
want it to be fast.  As you can see, the timings converge around here.
It may be worthwhile to have a separate sum routine optimized
specifically for these short packets.

* The 5th case isn't really real-world; it's just to get an idea how
the curve goes as the mbuf size increases.  What this and the previous
case show is that as the mbuf size increases, my cache handling starts
to make a noticable difference.  (It's much more noticable on a 486,
but we're comparing Pentiums.)

* The 6th and 7th cases are typical TCP segments, like you'd expect
with FTP or HTTP.

* The last two cases are typical NFS packets, having been reassembled
from fragments.  These should be large enough to overflow the L1 cache
and thus make the cache considerations more visible.

Note that there was some load on this machine at the time, so there is
some variance in the figures.

As a final note, it's true that the alignment conditions I simulate
aren't really realistic, but all implementations since my previous C
version have (I presume due to my optimizations) been nearly immune to
alignment considerations, so I don't consider this an issue.