Subject: Re: DIAGNOSTIC in -current
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Chuck Silvers <chuq@chuq.com>
List: current-users
Date: 09/30/2006 16:34:24
On Thu, Sep 28, 2006 at 04:26:56PM -0400, Thor Lancelot Simon wrote:
> I think what was used last time was either lmbench or hbench-OS.

ok.  I ran lmbench with both a vanilla -current GENERIC kernel
("macro") and one that has all internal details of malloc() moved into
kern_malloc.c, with MALLOC() and FREE() defined to just call malloc()
and free() respectively ("func").  between these runs, some tests had
the same result, some were faster with one configuration and some were
faster with the other.  when the results were different, the vanilla
kernel was faster more often.  here are the first few results from
lmbench:

macro
Simple syscall: 0.3796 microseconds
Simple read: 1.4517 microseconds
Simple write: 1.7143 microseconds
Simple stat: 6.0924 microseconds
Simple fstat: 0.8437 microseconds

func
Simple syscall: 0.3796 microseconds
Simple read: 1.4891 microseconds
Simple write: 1.5419 microseconds
Simple stat: 6.4692 microseconds
Simple fstat: 0.8447 microseconds


I tried replicating the result for "stat" with my own test program
that just stat()s a file in a loop a given number of times.  this test
shows the same result as lmbench:

macro
2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.119u 2.905s 0:03.05 98.6%     0+0k 4+0io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.129u 2.898s 0:03.02 99.6%     0+0k 0+1io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.079u 2.947s 0:03.02 99.6%     0+0k 0+0io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.059u 2.969s 0:03.02 99.6%     0+0k 0+1io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.060u 2.963s 0:03.02 100.0%    0+0k 0+0io 0pf+0w

func
2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.080u 3.144s 0:03.25 99.0%     0+0k 4+1io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.090u 3.133s 0:03.22 100.0%    0+0k 0+1io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.100u 3.122s 0:03.22 100.0%    0+0k 0+0io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.060u 3.160s 0:03.22 100.0%    0+0k 0+0io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.149u 3.086s 0:03.23 99.6%     0+0k 0+1io 0pf+0w


after looking at the code, I was pretty sure that stat() doesn't
actually call MALLOC() anywhere, and I confirmed that by running a
profiling version of the "func" kernel.  so it's clearly not the
instructions being executed that's making the difference.  I guessed
that the difference was just the code layout in memory causing
different CPU cache hit rates.  to see if the inline vs. function
change had any effect at all, I made a kernel where a global variable
controlled whether the macro called the function or did the inline
thing.  here are the results:

switchable - macro
2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.100u 2.971s 0:03.10 99.0%     0+0k 4+0io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.200u 2.861s 0:03.06 100.0%    0+0k 0+1io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.110u 2.962s 0:03.07 100.0%    0+0k 0+0io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.139u 2.925s 0:03.06 99.6%     0+0k 0+0io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.159u 2.898s 0:03.05 99.6%     0+0k 0+1io 0pf+0w

switchable - func
7 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.160u 2.911s 0:03.07 100.0%    0+0k 0+0io 0pf+0w
8 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.139u 2.919s 0:03.05 99.6%     0+0k 0+0io 0pf+0w
9 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.109u 2.958s 0:03.06 99.6%     0+0k 0+0io 0pf+0w
10 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.209u 2.859s 0:03.06 99.6%     0+0k 0+0io 0pf+0w
11 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.139u 2.929s 0:03.06 99.6%     0+0k 0+0io 0pf+0w

so indeed, the macro vs. function change makes no difference, it's all
due to the extra code in the expanded macro perturbing the CPU cache
usage, which has a significant effect in these microbenchmarks.


another interesting bit that I noticed during all these experiments
was that the results for some configurations weren't consistent from
one boot cycle to the next:

2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.089u 3.194s 0:03.31 98.7%     0+0k 4+0io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.160u 3.125s 0:03.28 100.0%    0+0k 0+1io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.099u 3.178s 0:03.27 99.6%     0+0k 0+0io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.149u 3.132s 0:03.28 99.6%     0+0k 0+1io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.110u 3.162s 0:03.27 100.0%    0+0k 0+0io 0pf+0w

... then same kernel after a reboot
2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.160u 3.151s 0:03.34 99.1%     0+0k 4+0io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.100u 3.211s 0:03.31 100.0%    0+0k 0+0io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.110u 3.192s 0:03.30 100.0%    0+0k 0+0io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.099u 3.209s 0:03.30 99.6%     0+0k 0+0io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.200u 3.113s 0:03.31 100.0%    0+0k 0+1io 0pf+0w

... and after another reboot
2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.209u 3.113s 0:03.35 98.8%     0+0k 4+1io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.160u 3.168s 0:03.32 100.0%    0+0k 0+0io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.100u 3.228s 0:03.32 100.0%    0+0k 0+1io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.150u 3.180s 0:03.33 100.0%    0+0k 0+0io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.119u 3.209s 0:03.32 99.6%     0+0k 0+0io 0pf+0w

I would guess that different physical memory pages are being used for
different purposes in the various boot cycles, which again perturbs
the usage of the CPU caches.


just for fun, I tried making a kernel where MALLOC() and FREE()
contained a call to the corresponding function version followed by
enough nop instructions to make the kernel have the same code layout
as with the normal macro.  I calculated that this would be 107 nops
for MALLOC() and 183 nops for FREE().  I didn't get it quite right,
but I was pretty close:

vanilla GENERIC
-rwxr-xr-x  1 chs  chs  9004708 Sep 29 04:29 netbsd

func
-rwxr-xr-x  1 chs  chs  8944919 Sep 29 05:18 netbsd

func + nops
-rwxr-xr-x  1 chs  chs  9001881 Sep 29 07:38 netbsd


and here's the result:

2 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.070u 2.810s 0:02.91 98.9%     0+0k 4+0io 0pf+0w
3 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.109u 2.775s 0:02.88 99.6%     0+0k 0+0io 0pf+0w
4 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.069u 2.817s 0:02.88 99.6%     0+0k 0+1io 0pf+0w
5 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.119u 2.776s 0:02.89 99.6%     0+0k 0+0io 0pf+0w
6 p3:~/perf # time ./mstat /build/tmp/lmbench 500000
0.159u 2.737s 0:02.89 99.6%     0+0k 0+1io 0pf+0w

so clearly, if we want the best performance for stat(), we need to
insert tens of thousands of nops in various places around the kernel.  :-)


I ran lmbench again with the runtime-switchable kernel and the results
were the same with both settings.  I'll go ahead and change MALLOC()
and FREE() to always just call their function versions.

-Chuck