So, I think I know a wee bit of what might be throwing me off on my
measurements.
I finally decided that including a context switch and a system call in
my "benchmark" was a bit too premature. I first need to make sure I am
measuring user time properly.
My first test, substituting random() calls for the time() calls shows
clearly something is wrong with the <sys/syscall_stats.h> hooks.
There should be no measurable system time per call here where I use
binuptime() in the <sys/syscall_stats.h> hooks:
$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
random() = 0.0014 us/c user, 0.0194 us/c sys, 0.0011 us/c wait,
0.0219 us/c wall
nulltime() = 0.0002 us/c user, 0.0228 us/c sys, 0.0013 us/c wait,
0.0243 us/c wall
countbits_sparse() = 0.0010 us/c user, 0.0764 us/c sys, 0.0033 us/c wait,
0.0807 us/c wall
countbits_dense() = 0.0012 us/c user, 0.0830 us/c sys, 0.0037 us/c wait,
0.0879 us/c wall
COUNT_BITS() = 0.0028 us/c user, 0.0248 us/c sys, 0.0016 us/c wait,
0.0292 us/c wall
count_bits() = 0.0012 us/c user, 0.0312 us/c sys, 0.0016 us/c wait,
0.0340 us/c wall
count_ul_bits() = 0.0005 us/c user, 0.1101 us/c sys, 0.0035 us/c wait,
0.1141 us/c wall
39.23 real 0.83 user 37.29 sys
21428 maximum resident set size
8 average shared memory size
483 average unshared data size
8 average unshared stack size
59 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
-147 voluntary context switches
207 involuntary context switches
binuptime() of course isn't free or fast, so thinking that it alone
might be skewing some system time into the measurements due to where it
is called in context switching I tried using hardclock_ticks in the
<sys/syscall_stats.h> hooks and it's clearly no better:
$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
random() = 0.0005 us/c user, 0.0214 us/c sys, 0.0005 us/c wait,
0.0224 us/c wall
nulltime() = 0.0019 us/c user, 0.0226 us/c sys, 0.0004 us/c wait,
0.0248 us/c wall
countbits_sparse() = 0.0023 us/c user, 0.0770 us/c sys, 0.0006 us/c wait,
0.0799 us/c wall
countbits_dense() = 0.0013 us/c user, 0.0851 us/c sys, 0.0012 us/c wait,
0.0877 us/c wall
COUNT_BITS() = 0.0013 us/c user, 0.0282 us/c sys, 0.0006 us/c wait,
0.0302 us/c wall
count_bits() = 0.0003 us/c user, 0.0330 us/c sys, 0.0005 us/c wait,
0.0338 us/c wall
count_ul_bits() = 0.0019 us/c user, 0.1107 us/c sys, 0.0016 us/c wait,
0.1142 us/c wall
39.37 real 0.95 user 38.22 sys
20216 maximum resident set size
8 average shared memory size
458 average unshared data size
8 average unshared stack size
50 page reclaims
33 page faults
0 swaps
4 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
-112 voluntary context switches
215 involuntary context switches
Just for giggles here is a run with the CPU TSC as the timer in the
<sys/syscall_stats.h> hooks showing the same gibberish:
$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
random() = 0.0013 us/c user, 0.0196 us/c sys, 0.0003 us/c wait,
0.0213 us/c wall
nulltime() = 0.0023 us/c user, 0.0211 us/c sys, 0.0003 us/c wait,
0.0238 us/c wall
countbits_sparse() = 0.0011 us/c user, 0.0787 us/c sys, 0.0003 us/c wait,
0.0801 us/c wall
countbits_dense() = 0.0008 us/c user, 0.0860 us/c sys, 0.0005 us/c wait,
0.0873 us/c wall
COUNT_BITS() = 0.0025 us/c user, 0.0264 us/c sys, 0.0003 us/c wait,
0.0292 us/c wall
count_bits() = 0.0005 us/c user, 0.0328 us/c sys, 0.0003 us/c wait,
0.0336 us/c wall
count_ul_bits() = 0.0024 us/c user, 0.1164 us/c sys, 0.0004 us/c wait,
0.1191 us/c wall
39.50 real 1.09 user 38.35 sys
20336 maximum resident set size
11 average shared memory size
671 average unshared data size
11 average unshared stack size
50 page reclaims
33 page faults
0 swaps
4 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
-187 voluntary context switches
291 involuntary context switches
Finally here are pretty much the results using the old statclock()
sampling showing exactly the expected results with only a very tiny and
perhaps insignificant aberration on four of the test runs:
$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
random() = 0.0201 us/c user, 0.0000 us/c sys, 0.0002 us/c wait,
0.0204 us/c wall
nulltime() = 0.0229 us/c user, 0.0000 us/c sys, 0.0006 us/c wait,
0.0235 us/c wall
countbits_sparse() = 0.0869 us/c user, 0.0001 us/c sys, 0.0003 us/c wait,
0.0873 us/c wall
countbits_dense() = 0.0971 us/c user, 0.0001 us/c sys, 0.0003 us/c wait,
0.0975 us/c wall
COUNT_BITS() = 0.0300 us/c user, 0.0001 us/c sys, 0.0002 us/c wait,
0.0303 us/c wall
count_bits() = 0.0339 us/c user, 0.0000 us/c sys, 0.0003 us/c wait,
0.0343 us/c wall
count_ul_bits() = 0.1173 us/c user, 0.0002 us/c sys, 0.0004 us/c wait,
0.1178 us/c wall
41.16 real 41.00 user 0.09 sys
20272 maximum resident set size
7 average shared memory size
424 average unshared data size
7 average unshared stack size
50 page reclaims
33 page faults
0 swaps
4 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
-105 voluntary context switches
207 involuntary context switches
Here's the DIAGNOSTIC print from calcru() during the run for reference,
and we can see that there are multiple hits, as many as ten, accounted
to system time despite there being no system calls whatsoever during the
test runs. These hits do not seem to be due to the getrusage() calls
themselves though since the longer the runs take, the more hits there
are, so statclock() is apparently catching the process in the transition
between CPU slices:
[[ .... this pair represent the before & after getrusage() calls around
the random() test .... ]]
getrusage: tcountbits[333]: 7366 us (u=1227 us [1], s=6138 us [5], i=0 us [0])
getrusage: tcountbits[333]: 2024291 us (u=2015054 us [1309], s=9236 us [6], i=0
us [0])
[[ .... this pair represent the before & after getrusage() calls around
the nulltime() test .... ]]
getrusage: tcountbits[333]: 2049048 us (u=2030434 us [1309], s=18613 us [12],
i=0 us [0])
getrusage: tcountbits[333]: 4338527 us (u=4316865 us [2790], s=21661 us [14],
i=0 us [0])
[[ .... this pair represent the before & after getrusage() calls around
the countbits_sparse() test .... ]]
getrusage: tcountbits[333]: 4392946 us (u=4367897 us [2790], s=25048 us [16],
i=0 us [0])
getrusage: tcountbits[333]: 13089803 us (u=13057370 us [8052], s=32432 us [20],
i=0 us [0])
[[ .... this pair represent the before & after getrusage() calls around
the countbits_dense() test .... ]]
getrusage: tcountbits[333]: 13121774 us (u=13082779 us [8052], s=38994 us [24],
i=0 us [0])
getrusage: tcountbits[333]: 22841278 us (u=22793855 us [13939], s=47422 us
[29], i=0 us [0])
[[ .... this pair represent the before & after getrusage() calls around
the COUNT_BITS() test .... ]]
getrusage: tcountbits[333]: 22863786 us (u=22811417 us [13939], s=52368 us
[32], i=0 us [0])
getrusage: tcountbits[333]: 25871319 us (u=25812594 us [15824], s=58724 us
[36], i=0 us [0])
[[ .... this pair represent the before & after getrusage() calls around
the count_bits() test .... ]]
getrusage: tcountbits[333]: 25892678 us (u=25829019 us [15824], s=63658 us
[39], i=0 us [0])
getrusage: tcountbits[333]: 29288928 us (u=29220579 us [17956], s=68348 us
[42], i=0 us [0])
[[ .... this pair represent the before & after getrusage() calls around
the count_ul_bits() test .... ]]
getrusage: tcountbits[333]: 29317905 us (u=29246239 us [17956], s=71665 us
[44], i=0 us [0])
getrusage: tcountbits[333]: 41065530 us (u=40976615 us [25347], s=87297 us
[54], i=1616 us [1])
exit|tty: tcountbits[333]: 41100603 us (u=41006769 us [25347], s=92215 us [57],
i=1617 us [1])
So, while statclock() tick counting seems more or less good enough for
benchmarking entirely CPU bound user-mode code, it's too wobbly for code
that spends a lot of time using low-cost no-I/O system calls such as my
little test program using time() again:
$ /usr/bin/time -l /root/tmp/tcountbits -t -i 1
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 1000000 iterations....
time() = 0.0547 us/c user, 9.8257 us/c sys, 0.0058 us/c wait,
9.8861 us/c wall
nulltime() = 0.0605 us/c user, 9.7801 us/c sys, 0.0253 us/c wait,
9.8659 us/c wall
countbits_sparse() = 0.0605 us/c user, 9.7468 us/c sys, 0.0366 us/c wait,
9.8438 us/c wall
countbits_dense() = 0.0456 us/c user, 9.7778 us/c sys, 0.0141 us/c wait,
9.8375 us/c wall
COUNT_BITS() = 0.0423 us/c user, 9.6685 us/c sys, 0.0290 us/c wait,
9.7397 us/c wall
count_bits() = 0.0515 us/c user, 9.6292 us/c sys, 0.0128 us/c wait,
9.6935 us/c wall
count_ul_bits() = 0.0561 us/c user, 9.8984 us/c sys, 0.0280 us/c wait,
9.9826 us/c wall
68.87 real 0.37 user 68.47 sys
20356 maximum resident set size
7 average shared memory size
408 average unshared data size
7 average unshared stack size
59 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
45 involuntary context switches
Would anyone who bothered to read this far, and who knows a thing or two
about context switching and such care to try to help see if the hooks in
<sys/syscall_stats.h> can be better placed or otherwise improved?
FYI, with random() my test program is still showing wonky wobbly results
on my iMac, and though they are not terrible, they're not really usable
at this resolution either:
$ /usr/bin/time -l ./tcountbits -t -r -i 100
tcountbits: now running each algorithm for 100000000 iterations....
random() = 0.0056 us/c user, 0.0000 us/c sys, 0.0002 us/c wait,
0.0058 us/c wall
nulltime() = 0.0113 us/c user, 0.0000 us/c sys, 0.0006 us/c wait,
0.0119 us/c wall
countbits_sparse() = 0.0616 us/c user, 0.0001 us/c sys, 0.0028 us/c wait,
0.0645 us/c wall
countbits_dense() = 0.1372 us/c user, 0.0004 us/c sys, 0.0102 us/c wait,
0.1479 us/c wall
COUNT_BITS() = 0.0152 us/c user, 0.0000 us/c sys, 0.0006 us/c wait,
0.0158 us/c wall
count_bits() = 0.0205 us/c user, 0.0000 us/c sys, 0.0007 us/c wait,
0.0212 us/c wall
count_ul_bits() = 0.0949 us/c user, 0.0003 us/c sys, 0.0044 us/c wait,
0.0996 us/c wall
36.71 real 34.62 user 0.09 sys
368640 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
110 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
25945 involuntary context switches
--
Greg A. Woods
Planix, Inc.
<woods%planix.com@localhost> +1 250 762-7675 http://www.planix.com/
Attachment:
pgpxi9VoBSTBs.pgp
Description: PGP signature