So, I think I know a wee bit of what might be throwing me off on my measurements. I finally decided that including a context switch and a system call in my "benchmark" was a bit too premature. I first need to make sure I am measuring user time properly. My first test, substituting random() calls for the time() calls shows clearly something is wrong with the <sys/syscall_stats.h> hooks. There should be no measurable system time per call here where I use binuptime() in the <sys/syscall_stats.h> hooks: $ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 100000000 iterations.... random() = 0.0014 us/c user, 0.0194 us/c sys, 0.0011 us/c wait, 0.0219 us/c wall nulltime() = 0.0002 us/c user, 0.0228 us/c sys, 0.0013 us/c wait, 0.0243 us/c wall countbits_sparse() = 0.0010 us/c user, 0.0764 us/c sys, 0.0033 us/c wait, 0.0807 us/c wall countbits_dense() = 0.0012 us/c user, 0.0830 us/c sys, 0.0037 us/c wait, 0.0879 us/c wall COUNT_BITS() = 0.0028 us/c user, 0.0248 us/c sys, 0.0016 us/c wait, 0.0292 us/c wall count_bits() = 0.0012 us/c user, 0.0312 us/c sys, 0.0016 us/c wait, 0.0340 us/c wall count_ul_bits() = 0.0005 us/c user, 0.1101 us/c sys, 0.0035 us/c wait, 0.1141 us/c wall 39.23 real 0.83 user 37.29 sys 21428 maximum resident set size 8 average shared memory size 483 average unshared data size 8 average unshared stack size 59 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received -147 voluntary context switches 207 involuntary context switches binuptime() of course isn't free or fast, so thinking that it alone might be skewing some system time into the measurements due to where it is called in context switching I tried using hardclock_ticks in the <sys/syscall_stats.h> hooks and it's clearly no better: $ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 100000000 iterations.... random() = 0.0005 us/c user, 0.0214 us/c sys, 0.0005 us/c wait, 0.0224 us/c wall nulltime() = 0.0019 us/c user, 0.0226 us/c sys, 0.0004 us/c wait, 0.0248 us/c wall countbits_sparse() = 0.0023 us/c user, 0.0770 us/c sys, 0.0006 us/c wait, 0.0799 us/c wall countbits_dense() = 0.0013 us/c user, 0.0851 us/c sys, 0.0012 us/c wait, 0.0877 us/c wall COUNT_BITS() = 0.0013 us/c user, 0.0282 us/c sys, 0.0006 us/c wait, 0.0302 us/c wall count_bits() = 0.0003 us/c user, 0.0330 us/c sys, 0.0005 us/c wait, 0.0338 us/c wall count_ul_bits() = 0.0019 us/c user, 0.1107 us/c sys, 0.0016 us/c wait, 0.1142 us/c wall 39.37 real 0.95 user 38.22 sys 20216 maximum resident set size 8 average shared memory size 458 average unshared data size 8 average unshared stack size 50 page reclaims 33 page faults 0 swaps 4 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received -112 voluntary context switches 215 involuntary context switches Just for giggles here is a run with the CPU TSC as the timer in the <sys/syscall_stats.h> hooks showing the same gibberish: $ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 100000000 iterations.... random() = 0.0013 us/c user, 0.0196 us/c sys, 0.0003 us/c wait, 0.0213 us/c wall nulltime() = 0.0023 us/c user, 0.0211 us/c sys, 0.0003 us/c wait, 0.0238 us/c wall countbits_sparse() = 0.0011 us/c user, 0.0787 us/c sys, 0.0003 us/c wait, 0.0801 us/c wall countbits_dense() = 0.0008 us/c user, 0.0860 us/c sys, 0.0005 us/c wait, 0.0873 us/c wall COUNT_BITS() = 0.0025 us/c user, 0.0264 us/c sys, 0.0003 us/c wait, 0.0292 us/c wall count_bits() = 0.0005 us/c user, 0.0328 us/c sys, 0.0003 us/c wait, 0.0336 us/c wall count_ul_bits() = 0.0024 us/c user, 0.1164 us/c sys, 0.0004 us/c wait, 0.1191 us/c wall 39.50 real 1.09 user 38.35 sys 20336 maximum resident set size 11 average shared memory size 671 average unshared data size 11 average unshared stack size 50 page reclaims 33 page faults 0 swaps 4 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received -187 voluntary context switches 291 involuntary context switches Finally here are pretty much the results using the old statclock() sampling showing exactly the expected results with only a very tiny and perhaps insignificant aberration on four of the test runs: $ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 100000000 iterations.... random() = 0.0201 us/c user, 0.0000 us/c sys, 0.0002 us/c wait, 0.0204 us/c wall nulltime() = 0.0229 us/c user, 0.0000 us/c sys, 0.0006 us/c wait, 0.0235 us/c wall countbits_sparse() = 0.0869 us/c user, 0.0001 us/c sys, 0.0003 us/c wait, 0.0873 us/c wall countbits_dense() = 0.0971 us/c user, 0.0001 us/c sys, 0.0003 us/c wait, 0.0975 us/c wall COUNT_BITS() = 0.0300 us/c user, 0.0001 us/c sys, 0.0002 us/c wait, 0.0303 us/c wall count_bits() = 0.0339 us/c user, 0.0000 us/c sys, 0.0003 us/c wait, 0.0343 us/c wall count_ul_bits() = 0.1173 us/c user, 0.0002 us/c sys, 0.0004 us/c wait, 0.1178 us/c wall 41.16 real 41.00 user 0.09 sys 20272 maximum resident set size 7 average shared memory size 424 average unshared data size 7 average unshared stack size 50 page reclaims 33 page faults 0 swaps 4 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received -105 voluntary context switches 207 involuntary context switches Here's the DIAGNOSTIC print from calcru() during the run for reference, and we can see that there are multiple hits, as many as ten, accounted to system time despite there being no system calls whatsoever during the test runs. These hits do not seem to be due to the getrusage() calls themselves though since the longer the runs take, the more hits there are, so statclock() is apparently catching the process in the transition between CPU slices: [[ .... this pair represent the before & after getrusage() calls around the random() test .... ]] getrusage: tcountbits[333]: 7366 us (u=1227 us [1], s=6138 us [5], i=0 us [0]) getrusage: tcountbits[333]: 2024291 us (u=2015054 us [1309], s=9236 us [6], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the nulltime() test .... ]] getrusage: tcountbits[333]: 2049048 us (u=2030434 us [1309], s=18613 us [12], i=0 us [0]) getrusage: tcountbits[333]: 4338527 us (u=4316865 us [2790], s=21661 us [14], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the countbits_sparse() test .... ]] getrusage: tcountbits[333]: 4392946 us (u=4367897 us [2790], s=25048 us [16], i=0 us [0]) getrusage: tcountbits[333]: 13089803 us (u=13057370 us [8052], s=32432 us [20], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the countbits_dense() test .... ]] getrusage: tcountbits[333]: 13121774 us (u=13082779 us [8052], s=38994 us [24], i=0 us [0]) getrusage: tcountbits[333]: 22841278 us (u=22793855 us [13939], s=47422 us [29], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the COUNT_BITS() test .... ]] getrusage: tcountbits[333]: 22863786 us (u=22811417 us [13939], s=52368 us [32], i=0 us [0]) getrusage: tcountbits[333]: 25871319 us (u=25812594 us [15824], s=58724 us [36], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the count_bits() test .... ]] getrusage: tcountbits[333]: 25892678 us (u=25829019 us [15824], s=63658 us [39], i=0 us [0]) getrusage: tcountbits[333]: 29288928 us (u=29220579 us [17956], s=68348 us [42], i=0 us [0]) [[ .... this pair represent the before & after getrusage() calls around the count_ul_bits() test .... ]] getrusage: tcountbits[333]: 29317905 us (u=29246239 us [17956], s=71665 us [44], i=0 us [0]) getrusage: tcountbits[333]: 41065530 us (u=40976615 us [25347], s=87297 us [54], i=1616 us [1]) exit|tty: tcountbits[333]: 41100603 us (u=41006769 us [25347], s=92215 us [57], i=1617 us [1]) So, while statclock() tick counting seems more or less good enough for benchmarking entirely CPU bound user-mode code, it's too wobbly for code that spends a lot of time using low-cost no-I/O system calls such as my little test program using time() again: $ /usr/bin/time -l /root/tmp/tcountbits -t -i 1 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 1000000 iterations.... time() = 0.0547 us/c user, 9.8257 us/c sys, 0.0058 us/c wait, 9.8861 us/c wall nulltime() = 0.0605 us/c user, 9.7801 us/c sys, 0.0253 us/c wait, 9.8659 us/c wall countbits_sparse() = 0.0605 us/c user, 9.7468 us/c sys, 0.0366 us/c wait, 9.8438 us/c wall countbits_dense() = 0.0456 us/c user, 9.7778 us/c sys, 0.0141 us/c wait, 9.8375 us/c wall COUNT_BITS() = 0.0423 us/c user, 9.6685 us/c sys, 0.0290 us/c wait, 9.7397 us/c wall count_bits() = 0.0515 us/c user, 9.6292 us/c sys, 0.0128 us/c wait, 9.6935 us/c wall count_ul_bits() = 0.0561 us/c user, 9.8984 us/c sys, 0.0280 us/c wait, 9.9826 us/c wall 68.87 real 0.37 user 68.47 sys 20356 maximum resident set size 7 average shared memory size 408 average unshared data size 7 average unshared stack size 59 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 45 involuntary context switches Would anyone who bothered to read this far, and who knows a thing or two about context switching and such care to try to help see if the hooks in <sys/syscall_stats.h> can be better placed or otherwise improved? FYI, with random() my test program is still showing wonky wobbly results on my iMac, and though they are not terrible, they're not really usable at this resolution either: $ /usr/bin/time -l ./tcountbits -t -r -i 100 tcountbits: now running each algorithm for 100000000 iterations.... random() = 0.0056 us/c user, 0.0000 us/c sys, 0.0002 us/c wait, 0.0058 us/c wall nulltime() = 0.0113 us/c user, 0.0000 us/c sys, 0.0006 us/c wait, 0.0119 us/c wall countbits_sparse() = 0.0616 us/c user, 0.0001 us/c sys, 0.0028 us/c wait, 0.0645 us/c wall countbits_dense() = 0.1372 us/c user, 0.0004 us/c sys, 0.0102 us/c wait, 0.1479 us/c wall COUNT_BITS() = 0.0152 us/c user, 0.0000 us/c sys, 0.0006 us/c wait, 0.0158 us/c wall count_bits() = 0.0205 us/c user, 0.0000 us/c sys, 0.0007 us/c wait, 0.0212 us/c wall count_ul_bits() = 0.0949 us/c user, 0.0003 us/c sys, 0.0044 us/c wait, 0.0996 us/c wall 36.71 real 34.62 user 0.09 sys 368640 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 110 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25945 involuntary context switches -- Greg A. Woods Planix, Inc. <woods%planix.com@localhost> +1 250 762-7675 http://www.planix.com/
Attachment:
pgpxi9VoBSTBs.pgp
Description: PGP signature