tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

clues to getrusage() problems with user vs. system time reporting



So, I think I know a wee bit of what might be throwing me off on my
measurements.

I finally decided that including a context switch and a system call in
my "benchmark" was a bit too premature.  I first need to make sure I am
measuring user time properly.

My first test, substituting random() calls for the time() calls shows
clearly something is wrong with the <sys/syscall_stats.h> hooks.

There should be no measurable system time per call here where I use
binuptime() in the <sys/syscall_stats.h> hooks:

$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0014 us/c user,  0.0194 us/c sys, 0.0011 us/c wait,  
0.0219 us/c wall
        nulltime() = 0.0002 us/c user,  0.0228 us/c sys, 0.0013 us/c wait,  
0.0243 us/c wall
countbits_sparse() = 0.0010 us/c user,  0.0764 us/c sys, 0.0033 us/c wait,  
0.0807 us/c wall
 countbits_dense() = 0.0012 us/c user,  0.0830 us/c sys, 0.0037 us/c wait,  
0.0879 us/c wall
      COUNT_BITS() = 0.0028 us/c user,  0.0248 us/c sys, 0.0016 us/c wait,  
0.0292 us/c wall
      count_bits() = 0.0012 us/c user,  0.0312 us/c sys, 0.0016 us/c wait,  
0.0340 us/c wall
   count_ul_bits() = 0.0005 us/c user,  0.1101 us/c sys, 0.0035 us/c wait,  
0.1141 us/c wall
       39.23 real         0.83 user        37.29 sys
     21428  maximum resident set size
         8  average shared memory size
       483  average unshared data size
         8  average unshared stack size
        59  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      -147  voluntary context switches
       207  involuntary context switches


binuptime() of course isn't free or fast, so thinking that it alone
might be skewing some system time into the measurements due to where it
is called in context switching I tried using hardclock_ticks in the
<sys/syscall_stats.h> hooks and it's clearly no better:

$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0005 us/c user,  0.0214 us/c sys, 0.0005 us/c wait,  
0.0224 us/c wall
        nulltime() = 0.0019 us/c user,  0.0226 us/c sys, 0.0004 us/c wait,  
0.0248 us/c wall
countbits_sparse() = 0.0023 us/c user,  0.0770 us/c sys, 0.0006 us/c wait,  
0.0799 us/c wall
 countbits_dense() = 0.0013 us/c user,  0.0851 us/c sys, 0.0012 us/c wait,  
0.0877 us/c wall
      COUNT_BITS() = 0.0013 us/c user,  0.0282 us/c sys, 0.0006 us/c wait,  
0.0302 us/c wall
      count_bits() = 0.0003 us/c user,  0.0330 us/c sys, 0.0005 us/c wait,  
0.0338 us/c wall
   count_ul_bits() = 0.0019 us/c user,  0.1107 us/c sys, 0.0016 us/c wait,  
0.1142 us/c wall
       39.37 real         0.95 user        38.22 sys
     20216  maximum resident set size
         8  average shared memory size
       458  average unshared data size
         8  average unshared stack size
        50  page reclaims
        33  page faults
         0  swaps
         4  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      -112  voluntary context switches
       215  involuntary context switches


Just for giggles here is a run with the CPU TSC as the timer in the
<sys/syscall_stats.h> hooks showing the same gibberish:

$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0013 us/c user,  0.0196 us/c sys, 0.0003 us/c wait,  
0.0213 us/c wall
        nulltime() = 0.0023 us/c user,  0.0211 us/c sys, 0.0003 us/c wait,  
0.0238 us/c wall
countbits_sparse() = 0.0011 us/c user,  0.0787 us/c sys, 0.0003 us/c wait,  
0.0801 us/c wall
 countbits_dense() = 0.0008 us/c user,  0.0860 us/c sys, 0.0005 us/c wait,  
0.0873 us/c wall
      COUNT_BITS() = 0.0025 us/c user,  0.0264 us/c sys, 0.0003 us/c wait,  
0.0292 us/c wall
      count_bits() = 0.0005 us/c user,  0.0328 us/c sys, 0.0003 us/c wait,  
0.0336 us/c wall
   count_ul_bits() = 0.0024 us/c user,  0.1164 us/c sys, 0.0004 us/c wait,  
0.1191 us/c wall
       39.50 real         1.09 user        38.35 sys
     20336  maximum resident set size
        11  average shared memory size
       671  average unshared data size
        11  average unshared stack size
        50  page reclaims
        33  page faults
         0  swaps
         4  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      -187  voluntary context switches
       291  involuntary context switches


Finally here are pretty much the results using the old statclock()
sampling showing exactly the expected results with only a very tiny and
perhaps insignificant aberration on four of the test runs:

$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0201 us/c user,  0.0000 us/c sys, 0.0002 us/c wait,  
0.0204 us/c wall
        nulltime() = 0.0229 us/c user,  0.0000 us/c sys, 0.0006 us/c wait,  
0.0235 us/c wall
countbits_sparse() = 0.0869 us/c user,  0.0001 us/c sys, 0.0003 us/c wait,  
0.0873 us/c wall
 countbits_dense() = 0.0971 us/c user,  0.0001 us/c sys, 0.0003 us/c wait,  
0.0975 us/c wall
      COUNT_BITS() = 0.0300 us/c user,  0.0001 us/c sys, 0.0002 us/c wait,  
0.0303 us/c wall
      count_bits() = 0.0339 us/c user,  0.0000 us/c sys, 0.0003 us/c wait,  
0.0343 us/c wall
   count_ul_bits() = 0.1173 us/c user,  0.0002 us/c sys, 0.0004 us/c wait,  
0.1178 us/c wall
       41.16 real        41.00 user         0.09 sys
     20272  maximum resident set size
         7  average shared memory size
       424  average unshared data size
         7  average unshared stack size
        50  page reclaims
        33  page faults
         0  swaps
         4  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      -105  voluntary context switches
       207  involuntary context switches


Here's the DIAGNOSTIC print from calcru() during the run for reference,
and we can see that there are multiple hits, as many as ten, accounted
to system time despite there being no system calls whatsoever during the
test runs.  These hits do not seem to be due to the getrusage() calls
themselves though since the longer the runs take, the more hits there
are, so statclock() is apparently catching the process in the transition
between CPU slices:

[[ .... this pair represent the before & after getrusage() calls around
the random() test .... ]]
getrusage: tcountbits[333]: 7366 us (u=1227 us [1], s=6138 us [5], i=0 us [0])
getrusage: tcountbits[333]: 2024291 us (u=2015054 us [1309], s=9236 us [6], i=0 
us [0])

[[ .... this pair represent the before & after getrusage() calls around
the nulltime() test .... ]]
getrusage: tcountbits[333]: 2049048 us (u=2030434 us [1309], s=18613 us [12], 
i=0 us [0])
getrusage: tcountbits[333]: 4338527 us (u=4316865 us [2790], s=21661 us [14], 
i=0 us [0])

[[ .... this pair represent the before & after getrusage() calls around
the countbits_sparse() test .... ]]
getrusage: tcountbits[333]: 4392946 us (u=4367897 us [2790], s=25048 us [16], 
i=0 us [0])
getrusage: tcountbits[333]: 13089803 us (u=13057370 us [8052], s=32432 us [20], 
i=0 us [0])

[[ .... this pair represent the before & after getrusage() calls around
the countbits_dense() test .... ]]
getrusage: tcountbits[333]: 13121774 us (u=13082779 us [8052], s=38994 us [24], 
i=0 us [0])
getrusage: tcountbits[333]: 22841278 us (u=22793855 us [13939], s=47422 us 
[29], i=0 us [0])

[[ .... this pair represent the before & after getrusage() calls around
the COUNT_BITS() test .... ]]
getrusage: tcountbits[333]: 22863786 us (u=22811417 us [13939], s=52368 us 
[32], i=0 us [0])
getrusage: tcountbits[333]: 25871319 us (u=25812594 us [15824], s=58724 us 
[36], i=0 us [0])

[[ .... this pair represent the before & after getrusage() calls around
the count_bits() test .... ]]
getrusage: tcountbits[333]: 25892678 us (u=25829019 us [15824], s=63658 us 
[39], i=0 us [0])
getrusage: tcountbits[333]: 29288928 us (u=29220579 us [17956], s=68348 us 
[42], i=0 us [0])

[[ .... this pair represent the before & after getrusage() calls around
the count_ul_bits() test .... ]]
getrusage: tcountbits[333]: 29317905 us (u=29246239 us [17956], s=71665 us 
[44], i=0 us [0])
getrusage: tcountbits[333]: 41065530 us (u=40976615 us [25347], s=87297 us 
[54], i=1616 us [1])

exit|tty: tcountbits[333]: 41100603 us (u=41006769 us [25347], s=92215 us [57], 
i=1617 us [1])


So, while statclock() tick counting seems more or less good enough for
benchmarking entirely CPU bound user-mode code, it's too wobbly for code
that spends a lot of time using low-cost no-I/O system calls such as my
little test program using time() again:

$ /usr/bin/time -l /root/tmp/tcountbits -t -i 1
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 1000000 iterations....
            time() = 0.0547 us/c user,  9.8257 us/c sys, 0.0058 us/c wait,  
9.8861 us/c wall
        nulltime() = 0.0605 us/c user,  9.7801 us/c sys, 0.0253 us/c wait,  
9.8659 us/c wall
countbits_sparse() = 0.0605 us/c user,  9.7468 us/c sys, 0.0366 us/c wait,  
9.8438 us/c wall
 countbits_dense() = 0.0456 us/c user,  9.7778 us/c sys, 0.0141 us/c wait,  
9.8375 us/c wall
      COUNT_BITS() = 0.0423 us/c user,  9.6685 us/c sys, 0.0290 us/c wait,  
9.7397 us/c wall
      count_bits() = 0.0515 us/c user,  9.6292 us/c sys, 0.0128 us/c wait,  
9.6935 us/c wall
   count_ul_bits() = 0.0561 us/c user,  9.8984 us/c sys, 0.0280 us/c wait,  
9.9826 us/c wall
       68.87 real         0.37 user        68.47 sys
     20356  maximum resident set size
         7  average shared memory size
       408  average unshared data size
         7  average unshared stack size
        59  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
        45  involuntary context switches


Would anyone who bothered to read this far, and who knows a thing or two
about context switching and such care to try to help see if the hooks in
<sys/syscall_stats.h> can be better placed or otherwise improved?



FYI, with random() my test program is still showing wonky wobbly results
on my iMac, and though they are not terrible, they're not really usable
at this resolution either:

$ /usr/bin/time -l ./tcountbits -t -r -i 100                                    
                
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0056 us/c user,  0.0000 us/c sys, 0.0002 us/c wait,  
0.0058 us/c wall
        nulltime() = 0.0113 us/c user,  0.0000 us/c sys, 0.0006 us/c wait,  
0.0119 us/c wall
countbits_sparse() = 0.0616 us/c user,  0.0001 us/c sys, 0.0028 us/c wait,  
0.0645 us/c wall
 countbits_dense() = 0.1372 us/c user,  0.0004 us/c sys, 0.0102 us/c wait,  
0.1479 us/c wall
      COUNT_BITS() = 0.0152 us/c user,  0.0000 us/c sys, 0.0006 us/c wait,  
0.0158 us/c wall
      count_bits() = 0.0205 us/c user,  0.0000 us/c sys, 0.0007 us/c wait,  
0.0212 us/c wall
   count_ul_bits() = 0.0949 us/c user,  0.0003 us/c sys, 0.0044 us/c wait,  
0.0996 us/c wall
       36.71 real        34.62 user         0.09 sys
    368640  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       110  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
     25945  involuntary context switches



-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.com@localhost>       +1 250 762-7675        http://www.planix.com/

Attachment: pgpxi9VoBSTBs.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index