tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

questions about clocks, and statclock() in particular....



I'm trying to learn a bit more about clocks and ticks on NetBSD as part
of my effort to try to improve measurement of CPU usage, etc.

What follows is a set of changes I've made to kern_clock.c from
netbsd-5.  These are mostly new or moved comments to try to better
explain things to myself....

The only functional change (other than the p_stats vm change from my
recent PR), I think, is that I've moved the setstatclockrate() calls
from inside statclock() and out to the {start,stop}profclock() functions
where they both make more sense, and would seem to belong.  OpenBSD did
this too, but they also left them inside statclock().  The main reason I
think they should be in the start/stop routines is so that the change in
clock rate takes effect immediately, not after up to another whole
stathz interval has passed, and besides if they only need to be called
once, then they should be called in the interface requesting the change.

Am I right in thinking that setstatclockrate() need only be called once,
by one processor, to affect the global stathz rate?  Or is OpenBSD right
in leaving these calls in both places?  I.e. is there only one real
interrupt happening here and it's ISR is being triggered to execute on
all CPUs?  (at the moment on i386 I finally see that stathz is zero, so
hardclock() is calling it all the time, and that's called from
clockintr(), so what sets things up to call clockint() on each CPU?)

Anyone who knows this code well, please also comment on the correctness
of my new/moved comments.

Perhaps someone might also commit any of these changes as desired.  (Or
should I file a PR or two with the changes moved to -current?)


--- kern_clock.c        12 Apr 2010 10:31:45 -0700      1.126
+++ kern_clock.c        02 Nov 2011 15:20:12 -0700      
@@ -73,6 +73,7 @@
 
 #include "opt_ntp.h"
 #include "opt_perfctrs.h"
+#include "opt_syscall_stats.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -177,17 +178,19 @@
        if (profhz == 0)
                profhz = i;
        psratio = profhz / i;
+
        if (schedhz == 0) {
                /* 16Hz is best */
                hardscheddiv = hz / 16;
                if (hardscheddiv <= 0)
                        panic("hardscheddiv");
        }
-
 }
 
 /*
  * The real-time timer, interrupting hz times per second.
+ *
+ * called by every CPU 
  */
 void
 hardclock(struct clockframe *frame)
@@ -244,11 +247,19 @@
        if ((p->p_stflag & PST_PROFIL) == 0) {
                p->p_stflag |= PST_PROFIL;
                /*
-                * This is only necessary if using the clock as the
-                * profiling source.
+                * Increment the number of processes being profiled and Iff
+                * stathz is on a separate clock then set the stathz rate to be
+                * at profhz, and set the statclock profile/stats divisor to
+                * psratio, if this is the first process being profiled.
+                * 
+                * This is only necessary if using the statistics clock as the
+                * profiling source (as opposed to pmc(9)).
                 */
-               if (++profprocs == 1 && stathz != 0)
+               if (++profprocs == 1 && stathz != 0) {
+
                        psdiv = psratio;
+                       setstatclockrate(profhz);
+               }
        }
 }
 
@@ -264,11 +275,19 @@
        if (p->p_stflag & PST_PROFIL) {
                p->p_stflag &= ~PST_PROFIL;
                /*
-                * This is only necessary if using the clock as the
-                * profiling source.
+                * Decrement the number of processes being profiles and Iff
+                * stathz is on a separate clock then reset the stathz rate to
+                * normal, and reset the profile/stats divisor to unity, if no
+                * more processes are being profiled.
+                * 
+                * This is only necessary if using the statistics clock as the
+                * profiling source (as opposed to pmc(9)).
                 */
-               if (--profprocs == 0 && stathz != 0)
+               if (--profprocs == 0 && stathz != 0) {
+
                        psdiv = 1;
+                       setstatclockrate(stathz);
+               }
        }
 }
 
@@ -336,8 +355,25 @@
 }
 
 /*
- * Statistics clock.  Grab profile sample, and if divider reaches 0,
- * do process and kernel statistics.
+ * Statistics (and profiling) clock.
+ *
+ * Called by every CPU.
+ *
+ * XXX profclock() really should always be a separate function, as per FreeBSD
+ * that should avoid all this bother about psdiv and psratio and eliminate the
+ * need for extra state variables in the per-cpu sheduler state
+ *
+ * As per initclocks() and startprofclock(), statclock() may be called at the
+ * faster profhz rate while profiling is in effect, in which case statistics
+ * gathering must only be done once every psratio number of calls.
+ *
+ * Note since statclock() is called by every CPU, the counting and dividing of
+ * its calls to separate profhz and stathz must be done in per-CPU storage.
+ *
+ * So, first grab a profile sample, and then if by decrementing the
+ * profile/stats counter we reach 0, also do process and kernel statistics and
+ * reset the counter up to psdiv again.  (psdiv will be 1 when statclockrate ==
+ * stathz)
  */
 void
 statclock(struct clockframe *frame)
@@ -352,22 +388,19 @@
        struct lwp *l;
 
        /*
-        * Notice changes in divisor frequency, and adjust clock
-        * frequency accordingly.
+        * Notice any changes in divisor frequency and reset the state
+        * variables for the current CPU.
         */
        if (spc->spc_psdiv != psdiv) {
                spc->spc_psdiv = psdiv;
                spc->spc_pscnt = psdiv;
-               if (psdiv == 1) {
-                       setstatclockrate(stathz);
-               } else {
-                       setstatclockrate(profhz);
-               }
        }
+
        l = ci->ci_data.cpu_onproc;
+
        if ((l->l_flag & LW_IDLE) != 0) {
                /*
-                * don't account idle lwps as swapper.
+                * don't account idle LWPs such as swapper. (???)
                 */
                p = NULL;
        } else {
@@ -376,23 +409,42 @@
        }
 
        if (CLKF_USERMODE(frame)) {
+               /*
+                * Interrupt came while in user mode; CPU was in user state.
+                */
+
+               /* grab profiling stats if profiling this process */
                if ((p->p_stflag & PST_PROFIL) && profsrc == PROFSRC_CLOCK)
                        addupc_intr(l, CLKF_PC(frame));
+
+               /*
+                * If profiling is enabled then statclock() is actually called
+                * at the profhz rate and we decrement a counter to determine
+                * whether this call is also at a stathz interval or not.  If
+                * not then we only do profiling, and so return now.
+                */
                if (--spc->spc_pscnt > 0) {
                        mutex_spin_exit(&p->p_stmutex);
                        return;
                }
 
+#ifndef SYSCALL_PROCTIMES
                /*
-                * Came from user mode; CPU was in user state.
-                * If this process is being profiled record the tick.
+                * Unfortunately since this effectively allocates a whole
+                * stathz interval to the process as user time it can be highly
+                * inaccurate since it assumes the process has (or will
+                * continue to) run for a whole stathz interval in user-mode.
                 */
                p->p_uticks++;
+#endif
                if (p->p_nice > NZERO)
                        spc->spc_cp_time[CP_NICE]++;
                else
                        spc->spc_cp_time[CP_USER]++;
        } else {
+               /*
+                * Interrupt came while in kernel mode.
+                */
 #ifdef GPROF
                /*
                 * Kernel statistics are just like addupc_intr, only easier.
@@ -412,18 +464,27 @@
                        addupc_intr(l, LWP_PC(l));
                }
 #endif
+               /*
+                * If profiling is enabled then statclock() is actually called
+                * at the profhz rate and we decrement a counter to determine
+                * whether this call is also at a stathz interval or not.  If
+                * not then we only do profiling, and so return now.
+                */
                if (--spc->spc_pscnt > 0) {
                        if (p != NULL)
                                mutex_spin_exit(&p->p_stmutex);
                        return;
                }
                /*
-                * Came from kernel mode, so we were:
+                * The current CPU may have been busy:
+                *
                 * - handling an interrupt,
                 * - doing syscall or trap work on behalf of the current
                 *   user process, or
                 * - spinning in the idle loop.
+                *
                 * Whichever it is, charge the time as appropriate.
+                *
                 * Note that we charge interrupts to the current process,
                 * regardless of whether they are ``for'' that process,
                 * so that we know how much of its real time was spent
@@ -431,20 +492,87 @@
                 */
                if (CLKF_INTR(frame) || (curlwp->l_pflag & LP_INTR) != 0) {
                        if (p != NULL) {
+#if 1 /* ndef SYSCALL_PROCTIMES */
+                               /*
+                                * Unfortunately since this allocates a whole
+                                * stathz interval to the process as interrupt
+                                * time it can be highly inaccurate, especially
+                                * for interrupt service routines which may
+                                * routinely be much shorter-running than
+                                * stathz.
+                                *
+                                * Statistically though the likelyhood of a
+                                * statclock() call comming while another ISR
+                                * is already executing is therefore relatively
+                                * low, unless perhaps the interrupt rate is
+                                * really high, in which case taking a whole
+                                * stathz tick ``hit'' for ISRs might, on
+                                * average, be OK.
+                                */
                                p->p_iticks++;
+#endif
                        }
                        spc->spc_cp_time[CP_INTR]++;
                } else if (p != NULL) {
+#ifndef SYSCALL_PROCTIMES
+                       /*
+                        * Unfortunately since this allocates a whole stathz
+                        * interval to the process as system time it can be
+                        * highly inaccurate, especially for some system calls
+                        * which may take far less than a stathz interval to
+                        * finish.
+                        *
+                        * Unlike interrupts though, some system calls may
+                        * actually run fairly long.  Does that make this fair?
+                        */
                        p->p_sticks++;
+#endif
                        spc->spc_cp_time[CP_SYS]++;
                } else {
                        spc->spc_cp_time[CP_IDLE]++;
                }
        }
-       spc->spc_pscnt = psdiv;
 
        if (p != NULL) {
+               struct vmspace *vm = p->p_vmspace;
+               long rss;
+
+               /*
+                * If the CPU is currently scheduled to a non-idle process,
+                * then charge that process with the appropriate VM resource
+                * utilization for a tick.
+                *
+                * Assume that the current process has been running the entire
+                * last tick, and account for VM use regardless of whether in
+                * user mode or system mode (XXX or interrupt mode?).
+                *
+                * rusage VM stats are expressed in kilobytes *
+                * ticks-of-execution.
+                */
+               /* based on code from 4.3BSD kern_clock.c and from FreeBSD ... 
*/
+
+# define pg2kb(n)      (((n) * PAGE_SIZE) / 1024)
+
+               p->p_stats->p_ru.ru_idrss += pg2kb(vm->vm_dsize); /* unshared 
data */
+               p->p_stats->p_ru.ru_isrss += pg2kb(vm->vm_ssize); /* unshared 
stack */
+               p->p_stats->p_ru.ru_ixrss += pg2kb(vm->vm_tsize); /* "shared" 
text? */
+
+               rss = pg2kb(vm_resident_count(vm));
+               if (rss > p->p_stats->p_ru.ru_maxrss)
+                       p->p_stats->p_ru.ru_maxrss = rss;
+
+               /* finally account overall for one stathz tick */
                atomic_inc_uint(&l->l_cpticks);
+
+               /* we're done with mucking with proc stats fields.... */
                mutex_spin_exit(&p->p_stmutex);
        }
+
+       /*
+        * reset the profhz divisor counter
+        *
+        * Note we must use the global variable here, not spc->spc_psdiv, since
+        * the statclock rate may already have been lowered by another CPU....
+        */
+       spc->spc_pscnt = psdiv;
 }

-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.com@localhost>       +1 250 762-7675        http://www.planix.com/

Attachment: pgpsQyPMCM6Jb.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index