tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

need help in verifying correctness of SYSCALL_* hook placement



Further experiments and a wee bit more hacking at <sys/syscall_stats.h>
have revealed that the SYSCALL_TIME_SLEEP(l) macro was always
incompletely defined and was always accounting excessive portions of
time to system time.  The attached diffs made against the netbsd-5
branch and which I think show everything I've changed with respect to
rusage CPU time accounting, show the new version.

These changes also show one or two fixes to the placement of these
macros, though the result of these changes does nothing major enough to
show any real improvement in my test results.

I think my next experiment will be to try binuptime() again, but this
time modify the macros to take advantage of the existing binuptime()
calls that are already made in kern_synch.c, and thus I'll only have to
pay for one additional call on each of system call entry and exit.  I'll
also store the raw struct bintime values and sums if possible (instead
of converting them to nanoseconds).  Doing this should avoid accounting
for any of the context switching and scheduler code that runs even when
there are no system calls and will also therefore exactly match the
accounting of p_rtime.

Basically what I've done is made sure the appropriate SYSCALL_* hook is
in place wherever uptdatertime() is called or l_stime is reset.

New results from my test program, using the CPU TSC variant of syscall
timing, show some significant improvement for CPU-bound processes (now
very much matching the old statclock-style results), but they both still
look rather wonky when each iteration of the test loop includes a call
to time() (and thus a system call to gettimeofday()).

SO, in other words I've now got the SYSCALL_PROCTIMES accounting (using
CPU TSC) working at least as reliably as, if not better than, the old
statclock-style sampling.  However both forms are still very unreliable
and unusable when the test code includes significant numbers of system
calls, especially of the like of gettimeofday().

Note that I would think there should be no measurable system time per
call in the first run, and in theory the user times in the second run
should be slightly less than the user times in the first run since
random() is exchanged for the lower-cost(?) time() function.

On the other hand, perhaps the system time in the first run is (almost)
correctly being accounted to the context switching and scheduler
code.... (???)

Can anyone see any other obvious problems with the placement of the
SYSCALL_* hooks, especially places where they may be missing?  Do the
new versions, especially SYSCALL_TIME_SLEEP(), look correct?

BTW, I think these tests were run with kern.sched.timesoftints=1, though
setting that sysctl either way doesn't seem to make any real difference
for this benchmark.  I didn't think it would, but I was unsure.

$ /usr/bin/time -l /root/tmp/tcountbits -tr -i 100
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 100000000 iterations....
          random() = 0.0200 us/c user,  0.0002 us/c sys, 0.0012 us/c wait,  
0.0215 us/c wall
        nulltime() = 0.0218 us/c user,  0.0004 us/c sys, 0.0020 us/c wait,  
0.0241 us/c wall
countbits_sparse() = 0.0809 us/c user,  0.0004 us/c sys, 0.0013 us/c wait,  
0.0826 us/c wall
 countbits_dense() = 0.0857 us/c user,  0.0003 us/c sys, 0.0019 us/c wait,  
0.0879 us/c wall
      COUNT_BITS() = 0.0290 us/c user,  0.0006 us/c sys, 0.0006 us/c wait,  
0.0302 us/c wall
      count_bits() = 0.0428 us/c user,  0.0002 us/c sys, 0.0006 us/c wait,  
0.0436 us/c wall
   count_ul_bits() = 0.1128 us/c user, -0.0000 us/c sys, 0.0019 us/c wait,  
0.1146 us/c wall
       40.55 real        39.30 user         1.00 sys
     20384  maximum resident set size
         8  average shared memory size
       465  average unshared data size
         8  average unshared stack size
        50  page reclaims
        33  page faults
         0  swaps
         4  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      -126  voluntary context switches
       191  involuntary context switches

$ /usr/bin/time -l /root/tmp/tcountbits -t -i 1
tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns
tcountbits: now running each algorithm for 1000000 iterations....
            time() = 0.5752 us/c user,  9.3730 us/c sys, 0.0212 us/c wait,  
9.9695 us/c wall
        nulltime() = 0.5780 us/c user,  9.4783 us/c sys, 0.0174 us/c wait, 
10.0737 us/c wall
countbits_sparse() = 0.5537 us/c user,  9.9170 us/c sys, 0.0233 us/c wait, 
10.4940 us/c wall
 countbits_dense() = 2.1380 us/c user,  8.7800 us/c sys, 0.0547 us/c wait, 
10.9726 us/c wall
      COUNT_BITS() = 1.5277 us/c user,  9.4275 us/c sys, 0.0362 us/c wait, 
10.9914 us/c wall
      count_bits() = 0.6911 us/c user,  9.3221 us/c sys, 0.0274 us/c wait, 
10.0407 us/c wall
   count_ul_bits() = 1.8210 us/c user,  8.2809 us/c sys, 0.0211 us/c wait, 
10.1231 us/c wall
       72.70 real         7.89 user        64.73 sys
     20396  maximum resident set size
        11  average shared memory size
       620  average unshared data size
        11  average unshared stack size
        59  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
        -6  voluntary context switches
       182  involuntary context switches


-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.com@localhost>       +1 250 762-7675        http://www.planix.com/

Index: sys/sys/syscall_stats.h
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/syscall_stats.h,v
retrieving revision 1.3
diff -u -r1.3 syscall_stats.h
--- sys/sys/syscall_stats.h     29 Apr 2008 06:53:03 -0000      1.3
+++ sys/sys/syscall_stats.h     9 Nov 2011 22:27:32 -0000
@@ -32,116 +32,251 @@
 #ifndef _SYS_SYCALL_STAT_H_
 #define        _SYS_SYCALL_STAT_H_
 
-#ifndef _LKM
 #include "opt_syscall_stats.h"
-#endif
 
-#ifdef SYSCALL_STATS
-#include <sys/syscall.h>
+/*
+ * keep track of how many times each system call is called for the whole system
+ */
+#ifdef SYSCALL_COUNTS
+
+# include <sys/syscall.h>
 
 extern uint64_t syscall_counts[SYS_NSYSENT];
-extern uint64_t syscall_count_user, syscall_count_system, 
syscall_count_interrupt;
-#define        SYSCALL_COUNT(table, code) ((table)[code]++)
 
-#if defined(SYSCALL_TIMES) && defined(__HAVE_CPU_COUNTER)
+/*
+ * used in syscall() entry to accumulate counts for each system call
+ *
+ * "table" must always be "syscall_counts",
+ * "code" is the system call number
+ */
+# define SYSCALL_COUNT(table, code)    ((table)[code]++)
+
+#else
+
+# define SYSCALL_COUNT(table, code)    /* nothing */
+
+#endif /* SYSCALL_COUNTS */
+
+/*
+ * keep track of overall CPU time used per system call for the whole system
+ */
+#if defined(SYSCALL_TIMES)
+
+# include <machine/cpu_counter.h>
 
-#include <machine/cpu_counter.h>
 extern uint64_t syscall_times[SYS_NSYSENT];
 
-#ifdef SYSCALL_TIMES_HASCOUNTER
-/* Force use of cycle counter - needed for Soekris systems */
-#define SYSCALL_TIME() (cpu_counter32())
-#else
-#define SYSCALL_TIME() (cpu_hascounter() ? cpu_counter32() : 0u)
-#endif
+# define SYSCALL_TIME_SYS_SET(l, table, code)          \
+       (l)->l_syscall_tp = (table) + (code)
 
-#ifdef SYSCALL_TIMES_PROCTIMES
-#define SYSCALL_TIME_UPDATE_PROC(l, fld, delta) \
-       (l)->l_proc->p_##fld##ticks += (delta)
-#else
-#define SYSCALL_TIME_UPDATE_PROC(l, fld, delta)
-#endif
+# define SYSCALL_TIME_UPDATE(l, delta)                 \
+       *(l)->l_syscall_tp += delta
 
-/* lwp creation */
-#define SYSCALL_TIME_LWP_INIT(l) do { \
-    (l)->l_syscall_counter = &syscall_count_system; \
-    SYSCALL_TIME_WAKEUP(l); \
-} while (0)
-
-/* System call entry hook */
-#define SYSCALL_TIME_SYS_ENTRY(l, table, code) do { \
-       uint32_t now = SYSCALL_TIME(); \
-       SYSCALL_TIME_UPDATE_PROC(l, u, elapsed = now - (l)->l_syscall_time); \
-       (l)->l_syscall_counter = (table) + (code); \
-       (l)->l_syscall_time = now; \
-} while (0)
-
-/* System call - process sleep */
-#define SYSCALL_TIME_SLEEP(l) do { \
-       uint32_t now = SYSCALL_TIME(); \
-       uint32_t elapsed = now - (l)->l_syscall_time; \
-       (l)->l_syscall_time = now; \
-       *(l)->l_syscall_counter += elapsed; \
-       SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \
-} while (0)
-
-/* Process wakeup */
-#define SYSCALL_TIME_WAKEUP(l) \
-       (l)->l_syscall_time = SYSCALL_TIME()
-
-/* System call exit */
-#define SYSCALL_TIME_SYS_EXIT(l) do { \
-       uint32_t now = SYSCALL_TIME(); \
-       uint32_t elapsed = now - (l)->l_syscall_time; \
-       (l)->l_syscall_time = now; \
-       *(l)->l_syscall_counter += elapsed; \
-       (l)->l_syscall_counter = &syscall_count_user; \
-       SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \
-} while (0)
-
-#ifdef _notyet
-/* Interrupt entry hook */
-#define SYSCALL_TIME_ISR_ENTRY(l, old) do { \
-       uint32_t now = SYSCALL_TIME(); \
-       uint32_t elapsed = now - (l)->l_syscall_time; \
-       (l)->l_syscall_time = now; \
-       old = (l)->l_syscall_counter; \
-       if ((l)->l_syscall_counter != &syscall_count_interrupt) \
-               if ((l)->l_syscall_counter == &syscall_count_user) \
-                       SYSCALL_TIME_UPDATE_PROC(l, u, elapsed); \
-               else { \
-                       *(l)->l_syscall_counter += elapsed; \
-                       SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \
-               } \
-               (l)->l_syscall_counter = &syscall_count_interrupt; \
-       } \
-} while (0)
-
-/* Interrupt exit hook */
-#define SYSCALL_TIME_ISR_EXIT(l, saved) do { \
-       uint32_t now = SYSCALL_TIME(); \
-       SYSCALL_TIME_UPDATE_PROC(l, i, now - (l)->l_syscall_time); \
-       (l)->l_syscall_time = now; \
-       (l)->l_syscall_counter = saved; \
-} while (0)
+#else /* !SYSCALL_TIMES */
+
+/*
+ * Avoid some unnecessary 64-bit additions if we're not using SYSCALL_TIMES
+ */
+# define SYSCALL_TIME_SYS_SET(l, table, code)  /* nothing */
+# define SYSCALL_TIME_UPDATE(l, delta)         /* nothing */
+
+#endif /* SYSCALL_TIMES */
+
+#if defined(SYSCALL_TIMES) || defined(SYSCALL_PROCTIMES)
+
+/*
+ * These are effectively state enumerators, but are also stand-ins for a
+ * location in syscall_times in lwp->l_syscall_tp so must be of the same type.
+ */
+extern uint64_t syscall_time_user, syscall_time_system, syscall_time_interrupt;
+
+#if !defined(SYSCALL_STATS_USE_CPUTSC) &&              \
+    !defined(SYSCALL_STATS_USE_HARDCLOCK) &&           \
+    !defined(SYSCALL_STATS_USE_UPTIME)
+# define SYSCALL_STATS_USE_CPUTSC      /* defined */
 #endif
 
+#ifdef SYSCALL_STATS_USE_UPTIME
+/*
+ * ideally we would just store the "raw" struct bintime, but for now since we
+ * have some relevant u_quad_t slots in struct proc to store the accumulated
+ * times in, we'll convert to nanoseconds so we can use them
+ *
+ * Note, binuptime() is too expensive, but getbinuptime() is too "sloppy"
+ */
+static __inline__ __attribute__((__always_inline__)) uint32_t
+__syscall_time(void)
+{
+       uint64_t ns;
+       struct bintime bt;
+
+       binuptime(&bt);
+       ns = (bt.sec * 1000000000LL) +
+            (uint64_t) (((uint64_t) 1000000000LL * (uint32_t) (bt.frac >> 32)) 
>> 32);
+
+       return (ns & UINT32_MAX);       /* XXX mask unnecessary? */
+}
 #endif
+
+
+#ifdef SYSCALL_STATS_USE_HARDCLOCK
+# include <sys/kernel.h>
+# define __syscall_time()      hardclock_ticks
 #endif
 
-#ifndef SYSCALL_TIME_SYS_ENTRY
-#define SYSCALL_TIME_LWP_INIT(l)
-#define SYSCALL_TIME_SYS_ENTRY(l,table,code)
-#define SYSCALL_TIME_SLEEP(l)
-#define SYSCALL_TIME_WAKEUP(l)
-#define SYSCALL_TIME_SYS_EXIT(l)
-#define SYSCALL_TIME_ISR_ENTRY(l,old)
-#define SYSCALL_TIME_ISR_EXIT(l,saved)
-#undef SYSCALL_TIMES
+
+#ifdef SYSCALL_STATS_USE_CPUTSC
+# ifndef __HAVE_CPU_COUNTER
+#  error "Use of CPU timestamp counter for SYSCALL_STATS invalid: no 
cpu_counter32()"
+# endif
+# include <machine/cpu_counter.h>
+# ifdef SYSCALL_STATS_HASCOUNTER
+/* Force use of cycle counter - needed for Soekris systems */
+#  define __syscall_time()     (cpu_counter32())
+# else
+#  define __syscall_time()     (cpu_hascounter() ? cpu_counter32() : 0u)
+# endif
 #endif
 
-#ifndef SYSCALL_COUNT
-#define SYSCALL_COUNT(table, code)
+# ifdef SYSCALL_PROCTIMES
+/*
+ * keep track of process user and system time accurately
+ */
+#  define SYSCALL_TIME_UPDATE_PROC(l, fld, delta)      \
+       (l)->l_proc->p_##fld##ticks += (delta)
+# else
+#  define SYSCALL_TIME_UPDATE_PROC(l, fld, delta)
+# endif
+
+/*
+ * Process wakeup
+ *
+ * Used in mi_switch() as new lwp is about to run -- mark the start time of its
+ * next time slice
+ */
+# define SYSCALL_TIME_WAKEUP(l)                                \
+       (l)->l_syscall_time = __syscall_time()
+
+/*
+ * lwp creation
+ *
+ * XXX is "syscall_time_system" the correct "state" here? 
+ */
+# define SYSCALL_TIME_LWP_INIT(l) do {                 \
+       (l)->l_syscall_tp = &syscall_time_system;       \
+       SYSCALL_TIME_WAKEUP(l);                         \
+} while (0)
+
+/*
+ * process yeilding from in kernel mode
+ *
+ * used in mi_switch() if a new lwp will execute, as well as in
+ * lwp_exit_switchaway()
+ *
+ * All time spent between the last time l->l_syscall_time was set and now was
+ * spent executing in kernel mode on behalf of l's process.
+ *
+ * Now yeilding to another lwp.
+ *
+ * XXX Do we have to update l_syscall_time here?
+ */
+# define SYSCALL_TIME_SLEEP(l) do {                            \
+       uint32_t now = __syscall_time();                        \
+       uint32_t elapsed = now - (l)->l_syscall_time;           \
+       if ((l)->l_syscall_tp == &syscall_time_user)            \
+               SYSCALL_TIME_UPDATE_PROC(l, u, elapsed);        \
+       else {                                                  \
+               SYSCALL_TIME_UPDATE(l, elapsed);                \
+               SYSCALL_TIME_UPDATE_PROC(l, s, elapsed);        \
+       }                                                       \
+       (l)->l_syscall_time = now;                              \
+} while (0)
+
+/*
+ * System call entry hook
+ * 
+ * used at beginning of syscall()
+ *
+ * "table" must always be "syscall_times",
+ * "code" is the system call number
+ *
+ * All time spent between the last time l->l_syscall_time was set and now was
+ * spent executing user code (except for when ISRs were running).
+ */
+# define SYSCALL_TIME_SYS_ENTRY(l, table, code) do {   \
+       uint32_t now = __syscall_time();                \
+       uint32_t elapsed = now - (l)->l_syscall_time;   \
+       SYSCALL_TIME_UPDATE_PROC(l, u, elapsed);        \
+       SYSCALL_TIME_SYS_SET(l, table, code);           \
+       (l)->l_syscall_time = now;                      \
+} while (0)
+
+/*
+ * System call completion
+ *
+ * Used at the end of syscall() just before returning to user mode. 
+ *
+ * All time spent between the last time l->l_syscall_time was set and now was
+ * spent executing in kernel mode on behalf of l's process.
+ */
+# define SYSCALL_TIME_SYS_EXIT(l) do {                 \
+       uint32_t now = __syscall_time();                \
+       uint32_t elapsed = now - (l)->l_syscall_time;   \
+       SYSCALL_TIME_UPDATE_PROC(l, s, elapsed);        \
+       SYSCALL_TIME_UPDATE(l, elapsed);                \
+       (l)->l_syscall_time = now;                      \
+       (l)->l_syscall_tp = &syscall_time_user;         \
+} while (0)
+
+
+/*
+ * Interrupt entry hook
+ *
+ * "old" is storage for a "uint64_t *" to be passed as "saved" to
+ * SYSCALL_TIME_ISR_EXIT()
+ */
+# define SYSCALL_TIME_ISR_ENTRY(l, old) do {                           \
+       uint32_t now = __syscall_time();                                \
+       uint32_t elapsed = now - (l)->l_syscall_time;                   \
+       (l)->l_syscall_time = now;                                      \
+       old = (l)->l_syscall_tp;                                        \
+       if ((l)->l_syscall_tp != &syscall_time_interrupt)               \
+               if ((l)->l_syscall_tp == &syscall_time_user)            \
+                       SYSCALL_TIME_UPDATE_PROC(l, u, elapsed);        \
+               else {                                                  \
+                       SYSCALL_TIME_UPDATE(l, elapsed);                \
+                       SYSCALL_TIME_UPDATE_PROC(l, s, elapsed);        \
+               }                                                       \
+               (l)->l_syscall_counter = &syscall_count_interrupt;      \
+       }                                                               \
+} while (0)
+
+/*
+ * Interrupt exit hook
+ *
+ * "saved" is the "unit64_t *" pointer passed as "old" to
+ * SYSCALL_TIME_ISR_ENTRY()
+ */
+# define SYSCALL_TIME_ISR_EXIT(l, saved) do {                          \
+       uint32_t now = __syscall_time();                                \
+       uint32_t elapsed = now - (l)->l_syscall_time;                   \
+       SYSCALL_TIME_UPDATE_PROC(l, i, elapsed);                        \
+       (l)->l_syscall_time = now;                                      \
+       (l)->l_syscall_tp = saved;                                      \
+} while (0)
+
+#endif /* SYSCALL_TIMES || SYSCALL_PROCTIMES */
+
+#ifndef SYSCALL_TIME_LWP_INIT
+# define SYSCALL_TIME_LWP_INIT(l)
+# define SYSCALL_TIME_SYS_ENTRY(l, table, code)
+# define SYSCALL_TIME_SLEEP(l)
+# define SYSCALL_TIME_WAKEUP(l)
+# define SYSCALL_TIME_SYS_EXIT(l)
+# define SYSCALL_TIME_ISR_ENTRY(l, old)
+# define SYSCALL_TIME_ISR_EXIT(l, saved)
+# undef SYSCALL_TIME_SYS_SET
+# undef SYSCALL_TIME_UPDATE
 #endif
 
 #endif /* !_SYS_SYCALL_STAT_H_ */
Index: sys/sys/lwp.h
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/lwp.h,v
retrieving revision 1.114.4.1
diff -u -r1.114.4.1 lwp.h
--- sys/sys/lwp.h       6 Feb 2009 01:54:09 -0000       1.114.4.1
+++ sys/sys/lwp.h       2 Nov 2011 01:58:11 -0000
@@ -83,7 +83,7 @@
        struct mdlwp    l_md;           /* l: machine-dependent fields. */
        int             l_flag;         /* l: misc flag values */
        int             l_stat;         /* l: overall LWP status */
-       struct bintime  l_rtime;        /* l: real time */
+       struct bintime  l_rtime;        /* l: real time running (ONPROC) */
        struct bintime  l_stime;        /* l: start time (while ONPROC) */
        u_int           l_swtime;       /* l: time swapped in or out */
        u_int           l_holdcnt;      /* l: if non-zero, don't swap */
@@ -100,7 +100,7 @@
        SLIST_HEAD(, turnstile) l_pi_lenders; /* l: ts lending us priority */
        uint64_t        l_ncsw;         /* l: total context switches */
        uint64_t        l_nivcsw;       /* l: involuntary context switches */
-       u_int           l_cpticks;      /* (: Ticks of CPU time */
+       u_int           l_cpticks;      /* (: Ticks of CPU during l_swtime */
        fixpt_t         l_pctcpu;       /* p: %cpu during l_swtime */
        fixpt_t         l_estcpu;       /* l: cpu time for SCHED_4BSD */
        psetid_t        l_psid;         /* l: assigned processor-set ID */
@@ -178,8 +178,8 @@
        _TAILQ_HEAD(,struct lockdebug,volatile) l_ld_locks;/* !: locks held by 
LWP */
 
        /* These are only used by 'options SYSCALL_TIMES' */
-       uint32_t        l_syscall_time; /* !: time epoch for current syscall */
-       uint64_t        *l_syscall_counter; /* !: counter for current process */
+       uint64_t        l_syscall_time; /* !: time epoch for current syscall */
+       uint64_t        *l_syscall_tp;  /* !: ptr to current process timer */
 };
 
 #if !defined(USER_TO_UAREA)
Index: sys/kern/sys_syscall.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/sys_syscall.c,v
retrieving revision 1.9
diff -u -r1.9 sys_syscall.c
--- sys/kern/sys_syscall.c      29 Apr 2008 06:53:03 -0000      1.9
+++ sys/kern/sys_syscall.c      9 Nov 2011 20:04:22 -0000
@@ -68,6 +68,7 @@
 
        code = SCARG(uap, code) & (SYS_NSYSENT - 1);
        SYSCALL_COUNT(syscall_counts, code);
+       SYSCALL_TIME_SYS_ENTRY(l, syscall_times, code);
        callp += code;
 
        if (__predict_false(callp->sy_flags & SYCALL_INDIRECT))
@@ -87,6 +88,7 @@
                return error;
        error = callp->sy_call(l, &uap->args, rval);
        trace_exit(code, rval, error);
+       SYSCALL_TIME_SYS_EXIT(l);
        return error;
 
        #undef TRACE_ARGS
Index: sys/kern/subr_syscall_stats.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/subr_syscall_stats.c,v
retrieving revision 1.2
diff -u -r1.2 subr_syscall_stats.c
--- sys/kern/subr_syscall_stats.c       29 Apr 2008 06:53:03 -0000      1.2
+++ sys/kern/subr_syscall_stats.c       3 Nov 2011 00:39:13 -0000
@@ -40,14 +40,19 @@
 
 #include <sys/syscall_stats.h>
 
-#ifdef SYSCALL_STATS
+#ifdef SYSCALL_COUNTS
 uint64_t syscall_counts[SYS_NSYSENT];
-uint64_t syscall_count_user, syscall_count_system, syscall_count_interrupt;
+#endif
 
 #ifdef SYSCALL_TIMES
 uint64_t syscall_times[SYS_NSYSENT];
 #endif
 
+#if defined(SYSCALL_TIMES) || defined(SYSCALL_PROCTIMES)
+uint64_t syscall_time_user, syscall_time_system, syscall_time_interrupt;
+#endif
+
+#if defined(SYSCALL_COUNTS) || defined(SYSCALL_TIMES)
 SYSCTL_SETUP(sysctl_syscall_setup, "sysctl system call stats")
 {
        const struct sysctlnode *cnode;
@@ -65,19 +70,21 @@
                        NULL, 0, NULL, 0,
                        CTL_KERN, CTL_CREATE);
        kern_syscalls = cnode->sysctl_num;
+# ifdef SYSCALL_COUNTS
         sysctl_createv(clog, 0, NULL, NULL,
                        CTLFLAG_PERMANENT,  
                        CTLTYPE_STRUCT, "counts",
                        SYSCTL_DESCR("per syscall counts"),
                        NULL, 0, syscall_counts, sizeof syscall_counts,
                        CTL_KERN, kern_syscalls, CTL_CREATE);
-#ifdef SYSCALL_TIMES
+# endif
+# ifdef SYSCALL_TIMES
         sysctl_createv(clog, 0, NULL, NULL,
                        CTLFLAG_PERMANENT,  
                        CTLTYPE_STRUCT, "times",
                        SYSCTL_DESCR("per syscall times"),
                        NULL, 0, syscall_times, sizeof syscall_times,
                        CTL_KERN, kern_syscalls, CTL_CREATE);
-#endif
+# endif
 }
 #endif
Index: sys/kern/kern_synch.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/kern_synch.c,v
retrieving revision 1.254.2.6
diff -u -r1.254.2.6 kern_synch.c
--- sys/kern/kern_synch.c       23 Apr 2009 17:47:13 -0000      1.254.2.6
+++ sys/kern/kern_synch.c       9 Nov 2011 21:52:24 -0000
@@ -97,6 +97,7 @@
 #include <sys/lwpctl.h>
 #include <sys/atomic.h>
 #include <sys/simplelock.h>
+#include <sys/syslog.h>
 
 #include <uvm/uvm_extern.h>
 
@@ -515,14 +516,16 @@
  */
 
 void
-updatertime(lwp_t *l, const struct bintime *now)
+updatertime(lwp_t *l, const struct bintime *btnow)
 {
 
        if ((l->l_flag & LW_IDLE) != 0)
                return;
 
+       SYSCALL_TIME_SLEEP(l);
+
        /* rtime += now - stime */
-       bintime_add(&l->l_rtime, now);
+       bintime_add(&l->l_rtime, btnow);
        bintime_sub(&l->l_rtime, &l->l_stime);
 }
 
@@ -625,7 +628,6 @@
 
        /* Count time spent in current system call */
        if (!returning) {
-               SYSCALL_TIME_SLEEP(l);
 
                /*
                 * XXXSMP If we are using h/w performance counters,
@@ -835,7 +837,6 @@
 #endif
 
        /* Count time spent in current system call */
-       SYSCALL_TIME_SLEEP(l);
        binuptime(&bt);
        updatertime(l, &bt);
 
@@ -863,6 +864,7 @@
        }
 
        /* Update the new LWP's start time. */
+       SYSCALL_TIME_WAKEUP(newl);
        newl->l_stime = bt;
        l->l_pflag &= ~LP_RUNNING;
 
@@ -1202,9 +1204,11 @@
                rlim = &p->p_rlimit[RLIMIT_CPU];
                sig = 0;
                if (__predict_false(runtm >= rlim->rlim_cur)) {
-                       if (runtm >= rlim->rlim_max)
+                       if (runtm >= rlim->rlim_max) {
                                sig = SIGKILL;
-                       else {
+                               log(LOG_ERR, "pid %d is killed: %s\n", 
p->p_pid, "exceeded RLIMIT_CPU");
+                               uprintf("sorry, pid %d, command %s, is killed: 
%s\n", p->p_pid, p->p_comm, "exceeded RLIMIT_CPU");
+                       } else {
                                sig = SIGXCPU;
                                if (rlim->rlim_cur < rlim->rlim_max)
                                        rlim->rlim_cur += 5;
@@ -1214,8 +1218,8 @@
                if (__predict_false(runtm < 0)) {
                        if (!backwards) {
                                backwards = true;
-                               printf("WARNING: negative runtime; "
-                                   "monotonic clock has gone backwards\n");
+                               log(LOG_WARNING, "WARNING: negative runtime; "
+                                   "monotonic clock has gone backwards\n"); /* 
or/i.e. overflowed??? */
                        }
                } else if (__predict_false(sig)) {
                        KASSERT((p->p_flag & PK_SYSTEM) == 0);
Index: sys/kern/kern_resource.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/kern_resource.c,v
retrieving revision 1.147.4.2
diff -u -r1.147.4.2 kern_resource.c
--- sys/kern/kern_resource.c    14 Aug 2009 21:15:16 -0000      1.147.4.2
+++ sys/kern/kern_resource.c    8 Nov 2011 06:17:42 -0000
@@ -39,6 +39,8 @@
 #include <sys/cdefs.h>
 __KERNEL_RCSID(0, "$NetBSD: kern_resource.c,v 1.147.4.2 2009/08/14 21:15:16 
snj Exp $");
 
+#include "opt_syscall_stats.h"
+
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
@@ -403,6 +405,28 @@
        return copyout(&rl, SCARG(uap, rlp), sizeof(rl));
 }
 
+#if defined(SYSCALL_PROCTIMES) &&                                      \
+    (defined(SYSCALL_STATS_USE_HARDCLOCK) || defined(SYSCALL_STATS_USE_CPUTSC))
+/*
+ * find the most significant bit set in an integer
+ *
+ * aka the "log base 2" of an integer 
+ */
+static unsigned int msb_uint64(uint64_t);
+
+static unsigned int
+msb_uint64(uint64_t v)
+{
+        unsigned int mb = 0;
+
+       while (v >>= 1) { /* unroll for more speed... */
+               mb++;
+       }
+
+        return mb;
+}
+#endif
+
 /*
  * Transform the running time and tick information in proc p into user,
  * system, and interrupt time usage.
@@ -413,22 +437,28 @@
 calcru(struct proc *p, struct timeval *up, struct timeval *sp,
     struct timeval *ip, struct timeval *rp)
 {
-       uint64_t u, st, ut, it, tot;
+       uint64_t st, ut, it, tot;       /* "ticks" */
+#if defined(SYSCALL_PROCTIMES) &&                                      \
+    (defined(SYSCALL_STATS_USE_HARDCLOCK) || defined(SYSCALL_STATS_USE_CPUTSC))
+       uint64_t stn, utn, itn;         /* normalized "ticks" */
+       unsigned int hcMSBit;
+#endif
+       uint64_t rus, sus, uus, ius;    /* microseconds */
        struct lwp *l;
-       struct bintime tm;
-       struct timeval tv;
+       struct bintime rbtm;
+       struct timeval rtv;
 
        mutex_spin_enter(&p->p_stmutex);
-       st = p->p_sticks;
-       ut = p->p_uticks;
-       it = p->p_iticks;
+       st = p->p_sticks;               /* statclock hits in system mode */
+       ut = p->p_uticks;               /* statclock hits in user mode */
+       it = p->p_iticks;               /* statclock hits during interrupt(s) */
        mutex_spin_exit(&p->p_stmutex);
 
-       tm = p->p_rtime;
+       rbtm = p->p_rtime;              /* hardclock real time for any exited 
lwps??? */
 
        LIST_FOREACH(l, &p->p_lwps, l_sibling) {
                lwp_lock(l);
-               bintime_add(&tm, &l->l_rtime);
+               bintime_add(&rbtm, &l->l_rtime);
                if ((l->l_pflag & LP_RUNNING) != 0) {
                        struct bintime diff;
                        /*
@@ -438,40 +468,174 @@
                         * which is much greater than the sampling
                         * error.
                         */
-                       binuptime(&diff);
-                       bintime_sub(&diff, &l->l_stime);
-                       bintime_add(&tm, &diff);
+                       binuptime(&diff);                       /* "uptime" */
+                       bintime_sub(&diff, &l->l_stime);        /* - 
"switchtime" */
+                       bintime_add(&rbtm, &diff);
                }
                lwp_unlock(l);
        }
 
+       bintime2timeval(&rbtm, &rtv);
+       rus = (uint64_t) rtv.tv_sec * 1000000ul + rtv.tv_usec; /* tot real tm 
usec */
+
+       /* not likely necessary with cpu_counter32() tickers! */
+       if (ut == 0)
+               ut++;                   /* 0.5 or less would be more fair!  :-) 
*/
+       /* else */ if (st == 0)
+               st++;                   /* 0.5 or less would be more fair!  :-) 
*/
+
+
+#ifndef SYSCALL_PROCTIMES
+       /*
+        * distribute total real microseconds of runtime accumulated by this
+        * process by each allocation of statclock "ticks" accounted against it
+        *
+        * XXX using statclock ticks, this can go bad!
+        *
+        * Total real time for all the lwps that are part of this "process" are
+        * calculated with true elapsed time in context switching, while
+        * statclock() only samples which lwp was executing and in what mode at
+        * stathz intervals.  If statclock() has not yet accounted the "next
+        * tick" to this process yet the amount of time apportioned to each of
+        * user, system, and interrupt time will be larger than it should be,
+        * and the next time we do something to calculate the split two of
+        * those times may appear to have gone backwards.
+        *
+        * FreeBSD saves the results of the calculation in the proc structure
+        * and then clamps subsequent results to be greater or equal to the
+        * previous results.
+        *
+        * Perhaps NetBSD could do the same, using p_stats->p_ru?
+        *
+        * A better solution would be to measure time spent in system calls
+        * using the same clock as is used for l_rtime.  That would only leave
+        * interrupt time as the wild-card being estimated by statclock ticks.
+        * Maybe interrupt time could also be measured by the interrupt
+        * dispatcher and then accounted to the thread which it interrupted?
+        *
+        * Hmmm... in <sys/syscall_stats.h> we have SYSCALL_PROCTIMES
+        *
+        * it records deltas of cpu_counter32() values on some platforms
+        */
+
        tot = st + ut + it;
-       bintime2timeval(&tm, &tv);
-       u = (uint64_t)tv.tv_sec * 1000000ul + tv.tv_usec;
 
-       if (tot == 0) {
-               /* No ticks, so can't use to share time out, split 50-50 */
-               st = ut = u / 2;
-       } else {
-               st = (u * st) / tot;
-               ut = (u * ut) / tot;
-       }
+       sus = (rus * st) / tot;
+       uus = (rus * ut) / tot;
+       ius = (rus * it) / tot;
+
+#  ifdef DIAGNOSTIC
+       printf("%s: %s[%ld]: %llu us (u=%llu us [%llu], s=%llu us [%llu], 
i=%llu us [%llu])\n",
+              (ip && rp) ? "acct_process" :
+              (ip && !rp) ? "getrusage" :
+              (!ip && rp) ? "sysctl" : "exit|tty",
+              p->p_comm, (long) p->p_pid,
+              rus, uus, ut, sus, st, ius, it);
+#  endif
+
+#else
+
+# if defined(SYSCALL_STATS_USE_HARDCLOCK) || defined(SYSCALL_STATS_USE_CPUTSC)
+       /*
+        * distribute total real microseconds of runtime accumulated by this
+        * process by each allocation of CPU TSC or hardclock "ticks" accounted
+        * against it
+        *
+        * first "normalize" the tick counters to some reasonable magnitude,
+        * (especially for CPU TSC, but useful for long-running procs?)
+        *
+        * Pick the highest common MSBit of each, then preserve the top-most
+        * "used" 24 bits by shifting each number right by (hcMSBit-24).
+        *
+        * XXX note with CPUTSC the still-statclock-based p_iticks value will
+        * normally be way too tiny to consider in this algorithm (i.e. it
+        * would always have the lowest MSB).
+        */
+#  if 0 /* def SYSCALL_STATS_USE_HARDCLOCK */
+       if (stathz != 0 && stathz < hz)
+               it *= hz / stathz;      /* adjust "it" from stathz to hz */
+#  endif
+       utn = ut;
+       stn = st;
+       itn = it;
+       hcMSBit = max(max(msb_uint64(ut), msb_uint64(st)), msb_uint64(it));
+       if (hcMSBit > 24) {
+               utn = ut >> (hcMSBit - 24);
+               if (utn == 0)
+                       utn++;
+               stn = st >> (hcMSBit - 24);
+               if (stn == 0)
+                       stn++;
+               itn = it >> (hcMSBit - 24);
+               if (itn == 0)
+                       itn++;
+       }
+       tot = stn + utn + itn;          /* total "ticks" */
+
+       sus = (rus * stn) / tot;
+       uus = (rus * utn) / tot;
+       ius = (rus * itn) / tot;
+
+#  ifdef DIAGNOSTIC
+       printf("%s: %s[%ld]: %llu us {msb=%u} (u=%llu us [%llu:%llu], s=%llu us 
[%llu:%llu], i=%llu us [%llu:%llu])\n",
+              (ip && rp) ? "acct_process" :
+              (ip && !rp) ? "getrusage" :
+              (!ip && rp) ? "sysctl" : "exit|tty",
+              p->p_comm, (long) p->p_pid,
+              rus, hcMSBit, uus, ut, utn, sus, st, stn, ius, it, itn);
+#  endif
        if (sp != NULL) {
-               sp->tv_sec = st / 1000000;
-               sp->tv_usec = st % 1000000;
+               sp->tv_sec = sus / 1000000;
+               sp->tv_usec = sus % 1000000;
        }
        if (up != NULL) {
-               up->tv_sec = ut / 1000000;
-               up->tv_usec = ut % 1000000;
+               up->tv_sec = uus / 1000000;
+               up->tv_usec = uus % 1000000;
        }
        if (ip != NULL) {
-               if (it != 0)
-                       it = (u * it) / tot;
-               ip->tv_sec = it / 1000000;
-               ip->tv_usec = it % 1000000;
+               ip->tv_sec = ius / 1000000;
+               ip->tv_usec = ius % 1000000;
+       }
+
+# elif defined(SYSCALL_STATS_USE_UPTIME)
+       /*
+        * We have _measured_ user, system (XXX and maybe interrupt) times!
+        *
+        * These are not stathz ticks, but rather nanoseconds (at binuptime()
+        * resolution)
+        *
+        * XXX except for p_iticks, which is still stathz ticks.... 
+        */
+       sus = st / 1000;
+       uus = ut / 1000;
+       ius = it * stathz;
+       tot = sus + uus;
+#  ifdef DIAGNOSTIC
+       printf("%s: %s[%ld]: rt=%llu us, u+s=%llu us (u=%lld ns / s=%lld ns), 
it=%llu ticks\n",
+              (ip && rp) ? "acct_process" :
+              (ip && !rp) ? "getrusage" :
+              (!ip && rp) ? "sysctl" : "exit|tty",
+              p->p_comm, (long) p->p_pid,
+              rus, tot, uus, sus, it);
+#  endif
+# else
+#  error "only one of: SYSCALL_STATS_USE_UPTIME, SYSCALL_STATS_USE_HARDCLOCK, 
or SYSCALL_STATS_USE_CPUTSC; must be defined"
+# endif
+#endif
+       if (sp != NULL) {
+               sp->tv_sec = sus / 1000000;
+               sp->tv_usec = sus % 1000000;
+       }
+       if (up != NULL) {
+               up->tv_sec = uus / 1000000;
+               up->tv_usec = uus % 1000000;
+       }
+       if (ip != NULL) {
+               ip->tv_sec = ius / 1000000;
+               ip->tv_usec = ius % 1000000;
        }
        if (rp != NULL) {
-               *rp = tv;
+               *rp = rtv;
        }
 }
 
@@ -486,12 +650,13 @@
        } */
        struct rusage ru;
        struct proc *p = l->l_proc;
+       struct timeval dummy;
 
        switch (SCARG(uap, who)) {
        case RUSAGE_SELF:
                mutex_enter(p->p_lock);
                memcpy(&ru, &p->p_stats->p_ru, sizeof(ru));
-               calcru(p, &ru.ru_utime, &ru.ru_stime, NULL, NULL);
+               calcru(p, &ru.ru_utime, &ru.ru_stime, &dummy, NULL);
                rulwps(p, &ru);
                mutex_exit(p->p_lock);
                break;
@@ -533,7 +698,7 @@
 
        LIST_FOREACH(l, &p->p_lwps, l_sibling) {
                ruadd(ru, &l->l_ru);
-               ru->ru_nvcsw += (l->l_ncsw - l->l_nivcsw);
+               ru->ru_nvcsw += (l->l_ncsw - l->l_nivcsw); /* XXX ru_nvcsw can 
go negative! */
                ru->ru_nivcsw += l->l_nivcsw;
        }
 }
Index: sys/kern/kern_clock.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/kern_clock.c,v
retrieving revision 1.126
diff -u -r1.126 kern_clock.c
--- sys/kern/kern_clock.c       5 Oct 2008 21:57:20 -0000       1.126
+++ sys/kern/kern_clock.c       2 Nov 2011 22:20:12 -0000
@@ -73,6 +73,7 @@
 
 #include "opt_ntp.h"
 #include "opt_perfctrs.h"
+#include "opt_syscall_stats.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -177,17 +178,19 @@
        if (profhz == 0)
                profhz = i;
        psratio = profhz / i;
+
        if (schedhz == 0) {
                /* 16Hz is best */
                hardscheddiv = hz / 16;
                if (hardscheddiv <= 0)
                        panic("hardscheddiv");
        }
-
 }
 
 /*
  * The real-time timer, interrupting hz times per second.
+ *
+ * called by every CPU 
  */
 void
 hardclock(struct clockframe *frame)
@@ -244,11 +247,19 @@
        if ((p->p_stflag & PST_PROFIL) == 0) {
                p->p_stflag |= PST_PROFIL;
                /*
-                * This is only necessary if using the clock as the
-                * profiling source.
+                * Increment the number of processes being profiled and Iff
+                * stathz is on a separate clock then set the stathz rate to be
+                * at profhz, and set the statclock profile/stats divisor to
+                * psratio, if this is the first process being profiled.
+                * 
+                * This is only necessary if using the statistics clock as the
+                * profiling source (as opposed to pmc(9)).
                 */
-               if (++profprocs == 1 && stathz != 0)
+               if (++profprocs == 1 && stathz != 0) {
+
                        psdiv = psratio;
+                       setstatclockrate(profhz);
+               }
        }
 }
 
@@ -264,11 +275,19 @@
        if (p->p_stflag & PST_PROFIL) {
                p->p_stflag &= ~PST_PROFIL;
                /*
-                * This is only necessary if using the clock as the
-                * profiling source.
+                * Decrement the number of processes being profiles and Iff
+                * stathz is on a separate clock then reset the stathz rate to
+                * normal, and reset the profile/stats divisor to unity, if no
+                * more processes are being profiled.
+                * 
+                * This is only necessary if using the statistics clock as the
+                * profiling source (as opposed to pmc(9)).
                 */
-               if (--profprocs == 0 && stathz != 0)
+               if (--profprocs == 0 && stathz != 0) {
+
                        psdiv = 1;
+                       setstatclockrate(stathz);
+               }
        }
 }
 
@@ -336,8 +355,25 @@
 }
 
 /*
- * Statistics clock.  Grab profile sample, and if divider reaches 0,
- * do process and kernel statistics.
+ * Statistics (and profiling) clock.
+ *
+ * Called by every CPU.
+ *
+ * XXX profclock() really should always be a separate function, as per FreeBSD
+ * that should avoid all this bother about psdiv and psratio and eliminate the
+ * need for extra state variables in the per-cpu sheduler state
+ *
+ * As per initclocks() and startprofclock(), statclock() may be called at the
+ * faster profhz rate while profiling is in effect, in which case statistics
+ * gathering must only be done once every psratio number of calls.
+ *
+ * Note since statclock() is called by every CPU, the counting and dividing of
+ * its calls to separate profhz and stathz must be done in per-CPU storage.
+ *
+ * So, first grab a profile sample, and then if by decrementing the
+ * profile/stats counter we reach 0, also do process and kernel statistics and
+ * reset the counter up to psdiv again.  (psdiv will be 1 when statclockrate ==
+ * stathz)
  */
 void
 statclock(struct clockframe *frame)
@@ -352,22 +388,19 @@
        struct lwp *l;
 
        /*
-        * Notice changes in divisor frequency, and adjust clock
-        * frequency accordingly.
+        * Notice any changes in divisor frequency and reset the state
+        * variables for the current CPU.
         */
        if (spc->spc_psdiv != psdiv) {
                spc->spc_psdiv = psdiv;
                spc->spc_pscnt = psdiv;
-               if (psdiv == 1) {
-                       setstatclockrate(stathz);
-               } else {
-                       setstatclockrate(profhz);
-               }
        }
+
        l = ci->ci_data.cpu_onproc;
+
        if ((l->l_flag & LW_IDLE) != 0) {
                /*
-                * don't account idle lwps as swapper.
+                * don't account idle LWPs such as swapper. (???)
                 */
                p = NULL;
        } else {
@@ -376,23 +409,42 @@
        }
 
        if (CLKF_USERMODE(frame)) {
+               /*
+                * Interrupt came while in user mode; CPU was in user state.
+                */
+
+               /* grab profiling stats if profiling this process */
                if ((p->p_stflag & PST_PROFIL) && profsrc == PROFSRC_CLOCK)
                        addupc_intr(l, CLKF_PC(frame));
+
+               /*
+                * If profiling is enabled then statclock() is actually called
+                * at the profhz rate and we decrement a counter to determine
+                * whether this call is also at a stathz interval or not.  If
+                * not then we only do profiling, and so return now.
+                */
                if (--spc->spc_pscnt > 0) {
                        mutex_spin_exit(&p->p_stmutex);
                        return;
                }
 
+#ifndef SYSCALL_PROCTIMES
                /*
-                * Came from user mode; CPU was in user state.
-                * If this process is being profiled record the tick.
+                * Unfortunately since this effectively allocates a whole
+                * stathz interval to the process as user time it can be highly
+                * inaccurate since it assumes the process has (or will
+                * continue to) run for a whole stathz interval in user-mode.
                 */
                p->p_uticks++;
+#endif
                if (p->p_nice > NZERO)
                        spc->spc_cp_time[CP_NICE]++;
                else
                        spc->spc_cp_time[CP_USER]++;
        } else {
+               /*
+                * Interrupt came while in kernel mode.
+                */
 #ifdef GPROF
                /*
                 * Kernel statistics are just like addupc_intr, only easier.
@@ -412,18 +464,27 @@
                        addupc_intr(l, LWP_PC(l));
                }
 #endif
+               /*
+                * If profiling is enabled then statclock() is actually called
+                * at the profhz rate and we decrement a counter to determine
+                * whether this call is also at a stathz interval or not.  If
+                * not then we only do profiling, and so return now.
+                */
                if (--spc->spc_pscnt > 0) {
                        if (p != NULL)
                                mutex_spin_exit(&p->p_stmutex);
                        return;
                }
                /*
-                * Came from kernel mode, so we were:
+                * The current CPU may have been busy:
+                *
                 * - handling an interrupt,
                 * - doing syscall or trap work on behalf of the current
                 *   user process, or
                 * - spinning in the idle loop.
+                *
                 * Whichever it is, charge the time as appropriate.
+                *
                 * Note that we charge interrupts to the current process,
                 * regardless of whether they are ``for'' that process,
                 * so that we know how much of its real time was spent
@@ -431,20 +492,87 @@
                 */
                if (CLKF_INTR(frame) || (curlwp->l_pflag & LP_INTR) != 0) {
                        if (p != NULL) {
+#if 1 /* ndef SYSCALL_PROCTIMES */
+                               /*
+                                * Unfortunately since this allocates a whole
+                                * stathz interval to the process as interrupt
+                                * time it can be highly inaccurate, especially
+                                * for interrupt service routines which may
+                                * routinely be much shorter-running than
+                                * stathz.
+                                *
+                                * Statistically though the likelyhood of a
+                                * statclock() call comming while another ISR
+                                * is already executing is therefore relatively
+                                * low, unless perhaps the interrupt rate is
+                                * really high, in which case taking a whole
+                                * stathz tick ``hit'' for ISRs might, on
+                                * average, be OK.
+                                */
                                p->p_iticks++;
+#endif
                        }
                        spc->spc_cp_time[CP_INTR]++;
                } else if (p != NULL) {
+#ifndef SYSCALL_PROCTIMES
+                       /*
+                        * Unfortunately since this allocates a whole stathz
+                        * interval to the process as system time it can be
+                        * highly inaccurate, especially for some system calls
+                        * which may take far less than a stathz interval to
+                        * finish.
+                        *
+                        * Unlike interrupts though, some system calls may
+                        * actually run fairly long.  Does that make this fair?
+                        */
                        p->p_sticks++;
+#endif
                        spc->spc_cp_time[CP_SYS]++;
                } else {
                        spc->spc_cp_time[CP_IDLE]++;
                }
        }
-       spc->spc_pscnt = psdiv;
 
        if (p != NULL) {
+               struct vmspace *vm = p->p_vmspace;
+               long rss;
+
+               /*
+                * If the CPU is currently scheduled to a non-idle process,
+                * then charge that process with the appropriate VM resource
+                * utilization for a tick.
+                *
+                * Assume that the current process has been running the entire
+                * last tick, and account for VM use regardless of whether in
+                * user mode or system mode (XXX or interrupt mode?).
+                *
+                * rusage VM stats are expressed in kilobytes *
+                * ticks-of-execution.
+                */
+               /* based on code from 4.3BSD kern_clock.c and from FreeBSD ... 
*/
+
+# define pg2kb(n)      (((n) * PAGE_SIZE) / 1024)
+
+               p->p_stats->p_ru.ru_idrss += pg2kb(vm->vm_dsize); /* unshared 
data */
+               p->p_stats->p_ru.ru_isrss += pg2kb(vm->vm_ssize); /* unshared 
stack */
+               p->p_stats->p_ru.ru_ixrss += pg2kb(vm->vm_tsize); /* "shared" 
text? */
+
+               rss = pg2kb(vm_resident_count(vm));
+               if (rss > p->p_stats->p_ru.ru_maxrss)
+                       p->p_stats->p_ru.ru_maxrss = rss;
+
+               /* finally account overall for one stathz tick */
                atomic_inc_uint(&l->l_cpticks);
+
+               /* we're done with mucking with proc stats fields.... */
                mutex_spin_exit(&p->p_stmutex);
        }
+
+       /*
+        * reset the profhz divisor counter
+        *
+        * Note we must use the global variable here, not spc->spc_psdiv, since
+        * the statclock rate may already have been lowered by another CPU....
+        */
+       spc->spc_pscnt = psdiv;
 }
Index: sys/kern/kern_acct.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/kern_acct.c,v
retrieving revision 1.86.12.1
diff -u -r1.86.12.1 kern_acct.c
--- sys/kern/kern_acct.c        20 May 2011 19:28:56 -0000      1.86.12.1
+++ sys/kern/kern_acct.c        2 Nov 2011 23:44:07 -0000
@@ -392,7 +392,7 @@
 acct_process(struct lwp *l)
 {
        struct acct acct;
-       struct timeval ut, st, tmp;
+       struct timeval ut, st, it, rt, tmp;
        struct rusage *r;
        int t, error = 0;
        struct rlimit orlim;
@@ -428,7 +428,7 @@
 
        /* (2) The amount of user and system time that was used */
        mutex_enter(p->p_lock);
-       calcru(p, &ut, &st, NULL, NULL);
+       calcru(p, &ut, &st, &it, &rt);
        mutex_exit(p->p_lock);
        acct.ac_utime = encode_comp_t(ut.tv_sec, ut.tv_usec);
        acct.ac_stime = encode_comp_t(st.tv_sec, st.tv_usec);
@@ -441,8 +441,13 @@
 
        /* (4) The average amount of memory used */
        r = &p->p_stats->p_ru;
+#if 0
        timeradd(&ut, &st, &tmp);
        t = tmp.tv_sec * hz + tmp.tv_usec / tick;
+#else
+       /* use p_rtime from calcru()! */
+       t = rt.tv_sec * hz + rt.tv_usec / tick;
+#endif
        if (t)
                acct.ac_mem = (r->ru_ixrss + r->ru_idrss + r->ru_isrss) / t;
        else
Index: sys/kern/init_sysctl.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/sys/kern/init_sysctl.c,v
retrieving revision 1.149.4.8
diff -u -r1.149.4.8 init_sysctl.c
--- sys/kern/init_sysctl.c      7 Mar 2011 17:08:28 -0000       1.149.4.8
+++ sys/kern/init_sysctl.c      2 Nov 2011 22:10:14 -0000
@@ -1256,7 +1256,7 @@
        clkinfo.tickadj = tickadj;
        clkinfo.hz = hz;
        clkinfo.profhz = profhz;
-       clkinfo.stathz = stathz ? stathz : hz;
+       clkinfo.stathz = stathz;
 
        node = *rnode;
        node.sysctl_data = &clkinfo;
@@ -3185,8 +3185,8 @@
                        ep->e_flag |= EPROC_SLEADER;
                strncpy(ep->e_login, ep->e_sess->s_login, MAXLOGNAME);
        }
-       ep->e_xsize = ep->e_xrssize = 0;
-       ep->e_xccount = ep->e_xswrss = 0;
+       ep->e_xsize = ep->e_xrssize = 0; /* XXX now see e_vm */
+       ep->e_xccount = ep->e_xswrss = 0; /* XXX now see e_vm */
 }
 
 u_int

#include <sys/cdefs.h>

#include <sys/types.h>

#include <sys/resource.h>
#include <sys/time.h>
#include <err.h>
#include <errno.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <unistd.h>

/*
 * WARNING:  time can appear to have gone backwards with getrusage(2)!
 *
 * See NetBSD Problem Report #30115 (and PR#10201).
 * See FreeBSD Problem Report #975 (and PR#10402).
 *
 * Problem has existed in all *BSDs since 4.4BSD if not earlier.
 *
 * Only FreeBSD has implemented a "fix" (as of rev.1.45 (svn r44725) of
 * kern_resource.c (etc.) on April 13, 1999)
 *
 * But maybe it is even worse than that -- distribution of time between user
 * and system doesn't seem to match reality!
 *
 * See the GNU MP Library (GMP)'s tune/time.c code for better timing?
 */

#if defined(__APPLE__)
# define MILLIONS       10              /* my mac is way faster!  :-) */
#else
# define MILLIONS       1
#endif

static unsigned int iter = MILLIONS * 1000000UL;

char *argv0 = "progname";


/*
 * for info about the worker algorithms used here see:
 * <URL:http://graphics.stanford.edu/~seander/bithacks.html>
 */

/*
 * do nothing much, but make sure you do it!
 */
unsigned int nullfunc(unsigned long);
unsigned int
nullfunc(unsigned long v)
{
        volatile unsigned int bc = (unsigned int) v;

        return bc;
}

/*
 * return the number of bits set to one in a value
 *
 * old-fashioned bit-by-bit bit-twiddling....  very slow!
 */
unsigned int count_ul_bits(unsigned long);
unsigned int
count_ul_bits(unsigned long v)
{
        unsigned int c;

        c = 0;
        /*
         * we optimize away any high-order zero'd bits...
         */
        while (v) {
                c += (v & 1);
                v >>= 1;
        }

        return c;
}

/*
 * return the number of bits set to one in a value
 *
 * Subtraction of 1 from a number toggles all the bits (from right to left) up
 * to and including the righmost set bit.
 *
 * So, if we decrement a number by 1 and do a bitwise and (&) with itself
 * ((n-1) & n), we will clear the righmost set bit in the number.
 *
 * Therefore if we do this in a loop and count the number of iterations then we
 * get the count of set bits.
 *
 * Executes in O(n) operations where n is number bits set to one in a given
 * integer value.
 */
unsigned int countbits_sparse(unsigned long);
unsigned int
countbits_sparse(unsigned long v)
{
        volatile unsigned int bc = 0;

        while (v) {
                v &= v - 1;     /* clear the least significant bit set to one */
                bc++;
        }

        return bc;
}

unsigned int countbits_dense(unsigned long);
unsigned int
countbits_dense(unsigned long v)
{
        volatile unsigned int bc = sizeof(v) * CHAR_BIT;

        v ^= (unsigned long) -1;
        while (v) {
                v &= v - 1;     /* clear the least significant bit set to one */
                bc--;
        }

        return bc;
}

/*
 * most efficient non-lookup variant from the URL above....
 */
#define COUNT_BITS(T, x, c)                                                     
\
        {                                                                       
\
                T n = (x);                                                      
\
                                                                                
\
                n = n - ((n >> 1) & (T)~(T)0/3);                                
\
                n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);           
\
                n = (n + (n >> 4)) & (T)~(T)0/255*15;                           
\
                c = (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * CHAR_BIT;      
\
        }                                                                       
\
                

unsigned int count_bits(unsigned long);
unsigned int
count_bits(unsigned long v)
{
        volatile unsigned int c;

        COUNT_BITS(unsigned long, v, c)

        return c;
}



#define MAX_STRLEN_OCTAL(t)     ((int) ((sizeof(t) * CHAR_BIT / 3) + 2))


/* XXX see also timevalsub() */
suseconds_t difftval(struct timeval, struct timeval);
suseconds_t
difftval(struct timeval tstart, struct timeval tend)
{
        tend.tv_sec -= tstart.tv_sec;
        tend.tv_usec -= tstart.tv_usec;

        while (tend.tv_usec < 0) {
                tend.tv_sec--;
                tend.tv_usec += 1000000;
        }
        while (tend.tv_usec >= 1000000) {
                tend.tv_sec++;
                tend.tv_usec -= 1000000;
        }

        return (suseconds_t) ((tend.tv_sec * 1000000) + tend.tv_usec);
}

suseconds_t microtime(void);

/*
 * microtime() - return number of microseconds since some epoch
 *
 * the particular epoch is irrelevant -- we just use the difference between two
 * of these samples taken sufficiently far appart enough that the resolution is
 * also relatively unimportant, though better than 1 second is expected....
 */
/*
 * Timing anomalies
 *
 * time(1) uses gettimeofday() to show the "real" time, by which it means the
 * wall-clock time it took to run the process, including the time to do the
 * vfork() and execvp(), ignore some signals, and call wait4().
 *
 * However currently on NetBSD we can see getrusage() report a total of system
 * plus user time of as much as 0.06 seconds longer than gettimeofay() says it
 * took for the whole thing!  E.g.:
 *
 * $ /usr/bin/time -p false       
 * real         0.00
 * user         0.03
 * sys          0.03
 *
 * Furthermore gettimeofday() can wander, e.g. due to NTP, or worse.
 *
 * Use clock_gettime(CLOCK_MONOTONIC, tspec) instead if possible!
 */

#ifdef CLOCK_MONOTONIC

suseconds_t
microtime()
{
        struct timespec tsnow;

        (void) clock_gettime(CLOCK_MONOTONIC, &tsnow);

        return (suseconds_t) ((tsnow.tv_sec * 1000000) + (tsnow.tv_nsec / 
1000));
}

#else /* !CLOCK_MONOTONIC */

/*
 * XXX this is currently for Darwin/Mac OS X, which does not implement the
 * POSIX (IEEE Std 1003.1b-1993) clock_gettime() API
 *
 * Note that on OS X the gettimeofday() function is implemented in libc as a
 * wrapper to either the _commpage_gettimeofday() function, if available, or
 * the normal system call.  If using the COMMPAGE helper then gettimeofday()
 * simply returns the value stored in the COMMPAGE and thus can execute without
 * a context switch.
 */

suseconds_t
microtime()
{
        struct timeval tvnow;

        (void) gettimeofday(&tvnow, (void *) NULL);

        return (suseconds_t) ((tvnow.tv_sec * 1000000) + tvnow.tv_usec);
}

#endif /* CLOCK_MONOTONIC */

void show_time(char *, suseconds_t, suseconds_t, suseconds_t);
void
show_time(char *fname, suseconds_t us_u, suseconds_t us_s, suseconds_t us_c)
{
        suseconds_t us_w = (us_c - (us_s + us_u));
        double pc_u = (double) us_u / (double) iter ;
        double pc_s = (double) us_s / (double) iter ;
        double pc_w = (double) us_w / (double) iter ;
        double pc_c = (double) us_c / (double) iter ;

        /*
         * note in the calculation of us_w above that wall clock elapsed time
         * (us_c) is expected to be longer than the sum of user (us_u) and
         * system (us_s) time, and we will display the difference as "wait"
         * time, suggesting the amount of time the process was waiting for the
         * CPU (shown here per call)
         */

        printf("%18s = %5.4f us/c user, %7.4f us/c sys, %5.4f us/c wait, %7.4f 
us/c wall\n",
               fname,
               pc_u,
               pc_s,
               pc_w,
               pc_c);
}

void usage(void)        __dead;
void
usage()
{
        fprintf(stderr, "Usage: %s [-tr] [-s seed] [-i 
millions_of_iterations]\n", argv0);
        fprintf(stderr, "-r: use random() instead of time()");
        fprintf(stderr, "-s seed: specify value for srandom() instead of using 
time()");
        fprintf(stderr, "-t: don't run the verbose proof of concept test -- 
just do timing runs");
        fprintf(stderr, "(default iterations: %lu * 10^6)\n", iter / 1000000UL);
        exit(EX_USAGE);
}

extern char *optarg;
extern int optind;
extern int optopt;
extern int opterr;
extern int optreset;

long int e_opt_strtol(void);

long int
e_opt_strtol()
{
        long int ltmp;
        char *ep;

        /*
         * extremely pedantic parameter evaluation
         */
        errno = 0;
        ltmp = strtol(optarg, &ep, 0);
        if (ep == optarg) {
                err(EX_USAGE,
                    "-%c param of '%s' is not a valid number",
                    optopt,
                    optarg);
        }
        if (*ep) {
                err(EX_USAGE,
                    "-%c param of '%s' has unsupported trailing unit 
specification characters",
                    optopt,
                    optarg);
        }
        if (errno != 0) {
                err(EX_USAGE,
                    "-%c param of '%s' is not convertible: %s",
                    optopt,
                    optarg,
                    strerror(errno));
        }
        if (ltmp > INT_MAX) {
                err(EX_USAGE,
                    "-%c param of %ld is too large (must be <= %d)",
                    optopt,
                    ltmp,
                    INT_MAX);
        }
        if (ltmp < 1) {
                err(EX_USAGE,
                    "-%c param of %ld is too small (must be > 0)",
                    optopt,
                    ltmp);
        }

        return ltmp;
}

long long int e_opt_strtoll(void);

long long int
e_opt_strtoll()
{
        long long int lltmp;
        char *ep;

        errno = 0;
        lltmp = strtoll(optarg, &ep, 0);
        if (ep == optarg) {
                err(EX_USAGE,
                    "-%c param of '%s' is not a valid number",
                    optopt,
                    optarg);
        }
        if (*ep) {
                err(EX_USAGE,
                    "-%c param of '%s' has unsupported trailing unit 
specification characters",
                    optopt,
                    optarg);
        }
        if (errno != 0) {
                err(EX_USAGE,
                    "-%c param of '%s' is not convertible: %s",
                    optopt,
                    optarg,
                    strerror(errno));
        }
        if (lltmp > LONG_MAX) {
                err(EX_USAGE,
                    "-%c param of %lld is too large (must be <= %d)",
                    optopt,
                    lltmp,
                    INT_MAX);
        }
        if (lltmp < 1) {
                err(EX_USAGE,
                    "-%c param of %lld is too small (must be > 0)",
                    optopt,
                    lltmp);
        }

        return lltmp;
}

int main(int, char *[]);

int
main(int argc, char *argv[])
{
        bool dotest = true;
        bool use_random = false;
        int ch;
        unsigned long seed = (unsigned long) time((time_t *) NULL);
        size_t i;
        struct rusage rus;
        struct rusage ruf;
#ifdef CLOCK_MONOTONIC
        struct timespec res;
#endif
        suseconds_t nulltime_u;
        suseconds_t nulltime_s;
        suseconds_t nulltime_w;
        suseconds_t timetime_u;
        suseconds_t timetime_s;
        suseconds_t timetime_w;
        suseconds_t totaltime_u;
        suseconds_t totaltime_s;
        suseconds_t totaltime_w;

        argv0 = (argv0 = strrchr(argv[0], '/')) ? argv0 + 1 : argv[0];

        optind = 1;             /* Start options parsing */
        opterr = 0;             /* I'll print my own errors! */
        while ((ch = getopt(argc, argv, ":hi:rs:t")) != -1) {
                switch (ch) {
                case 'i':
                        iter = (unsigned int) e_opt_strtol() * 1000000UL;
                        break;

                case 'r':
                        use_random = true;
                        break;

                case 's':
                        seed = (unsigned long) e_opt_strtoll();
                        break;

                case 't':
                        dotest = false;
                        break;

                case 'h':
                        usage();

                case '?':
                        warnx("unknown option -- '%c'", optopt);
                        usage();

                case ':':
                        /*
                         * NOTE: a leading ':' in optstring causes getopt() to
                         * return a ':' when an option is missing its parameter.
                         */
                        warnx("missing parameter for -%c", optopt);
                        usage();

                default:
                        warnx("programming error, unhandled flag: %c", ch);
                        abort();
                }
        }
        argc -= optind;
        argv += optind;

        if (argc) {
                usage();
        }
        /* show that they all work.... */
        for (i = (dotest ? 1 : INT_MAX); i < (sizeof(unsigned long) * 
CHAR_BIT); i++) {
                unsigned long v = 1UL << i;
                unsigned int c;

                COUNT_BITS(unsigned long, v/2+1, c)
                printf("%#-*lo (v/2+1)   = %d, %d, %d, %d\n", 
MAX_STRLEN_OCTAL(typeof(v)),
                       v/2+1, countbits_sparse(v/2+1), countbits_dense(v/2+1), 
count_ul_bits(v/2+1), c);
                COUNT_BITS(unsigned long, v-2, c)
                printf("%#-*lo (v-2)     = %d, %d, %d, %d\n", 
MAX_STRLEN_OCTAL(typeof(v)),
                       v-2, countbits_sparse(v-2), countbits_dense(v-2), 
count_ul_bits(v-2), c);
                COUNT_BITS(unsigned long, v-1, c)
                printf("%#-*lo (v-1)     = %d, %d, %d, %d\n", 
MAX_STRLEN_OCTAL(typeof(v)),
                       v-1, countbits_sparse(v-1), countbits_dense(v-1), 
count_ul_bits(v-1), c);
                COUNT_BITS(unsigned long, v, c)
                printf("%#-*lo (%2d bits) = %d, %d, %d, %d\n", 
MAX_STRLEN_OCTAL(typeof(v)),
                       v, (int) i, countbits_sparse(v), countbits_dense(v), 
count_ul_bits(v), c);
                puts("--------");
        }

#ifdef CLOCK_MONOTONIC
        /* XXX "#ifdef CLOCK_PROCESS_CPUTIME_ID"??? */
        if (clock_getres(CLOCK_MONOTONIC, &res) == -1) {
                err(EXIT_FAILURE, "clock_getres(CLOCK_MONOTONIC)");
        }
        warnx("using CLOCK_MONOTONIC timer with resolution: %ld s, %ld ns", 
res.tv_sec, res.tv_nsec);
#endif
                
        warnx("now running each algorithm for %u iterations....", iter);

        /*
         * We will see from below that on NetBSD (and others except maybe
         * FreeBSD) getrusage() can report _RADICALLY_ different amounts of
         * both _user_ time and system time for the exact same code to run!
         *
         * This is over and above the occasional appearance of one or the other
         * of these times appearing to go backwards (since they are both
         * calculated by dividing the amount of total run time between them
         * based on the number statclock tick "hits" which occured during that
         * runtime and based on whether the hit was taking while in kernel mode
         * or in user mode (or interrupt mode).
         */

#define START_CLOCKS(v) do {                                            \
        v##_w = microtime();                                            \
        getrusage(RUSAGE_SELF, &rus);                                   \
} while (0)

#define STOP_CLOCKS(v)  do {                                            \
        getrusage(RUSAGE_SELF, &ruf);                                   \
        v##_w = microtime() - v##_w;                                    \
        v##_u = difftval(rus.ru_utime, ruf.ru_utime);                   \
        v##_s = difftval(rus.ru_stime, ruf.ru_stime);                   \
} while (0)

        if (!use_random) {
                /* time time() */
                START_CLOCKS(timetime);
                for (i = 0; i < iter; i++) {
                        (void) time((time_t *) NULL);
                }
                STOP_CLOCKS(timetime);
                show_time("time()", timetime_u, timetime_s, timetime_w);
        } else {
                srandom(seed);
                /* time random() */
                START_CLOCKS(timetime);
                for (i = 0; i < iter; i++) {
                        (void) random();
                }
                STOP_CLOCKS(timetime);
                show_time("random()", timetime_u, timetime_s, timetime_w);
        }
        /* time nullfunc() */
        START_CLOCKS(nulltime);
        for (i = 0; i < iter; i++) {
                nullfunc((unsigned long) (use_random ? random() : time((time_t 
*) NULL)));
        }
        STOP_CLOCKS(nulltime);
        show_time("nulltime()", nulltime_u, nulltime_s, nulltime_w);

        /* note: leave nulltime_* as sum of nullfunc() time and time() time */

        /* time countbits_sparse() */
        START_CLOCKS(totaltime);
        for (i = 0; i < iter; i++) {
                countbits_sparse((unsigned long) (use_random ? random() : 
time((time_t *) NULL)));
        }
        STOP_CLOCKS(totaltime);
        show_time("countbits_sparse()", totaltime_u, totaltime_s, totaltime_w);

        /* time countbits_dense() */
        START_CLOCKS(totaltime);
        for (i = 0; i < iter; i++) {
                countbits_dense((unsigned long) (use_random ? random() : 
time((time_t *) NULL)));
        }
        STOP_CLOCKS(totaltime);
        show_time("countbits_dense()", totaltime_u, totaltime_s, totaltime_w);

        /* time COUNT_BITS() */
        START_CLOCKS(totaltime);
        for (i = 0; i < iter; i++) {
                unsigned int c;
                COUNT_BITS(unsigned long, (unsigned long) (use_random ? 
random() : time((time_t *) NULL)), c)
        }
        STOP_CLOCKS(totaltime);
        show_time("COUNT_BITS()", totaltime_u, totaltime_s, totaltime_w);

        /* time count_bits() */
        START_CLOCKS(totaltime);
        for (i = 0; i < iter; i++) {
                count_bits((unsigned long) (use_random ? random() : 
time((time_t *) NULL)));
        }
        STOP_CLOCKS(totaltime);
        show_time("count_bits()", totaltime_u, totaltime_s, totaltime_w);

        /* time count_ul_bits() */
        START_CLOCKS(totaltime);
        for (i = 0; i < iter; i++) {
                count_ul_bits((unsigned long) (use_random ? random() : 
time((time_t *) NULL)));
        }
        STOP_CLOCKS(totaltime);
        show_time("count_ul_bits()", totaltime_u, totaltime_s, totaltime_w);

        exit(0);
}

Attachment: pgp9EYFkUO0Vb.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index