Subject: Re: Some kernel profiling on a 5000/133 with -current
To: Chuck Silvers <chuq@chuq.com>
From: Ken Wellsch <kwellsch@tampabay.rr.com>
List: port-pmax
Date: 08/30/2001 17:02:48
Hi Chuck,

I generally pay no attention to profile data beyond the flat call graph
at the top of the gprof output.  But I did look at the call graph to see
who was using the function.  True enough, I was influenced by the sorting
provided by gprof.

So is it safe to say the first table is reasonably accurate? e.g. that
memcpy is by far the dominant function consuming kernel time in my test
(ignoring the profile overhead)? Is the total call count accurate?  Is
the assigned call ratio in the call graph accurate? e.g. in my table below
is memcpy being called nearly 50% of all calls, by microtime?  Or
are the called-by assignment counts also suspect?

I confess I was less concerned about consumed CPU time as I was with the
large number of calls that I didn't think were necessary.  I guess if
function call overhead on a MIPS system is low it is more of a waste of
time to suggest changing anything.

For what it is worth, I did find, doing long "bonnie" runs with 500Mb
file-size, that the performance of the assembler memcpy() and the
portable C version were about the same.

It is hardly sufficient testing, but I did make one 500Mb bonnie
run with my change below and it appeared to drop over a minute from
the ~52 minute run time (under a profiling kernel).

I also optimized out the call to platform.clkread for a MIPS1 only
3MIN kernel because as the code exists, it results in an empty call
on non MIPS3 3MIN boxes (the dominant 3MIN box out there).

diff -u /sys/arch/pmax/pmax/machdep.c /tmp/machdep.c
--- /sys/arch/pmax/pmax/machdep.c       Sun Aug 26 09:05:06 2001
+++ /tmp/machdep.c      Thu Aug 30 16:49:44 2001
@@ -772,10 +772,14 @@
        int s = splclock();
        static struct timeval lasttime;
 
-       *tvp = time;
-#if (DEC_3MIN + DEC_MAXINE + DEC_3MAXPLUS) > 0
+       tvp->tv_sec  = time.tv_sec;
+       tvp->tv_usec = time.tv_usec;
+
+#if defined(DEC_MAXINE) || defined(DEC_3MAXPLUS) || \
+    ( defined(DEC_3MIN) && defined(MIPS3) )
        tvp->tv_usec += (*platform.clkread)();
 #endif
+
        if (tvp->tv_usec >= 1000000) {
                tvp->tv_usec -= 1000000;
                tvp->tv_sec++;

Thanks for your reply!

-- Ken


On Thu, Aug 30, 2001 at 01:31:46PM -0700, Chuck Silvers wrote:
> hi,
> 
> sure it's possible to make that change, but it probably won't help all
> that much.  but go ahead and try it and see how the total cpu time taken
> by memcpy() changes.  it will probably be pretty much the same
> (within the margin of error for your test).
> 
> the reason for this is that the profiler can't tell how how much of the
> time used by a given function is related to each of the callers, so it
> just assigns cpu time to the callers proportional to the number of calls.
> 
> for example, if we have this program:
> 
> 
> char buf[100000000];
> 
> void
> c(int v)
> {
> 	int i;
> 	char *cp;
> 
> 	if (v) {
> 		return;
> 	}
> 
> 	for (i = 0; i < sizeof(buf); i++) {
> 		buf[i] = 1;
> 	}
> }
> 
> void
> a()
> {
> 	c(1);
> }
> 
> void
> b()
> {
> 	c(0);
> }
> 
> int
> main()
> {
> 	int i;
> 
> 	for (i = 0; i < 1000000; i++) {
> 		a();
> 	}
> 	b();
> }
> 
> 
> 
> we get this gprof output:
> 
> index % time    self  children    called     name
>                                                  <spontaneous>
> [1]    100.0    0.00    1.03                 main [1]
>                 0.01    1.02 1000000/1000000     a [2]
>                 0.00    0.00       1/1           b [4]
> -----------------------------------------------
>                 0.01    1.02 1000000/1000000     main [1]
> [2]    100.0    0.01    1.02 1000000         a [2]
>                 1.02    0.00 1000000/1000001     c [3]
> -----------------------------------------------
>                 0.00    0.00       1/1000001     b [4]
>                 1.02    0.00 1000000/1000001     a [2]
> [3]     99.0    1.02    0.00 1000001         c [3]
> -----------------------------------------------
>                 0.00    0.00       1/1           main [1]   
> [4]      0.0    0.00    0.00       1         b [4]
>                 0.00    0.00       1/1000001     c [3]
> 
> 
> according to gprof, all the time is taken by the 1000000 calls to
> c() made by a(), but looking at the code it's obvious that
> all the time is really taken by the one call to c() made by b().
> 
> in your case, all the time used by memcpy() is really due to things like
> copyin() and copyout(), even though there are a lot of calls from microtime().
> 
> -Chuck
> 
> 
> On Thu, Aug 30, 2001 at 01:04:02PM -0400, Ken Wellsch wrote:
> > On Fri, Aug 24, 2001 at 03:39:50PM -0400, Ken Wellsch wrote:
> > >
> > > index % time    self  children    called     name
> > > -----------------------------------------------------------------
> > > [10]    19.7  249.70    0.00 2023108         memcpy [10]
> > > -----------------------------------------------
> > > [16]     7.8    0.36   98.55  796813         microtime <cycle 8> [16]
> > >                98.35    0.00  796813/2023108     memcpy [10]
> > > -----------------------------------------------
> > >	...
> > 
> > I kept looking at this wondering, where the heck does microtime()
> > ever call memcpy()?
> > 
> > Well, I pondered this for a while and in looking at the code in
> > pmax/machdep.c I notice two "structure copies."
> > 
> > Sure enough, when I look at the assembler produced by "gcc" I see:
> > 
> > 	 ...
> >         move    $4,$16
> >         move    $17,$2
> >         la      $5,time
> >         .set    noreorder
> >         .set    nomacro
> >         jal     memcpy
> >         li      $6,8                    # 0x8
> >         .set    macro
> >         .set    reorder
> > 	 ...
> > 
> > which matches up with the line "*tvp = time;" below:
> > 
> >   microtime(tvp)
> >         struct timeval *tvp;
> >   { 
> >         int s = splclock();
> >         static struct timeval lasttime; 
> >  
> >         *tvp = time;
> > 
> > 	...
> > 
> > Apparently "time" is treated as volatile, while the closing
> > structure copy "lasttime = *tvp;" does not involve memcpy().
> > 
> > So given that this code is executing with splclock(), is it
> > possible to replace
> > 
> >         *tvp = time;
> > 
> > by
> > 
> >         tvp->tv_sec  = time.tv_sec;
> >         tvp->tv_usec = time.tv_usec;
> > 
> > and thus avoid this expensive and ever so frequent call?
> > 
> > Cheers,
> > 
> > -- Ken