Subject: Re: Odd top behavior with current, smp sparc 20
To: None <port-sparc@netbsd.org>
From: Christos Zoulas <christos@astron.com>
List: port-sparc
Date: 11/10/2005 16:19:25
In article <20051109160433.GA29422@SDF.LONESTAR.ORG>,
Bruce O'Neel <edoneel@sdf.lonestar.org> wrote:
>Hi,
>
>I have a sparc 20 with 4 cpus (2 55mhz, 2 100mhz, all hypersparcs) and I've
>noticed an odd response from top recently.
>
>I start 4 jobs, all of them the same, and all 4 take up one cpu.  
>They move from cpu to cpu as expected and each take in the high
>90% of the cpu, also, as expected.
>
>What's not expected is that frequently one, and less frequently more than 
>one while they show taking 95% of the CPU, they don't increment the
>cpu time field of top.  Top does report the same numbers as ps.
>
>Then, after a while the cpu time will jump from say 10 secs to 45 secs
>or what ever one would expect it to be.  A time command on these 4 jobs
>all show them taking about the right amount of time, ie, just about 150
>to 200 secs so the total count is correct.  It's just odd to watch top 
>run and all but one or two jobs not accumulate CPU...
>
>Anyway, odd.  I don't see it on another sparc 20 with 2 cpus,
>1 180mhz and 1 133mhz, but, I only run two jobs there so it might
>not be so obvious.
>
>I don't believe I have /proc mounted on either system for what it's
>worth.  They run an identical build of current, some time from
>the latter part of October, probably built on 22 Oct or so.

There is this glorious piece of code in m_netbsd15.c:

	...
	long cputime;
	...
#if 0
        /* This does not produce the correct results */
	cputime = pp->p_uticks + pp->p_sticks + pp->p_iticks;
#else   
	cputime = pp->p_rtime_sec;      /* This does not count interrupts */
#endif  

Well, pp->p_rtime_sec is supposed to work but somehow in your case it doesn't.
I cannot really explain why it does not work, since this is a copy of:

	p->p_rtime.tv_sec

and p_rtime is used by the scheduler to do process time accounting.
Now the "This does not produce the correct results statement" is
correct, because pp->p_?ticks is a 64 bit number and adding three
of them would overflow. It is also wrong because it is not dividing
by stathz... Anyway after fixing those two issues and enabling the
ticks code with this diff...

Index: m_netbsd15.c
===================================================================
RCS file: /cvsroot/src/usr.bin/top/machine/m_netbsd15.c,v
retrieving revision 1.24
diff -u -u -r1.24 m_netbsd15.c
--- m_netbsd15.c	3 Oct 2005 05:34:51 -0000	1.24
+++ m_netbsd15.c	10 Nov 2005 16:16:54 -0000
@@ -544,7 +544,7 @@
 	char *(*get_userid) __P((int));
 {
 	struct kinfo_proc2 *pp;
-	long cputime;
+	uint64_t cputime;
 	double pct;
 	struct handle *hp;
 	const char *statep;
@@ -582,9 +582,10 @@
 		comm[COMSIZ - 1] = '\0';
 	}
 
-#if 0
+#if 1
 	/* This does not produce the correct results */
 	cputime = pp->p_uticks + pp->p_sticks + pp->p_iticks;
+	cputime /= hz;
 #else
 	cputime = pp->p_rtime_sec;	/* This does not count interrupts */
 #endif

... I see inconsistent timing results compared to the rtime calculation. Some
processes show more time with the old version of top others show more time
with the new version. The numbers are correct in the ball-park. I don't
have the time to look at it in more detail right now, although I would be
curious if this change produces better results in your system. I would also
file a PR, because this could be the sign of a more fundamental MP issue.

christos