Subject: Race condition in kernel! (was ping times)
To: None <port-macppc@netbsd.org>
From: Donald Lee <MacPPC@caution.icompute.com>
List: port-macppc
Date: 07/17/2004 17:22:29
In our last episode.......

I just tried booting a CD on a Quicksilver G4/867.  It has the same problem.
(10 ms quantum ping times to local network)

Could anyone else try a ping from a *fast* MacPPC and see if the ping
times are all multiples of 10 ms?  I've now seen the problem on two
machines - an 867 Mhz Quicksilver, and a G4/AGP with a
1 Ggz CPU, (both are Moto 7455 CPUs).

This Quicksilver does not have the L3 cache, so that tends to eliminate
the L3 cache theory.  I'm now convinced it's a race - probably
somewhere in the context switch machinery in the kernel.

I'm now betting that the interrupts are happening
a little "too fast", and that some of the locking in the kernel is
not quite right, allowing the faster response to cause events to
get lost (i.e. the interrupting event completes before the requestor is
quite finished setting up to handle it, maybe?)

Anyone have suggestions?  I'm motivated to dig in and puzzle this
out, but it will be a hard slog for me unless I get a little guidance.

Another symptom that I had not connected to this one was that
my dumps (dump(8)) on the 1 Ghz CPU were *really* slow.  I found that
running a CPU intensive program at the same time the dumps were active
actually made the dumps run about 15 times faster (yes, 15x).  I looked
at the dump source, and dump forks off several copies of itself and
"passes the torch" between the pids to do the I/O.  It looked to me
at the time like the synchronization in dump was busted, but that behavior
could also be explained by the kernel dropping events.

The other thing that comes to mind is the problems with the ATA
card.  That also drops interrupts, which could be explained by
this sort of race.

It would be really nice if this were a long standing bug causing
a bunch of obscure problems.

I plan to send-pr this.

Ideas anyone?  How does one get a "trace" of events in a NetBSD kernel
with extremely fine time granularity so I can see the sequence of
events through context switches?

Thanks in advance,

-dgl-