Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: None <port-macppc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 11/11/2004 16:54:37
John Klos has pointed out that I missed posting some information to this
thread. I am reposting for clarity, with relevant editing:

3) The MP kernel panic appears to be related to how much memory is
physically present. (...)After I installed an additional 128M (for 256M
total), I was able to compile userland after 21 1/2 hours. I removed the
128M and tried again, and a kernel panic occured within in thirty
minutes. Reinstall the stick and I had been building userland for four
hours or so before I kernel panic'd due to an (unrelated) issue. 

(...)I noticed in reading posts
about the kernel panic that make and nbmake were almost always being
used at the time of the panic. I watched top very closely and it will
routinely show active memory of 140M or more during building userland
and likely other builds as well (but possibly not during building a
kernel). I believe that when the active memory exceeds the physically
installed memory, the kernel panics occur.

I believe this is because some memory associated with the IPI
(Interprocessor Interrupt) is being paged out of memory or back into
memory without being marked dirty, and one CPU is thinking it has sent
an IPI to the other CPU when it hasn't gotten out of its own cache, or
the other CPU thinks it has responded when that too hasn't gotten out
its cache.

During testing I did apply a patch to arch/powerpc/powerpc/pio_subr.S
that overrode the DBGSYNC #define for (MULTIPROCESSOR), so that any
memory access forces a sync instruction. This did not affect the
problem, as it it turns out that DBGSYNC is defined somewhere for almost
all of the MP kernel. I say almost all because the patch resulted in a
kernel that was 64 bytes (16 instructions) larger, and that (possibly
erroneously) leads me to believe that there are an additional 16 syncs.

tim