Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: Tim Kelly <hockey@dialectronics.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: port-macppc
Date: 11/11/2004 17:28:26
--0lnxQi9hkpPO77W3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Nov 11, 2004 at 06:50:13PM -0500, Tim Kelly wrote:
> Hi Bill,
>=20
> > How about telling us what the other CPU is doing? Ok, the one CPU is=20
> > waiting for the other one. So what is it doing?
>=20
> I'm not sure. I'm basing my theory, and it is a theory, on the

That's really what we need. :-)

> incredible difference adding 128M of RAM to the existing 128M of RAM
> made to the stability of -current MP. I looked at the code in cpu.c and

I think all you're doing is changing race conditions.

> machdep.c and it appears to me that the kernel panic is forced after the
> count/wait exceeds a certain level. Since the wait conditional is for
> memory to be filled in asychronously, if either CPU thinks they have the
> most current version of that memory in their (L1?) cache, the
> miscommunication occurs. The CPU's cache reflects either no message
> received or no response received. That's why my first attempt at a patch
> involved ensuring a sync operation after each memory access.

My understanding is that cache coherency is handled by the hardware. SMP=20
won't work otherwise.

> Now, it seems to me that this shouldn't be affected by the need to page
> memory, so my hope had been to reproduce this on as many systems as
> possible so that it could be determined if active memory requirements
> exceeding physical memory present is consistent.
>=20
> I'm fairly handy with Macsbug, the Motorola debugger for Macs, so if you
> have some specific commands that can do bt's on the other CPU, please
> pass them on. I can reproduce this kernel panic in less than an hour.
> Also, so that I can identify potential avenues quicker, what _should_
> the other CPU be doing?

I thought we had support in DDB for querying the other CPUs. That's what=20
we really need to do, to see why the other CPU didn't answer the IPI.

Take care,

Bill

--0lnxQi9hkpPO77W3
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFBlBG6Wz+3JHUci9cRAsmwAJsF2ZMZv96lyXJ9AiSKomp+FCsfBwCfX8dD
nLPU1R+0bJeNW/vxp2v+gj8=
=n9DR
-----END PGP SIGNATURE-----

--0lnxQi9hkpPO77W3--