port-sparc: Re: pv_unlink0 panics on 1.3.2

Subject: Re: pv_unlink0 panics on 1.3.2
To: Erik Bertelsen <erik@mediator.uni-c.dk>
From: Erik E. Fair <fair@clock.org>
List: port-sparc
Date: 10/02/1998 06:18:55

I upgraded my production system (96MB SPARC LX) where I had this problem
from 1.2G to 1.3.2 almost three weeks ago, and it has been stable:

 5:56AM  up 19 days,  9:31, 18 users, load averages: 0.16, 0.21, 0.17

Please note that the patch which initially worked for me (zeroing out pv_va
when it was released) was strictly a "data cleanliness" shot-in-the-dark; I
still don't understand the pmap code or the sun4m MMU (not that I've been
putting in any effort to do so since my problem went away). About 30 days
after PK committed the official fix (and removed my patch) I converted over
to that "-current" kernel, and stayed there for nearly a year without
incident. On the flip side, my system is not heavily exercised - I am a
great believer in having excess capacity...

There has been one other report of a pv_unlink0 panic on port-sparc within
the last six months, but I've forgotten who reported it.

This problem smells to me like a race condition of some sort - perhaps an
interrupt lockout that should happen but isn't. Unfortunately, it is very
difficult to try and solve this sort of problem without a deep
understanding of the code and what it should or should not be doing; you
could try putting in trace statements, but if it is a timing problem, that
could alter the timing (Heisenberg strikes again!). Yuck.

Just out of curiosity, are you using many SunOS 4.1.x binaries in
emulation? The kernel stack revealed that my panics were almost always
triggered by a getdents() call out of the emulation layer...

	Erik <fair@clock.org>