Subject: Re: pv_unlink (not pv_unlink0) panics?
To: Dave McGuire <mcguire@neurotica.com>
From: Erik E. Fair <fair@clock.org>
List: port-sparc
Date: 06/22/1997 15:19:10
No, you're not on drugs.

My first experience was with a stripped kernel (no unnecessary
drivers), and no DEBUG or DIAGNOSTIC on. It panic'd toward the end
of /etc/rc.local, just after I'd spent two days doing the conversion
of files and such from SunOS 4.1.4 to NetBSD current. I was doing
the first full multiuser boot, and seeing that panic was a real
bummer (all I could think at that moment was, "I'm screwed - I have
to back out to SunOS!").

However, in my darkest hour, I took heart from the words of Obi
Wan Kenobi ("Use The Source, Luke!"), and decided to go on a bug
hunt. It took another 36 hours or so of experimentation (removing
RAM, new kernels, etc) before I found a stable combination of
options. Then, I went to work on the E-mail system (I run zmailer,
not sendmail, for a variety of hysterical raisins) which took
another week to get going (that code had never been compiled "-Wall"
and so I had a lot of clean up work to do). My housemate recompiled
and installed Apache, and then I went a-hunting for the cause of
the pv_unlink0 panic.

I am something of a traditionalist systems programmer. I believe
that garbage collection is an excuse for lazy programming, OOP is
a lot of fancy terminology for what we used to call "modular
programming", "good function decomposition", and "good data structure
design"; and I view debuggers with suspicion (I've been caught by
debugger induced Heisen-bugs just once too often; my favorite
debugging tools are "printf" and code reading).

Since I lacked even the most rudimentary documentation (other than
the code itself) of the sun4m MMU, I read the pmap code with an
eye toward things that didn't make sense structurally or as "good C."
Uninitialized variables, calling sequence problems, function
argument mismatches, etc. I found a few trivial things wrong, but
nothing dangerously so.

So, next step was to make sure that when things were initialized,
they really got all initialized, and when released, really cleaned
out. That lead to the patch. First attempt made the panics go away
(good), but stomped some state that needed to stay (oops). Second
attempt made the panics go away. By doing this, I was hoping that
the problem would move closer to its proximate cause by making the
system state more clean; if you're going to leave previous state
lying around, the code has to be consistent about when it's valid,
and when it's not. I was trying to put invalid state in when the
state was supposed to be invalid which would hopefully cause bad
references to it to die more quickly and more spectacularly.

Instead, my guaranteed panic cases failed to panic the system any
more. Bug fixed? Well, not really, because I still can't tell you
the actual cause - only that it's not happening to me any more.

I haven't tested my "guaranteed panic" cases against the official
fix yet. I need to bring the production system up to -current from
end of April, and that will take a little doing.

	Erik E. Fair	fair@clock.org