Subject: odd lossage on 3.99.9/i386
To: None <current-users@netbsd.org>
From: Sean Davis <dive-nb@endersgame.net>
List: current-users
Date: 10/03/2005 01:27:44
(CC'd to port-i386 since, so far as I know, this is i386-specific.)

For background, my system is configured as such:

ASUS A7V880 motherboard
Athlon XP 2700+ CPU
1GB (2x512MB) DDR400 (running at DDR333) RAM
onboard vt8237 sata: 2 x 80GB WD800JD
onboard pata: channel 0: 120GB ATA100, channel 1: CDRW (master), DVD-ROM
              (slave)
pci promise (pdcide): 15GB ATA100
pci satalink (satalink): 250GB SATA150

root is on the first 80GB SATA on viaide.


I just ran into a very interesting problem. I had just finished updating all
my packages, and everything was fine. So I built a new -current userland and
kernel (with no funky options; same kernel I've been using since the 2.99.x
days), and tried to boot it. When booting, it kernel panicked right before
it'd start to try to mount root. Dumping wasn't possible, nor was getting a
traceback.

Now for the odd part: after rebooting, the machine deadlocked in the NetBSD
loader (which had not been updated yet) after typing a few characters. I was
able to reproduce this several times, for example, typing boot netbsd -s
would die around "boot n"... "boot -s" would die after the first "b"... etc.
the machine was locking hard; no ctrl-alt-delete response.

I changed everything in the BIOS to 'safe' settings (no overclocking
whatsoever, even reset everything to defaults) and it made no difference.

Then I powered off the machine for about 10 minutes, and turned it back on,
and tried booting the new kernel. Exact same thing happened. Powered it off
again, left it off for another 10 minutes or so, had it boot the old kernel,
and everything is running perfectly fine.

I've run through memtest86 and the BIOS' (rudimentary) ram test (aka fast
boot off)... no issues there.

I am very hesitant to name this a hardware issue, as everything worked just
fine up to the moment I tried to boot 3.99.9, and after being powered off,
worked just fine with 3.99.8. My suspicion is that something lingered in the
RAM (the RAM doesn't seem to get cleared on reboot; I often have three or
four boots worth of dmesg output in dmesg at any time) and whatever that
something was, it was causing major lossage elsewhere.

According to BIOS, temperature is nominal. I work with Athlon CPUs at higher
temp with no problems.

Does anyone have any ideas for a course of action to take to fix this, or
any similar experiences, or anything that might point me in the right
direction to making me able to update -current?

TIA,
-Sean