Subject: re: wd.c crashes/hard errors
To: Douglas Crosher <dtc@stan.xx.swin.oz.au>
From: Dirk Steinberg <steinber@machtnix.ert.rwth-aachen.de>
List: current-users
Date: 02/10/1994 14:16:56
>>>>> "Douglas" == Douglas Crosher <dtc@stan.xx.swin.oz.au> writes:
>> Date: Wed, 9 Feb 94 13:50:33 +0100 From:
>> steinber@machtnix.ert.rwth-aachen.de (Dirk Steinberg)
>> Message-Id: <9402091250.AA06233@machtnix.ert.rwth-aachen.de>
>> Hi,
>>
>> yesterday it happened again: I was running a current-940207
>> system & kernel. I have a Quantum LPS 240 AT hard disk and
>> since this one had problems with the -current wd.c, I doubled
>> the WDCNDELAY (all this is from memory, for reasons that will
>> become apparent soon) value (this was suggested some time ago
>> on this list. So during normal operation, suddenly the kernel
>> hung with repeated messages like this:
>>
>> wdc0: busy too long, resetting
>> wdc0: busy too long, resetting ...
Douglas> I run NetBSD0.9 and had a Quantum LPS 240 AT connected
Douglas> as the second disk with a WD 340 as the first. The
Douglas> machine run for weeks without a problem then suddenly I
Douglas> started getting the above problem. The root partition
Douglas> was trashed (which was on the WD340), the other
Douglas> partitions were OK. I do not get this problem running
Douglas> either of the drives alone, so am now just using the
Douglas> WD340.
>> As a side note, I am observing extra interrupts every so
>> often. I always get one directly after (or maybe during?) the
>> autoconfig phase:
>>
>> wdc0: extra interrupt
Douglas> Yes I get these also.
>> I already had these types of crashes before, and every time a
>> filesystem was damaged so badly that fsck couldn't repair
>> it. This time it was the root filesystem...
>>
>> Even worse, when checking the fs after reboot, fsck hangs the
>> system after:
>>
>> wd0a: hard error reading fsbn 10720 of 10720-10723
Douglas> Yes I got this too , I find this very strange as the IDE
Douglas> drive should not give hard errors? Typically the damage
Douglas> to the root partition was so bad that I could not reboot!
Seems that you were seeing almost *exactly* the same symptoms as I.
Glad I'm not alone; at least I know that I was not dreaming! Although
this doesn't help me very much :-(.
>> This error is persistent across reboots, power off, etc. Now
>> since I have a IDE disk I shoudn't get hard errors. I never had
>> any hard errors before, and my Linux partition still works
>> fine. So my NetBSD installation is hosed for now. I sure hope
>> this error goes away when I reinstall/re-mkfs. Is it actually
>> possible that the faulty wd.c caused damage to my disk, or that
>> it at least screwed up the low-level format on some track? If
>> so, how could I reformat a single track without reformatting
>> the entire disk? And how to format (low-level) a IDE disk in
>> the first place? I know how it works for MFM/RLL/ESDI and SCSI
>> disks and have done this many times before. But IDE disks?
Douglas> I was able to restore my system by doing a disklabel,
Douglas> and putting a clean fs on the root partition, then
Douglas> reinstalling all the file that were on that partition.
Douglas> The disk errors did not re-appear till a few days later
Douglas> when the f..ken thing crashed again. This time I removed
Douglas> the second drive and things have been fine.
This makes me hope that my drive is not physically damaged (or
low-level un-formatted). The error message is really weired, though.
As I said, this is the third time this has happened to me, and the
Quantum is my only drive! So your workaround won't work for me...
I also wonder why the crashes are so bad that even fsck in manual mode
cannot repair them. On any other Unix system that uses BSD ufs/fsck
I've seen, you lose at most a few files after a crash. The kernel must
be doing something really horrible when it crashes; just not syncing
all buffers cannot be the cause.
I remember that someone also reported serious corruption
problems/overwrites with non-SCSI disks. Is anyone else seeing this?
Douglas> Regards Douglas Crosher
I am getting somewhat tired of reinstalling the entire system once a
week. Since I don't have access to an internet host that has send-pr
installed, could someone please file this message as a GNATS problem
report with *critical* seriousness? (OK - Disregard this for now; I
tried to send this with GNATS myself from a Sun, not sure if it
worked)
Dirk
-----------------------------------------------------------------------------
Dirk W. Steinberg - RWTH Aachen - Internet email: steinber@ert.rwth-aachen.de
Aachen University of Technology / IS2-Integrated Systems in Signal Processing
Rhein.Westf.Tech.Hochsch. Aachen / Integrierte Systeme der Signalverarbeitung
Templergraben 55 / D-52056 Aachen / phone:+49 241 807879 / fax:+49 241 807631
Home address: Kleikstr. 63, D-52134 Herzogenrath,Germany/phone: +49 2406 7225
------------------------------------------------------------------------------