current-users: re: wd.c crashes/hard errors

Subject: re: wd.c crashes/hard errors
To: Douglas Crosher <dtc@stan.xx.swin.oz.au>
From: Dirk Steinberg <steinber@machtnix.ert.rwth-aachen.de>
List: current-users
Date: 02/10/1994 14:16:56
>>>>> "Douglas" == Douglas Crosher <dtc@stan.xx.swin.oz.au> writes:

    >> Date: Wed, 9 Feb 94 13:50:33 +0100 From:
    >> steinber@machtnix.ert.rwth-aachen.de (Dirk Steinberg)
    >> Message-Id: <9402091250.AA06233@machtnix.ert.rwth-aachen.de>

    >> Hi,
    >> 
    >> yesterday it happened again: I was running a current-940207
    >> system & kernel. I have a Quantum LPS 240 AT hard disk and
    >> since this one had problems with the -current wd.c, I doubled
    >> the WDCNDELAY (all this is from memory, for reasons that will
    >> become apparent soon) value (this was suggested some time ago
    >> on this list. So during normal operation, suddenly the kernel
    >> hung with repeated messages like this:
    >> 
    >> wdc0: busy too long, resetting
    >> wdc0: busy too long, resetting ...

    Douglas> 	I run NetBSD0.9 and had a Quantum LPS 240 AT connected
    Douglas> as the second disk with a WD 340 as the first.  The
    Douglas> machine run for weeks without a problem then suddenly I
    Douglas> started getting the above problem.  The root partition
    Douglas> was trashed (which was on the WD340), the other
    Douglas> partitions were OK.  I do not get this problem running
    Douglas> either of the drives alone, so am now just using the
    Douglas> WD340.

    >>  As a side note, I am observing extra interrupts every so
    >> often. I always get one directly after (or maybe during?) the
    >> autoconfig phase:
    >> 
    >> wdc0: extra interrupt

    Douglas> Yes I get these also.

    >>  I already had these types of crashes before, and every time a
    >> filesystem was damaged so badly that fsck couldn't repair
    >> it. This time it was the root filesystem...
    >> 
    >> Even worse, when checking the fs after reboot, fsck hangs the
    >> system after:
    >> 
    >> wd0a: hard error reading fsbn 10720 of 10720-10723

    Douglas> Yes I got this too , I find this very strange as the IDE
    Douglas> drive should not give hard errors? Typically the damage
    Douglas> to the root partition was so bad that I could not reboot!

Seems that you were seeing almost *exactly* the same symptoms as I.
Glad I'm not alone; at least I know that I was not dreaming! Although
this doesn't help me very much :-(.

    >>  This error is persistent across reboots, power off, etc. Now
    >> since I have a IDE disk I shoudn't get hard errors. I never had
    >> any hard errors before, and my Linux partition still works
    >> fine. So my NetBSD installation is hosed for now. I sure hope
    >> this error goes away when I reinstall/re-mkfs. Is it actually
    >> possible that the faulty wd.c caused damage to my disk, or that
    >> it at least screwed up the low-level format on some track? If
    >> so, how could I reformat a single track without reformatting
    >> the entire disk? And how to format (low-level) a IDE disk in
    >> the first place? I know how it works for MFM/RLL/ESDI and SCSI
    >> disks and have done this many times before. But IDE disks?

    Douglas> 	I was able to restore my system by doing a disklabel,
    Douglas> and putting a clean fs on the root partition, then
    Douglas> reinstalling all the file that were on that partition.
    Douglas> The disk errors did not re-appear till a few days later
    Douglas> when the f..ken thing crashed again.  This time I removed
    Douglas> the second drive and things have been fine.

This makes me hope that my drive is not physically damaged (or
low-level un-formatted). The error message is really weired, though.
As I said, this is the third time this has happened to me, and the
Quantum is my only drive! So your workaround won't work for me...

I also wonder why the crashes are so bad that even fsck in manual mode
cannot repair them. On any other Unix system that uses BSD ufs/fsck
I've seen, you lose at most a few files after a crash. The kernel must
be doing something really horrible when it crashes; just not syncing
all buffers cannot be the cause.

I remember that someone also reported serious corruption
problems/overwrites with non-SCSI disks. Is anyone else seeing this?

    Douglas> Regards Douglas Crosher

I am getting somewhat tired of reinstalling the entire system once a
week. Since I don't have access to an internet host that has send-pr
installed, could someone please file this message as a GNATS problem
report with *critical* seriousness? (OK - Disregard this for now; I
tried to send this with GNATS myself from a Sun, not sure if it
worked)

	Dirk

-----------------------------------------------------------------------------
Dirk W. Steinberg - RWTH Aachen - Internet email: steinber@ert.rwth-aachen.de
Aachen University of Technology / IS2-Integrated Systems in Signal Processing
Rhein.Westf.Tech.Hochsch. Aachen / Integrierte Systeme der Signalverarbeitung
Templergraben 55 / D-52056 Aachen / phone:+49 241 807879 / fax:+49 241 807631
Home address: Kleikstr. 63, D-52134 Herzogenrath,Germany/phone: +49 2406 7225

------------------------------------------------------------------------------