Subject: port-i386/118: non-SCSI (AT/IDE style) hard disks get badly corrupted at times.
To: None <gnats-admin>
From: Dirk Steinberg <steinber@ert.rwth-aachen.de>
List: netbsd-bugs
Date: 02/10/1994 05:20:11
>Number: 118
>Category: port-i386
>Synopsis: non-SCSI (AT/IDE style) hard disks get badly corrupted at times.
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: gnats-admin (GNATS administrator)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Feb 10 05:20:04 1994
>Originator:
>Organization:
RWTH Aachen / Lehrstuhl fuer Elektrische Regelungstechnik
>Release: current-940207
>Environment:
386/40, i486 mainboard with 386 CPU, no 387, 256 K cache,
16 MB RAM, 234 MB IDE disk Quantum LPS 240 AT,
Kernel and binaries: current-940207.
System: SunOS machtnix 4.1.2 2 sun4c
Architecture: sun4
>Description:
To: dtc@stan.xx.swin.oz.au (Douglas Crosher)
Cc: current-users@sun-lamp.cs.berkeley.edu
Subject: re: wd.c crashes/hard errors
In-Reply-To: <9402100153.AA06037@stan.xx.swin.OZ.AU>
References: <9402100153.AA06037@stan.xx.swin.OZ.AU>
>>>>> "Douglas" == Douglas Crosher <dtc@stan.xx.swin.oz.au> writes:
>> Date: Wed, 9 Feb 94 13:50:33 +0100 From:
>> steinber@machtnix.ert.rwth-aachen.de (Dirk Steinberg)
>> Message-Id: <9402091250.AA06233@machtnix.ert.rwth-aachen.de>
>> Hi,
>>
>> yesterday it happened again: I was running a current-940207
>> system & kernel. I have a Quantum LPS 240 AT hard disk and
>> since this one had problems with the -current wd.c, I doubled
>> the WDCNDELAY (all this is from memory, for reasons that will
>> become apparent soon) value (this was suggested some time ago
>> on this list. So during normal operation, suddenly the kernel
>> hung with repeated messages like this:
>>
>> wdc0: busy too long, resetting
>> wdc0: busy too long, resetting ...
Douglas> I run NetBSD0.9 and had a Quantum LPS 240 AT connected
Douglas> as the second disk with a WD 340 as the first. The
Douglas> machine run for weeks without a problem then suddenly I
Douglas> started getting the above problem. The root partition
Douglas> was trashed (which was on the WD340), the other
Douglas> partitions were OK. I do not get this problem running
Douglas> either of the drives alone, so am now just using the
Douglas> WD340.
>> As a side note, I am observing extra interrupts every so
>> often. I always get one directly after (or maybe during?) the
>> autoconfig phase:
>>
>> wdc0: extra interrupt
Douglas> Yes I get these also.
>> I already had these types of crashes before, and every time a
>> filesystem was damaged so badly that fsck couldn't repair
>> it. This time it was the root filesystem...
>>
>> Even worse, when checking the fs after reboot, fsck hangs the
>> system after:
>>
>> wd0a: hard error reading fsbn 10720 of 10720-10723
Douglas> Yes I got this too , I find this very strange as the IDE
Douglas> drive should not give hard errors? Typically the damage
Douglas> to the root partition was so bad that I could not reboot!
Seems that you were seeing almost *exactly* the same symptoms as I.
Glad I'm not alone; at least I know that I was not dreaming! Although
this doesn't help me very much :-(.
>> This error is persistent across reboots, power off, etc. Now
>> since I have a IDE disk I shoudn't get hard errors. I never had
>> any hard errors before, and my Linux partition still works
>> fine. So my NetBSD installation is hosed for now. I sure hope
>> this error goes away when I reinstall/re-mkfs. Is it actually
>> possible that the faulty wd.c caused damage to my disk, or that
>> it at least screwed up the low-level format on some track? If
>> so, how could I reformat a single track without reformatting
>> the entire disk? And how to format (low-level) a IDE disk in
>> the first place? I know how it works for MFM/RLL/ESDI and SCSI
>> disks and have done this many times before. But IDE disks?
Douglas> I was able to restore my system by doing a disklabel,
Douglas> and putting a clean fs on the root partition, then
Douglas> reinstalling all the file that were on that partition.
Douglas> The disk errors did not re-appear till a few days later
Douglas> when the f..ken thing crashed again. This time I removed
Douglas> the second drive and things have been fine.
This makes me hope that my drive is not physically damaged (or
low-level un-formatted). The error message is really weired, though.
As I said, this is the third time this has happened to me, and the
Quantum is my only drive! So your workaround won't work for me...
I also wonder why the crashes are so bad that even fsck in manual mode
cannot repair them. On any other Unix system that uses BSD ufs/fsck
I've seen, you lose at most a few files after a crash. The kernel must
be doing something really horrible when it crashes; just not syncing
all buffers cannot be the cause.
I remember that someone also reported serious corruption
problems/overwrites with non-SCSI disks. Is anyone else seeing this?
Douglas> Regards Douglas Crosher
I am getting somewhat tired of reinstalling the entire system once a
week. Since I don't have access to an internet host that has send-pr
installed, could someone please file this message as a GNATS problem
report with *critical* seriousness? (OK - Dirregard this for now; I
tried to send this with GNATS myself from a Sun, not sure if it
worked)
Dirk
-----------------------------------------------------------------------------
Dirk W. Steinberg - RWTH Aachen - Internet email: steinber@ert.rwth-aachen.de
Aachen University of Technology / IS2-Integrated Systems in Signal Processing
Rhein.Westf.Tech.Hochsch. Aachen / Integrierte Systeme der Signalverarbeitung
Templergraben 55 / D-52056 Aachen / phone:+49 241 807879 / fax:+49 241 807631
Home address: Kleikstr. 63, D-52134 Herzogenrath,Germany/phone: +49 2406 7225
>How-To-Repeat:
Work with my machine for at most one week :-).
>Fix:
Workaround: reinstall NetBSD :-(.
>Audit-Trail:
>Unformatted:
------------------------------------------------------------------------------