Subject: Re: Soft error on disk write corrupted drive
To: Stuart Brooks <stuartb@cat.co.za>
From: Giles Lean <giles.lean@pobox.com>
List: port-i386
Date: 08/30/2007 23:02:32
Stuart Brooks <stuartb@cat.co.za> wrote:
> A disk write which was directed to the rwd0g partition reported the
> "error writing fsbn" with "id not found" a few times before succeeding
> (we believed) with "soft error (corrected)". However the write
> actually ended up taking place to sector 0 on *wd0d*, trashing the
> disk. The data never made its way onto the wd0g partition.
> 1. A problem with the rewrite attempt in NetBSD
> 2. A corruption on the PCI transfer
> 3. An error on the drive
> - an incorrect sector write
> - a failed reallocation
What follows is speculation, but I'd bet on a disk error
first, a NetBSD error second, and a PCI corruption third.
I would expect (but I've been wrong before ...) that a PCI
error would show up more often and you'd have to be unlucky to
hit it precisely at the same time as you had a disk error.
For the other two causes I suppose it's a toss up: neither the
disk firmware's error handling code nor NetBSD's error
handling are as well exercised as the normal working write
cases.
My experience with other Unix-like operating systems and disks
is that such problems are most often disk problems, which is
why I choose to suspect the disk firmware ahead of NetBSD.
(Possible bias disclosure: I used to work for an OS vendor,
not a disk vendor. :-)
I reiterate that I'm just guessing. The most similar error I
have seen on NetBSD was a "freeing free fragment" panic after
"recovered" disk write errors, but there were differences to
your case:
a) the problem disk was from a different manufacturer, and was
several years old
b) disk timeouts and "recovered" errors immediately prior to
the panic were a strong hint that the disk was on the way
out
c) "freeing free fragment" panics are usually hardware
problems in my experience
d) I was and am running NetBSD 4.0_BETA2 and not 3.x on the
system that panicked, and it's been stable(*) once I
replaced the problem disk.
(*) OK, the system had a power supply fail a week or two
later. It's conceivable that the two failures were related
(I've seen odder combinations) but it's unlikely.
Were I you I would:
1. replace the wd0 disk ASAP if you haven't already!
2. watch similar model/vintage disks that you have carefully
(e.g. with a SMART utility -- disks often fail without
warning, but warnings are worth paying attention to)
3. see if there are disk firmware updates(+) available
4. (if you're really keen and have resources) review the
NetBSD code in the write path between your application and
the disk to see if you can see a problem.
Even if the cause is software I'd not be optimistic that
anyone will be able to see it without a reproducible test
case, but you might be lucky.
(+) Disk vendors are exceedingly reticent about what problems
are fixed in new firmware: even if there is new firmware, it
may not give you any idea what was changed. :-(
Good luck?
Regards,
Giles