port-i386: Re: Soft error on disk write corrupted drive

Subject: Re: Soft error on disk write corrupted drive
To: Stuart Brooks <stuartb@cat.co.za>
From: Giles Lean <giles.lean@pobox.com>
List: port-i386
Date: 08/30/2007 23:02:32
Stuart Brooks <stuartb@cat.co.za> wrote:

> A disk write which was directed to the rwd0g partition reported the
> "error writing fsbn" with "id not found" a few times before succeeding
> (we believed) with "soft error (corrected)". However the write
> actually ended up taking place to sector 0 on *wd0d*, trashing the
> disk. The data never made its way onto the wd0g partition.

> 1. A problem with the rewrite attempt in NetBSD
> 2. A corruption on the PCI transfer
> 3. An error on the drive
>    - an incorrect sector write
>    - a failed reallocation

What follows is speculation, but I'd bet on a disk error
first, a NetBSD error second, and a PCI corruption third.

I would expect (but I've been wrong before ...) that a PCI
error would show up more often and you'd have to be unlucky to
hit it precisely at the same time as you had a disk error.

For the other two causes I suppose it's a toss up: neither the
disk firmware's error handling code nor NetBSD's error
handling are as well exercised as the normal working write
cases.

My experience with other Unix-like operating systems and disks
is that such problems are most often disk problems, which is
why I choose to suspect the disk firmware ahead of NetBSD.
(Possible bias disclosure: I used to work for an OS vendor,
not a disk vendor. :-)

I reiterate that I'm just guessing.  The most similar error I
have seen on NetBSD was a "freeing free fragment" panic after
"recovered" disk write errors, but there were differences to
your case:

a) the problem disk was from a different manufacturer, and was
   several years old

b) disk timeouts and "recovered" errors immediately prior to
   the panic were a strong hint that the disk was on the way
   out

c) "freeing free fragment" panics are usually hardware
   problems in my experience

d) I was and am running NetBSD 4.0_BETA2 and not 3.x on the
   system that panicked, and it's been stable(*) once I
   replaced the problem disk.

   (*) OK, the system had a power supply fail a week or two
   later.  It's conceivable that the two failures were related
   (I've seen odder combinations) but it's unlikely.

Were I you I would:

1. replace the wd0 disk ASAP if you haven't already!

2. watch similar model/vintage disks that you have carefully
   (e.g. with a SMART utility -- disks often fail without
   warning, but warnings are worth paying attention to)

3. see if there are disk firmware updates(+) available

4. (if you're really keen and have resources) review the
   NetBSD code in the write path between your application and
   the disk to see if you can see a problem.

   Even if the cause is software I'd not be optimistic that
   anyone will be able to see it without a reproducible test
   case, but you might be lucky.
  
(+) Disk vendors are exceedingly reticent about what problems
are fixed in new firmware: even if there is new firmware, it
may not give you any idea what was changed. :-(

Good luck?

Regards,

Giles