port-i386: Re: Soft error on disk write corrupted drive

Subject: Re: Soft error on disk write corrupted drive
To: Giles Lean <giles.lean@pobox.com>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: port-i386
Date: 08/30/2007 08:42:32
	Hello.  Having said all that, I'm inclined to agree with Giles that
the most likely culprit is the disk itself.  I've seen errors following
this code path in NetBSD for a number of years, and in a variety of
situations, and if the error was corrected, the data always got to the
right sectors. The NetBSD code, and most other OS's that I've worked with,
simply requeues the write request, and tries again, possibly after a
hardware reset command.  In this case, it sounds like the drive is taking
the second pass, reporting success, and actually not doing what it
promised.  
-Brian
On Aug 30, 11:02pm, Giles Lean wrote:
} Subject: Re: Soft error on disk write corrupted drive
} 
} Stuart Brooks <stuartb@cat.co.za> wrote:
} 
} > A disk write which was directed to the rwd0g partition reported the
} > "error writing fsbn" with "id not found" a few times before succeeding
} > (we believed) with "soft error (corrected)". However the write
} > actually ended up taking place to sector 0 on *wd0d*, trashing the
} > disk. The data never made its way onto the wd0g partition.
} 
} > 1. A problem with the rewrite attempt in NetBSD
} > 2. A corruption on the PCI transfer
} > 3. An error on the drive
} >    - an incorrect sector write
} >    - a failed reallocation
} 
} What follows is speculation, but I'd bet on a disk error
} first, a NetBSD error second, and a PCI corruption third.
} 
} I would expect (but I've been wrong before ...) that a PCI
} error would show up more often and you'd have to be unlucky to
} hit it precisely at the same time as you had a disk error.
} 
} For the other two causes I suppose it's a toss up: neither the
} disk firmware's error handling code nor NetBSD's error
} handling are as well exercised as the normal working write
} cases.
} 
} My experience with other Unix-like operating systems and disks
} is that such problems are most often disk problems, which is
} why I choose to suspect the disk firmware ahead of NetBSD.
} (Possible bias disclosure: I used to work for an OS vendor,
} not a disk vendor. :-)
} 
} I reiterate that I'm just guessing.  The most similar error I
} have seen on NetBSD was a "freeing free fragment" panic after
} "recovered" disk write errors, but there were differences to
} your case:
} 
} a) the problem disk was from a different manufacturer, and was
}    several years old
} 
} b) disk timeouts and "recovered" errors immediately prior to
}    the panic were a strong hint that the disk was on the way
}    out
} 
} c) "freeing free fragment" panics are usually hardware
}    problems in my experience
} 
} d) I was and am running NetBSD 4.0_BETA2 and not 3.x on the
}    system that panicked, and it's been stable(*) once I
}    replaced the problem disk.
} 
}    (*) OK, the system had a power supply fail a week or two
}    later.  It's conceivable that the two failures were related
}    (I've seen odder combinations) but it's unlikely.
} 
} Were I you I would:
} 
} 1. replace the wd0 disk ASAP if you haven't already!
} 
} 2. watch similar model/vintage disks that you have carefully
}    (e.g. with a SMART utility -- disks often fail without
}    warning, but warnings are worth paying attention to)
} 
} 3. see if there are disk firmware updates(+) available
} 
} 4. (if you're really keen and have resources) review the
}    NetBSD code in the write path between your application and
}    the disk to see if you can see a problem.
} 
}    Even if the cause is software I'd not be optimistic that
}    anyone will be able to see it without a reproducible test
}    case, but you might be lucky.
}   
} (+) Disk vendors are exceedingly reticent about what problems
} are fixed in new firmware: even if there is new firmware, it
} may not give you any idea what was changed. :-(
} 
} Good luck?
} 
} Regards,
} 
} Giles
>-- End of excerpt from Giles Lean