Subject: Re: Soft error on disk write corrupted drive
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Steven M. Bellovin <smb@cs.columbia.edu>
List: port-i386
Date: 08/30/2007 11:53:19
On Thu, 30 Aug 2007 08:42:32 -0700
buhrow@lothlorien.nfbcal.org (Brian Buhrow) wrote:

> 	Hello.  Having said all that, I'm inclined to agree with
> Giles that the most likely culprit is the disk itself.  I've seen
> errors following this code path in NetBSD for a number of years, and
> in a variety of situations, and if the error was corrected, the data
> always got to the right sectors. The NetBSD code, and most other OS's
> that I've worked with, simply requeues the write request, and tries
> again, possibly after a hardware reset command.  In this case, it
> sounds like the drive is taking the second pass, reporting success,
> and actually not doing what it promised.  

What are the block numbers where the data was actually written?  Is
there any obvious relationship between the relative block number on the
intended partition versus the relative block number on some other
partition?  For that matter, the error message reported the problem on
the first block of a multi-block write, with the low order 2 bits of
the block number zero.  Does that hold true for the actual block
written, relative to some other partition?  (It is *not* true for the
disk itself, per the error message.)



		--Steve Bellovin, http://www.cs.columbia.edu/~smb