Subject: Re: Disk drivers and data errors
To: Stephen Borrill <netbsd@precedence.co.uk>
From: Greg Troxel <gdt@ir.bbn.com>
List: netbsd-help
Date: 07/26/2005 08:24:32
  [formatting recovered]

  wd0e: error reading fsbn 332551232 of 332551232-332551295  (wd0 bn 336682016; cn 33 4009 tn 14 sn 62), retrying
  wd0: (uncorrectable data error)

That looks like a genuine error, where the disk reported an
uncorrectable ECC error on read.  I'd be astounded if a buggy
controller synthesized such an error.  Usually controller or cable
problems look something like "CRC error - downgrading to mode X" or
command timeouts.

  server smartd[295]: Device: /dev/wd1d, 110 Currently unreadable (pending) sectors
  server smartd[295]: Device: /dev/wd1d, 110 Offline uncorrectable sectors

This is the drive's firmware reporting, which again I don't believe is
a controller issue.  This drive appears to simply be having bad
sectors.

I'd do

  dd if=/dev/rwd1d of=/dev/null bs=256k

note erors, and then take the drive and put it in a machine (where you
aren't suspicious of the controller) and do the dd again.  I suspect
you'll see the same bad blocks.

The bad blocks should get remapped, but this only happens on write, so
you don't silently lose data.  What you need to do is find and rewrite
just the bad blocks, or if RF perhaps the whole disk.

  raid0: initiating in-place reconstruction on column 0
  raid0: Recon write failed!
  panic: raidframe error at line 880 file
  /usr/work/netmanager/netbsd/usr/src/sys/arch/i386/compile/NETMANRAID/../../../../dev/raidframe/rf_reconstruct.c

This is arguably a weakness in raidframe; failure to write a block
should cause reconstruction to fail but I don't see a need to panic.

But, if writes fail, the disk is really broken, and needs replacing.
I'd look for the kernel messages just before 'Recon write failed!'.
You could try dd'ing /dev/zero onto the disk before reconstructing
(avoiding the component labels or putting them back - this is of
course nontrivial).

  These errors aren't at random addresses (i.e. they are consistent per
  machine), but they differ from machine to machine (i.e. it's not some
  off address-related fault). We've also seen address mark not found
  errors.

Again, this really points to disks going bad.  IMHO disk drive
reliability has been going downhill over the last 5-10 years.  We had
a bad batch of 40G Maxtor disks in late 2000 Dell desktops.  Pretty
much all of them have failed, and we've been replacing them even if
not because failures are too painful.  Since then, I've been buying
only Seagate drives and doing ok.


-- 
        Greg Troxel <gdt@ir.bbn.com>