netbsd-users: raidframe diagnosis - how to recover from read error w/o disk replacement

Subject: raidframe diagnosis - how to recover from read error w/o disk replacement
To: None <netbsd-users@netbsd.org>
From: Greg Troxel <gdt@ir.bbn.com>
List: netbsd-users
Date: 05/05/2005 22:01:54
I have a pretty vanilla i386 box with two identical IDE disks, running
1.6.2-stable.  I have 8 raid sets (raidframe RAID 1) configured on
them for various filesystems.  I didn't notice until today, but the
logs show that a while ago (umm, April of 2004) there was lossage:

Back in 2004:

Apr 14 15:37:02 watson /netbsd: RAIDFRAME: Configure (RAID Level 1): total number of sectors is 47626880 (23255 MB)
Apr 14 15:37:02 watson /netbsd: RAIDFRAME(RAID Level 1): Using 6 floating recon bufs with no head sep limit
Apr 14 15:37:02 watson /netbsd: boot device: raid0
Apr 14 15:37:02 watson /netbsd: root on raid0a dumps on raid0b

[system runs ok for a while]

Apr 17 03:19:33 watson /netbsd: wd1m: error reading fsbn 39609074 of 39609074-39609075 (wd1 bn 147324962; cn 146155 tn 11 sn 29), retrying
Apr 17 03:19:36 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:19:36 watson /netbsd: wd1: soft error (corrected)
Apr 17 03:19:52 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43), retrying
Apr 17 03:20:15 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43), retrying
Apr 17 03:20:16 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd: wd1: transfer error, downgrading to Ultra-DMA mode 2
Apr 17 03:20:16 watson /netbsd: wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
Apr 17 03:20:16 watson /netbsd: wd1(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data transfers)
Apr 17 03:20:16 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43), retrying
Apr 17 03:20:16 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd: wd1: transfer error, downgrading to Ultra-DMA mode 1
Apr 17 03:20:16 watson /netbsd: wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA data transfers)
Apr 17 03:20:16 watson /netbsd: wd1(pciide0:0:1): using PIO mode 4, Ultra-DMA mode 1 (using DMA data transfers)
Apr 17 03:20:16 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43), retrying
Apr 17 03:20:16 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43), retrying
Apr 17 03:20:16 watson /netbsd: wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd: wd1m: error reading fsbn 39611356 of 39611356-39611357 (wd1 bn 147327244; cn 146157 tn 15 sn 43)wd1: (uncorrectable data error)
Apr 17 03:20:16 watson /netbsd:
Apr 17 03:20:16 watson /netbsd: raid7: IO Error.  Marking /dev/wd1m as failed.
Apr 17 03:20:16 watson /netbsd: raid7: node (Rmir) returned fail, rolling backward
Apr 17 03:20:16 watson /netbsd: raid7: DAG failure: r addr 0x25c6b9c (39611292) nblk 0x2 (2) buf 0xde010000

Then on next boot:

Jun 24 15:37:01 watson /netbsd: RAID autoconfigure
Jun 24 15:37:01 watson /netbsd: Configuring raid7:
Jun 24 15:37:01 watson /netbsd: RAIDFRAME: Configure (RAID Level 1): total number of sectors is 47626880 (23255 MB)
Jun 24 15:37:01 watson /netbsd: RAIDFRAME(RAID Level 1): Using 6 floating recon bufs with no head sep limit
Jun 24 15:37:01 watson /netbsd: boot device: raid0
Jun 24 15:37:01 watson /netbsd: root on raid0a dumps on raid0b
[other normal stuff]
Jun 24 15:37:48 watson /netbsd: raid7: Error re-writing parity!

And after that:

Oct 28 11:03:09 watson /netbsd: Configuring raid7:
Oct 28 11:03:09 watson /netbsd: RAIDFRAME: Configure (RAID Level 1): total number of sectors is 47626880 (23255 MB)
Oct 28 11:03:09 watson /netbsd: RAIDFRAME(RAID Level 1): Using 6 floating recon bufs with no head sep limit
Oct 28 11:03:09 watson /netbsd: boot device: raid0
Oct 28 11:03:09 watson /netbsd: root on raid0a dumps on raid0b
[other normal stuff]
Oct 28 11:31:58 watson /netbsd: raid7: Error re-writing parity!


So, it seems that wd1m is marked failed due to long ago hardware
issues, and if it were not part of a raid 1 set I should junk/replace
it.  I did dd all of wd1m wtithout trouble, so I am wondering about
reconfiguring it.  I think I would need to:

raidctl -R /dev/wd1m raid7

to cause wd0m to be copied to wd1m, with a new label.

[Really I count this as a raid success story, since my server did not fail.]