Subject: Re: raidframe diagnosis - how to recover from read error w/o disk replacement
To: Greg Troxel <gdt@ir.bbn.com>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 05/05/2005 21:29:24
Greg Troxel writes:
> I have a pretty vanilla i386 box with two identical IDE disks, running
> 1.6.2-stable.  I have 8 raid sets (raidframe RAID 1) configured on
> them for various filesystems.  I didn't notice until today, but the
> logs show that a while ago (umm, April of 2004) there was lossage:
[snip]
> Oct 28 11:03:09 watson /netbsd: Configuring raid7:
> Oct 28 11:03:09 watson /netbsd: RAIDFRAME: Configure (RAID Level 1): total nu
> mber of sectors is 47626880 (23255 MB)
> Oct 28 11:03:09 watson /netbsd: RAIDFRAME(RAID Level 1): Using 6 floating rec
> on bufs with no head sep limit
> Oct 28 11:03:09 watson /netbsd: boot device: raid0
> Oct 28 11:03:09 watson /netbsd: root on raid0a dumps on raid0b
> [other normal stuff]
> Oct 28 11:31:58 watson /netbsd: raid7: Error re-writing parity!
> 
> 
> So, it seems that wd1m is marked failed due to long ago hardware
> issues, and if it were not part of a raid 1 set I should junk/replace
> it.  I did dd all of wd1m wtithout trouble, so I am wondering about
> reconfiguring it.  I think I would need to:
> 
> raidctl -R /dev/wd1m raid7
> 
> to cause wd0m to be copied to wd1m, with a new label.

Correct.  

And assuming there is really nothing wrong with /dev/wd1m 
right now, if this succeeds, all will be well.  However: I'm pretty 
sure 1.6.2 will be quite unhappy in the face of an error when doing a 
reconstruction...  I think both 1.6.2 and 2.0 will end up panicing 
if it hits a write error when trying the above reconstruction :(  
(That problem has been fixed for NetBSD 3.0)

So in this particular case, I'd recommend doing the "raidctl -R" from 
something like single-user where a panic would be less of an issue.

> [Really I count this as a raid success story, since my server did not fail.]

Ya.. it's just annoying when it works so well that you don't notice 
the problem :-}  (been there, done that...)  I believe the 'daily' 
scripts (at least for 3.0) now check for failed RAID devices, and
there is at least some indication to the operator that something has 
gone wrong...

Later...

Greg Oster