netbsd-help: Re: RaidFrame - Failed Partition on one disk

Subject: Re: RaidFrame - Failed Partition on one disk
To: Chris Cameron <chris@onemind.com>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-help
Date: 07/23/2003 15:29:08
"Chris Cameron" writes:
> Hi,
> 
> I have a Raid1 setup on NetBSD 1.6 which reported to me that I had a failed
> component today.
> 
> I dont think that the disk has failed though, as another partition on that
> same disk is still functioning fine. Is there a way to rebuild the bad
> component on the failed raid partition?

You can do that with:
 
 raidctl -R /dev/wd0a raid0

> (I will be verifying that the disk
> is in good condition, but I need to do that outside of office hours).
> 
> Below are the results from raidctl -s raid0 and raid2 (the 2 raid partitions
> I have)
> 
> server# raidctl -s raid0
> Components:
>            /dev/wd0a: optimal
>            /dev/wd1a: failed
> No spares.
> Component label for /dev/wd0a:
>    Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
>    Version: 2, Serial Number: 20021100, Mod Counter: 341
>    Clean: No, Status: 0
>    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
>    Queue size: 100, blocksize: 512, numBlocks: 1088512
>    RAID Level: 1
>    Autoconfig: Yes
>    Root partition: Yes
>    Last configured as: raid0
> /dev/wd1a status is: failed.  Skipping label.
> Parity status: DIRTY
> Reconstruction is 100% complete.
> Parity Re-write is 100% complete.
> Copyback is 100% complete.
> 
> server# raidctl -s raid2
> Components:
>            /dev/wd0e: optimal
>            /dev/wd1e: optimal
> No spares.
> Component label for /dev/wd0e:
>    Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
>    Version: 2, Serial Number: 20021102, Mod Counter: 202
>    Clean: No, Status: 0
>    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
>    Queue size: 100, blocksize: 512, numBlocks: 74970112
>    RAID Level: 1
>    Autoconfig: Yes
>    Root partition: No
>    Last configured as: raid2
> Component label for /dev/wd1e:
>    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
>    Version: 2, Serial Number: 20021102, Mod Counter: 202
>    Clean: No, Status: 0
>    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
>    Queue size: 100, blocksize: 512, numBlocks: 74970112
>    RAID Level: 1
>    Autoconfig: Yes
>    Root partition: No
>    Last configured as: raid2
> Parity status: clean
> Reconstruction is 100% complete.
> Parity Re-write is 100% complete.
> Copyback is 100% complete.
> 
> Am I correct in thinking that perhaps /dev/wd1a has been corrupted in some
> manner and just needs to be rebuilt, since /dev/wd1e is still in optimal
> state? 

My guess is that wd1 has a physical error that is showing up in the 
'a' partition of the disk.  Partition 'e' doesn't have any errors 
(yet), but I wouldn't use that as a good reason for thinking that 'a' 
will be fine after a rebuild.  Your first step is to check 
/var/log/messages and try to find out what actually happened to the 
wd1 to cause the error... 

> If so, how would I do such a thing?

See above... If, in fact, wd1a got marked as "failed" because of a 
read error, you may be able to successfully do a reconstruct to the 
same disk if it re-maps around the bad block(s).  But you'll want to 
make sure you know why the disk failed, and then decide as to whether 
you need a replacment.

Later...

Greg Oster