Subject: Re: Finding lost components in a raid set?
To: Greg Oster <oster@cs.usask.ca>
From: Johan Ihren <johani@autonomica.se>
List: current-users
Date: 02/07/2002 22:28:21
Greg Oster <oster@cs.usask.ca> writes:
Hi Greg,
> > Basically it seems that everything has worked very nicely and I'm
> > really pleased.
Unfortunately I spoke too soon.
My re-construction of the "failed" component with a "raidctl -R
/dev/wd2g raid1" just finished and raidctl -s now reports happiness.
But the disklabel for raid1 is zeroed.
How can that happen? I had two components out of three intact at all
times for a 2+1 RAID5 device and I see no reason to lose the label.
I have to admit that I did *not* keep a copy of that label in a safe
place, which in retrospect seems rather stupid. But I regarded the
underlying device as "safe", in the sense that an event that manage to
wipe out the label would wipe out the file system data also.
I also have to admit that I am less happy than 30 minutes ago ;-(
Losing the disklabel after what should be considered a standard
replacement of a failed component is not encouraging. But I really
don't see where I did anything wrong.
Sigh.
> > But, since I have no indication whatsoever that the disks should have
> > failed in any way, I cannot help being a bit curious about *both* raid
> > sets losing a component on the *same* disk without the disk being bad.
>
> I don't suppose you have a copy of the disklabel for that disk, do
> you? The autoconfig should have picked it up wd2a if it had a
> disklabel type of FS_RAID (unless the component label got
> corrupted). 'raidctl -c' should have found a valid component label
> for wd2g -- the fact that it's marked as failed indicates that it
> likely didn't.
I have the label for wd2 (attached), I don't have the label for raid1,
as I said above.
> > I assume that the autoconfigured raid devices keep their configs in
> > the component labels
>
> The configuration info is in the component labels, yes.
>
> > or thereabouts and I don't understand how both of
> > them can have been trashed at the same time.
>
> I don't get that either... Was wd2 on the controller card that
> failed? Maybe it managed to zero the component labels on you??
> Seems unlikely, but that's the only explanation I can think of so
> far...
Yes, wd2 was on the controller that failed.
> > And if the on-disk
> > configs were "bad", why didn't a "raidctl -c" fix that, if the media
> > is ok?
>
> "raidctl -c" uses the component labels to verify that you have the
> components listed in the right order... If a component label is
> missing (or badly out-of-sync with the remaining components), then
> that component will be marked as failed. If you managed to get
> components in the wrong order, the RAID set shouldn't even
> configure. (It tries quite hard to make sure you don't mess up the
> ordering, but it needs valid component labels in order to do
> that. :) )
I assume that this "component label" is stored adjacent to the device
it describes. I.e. the component label for /dev/wd2g is located at the
head of that physical partition?
Here's the disklabel for wd2. It is definitely intact, since it is
exactly the same as for wd0 and wd1:
bash# disklabel wd0
# /dev/rwd0d:
type: ESDI
disk: WDC WD1000BB-00C
label: fictitious
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 16383
total sectors: 195371568
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
8 partitions:
# size offset fstype [fsize bsize cpg/sgs]
a: 524288 63 RAID # (Cyl. 0*- 520*)
b: 524288 524351 swap # (Cyl. 520*- 1040*)
c: 195371505 63 unused 0 0 # (Cyl. 0*- 193820)
d: 195371568 0 unused 0 0 # (Cyl. 0 - 193820)
e: 8388608 1048639 RAID # (Cyl. 1040*- 9362*)
f: 524288 9437247 RAID # (Cyl. 9362*- 9882*)
g: 100000000 9961535 RAID # (Cyl. 9882*- 109088*)
h: 85410033 109961535 RAID # (Cyl. 109088*- 193820)
Johan