current-users: Re: Possible problem with raidframe, possible problem with pilot :)

Subject: Re: Possible problem with raidframe, possible problem with pilot :)
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 12/27/2004 16:51:15
Brian Buhrow writes:
> 	hello.  I've been running raidframe on various versions of NetBSD for
> some time, and I currently have a number of raid1 systems running under
> NetBSD-2.0 branch code.
> 	My typical configuration looks like:
> 
> Raid level: 1
> Components: /dev/wd0e /dev/wd1e
> I also have auto-configure and root-partition turned on, so that the root
> of the system is on the raid set.
> 
> 	Last week, I rebooted a system running this configuration, and it came
> up fine. That night, a script which I use to check the health of the raid
> sets on the systems mailed me taht there was trouble.  When I logged in, I
> discovered that, due to circumstances beyond the scope of this problem, the
> primary disk, /dev/wd0, in the original component raid set, wasn't
> recognized, and the disk which had been wd1 was now wd0 on the system.  The
> raidctl -s command showed that component1 was failed, and that /dev/wd0e
> was configured.  So far, so good, or so I thought.
> 	When I restored the original disk, which was not a disk failure, but
> an unrelated software failure, the auto-configure mechanism restored the
> raid set, but used the data from the disk which had been off-line for 24
> hours to populate the / filesystem. 

?????  Hmmmmmm... Do you have dmesg's from all of these reboots?

At this point the original /dev/wd0e should have been marked as 'failed' 
in this set, and /dev/wd1e should have been optimal.  Why it wasn't 
'component0' that was failed in the first place, I have no idea.
What channel a drive is on is completely irrelevant. 

The set should *not* have been in a good state....

> No problem, I tought, since I was in
> single user mode.  I manually failed the /dev/wd0e component, rebooted, and
> figured I'd be fine. 

I don't understand why/how both wd0e and wd1e could have been "optimal", 
and that the system would even let you fail wd0e!

> The system came up, configured its raid set, failing
> the /dev/wd0e component, as I expected.  However, running fsck  on the /
> filesystem, which should have been the contents of /dev/wd1e at this point,
> produced hundreds of inconsistencies, and when all was done, I found that
> the data on the filesystem was not as current as the last time the machine
> had been cleanly shut down.  It wasn't exactly as old as the time that the
> original disk had gone off-line, but somewhere between the time the
> original disk had disappeared, and the time I failed the original disk
> manually.
> 	As I thought about this scenario more, it occurred to me that I should
> have noticed a problem when I received my original notification of trouble.
> When I checked on things at that point in time, the raid showed that
> /dev/wd0e was optimal and that component1 was failed.  Shouldn't it have
> shown component0 as failed, and /dev/wd0e as optimal, assuming it used the
> device name assigned by the kernel at boot time, rather than the device name
> which was assigned when the raid set was originally configured?  

Yes.  Device name at configuration time is irrelevant to the auto 
configuration stuff.  Maybe the component label for wd0e got stuffed onto 
wd1e at some point, and then wd1e became component0 instead of component1???

> Also, when
> the disk came back on line, complete with its original component label,
> shouldn't raidframe have ignored it because its modification time was less
> than that of the other component of the raid1 system?

It will recognize that it's part of the set because everything other 
than the modification count would be the same.  Now: if disk A fails, 
then disk B's mod counter will be greater than on A.  The only way 
you can get B's counter higher than A is if you a) change it by hand, 
b) boot with B connected but not A. 

> 	It looks as though raidframe might not be paying attention to column
> numbers and component numbers when raid1 is configured on a raid set.

It's certainly supposed to!!!  In fact, that's the primary thing it 
cares about!!

>  Do
> we have the classic "quarum" problem with mirrored disks, in as much as
> that raidframe sees two disks, each with a different modification number,
> but it doesn't know which one is the most current because there's not a
> majority of disks which agree on the mod time?

No.  In this case, it goes with the disk with the highest mod time.

> 	Is my problem that the device moved from /dev/wd1e to /dev/wd0e, or is
> my problem that raidframe didn't know which modification time was the most
> recent? 

I'm not sure... I'd like to see some dmesg output from all those 
various reboots.... 

> Also, why did component1 show as being failed when it was really
> component0 which had failed?  

This was after a reboot too, right?  The component label says what 
position a given component is in, and that's the position it gets 
placed in when the set is configured.

> Are components numbered from 0 or 1?

They are numbered from 0.  

> 	I'm not sure if this was some sort of pilot error on my part, or if
> raidframe didn't do what I expected.  I've used raidframe with raid5 sets
> for years, including many failed disks, without problem, so I'm fairly
> certain I performed reasonable restoration steps, but perhaps raid1 is a
> special case which i'm not aware of?

RAID 1 is a bit trickier, but the auto configuration bits are 
supposed to be able to deal with these sorts of things.... 
 
> Any light would be extremely helpful.

/var/log/messages* may yield some answers.  It smells like wd1 became 
wd0, and didn't tell anyone about it, and RAIDframe maybe wrote the 
wrong label to the wrong place????  But that doesn't make any sense 
either, because if two components are configured, and both have the 
same "position" in the array, only the one with the higher 
modification count will be configured.... 

This is really bizzare... 

Later...

Greg Oster