current-users: Possible problem with raidframe, possible problem with pilot :)

Subject: Possible problem with raidframe, possible problem with pilot :)
To: None <current-users@netbsd.org>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 12/26/2004 23:18:51
	hello.  I've been running raidframe on various versions of NetBSD for
some time, and I currently have a number of raid1 systems running under
NetBSD-2.0 branch code.
	My typical configuration looks like:

Raid level: 1
Components: /dev/wd0e /dev/wd1e
I also have auto-configure and root-partition turned on, so that the root
of the system is on the raid set.

	Last week, I rebooted a system running this configuration, and it came
up fine. That night, a script which I use to check the health of the raid
sets on the systems mailed me taht there was trouble.  When I logged in, I
discovered that, due to circumstances beyond the scope of this problem, the
primary disk, /dev/wd0, in the original component raid set, wasn't
recognized, and the disk which had been wd1 was now wd0 on the system.  The
raidctl -s command showed that component1 was failed, and that /dev/wd0e
was configured.  So far, so good, or so I thought.
	When I restored the original disk, which was not a disk failure, but
an unrelated software failure, the auto-configure mechanism restored the
raid set, but used the data from the disk which had been off-line for 24
hours to populate the / filesystem.  No problem, I tought, since I was in
single user mode.  I manually failed the /dev/wd0e component, rebooted, and
figured I'd be fine.  The system came up, configured its raid set, failing
the /dev/wd0e component, as I expected.  However, running fsck  on the /
filesystem, which should have been the contents of /dev/wd1e at this point,
produced hundreds of inconsistencies, and when all was done, I found that
the data on the filesystem was not as current as the last time the machine
had been cleanly shut down.  It wasn't exactly as old as the time that the
original disk had gone off-line, but somewhere between the time the
original disk had disappeared, and the time I failed the original disk
manually.
	As I thought about this scenario more, it occurred to me that I should
have noticed a problem when I received my original notification of trouble.
When I checked on things at that point in time, the raid showed that
/dev/wd0e was optimal and that component1 was failed.  Shouldn't it have
shown component0 as failed, and /dev/wd0e as optimal, assuming it used the
device name assigned by the kernel at boot time, rather than the device name
which was assigned when the raid set was originally configured?  Also, when
the disk came back on line, complete with its original component label,
shouldn't raidframe have ignored it because its modification time was less
than that of the other component of the raid1 system?
	It looks as though raidframe might not be paying attention to column
numbers and component numbers when raid1 is configured on a raid set.  Do
we have the classic "quarum" problem with mirrored disks, in as much as
that raidframe sees two disks, each with a different modification number,
but it doesn't know which one is the most current because there's not a
majority of disks which agree on the mod time?
	Is my problem that the device moved from /dev/wd1e to /dev/wd0e, or is
my problem that raidframe didn't know which modification time was the most
recent? Also, why did component1 show as being failed when it was really
component0 which had failed?  Are components numbered from 0 or 1?
	I'm not sure if this was some sort of pilot error on my part, or if
raidframe didn't do what I expected.  I've used raidframe with raid5 sets
for years, including many failed disks, without problem, so I'm fairly
certain I performed reasonable restoration steps, but perhaps raid1 is a
special case which i'm not aware of?

Any light would be extremely helpful.
-thanks
-Brian