current-users: raid: failed device used after reboot

Subject: raid: failed device used after reboot
To: None <current-users@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: current-users
Date: 05/22/2000 19:16:17

Hi,
I'm playing with an array of disk and raidframe, experimenting with various
failure type. Here's what I've just got into:
I've a raid1 spread accross drives in different enclosures, in such a way that
I can power down one enclosure without loosing the raid.
I've got in trouble with the following senario:
- start writing to the filesystem: dd if=/dev/zero of=file bs=64k
- power down one of the enclosures. raidframe mark the corresponding devices
  as failed and continue running. dd doesn't stop.
- power back the enclosure.
- reboot.

When the machine reboots, raidframe finds all disks with status 'optimal' and
parity 'dirty' so it starts revriting parity. Unfortunably some of the failed
disks were master, so data a read from them instead of from the slave which
has the accurate data. This resulted for me in an unclean filesystem,
which had to be fixed with a 'fsck -y' (there was an unallocated inode in a
directory).

In this senario raidframe should record elsewhere that the failed disk, and
not reuse them. Maybe something based on the mod ref counter would work (or
did I miss what the mod ref counter is for ?).

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
--