current-users: Re: raid: failed device used after reboot

Subject: Re: raid: failed device used after reboot
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 05/22/2000 18:44:36

Manuel Bouyer writes:
> Hi,
> I'm playing with an array of disk and raidframe, experimenting with various
> failure type. Here's what I've just got into:
> I've a raid1 spread accross drives in different enclosures, in such a way tha
> t
> I can power down one enclosure without loosing the raid.
> I've got in trouble with the following senario:
> - start writing to the filesystem: dd if=/dev/zero of=file bs=64k
> - power down one of the enclosures. raidframe mark the corresponding devices
>   as failed and continue running. dd doesn't stop.
> - power back the enclosure.
> - reboot.
> 
> When the machine reboots, raidframe finds all disks with status 'optimal' and
> parity 'dirty' so it starts revriting parity. Unfortunably some of the failed
> disks were master, so data a read from them instead of from the slave which
> has the accurate data. This resulted for me in an unclean filesystem,
> which had to be fixed with a 'fsck -y' (there was an unallocated inode in a
> directory).

What vintage of NetBSD are you running (1.4.2, -current, what date?)?

> In this senario raidframe should record elsewhere that the failed disk, and
> not reuse them. Maybe something based on the mod ref counter would work (or
> did I miss what the mod ref counter is for ?).

The modification counters are supposed to handle this, but there were some 
problems with this a while back...  If you're running a recent -current, 
you should see the master marked as 'failed' even after it comes back up.
Can you send me the relevant chunk of 'dmesg' or /var/log/messages for when it 
boots, and/or does device autodetection?  There could be a bug in there, but I 
don't have enough info yet...

Later...

Greg Oster