Subject: Re: raid: failed device used after reboot
To: None <current-users@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: current-users
Date: 05/22/2000 19:35:07
On Mon, May 22, 2000 at 07:16:17PM +0200, Manuel Bouyer wrote:
> Hi,
> I'm playing with an array of disk and raidframe, experimenting with various
> failure type. Here's what I've just got into:
> I've a raid1 spread accross drives in different enclosures, in such a way that
> I can power down one enclosure without loosing the raid.
> I've got in trouble with the following senario:
> - start writing to the filesystem: dd if=/dev/zero of=file bs=64k
> - power down one of the enclosures. raidframe mark the corresponding devices
>   as failed and continue running. dd doesn't stop.
> - power back the enclosure.
> - reboot.
> 
> When the machine reboots, raidframe finds all disks with status 'optimal' and
> parity 'dirty' so it starts revriting parity. Unfortunably some of the failed
> disks were master, so data a read from them instead of from the slave which
> has the accurate data. This resulted for me in an unclean filesystem,
> which had to be fixed with a 'fsck -y' (there was an unallocated inode in a
> directory).
> 
> In this senario raidframe should record elsewhere that the failed disk, and
> not reuse them. Maybe something based on the mod ref counter would work (or
> did I miss what the mod ref counter is for ?).

Another problem I noticed: if I reboot with the enclosure down, the raid
doesn't configure properly when it should:
raidlookup on device: /dev/sd10a failed!
raidlookup on device: /dev/sd11a failed!
raidlookup on device: /dev/sd12a failed!
raidlookup on device: /dev/sd13a failed!
raidlookup on device: /dev/sd14a failed!
raid2: Component /dev/sd10a being configured at row: 0 col: 0
         Row: 0 Column: 0 Num Rows: 0 Num Columns: 0
         Version: 0 Serial Number: 0 Mod Counter: 0
         Clean: No Status: 0
Number of rows do not match for: /dev/sd10a
Number of columns do not match for: /dev/sd10a
/dev/sd10a is not clean!
raid2: Component /dev/sd30a being configured at row: 0 col: 1
         Row: 0 Column: 1 Num Rows: 1 Num Columns: 20
         Version: 2 Serial Number: 2000051902 Mod Counter: 1237272265
         Clean: No Status: 0
/dev/sd30a has a different serial number: 0 2000051902
/dev/sd30a has a different modfication count: 0 1237272265
/dev/sd30a is not clean!
[...]

(sd10->14 are all in the same enclosure, sd30->34 in the other).
/dev/sd10a should have just been marked failed, isn't it ?

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
--