netbsd-bugs: kern/10160: raidframe behaves poorly with failed disk

Subject: kern/10160: raidframe behaves poorly with failed disk
To: None <gnats-bugs@gnats.netbsd.org>
From: None <nemo@red-bean.com>
List: netbsd-bugs
Date: 05/20/2000 11:56:13
>Number:         10160
>Category:       kern
>Synopsis:       raidframe behaves poorly with failed disk
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat May 20 11:57:01 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Joel N. Weber II
>Release:        1.4.2, with slightly newer 1.4 branch kernel
>Organization:
Gratuitous Organization for Gratuitous Software Enhancement
>Environment:
System: NetBSD xanthine 1.4.2A NetBSD 1.4.2A (XANTHINE) #0: Sat Apr 15 12:53:55 EDT 2000 nemo@xanthine:/usr/src/syssrc/sys/arch/i386/compile/XANTHINE i386


>Description:

xanthine has two disks, /dev/sd0 and /dev/sd1.  About two weeks ago,
sd0 appears to have lost some blocks; if you tried to read some blocks
on the disk it will sort of get wedged, but that disk was intact
enough that the machine could successfully read the disklabel and
kernel and quite a bit more off of it.

xanthine crashed when sd0 got wedged the first time.  Unfortunately I neglected
to save any output at that point.  It is entirely possible that it wedged
on the raid level 0 device failing, in which case I can't even assert that
there is a bug in raidframe.  (xanthine was configured with a /dev/raid0
and a /dev/raid1 that had each had a component on each of sd0 and sd1;
raid0 was at raid level 0, and raid 1 at raid level 1.  I think that when
I reconstruct things I will make them both raid level 1, as I believe
improved reliability probably has more value than conservation of disk space
given that I have gratuitously large disks for the amount of data I have.)

When I tried rebooting xanthine, with both sd0 and sd1 still in the
system, I got the following repeated at least a half dozen times (and
I suspect it would have gone on forever had I not turned the machine
off):

[0] node (R  ) returned fail, rolling backward
[0] DAG failure: r addr 0xa18810 (10586128) nblk 0x2 (2) buf 0xc73f000
DEAD DISK BOGUSLY DETECTED!!

I'm getting the distinct impression that NetBSD reacts to a partially
dead disk that's being excessively slow by getting into a loop
printing the above text and being hosed, rather than deciding that the
disk is dead and moving on.  Perhaps it should be reacting by an
inability to read a block on one disk by asking the other disk for it.

After seeing this lossage, I decided to run using sd1 as the only disk
in my system; I'm currently waiting for a warranty replacement of sd0
after which I will have an sd0 again.

sd1 had been set up with a proper disklabel, but didn't have
appropriate boot code or an appropriate partition table installed.  I
went through about a half dozen boots trying to get this right, and at
one point obliterated the disklabel on sd1 and had to temporarily
reinstall sd0 so that I could copy its disklabel to sd1.  The whole
process of setting up a raid array so that you can boot off the disk
needs to either be better documented, or better automated.

It also appears to be the case that if a component has evaporated,
raidframe decides that the serial number of the evaporated component
is zero, and compains of a serial number mismatch.

I believe that dmesg also reports that sd0 and sd1 are both hosed when
sd1 is OK, but raidctl's status subcommand reports that sd0 is failed
and sd1 is optimal.  I'm not sure why this is.

>How-To-Repeat:

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: