Subject: Re: problems with raidframe on FC disks
To: Ben Rosengart <br@panix.com>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: current-users
Date: 02/27/2002 16:42:50
On Wed, Feb 27, 2002 at 03:45:33PM -0500, Ben Rosengart wrote:
> I've been using RAIDframe on NetBSD 1.5.2 to stripe eight fibre
> channel disks.  This has been working OK, modulo some soft errors
> which didn't seem to actually cause any problems.
> 
> Yesterday, I switched to a NetBSD-1.5ZA kernel, using yesterday's
> sources, in order to solve another problem which I believe to be
> unrelated.  Overnight, the machine became unresponsive, with this
> in syslog:
> 
> Feb 27 03:27:27 reader1 /netbsd: raid1: IO Error.  Marking /dev/sd4a as failed.
> Feb 27 03:27:27 reader1 /netbsd: raid1: node (R  ) returned fail,
>    rolling backward
> Feb 27 03:27:27 reader1 /netbsd: raid1: DAG failure: r addr 0x153c580
>    (22267264) nblk 0x20 (32) buf 0xc93c1000
> Feb 27 03:31:43 reader1 /netbsd: raid1: IO Error.  Marking /dev/sd3a as failed.
> Feb 27 03:31:43 reader1 /netbsd: raid1: node (W  ) returned fail, rolling
>    forward
> Feb 27 03:31:43 reader1 /netbsd: raid1: IO Error.  Marking /dev/sd6a as failed.
> Feb 27 03:31:43 reader1 /netbsd: raid1: node (R  ) returned fail,
>    rolling backward
> Feb 27 03:31:43 reader1 /netbsd: raid1: IO Error.  Marking /dev/sd7a as failed.
> Feb 27 03:31:43 reader1 /netbsd: raid1: node (R  ) returned fail,
>    rolling backward
> Feb 27 03:32:47 reader1 /netbsd: raid1: node (R  ) returned fail,
>    rolling backward
> Feb 27 03:33:51 reader1 last message repeated 3 times
> Feb 27 03:47:25 reader1 /netbsd: Failed to write RAID component info!
> Feb 27 03:53:22 reader1 /netbsd: sd1(isp0:0:57:0): adapter resource shortage
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I think that's your disk error right there.  I'm unsure why it's not 
printf'ed until *after* the RAIDframe errors, but I'd expect that if you
have an "adapter resource shortage", it's a reasonable assumption that the
transfers RAIDframe thinks failed actually did.

Another question is whether this should be a *retryable* error.

Thor