current-users: Re: raidframe and pciide list interrupts

Subject: Re: raidframe and pciide list interrupts
To: Simon Burge <simonb@wasabisystems.com>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 11/21/2000 11:22:38

Simon Burge writes:
> Hi,
> 
> One half of my raidframe mirror across wd0 and wd1 (a pair of IBM 46GB
> disks) on my Alpha PC164 running 1.5_BETA2 just died with:
> 
[snip]
> This continued for about 10 minutes with lots of pciide and wd0 errors
> interspersed with the following raidframe errors:
> 
> Nov 22 03:10:30 thoreau /netbsd: raid0: IO Error.  Marking /dev/wd0a as faile
> d.
> Nov 22 03:10:30 thoreau /netbsd: raid0: node (Rmir) returned fail, rolling ba
> ckward
> Nov 22 03:10:30 thoreau /netbsd: raid0: DAG failure: r addr 0x508040 (5275712
> ) nblk 0x10 (16) buf 0xfffffe000307c000
> Nov 22 03:11:22 thoreau /netbsd: raid0: node (Wpd) returned fail, rolling for
> ward
> Nov 22 03:12:14 thoreau /netbsd: raid0: node (Wpd) returned fail, rolling for
> ward
> Nov 22 03:12:14 thoreau /netbsd: raid0: node (Wpd) returned fail, rolling for
> ward
> Nov 22 03:15:41 thoreau /netbsd: raid0: node (Wpd) returned fail, rolling for
> ward
> Nov 22 03:20:01 thoreau /netbsd: raid0: node (Wpd) returned fail, rolling for
> ward
> 
> and now seems to be ignoring wd0 altogether.
> 
> So, a couple of questions:
> 
>  1) Shouldn't raidframe have stopped accessing wd0 after the first
>     "Marking /dev/wd0a as failed"?

It should have... much/all of the RAIDframe printout should be from IO that
was already queued and sent to wd0 before the first error occured.  Depending 
on how much IO was taking place (and how long it took to get those error
messages out) it might take a little bit for that to get flushed through...

>  2) Is the disk hosed?  Sleep time now - I'll reboot in the morning and
>     see what happens.

Dunno... given the "lost interrupt", the media might still be fine... 
(but, of course, it's all messed up as far as RAIDframe is concerned :) )

Later...

Greg Oster