NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
The following reply was made to PR kern/40569; it has been noted by GNATS.
From: Greg Oster <oster%cs.usask.ca@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
shutdown
Date: Mon, 09 Feb 2009 17:01:45 -0600
Matthias Scheler writes:
> The following reply was made to PR kern/40569; it has been noted by GNATS.
>
> From: Matthias Scheler <tron%zhadum.org.uk@localhost>
> To: NetBSD GNATS <gnats-bugs%NetBSD.org@localhost>
> Cc: Greg Oster <oster%cs.usask.ca@localhost>
> Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
> own
> Date: Mon, 9 Feb 2009 21:07:13 +0000
>
> On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
> > This time it actually seems to fail because of the LBA48 bug. I know
> > remember that re-construct was done in the opposite direction in
> > the past. I'll try to add the harddisk to the quick table.
>
> 1.) The system wedge again during shutdown:
>
> ahcisata0 port 2: device present, speed: 1.5Gb/s
> Feb 9 19:56:43 colwyn su: tron to root on /dev/ttyp1
> Feb 9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix
> Feb 9 19:58:11 colwyn syslogd: Exiting on signal 15
> syncing disks... 5 done
> unmounting file systems...raid1: Waiting for reconstruction to stop...
So it is very likely sleeping in the reconstruct code and waiting for
a write that is never going to happen...
AHHHH... I think I see the bug there is at least one missing:
num_writes++;
in rf_reconstruct:rf_ContinueReconstructFailedDisk(). (it might be
that two are missing.. I need to do more analysis...) Basically
writes with errors are still writes that need to be accounted for,
and that's not happening properly... I'll see about geting this
fixed for 5.0. (testing may prove to be a pain... I may have to
resurect an old testing box so that I have some disks with
real write errors... and I'm not sure those will even be sufficient
to replicate this :-/ )
> 2.) The kernel with the two hard disks in the quick table has managed
> to re-construct the RAID 1.
>
> So it seems this is not a RAIDframe bug after all but rather a problem
> with the drives (and eventually error handling in ahcisata(4)).
There's a RAIDframe bug in there too...
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index