NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown
The following reply was made to PR kern/40569; it has been noted by GNATS.
From: Greg Oster <oster%cs.usask.ca@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system
shutdown
Date: Sat, 07 Feb 2009 12:54:54 -0600
Matthias Scheler writes:
> On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
> > > I retried the parity rewrite but it was rejected by "raidctl" because of
> > > an invalid I/O control.
> >
> > Do you have a bit more info on exactly what you tried here and what
> > the error was?
>
> Not really but I tried another rebuild after powercycling the system (to
> check the cabling) and it failed again:
>
> aid1: initiating in-place reconstruction on column 0
> wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684354
> 55; cn 266305 tn 0 sn 15), retrying
> wd3: soft error (corrected)
> wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455;
> cn 266305 tn 0 sn 15), retrying
[snip]
> wd2: (id not found)
> raid1: Recon write failed!
> raid1: reconstruction failed.
> ahcisata0 port 2: device present, speed: 1.5Gb/s
> raid1: Error re-writing parity!
I don't understand where this last line is coming from... Unless it
finished rebuilding parity for raid0, and it's just coincidece that
it finished at exactly this spot?
> If you tell me what kind of debugging you would like me to do I can try
> to reproduce the problem by attempting another rebuild.
Hmmmmm.... Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
there is a:
return (1);
You could try adding a printf() just before that line, and see if
that gets printed.... I bet it doesn't...
I *think* you're getting hung up in the:
if (!write_error) {
/* wait for writes to complete */
while (raidPtr->reconControl->pending_writes > 0) {
part of rf_ContinueReconstructFailedDisk().
It seems that you've had a (corrected) read error on wd3e.. but I'm
wondering if that's contributing to the problem here.. The issue, I
think, is that there are still pending writes, or that the code
thinks there are pending writes... I know this code was tested on a
disk that had real failing writes, but it's unlikely that they were
exactly the same as what you're seeing, and so there's room here for
bugs...
Oh... It just hit me:
wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn
268435455; cn 266305 tn 0 sn 15), retrying
wd3: soft error (corrected)
wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455;
cn 266305 tn 0 sn 15), retrying
wd2: (id not found)
What type of disks are these, and do they have the the 'LBA48-quirk' entry
to change addressing modes or whatever for block 268435455? (just hunt
for that block number in Google for more info...) There are other
PRs (like 38376) which describe these same sort of symptoms...
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index