NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown



The following reply was made to PR kern/40569; it has been noted by GNATS.

From: Greg Oster <oster%cs.usask.ca@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system 
shutdown 
Date: Sat, 07 Feb 2009 12:54:54 -0600

 Matthias Scheler writes:
 > On Sat, Feb 07, 2009 at 12:50:03AM +0000, Greg Oster wrote:
 > >  > I retried the parity rewrite but it was rejected by "raidctl" because of
 > >  > an invalid I/O control. 
 > >  
 > >  Do you have a bit more info on exactly what you tried here and what 
 > >  the error was?
 > 
 > Not really but I tried another rebuild after powercycling the system (to
 > check the cabling) and it failed again:
 > 
 > aid1: initiating in-place reconstruction on column 0
 > wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 2684354
 > 55; cn 266305 tn 0 sn 15), retrying
 > wd3: soft error (corrected)
 > wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; 
 > cn 266305 tn 0 sn 15), retrying
 [snip]
 > wd2: (id not found)
 > raid1: Recon write failed!
 > raid1: reconstruction failed.
 > ahcisata0 port 2: device present, speed: 1.5Gb/s
 > raid1: Error re-writing parity!
 
 I don't understand where this last line is coming from... Unless it 
 finished rebuilding parity for raid0, and it's just coincidece that 
 it finished at exactly this spot?
 
 > If you tell me what kind of debugging you would like me to do I can try
 > to reproduce the problem by attempting another rebuild.
 
 Hmmmmm....  Around line 857 of src/sys/dev/raidframe/rf_reconstruct.c
 there is a:
 
   return (1);
 
 You could try adding a printf() just before that line, and see if 
 that gets printed....  I bet it doesn't... 
 
 I *think* you're getting hung up in the:
 
                if (!write_error) {
                        /* wait for writes to complete */
                        while (raidPtr->reconControl->pending_writes > 0) {
 
 part of rf_ContinueReconstructFailedDisk().
 
 It seems that you've had a (corrected) read error on wd3e.. but I'm 
 wondering if that's contributing to the problem here..  The issue, I 
 think, is that there are still pending writes, or that the code 
 thinks there are pending writes...  I know this code was tested on a 
 disk that had real failing writes, but it's unlikely that they were 
 exactly the same as what you're seeing, and so there's room here for 
 bugs... 
 
 Oh... It just hit me:
 
 wd3e: LBA48 bug reading fsbn 268435392 of 268435392-268435519 (wd3 bn 
268435455; cn 266305 tn 0 sn 15), retrying
  wd3: soft error (corrected)
  wd2e: error writing fsbn 268435392 of 268435392-268435519 (wd2 bn 268435455; 
cn 266305 tn 0 sn 15), retrying
  wd2: (id not found)
 
 What type of disks are these, and do they have the the 'LBA48-quirk' entry 
 to change addressing modes or whatever for block 268435455?  (just hunt 
 for that block number in Google for more info...)  There are other 
 PRs (like 38376) which describe these same sort of symptoms...
 
 Later...
 
 Greg Oster
 
 


Home | Main Index | Thread Index | Old Index