NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutdown



The following reply was made to PR kern/40569; it has been noted by GNATS.

From: Greg Oster <oster%cs.usask.ca@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system 
shutdown 
Date: Mon, 09 Feb 2009 17:01:45 -0600

 Matthias Scheler writes:
 > The following reply was made to PR kern/40569; it has been noted by GNATS.
 > 
 > From: Matthias Scheler <tron%zhadum.org.uk@localhost>
 > To: NetBSD GNATS <gnats-bugs%NetBSD.org@localhost>
 > Cc: Greg Oster <oster%cs.usask.ca@localhost>
 > Subject: Re: kern/40569: Faild RAIDframe parity rewrite prevents system shutd
 > own
 > Date: Mon, 9 Feb 2009 21:07:13 +0000
 > 
 >  On Mon, Feb 09, 2009 at 07:50:03PM +0000, Matthias Scheler wrote:
 >  >  This time it actually seems to fail because of the LBA48 bug. I know
 >  >  remember that re-construct was done in the opposite direction in
 >  >  the past. I'll try to add the harddisk to the quick table.
 >  
 >  1.) The system wedge again during shutdown:
 >  
 >  ahcisata0 port 2: device present, speed: 1.5Gb/s
 >  Feb  9 19:56:43 colwyn su: tron to root on /dev/ttyp1
 >  Feb  9 19:57:56 colwyn shutdown: reboot by tron: Kernel bug fix 
 >  Feb  9 19:58:11 colwyn syslogd: Exiting on signal 15
 >  syncing disks... 5 done
 >  unmounting file systems...raid1: Waiting for reconstruction to stop...
 
 So it is very likely sleeping in the reconstruct code and waiting for 
 a write that is never going to happen...
 
 AHHHH... I think I see the bug there is at least one missing:
 
  num_writes++;
 
 in rf_reconstruct:rf_ContinueReconstructFailedDisk().  (it might be 
 that two are missing.. I need to do more analysis...)  Basically 
 writes with errors are still writes that need to be accounted for, 
 and that's not happening properly...  I'll see about geting this 
 fixed for 5.0.  (testing may prove to be a pain... I may have to 
 resurect an old testing box so that I have some disks with 
 real write errors... and I'm not sure those will even be sufficient 
 to replicate this :-/ )
 
 >  2.) The kernel with the two hard disks in the quick table has managed
 >      to re-construct the RAID 1.
 >  
 >  So it seems this is not a RAIDframe bug after all but rather a problem
 >  with the drives (and eventually error handling in ahcisata(4)).
 
 There's a RAIDframe bug in there too... 
 
 Later...
 
 Greg Oster
 
 


Home | Main Index | Thread Index | Old Index