NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/39993: lockup on i386 SMP (raidframe related ?)



On Fri, Nov 21, 2008 at 08:29:18PM +0100, Manuel Bouyer wrote:
> >  Were it only that simple, I'd be happy...  Unfortunately, I've got a 
> >  couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I 
> >  havn't seen this problem at all.. :( 
> >  
> >  What happens if you do:
> >  
> >    dd if=/dev/rsd0e of=/dev/null bs=1m &
> >    dd if=/dev/rsd1e of=/dev/null bs=1m &
> >  
> >  where rsd0e and rsd1e are the (raw) components of your RAID set?
> 
> works fine (I tried with the different components of the RAID).
> 
> I also tried to reproduce it on a athlonx2 with 2 SATA drives, no luck.
> 
> other factors that may be relevant:
> - when this happens a background parity rewrite is running
> - there may be hardware issues with drives (like command timeouts),
>   so there may be aborted transfers/I/O errors reported to the raidframe 
>   layer.
> 
> What I could do it try let it rebuild parity on the UP kernel, and reboot
> to the SMP kernel after.

What I did:
- reboot in UP mode, let parity rebuild complete.
- reboot in SMP mode, multiuser
and now the system is up for about 17H, with its usual load. So it looks like
the issue is related to parity rebuild.

From the traces I gathered from ddb and gdb, it looks like CPU 1 is
trying to aquire a simple_lock (could it be in rf_DiskIOComplete, the
RF_LOCK_QUEUE_MUTEX(queue, "DiskIOComplete"); ?) while CPU 0 is halted with
this lock held.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--


Home | Main Index | Thread Index | Old Index