Re: kern/39993: lockup on i386 SMP (raidframe related ?)

To: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Sat, 22 Nov 2008 15:11:06 +0100

On Fri, Nov 21, 2008 at 08:29:18PM +0100, Manuel Bouyer wrote:
> >  Were it only that simple, I'd be happy...  Unfortunately, I've got a 
> >  couple of different boxes w/ 5.0_BETA+SMP+RAIDframe+heavy IO and I 
> >  havn't seen this problem at all.. :( 
> >  
> >  What happens if you do:
> >  
> >    dd if=/dev/rsd0e of=/dev/null bs=1m &
> >    dd if=/dev/rsd1e of=/dev/null bs=1m &
> >  
> >  where rsd0e and rsd1e are the (raw) components of your RAID set?
> 
> works fine (I tried with the different components of the RAID).
> 
> I also tried to reproduce it on a athlonx2 with 2 SATA drives, no luck.
> 
> other factors that may be relevant:
> - when this happens a background parity rewrite is running
> - there may be hardware issues with drives (like command timeouts),
>   so there may be aborted transfers/I/O errors reported to the raidframe 
>   layer.
> 
> What I could do it try let it rebuild parity on the UP kernel, and reboot
> to the SMP kernel after.

What I did:
- reboot in UP mode, let parity rebuild complete.
- reboot in SMP mode, multiuser
and now the system is up for about 17H, with its usual load. So it looks like
the issue is related to parity rebuild.

From the traces I gathered from ddb and gdb, it looks like CPU 1 is
trying to aquire a simple_lock (could it be in rf_DiskIOComplete, the
RF_LOCK_QUEUE_MUTEX(queue, "DiskIOComplete"); ?) while CPU 0 is halted with
this lock held.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

References:
- Re: kern/39993: lockup on i386 SMP (raidframe related ?)
  - From: Greg Oster
- Re: kern/39993: lockup on i386 SMP (raidframe related ?)
  - From: Manuel Bouyer

Prev by Date: kern/40004: ps core dumps when working on kernel core dump
Next by Date: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Previous by Thread: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Next by Thread: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Indexes:

Home | Main Index | Thread Index | Old Index