Subject: Re: i386 IPI panic
To: None <tech-smp@NetBSD.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-smp
Date: 11/27/2004 17:29:07
On Sat, Nov 27, 2004 at 01:24:14PM +0100, Manuel Bouyer wrote:
> On Thu, Nov 25, 2004 at 05:38:47PM +0100, Manuel Bouyer wrote:
> > Hi,
> > I got a panic on a box which got a SMP kernel yesterday evening (it worked
> > fine with a  UP kernel for months):
> > panic: TLP IPI rendezvous failed (mask 1)
> > 
> > Here is the stack trace (although I suspect it's not usefull):
> > pmap_tlp_shootdown
> > pmap_do_remove
> > pmap_remove
> > ubc_alloc
> > ffs_write
> > vn_write
> > dofilewrite
> > sys_write
> > 
> > I have similar boxes running SMP without problems, including one with
> > a similar workload. The only special thing this one has is that it has
> > some serial port activity (apcupsd running on one of the motherboard ports,
> > 8 serial console connected to a eight ports PUC device, some of them getting
> > verbose logs).
> > I've seen in the config file that com needs a special option on i386 SMP,
> > could it be that the serial driver blocks *all* interrupts (including IPI
> > ones) for too long ?
> 
> 
> The box is dead again, this time it should sit at the debugger prompt.
> I may go to work this afternoon to restart it, is there anything special
> I can look at from ddb ?

Ok, this time I had ddb.onpanic=1, and got a stack trace for both CPUs:

CPU 1 (the one that paniced):
panic
pmap_tlp_shootdow
pmap_kremove
pipe_direct_write
pipe_write
dofilewrite
sys_write
syscall_plain

CPU 0:
acquire
spinlock_aquire_count
mi_switch
ltsleep
sbwait
soreceive
soo_read
dofileread
sys_read
syscall_plain

So it looks like a deadlock on kernel_lock. Another specificity of this box is
that it's an amanda client, and indeed both time it paniced while amanda runs.

I don't know enouth about this part of the kernel to propose a fix for this,
but I'm willing to test code :)

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--