Subject: kern/28541: mi_switch() can deadlock on biglock
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Manuel Bouyer <Manuel.Bouyer@lip6.fr>
List: netbsd-bugs
Date: 12/05/2004 20:00:01
>Number:         28541
>Category:       kern
>Synopsis:       mi_switch() can deadlock on biglock
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Dec 05 20:00:00 +0000 2004
>Originator:     Manuel Bouyer
>Release:        NetBSD 2.0_RC5
>Organization:
	ASIM/LIP6 http://www-asim.lip6.fr/
>Environment:
System: NetBSD 2.0_RC5 (RAI.MP) #0: Wed Nov 24 17:41:46 CET 2004 bouyer@pop.lip6.fr:/local/pop1/bouyer/netbsd-2-0/src/sys/arch/i386/compile/RAI.MP
Architecture: i386
Machine: i386
>Description:
	[initially posted on tech-smp and tech-kern]
	This SMP box reliably panics while doing amanda backup with:
	panic: TLP IPI rendezvous failed (mask 1)
	I have another SMP box (same hardware) with a similar workload, which
	is working fine. The difference between the 2 is that this one has
	2 8-port puc device for serial consoles (some of them gets a lot
	of activity) and it is an amanda client.
	The stack traces shows:
CPU 1 (the one that paniced):
panic
pmap_tlp_shootdow
pmap_kremove
pipe_direct_write
pipe_write
dofilewrite
sys_write
syscall_plain

CPU 0:
acquire
spinlock_aquire_count
mi_switch
ltsleep
sbwait
soreceive
soo_read
dofileread
sys_read
syscall_plain
	
	CPU0 is trying to aquire again kernel_lock, while CPU1 has it, and
	tries to send an IPI to CPU0.
	But I don't know how this would prevent CPU0 from receiving an IPI.

>How-To-Repeat:
	Run several mrtg instances, and an amanda client on a dual-CPU box.
>Fix:
	unknown.