Subject: Re: kern/25285: i386 MP panic: TLB IPI rendezvous failed (mask 1)
To: None <dokas@cs.umn.edu>
From: Erik E. Fair <fair@netbsd.org>
List: current-users
Date: 06/10/2004 10:46:34
We need to know:

1. What interrupts were masked off by each CPU while that CPU was 
spinning while waiting for a lock? (offhand, one or more of them 
would appear to have masked off the TLB IPI...)

2. What data structures were each of those CPUs attempting to 
manipulate behind each lock request? They might be separable into 
separate locks so that there isn't contention for the same big lock.

This looks like a deadlock situation because of an interaction 
between interrupt masking and our mutex subsystem. At least one of 
those spinning CPUs has masked off the TLB IPI and then attempts to 
acquire the kernel biglock, and spins. Another CPU attempts a TLB 
shootdown (probably while holding the kernel biglock), and fails 
because of the other CPU waiting for the biglock while its TLB IPI is 
masked off.

At least the system didn't silently hang.

It's also instructive that all the other CPUs other than the one 
attempting the shootdown are waiting for the kernel lock. We need 
finer grained locking than this to prevent this level of contention.

	Erik <fair@netbsd.org>