tech-kern: Re: Getting "TLB IPI rendezvous failed..."

Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Frank van der Linden <fvdl@netbsd.org>
From: Frederick Bruckman <fredb@immanent.net>
List: tech-kern
Date: 12/23/2004 19:53:47

On Thu, 23 Dec 2004, Frank van der Linden wrote:

> On Thu, Dec 23, 2004 at 12:56:26AM -0600, Frederick Bruckman wrote:
>> 2) The general pattern seems to be that one cpu is at spipl(), waiting
>> for a lock, while the other cpu insists on doing something to the first
>> cpu, and has no way to back off? I wonder why it's only i386.
>
> That's the general deadlock pattern: one CPU is at a very high spl
> (splipi, which is the highest possible), waiting to acquire a lock. Another
> CPU holds the lock, and has to do something which involves sending an IPI
> and waiting for the other CPUs to receive it. But, the first CPU never
> gets it.

...because, it's at splipi()?  That sounds too obvious.  I wish I 
understood how that was ever supposed to work. I mean, if there are 
places where you can block all interrupts waiting for a lock, how is 
it that the other processor can ever assume it can send you an IPI, 
and that you are guaranteed to receive it?

> I don't know why this problem has resurfaced recently for some people.
>
> Manuel is right, collecting the traces is the most important thing, it
> will show where the CPUs get stuck.

Did it again. Same place on CPU 6 in uvm_glue.c as last time, but on 
CPU 0, not in mi_switch(), this time...

acquire
lockmgr
x86_intlock
Xintr_ioapic_level10
--- interrupt ---
Xspllower
mpidle
preempt
trap
--- trap (number 3) ---

Has anyone seen these in current? I don't get it in current, but that 
could be only because it's one of those wierd things.

Frederick