Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Stephan Uphoff <ups@tree.com>
List: tech-kern
Date: 02/08/2005 14:46:06
On Thu, 2005-01-27 at 10:47, Manuel Bouyer wrote:
> On Thu, Jan 27, 2005 at 04:31:48PM +0100, Manuel Bouyer wrote:
> > Another one:
> > pmap_tlb_shootnow: CPU 0 interrupt level 0xc pending 0x400 depth 1 ci_ipis 16
> 
> And some more:
> npxsave_lwp: CPU 0 interrupt level 0x6 pending 0x10000400 depth 1 ci_ipis 8
> pmap_tlb_shootnow: CPU 0 interrupt level 0xd pending 0x400 depth 0 ci_ipis 16
> pmap_tlb_shootnow: CPU 0 interrupt level 0xd pending 0x400 depth 0 ci_ipis 16
> pmap_tlb_shootnow: CPU 0 interrupt level 0xe pending 0x400 depth 1 ci_ipis 0
> 
> In the last one, the processor is at splipi() already. Maybe it is
> already processing our IPI, and we didn't wait long enouth ?
> We're doing 10M loops here. How long will it take on a 1Ghz CPU ?
> Could the other CPU be blocked long enouth on a register I/O ?

OK - after going yet again through all the code I only see three mayor
(im-)probable causes.

1) The APIC IPI mechanism does not work as advertised and CPU 0 never
received and interrupt. However the "Intel Specification Updates" for
your CPU do not indicate and problem. The assembler interrupt stubs may
generate some extra/out of order "End of interrupts" to the local apic
but this should also not cause your problem.

2) The IPI interrupt level pending bit gets lost.
   However the only way I can imagine this could happen is if the C
compiler uses aliased memory in the inline assembly sequence to set the
pending bits for "soft interrupts". 
I looked at the assembly code and it seems fine. (Beside having an
unneeded LOCK prefix)
However defining the interrupt level pending field as volatile may be a
good idea.

3) You fabricated the debug messages just to drive me crazy ;-)

I have some ideas on how to narrow down where to look for your problem.
Hopefully I will get some time in the next days to write/test some
patches.

Stephan