tech-kern: Re: Getting "TLB IPI rendezvous failed..."

Subject: Re: Getting "TLB IPI rendezvous failed..."
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Stephan Uphoff <ups@tree.com>
List: tech-kern
Date: 01/25/2005 22:24:50

On Tue, 2005-01-25 at 11:19, Manuel Bouyer wrote: 
> On Tue, Jan 25, 2005 at 04:52:38PM +0100, Manuel Bouyer wrote:
> > On Sat, Jan 22, 2005 at 09:21:59AM -0500, Stephan Uphoff wrote:
> > > Could you try the attached patch?
> > > Please make sure that all your com devices show up.
> > 
> > OK, I finally got around to try it and have interesting results.
> > First, I tried it yesterday, but the RAID parity was not clean, so raidframe
> > didn't use both disks. I couldn't get the box to panic.
> > I tried again today with a clean parity, I got the panic as expected.
> > 
> > I added attitional debug printfs, see the attached patch.
> here is the patch in question

Just to make sure - did you also modify sys/arch/x86/include/intrdefs.h?
-#define IDT_INTR_HIGH  0xef
+#define IDT_INTR_HIGH  0xdf

OK - I guess it is time for some experiments since I am out of ideas
where to look in the source code.

Could you add ci_ipis to your CPU printout?
Maybe there is no IPI pending for CPU 0 and we can stop looking at the
interrupt delivery.

Replacing the panic in pmap_tlb_shootnow with some printf statements and
a goto to the start of the function may show us if this is a deadlock or
a race condition  (Maybe panic after 5 retries?)

There are also some nice Intel "Specification Updates" that warn of some
order violations when mixing atomic and non-atomic operations.
Just to make sure that this is not our problem can you replace the line
	self->ci_tlb_ipi_mask = cpumask;
in pmap_tlb_shootnow with 
	x86_atomic_setbits(&self->ci_tlb_ipi_mask,cpumask);

I don't think that this is a cli() problem since the debugger seems to
work just fine and it uses a special IPI.

Thanks
Stephan