Subject: TLB IPI rendez-vous failed
To: None <firstname.lastname@example.org, email@example.com>
From: Matthias Drochner <M.Drochner@fz-juelich.de>
Date: 03/11/2004 20:20:41
This is a multipart MIME message.
Content-Type: text/plain; charset=us-ascii
under some load patterns involving simultanous floating-point,
I/O and process creation activity, by dual-Opteron box
panics with this message within minutes or hours.
The stack traceback looks always similar: CPU 1 (the secondary,
non interrupt-handling) like:
and the boot CPU like:
So the secondary CPU has grabbed the kernel lock, and the boot
CPU spins waiting for it. IPIs should get through anyway.
The bit corresponding to TLB flush is set in cpu0->ci_ipis,
only the processor obviously didn't get the interrupt or didn't
see that bit in ci_ipis.
My impression is that this results from IPIs getting lost if
coming in rapid succession. FPU sync operations are relatively
expensive, so it might happen that an FPU sync IPI is still is
progress when a TLB flush IPI is issued.
The appended patch makes my dual-amd64 run stable.
I don't quite understand how the assumed race condition looks
like exactly, perhaps someone has some more imagination:-)
The i386 code is identical appearently, so this might be an
issue there too.
Content-Type: text/plain ; name="ipihdl.txt"; charset=us-ascii
Content-Disposition: attachment; filename="ipihdl.txt"
--- vector.S.~1.3.~ Fri Feb 27 13:13:44 2004
+++ vector.S Thu Mar 11 18:45:53 2004
@@ -305,7 +305,7 @@ IDTVEC(intr_lapic_ipi)
- movl $0,_C_LABEL(local_apic)+LAPIC_EOI
+# movl $0,_C_LABEL(local_apic)+LAPIC_EOI
@@ -315,6 +315,7 @@ IDTVEC(intr_lapic_ipi)
+ movl $0,_C_LABEL(local_apic)+LAPIC_EOI
orl $(1 << LIR_IPI),CPUVAR(IPENDING)