Subject: Re: SMP stability issues
To: Chris Rendle-Short <jim@tty1.rr.nu>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-smp
Date: 11/12/2006 11:45:13
--k+w/mQv8wyuph6w0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Sun, Nov 12, 2006 at 01:23:34PM +1100, Chris Rendle-Short wrote:
> Well, I just tried running GENERIC.MPACPI like some of the others suggested,
> however it is still locking up. Here is the dmesg from GENERIC.MPACPI
> (although it looks like I might need to check my ACPI configuration in the
> BIOS.

It looks kike it's using ACPI

> I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled
> as you suggested. Is it likely to matter whether or not ACPI is enabled in
> the test kernel?

Yes, these checks are independant from ACPI vs MPBIOS

> pchb0 at pci0 dev 0 function 0
> pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)

OK, this is the same motherboard as I have here (I have several of theses). I
also have issues with them, I guess the debug options
will show you that the CPU is missing IPI interrupts on occasion.
If so, the attached patch should help (my boxes are rock solid with this
patch). Note that it's only active if you have
options DIAGNOSTIC
in your kernel config.
Acutally I suspect this is a bug in the chipset; I have Intel-based dual-PIII
motherboards which don't have this issue, nor do P4 SMP systems.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--

--k+w/mQv8wyuph6w0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="diff.via"

Index: i386/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
retrieving revision 1.181.2.2
diff -u -r1.181.2.2 pmap.c
--- i386/pmap.c	26 Sep 2005 20:24:52 -0000	1.181.2.2
+++ i386/pmap.c	12 Nov 2006 10:42:15 -0000
@@ -3652,6 +3652,7 @@
 	int s;
 #ifdef DIAGNOSTIC
 	int count = 0;
+	int ipi_retry = 0;
 #endif
 #endif
 
@@ -3672,6 +3673,9 @@
 	/*
 	 * Send the TLB IPI to other CPUs pending shootdowns.
 	 */
+#ifdef DIAGNOSTIC
+ipi_again:
+#endif
 	for (CPU_INFO_FOREACH(cii, ci)) {
 		if (ci == self)
 			continue;
@@ -3683,9 +3687,20 @@
 
 	while (self->ci_tlb_ipi_mask != 0) {
 #ifdef DIAGNOSTIC
-		if (count++ > 10000000)
+		if (count++ > 10000000) {
+			for (CPU_INFO_FOREACH(cii, ci)) {
+				if (ci == self)
+					continue;
+				printf("CPU %ld interrupt level 0x%x pending "
+				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+				    ci->ci_ilevel, ci->ci_ipending,
+				    ci->ci_idepth, ci->ci_ipis);
+			}
+			if (ipi_retry++ < 5)
+				goto ipi_again;
 			panic("TLB IPI rendezvous failed (mask %x)",
 			    self->ci_tlb_ipi_mask);
+		}
 #endif
 		x86_pause();
 	}
Index: isa/npx.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
retrieving revision 1.107.4.1
diff -u -r1.107.4.1 npx.c
--- isa/npx.c	12 May 2006 15:41:46 -0000	1.107.4.1
+++ isa/npx.c	12 Nov 2006 10:42:16 -0000
@@ -752,6 +752,8 @@
 	} else {
 #ifdef DIAGNOSTIC
 		int spincount;
+		int ipi_retry = 0;
+ipi_again:
 #endif
 
 		IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
@@ -770,6 +772,16 @@
 #ifdef DIAGNOSTIC
 			spincount++;
 			if (spincount > 10000000) {
+				printf("CPU %ld interrupt level 0x%x pending "
+				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+				    ci->ci_ilevel, ci->ci_ipending,
+				    ci->ci_idepth, ci->ci_ipis);
+				printf("CPU %ld interrupt level 0x%x pending "
+				    "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
+				    oci->ci_ilevel, oci->ci_ipending,
+				    oci->ci_idepth, oci->ci_ipis);
+				if (ipi_retry++ < 5)
+					goto ipi_again;
 				panic("fp_save ipi didn't");
 			}
 #endif

--k+w/mQv8wyuph6w0--