Subject: Re: kern/32757: TLB IPI rendezvous fails sometimes
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 02/06/2006 22:10:03
The following reply was made to PR kern/32757; it has been noted by GNATS.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: kern/32757: TLB IPI rendezvous fails sometimes
Date: Mon, 6 Feb 2006 23:06:26 +0100

 --SUOF0GtieIMvvwua
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 On Mon, Feb 06, 2006 at 09:50:01AM +0000, seebs wrote:
 > Machine: i386
 > >Description:
 > 	On at least some motherboards, NetBSD 2.1 occasionally fails with TLB
 > 	IPI rendezvous failed.  The patch (from pmap.c 1.184) is verified
 > 	present.
 > >How-To-Repeat:
 > 	Run under load.
 > 
 > 	Someone else on the NetBSD lists reports the same behavior with a
 > 	Pentium 3 system, suggesting that this isn't just a specific
 
 I'm the one who reported the problem. Hardware is PIII-1Ghz on a
 MSI 694D-Pro 2 motherboard:
 mainbus0 (root)
 mainbus0: Intel MP Specification (Version 1.4) (OEM00000 PROD00000000)
 cpu0 at mainbus0: apid 0 (boot processor)
 cpu0: Intel Pentium III (686-class), 1002.37 MHz, id 0x68a
 cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
 cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
 cpu0: features 387fbff<FXSR,SSE>
 cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
 cpu0: L2 cache 256 KB 32B/line 8-way
 cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
 cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
 cpu0: serial number 0000-068A-0001-DDD6-4ED7-4704
 cpu0: calibrating local timer
 cpu0: apic clock running at 133 MHz
 cpu0: 8 page colors
 cpu1 at mainbus0: apid 1 (application processor)
 cpu1: starting
 cpu1: Intel Pentium III (686-class), 1002.28 MHz, id 0x68a
 cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
 cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
 cpu1: features 387fbff<FXSR,SSE>
 cpu1: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
 cpu1: L2 cache 256 KB 32B/line 8-way
 cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
 cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
 cpu1: serial number 0000-068A-0003-ADAB-C15A-1E54
 pchb0 at pci0 dev 0 function 0
 pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)
 pcib0 at pci0 dev 7 function 0
 pcib0: VIA Technologies VT82C686A PCI-ISA Bridge (rev. 0x40)
 viaide0 at pci0 dev 7 function 1
 viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller
 
 I still see it with NetBSD 3.0, both for TLB IPIs and FPU IPIs.
 I'm running with the attached patch, all my systems are stable with
 this. I have several systems based on the same hardware running SMP, with
 different workloads, all of them show the problems from once a day to
 once in several weeks, depending on the workload.
 
 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --
 
 --SUOF0GtieIMvvwua
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff
 
 Index: i386/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
 retrieving revision 1.181.2.2
 diff -u -r1.181.2.2 pmap.c
 --- i386/pmap.c	26 Sep 2005 20:24:52 -0000	1.181.2.2
 +++ i386/pmap.c	6 Feb 2006 19:37:12 -0000
 @@ -3652,6 +3652,7 @@
  	int s;
  #ifdef DIAGNOSTIC
  	int count = 0;
 +	int ipi_retry = 0;
  #endif
  #endif
  
 @@ -3672,6 +3673,9 @@
  	/*
  	 * Send the TLB IPI to other CPUs pending shootdowns.
  	 */
 +#ifdef DIAGNOSTIC
 +ipi_again:
 +#endif
  	for (CPU_INFO_FOREACH(cii, ci)) {
  		if (ci == self)
  			continue;
 @@ -3683,9 +3687,20 @@
  
  	while (self->ci_tlb_ipi_mask != 0) {
  #ifdef DIAGNOSTIC
 -		if (count++ > 10000000)
 +		if (count++ > 10000000) {
 +			for (CPU_INFO_FOREACH(cii, ci)) {
 +				if (ci == self)
 +					continue;
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
 +				    ci->ci_ilevel, ci->ci_ipending,
 +				    ci->ci_idepth, ci->ci_ipis);
 +			}
 +			if (ipi_retry++ < 5)
 +				goto ipi_again;
  			panic("TLB IPI rendezvous failed (mask %x)",
  			    self->ci_tlb_ipi_mask);
 +		}
  #endif
  		x86_pause();
  	}
 Index: isa/npx.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
 retrieving revision 1.107
 diff -u -r1.107 npx.c
 --- isa/npx.c	3 Feb 2005 21:08:58 -0000	1.107
 +++ isa/npx.c	6 Feb 2006 19:37:12 -0000
 @@ -732,6 +732,8 @@
  	} else {
  #ifdef DIAGNOSTIC
  		int spincount;
 +		int ipi_retry = 0;
 +ipi_again:
  #endif
  
  		IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
 @@ -750,6 +752,16 @@
  #ifdef DIAGNOSTIC
  			spincount++;
  			if (spincount > 10000000) {
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
 +				    ci->ci_ilevel, ci->ci_ipending,
 +				    ci->ci_idepth, ci->ci_ipis);
 +				printf("CPU %ld interrupt level 0x%x pending "
 +				    "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
 +				    oci->ci_ilevel, oci->ci_ipending,
 +				    oci->ci_idepth, oci->ci_ipis);
 +				if (ipi_retry++ < 5)
 +					goto ipi_again;
  				panic("fp_save ipi didn't");
  			}
  #endif
 
 --SUOF0GtieIMvvwua--