Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: None <port-macppc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 12/03/2004 12:40:31
Ok, I've made significant progress on this and I have a patch that I
believe will at least reduce the problem or fix it outright. I say this
because I was able to identify the problem to the point that I could
reproduce it at will, and I've also identified conclusively one of the two
original premises as to why CPU0 was not acknowledging the IPI. It's PSL_EE
was turned off and was not receiving external interrupts.

The problem occured when the network would get backed up on CPU0, like
during a ssh session with a lot of output to the remote computer with a
small pipe or limited processing on the other end, and the kernel would
apparently need to write to disk for buffering. The other CPU, CPU1, would
be in the process of doing a vfork, which I think would tie up the disk. I
couldn't pinpoint the exact race conditions, only the symptoms. After both
adding some more diagnostic code and grasping how to make it panic at will,
I definitively determined that CPU0 was trying to acquire a simple lock for
a defered network interrupt. Michael's sync mesh patch made the kernel
panic frequency increase ten fold. I also noticed that the race condition
vanished during the reboot process immediately after shutting down the vfs
stuff. This led me to conclude that the simple lock trying to be acquired
was on disk resources. However, CPU1 was cycling while waiting for CPU0 to
acknowledge the IPI. CPU0 wasn't acknowledging the IPI while cycling over
the simple lock within do_pending_intr.

After I went over the code extensively, and after not once seeing a lock
panic (LOCKDEBUG) or a re-entrancy into do_pending_intr after setting the
panic semaphore, I had to conclude the problem was not a lock or race and
was elsewhere. Along the way, however, I changed some code in extintr.c to
bring it in line with simple lock requirements, which is that the irq
should not be enabled until after the simple lock is acquired and released.
For the non-openpic Macs, the original code was re-enabling the irq before
acquiring the lock. I also added another layer of IPL to the MULTIPROCESSOR
case, to also bring it in line with the kern_lock.c notes. This places an
IPL_IPI in between IPL_HIGH and IPL_SERIAL. Putting IPL_IPI at the same
level as IPI_SERIAL led to some problems with the serial console. Putting
IPL_IPI above IPI_HIGH ensures that even during a simple lock any deferred
IPI interrupts get processed, but nothing else will.

It's not likely that there will be deferred IPIs, though, as the code
handles external interrupts for the MP case immediately, so this is more
for consistency and CYA than actual problem solving. Since the problem was
not in the do_pending_intr or kernel lock code, it had to be somewhere
else, and this left the external interrupts themselves. I worked backward
on the bt from CPU0:

0x00697ef0: at panic+0x19c
0x00697f80: at do_pending_int+0x2c8
0x00697fc0: at spllower+0x50
0x00697fe0: at lcsplx+0xc
0x00697ff0: at Idle+0x18

Prior to handling pending interrupts, the code turns off PSL_EE, but
restores the original msr setting before trying to acquire a kernel lock.
This should allow the CPU to receive external interrupts. Going all the way
back up the backtrace, I find in arch/powerpc/powerpc/locore_subr.c:

/*
 * No lwps are runnable, so loop waiting for one.
 * Separate label here for accounting purposes.
 * When we get here, interrupts are off (MSR[EE]=0), sched_lock is held,
 * spl is IPL_SCHED, and %r31 is a scratch register.
 */
ASENTRY(Idle)
        lis     %r8,_C_LABEL(sched_whichqs)@ha
        ldint   %r9,_C_LABEL(sched_whichqs)@l(%r8)

        or.     %r9,%r9,%r9
        bne+    .Lsw1                   /* at least one queue non-empty */

#if defined(MULTIPROCESSOR) || defined(LOCKDEBUG)
        bl      _C_LABEL(sched_unlock_idle)
#endif

        li      %r3,0                   /* go IPL_SCHED to IPL_NONE */
        bl      _C_LABEL(lcsplx)
        mr      %r31,%r3                /* save returned IPL mask */

#if defined(PPC_IBM4XX)
        wrteei  1                       /* reenable ints again */
#else /* PPC_OEA */
        mfmsr   %r4
        ori     %r4,%r4,PSL_EE@l        /* reenable ints again */
        mtmsr   %r4
        isync
#endif

So in fact when CPU0 comes down this path, its PSL_EE is not enabled and it
is not receiving external interrupts, and therefore it is not acknowledging
IPIs. I have confirmed this under panic conditions as described above.
Additionally, when looking at the do_pending_int code, every time it goes
to do a kernel lock, it calls mtmsr(emsr), apparently with the expectation
that this will enable external interrupts. Since PSL_EE was off when
do_pending_int was entered, this reinforces it being off.

In order to overcome this, I added code in do_pending_int to checkwith MP
if the PSL_EE is off on CPU0 (us) enable it but raise the spl level to
IPL_IPI. I also recover what the interrupt mask was before raising to
IPL_IPI, so that existing pending interrupts will still get processed, but
all new interrupts will be masked as pending. This may introduce some other
race condition, but my thinking was that we'd like to handling the existing
interrupts without being interrupted by anything less than an IPI. If this
patch tests out, I may allow the pending interrupt to handle the spl in the
future, unless someone can explain what using IPL_IPI could prevent from
occuring. One concern could be that a lower interrupt could require the
services of a higher interrupt and that IPL_IPI could block this. Since the
focus for solving this bug has been to identify why CPU0 is not responding
to IPIs, I wanted to ensure IPIs get handled.

On exit from do_pending_intr I restore the PSL_EE to what it was; in Idle()
the first thing done after returning from lcsplx is to restore PSL_EE to
on. Under other circumstances it may be preferred that PSL_EE is off, so
restoring it to the original setting is better than ensuring it is on.

I also added code to mp_save_fpu_lwp in machdep.c that calls delay()
instead of hard looping over a counter. This should help minimize race
conditions. I'd like to lower the spl, but the limiting factor is what
calls mp_save_fpu_lwp. In this case, it is vfork, which may not appreciate
some lower interrupts running.

With the above changes I have not been able to duplicate the conditions
leading to panic, and since the entire problem now has an explanation and a
logical fix, I'm willing to post the patches for wider testing. I'm
currently in the process of building userland that had only succeeded once
before, with 256M of memory instead of the current 128M. Hopefully both
that race condition and the one I could reproduce at will are identical. It
takes about 21 hours with 256M, but it seems that with less memory the
process takes longer, so I can not predict when it will be done. I will
report one way or the other.

In the meantime, it would be helpful if people with MP systems could test
the code with an MP kernel. The patch below still has a lot of diagnostic
messages. Please apply them as well, because if the problem is still
present, or another set of circumstances can lead to the same race
condition, I need to know what was going on leading up to the panic. I also
changed the waiting for the FPU to flush from a hard loop to a delay(100),
to allow other non-masked interrupts to execute. I have would like to
improve this even further and determine how low an spl can I set while
waiting here, to further reduce race conditions. I think that would end up
being dependant on how vfork will handle this, as this is where I was
always finding CPU1 to be in when it was waiting for the FPU to flush. Note
that the patches are for both macppc and powerpc files.

While the patch includes the virq change to extintr.c, it does not include
any of the patches for the grackle/PCI card attachment or the improper
attachment of mac-io on multiple I/O controllers. I hope to clean up those
previously posted patches and post them to my web site next week. The patch
does contain one component of the fix for the improper canonicalization of
the bootpath but it won't affect operation. I do hope they apply properly.
If they do not, please let me know.

I also plan to review the virq code again, as I can now see that the
original intention was to share irqs as long as they were different
(signal?) types. However, I can not see where this differentiation is made
when determining the real irq in order to execute its handler. I need to
research extintr.c to see if the code was removed. For now, as long as
there are less than 28 hardware interrupts, there should be no problem.

My thanks to Pavel Cahyna for the links to posts in tech-kern as well as
pointing out that spllock and splsched are defined to be splhigh, which led
me to insert the IPL_IPI spl, and to Michael (macallan18) for asking me to
clearly explain how the NetBSD/macppc interrupt process worked so he could
compare it to his experiences with other OSes and architectures, and
explaining it made me realize there wasn't a thing wrong with the
do_pending_intr code or the kernel lock and that the answer was somewhere
else.

tim

Index: extintr.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/extintr.c,v
retrieving revision 1.44
diff -d -u -r1.44 extintr.c
--- extintr.c	20 Jun 2004 20:50:13 -0000	1.44
+++ extintr.c	3 Dec 2004 17:16:39 -0000
@@ -159,14 +159,18 @@

 	if (irq < 0 || irq >= ICU_LEN)
 		panic("invalid irq %d", irq);
+#if 0
 	if (virq[irq])
 		return virq[irq];
+#endif

 	virq_max++;
 	v = virq_max;
 	if (v > HWIRQ_MAX)
 		panic("virq overflow");

+	printf(" mapping irq %d to virq %d\n", irq, v);
+
 	intrsources[v].is_hwirq = irq;
 	virq[irq] = v;

@@ -332,12 +336,23 @@
 	 */
 	imask[IPL_HIGH] |= imask[IPL_CLOCK];

+#ifdef MULTIPROCESSOR
+	/*
+	 * We need IPIs to get processed
+	 */
+
+	imask[IPL_IPI] |= imask[IPL_HIGH];
+
 	/*
 	 * We need serial drivers to run at the absolute highest priority to
 	 * avoid overruns, so serial > high.
 	 */
+	imask[IPL_SERIAL] |= imask[IPL_IPI];
+#else
 	imask[IPL_SERIAL] |= imask[IPL_HIGH];

+#endif
+
 	/* And eventually calculate the complete masks. */
 	for (irq = 0, is = intrsources; irq < NIRQ; irq++, is++) {
 		register int irqs = 1 << irq;
@@ -541,11 +556,23 @@

 	int_state = gc_read_irq();
 #ifdef MULTIPROCESSOR
+{
+int mk;
+
 	r_imen = 1 << virq[GC_IPI_IRQ];
+	mk = int_state & r_imen;
+	if (ci->ci_other_rq_panic) {
+		/* we're in a lock/race - handle panic */
+		printf("ext_intr sees other CPU requested a panic...");
+		printf("int_state: %u, r_imen: %x, int_state & r_imen: %x\n",
+			(unsigned int)int_state, r_imen, mk);
+		ci->ci_other_rq_panic = 0;
+	}
 	if (int_state & r_imen) {
 		int_state &= ~r_imen;
 		cpuintr(NULL);
 	}
+}
 #endif
 	if (int_state == 0)
 		return;
@@ -679,31 +706,76 @@
 	int hwpend;
 	int emsr, dmsr;

+#ifdef MULTIPROCESSOR
+	int tmsr;
+#endif
+
+#ifdef MULTIPROCESSOR
+	if (ci->ci_other_rq_panic)
+		/* we're in a loop - handle panic */
+		panic("do_pending_intr(1) taking too long!");
+#endif
+
 	if (ci->ci_iactive)
 		return;

 	ci->ci_iactive = 1;
+
+	/* recover this before raising */
+	pcpl = ci->ci_cpl;
+
 	emsr = mfmsr();
-	dmsr = emsr & ~PSL_EE;
+
+	dmsr = emsr & ~PSL_EE;
 	mtmsr(dmsr);

-	pcpl = ci->ci_cpl;
 again:

 #ifdef MULTIPROCESSOR
+	tmsr = emsr;
 	if (ci->ci_cpuid == 0) {
+
+		/* EE was already off */
+		/* we may have an IPI pending */
+		/* for SP, PSL_EE off is by design */
+		if (!(emsr & PSL_EE)) {
+			emsr |= PSL_EE;
+			/* already pending interrupts get processed */
+			/* because the mask is against earlier pcpl */
+			/* new interrupts get marked pending */
+			splraise(imask[IPL_IPI]);
+		}
+
+
 #endif
 	/* Do now unmasked pendings */
 	while ((hwpend = (ci->ci_ipending & ~pcpl & HWIRQ_MASK)) != 0) {
+#ifdef MULTIPROCESSOR
+		if (ci->ci_other_rq_panic)
+			/* been here too long */
+			panic("do_pending_intr(2) taking too long!");
+#endif
 		irq = 31 - cntlzw(hwpend);
 		is = &intrsources[irq];
-		if (!have_openpic)
-			gc_enable_irq(is->is_hwirq);
-
 		ci->ci_ipending &= ~(1 << irq);
 		splraise(is->is_mask);
 		mtmsr(emsr);
 		KERNEL_LOCK(LK_CANRECURSE|LK_EXCLUSIVE);
+
+#ifdef MULTIPROCESSOR
+{
+	int i;
+		if (ci->ci_other_rq_panic) {
+		for (i = 0; virq[i] != irq && i < HWIRQ_MAX; i++)
+				;
+		printf("about to panic @ 3, virq: 0x%x, hwirq:0x%x,
is_mask: %x\n",
+			irq, i, is->is_mask);
+		printf("PSL_EE: %x\n", (emsr & PSL_EE));
+			/* been here too long */
+			panic("do_pending_intr(3) taking too long!");
+		}
+}
+#endif
 		ih = is->is_hand;
 		while (ih) {
 			(*ih->ih_fun)(ih->ih_arg);
@@ -714,7 +786,10 @@
 		ci->ci_cpl = pcpl;

 		is->is_ev.ev_count++;
-		if (have_openpic)
+
+		if (!have_openpic)
+			gc_enable_irq(is->is_hwirq);
+		else
 			openpic_enable_irq(is->is_hwirq, is->is_type);
 	}
 #ifdef MULTIPROCESSOR
@@ -785,7 +860,16 @@
 #endif
 	ci->ci_cpl = pcpl;	/* Don't use splx... we are here already! */
 	ci->ci_iactive = 0;
+
+	/* restore original condition of PSL_EE */
+#ifndef MULTIPROCESSOR
 	mtmsr(emsr);
+#else
+	if (ci->ci_cpuid == 0)
+		mtmsr(tmsr);
+	else
+		mtmsr(emsr);
+#endif
 }

 int
Index: cpu.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/cpu.c,v
retrieving revision 1.38
diff -d -u -r1.38 cpu.c
--- cpu.c	13 Feb 2004 11:36:15 -0000	1.38
+++ cpu.c	3 Dec 2004 17:17:47 -0000
@@ -317,7 +317,7 @@

 	if (!openpic_base) {
 		/* Register IPI. */
-		intr_establish(GC_IPI_IRQ, IST_LEVEL, IPL_HIGH, cpuintr, NULL);
+		intr_establish(GC_IPI_IRQ, IST_LEVEL, IPL_IPI, cpuintr, NULL);
 	}

 	return 0;
Index: machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/machdep.c,v
retrieving revision 1.133
diff -d -u -r1.133 machdep.c
--- machdep.c	3 Jul 2004 16:24:57 -0000	1.133
+++ machdep.c	3 Dec 2004 17:18:12 -0000
@@ -61,6 +61,8 @@
 #include <sys/boot_flag.h>
 #include <sys/ksyms.h>

+#include <sys/unistd.h>
+
 #include <uvm/uvm_extern.h>

 #include <net/netisr.h>
@@ -681,7 +683,9 @@
 {
 	struct pcb *pcb = &l->l_addr->u_pcb;
 	struct cpu_info *fpcpu;
-	int i;
+	int i, j, pcpl;
+
+	pcpl = 0;

 	/*
 	 * Send an IPI to the other CPU with the data and wait for that CPU
@@ -694,6 +698,9 @@
 	if (fpcpu == NULL) {
 		return;
 	}
+	j = 0;
+	fpcpu->ci_other_rq_panic = 0;
+
 	macppc_send_ipi(fpcpu, MACPPC_IPI_FLUSH_FPU);

 	/* Wait for flush. */
@@ -701,14 +708,39 @@
 	while (pcb->pcb_fpcpu)
 		;
 #else
-	for (i = 0; i < 0x3fffffff; i++) {
-		if (pcb->pcb_fpcpu == NULL)
+	for (i = 0; i < 0x10000; i++) {
+		delay(100);
+		if (pcb->pcb_fpcpu == NULL) {
+/*
+			printf("cpu %x responded to cpu %x IPI\n",
+				fpcpu->ci_cpuid, cpu_number());
+*/
 			return;
+		}
+		else {
+			if (fpcpu->ci_iactive)
+				j++;
+		}
 	}
-	printf("mp_save_fpu_proc{%d} pid = %d.%d, fpcpu->ci_cpuid = %d\n",
+	printf("mp_save_fpu_lwp{%d} pid = %d.%d, fpcpu->ci_cpuid = %d\n",
 	    cpu_number(), l->l_proc->p_pid, l->l_lid, fpcpu->ci_cpuid);
-	panic("mp_save_fpu_proc");
+	printf("number of delay(100) other CPU has iactive: 0x%x\n", j);
+	fpcpu->ci_other_rq_panic = 1;
+	asm volatile ("sync; eieio"); /* for good measure */
+	 /* sit and spin for a bit */
+	for (i = 0; i < 0x10000; i++) {
+		delay(100);
+		if (!pcb->pcb_fpcpu) {
+			fpcpu->ci_other_rq_panic = 0;
+			printf("pcb_fpcpu cleared\n");
+			return;
+		}
+	}
+	if (pcb->pcb_fpcpu)
+		panic("mp_save_fpu_proc");
+
 #endif
+
 }

 #ifdef ALTIVEC


<*** NOTE -- these are for arch/powerpc/ files ***>

Index: include/cpu.h
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/include/cpu.h,v
retrieving revision 1.42
diff -d -u -r1.42 cpu.h
--- include/cpu.h	22 Sep 2004 11:32:03 -0000	1.42
+++ include/cpu.h	3 Dec 2004 17:20:57 -0000
@@ -113,6 +113,9 @@
 	struct evcnt ci_ev_vec;		/* Altivec traps */
 	struct evcnt ci_ev_vecsw;	/* Altivec context switches */
 	struct evcnt ci_ev_umchk;	/* user MCHK events */
+#ifdef MULTIPROCESSOR
+	int          ci_other_rq_panic; /* the other CPU needs us to panic */
+#endif
 };

 #ifdef MULTIPROCESSOR
Index: powerpc/openfirm.c
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/powerpc/openfirm.c,v
retrieving revision 1.14
diff -d -u -r1.14 openfirm.c
--- powerpc/openfirm.c	27 Sep 2003 04:44:42 -0000	1.14
+++ powerpc/openfirm.c	3 Dec 2004 17:21:28 -0000
@@ -245,6 +245,31 @@
 }

 int
+OF_canon(char *alias, char *device, int maxlen)
+{
+	static struct {
+		char *name;
+		int nargs;
+		int nreturns;
+		char *alias;
+		char *device;
+		int maxlen;
+		int fullLen;
+	} args = {
+		"canon",
+		3,
+		1,
+	};
+
+	args.alias = alias;
+	args.device = device;
+	args.maxlen = maxlen;
+	if (openfirmware(&args) == -1)
+		return -1;
+	return args.fullLen;
+}
+
+int
 OF_instance_to_path(int ihandle, char *buf, int buflen)
 {
 	static struct {
Index: powerpc/pio_subr.S
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/powerpc/pio_subr.S,v
retrieving revision 1.8
diff -d -u -r1.8 pio_subr.S
--- powerpc/pio_subr.S	29 Mar 2004 01:15:14 -0000	1.8
+++ powerpc/pio_subr.S	3 Dec 2004 17:21:58 -0000
@@ -34,11 +34,16 @@
  *	in the low bits but if that happens then the offset was too large
  *	to being with.
  */
-#ifdef DEBUG
-#define	DBGSYNC	sync
+#ifdef MULTIPROCESSOR
+ #define DBSYNC sync
 #else
-#define	DBGSYNC	/* nothing */
+ #ifdef DEBUG
+  #define	DBGSYNC	sync
+ #else
+  #define	DBGSYNC	/* nothing */
+ #endif
 #endif
+
 /* LINTSTUB: include <sys/param.h> */
 /* LINTSTUB: include <sys/types.h> */
 /* LINTSTUB: include <machine/bus.h> */