Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: None <port-macppc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 11/13/2004 08:04:38
On Thu, 11 Nov 2004 15:34:26 -0800
Bill Studenmund <wrstuden@netbsd.org> wrote:

> How about telling us what the other CPU is doing? Ok, the one CPU is 
> waiting for the other one. So what is it doing?

Ok, it is impossible to reconcile my theory with Dave Huang's report of
getting the panic with a very large amount of RAM. So I've dropped that,
and I am looking at three avenues: the other CPU is off, the other CPU
is not listening, or the other CPU can't honor the IPI.

Yes, I'm going to be verbose about this because that's the best way I
can make sure everyone sees how I got where I got.

The biggest obstacle, as I see it, is that without support for MP in the
macppc ddb, it is very difficult to get critical stuff like a stack
trace and the registers from the hung CPU. The CPU calling for a panic
isn't the problem CPU. One of the most critical registers on the other
CPU is the Machine State Register (MSR). However, because we already
know the other CPU is incommunicado, it's not like we can just ask it
for that value. Instead, we have to do some indirect examination.

A (somewhat thorough) review of the code shows that once both CPUs are
spun up, the only time one CPU tells another CPU to turn off is at
rebooting. Since we know that both CPUs are spun up at booting, this
eliminates the possibility that the CPU has been turned off. In the MSR
are two bits that we'd really like to know - External Interrupts Enabled
(PSL_EE) and Power Conservation (POW). In macppc/macppc/cpu.c, cpuintr
shows that halting a CPU turns off PSL_EE and turns on POW. It'd be nice
to know these values when the CPU is hung, to absolutely rule out the
CPU has been halted, but since there are no debugger messages indicating
the CPU has been halted, I think this can be ruled out.

An examination of the code shows that POW gets enabled only when the CPU
is halted. From here we need to examine if PSL_EE gets turned off,
something like &~PSL_EE. It isn't immediately obvious, but this gets
done in clock.c, during the microtime() routine as part of the
decrementer algorithms. For now we'll ignore this.

Another culprit is do_pending_intr() in extintr.c. When this is entered,
the code turns off EE and sets ci->ci_iactive (interrupt active in the
cpu_info instance for this CPU). By inference, then, a good check would
be to see how often the other CPUs ci_iactive field is set when waiting
on the FPUs to flush. It's not a perfect correlation, but it's a start.

Therefore, in machdep.c, I added a couple lines of code (I'm
paraphrasing for brevity, full diffs are at the end). To the counting
timer in mp_save_fpu_proc()(macppc/macppc/machdep.c), I added

	int j = 0;
        for (i = 0; i < 0x3fffffff; i++) {
                if (pcb->pcb_fpcpu == NULL)
                        return;
                else {
                        if (fpcpu->ci_iactive)
                                j++;
                }
        }
        printf("mp_save_fpu_proc{%d} pid = %d.%d, fpcpu->ci_cpuid
=%d\n", cpu_number(), l->l_proc->p_pid, l->l_lid, fpcpu->ci_cpuid);
     
        printf("number of times other CPU has iactive: 0x%x\n", j);


Each cycle that the FPUs aren't flush, I check to see if the other CPU
was in its interrupt handler, do_pending_intr. I then ran the modified
kernel in circumstances that always cause a panic. At the panic, I got
this:

mp_save_fpu_proc{1} pid = 13726.1, fpcpu->ci_cpuid = 0
number of times other CPU has iactive: 0x3fffffff
panic: mp_save_fpu_proc

So every cycle the FPU wasn't flushed, the other CPU had ci_iactive
turned on, and that means for some of the timer the other CPU's PSL_EE
bit was turned off and it wasn't listening to incoming requests.
However, there are several steps in which PSL_EE is re-enabled, to
handle KERNEL_LOCK(flags). So for at least some of do_pending_int, there
should be receiving of an IPI to flush the FPUs. However, the actually
flushing occurs in cpuintr() in cpu.c, and that does not get called in
do_pending_int.

So what is causing the other CPU to stay in its interrupt handler? For
that we need the stack trace of the other CPU. Unfortunately, there
again doesn't seem to be a way to get this directly. However, what if
instead of the sending CPU panicing, we get the hung CPU to panic?

As it turns out, there are two loops that could potentially never
terminate in do_pending_intr(). Identifying them should be possible by
inserting a flag into cpu_info that indicates the other CPU has
requested a panic. In order to do this, I added

int ci_other_rq_panic;

to the cpu_info struct in powerpc/include/cpu.h

and

#ifdef MULTIPROCESSOR
                if (ci->ci_other_rq_panic)
                        /* been here too long */
                        panic("do_pending_intr(n) taking too long!");
#endif

to the two places in do_pending_intr that could get stuck looping, and
one right when we enter do_pending_intr just in case we have to panic
before getting returned due to already being in do_pending_intr (in an
endless loop).

To tie it together, instead of a panic in mp_save_fpu_proc() I set 

othercpu->ci_other_rq_panic = 0;

before I requested an FPU flush (to ensure it's clear) and then

fpcpu->ci_other_rq_panic = 1;
for (i = 0; i <= 0x3fffffff; i++);

when the timer goes off. This lets the other CPU have some time to catch
the panic request before the originating CPU panics itself.

Since the kernel panics at random times, I had time to sit and ponder
what would keep the PSL_EE disabled while ci_iactive stayed true.
There aren't too many places, and none of them have endless loops. That
makes it lean towards a problem with being stuck in do_pending_int and
not being able to honor the IPI. An examination of the macro KERN_LOCK
shows an interesting possibility. The macro is only defined for
MULTIPROCESSOR, and resolves to _kern_lock in src/sys/kern/kern_lock.c.
Here we see the first possible hint of a deadlock. 

_kern_lock has
		
		s = splbiglock();
                while (!simple_lock_try(&kernel_lock)) {
                        splx(s);
                        SPINLOCK_SPIN_HOOK;
                        s = splbiglock();
                }


and simple_lock_try has

        /*
         * MULTIPROCESSOR case: This is `safe' since if
it's not us, we         * don't take any action.
         */

#if defined(MULTIPROCESSOR) /* { */
        if ((rv = __cpu_simple_lock_try(&alp->lock_data)) == 0) {
                if (alp->lock_holder == cpu_id)
                        SLOCK_WHERE("simple_lock_try: locking against
myself\n",                            alp, id, l);
                goto out;
        }
#else

out:
        splx(s);
        return (rv);

and __cpu_simple_lock_try has

(a bunch of assembly language code that I'm not going to hurt my brain
any more trying to figure out)

let's suppose __cpu_simple_lock_try returns a 0. Then rv == 0 is true
and the code kicks to out:, hit splx() and returns rv (0). The while()
loop from _kern_lock hits splx() again and loops again.

Hmmm...see, I had a real hard time figuring out how do_pending_int()
gets called anyways. Turns out splx is one of only a couple places it
gets called. splx gets called all over the place. However, in
do_pending_int(), the first check is whether we are already in
do_pending_int() ( if (ci_iactive) return;). If we're stuck in a while()
loop somewhere that keeps calling splx(), we'll stay in do_pending_int()
and ci_iactive will stay true. The IPI never gets honored because
cpuintr never gets called.

Ok, now that I've got a panic eight hours later, here is the bt:

mp_save_fpu_proc{1} pid = 4623.1, fpcpu->ci_cpuid = 0
number of times other CPU has iactive: 0x3fffffff
panic: mp_save_fpu_proc
Stopped in pid 4623.1 (nbmake) at       netbsd:cpu_Debugger+0x10:      
lwz     r 0, r1, 0x14

At this point, I though perhaps my semaphore didn't work, so I rebooted:

db{1}> reboot
syncing disks... panic: do_pending_intr(3) taking too long!
Stopped at      netbsd:cpu_Debugger+0x10:       lwz     r0, r1, 0x14
db{0}> bt
0x00697ef0: at panic+0x19c
0x00697f80: at do_pending_int+0x2c8
0x00697fc0: at spllower+0x50
0x00697fe0: at lcsplx+0xc
0x00697ff0: at Idle+0x18

What do you know, it _did_ work. It seems out of order, but maybe with
fine tuning it'll nail it. Except, look at the debugger message that was
printed out:

panic: do_pending_intr(3) taking too long!

This wasn't the first semaphore, or even the second. It was the _third_
semaphore, which means that the request for the other CPU to panic
occured while in do_pending_int, as theorized. In fact, this panic comes
immediately after KERNEL_LOCK, which looked promising as explained
earlier. However, there were no do_pending_intr(1) messages, so there
was no reentry into do_pending_int, and although the (3) is in a while()
loop, it could also get hit the first time. After some more examining,
I'm not able to pinpoint the problem, but I find the call to 

s = splbiglock()

interesting. This is a macro for splclock(), which is a macro for
splraise(imask | IPL_clock). Is it possible that something related to
the clock on the primary CPU is getting turned off, and that thread just
stops cold? The thread hit the panic only _after_ the secondary CPU, the
one requesting the panic, rebooted.

The trail goes cold at that point...

(and to all who suggested make -j 2, I am deeply grateful for how much
time you've saved me in these builds)

tim


db{0}> show all procs
 PID    PPID   PGRP UID S   FLAGS LWPS  COMMAND  WAIT
 4623   24595  98    0 2  0x4002    1   nbmake
 24595  23629  98    0 2  0x4002    1   sh       wait
 23629   7418  98    0 2  0x4002    1   nbmake   wait
 7418   11122  98    0 2  0x4002    1   sh       wait
 11122  26927  98    0 2  0x4002    1   nbmake   wait
 26927  15045  98    0 2  0x4002    1   sh       wait
 15045  26110  98    0 2  0x4002    1   nbmake   wait
 26110   1321  98    0 2  0x4002    1   sh       wait
 25435    440    25435    0 2  0x4002    1     pico  select
 3572   14680     3572    0 2  0x4002    1     pico  select
 14680   14575    14680    0 2  0x4002    1     tcsh   pause
 14575   14732    14575    0 2  0x4002    1      csh   pause
 14732   13541    14732   1000 2  0x4002    1     tcsh   pause
 13541   10536    10536   1000 2   0x100    1     sshd  select
 10536     329    10536     0 2     0x1    1     sshd   netio
 1321    98   98     0 2  0x4002    1   nbmake    wait
 98     404   98     0 2  0x5002    1   sh    wait
 440     72   440     0 2  0x4002    1     tcsh   pause
 72     429   72     0 2  0x4002    1      csh   pause
 429    414   429   1000 2  0x4002    1     tcsh   pause
 414    400   400   1000 2   0x100    1     sshd  select
 400    329   400     0 2     0x1    1     sshd   netio
 404    377   404     0 2  0x4002    1     tcsh   pause
 377    398   377     0 2  0x4002    1   csh   pause
 398    381   398   1000 2  0x4002    1     tcsh   pause
 381    379   379   1000 2   0x100    1     sshd  select
 379    329   379     0 2     0x1    1     sshd   netio
 380    375   380     0 2  0x4002    1   top    poll
 375    364   375     0 2  0x4002    1   csh   pause
 364    306   364   1000 2  0x4002    1     tcsh   pause
 306    373   373   1000 2   0x100    1     sshd  select
 373    329   373     0 2     0x1    1     sshd   netio
 363   1   363     0 2  0x4002    1    getty   ttyin
 314   1   1     0 2  0x4000    1    getty nanosle
 362   1   362     0 2   0    1     cron nanosle
 337   1   337     0 2   0    1    inetd  kqread
 335   1   335     0 2   0x100    1    sendmail  select
 329   1   329     0 2   0    1     sshd  select
 191   1   191     0 2   0    1     syslogd
 7   0   0     0 2 0x20200    1    aiodoned aiodone
 6   0   0     0 2 0x20200    1     ioflush  syncer
 5   0   0     0 2 0x20200    1   pagedaemon pgdaemo
 4   0   0     0 2 0x20200    1   lfs_writer lfswrit
 3   0   0     0 2 0x20200    1    scsibus1  sccomp
 2   0   0     0 2 0x20200    1    scsibus0  sccomp
 1   0   1     0 2  0x4000    1     init    wait
 0   -1   0     0 2 0x20200    1     swapper schedul
db{0}> show registers
r0          0x311d9c    panic+0x1a0
r1          0x697ee0    ADBDevTable+0x7a680
r2          0x61aac8    cpu_info
r3          0x61aac8    cpu_info
r4          0xd
r5          0x1
r6          0
r7          0x5d0000    ddbstk+0x1958
r8          0
r9          0x5d0000    ddbstk+0x1958
r10         0
r11         0x63061
r12         0x28002048
r13         0x185cb98   ADBDevTable+0x123f338
r14         0x1850000   ADBDevTable+0x12327a0
r15         0x5
r16         0x5
r17         0x1850000   ADBDevTable+0x12327a0
r18         0x1850000   ADBDevTable+0x12327a0
r19         0x18
r20         0x1
r21         0x5d0000    ddbstk+0x1958
r22         0x1
r23         0x605820    intrsources
r24         0
r25         0xffffffff
r26         0x5d0000    ddbstk+0x1958
r27         0x1032
r28         0x1032
r29         0x104
r30         0x5d0000    ddbstk+0x1958
r31         0x5095a4    linux_socketcall+0x890
iar         0x3bdb4c    cpu_Debugger+0x10
msr         0x1032
lr          0x3bdb4c    cpu_Debugger+0x10
ctr         0x3dde30    cnflush
cr          0x48002048
xer         0
mq          0x1
netbsd:cpu_Debugger+0x10:       lwz     r0, r1, 0x14
db{0}> reboot
halt{1}
rebooting

Patches:

Index: cpu.h
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/include/cpu.h,v
retrieving revision 1.42
diff -d -u -r1.42 cpu.h
--- cpu.h	22 Sep 2004 11:32:03 -0000	1.42
+++ cpu.h	12 Nov 2004 21:10:30 -0000
@@ -113,6 +113,9 @@
 	struct evcnt ci_ev_vec;		/* Altivec traps */
 	struct evcnt ci_ev_vecsw;	/* Altivec context switches */
 	struct evcnt ci_ev_umchk;	/* user MCHK events */
+#ifdef MULTIPROCESSOR
+	int          ci_other_rq_panic; /* the other CPU needs us to panic */ 
+#endif
 };
 
 #ifdef MULTIPROCESSOR
Index: machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/machdep.c,v
retrieving revision 1.133
diff -d -u -r1.133 machdep.c
--- machdep.c	3 Jul 2004 16:24:57 -0000	1.133
+++ machdep.c	13 Nov 2004 01:48:13 -0000
@@ -681,7 +681,7 @@
 {
 	struct pcb *pcb = &l->l_addr->u_pcb;
 	struct cpu_info *fpcpu;
-	int i;
+	int i, j;
 
 	/*
 	 * Send an IPI to the other CPU with the data and wait for that
CPU@@ -694,6 +694,9 @@
 	if (fpcpu == NULL) {
 		return;
 	}
+	j = 0;
+	fpcpu->ci_other_rq_panic = 0;
+
 	macppc_send_ipi(fpcpu, MACPPC_IPI_FLUSH_FPU);
 
 	/* Wait for flush. */
@@ -704,9 +707,17 @@
 	for (i = 0; i < 0x3fffffff; i++) {
 		if (pcb->pcb_fpcpu == NULL)
 			return;
+		else {
+			if (fpcpu->ci_iactive)
+				j++;
+		}
 	}
 	printf("mp_save_fpu_proc{%d} pid = %d.%d, fpcpu->ci_cpuid =
%d\n", 	    cpu_number(), l->l_proc->p_pid, l->l_lid, fpcpu->ci_cpuid);
+	printf("number of times other CPU has iactive: 0x%x\n", j);
+	fpcpu->ci_other_rq_panic = 1;
+	asm ("sync"); /* for good measure */
+	for (i = 0; i < 0x3fffffff; i++); /* sit and spin for a bit */
 	panic("mp_save_fpu_proc");
 #endif
 }


Index: extintr.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/extintr.c,v
retrieving revision 1.44
diff -d -u -r1.44 extintr.c
--- extintr.c	20 Jun 2004 20:50:13 -0000	1.44
+++ extintr.c	13 Nov 2004 01:43:14 -0000
@@ -679,6 +679,12 @@
 	int hwpend;
 	int emsr, dmsr;
 
+#ifdef MULTIPROCESSOR
+	if (ci->ci_other_rq_panic)
+		/* we're in a loop - handle panic */
+		panic("do_pending_intr(1) taking too long!");
+#endif
+
 	if (ci->ci_iactive)
 		return;
 
@@ -689,12 +695,16 @@
 
 	pcpl = ci->ci_cpl;
 again:
-
 #ifdef MULTIPROCESSOR
 	if (ci->ci_cpuid == 0) {
 #endif
 	/* Do now unmasked pendings */
 	while ((hwpend = (ci->ci_ipending & ~pcpl & HWIRQ_MASK)) != 0) {
+#ifdef MULTIPROCESSOR
+		if (ci->ci_other_rq_panic)
+			/* been here too long */
+			panic("do_pending_intr(2) taking too long!");
+#endif
 		irq = 31 - cntlzw(hwpend);
 		is = &intrsources[irq];
 		if (!have_openpic)
@@ -706,6 +716,11 @@
 		KERNEL_LOCK(LK_CANRECURSE|LK_EXCLUSIVE);
 		ih = is->is_hand;
 		while (ih) {
+#ifdef MULTIPROCESSOR
+		if (ci->ci_other_rq_panic)
+			/* been here too long */
+			panic("do_pending_intr(3) taking too long!");
+#endif
 			(*ih->ih_fun)(ih->ih_arg);
 			ih = ih->ih_next;
 		}