Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: None <port-macppc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 11/13/2004 08:04:38
On Thu, 11 Nov 2004 15:34:26 -0800
Bill Studenmund <wrstuden@netbsd.org> wrote:
> How about telling us what the other CPU is doing? Ok, the one CPU is
> waiting for the other one. So what is it doing?
Ok, it is impossible to reconcile my theory with Dave Huang's report of
getting the panic with a very large amount of RAM. So I've dropped that,
and I am looking at three avenues: the other CPU is off, the other CPU
is not listening, or the other CPU can't honor the IPI.
Yes, I'm going to be verbose about this because that's the best way I
can make sure everyone sees how I got where I got.
The biggest obstacle, as I see it, is that without support for MP in the
macppc ddb, it is very difficult to get critical stuff like a stack
trace and the registers from the hung CPU. The CPU calling for a panic
isn't the problem CPU. One of the most critical registers on the other
CPU is the Machine State Register (MSR). However, because we already
know the other CPU is incommunicado, it's not like we can just ask it
for that value. Instead, we have to do some indirect examination.
A (somewhat thorough) review of the code shows that once both CPUs are
spun up, the only time one CPU tells another CPU to turn off is at
rebooting. Since we know that both CPUs are spun up at booting, this
eliminates the possibility that the CPU has been turned off. In the MSR
are two bits that we'd really like to know - External Interrupts Enabled
(PSL_EE) and Power Conservation (POW). In macppc/macppc/cpu.c, cpuintr
shows that halting a CPU turns off PSL_EE and turns on POW. It'd be nice
to know these values when the CPU is hung, to absolutely rule out the
CPU has been halted, but since there are no debugger messages indicating
the CPU has been halted, I think this can be ruled out.
An examination of the code shows that POW gets enabled only when the CPU
is halted. From here we need to examine if PSL_EE gets turned off,
something like &~PSL_EE. It isn't immediately obvious, but this gets
done in clock.c, during the microtime() routine as part of the
decrementer algorithms. For now we'll ignore this.
Another culprit is do_pending_intr() in extintr.c. When this is entered,
the code turns off EE and sets ci->ci_iactive (interrupt active in the
cpu_info instance for this CPU). By inference, then, a good check would
be to see how often the other CPUs ci_iactive field is set when waiting
on the FPUs to flush. It's not a perfect correlation, but it's a start.
Therefore, in machdep.c, I added a couple lines of code (I'm
paraphrasing for brevity, full diffs are at the end). To the counting
timer in mp_save_fpu_proc()(macppc/macppc/machdep.c), I added
int j = 0;
for (i = 0; i < 0x3fffffff; i++) {
if (pcb->pcb_fpcpu == NULL)
return;
else {
if (fpcpu->ci_iactive)
j++;
}
}
printf("mp_save_fpu_proc{%d} pid = %d.%d, fpcpu->ci_cpuid
=%d\n", cpu_number(), l->l_proc->p_pid, l->l_lid, fpcpu->ci_cpuid);
printf("number of times other CPU has iactive: 0x%x\n", j);
Each cycle that the FPUs aren't flush, I check to see if the other CPU
was in its interrupt handler, do_pending_intr. I then ran the modified
kernel in circumstances that always cause a panic. At the panic, I got
this:
mp_save_fpu_proc{1} pid = 13726.1, fpcpu->ci_cpuid = 0
number of times other CPU has iactive: 0x3fffffff
panic: mp_save_fpu_proc
So every cycle the FPU wasn't flushed, the other CPU had ci_iactive
turned on, and that means for some of the timer the other CPU's PSL_EE
bit was turned off and it wasn't listening to incoming requests.
However, there are several steps in which PSL_EE is re-enabled, to
handle KERNEL_LOCK(flags). So for at least some of do_pending_int, there
should be receiving of an IPI to flush the FPUs. However, the actually
flushing occurs in cpuintr() in cpu.c, and that does not get called in
do_pending_int.
So what is causing the other CPU to stay in its interrupt handler? For
that we need the stack trace of the other CPU. Unfortunately, there
again doesn't seem to be a way to get this directly. However, what if
instead of the sending CPU panicing, we get the hung CPU to panic?
As it turns out, there are two loops that could potentially never
terminate in do_pending_intr(). Identifying them should be possible by
inserting a flag into cpu_info that indicates the other CPU has
requested a panic. In order to do this, I added
int ci_other_rq_panic;
to the cpu_info struct in powerpc/include/cpu.h
and
#ifdef MULTIPROCESSOR
if (ci->ci_other_rq_panic)
/* been here too long */
panic("do_pending_intr(n) taking too long!");
#endif
to the two places in do_pending_intr that could get stuck looping, and
one right when we enter do_pending_intr just in case we have to panic
before getting returned due to already being in do_pending_intr (in an
endless loop).
To tie it together, instead of a panic in mp_save_fpu_proc() I set
othercpu->ci_other_rq_panic = 0;
before I requested an FPU flush (to ensure it's clear) and then
fpcpu->ci_other_rq_panic = 1;
for (i = 0; i <= 0x3fffffff; i++);
when the timer goes off. This lets the other CPU have some time to catch
the panic request before the originating CPU panics itself.
Since the kernel panics at random times, I had time to sit and ponder
what would keep the PSL_EE disabled while ci_iactive stayed true.
There aren't too many places, and none of them have endless loops. That
makes it lean towards a problem with being stuck in do_pending_int and
not being able to honor the IPI. An examination of the macro KERN_LOCK
shows an interesting possibility. The macro is only defined for
MULTIPROCESSOR, and resolves to _kern_lock in src/sys/kern/kern_lock.c.
Here we see the first possible hint of a deadlock.
_kern_lock has
s = splbiglock();
while (!simple_lock_try(&kernel_lock)) {
splx(s);
SPINLOCK_SPIN_HOOK;
s = splbiglock();
}
and simple_lock_try has
/*
* MULTIPROCESSOR case: This is `safe' since if
it's not us, we * don't take any action.
*/
#if defined(MULTIPROCESSOR) /* { */
if ((rv = __cpu_simple_lock_try(&alp->lock_data)) == 0) {
if (alp->lock_holder == cpu_id)
SLOCK_WHERE("simple_lock_try: locking against
myself\n", alp, id, l);
goto out;
}
#else
out:
splx(s);
return (rv);
and __cpu_simple_lock_try has
(a bunch of assembly language code that I'm not going to hurt my brain
any more trying to figure out)
let's suppose __cpu_simple_lock_try returns a 0. Then rv == 0 is true
and the code kicks to out:, hit splx() and returns rv (0). The while()
loop from _kern_lock hits splx() again and loops again.
Hmmm...see, I had a real hard time figuring out how do_pending_int()
gets called anyways. Turns out splx is one of only a couple places it
gets called. splx gets called all over the place. However, in
do_pending_int(), the first check is whether we are already in
do_pending_int() ( if (ci_iactive) return;). If we're stuck in a while()
loop somewhere that keeps calling splx(), we'll stay in do_pending_int()
and ci_iactive will stay true. The IPI never gets honored because
cpuintr never gets called.
Ok, now that I've got a panic eight hours later, here is the bt:
mp_save_fpu_proc{1} pid = 4623.1, fpcpu->ci_cpuid = 0
number of times other CPU has iactive: 0x3fffffff
panic: mp_save_fpu_proc
Stopped in pid 4623.1 (nbmake) at netbsd:cpu_Debugger+0x10:
lwz r 0, r1, 0x14
At this point, I though perhaps my semaphore didn't work, so I rebooted:
db{1}> reboot
syncing disks... panic: do_pending_intr(3) taking too long!
Stopped at netbsd:cpu_Debugger+0x10: lwz r0, r1, 0x14
db{0}> bt
0x00697ef0: at panic+0x19c
0x00697f80: at do_pending_int+0x2c8
0x00697fc0: at spllower+0x50
0x00697fe0: at lcsplx+0xc
0x00697ff0: at Idle+0x18
What do you know, it _did_ work. It seems out of order, but maybe with
fine tuning it'll nail it. Except, look at the debugger message that was
printed out:
panic: do_pending_intr(3) taking too long!
This wasn't the first semaphore, or even the second. It was the _third_
semaphore, which means that the request for the other CPU to panic
occured while in do_pending_int, as theorized. In fact, this panic comes
immediately after KERNEL_LOCK, which looked promising as explained
earlier. However, there were no do_pending_intr(1) messages, so there
was no reentry into do_pending_int, and although the (3) is in a while()
loop, it could also get hit the first time. After some more examining,
I'm not able to pinpoint the problem, but I find the call to
s = splbiglock()
interesting. This is a macro for splclock(), which is a macro for
splraise(imask | IPL_clock). Is it possible that something related to
the clock on the primary CPU is getting turned off, and that thread just
stops cold? The thread hit the panic only _after_ the secondary CPU, the
one requesting the panic, rebooted.
The trail goes cold at that point...
(and to all who suggested make -j 2, I am deeply grateful for how much
time you've saved me in these builds)
tim
db{0}> show all procs
PID PPID PGRP UID S FLAGS LWPS COMMAND WAIT
4623 24595 98 0 2 0x4002 1 nbmake
24595 23629 98 0 2 0x4002 1 sh wait
23629 7418 98 0 2 0x4002 1 nbmake wait
7418 11122 98 0 2 0x4002 1 sh wait
11122 26927 98 0 2 0x4002 1 nbmake wait
26927 15045 98 0 2 0x4002 1 sh wait
15045 26110 98 0 2 0x4002 1 nbmake wait
26110 1321 98 0 2 0x4002 1 sh wait
25435 440 25435 0 2 0x4002 1 pico select
3572 14680 3572 0 2 0x4002 1 pico select
14680 14575 14680 0 2 0x4002 1 tcsh pause
14575 14732 14575 0 2 0x4002 1 csh pause
14732 13541 14732 1000 2 0x4002 1 tcsh pause
13541 10536 10536 1000 2 0x100 1 sshd select
10536 329 10536 0 2 0x1 1 sshd netio
1321 98 98 0 2 0x4002 1 nbmake wait
98 404 98 0 2 0x5002 1 sh wait
440 72 440 0 2 0x4002 1 tcsh pause
72 429 72 0 2 0x4002 1 csh pause
429 414 429 1000 2 0x4002 1 tcsh pause
414 400 400 1000 2 0x100 1 sshd select
400 329 400 0 2 0x1 1 sshd netio
404 377 404 0 2 0x4002 1 tcsh pause
377 398 377 0 2 0x4002 1 csh pause
398 381 398 1000 2 0x4002 1 tcsh pause
381 379 379 1000 2 0x100 1 sshd select
379 329 379 0 2 0x1 1 sshd netio
380 375 380 0 2 0x4002 1 top poll
375 364 375 0 2 0x4002 1 csh pause
364 306 364 1000 2 0x4002 1 tcsh pause
306 373 373 1000 2 0x100 1 sshd select
373 329 373 0 2 0x1 1 sshd netio
363 1 363 0 2 0x4002 1 getty ttyin
314 1 1 0 2 0x4000 1 getty nanosle
362 1 362 0 2 0 1 cron nanosle
337 1 337 0 2 0 1 inetd kqread
335 1 335 0 2 0x100 1 sendmail select
329 1 329 0 2 0 1 sshd select
191 1 191 0 2 0 1 syslogd
7 0 0 0 2 0x20200 1 aiodoned aiodone
6 0 0 0 2 0x20200 1 ioflush syncer
5 0 0 0 2 0x20200 1 pagedaemon pgdaemo
4 0 0 0 2 0x20200 1 lfs_writer lfswrit
3 0 0 0 2 0x20200 1 scsibus1 sccomp
2 0 0 0 2 0x20200 1 scsibus0 sccomp
1 0 1 0 2 0x4000 1 init wait
0 -1 0 0 2 0x20200 1 swapper schedul
db{0}> show registers
r0 0x311d9c panic+0x1a0
r1 0x697ee0 ADBDevTable+0x7a680
r2 0x61aac8 cpu_info
r3 0x61aac8 cpu_info
r4 0xd
r5 0x1
r6 0
r7 0x5d0000 ddbstk+0x1958
r8 0
r9 0x5d0000 ddbstk+0x1958
r10 0
r11 0x63061
r12 0x28002048
r13 0x185cb98 ADBDevTable+0x123f338
r14 0x1850000 ADBDevTable+0x12327a0
r15 0x5
r16 0x5
r17 0x1850000 ADBDevTable+0x12327a0
r18 0x1850000 ADBDevTable+0x12327a0
r19 0x18
r20 0x1
r21 0x5d0000 ddbstk+0x1958
r22 0x1
r23 0x605820 intrsources
r24 0
r25 0xffffffff
r26 0x5d0000 ddbstk+0x1958
r27 0x1032
r28 0x1032
r29 0x104
r30 0x5d0000 ddbstk+0x1958
r31 0x5095a4 linux_socketcall+0x890
iar 0x3bdb4c cpu_Debugger+0x10
msr 0x1032
lr 0x3bdb4c cpu_Debugger+0x10
ctr 0x3dde30 cnflush
cr 0x48002048
xer 0
mq 0x1
netbsd:cpu_Debugger+0x10: lwz r0, r1, 0x14
db{0}> reboot
halt{1}
rebooting
Patches:
Index: cpu.h
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/include/cpu.h,v
retrieving revision 1.42
diff -d -u -r1.42 cpu.h
--- cpu.h 22 Sep 2004 11:32:03 -0000 1.42
+++ cpu.h 12 Nov 2004 21:10:30 -0000
@@ -113,6 +113,9 @@
struct evcnt ci_ev_vec; /* Altivec traps */
struct evcnt ci_ev_vecsw; /* Altivec context switches */
struct evcnt ci_ev_umchk; /* user MCHK events */
+#ifdef MULTIPROCESSOR
+ int ci_other_rq_panic; /* the other CPU needs us to panic */
+#endif
};
#ifdef MULTIPROCESSOR
Index: machdep.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/machdep.c,v
retrieving revision 1.133
diff -d -u -r1.133 machdep.c
--- machdep.c 3 Jul 2004 16:24:57 -0000 1.133
+++ machdep.c 13 Nov 2004 01:48:13 -0000
@@ -681,7 +681,7 @@
{
struct pcb *pcb = &l->l_addr->u_pcb;
struct cpu_info *fpcpu;
- int i;
+ int i, j;
/*
* Send an IPI to the other CPU with the data and wait for that
CPU@@ -694,6 +694,9 @@
if (fpcpu == NULL) {
return;
}
+ j = 0;
+ fpcpu->ci_other_rq_panic = 0;
+
macppc_send_ipi(fpcpu, MACPPC_IPI_FLUSH_FPU);
/* Wait for flush. */
@@ -704,9 +707,17 @@
for (i = 0; i < 0x3fffffff; i++) {
if (pcb->pcb_fpcpu == NULL)
return;
+ else {
+ if (fpcpu->ci_iactive)
+ j++;
+ }
}
printf("mp_save_fpu_proc{%d} pid = %d.%d, fpcpu->ci_cpuid =
%d\n", cpu_number(), l->l_proc->p_pid, l->l_lid, fpcpu->ci_cpuid);
+ printf("number of times other CPU has iactive: 0x%x\n", j);
+ fpcpu->ci_other_rq_panic = 1;
+ asm ("sync"); /* for good measure */
+ for (i = 0; i < 0x3fffffff; i++); /* sit and spin for a bit */
panic("mp_save_fpu_proc");
#endif
}
Index: extintr.c
===================================================================
RCS file: /cvsroot/src/sys/arch/macppc/macppc/extintr.c,v
retrieving revision 1.44
diff -d -u -r1.44 extintr.c
--- extintr.c 20 Jun 2004 20:50:13 -0000 1.44
+++ extintr.c 13 Nov 2004 01:43:14 -0000
@@ -679,6 +679,12 @@
int hwpend;
int emsr, dmsr;
+#ifdef MULTIPROCESSOR
+ if (ci->ci_other_rq_panic)
+ /* we're in a loop - handle panic */
+ panic("do_pending_intr(1) taking too long!");
+#endif
+
if (ci->ci_iactive)
return;
@@ -689,12 +695,16 @@
pcpl = ci->ci_cpl;
again:
-
#ifdef MULTIPROCESSOR
if (ci->ci_cpuid == 0) {
#endif
/* Do now unmasked pendings */
while ((hwpend = (ci->ci_ipending & ~pcpl & HWIRQ_MASK)) != 0) {
+#ifdef MULTIPROCESSOR
+ if (ci->ci_other_rq_panic)
+ /* been here too long */
+ panic("do_pending_intr(2) taking too long!");
+#endif
irq = 31 - cntlzw(hwpend);
is = &intrsources[irq];
if (!have_openpic)
@@ -706,6 +716,11 @@
KERNEL_LOCK(LK_CANRECURSE|LK_EXCLUSIVE);
ih = is->is_hand;
while (ih) {
+#ifdef MULTIPROCESSOR
+ if (ci->ci_other_rq_panic)
+ /* been here too long */
+ panic("do_pending_intr(3) taking too long!");
+#endif
(*ih->ih_fun)(ih->ih_arg);
ih = ih->ih_next;
}