port-macppc: Re: reproducible kernel panic w/ 2.0RC4MP

Subject: Re: reproducible kernel panic w/ 2.0RC4MP
To: Bill Studenmund <wrstuden@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 12/10/2004 15:50:14
Hi Bill,

>I'm coming into this thread late, I realize. Some of what I'm about to say
>may already have been covered. Sorry.
>
>First off, thank you for digging into this.

My pleasure. I happen to have a dual 604e card in my 7300, so I would like
this to work.

>I agree with Pavel and think that we need to be careful about having
>IPL_IPI above IPI_HIGH. Well, specifically above SPL_LOCK. And actually I
>really think we shouldn't. I realize that you may have changed your mind
>on this (the last patch I've seen is just turning interrupts on a bit
>earlier) and thus the comment is just accademic. :-) I think it would be
>difficult in the long run to ensure that IPIs never need to grab locks
>(especially if they can on other platforms), and so we can get into a
>mess.

I was thinking in terms of race conditions, where one processor needed the
other processor to flush the FPUs in order to release a resource, but the
other processor was already trying to acquire a simple lock. In
sys/kern/kern_lock.c, there is this comment:

 * Some platforms may have interrupts of higher priority than splsched(),
 * including hard serial interrupts, inter-processor interrupts, and
 * kernel debugger traps.

As Pavel pointed out, in arch/macppc/include/intr.h

/*
 * Miscellaneous
 */
#define splvm()         splraise(imask[IPL_VM])
#define splhigh()       splraise(imask[IPL_HIGH])
#define splsched()      splhigh()
#define spllock()       splhigh()
#define spl0()          spllower(0)

So splsched is splhigh, and by inference with the comment above, there
could or should be a spl higher than splhigh for interprocessor interrupts.
If we raise to spllock, the IPI would still get through if it was a higher
spl. I admit that my grasp of spl is not at the level of yours or Pavel's,
and so I was just trying to follow what documentation I could find.

>
>>  #ifdef MULTIPROCESSOR
>> +	tmsr = emsr;
>>  	if (ci->ci_cpuid == 0) {
>
>I also don't like that line of the patch. We're supposed to support SMP,
>which is _Symmetric_. Hard-coding cases for one CPU or another strikes me
>as wrong. I realize we need to special-case startup, but other than that I
>think we should be able to deal with things on any processor. I realize we
>may route certain interrupts to certain CPUs, but we should look at things
>in terms of "the CPU X is bound to" not a magic number.

The if (ci->ci_cpuid == 0) { is not my patch. extintr.c already restricts
external and pending non-soft interrupts to being handled on CPU0. I just
found that if I enabled PSL_EE for CPU1 when reaching do_pending_int, I
wouldn't even get past the attachment phase. The change to locore_subr.S's
Idle() basically does that, and on my MP it refuses to come even close to
the level of functionality it had before. If I only enable PSL_EE on CPU0,
I have been unable to reproduce the kernel panic I could before at will.

Michael and Allen have made great progress towards revamping the interrupt
handling, to the point where the remaining problems may be specific to my
CPU card (Daystar). Dave Huang reported that Matt's patch did fix his
kernel panic, and as PSL_EE should have been enabled before lowering spl,
perhaps Apple MP hardware will work properly now.

tim