Subject: kernel stack overflow due to deep interrupt nesting
To: None <port-mips@netbsd.org, port-sgimips@netbsd.org>
From: Rafal Boni <rafal@attbi.com>
List: port-sgimips
Date: 04/05/2002 12:37:18
Folks:
I've finally had a few minutes of quiet to chase down a problem I've
been staring at on-and-off for a while: with a lot of output going
to the serial console & the console running at high speeds (38.4kbps
in this case), my sgi kernels would generally fall over in several
rather brutal ways (usually cache error panics or something else
really non-intuitive).
I finally tracked down the problem to a kernel stack overflow due to
too deep interrupt nesting... Here's a backtrace (the panic is a
check I added to make tracking this down a bit easier):
panic: cpu_intr: max_intr_depth too high: 16
Stopped at 0x8815ee64: jr ra
bdslot: nop
db> tr
cpu_Debugger+4 (8ffff000,d,0,0) ra 88099290 sz 0
panic+124 (881ba894,36,0,6) ra 88193390 sz 40
cpu_intr+84 (881ba894,36,0,6) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,33) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,33) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,33) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,34) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,34) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,34) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,3a) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,3a) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,3a) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,39) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,39) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,39) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,30) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,30) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,30) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,32) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,32) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,32) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,32) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,32) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,32) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,62) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,62) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,62) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,65) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,65) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,65) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,46) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,46) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,46) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,1,bfa00000,4) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,1,bfa00000,4) ra 881933b0 sz 64
cpu_intr+a4 (fc01,1,bfa00000,4) ra 8815d79c sz 32
mips3_KernIntr+84 (ca554000,0,97ba,881f97a0) ra 8806928c sz 128
cpu_switch+64 (ca554000,0,97ba,881f97a0) ra 8808dabc sz 24
mi_switch+278 (ca554000,0,97ba,881f97a0) ra 8808d088 sz 48
ltsleep+244 (ca554000,0,97ba,881f97a0) ra 880d2320 sz 48
sched_sync+24c (ca554000,0,97ba,881f97a0) ra 8815de30 sz 104
mips3_proc_trampoline+8 (ca554000,0,97ba,881f97a0) ra 0 sz 0
User-level: curproc NULL
db> reboot 8
syncing disks... trap: TLB miss (load or instr. fetch) in kernel mode
status=0xff02, cause=0x8408, epc=0x880caa48, vaddr=0x0
curproc == NULL ksp=0xca554aa8
Stopped at 0x880caa48: lw v1,264(a0)
The problem is (and this can probably also happen on any other
MIPS port that uses a platform-specific IO interrupt handler
since many do the same thing) that interrupts are generally
turned on in the platform-specific IO interrupt handler, which
can cause it to be interrupted to service new interrupts, etc.
etc.
Note the second and subsequent mips3_KernIntr invocations all
happen to come from the same address; that address (`0x8818f7d8')
is the next instuction after the call to:
_splset((status & ~cause & MIPS_HARD_INT_MASK) | MIPS_SR_INT_IE);
at the end of ip22_intr() in sgimips/sgimips/ip22.c.
I can think of several possible solutions, not none seem to be very
good to me, so I thought I'd toss this out here and see if people
have any better ideas.
Potential `solutions' that I'm not too happy with include:
* Enlarging the size of the kernel stack in hopes of avoiding
this. Not sure how deeply nested we can get, though, so I
don't know how many more pages we'd need to set up.
* Making the interrupt routine non-reentrant by not frobbing the
interrupt masks internally and hoping it gets taken care of by
the return-from-interrupt restoring the SR and interrupt masks.
This seems a little draconian.
* Frobbing the mips-generic interrupt code to look while there are
pending interrupts to avoid taking additional exceptions and then
only restoring interrupt masks after exiting from the loop. This
is probably the least repulsive to me, but needs to touch generic
code, which I'm loath to do at this point in the proximity to the
1.6 release being branched 8-/
Any thoughts, ideas, etc. appreciated,
--rafal
----
Rafal Boni rafal@attbi.com
We are all worms. But I do believe I am a glowworm. -- Winston Churchill