IP32 debugging kernel intr handler (crime/mace)

To: port-sgimips%netbsd.org@localhost
Subject: IP32 debugging kernel intr handler (crime/mace)
From: roop <roop%tamasi.org@localhost>
Date: Tue, 19 Mar 2013 10:11:59 +0000

Hi,

I'm digging into 47640 on sgimips (SGI O2 IP32) on netbsd-current
(6.99.17).

http://gnats.NetBSD.org/cgi-bin/query-pr-single.pl?number=47640

Background
-----------------
I'm basically starting out on kernel tinkering.

As explained in the gnats this SGI IP32 will hard lock at random point.
Otherwise it operates very nice and quickly. That was 6.0.1, same deal
on current. 
Also 47639, a random keyboard flash co-incides with a interrupt, may be
related.

What I want to do, but don't know if i can:
- get a debugger triggered at the right point, which means finding the
  right spot
- adding more debug around some other bits of kernel code, but not so to
  saturate the console
- breaking the kernel (linux magic sysreq style) and just getting it to
  dump it's stack at a keypress

The hunch was a glitch in the interrupt handling, so I've added
aprint_debug() liberally throughout dev/crime.c and mace/mace.c, as seen
in
https://gist.github.com/rooprob/5191131

The pattern I've discovered seems always "hang" after a complete
crime_intr()/mace_intr() loop, or immediately before the next one.
I'm completely open to the fact that thereofre this means the hang is
somewhere else.
There'll be hundreds of successful debug messages, occasionally a NULL
function pointer I've trapped with my code but it always end the same
way.

crime_intr()
  -> mace_intr()
     -> finish mace_intr
<- finish crime_intr

Some examples of my debug output:

XXX2013-RPF crime_intr(2111091192, 536936211, 1024)!
XXX2013-RPF <- !crime_ipending 0
XXX2013-RPF crime_intr(2152175152, 536936195, 1024)!
XXX2013-RPF <- !crime_ipending 0

XXX2013-RPF crime_intr(2110433472, 536936211, 1024)!
XXX2013-RPF -> mace_intr(256)
XXX2013-RPF -> mace_inst(256): processing non-IRQ 4
M0:256 M1:256 M2:256 M3:256 M4:256 M5:256 M6:256 M7:->(0x80074124)
M8:256 M9:256 M10:256 M11:256 M12:256 M13:256 M14:256 M15:256 M16:256
M17:256 M18:256 M19:256 M20:256 M21:256 M22:256 M23:256 M24:256 M25:256
M26:256 M27:256 M28:256 M29:256 M30:256 M31:256 done
XXX2013-RPF <- !crime_ipending 0

This last one, MACE idx 7 in maceintrtab fired.

Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF crime_intr(4255804, 536936211, 
1024)!
Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF -> mace_intr(16)
Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF -> IRQ4 mace_intr(16) is IRQ 
4!!! 
Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF -> IRQ4 mace_intr(16) completed 
mips3_ld() 
Mar 18 23:22:03 mapleleaf /netbsd: M4:0:16 M4:1:->(0x800d4348) M4:2:16 M4:3:16 
M4:4:16 M4:5:16 M4:6:16 M4:7:16 M4:8:16 M4:9:16 M4:10:16 M4:11:16 M4:12:16 
M4:13:16 M4:14:16 M4:15:16 M4:16:16 M4:17:16 M4:18:16 M4:19:16 M4:20:16 
M4:21:16 M4:22:16 M4:23:16 M4:24:16 M4:25:16 M4:26:16 M4:27:16 M4:28:16 
M4:29:16 M4:30:16 M4:31:16 done
Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF -> mace_inst(0): processing 
non-IRQ 4
Mar 18 23:22:03 mapleleaf /netbsd: M0:0 M1:0 M2:0 M3:0 M4:0 M5:0 M6:0 M7:0 M8:0 
M9:0 M10:0 M11:0 M12:0 M13:0 M14:0 M15:0 M16:0 M17:0 M18:0 M19:0 M20:0 M21:0 
M22:0 M23:0 M24:0 M25:0 M26:0 M27:0 M28:0 M29:0 M30:0 M31:0 done
Mar 18 23:22:03 mapleleaf /netbsd: XXX2013-RPF <- !crime_ipending 0


Causing a crash:
----------------------
With all my new debug spewing to the screen I can often get through a
good amount of atf-run before it will abruptly stop. It dies at
different places each time. If it's a subtle timing issue, it could be
masked by the new slowness of having to output to serial line, but I'm
not so sure about that.

Example output from rtf-run,
fs/ffs/t_setquota (84/498): 40 test cases
    set_be_1_both: [3.021925s] Passed.
... <snip>...
    set_default_le_1_both: [2.517445s] Passed.
    set_default_le_1_group: <died here>

Copying a new kernel from an NFS remote build server is my usual sure
fire way of getting it to hang quickly. This is why I'm thinking
interrupts. What doesn't gel with this idea is how long it can go - it's
interrupts and something else, perhaps.
What's interesting is that the intr handler stands up well against a
deluge of serial debug message output.

Where next?
----------------
Now I know how to make it crash, and now I've exposed the intr handling,
what should I look at next?

hunch#2: a missed interrupt ? kernel waiting forever?
hunch#3: a corrupted function (I tried to add sanity around that in my
patch where function !=NULL and managed to log a few, but to no avail)
hunch#4: something which happens outside intr handler space

I'm going to have to get the output-deluge down as it's too much for
syslog to capture, nor a serial line at 115200baud. The gfx console
happily keeps up, but I cannot copy and paste that to record the event. 

thanks for reading this far.

rob

Prev by Date: Re: sgi O2 VICE and Nintendo64
Next by Date: Can't boot SGI O2
Previous by Thread: sgi O2 VICE and Nintendo64
Next by Thread: Can't boot SGI O2
Indexes:

Home | Main Index | Thread Index | Old Index