Subject: more system hang debugging
To: None <tech-kern@netbsd.org, port-mips@netbsd.org>
From: Emmanuel Dreyfus <manu@netbsd.org>
List: port-mips
Date: 03/31/2002 23:16:23
Hello

I'm still working on the system hang with Photoshop running with
COMPAT_IRIX on sgimips. I have discovered that the hang is caused by the
same exception being handled forever. 

It's a page fault. mips3_UserGenException (from
sys/arch/mips/mips/mipsX_subr.S) is invoked, it executes trap(), which
in turn calls uvm_fault(). In uvm_fault(), we fail on uvmfault_lookup(),
because there is no mapping at the requested address (0xc). We return to
trap() with EFAULT.

Here is the ouput of the diagnosic log in trap():

uvm_fault(0x886f8320 (pmap 0x886f9080), 0 (0xc), 0, 1) -> 14 at pc
0x41445d8

As I understand, the user process tried to reach address 0xc. There is
no valid mapping here, hence the kernel should send a SIGSEGV to the
process.

By adding more printf, I can see that trap() then calls trapsignal()
which in turn calls psignal1(). psignal1() sets the user process as
runnable and schedule the delivery of a SIGSEGV, if not masked.
Everything seems alright here.

What is strange is that the same exception is being handled immediatly
after we leave trap(). If I add printf's at the beginning and the end of
trap(), I can see that we re-enter trap() with exactly the same
exception just after we have left it. The diagnostic log shows the same
information forever.

I beleive we never go back to user mode, because
1) the machine is hung
2) On the diagnostic log in trap(), we always see the same value for pc,
it never moves forward.

Now the question is how can this happen? I read the code for
UserGenException, and although I'm not a MIPS assembly expert, I do not
see any place where we could jump again in trap(). And I do not expect
the code in UserGenException to cause a new fault, because
1) trap() is always called with a fault and never a kernelfault (hence I
beleive it's not caused by kernel code)
2) I would expect the kernel stack to explode after some time
3) the diagnostic log shows me a pc value at 0x41445d8, this is not in
kernel memory.

Another question: I also have a stack trace when I drop into ddb:

uvm_fault+c8 (886f8960,0,0,1) ra 881615cc sz 296
trap+4d0 (ff13,0,0,1) ra 8815a484 sz 64
mips3_UserGenException+cc (ff13,0,0,41445d8) ra 0 sz 0

I wonder if the return address for mips3_UserGenException is reasonable.
Could this explain some problem?

-- 
Emmanuel Dreyfus.
"Le 80x86 n'est pas si complexe - il n'a simplement pas de sens"
(Mike Johnson, responsable de la conception x86 chez AMD) 
manu@netbsd.org