Subject: signal delivery debugging
To: None <tech-kern@netbsd.org>
From: Emmanuel Dreyfus <manu@netbsd.org>
List: tech-kern
Date: 03/29/2002 11:37:04
Hi!

I've hit a strange problem while attempting to run photoshop on
NetBSD/sgimips with COMPAT_IRIX. I'm looking for ideas about how to
debug this. Quickly speaking, I get a machine hang, but no code from
COMPAT_IRIX seems invoked at hang time.

More details:
For some reason, the program gets a SIGSEGV, the signal handler is
invoked, but it seems to never call sigreturn. Here is the trace:

   242 AdobePhotoshop3. CALL  access(0x1101b298,0)
   242 AdobePhotoshop3. NAMI  "/emul/irix/var/tmp/photoAAAa0003m"
   242 AdobePhotoshop3. NAMI  "/var/tmp/photoAAAa0003m"
   242 AdobePhotoshop3. RET   access -1 errno 2 No such file or
directory
   242 AdobePhotoshop3. PSIG  SIGSEGV caught handler=0x41444c4
mask=(13,20) code
=0x200e00
   242 AdobePhotoshop3. CALL  getpid
   242 AdobePhotoshop3. RET   getpid 242/0xf2
   242 AdobePhotoshop3. CALL  write(0x2,0x45e010c,0xe)
   242 AdobePhotoshop3. GIO   fd 2 wrote 14 bytes
       "Caught signal "
   242 AdobePhotoshop3. RET   write 14/0xe
   242 AdobePhotoshop3. CALL  getpid
   242 AdobePhotoshop3. RET   getpid 242/0xf2
   242 AdobePhotoshop3. CALL  write(0x2,0xfb53c60,0x8)
   242 AdobePhotoshop3. GIO   fd 2 wrote 8 bytes
       "at PC 0x"
   242 AdobePhotoshop3. RET   write 8
   242 AdobePhotoshop3. CALL  write(0x2,0x7fffb7dd,0x1)
   242 AdobePhotoshop3. GIO   fd 2 wrote 1 bytes
       "1"
   242 AdobePhotoshop3. RET   write 1
   242 AdobePhotoshop3. CALL  write(0x2,0xfb53c60,0x2)
   242 AdobePhotoshop3. GIO   fd 2 wrote 2 bytes
       ": "
   242 AdobePhotoshop3. RET   write 2
   242 AdobePhotoshop3. CALL  getpid
   242 AdobePhotoshop3. RET   getpid 242/0xf2
   242 AdobePhotoshop3. CALL  write(0x2,0x45e0128,0x1e)
   242 AdobePhotoshop3. GIO   fd 2 wrote 30 bytes
       "with no information provided.
       "
   242 AdobePhotoshop3. RET   write 30/0x1e

At this point, the machine hangs. It does not handle network connections
anymore, and getty on the console does not answer to keystrokes.

I suspected that the program had called sigreturn, and that a bug in
irix_sys_sigreturn crashed the kernel before ktrace could record the
system call. I was wrong: If I add some printf at the very beginning of
the syscall handlers in sys/arch/mips/mips/syscall.c to display the pid
and system calls on the console, I get this (syslogd was killed, else
you get thousands of syscalls from syslogd).

returning from irix_sendsig()
242, fancy: 1020
242, fancy: 1004
174 plain: 3
174 plain: 93
242, fancy: 1020
242, fancy: 1004
174 plain: 20
174 plain: 20
174 plain: 93
174 plain: 3
174 plain: 4
174 plain: 93
242, fancy: 1004
174 plain: 3
174 plain: 93
242, fancy: 1004
174 plain: 3
174 plain: 93
242, fancy: 1020
242, fancy: 1004
174 plain: 20
174 plain: 20
174 plain: 93
174 plain: 3
174 plain: 4
174 plain: 93

pid 174 is sshd, pid 242 is Photoshop. We can see the getpid (sc 1020)
and write (sc 1004) that were in the kernel trace. Phtoshop does not
call sigreturn (sc 1088), or if it does, we hang before reaching the
system call handlers in sys/arch/mips/mips/syscall.c.

I can un-hang the system by droping into ddb and sending a signal to
photoshop. If Photoshop catches it, the machine returns to life for a
few system calls, then it hangs again with the same situation. If I send
SIGTERM, Photoshop dies, and everything gets back to normal life.

This situation is a bit weird to me: I beleive that a process cannot
hang the machine when in user mode. Is this always true? If it is, then
this means that Photoshop made a system call and the system got hang in
kernel mode before getting in the system call handler. This would mean
that Photoshop does some black magic with system call that the code in
locore does not handle properly. How can I check for this? And How can I
debug this? 

Or it could also be that sigreturn was invoked but something screwed
enough the kernel so that the printf output got lost. Is it possible? 

Is there any other problem to check? It's worth noting that IRIX signal
delivery has been working properly for a long time now, it is the first
time I experience this kind of problems. Since the signal handler is
invoked, and if sigreturn is never invoked, I'd say that the problem is
not in signal delivery but in system call handling code. Does this seems
right, or did I miss something?

I also thought about locks: is it possible that a missing unlock
somewhere caused this?

-- 
Emmanuel Dreyfus.  
Avec Windows 3.1 ils etaient au bord du gouffre...
Avec Windows 95 ils ont fait un grand bon en avant.
manu@netbsd.org