Subject: Re: Unkillable process, stalled socket write()
To: Christos Zoulas <christos@tac.gw.com>
From: Jorgen Lundman <lundman@lundman.net>
List: netbsd-users
Date: 03/10/2005 19:17:54
Sorry for the delay, it does not happen that often.

pkill -INT lundftpd

ps -axl
    0  4021  6423   0 -22  0    0     0 -        ZW   ??   0:00.00 (lundftpd)
    0  6423     1   0 -18  0 2520  2684 sokva    Ds   ??   0:16.06 ./lundftpd
    0  9550  6423   0 -22  0    0     0 -        ZW   ??   0:00.00 (lundftpd)

The middle process, the other two are children, just wait4()ing.

gdb ./lundftpd

#0  0x48187dc3 in write () from /usr/lib/libc.so.12
#1  0x080cbc76 in BIO_sock_non_fatal_error ()
(hmm? a clue? called via SSL_write if that makes any difference)

I can not get write() to exit.



kill -9 6423

    0  6423     1   0 -18  0 2520  2680 sokva    Ds   ??   0:16.06 ./lundftpd

0x48187dc3 in write () from /usr/lib/libc.so.12

The socket it supposed to be nonblocking, but even if it wasn't - should this 
happen? (Dead disks I have seen before but this is a network socket).

(gdb) disassemble
Dump of assembler code for function write:
0x48187dbc <write>:     mov    $0x4,%eax
0x48187dc1 <write+5>:   int    $0x80
0x48187dc3 <write+7>:   jb     0x48187da4 <writev+12>
0x48187dc5 <write+9>:   ret
0x48187dc6 <write+10>:  nop
0x48187dc7 <write+11>:  nop
0x48187dc8 <write+12>:  push   %ebx
0x48187dc9 <write+13>:  call   0x48187dce <write+18>
0x48187dce <write+18>:  pop    %ebx
0x48187dcf <write+19>:  add    $0x861fa,%ebx
0x48187dd5 <write+25>:  mov    0xc24(%ebx),%ecx
0x48187ddb <write+31>:  pop    %ebx
0x48187ddc <write+32>:  jmp    *%ecx
0x48187dde <write+34>:  mov    %esi,%esi
End of assembler dump.

reboot -d this time.


Lund



Christos Zoulas wrote:
> In article <422BAC46.90001@lundman.net>,
> Jorgen Lundman  <lundman@lundman.net> wrote:
> 
>>I can now confirm this happens in NetBSD-2.0 as well. Nonblocking listen(2) 
>>socket, that stalls forever. kill -9 does not terminate process. If I want that 
>>port back, I have to reboot. (Unless there is some way I can hack/claim it back 
>>with gdb+kernel?). The rest of the machine is just fine.
>>
>>deadlocked due to IO when its a disk I understand, but the same based on
>>network?
>>
>>Still only happens every 40 days or so, so it is hard to track down. I can not 
>>run ktrace that long since it logs all the data as well. (Can one tell
>>ktrace to 
>>only record the call, not the data?)
>>
>>Could I force a dump at this time, since the machine itself it working fine, to 
>>further investigate the trouble?
> 
> 
> What wait channel is the process stuck on (ps -axl)?
> 
> christos
> 
> 

-- 
Jorgen Lundman       | <lundman@lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)