netbsd-users: Re: Unkillable process, stalled socket write()

Subject: Re: Unkillable process, stalled socket write()
To: NetBSD Users <netbsd-users@NetBSD.org>
From: Jorgen Lundman <lundman@lundman.net>
List: netbsd-users
Date: 03/07/2005 10:20:06
I can now confirm this happens in NetBSD-2.0 as well. Nonblocking listen(2) 
socket, that stalls forever. kill -9 does not terminate process. If I want that 
port back, I have to reboot. (Unless there is some way I can hack/claim it back 
with gdb+kernel?). The rest of the machine is just fine.

deadlocked due to IO when its a disk I understand, but the same based on network?

Still only happens every 40 days or so, so it is hard to track down. I can not 
run ktrace that long since it logs all the data as well. (Can one tell ktrace to 
only record the call, not the data?)

Could I force a dump at this time, since the machine itself it working fine, to 
further investigate the trouble?

Lund


Jorgen Lundman wrote:
> 
> NetBSD mirror 1.6ZF NetBSD 1.6ZF (mirror) #6: Fri Apr  2 04:06:49 CEST 
> 2004 root@mirror:/usr/src/sys-current/src/sys/arch/i386/compile/mirror i386
> 
> Most likely is something I have done in my software, but it is behaving 
> unusual.
> FTPD process I have is now hung. Kill -9 does nothing to it, and 
> naturally I can not release it.
> 
> Usually when I see this, it is usually due to disk or tape going bad and 
> the kernel will block forever. But what is unusual is that this time the 
> blocked fd is a socket, that the FTPd is sending to.
> 
> However, gdb tells me:
> 
> #0  0x481d36d7 in write () from /usr/lib/libc.so.12
> #1  0x808bb2b in sockets_write (fd=275, [cut]
> 
> Inspecting my structures I can confirm that fd 275 is a socket, we 
> already have read 4072 bytes from the file on disk, and are now trying 
> to send them.
> 
> 
> 0x481d36d7 in write () from /usr/lib/libc.so.12
> (gdb) disass
> Dump of assembler code for function write:
> 0x481d36d0 <write>:     mov    $0x4,%eax
> 0x481d36d5 <write+5>:   int    $0x80
> 0x481d36d7 <write+7>:   jb     0x481d36b8 <getpid+8>
> 0x481d36d9 <write+9>:   ret
> 
> int 0x80 at a guess is just a syscall, and 0x4 would be sys_write().
> 
> fd 275 is also in nonblocking mode, so even if it was that it is out of 
> mbufs or memory, should it not always return, even with a failure?
> 
> Memory: 278M Act, 39M Inact, 608K Wired, 10M Exec, 291M File, 476M Free 
> Swap: 10G Total, 10G Free
> 
> USER   PID %CPU %MEM  VSZ   RSS TT STAT STARTED      TIME COMMAND
> root 15529  0.0  0.0 8216     4 ?? DXs  11:03AM  67:08.89 ./lundftpd
> 
> sysstat mbufs
>           /0   /5   /10  /15  /20  /25  /30  /35  /40  /45  /50  /55  /60
> data      XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
> 44023
> headers   XXXXXXXXXXXXXXXXXXXXXXXXXX
> 
> Alas, netstat, vmstat don't run since userland is 1.6.2 and kernel is 
> -current (to support the nic) sigh.
> 
> 
> Lund
> 
> 

-- 
Jorgen Lundman       | <lundman@lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)