Subject: Re: Unkillable process, stalled socket write()
To: NetBSD Users <netbsd-users@NetBSD.org>
From: Jorgen Lundman <email@example.com>
Date: 03/07/2005 10:20:06
I can now confirm this happens in NetBSD-2.0 as well. Nonblocking listen(2)
socket, that stalls forever. kill -9 does not terminate process. If I want that
port back, I have to reboot. (Unless there is some way I can hack/claim it back
with gdb+kernel?). The rest of the machine is just fine.
deadlocked due to IO when its a disk I understand, but the same based on network?
Still only happens every 40 days or so, so it is hard to track down. I can not
run ktrace that long since it logs all the data as well. (Can one tell ktrace to
only record the call, not the data?)
Could I force a dump at this time, since the machine itself it working fine, to
further investigate the trouble?
Jorgen Lundman wrote:
> NetBSD mirror 1.6ZF NetBSD 1.6ZF (mirror) #6: Fri Apr 2 04:06:49 CEST
> 2004 root@mirror:/usr/src/sys-current/src/sys/arch/i386/compile/mirror i386
> Most likely is something I have done in my software, but it is behaving
> FTPD process I have is now hung. Kill -9 does nothing to it, and
> naturally I can not release it.
> Usually when I see this, it is usually due to disk or tape going bad and
> the kernel will block forever. But what is unusual is that this time the
> blocked fd is a socket, that the FTPd is sending to.
> However, gdb tells me:
> #0 0x481d36d7 in write () from /usr/lib/libc.so.12
> #1 0x808bb2b in sockets_write (fd=275, [cut]
> Inspecting my structures I can confirm that fd 275 is a socket, we
> already have read 4072 bytes from the file on disk, and are now trying
> to send them.
> 0x481d36d7 in write () from /usr/lib/libc.so.12
> (gdb) disass
> Dump of assembler code for function write:
> 0x481d36d0 <write>: mov $0x4,%eax
> 0x481d36d5 <write+5>: int $0x80
> 0x481d36d7 <write+7>: jb 0x481d36b8 <getpid+8>
> 0x481d36d9 <write+9>: ret
> int 0x80 at a guess is just a syscall, and 0x4 would be sys_write().
> fd 275 is also in nonblocking mode, so even if it was that it is out of
> mbufs or memory, should it not always return, even with a failure?
> Memory: 278M Act, 39M Inact, 608K Wired, 10M Exec, 291M File, 476M Free
> Swap: 10G Total, 10G Free
> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> root 15529 0.0 0.0 8216 4 ?? DXs 11:03AM 67:08.89 ./lundftpd
> sysstat mbufs
> /0 /5 /10 /15 /20 /25 /30 /35 /40 /45 /50 /55 /60
> data XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> headers XXXXXXXXXXXXXXXXXXXXXXXXXX
> Alas, netstat, vmstat don't run since userland is 1.6.2 and kernel is
> -current (to support the nic) sigh.
Jorgen Lundman | <firstname.lastname@example.org>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)