tech-net: Re: Why do some network connections get stuck forever?

Subject: Re: Why do some network connections get stuck forever?
To: None <tech-net@NetBSD.org>
From: Greg A. Woods <woods@weird.com>
List: tech-net
Date: 10/25/2005 16:52:50
--Multipart_Tue_Oct_25_16:52:46_2005-1
Content-Type: text/plain; charset=US-ASCII

At Mon, 24 Oct 2005 19:46:33 -0400,
Steven M. Bellovin wrote:
> 
> In message <djjlgd$iu8$1@serpens.de>, Michael van Elst writes:
> >woods@weird.com ("Greg A. Woods") writes:
> >
> >>Why do some network connections get stuck forever?
> >
> >I would guess that this is a feature of TCP.
>
> In particular, if you don't have keep-alives enabled that behavior is 
> in fact correct.

Indeed.

I'm not sure that these connections do have keep-alives enabled, and now
I'm guessing that they don't, especially since I can't find any use of
SO_KEEPALIVE in the source.  :-)

But how does one discover the current state of the socket options in
order to determine whether or not keep-alives are enabled or not?
Are they included in the apparently invisible "socket flags field" that
is not being shown by fstat(8) despite the manual mentioning it?
(see attached)

However the Cyrus imapd does have a timeout mechanism to kill inactive
connections, yet there's one process that's been there for over a month
now and the client it was connected to has been started and stopped
repeatedly and even its host has been rebooted so the connection is
definitely inactive:


# ps -lp 26648                          
UID   PID PPID CPU PRI NI  VSZ  RSS WCHAN STAT TT    TIME COMMAND
120 26648  235   0   2  4 3840 4176 netio IN   ?? 0:00.10 imapd: imapd: wonder.planix.com [204.29.161.37]   

How do I find the outstanding alarm for a process?  (I knew how to do it
on SysV with the good old "crash" command! ;-))

For Cyrus in particular it looks like the idle connection timeout is
tested by using a timeout value for the select() call.

How do I find the select() timeout values for a socket, if any?

Since the process is in "netio", maybe it's stuck in a read() call, and
unfortunately I don't see an alarm() being set before the read(), but if
the select() returned there should have been something to read, right?
So why would read() ever block after select() indicated there was
something to read?


Additionally it does often seem as if these server processes only get
stuck when the ethernet interface is disconnected for some extended
period of time.  They don't seem to get stuck if the client crashes or
the client host crashes, or if there's loss of connectivity due to
routing problems or other problems on intermediate links between the
client and server.

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>


--Multipart_Tue_Oct_25_16:52:46_2005-1
Content-Type: text/plain; charset=US-ASCII

Index: fstat.c
===================================================================
RCS file: /cvs/master/m-NetBSD/main/src/usr.bin/fstat/fstat.c,v
retrieving revision 1.72
diff -u -r1.72 fstat.c
--- fstat.c	17 Jul 2005 07:36:26 -0000	1.72
+++ fstat.c	25 Oct 2005 20:50:33 -0000
@@ -805,6 +805,7 @@
 	switch(dom.dom_family) {
 	case AF_INET:
 		getinetproto(proto.pr_protocol);
+		/* XXX the manual says "socket flags" are printed in the next field before the address */
 		if (proto.pr_protocol == IPPROTO_TCP) {
 			if (so.so_pcb == NULL)
 				break;
@@ -846,6 +847,7 @@
 #ifdef INET6
 	case AF_INET6:
 		getinetproto(proto.pr_protocol);
+		/* XXX the manual says "socket flags" are printed in the next field before the address */
 		if (proto.pr_protocol == IPPROTO_TCP) {
 			if (so.so_pcb == NULL)
 				break;
@@ -968,6 +970,8 @@
 /*
  * getinetproto --
  *	print name of protocol number
+ *
+ * XXX why isn't this just getprotobynumber(3)?
  */
 static void
 getinetproto(int number)

--Multipart_Tue_Oct_25_16:52:46_2005-1--