Subject: Lasting connections
To: Matthias Scheler <tron@lyssa.owl.de>
From: Stephen Ma <Stephen.Ma@jtec.com.au>
List: tech-net
Date: 12/12/1997 13:33:56
[shifting this thread to tech-net]

>>>>> "Matthias" == Matthias Scheler <tron@lyssa.owl.de> writes:

Matthias> 	Hi, I've got a NetBSD 1.3_ALPHA system (971110
Matthias> sources) here with an uptime of 27 days. Actually there are
Matthias> about haf a dozen TCP connections which are in state
Matthias> "CLOSING" for a long time, one of them even for weeks. There
Matthias> are definitely no process belonging to this connections any
Matthias> more.

This is a probably because of a "known problem" in the BSD stack.
Richard Stevens posted a suggested fix for this problem, which I've
reproduced below.

- S

-- 

Newsgroups: comp.protocols.tcp-ip
From: rstevens@noao.edu (W. Richard Stevens)
Date: 1996/03/13
Message-ID: <4i5dce$267@noao.edu>
Organization: National Optical Astronomy Observatories, Tucson, AZ, USA
Subject: fix for sockets stuck in the CLOSING state
Keywords: CLOSING

For years there has been a reported problem with sockets (actually TCP
endpoints, since the bug affects sockets and TLI) stuck in the CLOSING
state.  Numerous people have mentioned this bug to me, and it's in all
versions of the BSD network source code, it's just that it takes a busy
Web server and a braindead PC client to trigger it.  One busy Web server
that I know of has encountered this, with multiple sockets per day stuck
in this state, all with multiple mbufs queued for sending.  They have to
reboot every few days to clear them all.
   
The normal scenario is as follows:
   
1. Web client establishes connection and sends HTTP request.
   
2. Web server reads request and starts sending reply.
   
3. At some point the client sends a FIN with an advertised window of 0.
   Sometimes the window decreases slowly to 0, and sometimes the window
   of 0 appears after a much bigger window (i.e., the window has effectively
   been shrunk by the client).
 
4. The server responds to the FIN with an immediate ACK.
 
5. There are no more packets sent on the connection.
 
At this point the socket is normally in the CLOSING state: Web servers
often do a big write in step 2, followed immediately by a close, which
moves to FIN_WAIT_1; the receipt of the FIN in step 3 moves to CLOSING.
But there is data on the send queue for the socket, and *both* the
retransmit timer and the persist timer are 0.  (I can verify this with
a version of netstat that I have.)  It appears the client disappears  
from the net after sending its FIN, so the client never sends an ACK  
advertising a nonzero window.  And since the server's persist timer is 
not set, the server never probes the 0 window.
 
The only timer going at this point is the keepalive timer, but in normal
BSD-derived code, the keepalive timer does not work in the CLOSING state.
Apparently SGI has changed this, and Web servers normally do set the
keepalive timer, so on an SGI box this connection would normally disappear
after 2 hours (assuming the keepalive interval has not been set lower).
 
But the real fix is to prevent tcp_output from sending the FIN in step 4
above without making certain that either the persist timer or the
retransmit timer is set.  That's what the patch below does.  What you'll
see after putting in this patch is the socket move to the CLOSING state
and then persist probes will start (after 5 seconds).  The fairly new  
persist-probe timeout code will then drop the connection, often around
15 minutes later, assuming the client has really disappeared.  (This fix
gets rid of these stuck sockets faster than the keepalive hack mentioned
above.)  NOTE: If your kernel does not have the persist timeout code
(from 4.4BSD-Lite2, also shown and described on pp. 196-200 of "TCP/IP
Illustrated, Volume 3"), you must put that patch in also, or you'll
make things worse.
 
Many thanks to Dave Borman for simplifying the original fix that I had
for this problem.
 
        Rich Stevens 
 
-------------------------------------------------------------------   
*** tcp_output.c	Tue Feb 27 19:19:25 1996
--- tcp_output.c.bis	Tue Mar 12 09:29:31 1996
***************
*** 144,158 ****
		 * but we haven't been called to retransmit,
		 * len will be -1.  Otherwise, window shrank 
		 * after we sent into it.  If window shrank to 0,
!		 * cancel pending retransmit and pull snd_nxt
!		 * back to (closed) window.  We will enter persist
!		 * state below.	 If the window didn't close completely, 
!		 * just wait for an ACK. 
		 */ 
		len = 0;
		if (win == 0) {
			tp->t_timer[TCPT_REXMT] = 0;
			tp->snd_nxt = tp->snd_una;
		} 
	}
	if (len > tp->t_maxseg) {
--- 144,161 ----
		 * but we haven't been called to retransmit,
		 * len will be -1.  Otherwise, window shrank
		 * after we sent into it.  If window shrank to 0,
!		 * cancel pending retransmit, pull snd_nxt back
!		 * to (closed) window, and set the persist timer
!		 * if it isn't already going.  If the window didn't
!		 * close completely, just wait for an ACK.
		 */
		len = 0;
		if (win == 0) { 
			tp->t_timer[TCPT_REXMT] = 0;
+			tp->t_rxtshift = 0;
			tp->snd_nxt = tp->snd_una;
+			if (tp->t_timer[TCPT_PERSIST] == 0)  
+				tcp_setpersist(tp);
		}
	}	 
	if (len > tp->t_maxseg) {