Subject: 32/64 sign-extension bug in TCP
To: None <tech-net@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-net
Date: 05/03/2004 06:57:43
--Qxx1br4bt0+wmkIi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

over the weekend I noticed that my sparc64 box (an ultra2) was showing
some strange pauses in transferring data over TCP.  tcpdump revealed
that the ultra2 was sending lots of duplicate ACKs, sometimes over
20 in a row.  the problem turns out to be that when the application on
the ultra2 (cvs in this case) is slow at reading the data out of the
kernel such that the sender waits for a zero window long enough that
it times out and tries sending another byte, the ultra2 responds with
a window of 65535 instead of (the correct value) 0.  the other end then
dumps a whole bunch more data on the poor ultra2 and the TCP code goes
into all kinds of slow recovery paths.  here's a snippet from tcpdump:

03:31:37.093601 ultra2.65533 > spathi.cvspserver: . ack 2131167167 win 912 (DF)
03:31:42.086607 spathi.cvspserver > ultra2.65533: . 2131167167:2131168079(912) ack 3432123194 win 33580 (DF)
03:31:42.284721 ultra2.65533 > spathi.cvspserver: . ack 2131168079 win 0 (DF)
03:31:47.276666 spathi.cvspserver > ultra2.65533: . 2131168079:2131168080(1) ack 3432123194 win 33580 (DF)
03:31:47.475007 ultra2.65533 > spathi.cvspserver: . ack 2131168080 win 65535 (DF)
03:31:47.475060 spathi.cvspserver > ultra2.65533: . 2131168080:2131169540(1460) ack 3432123194 win 33580 (DF)
03:31:47.475072 spathi.cvspserver > ultra2.65533: . 2131169540:2131171000(1460) ack 3432123194 win 33580 (DF)
03:31:47.475082 spathi.cvspserver > ultra2.65533: . 2131171000:2131172460(1460) ack 3432123194 win 33580 (DF)


the TCP code uses a mixture of "int" and "long" signed and unsigned
variables, and in an LP64 kernel it's not surprising to find that
there are some sign-extension bugs.  the code that's at fault here is:

	long win;

...
	if (win < (long)(tp->rcv_adv - tp->rcv_nxt))
		win = (long)(tp->rcv_adv - tp->rcv_nxt);


and rcv_adv and rcv_nxt are of type "tcp_seq", which is "u_int32_t".
the expression "(long)(tp->rcv_adv - tp->rcv_nxt)" was intended to
evaluate to -1 for the packets shown above, but since long is now
64 bits, it evaluates to 4294967295 instead.  this causes window to
be returned as 65535 as we see.

the attached diff fixes this for me, by using replacing "long" with "int"
in tcp_output.c and removing several casts that become unnecessary.
does anyone see any problem with this change?  post 2.0, it would be good
to eliminate "long" from the rest of the TCP code, but I'd rather stick
with a smaller change for now.

-Chuck

--Qxx1br4bt0+wmkIi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="diff.tcp"

Index: netinet/tcp_output.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_output.c,v
retrieving revision 1.111
diff -u -p -r1.111 tcp_output.c
--- netinet/tcp_output.c	26 Apr 2004 03:54:29 -0000	1.111
+++ netinet/tcp_output.c	3 May 2004 13:34:21 -0000
@@ -419,7 +419,7 @@ __inline
 #endif
 int
 tcp_build_datapkt(struct tcpcb *tp, struct socket *so, int off,
-    long len, int hdrlen, struct mbuf **mp)
+    int len, int hdrlen, struct mbuf **mp)
 {
 	struct mbuf *m, *m0;
 
@@ -434,7 +434,7 @@ tcp_build_datapkt(struct tcpcb *tp, stru
 	}
 #ifdef notyet
 	if ((m = m_copypack(so->so_snd.sb_mb, off,
-	    (int)len, max_linkhdr + hdrlen)) == 0)
+	    len, max_linkhdr + hdrlen)) == 0)
 		return (ENOBUFS);
 	/*
 	 * m_copypack left space for our hdr; use it.
@@ -502,11 +502,11 @@ tcp_build_datapkt(struct tcpcb *tp, stru
 	off = tp->t_inoff;
 
 	if (len <= M_TRAILINGSPACE(m)) {
-		m_copydata(m0, off, (int) len, mtod(m, caddr_t) + hdrlen);
+		m_copydata(m0, off, len, mtod(m, caddr_t) + hdrlen);
 		m->m_len += len;
 		TCP_OUTPUT_COUNTER_INCR(&tcp_output_copysmall);
 	} else {
-		m->m_next = m_copy(m0, off, (int) len);
+		m->m_next = m_copy(m0, off, len);
 		if (m->m_next == NULL) {
 			m_freem(m);
 			return (ENOBUFS);
@@ -533,7 +533,7 @@ tcp_output(tp)
 {
 	struct socket *so;
 	struct route *ro;
-	long len, win;
+	int len, win;
 	int off, flags, error;
 	struct mbuf *m;
 	struct ip *ip;
@@ -651,7 +651,6 @@ again:
 	sendalot = 0;
 	off = tp->snd_nxt - tp->snd_una;
 	win = min(tp->snd_wnd, tp->snd_cwnd);
-
 	flags = tcp_outflags[tp->t_state];
 	/*
 	 * If in persist timeout with window of 0, send 1 byte.
@@ -769,12 +768,12 @@ again:
 		 * taking into account that we are limited by
 		 * TCP_MAXWIN << tp->rcv_scale.
 		 */
-		long adv = min(win, (long)TCP_MAXWIN << tp->rcv_scale) -
+		int adv = min(win, TCP_MAXWIN << tp->rcv_scale) -
 			(tp->rcv_adv - tp->rcv_nxt);
 
-		if (adv >= (long) (2 * rxsegsize))
+		if (adv >= 2 * rxsegsize)
 			goto send;
-		if (2 * adv >= (long) so->so_rcv.sb_hiwat)
+		if (2 * adv >= so->so_rcv.sb_hiwat)
 			goto send;
 	}
 
@@ -1039,12 +1038,12 @@ send:
 	 * Calculate receive window.  Don't shrink window,
 	 * but avoid silly window syndrome.
 	 */
-	if (win < (long)(so->so_rcv.sb_hiwat / 4) && win < (long)rxsegsize)
+	if (win < (int)(so->so_rcv.sb_hiwat / 4) && win < rxsegsize)
 		win = 0;
-	if (win > (long)TCP_MAXWIN << tp->rcv_scale)
-		win = (long)TCP_MAXWIN << tp->rcv_scale;
-	if (win < (long)(tp->rcv_adv - tp->rcv_nxt))
-		win = (long)(tp->rcv_adv - tp->rcv_nxt);
+	if (win > TCP_MAXWIN << tp->rcv_scale)
+		win = TCP_MAXWIN << tp->rcv_scale;
+	if (win < (int)(tp->rcv_adv - tp->rcv_nxt))
+		win = (int)(tp->rcv_adv - tp->rcv_nxt);
 	th->th_win = htons((u_int16_t) (win>>tp->rcv_scale));
 	if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
 		u_int32_t urp = tp->snd_up - tp->snd_nxt;

--Qxx1br4bt0+wmkIi--