Subject: kern/14799: Bug in TCP MSS usage - patch included
To: None <gnats-bugs@gnats.netbsd.org>
From: None <rb-netbsd@bigscarychildren.net>
List: netbsd-bugs
Date: 12/01/2001 01:17:24
>Number:         14799
>Category:       kern
>Synopsis:       TCP MSS isn't calculated properly
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Nov 30 22:18:00 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Rick Byers
>Release:        NetBSD 1.5Y and others
>Organization:
	
>Environment:
	
	
System: NetBSD Rick.Apenheul.BigScaryChildren.net 1.5Y NetBSD 1.5Y (RICK) #0: Fri Nov 30 23:44:58 EST 2001 rick@Rick.Apenheul.BigScaryChildren.net:/usr/src/sys/arch/i386/compile/RICK i386
Architecture: i386
Machine: i386

Incorrect behaviour also detected from ftp.netbsd.org and www.netbsd.org (1.5-release?)
>Description:
	The TCP MSS option sent in a TCP SYN packet is supposed to reflect
	the maximum TCP segment size the sender can receive NOT including
	any options.  i.e. it's the maximum IP packet size (usually 1500 
	on ethernet) minus 40 (the minimum length of the ip and tcp headers).

	When NetBSD receives a tcp packet with an MSS that is smaller than
	it's own mss, it doesn't properly take the options into account and
	(if any tcp options are used) will end up generating TCP packets
	that are bigger than the receiver has said it can receive.  This
	typically results in fragmentation, or in the worst case - packet
	loss.
>How-To-Repeat:
	On machine A (any OS), lower the default MSS to, for example, 1360 
	(i.e. set the ethernet MTU to 1400).  Make sure the NetBSD machine
	(call it "B") will use some kind of TCP options (i.e. 
	net.inet.tcp.timestamps is enabled, which is the default).

	Initiate a TCP connection from A to B, and cause B to send a large
	amount of data over the connection (large enough to fill a packet 
	atleast).  Now, using an ethernet sniffer, analyze the traffic.
	
	You will notice that even though A sent a mss of 1360, B sent
	TCP packets with up to 1360 bytes of payload REGARDLESS of the
	options used in the TCP header.  For example, with timestamps
	enabled, there is an extra 12 bytes in the TCP header making the
	total IP packet size 1412. 

	In a real-world situation (i.e. a PPPoE interface with an MTU of
	1492) this causes fragmentation wich decreases thruput.  Not to 
	mention this is in clear violation of the TCP Spec and RFC 1122.
>Fix:
	When the SYN packet is received, the MSS is stored in t_peermss
	of the tcp control block (tcp_mss_from_peer()).  When packets
	are sent, tcp_segsize() only takes the ip/tcp options into account
	when determining the local mss.  It then uses the minimum of the
	local mss and the advertised peer mss.  

	Instead, tcp_segsize needs to subtract the option length from
	the advertised peer mss as well.  Here is a patch that appears
	to do the right thing (although someone should sanity check this).  
	Infact, someone should probably review all the MSS code because there
	are other areas that look a little fishy.  Unfortunantly, this
	code probably doesn't get tested very well because most hosts on the
	net have an MTU of 1500. 

Index: tcp_output.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/tcp_output.c,v
retrieving revision 1.75
diff -c -r1.75 tcp_output.c
*** tcp_output.c	2001/11/13 00:32:41	1.75
--- tcp_output.c	2001/12/01 05:50:47
***************
*** 190,195 ****
--- 190,196 ----
  	struct ifnet *ifp;
  	int size;
  	int iphlen;
+ 	int optlen;
  
  #ifdef DIAGNOSTIC
  	if (tp->t_inpcb && tp->t_in6pcb)
***************
*** 258,264 ****
  		}
  	}
  #endif
! 	size -= tcp_optlen(tp);
  	/*
  	 * XXX tp->t_ourmss should have the right size, but without this code
  	 * fragmentation will occur... need more investigation
--- 259,270 ----
  		}
  	}
  #endif
! 
! out:
! 	/* Now we must make room whatever extra TCP/IP options will be
! 	 * in the packet.
! 	 */
! 	optlen = tcp_optlen(tp);
  	/*
  	 * XXX tp->t_ourmss should have the right size, but without this code
  	 * fragmentation will occur... need more investigation
***************
*** 266,294 ****
  #ifdef INET
  	if (inp) {
  #ifdef IPSEC
! 		size -= ipsec4_hdrsiz_tcp(tp);
  #endif
! 		size -= ip_optlen(inp);
  	}
  #endif
  #ifdef INET6
  #ifdef INET
  	if (in6p && tp->t_family == AF_INET) {
  #ifdef IPSEC
! 		size -= ipsec4_hdrsiz_tcp(tp);
  #endif
  		/* XXX size -= ip_optlen(in6p); */
  	} else
  #endif
  	if (in6p && tp->t_family == AF_INET6) {
  #ifdef IPSEC
! 		size -= ipsec6_hdrsiz_tcp(tp);
  #endif
! 		size -= ip6_optlen(in6p);
  	}
  #endif
  
-  out:
  	/*
  	 * *rxsegsizep holds *estimated* inbound segment size (estimation
  	 * assumes that path MTU is the same for both ways).  this is only
--- 272,300 ----
  #ifdef INET
  	if (inp) {
  #ifdef IPSEC
! 		optlen += ipsec4_hdrsiz_tcp(tp);
  #endif
! 		optlen += ip_optlen(inp);
  	}
  #endif
  #ifdef INET6
  #ifdef INET
  	if (in6p && tp->t_family == AF_INET) {
  #ifdef IPSEC
! 		optlen += ipsec4_hdrsiz_tcp(tp);
  #endif
  		/* XXX size -= ip_optlen(in6p); */
  	} else
  #endif
  	if (in6p && tp->t_family == AF_INET6) {
  #ifdef IPSEC
! 		optlen += ipsec6_hdrsiz_tcp(tp);
  #endif
! 		optlen += ip6_optlen(in6p);
  	}
  #endif
+ 	size -= optlen;
  
  	/*
  	 * *rxsegsizep holds *estimated* inbound segment size (estimation
  	 * assumes that path MTU is the same for both ways).  this is only
***************
*** 297,304 ****
  	 * ipseclen is subtracted from both sides, this may not be right.
  	 * I'm not quite sure about this (could someone comment).
  	 */
! 	*txsegsizep = min(tp->t_peermss, size);
! 	*rxsegsizep = min(tp->t_ourmss, size);
  
  	if (*txsegsizep != tp->t_segsz) {
  		/*
--- 303,310 ----
  	 * ipseclen is subtracted from both sides, this may not be right.
  	 * I'm not quite sure about this (could someone comment).
  	 */
! 	*txsegsizep = min(tp->t_peermss - optlen, size);
! 	*rxsegsizep = min(tp->t_ourmss - optlen, size);
  
  	if (*txsegsizep != tp->t_segsz) {
  		/*
>Release-Note:
>Audit-Trail:
>Unformatted: