Subject: kern/2274: NetBSD's MTU/MSS handling is rather broken
To: None <gnats-bugs@NetBSD.ORG>
From: John Hawkinson <jhawk@mit.edu>
List: netbsd-bugs
Date: 03/30/1996 11:26:02
>Number:         2274
>Category:       kern
>Synopsis:       NetBSD's MTU/MSS handling is rather broken
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 30 11:50:01 1996
>Last-Modified:
>Originator:     John Hawkinson
>Organization:
MIT SIPB
>Release:        1.1
>Environment:
System: NetBSD lola-granola 1.1A NetBSD 1.1A (LOLA) #2: Sun Mar 10 08:01:40 EST 1996 mycroft@zygorthian-space-raiders:/afs/sipb.mit.edu/project/netbsd/dev/current-source/build/i386_nbsd1/sys/arch/i386/compile/LOLA i386


>Description:

	NetBSD's TCP maximum-segment-size handling is broken. In my
environment, it has the effect of using an MSS of 512 to non-local
destinations, and using 1460 to local destinations.  This is of
course, reasonably suboptimal since the local network is quite more
reliable and congestion-free than the Internet, and the congested
environment is where you want to minimize the number of packets sent.

>How-To-Repeat:

	With tcp_mss_dflt set to 512 (the default), and my routing table
as follows:

[lola-granola!jhawk] ~> netstat -rn
Routing tables

Internet:
Destination      Gateway            Flags     Refs     Use    Mtu  Interface
default          18.70.0.1          UGS        21  2451568      -  fe0
18.70            link#2             UC          0        0      -  fe0
18.70.0.1        0:0:c:5:a2:33      UHL         1     1348      -  fe0
18.70.0.6        8:0:20:74:0:98     UHL         0       10      -  fe0
18.70.0.26       127.0.0.1          UGHS        1     2206      -  lo0
18.70.0.36       0:40:95:4:fd:c8    UHL         0        2      -  fe0
18.70.0.54       2:60:8c:a9:f7:ae   UHL         2       94      -  lo0 =>
18.70.0.54       link#1             UC          0        0      -  ed0
18.70.0.56       8:0:20:22:22:70    UHL         0     9639      -  fe0
18.70.0.61       0:0:c0:b5:a8:d     UHL         2     3179      -  fe0
18.70.0.158      link#1             UCS         0        0      -  ed0
18.70.0.160      8:0:2b:2b:eb:3b    UHL         1       17      -  fe0
18.70.0.161      8:0:20:22:cf:21    UHL         0       21      -  fe0
18.70.0.215      8:0:20:1f:49:df    UHL         0    17484      -  fe0
18.70.0.216      link#1             UCS         0        0      -  ed0
18.70.0.218      8:0:20:75:3c:eb    UHL         0    19408      -  fe0
18.70.0.224      8:0:2b:e:f8:4      UHL         1      124      -  fe0
18.70.0.252      8:0:69:8:96:6f     UHL         1     2014      -  fe0
18.70.2.1        0:80:d3:a0:27:5f   UHL         1        4      -  fe0
127.0.0.1        127.0.0.1          UH          5   298684      -  lo0

(note that the machine's ip address is 18.70.0.26 and it's subnetted
to 255.255.0.0).

If I attempt to connnect to a machine on the local ethernet
(18.70.0.252), 1460 is used:

11:12:38.651510 LOLA-GRANOLA.MIT.EDU.1744 > OPUS.MIT.EDU.www: S
1714944000:1714944000(0) win 16384 <mss 1460,nop,wscale
0,nop,nop,timestamp 2630065 1985830193>
[lola-granola!jhawk] ~> route get opus
   route to: OPUS.MIT.EDU
destination: OPUS.MIT.EDU
  interface: fe0
      flags: <UP,HOST,DONE,LLINFO>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0         0      1185 

If I attempt to connect to a machine on another subnet of network 18
(18.177.0.64), 1460 is used:

11:13:55.532712 LOLA-GRANOLA.MIT.EDU.1745 > PACKET-DROP.MIT.EDU.www: S 1724864000:1724864000(0) win 16384 <mss 1460,nop,wscale 0,nop,nop,timestamp 2621440 3277675825>

[lola-granola!jhawk] ~> route get packet-drop
   route to: PACKET-DROP.MIT.EDU
destination: default
       mask: default
    gateway: NW12A-RTR-W20-ETHER.MIT.EDU
  interface: fe0
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0         0         0 

However if I attempt to connect to a machine outside of network 18
(199.94.220.184):

11:15:57.250897 LOLA-GRANOLA.MIT.EDU.1746 > all-purpose-gunk.near.net.www: S 1740480000:1740480000(0) win 16384 <mss 512,nop,wscale 0,nop,nop,timestamp 2651311 1029594417>

[lola-granola!jhawk] ~> route get ap-gunk.near.net
   route to: all-purpose-gunk.near.net
destination: default
       mask: default
    gateway: NW12A-RTR-W20-ETHER.MIT.EDU
  interface: fe0
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0         0         0 

This seems awfully inconsistent. There does not seem to be any good
reason why connections to PACKET-DROP.MIT.EDU and
ALL-PURPOSE-GUNK.NEAR.NET do not use the same MSS. I'm not very
familiar with the internals of BSD TCP code, so looking there hasn't
been helpful, but I theorize that something is considering the network
route to 18.70.0.0 255.255.0.0 as a route to 18.0.0.0 255.0.0.0, at
least for purposes of MSS computation (i.e. the stated mask of the
route is being ignored and the classful mask is being assumed).
This seems horribly wrong and broken, but is operationally slightly
better than allowing _all_ connections off the local network to use
an MSS of 512.

As an aside, connections to localhost use an MSS of 30720. One would
think this could be improved substantially (but perhaps not?).

>Fix:

	1. Implement path MTU discovery. FreeBSD has it, so we really
	should get it at some point. I suppose this is unlikely to
	happen soon.

	2. Fix the aforementioned masking problem. Unfortunately this
	seems somewhat counterproductive if nothing else is done.

	3. Change tcp_dflt_mss from 512 to 1460. This is the easy way
	out. Unfortunately, I'm not quite sure what effect it will
	have when there are <1460 mss links in the middle. I suppose
	it is likely to cause fragmentation on those links, but given
	the structure of the modern Internet, anyone who has a link
	with an MTU less than 1500 isn't really concerned about
	performance, anyway (i.e. they're a dialup link), so perhaps
	we don't care if they fragment (this is a rationalization that
	seems reasonably plausible).
>Audit-Trail:
>Unformatted: