Subject: Re: TCP_NODELAY and full links (was Re: sup problems?)
To: None <feico@pasta.cs.uit.no>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 09/29/1999 04:01:13
Feico Dillema writes:

| Am I right to conclude that it is safe or at least reasonable to have
| PMTU disc. switched on for a server not directly behind such a
| blackhole? How about clients behind such a blackhole? Will they be
| blocked?

It's reasonable, yes, but there are risks.

Below I use client and server although really it's TCP sender and
TCP receiver in each case; TCP does not care about clients and servers
per se.

Case #1: no filtering out of ICMP
	-- server sends big packet with "Don't Fragment" (DF) option set
           towards client
	-- router somewhere tries to forward big packet onto small-packet-only
           interface, and cannot
	-- router sends "Needs Fragmentation" ICMP message back to client
           hopefully with the MTU of the small-packet-only interface embedded
           in it
	-- server tries a smaller packet (small-packet-only interface 
           MTU sized if known from "NF" message content) and hopefully 
           that works (otherwise you see "Needs Fragmentation" again)
        -- at intervals, probe the path MTU by repeating the four steps above

	Result: we reasonably rapidly converge on the MTU of the path
        between server and client, and for bulk transfers end up using
        the biggest possible packets, thus improving network efficiency.

Case #2: ICMP Needs Fragmentation messages are filtered out in the path
         between the router with the small-packet-only interface and the
         server itself

	-- big TCP segment is sent with "Don't Fragment" IP option set
	-- no "Needs Fragmentation" ICMP message is seen, as it has
           been eaten by a filter or otherwise lost in transit
	-- TCP RTO (roundtrip time out) or 3x duplicate ACK occurs,
           with the client ACKing the segment before the "big segment"
	-- segment is resent as is, thereby triggering "NF" which again never
           arrives.  loop to previous step until conversation times out.

	Result: conversation killed by ICMP black hole.

Case #3: as per case #2 but server can detect and avoid PMTU blackholes

	-- TCP RTO (roundtrip time out) or 3DUPACK occurs, smaller 
           segment is resent
	-- when ACK is seen, use the smaller segment size as basis for new
           path MTU, otherwise loop to previous step
	-- probe path MTU at intervals

	Result: we stall each time we send a packet that is bigger
        than the path MTU.  We do converge on the maximum supported
        packet size and thereby improve network efficiency, but 
        we send at a slower rate due to packet losses.

        If we're lucky we see DUPACKs and do a fast-retransmit/fast-recovery,
        and the stall is not so bad, and possibly even beneficial (e.g., in
        Greg Wood's congested FIFO case, the halving of cwnd reduces 
        unnecessary buffer occupation).

        If we are unlucky, we see RTOs, and have to slow-start.
	(This drains Greg's queue to zero, which might be good for other
        things trying to use his link, but not so good for the speed
        at which the affected TCP bulk transfer finishes).

Case #4: subtle optimization of #3

	-- TCP RTO (roundtrip time out) occurs, resend packet as-is but
           without "Don't Fragment" option
	-- if ACK is seen, then next RTT try a big packet with DF again,
           and if we see an RTO, we have probably stumbled into an ICMP
           black hole, so use the smaller MTU

	Result: more clues about whether we are seeing packets
        being lost because of path MTU blackhole instead of congestion.

We want to reprobe the path MTU because it can change over time,
and we would like to exploit any *increase* in path MTU, so
every so often we send a larger packet if we aren't already at the
MTU of the medium to which the server is directly connected.
Every so often can be as aggressive as once per RTT, and being
aggressive is great if we haven't seen a "Needs Fragmentation"
message or an RTO yet.  Once we've seen an "NF" or we take a guess
that the RTOs are as a result of blackholing rather than congestion
(i.e., if we're losing the big packet and not the small ones,
we see an ICMP blackhole) being aggressive is more likely wasteful
than useful.

There are longwinded arguments here and there about how fast one
should react to decreases in path MTU.   One school is very definitely
in favour of setting DF on *every* packet, so that any path MTU change
can be dealt with immediately.  Another school favours probing only
once per RTT.  It all depends on how deep your feelings are about the
risks of IP fragmentation.  (Mine aren't very deep; I often turn
off path MTU discovery and send segments (without DF set!) that
are the minimum of the attached-network MTU and the receiver's
advertised MSS and if it's a busy box being talked to by alot of
random people, the standard ethernet MTU).

	Sean.