ssh's "bad packet length" vs. SACK and IPsec

To: tech-net%netbsd.org@localhost
Subject: ssh's "bad packet length" vs. SACK and IPsec
From: Quentin Garnier <cube%cubidou.net@localhost>
Date: Wed, 20 Feb 2008 16:53:52 +0100
Hi,

How the three are linked is an interesting story...  I'll make it short.
There are several bugs involved.

If you wish to reproduce the issue, here's how I did it.  The bug annoys
me at work, so I built a sketch of that network using Xen DomUs.  Here's
the setup:

+---+ Eth +---+ Eth +----------+ Eth +---+ Eth +---+
| C +=====+ A +=====+ Internet +=====+ B +=====+ D +
+---+     +-+-+     +----------+     +-+-+     +---+
            |  gif tunnel with         |
            +--------------------------+
                   IPsec transport

The so-called "Internet" is actually the Dom0.  The four Ethernet
segments are done with bridges on the Dom0.

The A-Dom0 network is 10.0.0/24, the B-Dom0 network is 10.0.1/24, the
A-C network is 10.0.2/24 and the B-D network is 10.0.3/24.  The gif
tunnel connects addresses starting with 10.1.0.

On each of its interfaces, the last component of the IP address is the
hex number of the host.

How IPsec is configured is completely irrelevant at this point:  as a
matter of fact you don't have to configure any policy, merely compiling
a kernel with support for IPsec is enough.

I expect the "Bad packet length" issue to be common for people using a
similar setup, or an environment prone to MTU sensitiveness.  You
experiment it seeing scp, or even sometimes ssh, aborting with a message
about a SSH packet length that indeed seems completely bogus.

I've only been able to pinpoint the issue with ssh/scp.  Any other
application I could try (be it ftp, or a simple test I coded) are simply
faced with the issue Edgar Fuß describes in a thread these days:  the
connection stalls.  Note that it can happen with ssh/scp too, it's just
that sometimes ssh/scp will go further for some timing reason.  So many
bugs to pick from...

I have a trace of a failing scp session at the following address:

http://taryn.cubidou.net/~cube/onA

As its name suggests, it's a capture done on machine A.  The scp was
trying to send a file from C to D.

The IPsec bug
=============

First thing's first.  We'll see how fixing that bug hides the SACK one
later.

Look at the "Need Fragment" ICMP messages in the trace.  Can you see
what is wrong?  Here I thank Michael Van Elst for analysing this bug.

In ip_output.c:ip_forward(), we get EMSGSIZE returned because of the
MTU of the gif interface (1280 in a default setup).  There is a mention
here, "XXX quickhack!!!" for good reason:  the code doesn't make sure
the packet was actually going to be handled by an IPsec policy.  In this
case, it's not (at this point it hasn't been forwarded to the gif
interface).

The problem is ipsec4_getpolicybyaddr() will always return something,
the default policy if no existing policy matched.  The "req" field of
that policy is NULL, hence "destmtu" doesn't get set and stays at 0,
which we send back in the ICMP message.

This is a bug that needs to be fixed.  Michael suggested the following
workaround for now:

--- ip_input.c  16 Sep 2007 15:34:59 -0000      1.236.2.1
+++ ip_input.c  20 Feb 2008 15:11:20 -0000
@@ -1996,7 +1996,7 @@
                            IPSEC_DIR_OUTBOUND, IP_FORWARDING,
                            &ipsecerror);
 
-                       if (sp == NULL)
+                       if (sp == NULL || sp->req == NULL)
                                destmtu = ipforward_rt.ro_rt->rt_ifp->if_mtu;
                        else {
                                /* count IPsec header size */

The SACK bug (and other TCP issues)
===================================

That one took a lot longer to track down, because I had to learn how TCP
works in NetBSD, which, well, took time.

I find it interesting that TCP doesn't act immediately upon receiving an
ICMP "Need Fragment" message.  What the code does is record it and wait
to use it in the handler of the retransmission timeout handler.  My
guess here is that not all packets that have been sent might get
blocked, and if any of them went through, the peer will send a duplicate
ACK, possibly with SACK information and fast retransmit will roll
forward.

As you can see in the trace though, it's not the case here.  There was
enough data in the socket buffer to send 4 segments in a burst (the
maximum allowed), and all of them got an ICMP reply.  Therefore we wait
for a whole 1.5 second for the timeout to happen.

I really think we should get the route created and adjust the segment
size immediately.  Maybe there are other reasons for not doing that.

Anyway, what does the code do with that "MTU 0" thing?  It goes all the
way down to ip_icmp.c:icp_mtudisc(), which then uses a hard coded table
of values to try (which I find rather smart).  The first possible one is
1492, so that's what the TCP code will use.

At this point SACK is not yet involved of course, having received no
ACK.  Plain retransmit is the way to go, using 1440 as the segment size
instead of 1448 before.

I'm not entirely sure why there are two retransmission here.  But the
second one does send 4 packets, and the congestion window has shrunk
enough so that the last packet is small enough to go through the tunnel.

We get a duplicate ACK in return with SACK information, so we initiate
SACK recovery.  While ICMP Need Fragment messages have been received,
they have not been acted upon, so TCP still thinks it's ok to send 1440
bytes of payload.  Tough luck.

This is where it gets specific to ssh/scp that doesn't always run into
the issue my test app and Edgar Fuß (I believe) are facing:  the rexmt
timer was armed a second time, so we get to do the Path MTU Discovery
dance a second time.

But SACK has recorded in its scoreboard that it has retransmitted some
data.  So tcp_output() will start where the previous packet left off.
The congestion window make it so it's only possible to send a 468-byte
long packet, so we do that.

Now look at the next packet, frame #100 on the wire.  Compare the data
with frame #74.  Can you see the bug?  Now compare the date in #100 and
in #102.  What I know is that what's in #102 is correct, so NetBSD sent
data from the buffer with the wrong sequence number.

Looking back, I should have asked the following question at this point:
why it is even sending this packet (and the subsequent ones)?  We've
filled the congestion window already, since the previous packet was not
a full segment.

A lot of debug allowed my to understand the issue:  tcp_mtudisc
invalidates all previous send and sets snd_nxt to the value of snd_una
so that tcp_output will start from the beginning.  Except SACK entered
the picture, and it will try to send what's after what it already sent.

After sending #99, as sendalot is 1, we goto again.  "off" gets to point
after the data sent in #99, but cwin is 0 because the congestion window
is full.  So sack_rxmit stays at 0.  Things should stop there, except
they don't, because when tcp_mtudisc reset snd_nxt, it didn't change
sack_newdata, which means now snd_nxt is smaller than sack_newdata, so
the second "cwin" gets set to a very large value, allowing the sending
of full segemnts.

But at that point, "off" is set in stone and doesn't agree with snd_nxt.

Oops.

My take at a patch:

Index: tcp_subr.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.208.2.1
diff -u -p -r1.208.2.1 tcp_subr.c
--- tcp_subr.c  24 May 2007 19:13:13 -0000      1.208.2.1
+++ tcp_subr.c  20 Feb 2008 15:47:55 -0000
@@ -1729,7 +1729,7 @@ tcp_mtudisc(struct inpcb *inp, int errno
                /*
                 * Resend unacknowledged packets.
                 */
-               tp->snd_nxt = tp->snd_una;
+               tp->snd_nxt = tp->sack_newdata = tp->snd_una;
                tcp_output(tp);
        }
 }

tcp6_mtudisc() would need the same treatment.  I looked at the other
places where snd_nxt is reset in a similar way, but they seem to be
safe:  SACK either hasn't happened yet, or is completely out of the
picture.  A second look wouldn't be too much though.

While those two patches makes the whole thing happy, I think we should
re-visit the path MTU discovery code to be more efficient.  In the case
of blackholes for instance, we should make use of icmp_mtudisc's clever
table.  And I still think we should act on ICMP Need Fragment messages
immediately.

I'd appreciate if someone with TCP and possibly SACK knowledge would
confirm my analysis and the correctness of the patch I suggest.
Otherwise I'll commit sometime later...

The reason I ask is that when we change the MTU, we should probably
invalidate all SACK retransmits, like we do for non-SACK retransmits.
The patch I suggest doesn't do that, it only give a chance to the code
not to send bogus data.

-- 
Quentin Garnier - cube%cubidou.net@localhost - cube%NetBSD.org@localhost
"See the look on my face from staying too long in one place
[...] every time the morning breaks I know I'm closer to falling"
KT Tunstall, Saving My Face, Drastic Fantastic, 2007.
Attachment: pgpXmG92z7EyV.pgp
Description: PGP signature
Follow-Ups:
- Re: ssh's "bad packet length" vs. SACK and IPsec
  - From: YAMAMOTO Takashi
Prev by Date: Re: Mobile IPv6 patch for NetBSD-current (revised)
Next by Date: Re: Mobile IPv6 patch for NetBSD-current (revised)
Previous by Thread: asm source
Next by Thread: Re: ssh's "bad packet length" vs. SACK and IPsec
Indexes:
Home | Main Index | Thread Index | Old Index