FAST_IPSEC fragmentation problem

To: tech-net%NetBSD.org@localhost
Subject: FAST_IPSEC fragmentation problem
From: Beverly Schwartz <bschwart%bbn.com@localhost>
Date: Tue, 16 Oct 2012 16:05:11 -0400

I have detected a problem in which FAST_IPSEC does not return a fragmentation 
needed icmp message.

Here's the setup:

A=======B--------C

Between A & B is an IPsec esp tunnel.  B & C are in the clear.

A initiates a wget from C.  After initial TCP handshaking, C sends a packet
with TCP data to fill the mtu between B & C.

B tries to forward the packet to A after adding a new IP header, ESP header,
and padding, but the resultant packet is too large.

B does *not* send a fragmentation needed packet back to C.  Instead, C times
out and black hole detection comes into play, C starts sending minimally sized
packets, and then the data gets through.  This delays the (small) transfer by
about 5 seconds.  For large transfers, this has a signifcant impact on
throughput because of the small-sized packets.

If you look at netipsec/ipsec_output.c, it looks like fragmentation is being
taken care of, but it is not.  This is what happens.

In function ipsec4_process_packet, based on the policy set for setting the DF
bit on the outer packet (ip4_ipsec_dfbit) and the setting of the bit in the
inner packet is whether or not the DF bit gets set in the outer packet.  This
code does work as expected.

Continuing further down ipsec4_process_packet, we see a call to do the output
transform:
error = (*sav->tdb_xform->xf_output)(m, isr, NULL, i, off);

In my case, we are doing crypto, and this function maps to esp_output.

Looking in netipsec/xform_esp.c, we can see crypto_dispatch called.
crypto_dispatch will queue the packet to have crypto done, returning
ENOENT.  

We can see this is netinet/ip_output.c after the call to
ipsec4_process_packet.  In ip_output, error is changed to 0 because no
error has occurred - we're just waiting to be called back when opencrypto
is done.  Control jumps to done, and ip_forward sees no error.  ip_forward
returns.

When crypto has done it's thing, it puts the packet on another queue, and
calls esp_output_cb.  esp_output_cb calls ip_output.  On this run through
ip_output, the DF bit is checked, and we do have an MTU error.  However,
we're no longer returning to ip_forward which has all the context we need
to call icmp_error.  Even if esp_output_cb could detect the failure is an
mtu problem, all it has is an encrypted packet, and has no way to figure
out where to send an icmp error message to.

So no fragmentation needed message is sent.

I have been trying different ways to deal with this.  What seems easiest
to me, is in the case where the DF bit is set, add a call before the call
to ipsec4_process_packet in ip_output which determines how many bytes
IPsec will add.  Then check current packet length plus the extra bytes
against mtu.  If this fails, return EMSGSIZE.  Here's the code, placed
right before the call to ipsec4_process_packet:

                /*
                 * Check that MTU is sufficient.
                 */
                if (ntohs(ip->ip_off) & IP_DF) {
                        size_t ipsec_hdrlen = ipsec_hdrsiz(sp);
                        if (ip_len + ipsec_hdrlen > mtu) {
                                if (flags & IP_RETURNMTU)
                                        *mtu_p = mtu - ipsec_hdrlen;
                                error = EMSGSIZE;
                                IP_STATINC(IP_STAT_CANTFRAG);
                                splx(s);
                                goto bad;
                        }
                }

ipsec_hdrsiz is declared a static function, but this is easy to change.
I prefer calling ipsec_hdrsiz directly rather than using ipsec4_hdrsiz
because we already have the policy so don't need to get it again.

This does not fully solve our problem, however.  In ip_forward, we
call ipsec4_hdrsiz in order to determine the mtu size to send
in the icmp error message.  Well the calculation doesn't work, and
we end up with an mtu of 1500.  No help there.

Fortunately, this problem is easy to fix.  Because we set *mtu_p,
ip_forward already has the new mtu.  So in case EMSGSIZE in ip_forward,
after setting type and code, we need to add:
                if (destmtu != 0)
                        break;

The only problem I'm still struggling with is that ipsec_hdrsiz returns
a strange (meaning odd number) header length.  I think this should be
decreased until we have a multiple of 4.  But that's just my opinion.

BTW, IPv6 doesn't quite run into this because it just applies source
fragmentation to the new packet.  IPv6 should not fragment midstream,
so this is probably not desired behavior.  However, one could argue
that the encapsulated packet is a new packet, therefore fragmentation
is allowed.  In my opinion, this doesn't ring true to the spirit of the
IPv6 spec.

-Bev

Follow-Ups:
- IPv4 fragmentation and IPsec details
  - From: Beverly Schwartz
- Re: FAST_IPSEC fragmentation problem
  - From: Gert Doering

Prev by Date: Re: SMSC95XX [was Re: Raspberry PI]
Next by Date: Re: FAST_IPSEC fragmentation problem
Previous by Thread: Re: SMSC95XX [was Re: Raspberry PI]
Next by Thread: Re: FAST_IPSEC fragmentation problem
Indexes:

Home | Main Index | Thread Index | Old Index