IPv4 fragmentation and IPsec details

To: "tech-net%NetBSD.org@localhost" <tech-net%NetBSD.org@localhost>
Subject: IPv4 fragmentation and IPsec details
From: Beverly Schwartz <bschwart%bbn.com@localhost>
Date: Thu, 25 Oct 2012 15:10:56 -0400

This is a continuation of the email I posted on 10/16/12 about FAST_IPSEC not 
sending "fragmentation needed" messages.  I have included that email at the end 
of this email.

As part of my fix, I now have ip_forward use the mtu returned from ip_output 
via the mtu_p pointer.  The problem is that breaks KAME_IPSEC.  KAME_IPSEC 
doesn't update the mtu value, so with my fix outlined below, KAME_IPSEC will 
now send a "fragmentation needed" message with a value of 1500 (or whatever the 
mtu of the interface is).  So I have modified the KAME_IPSEC block in 
ip_output.c to parallel what I do in the FAST_IPSEC block.  (See below.)

This all works, but there are a couple of troubling details.

1. ipsec_hdrsiz returns the maximum possible number of bytes IPsec may add, 
rather than the actual number of bytes that *will* be added to the packet.  
With my fix, some packets that could actually fit in the MTU will be bounced 
because ipsec_hdrsiz reports a number of bytes that is greater than what is 
actually added.

2. ipsec_hdrsiz returns an odd number, so the "fragmentation needed" packet 
contains an mtu that is an odd number.  (This was true before any of my fixes, 
so is a different but related problem.  See esp_hdrsiz in netipsec/xform_esp.c 
for FAST_IPSEC or netinet6/esp_output.c for KAME_IPSEC.)

For the short term, I am proposing that we use ipsec_hdrsiz as is, but in 
ip_forward, reduce destmtu until it is on a 4-byte boundary.

For a longer term fix, I propose having two functions, one that calculates the 
actual number of bytes added to a packet, and another that calculates the 
maximum possible number of bytes that can be added based on IPsec policy.  The 
calculation should push the number up to a 4-byte, 8-byte or whatever boundary, 
based on what boundary the IPsec policy uses for padding.  

Thoughts?  Comments?

-Bev

Begin forwarded message:

> From: Beverly Schwartz <bschwart%bbn.com@localhost>
> Subject: FAST_IPSEC fragmentation problem
> Date: October 16, 2012 4:05:11 PM EDT
> To: tech-net%NetBSD.org@localhost
> Cc: Beverly Schwartz <bschwart%bbn.com@localhost>
> 
> I have detected a problem in which FAST_IPSEC does not return a fragmentation 
> needed icmp message.
> 
> Here's the setup:
> 
> A=======B--------C
> 
> Between A & B is an IPsec esp tunnel.  B & C are in the clear.
> 
> A initiates a wget from C.  After initial TCP handshaking, C sends a packet
> with TCP data to fill the mtu between B & C.
> 
> B tries to forward the packet to A after adding a new IP header, ESP header,
> and padding, but the resultant packet is too large.
> 
> B does *not* send a fragmentation needed packet back to C.  Instead, C times
> out and black hole detection comes into play, C starts sending minimally sized
> packets, and then the data gets through.  This delays the (small) transfer by
> about 5 seconds.  For large transfers, this has a signifcant impact on
> throughput because of the small-sized packets.
> 
> If you look at netipsec/ipsec_output.c, it looks like fragmentation is being
> taken care of, but it is not.  This is what happens.
> 
> In function ipsec4_process_packet, based on the policy set for setting the DF
> bit on the outer packet (ip4_ipsec_dfbit) and the setting of the bit in the
> inner packet is whether or not the DF bit gets set in the outer packet.  This
> code does work as expected.
> 
> Continuing further down ipsec4_process_packet, we see a call to do the output
> transform:
> error = (*sav->tdb_xform->xf_output)(m, isr, NULL, i, off);
> 
> In my case, we are doing crypto, and this function maps to esp_output.
> 
> Looking in netipsec/xform_esp.c, we can see crypto_dispatch called.
> crypto_dispatch will queue the packet to have crypto done, returning
> ENOENT.  
> 
> We can see this is netinet/ip_output.c after the call to
> ipsec4_process_packet.  In ip_output, error is changed to 0 because no
> error has occurred - we're just waiting to be called back when opencrypto
> is done.  Control jumps to done, and ip_forward sees no error.  ip_forward
> returns.
> 
> When crypto has done it's thing, it puts the packet on another queue, and
> calls esp_output_cb.  esp_output_cb calls ip_output.  On this run through
> ip_output, the DF bit is checked, and we do have an MTU error.  However,
> we're no longer returning to ip_forward which has all the context we need
> to call icmp_error.  Even if esp_output_cb could detect the failure is an
> mtu problem, all it has is an encrypted packet, and has no way to figure
> out where to send an icmp error message to.
> 
> So no fragmentation needed message is sent.
> 
> I have been trying different ways to deal with this.  What seems easiest
> to me, is in the case where the DF bit is set, add a call before the call
> to ipsec4_process_packet in ip_output which determines how many bytes
> IPsec will add.  Then check current packet length plus the extra bytes
> against mtu.  If this fails, return EMSGSIZE.  Here's the code, placed
> right before the call to ipsec4_process_packet:
> 
>                /*
>                 * Check that MTU is sufficient.
>                 */
>                if (ntohs(ip->ip_off) & IP_DF) {
>                        size_t ipsec_hdrlen = ipsec_hdrsiz(sp);
>                        if (ip_len + ipsec_hdrlen > mtu) {
>                                if (flags & IP_RETURNMTU)
>                                        *mtu_p = mtu - ipsec_hdrlen;
>                                error = EMSGSIZE;
>                                IP_STATINC(IP_STAT_CANTFRAG);
>                                splx(s);
>                                goto bad;
>                        }
>                }
> 
> ipsec_hdrsiz is declared a static function, but this is easy to change.
> I prefer calling ipsec_hdrsiz directly rather than using ipsec4_hdrsiz
> because we already have the policy so don't need to get it again.
> 
> This does not fully solve our problem, however.  In ip_forward, we
> call ipsec4_hdrsiz in order to determine the mtu size to send
> in the icmp error message.  Well the calculation doesn't work, and
> we end up with an mtu of 1500.  No help there.
> 
> Fortunately, this problem is easy to fix.  Because we set *mtu_p,
> ip_forward already has the new mtu.  So in case EMSGSIZE in ip_forward,
> after setting type and code, we need to add:
>               if (destmtu != 0)
>                       break;
> 
> The only problem I'm still struggling with is that ipsec_hdrsiz returns
> a strange (meaning odd number) header length.  I think this should be
> decreased until we have a multiple of 4.  But that's just my opinion.
> 
> BTW, IPv6 doesn't quite run into this because it just applies source
> fragmentation to the new packet.  IPv6 should not fragment midstream,
> so this is probably not desired behavior.  However, one could argue
> that the encapsulated packet is a new packet, therefore fragmentation
> is allowed.  In my opinion, this doesn't ring true to the spirit of the
> IPv6 spec.
> 
> -Bev
> 
>

References:
- FAST_IPSEC fragmentation problem
  - From: Beverly Schwartz

Prev by Date: Re: IPv6 ICMPv6 neighbor solicitation - ignored between different prefixes
Next by Date: Re: Introducing NPF in NetBSD 6.0
Previous by Thread: Re: FAST_IPSEC fragmentation problem
Next by Thread: IPv6 tentative address and RTM_NEWADDR (again) (was: Re: error from ntpd upon IPv6 address receipt/bind)
Indexes:

Home | Main Index | Thread Index | Old Index