Subject: packet loss? w/ 1.6[A-D] & IPSEC policy
To: None <current-users@netbsd.org>
From: Arto Selonen <arto@selonen.org>
List: current-users
Date: 07/20/2002 14:30:19
Hi!

I have had IPSEC policies in place since late May 2001, and things
have worked as expected up to 1.5ZC. After upgrading to 1.6A,
and continuing to the current 1.6B/1.6D client/server pair I have had
problems. SSH seems to work without noticeable effects, but
eg. web surfing from client to server breaks with connections
eventually timing out, etc.

If my memory serves me right, then this started happening as soon as I
upgraded the client (10.1.1.1) from 1.5ZC to 1.6A, even though the
problems *seem* to be at the server end (which first stayed at 1.5ZC
and was then upgraded to 1.6A, 1.6B and 1.6D without any help).

It would seem that as soon as I turn IPSEC policy on for the
client/server pair, I start loosing packets (from the server end).
Why did it surface after (client?) upgrade to 1.6A (and beyond)? Have I
overlooked a required change somewhere along the way?

What can (should) I do to get this working again? send-pr? Any help in
debugging this would be appreciated. Naturally, I am happy to
provide any additional details that might be relevant to this issue.


The client (10.1.1.1) runs 1.6B and the server (10.2.2.2) runs 1.6D.
There is the Big Bad Internet between the hosts.

Here is what I'm seeing without the policies on:

---------------------------------------
user@10.1.1.1% telnet www.example.com 80
Trying 10.2.2.2...
Connected to www.example.com.
Escape character is '^]'.
GET http://www.example.com/ HTTP/1.0<enter>
<enter>
HTTP/1.1 200 OK
[rest of headers + page content follows]
Connection closed by foreign host.
user@10.1.1.1%
---------------------------------------

After I add the policies the dialog becomes:

---------------------------------------
user@10.1.1.1% telnet www.example.com 80
Trying 10.2.2.2...
Connected to www.example.com.
Escape character is '^]'.
GET http://www.example.com/ HTTP/1.0<enter>
<enter>
<enter>
Connection closed by foreign host.
user@10.1.1.1%
---------------------------------------

In other words, there is no output and when I press Enter for the third
time the connection is closed. This can be repeated at will, and similar
effects happen when using web browsers. Sometimes I might even get a part
of the page before the connection closes (using lynx). Unsuccesful attempts
do not register at the web server logs (Apache 1.3.26).

The amount of data that the HTTP reply should contain is a bit over 2KB.


Here are some details:

/etc/ipsec.conf @ 10.2.2.2 (modified IP/spi/keys/whitespace):
-----------------------------------------------------------------
add 10.1.1.1 10.2.2.2 esp 0 -E rijndael-cbc 0x0000000000000000000000000000000000000000000000000000000000000000;
add 10.1.1.1 10.2.2.2 ah  1 -A hmac-sha1    0x0000000000000000000000000000000000000000;
add 10.2.2.2 10.1.1.1 esp 2 -E rijndael-cbc 0x0000000000000000000000000000000000000000000000000000000000000000;
add 10.2.2.2 10.1.1.1 ah  3 -A hmac-sha1    0x0000000000000000000000000000000000000000;
spdadd 10.2.2.2 10.1.1.1 any -P out ipsec esp/transport//require ah/transport//require;
spdadd 10.1.1.1 10.2.2.2 any -P in  ipsec esp/transport//require ah/transport//require;
-----------------------------------------------------------------

The "same" policy is used at 10.1.1.1. No racoon or other is used for key
management (so that should not be an issue here).


Here are the tcpdump outputs at 10.2.2.2 (web server) for the above trials:

tcpdump -n -i ep0 host 10.1.1.1 (no policy, modified timestamp,IP,port,seq#)
------------------------------------------------------------------------------
27.66736 10.1.1.1.23 > 10.2.2.2.80: S 123:123(0) win 16384 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)
27.66780 10.2.2.2.80 > 10.1.1.1.23: S 987:987(0) ack 124 win 16384 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)
27.67405 10.1.1.1.23 > 10.2.2.2.80: . ack 1 win 17520 <nop,nop,timestamp 0 0> (DF)
37.48140 10.1.1.1.23 > 10.2.2.2.80: P 1:39(38) ack 1 win 17520 <nop,nop,timestamp 19 0> (DF)
37.67593 10.2.2.2.80 > 10.1.1.1.23: . ack 39 win 17520 <nop,nop,timestamp 20 19> (DF)
38.76712 10.1.1.1.23 > 10.2.2.2.80: P 39:41(2) ack 1 win 17520 <nop,nop,timestamp 22 0> (DF)
38.76925 10.2.2.2.80 > 10.1.1.1.23: . 1:1449(1448) ack 41 win 17520 <nop,nop,timestamp 22 22> (DF)
38.77014 10.2.2.2.80 > 10.1.1.1.23: P 1449:2246(797) ack 41 win 17520 <nop,nop,timestamp 22 22> (DF)
38.77086 10.2.2.2.80 > 10.1.1.1.23: F 2246:2246(0) ack 41 win 17520 <nop,nop,timestamp 22 22> (DF)
38.78816 10.1.1.1.23 > 10.2.2.2.80: . ack 2246 win 16723 <nop,nop,timestamp 22 22> (DF)
38.78911 10.1.1.1.23 > 10.2.2.2.80: . ack 2247 win 16723 <nop,nop,timestamp 22 22> (DF)
38.79304 10.1.1.1.23 > 10.2.2.2.80: F 41:41(0) ack 2247 win 17520 <nop,nop,timestamp 22 22> (DF)
38.79321 10.2.2.2.80 > 10.1.1.1.23: . ack 42 win 17520 <nop,nop,timestamp 22 22> (DF)
------------------------------------------------------------------------------

tcpdump -n -i ep0 host 10.1.1.1 (with IPSEC, modified timestamp,IP,spi)
------------------------------------------------------------------------------
03.13645 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x1): ESP(spi=0x0,seq=0x1) (DF)
03.13767 10.2.2.2 > 10.1.1.1: AH(spi=0x3,seq=0x1): ESP(spi=0x2,seq=0x1) (DF)
03.14633 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x2): ESP(spi=0x0,seq=0x2) (DF)
23.09855 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x3): ESP(spi=0x0,seq=0x3) (DF)
23.28964 10.2.2.2 > 10.1.1.1: AH(spi=0x3,seq=0x2): ESP(spi=0x2,seq=0x2) (DF)
24.43422 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x4): ESP(spi=0x0,seq=0x4) (DF)
24.43878 10.2.2.2 > 10.1.1.1: AH(spi=0x3,seq=0x4): ESP(spi=0x2,seq=0x4) (DF)
24.45133 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x5): ESP(spi=0x0,seq=0x5) (DF)
47.05617 10.1.1.1 > 10.2.2.2: AH(spi=0x1,seq=0x6): ESP(spi=0x0,seq=0x6) (DF)
47.05724 10.2.2.2 > 10.1.1.1: AH(spi=0x3,seq=0x7): ESP(spi=0x2,seq=0x7)
------------------------------------------------------------------------------

Assuming that the packet exchange should be very similar to the clear text
case, I'm guessing there is the same three-way handshake, then my
intentionally slow 'GET' followed by a one second pause for the second
Enter to complete the HTTP request. After that the server should be
sending the reply, but there is only one packet and no output at the
client end, followed by my 20 second wait, and then the connection closes.

I've verified that both 10.1.1.1 and 10.2.2.2 give the "same" output when
running tcpdump during a failed telnet session. I have no idea why
the server skips several packets (in this case 0x3, 0x5 and 0x6).
That certainly would explain why the client doesn't get a proper reply for
the http request.

Running ping from client to server with the policy enabled looks ok:
120 packets transmitted, 120 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 9.511/17.585/20.115/2.529 ms


This is what the ep0 interface looks like on server (modified MAC,inet,inet6):
-------------------------------------------------------------------
ep0: flags=8963<UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
        address: 00:00:00:00:00:00
        media: Ethernet 10baseT
        inet 10.2.2.2 netmask 0xfffffff8 broadcast 10.2.2.7
        inet alias 10.2.2.3 netmask 0xfffffff8 broadcast 10.2.2.7
        inet alias 10.2.2.4 netmask 0xfffffff8 broadcast 10.2.2.7
        inet6 fe80::200:00ff:fe00:0000%ep0 prefixlen 64 scopeid 0x2
-------------------------------------------------------------------

Anything else that might be useful to check? I don't (yet) have
DEBUG/DIAGNOSTIC options in the kernel(s).


Sincerely,
	Arto Selonen

#######======------  http://www.selonen.org/arto/  --------========########
Everstinkuja 5 B 35                               Don't mind doing it.
FIN-02600 Espoo        arto@selonen.org         Don't mind not doing it.
Finland              tel +358 50 560 4826     Don't know anything about it.