tech-net: Re: PF and TCP Window Scaling in NetBSD 3.0

Subject: Re: PF and TCP Window Scaling in NetBSD 3.0
To: Joerg Roedel <joro-bsd@zlug.org>
From: Daniel Hartmeier <daniel@benzedrine.cx>
List: tech-net
Date: 07/12/2006 10:08:45
On Tue, Jul 11, 2006 at 08:07:13PM +0200, Joerg Roedel wrote:

> Why did it work at all with the Linux Kernel 2.6.16. I think
> such a mistake in the ruleset should make all TCP connections stop
> working (but it stops working only with a few sites, maybe depending on
> the TCP scale factor they offer).
> With the Linux Kernel 2.6.17.1 (which offers 3 for the window scaling,
> 2.6.16 offered 2, as I examined the only difference) TCP stops working
> with all sites (in detail: the handshake succeeds, data is sent to the
> peer, but the answer packets are dropped by the firewall).

If one of the peers does not support window scaling (or has it
disabled), there should be no stalling, no matter what your endpoint
does. But most peers nowadays support it, so when you tried "with all
sites", you might have only talked to window scaling enabled peers, by
chance.

What Linux recentely tuned, afaik, is that the window scaling factors
are chosen dynamically depending on read buffer size. If a process uses
a large read buffer (either by global settings or setsockopt(2)), it'll
negotiate a larger scaling factor.

If your side is, for instance, a web browser downloading a file over
HTTP from a web server, pf will stall incoming segments as soon as the
window advertised by the client (through its most recent ACK) is
violated. For example, the most recent ACK might be

   th_ack 1000000, th_win 16384

Let's say the client has a large read buffer and negotiated a window
scaling factor of 2^7 == 128, that means it advertised with the ACK that
it is ready to receive up to 16384 * 128 == 2097152 bytes.

But if pf has missed the window scaling negotiation, it assumes no
scaling is taking place, allowing literally only 16384 bytes.

Now the server starts to send data, maybe in 1448 byte chunks. The
first 11 chunks are passed by pf, because they fit even in the unscaled
window of 16384.

Now it depends on how quickly the client ACKs, and how high it can ACK
(whether there was any packet loss). If the client ACKs 1015928 before
the server sends its 12th packet, the window in pf advances and the 12th
packet can pass. Otherwise the 12th packet is wrongly blocked by pf
because it appears to violate the advertised window (wrong because pf
assumes there is no window scaling). The smaller the scaling factor, the
less likely the stalling, probably. With a scaling factor of 2^0, which
is not uncommon, there is no stalling, either.

The HTTP case is simpler because usually only the client's scale factor
matters, the only thing sent from client to server is the request, which
is usually short enough to not make use of the server's scaling factor.
It's also unlikely that a web server uses a very large read buffers
(like the 2MB in the example above), it might be picking scale factor
2^0 == 1 for its side mostly. You can use pfctl -vss to see what wscale
factors pf is honouring for each side.

If you want to investigate further, you can enable pf's debug logging
with pfctl -xm and watch /var/log/messages for "BAD state:" lines from
pf. Those show the precise values of the state entry windows and of the
packet that was blocked.

Daniel