tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: BSD doesn't track window size changes correctly.



On 3 Apr, 2012, at 19:43 , Darren Reed wrote:

> Dennis Ferguson wrote:
>> ...
>> I think you really want to look at what the FreeBSD code does when
>> it starts off knowing that its neighbor's window is closed and then
>> gets a packet that does those two things.
> 
> Indeed, the discussion over this issue started with a thread
> on the FreeBSD networking list:
> http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031894.html
> 
> The latest in that thread is here, with a proposed patch:
> http://lists.freebsd.org/pipermail/freebsd-net/2012-April/031934.html
> 
> Note that much of the logic changes here are in code that looks
> the same on NetBSD so this patch may be relevant for NetBSD too.

Yes, NetBSD has the problem too and that change probably works.  I must
be getting slow, though, it still took me a while to figure out
how that change fixed it since I couldn't see how the problem occurred
in the first place and couldn't find it in the discussion above.  The
root cause seems to be an underflow of an unsigned variable, tp->snd_wnd.

For whomever patches this, I think the problem goes as follows.  The
last two packets in the trace prior to the bad behavior are these:

09:03:38.416891 IP NetBSD.ssh > FreeBSD.35421: Flags [.], ack 239664, win 0, 
options [nop,nop,TS val 249 ecr 1076479], length 0
09:03:38.437943 IP NetBSD.ssh > FreeBSD.35421: Flags [.], seq 4436931:4438371, 
ack 239672, win 0, options [nop,nop,TS val 249 ecr 1076502], length 1440

That second packet is unusual because (1) it is carrying old, retransmitted 
data,
and (2) it ack's 8 bytes of data outside the window advertised in the previous
packet.  The code in tcp_input() which updates the window from information in a
packet is this, as Darren pointed out:

        /*      
         * Update window information.
         * Don't look at window if no ACK: TAC's send garbage on first SYN.
         */             
        if ((tiflags & TH_ACK) && (SEQ_LT(tp->snd_wl1, th->th_seq) ||
            (tp->snd_wl1 == th->th_seq && (SEQ_LT(tp->snd_wl2, th->th_ack) ||
            (tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd))))) {
                /* keep track of pure window updates */
                if (tlen == 0 &&
                    tp->snd_wl2 == th->th_ack && tiwin > tp->snd_wnd)
                        TCP_STATINC(TCP_STAT_RCVWINUPD);
---->           tp->snd_wnd = tiwin;
                tp->snd_wl1 = th->th_seq;
                tp->snd_wl2 = th->th_ack;
                if (tp->snd_wnd > tp->max_sndwnd)
                        tp->max_sndwnd = tp->snd_wnd; 
                needoutput = 1;         
        }

If the condition in the if() is satisfied this copies the send window
from the packet.  The condition in the if() can only true, however,
if the packet is carrying a current sequence number like the first
packet above; if the packet is retransmitting old data, like the
second packet, it ignores the window in the packet and retains the
previous value of the window.

Note, however, that if the packet carrying an old sequence number is also
carrying an ack for previously un-acked data, tp->snd_wnd must be updated
regardless of the sequence number.  In the existing code this is handled
before the if() above, with code which looks like

                acked = th->th_ack - tp->snd_una;
                . . .
                if (acked > so->so_snd.sb_cc) {
                        tp->snd_wnd -= so->so_snd.sb_cc;
                        sbdrop(&so->so_snd, (int)so->so_snd.sb_cc);
                        ourfinisacked = 1;
                } else {
                        if (acked > (tp->t_lastoff - tp->t_inoff))
                                tp->t_lastm = NULL;
                        sbdrop(&so->so_snd, acked);
                        tp->t_lastoff -= acked;
---->                   tp->snd_wnd -= acked;
                        ourfinisacked = 0;
                }


So the algorithm seems to be:

- If the packet ack's some data, adjust the current tp->snd_wnd to account
  for this.  That is, keep the window advertised in a previous packet, but
  update it to account for what's been ack'd since then.

- If it then decides to believe the window in the new packet overwrite 
tp->snd_wnd
  with the new packet's window. If it decides not to believe it, it continues
  to use the window it got previously, appropriately adjusted.

Given this, the first packet above is carrying a window the if() would decide to
believe so after it is processed tp->snd_win will be 0, copied from the packet.

The odd things about the second packet cause the problem as follows.  Odd thing
(2), the fact that the packet ack's 8 bytes outside the window advertised in the
previous packet, causes it to subtract 8 from the zero-valued tp->snd_win.  
Since
tp->snd_win is a u_long it ends up being a really big number.  Odd thing (1), 
the
fact that the packet is carrying old data, makes the if() evaluate false so the
previous error is not overwritten with the window from the packet, as it might
normally be.  tcp_output() gets called with a really big tp->snd_win, which 
prompts
it to emit a whole congestion window of new packets.

The patch above for FreeBSD changes the behavior of the current code by making 
it
update tp->snd_win from any packet which ack's new data.  That makes the broken
adjustment code go away and otherwise seems like a reasonable thing to do to me,
though I don't know for sure since I don't know why the old code didn't do this.
An alternative patch, which would fix the bug but retain the behavior of the old
code, would be to just avoid letting tp->snd_wnd get decremented below zero.

Dennis Ferguson


Home | Main Index | Thread Index | Old Index