tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: 5.2: something wrong with TCP retransmits?

>> [...strange TCP lockup...]
> I'm suspecting I'll have to reproduce the original setup to make it
> misbehave again.

Fortunately (or perhaps not :/), this turned out to be unnecessary.  It
locked up on me just recently, and proved to be cooperative enough to
do it again with tcpdumps running everywhere.

The setup is mostly straightforward, though there is one subtlety.

There are three machines involved: the two endpoints and one gateway in
between them.  They are (exact) and (chip), which
are the endpoints, and stone, in between them.  The subtlety is that
between stone and chip, there are, in IP terms, two links, one of them
being a tunnel layered atop the other; packet routing between stone and
chip is asymmetric, with packets taking the underlying DSL link in one
direction and the tunnel in the other.  (This is a historical artifact;
while it may be part of what provokes the problem, I believe the
problem it's provoking is real.)

I snooped on exact (bge0) and chip (fxp1 and tun0).  If you can tell
anything from packet payloads, ssh isn't doing its job properly, so I
don't mind putting the raw pcap files up for FTP; alongside them are
text files generated from them with tcpdump -S -tt -n -s 2000.

This is all on, in /mouse/misc/tcp-lockup/,
as {pcap,txt}.{bgep,fxp1,tun0}.  The FTP server should be willing to
bzip2 or gzip files on the fly if you ask for (eg) txt.bge0.bz2.
(There's not much point in trying to compress the pcap files, since the
bulk data in them is encrypted.  But the text files compress by better
than ten to one with bzip2.)

The asymmetric routing is a bit odd, but unless it leads to complete
communication failure, TCP should be able to recover, even if not at
full possible speed.  But, instead, I'm seeing data trucking along at
tens of KB a second, then suddenly locking up completely, apparently
because exact decides to stop retransmitting.  The captures end at
05:05:54 UTC; I waited until at least 05:17 before killing them and
assembling the results.  So, regardless of what goes on with stone and
chip, it seems to me exact should be doing _some_ retransmitting, and
as far as I can tell it isn't.  At all.  (When the failure happens.)

One other thing I just noticed: this case occurred with a 5.1 kernel,
not the 5.2 I initially had trouble with.

I took your suggestion and checked netstat -s.  Specifically, I ran

sleep 10; netstat -s > z.1; sleep 60; netstat -s > z.2

on exact and chip, carefully avoiding anything that should provoke
traffic (the initial sleep is to allow me to switch the active window
away from either machine).  Chip had a lot of changes, which is not
surprising because it's a relatively busy machine, being and also in the path between my house network
(including my NTP pool member) and the world.  But exact should have
been doing nothing, and, indeed, none of the stats increased by more
than two, those probably because of broadcast NTP on that subnet.  The
stats that changed:

        total packets received +2
        packets for this host +2
        packets sent from this host +1
        datagrams received +2
        delivered +2
        PCB has misses +1
        datagrams output +1
        packets received +1
                valid request packets +1
                broadcast/multicast packets +1

In particular, none of the tcp numbers changed at all.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index