tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

TCP PMTU/SACK/timer problem



I think I'm getting a clue what might be happening here.
Could someone in the know please save my weekend and comment on this?

If tcp_timer_rexmt() gets called during PMTU discovery, TCPT_REXMT will have
fired, but will not be disarmed. This means that tcp_output() will never ever
arm the retransmit timer again beacause it thinks its armed already.

Now, if we have a SACK hole, retransmit the hole, loose the segment again
(or the ACK), tcp_output will exit at just_return without sending anything.
There's even a comment above that this is possible in SACK, so there's code
to re-arm the retransmit timer without having sent a segment, only this will
not happen because the timer is already considered armed (i.e.
TCP_TIMER_ISARMED(tp, TCP_REXMIT) wil be true), but it will not fire.
So we hang.

The problem is I can't get the NFS connection to hang right now and although
I did print out the entire PCB with gdb, I didn't notice there was a SACK
hole so I don't have that hole's retransmission state. So I can't be entirely
sure this analysis is correct until I manage to make the connection hang again
and print out the missing information.

Second problem is I can't just go ahead and install a patched kernel on that
file server unless I'm really confident (i. e. someone in the know concurs)
I found the problem, because, well, that's the file server.

I must admit I don't fully understand how the code is supposed to behave if
it's still in SACK recovery (but hasn't timed out on the lost segment) and
new data comes in to be sent.

So, please, could someone comment whether this sounds plausible or whether
I'm missing the point?


Home | Main Index | Thread Index | Old Index