tech-net: Recent nbftp panic analysis

Subject: Recent nbftp panic analysis
To: None <tech-net@netbsd.org>
From: Charles M. Hannum <abuse@spamalicious.com>
List: tech-net
Date: 02/14/2005 09:01:33

The system died with "panic: tcp_output REXMT".  I believe this is caused by a 
race condition between the callout handler and tcp_input().  Specifically, if 
we've just started to run tcp_persist_timer(), the flags for the persist 
timer are CALLOUT_FIRED|CALLOUT_INVOKING; therefore, if we also happen to get 
interrupted and receive a packet for the same connection, these tests:

                                else if (TCP_TIMER_ISARMED(tp,
                                    TCPT_PERSIST) == 0)
                                        TCP_TIMER_ARM(tp, TCPT_REXMT,
                                            tp->t_rxtcur);

                } else if (TCP_TIMER_ISARMED(tp, TCPT_PERSIST) == 0)
                        TCP_TIMER_ARM(tp, TCPT_REXMT, tp->t_rxtcur);

will fail to notice that we're using the persist timer (because we have a 0 
window) and will arm the rexmt timer.  When we return to tcp_persist_timer(), 
it will try to *also* reset the persist timer, and we panic.

The simplest thing to do here is to change TCP_TIMER_ISARMED; to wit:

 #define        TCP_TIMER_ISARMED(tp, timer)\
-       callout_pending(&(tp)->t_timer[(timer)])
+       (callout_pending(&(tp)->t_timer[(timer)]) ||                    \
+        callout_invoking(&(tp)->t_timer[(timer)]))

However, I've thought for a long time that it might make more sense to have a 
separate state variable/bit to keep track of which of the two timers we're 
using at a given time (really, whether we have a 0 window).  The logic would 
be much less crufty that way.