Subject: Re: pppoe reconnection
To: Simon Burge <simonb@wasabisystems.com>
From: Martin Husemann <martin@duskware.de>
List: tech-net
Date: 08/19/2003 19:54:36
On Tue, Aug 19, 2003 at 09:52:59AM +1000, Simon Burge wrote:
> This morning, the link still wasn't up so I tried a manual ifconfig
> down/up again and it sprang into life.

I have the feeling I've seen this too, but I've also seen multiple days
of outage and the link coming up as soon as the modem resync'd.

So, in theory the code tries to initiate a PPPoE session every minute,
no matter how long it failed. The single exception is: if the authentication
failed more often than the configure maximum. I've seen this bite me - 
apparently the ISP rebooted the radius server and the DSLAM, but the DSLAM
came up faster. So the PPPoE session got established, but authentication 
failed. The router retried five times and then gave up. Running

 pppoectl -d pppoe0

will show this (and I think it logs the fact that it's giving up to syslog).
If your log does not contain notes about this, there must be something else
wrong. I can only think of two problems

 - the 1 minute callout does not get serviced (maybe due to the callwheel
   corruption Havard noticed - but this is just guessing in the blue and
   pretty unlikely)
 - the session establishing packet does not make it to the wire - for some
   unknown reasons, that are, however, apparently solved by ifconfig down/up

Unfortunately this is hard to debug, since you have to get in this strange
situation and then only have one attempt. All tries to reproduce it at
will failed for me (i.e. in my local test setup, when I disconnect the network
cable of the PPPoE server and plug it back in after hours, the session comes
up imediately, always).

Maybe, as a first step to debug this, we should check the return value of
pppoe_output in pppoe_send_padi and log some diagnostic if sending the
packet is not successfull.

And I guess here we have our bug: inside the callout handler (pppoe_timeout)
there is this code:

                if (pppoe_send_padi(sc) == 0)
                        callout_reset(&sc->sc_timeout, retry_wait,
                            pppoe_timeout, sc);
                else
                        pppoe_abort_connect(sc);

Now imagine the connection is down, the outgoing interface has it's queue
filled and we try the first reconnect - pppoe_send_padi fails (queue full, or
some other output error), and we abort the connection, without rescheduling
another timeout. Am I missing something?

Of course this never happens when testing ;-)

Martin