Subject: Re: telnet loop while trying to flush revoked tty FD
To: john heasley <heas@shrubbery.net>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-net
Date: 06/02/2003 14:32:08
On Sat, 31 May 2003, john heasley wrote:

> I ran this patch by Matthias and Martin.  They did not notice anything
> wrong with the patch described below, but were concerned that it might
> be masking a bug that is the root cause and since I am not certain how
> telnet reached the described state, they thought it best to solicit
> further comment here.

Well, how about we figure out more what's going on then? :-)

> so, here goes....
>
> I am not positive how telnet gets into this situation; undoubtedly it
> is expect or me doing something dumb with expect.  However, it appears
> that while telnet is trying to exit, it can get stuck in it's final
> attempt to flush it's tty when the tty is already dead.
>
> In the instance that I caught, lsof indicated that the filedescriptor
> had been revoked, like so:
>
> telnet  22519 heas    0u  VBAD                            (revoked)
> telnet  22519 heas    1u  VBAD                            (revoked)
> telnet  22519 heas    2u  VBAD                            (revoked)
>
> causing it to loop in SetForExit->EmptyTerminal.
>
> Besides updating this comment that i stumbled upon, I changed changed
> EmptyTerminal to return when ttyflush returns "permanent failure"

What exactly is EmptyTerminal trying to do? What is the loop you're
escaping out of waiting for?

> (PR/18984) and ttyflush to mark the ring buffer as empty so that
> anything which tested for a non-empty ring would think it was empty,
> and not try again.

Sounds ok.

> This seems to work fine.  since I do not know how it got into this
> state, i can't reproduce it.  I wrote a quick program to revoke()
> the ttys of a stuck telnet (stuck for other legitimate reasons).
> but, I think there was more to this scenario than i know because
> telnet never made a call to EmptyTerminal() before exiting.

Ouch. Reproducability would be nice.

The one thing I can think of is 1) try printing (stderror or logging) the
error returned in your -2 case, so you know why you're erroring.

2) try revoking a non-stuck tty. The problem (as I cursorly understand it)
is that there are data in telnet waiting to write to the kernel when the
revoke happens. If the telnet is stalled, chances are that telnet has been
able to forward its buffers to the kernel before the revoke. Another
option would be to have "sent" enough data that the kernel's flow control
on the pty kicks in & not everything telnet wants to send gets into the
pty's (in-kernel) buffer.

If I understand the problem right, one solution might be to add a special
signal handler. Pick an otherwise unused signal, like SIGUSR2 maybe, and
add a handler. Add a global variable. Add code to the write routine
(whatever calls write(2)) so that if this new variable is non-zero, it
busy-waits. A la "while ((volatile)g_new_variable) ;". Then make your
signal handler flip the state of this variable. On a 0->1 transition, in
addition to setting the variable, trigger the write of some data (tricky
from a signal handler). On a 1 -> 0 transition, just update
g_new_variable.

So you then fire up telnet. Send it a SIGUSR2 (or whatever you choose).
You now hopefully have telnet busy-waiting to send data. Then trigger a
revoke on the pyt. Then SIGUSR2 again, and you should have your error
case, if I understand it right.

Take care,

Bill