Subject: Re: why must exiting processes wait forever for events that never occur?!?!?!
To: None <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/17/2000 23:39:53
[ On , October 17, 2000 at 15:51:45 (-0000), Charles M. Hannum wrote: ]
> Subject: Re: why must exiting processes wait forever for events that never occur?!?!?!
>
> Actually, what makes Mike's solution sillier is that there's already
> another way to do it.  `stty -f /dev/tty... -crtscts -cdtrcts -mdmbuf'
> should unstick the tty and cause any output to drain (unless the line
> has been disconnected, in which case output is flushed anyway).  I
> *know* I fixed that in com and zs years ago; if there are drivers in
> which it does not work, they are by definition broken.  [Note: I
> haven't checked whether software flow control unblocks correctly.]

All those flags were off already.  I'm not sure what state software flow
control was in though.  I suppose it's possible that when I disconnected
my telnet to the terminal server that somehow it sent a ^S down the
wire, though I was very careful to try and make sure that the terminal
server never tried to do any flow control either.  I physically
disconnected the console port several times too.  However since DDB
functioned fine, and since I sent many ^Q's down the wire afterwardss to
no effect, I doubt it was a software flow control problem.

The other "problem" in this particular case was of course that it was
the console itself that was hung, thus preventing me from logging in as
root there, or from using "su" from anywhere else.

Which reminds me -- I've been thinking since then of going through my
source tree and nuking all uses of LOG_CONS (or simply removing its
support from openlog() and thus avoid the problem silently).  I've had
too many similar experiences where somone pauses output on the console
and the whole system comes to an excruciating halt.  This even happened
on a big and important Solaris 7 server at a client site last week.
With modern systems where the console has effectively become just a
device where boot progress and panic messages are displayed, and perhaps
where secure access as root can be gained, I don't think there's really
much need to have syslog(3) write directly to the console.  It's bad
enough that the kernel can tie itself in a knot trying to write silly
messages to a paused console, never mind having every important daemon
do the same!

> Of course, accepting a signal wouldn't actually help in this case,
> because all the signal would try to do is make the process exit again.
> There's also no way for the signal to even take effect unless the
> process tries to go back to user mode, which can't happen during exit.

This is what needs fixing (besides adding timeouts in exit() which I
think everyone agrees are necessary too, and which could possibly be
implemented with the same repaired mechanism).

What needs to happen, I think, is that system calls must be made to be
interruptible once the process has called exit(), and of course all
signal handlers must be returned to their default state.  This way, I
think, when an additional signal is sent to an exiting process then it
should immediately exit, so long as an interrupted close() in exit() now
cleans up instead of just returning -1 and EINTR.  Perhaps this is the
trick though.  If inside exit() a close() fails and sets EINTR then an
internal force_close() could be called to throw away the data that was
blocking the close and clean up, etc.

> In the Hurd, for example, all this cleanup is handled in their
> mega-libc, and the equivalent of SIGKILL can terminate the
> process/task at any time no matter what I/O is outstanding.

That seems like a nasty violation of layering, though clearly it does
solve the problem....

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>