tech-kern: why must exiting processes wait forever for events that never occur?!?!?!

Subject: why must exiting processes wait forever for events that never occur?!?!?!
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/17/2000 03:15:52
Speaking of exiting processes.....

Events on one of my systems today incite me to raise this issue again.
For some reason, possibly due to random noise on the serial lines, the
serial console on this machine locked up.  I had forgotten to set the
"local" flag in /etc/ttys and as a result of this particular set of
events getty got stuck exiting and waiting for some output to drain.
This caused every subsequent process that tried to write to the console
to similarly get stuck, including syslog and of course anything that
called openlog(... LOG_CONS, ...), such as "su".  In this particular
case no amount of toggling RS-232 control lines woke up the process.

A 'kill -9' of course did nothing useful either since the process was
already exiting....

In fact if I'm not mistaken a process stuck in exit() on a tty cannot
always be killed, even with 'kill -9'!

Grrr...  These damn 4.2BSD "reliable" signals aren't so reliable after
all it seems!!!

Of course sending a break down the line popped up the DDP prompt and all
was happy while in the debugger, so the UART itself was not stuck.

After playing with DDB for a while, trying to "call wakeup(...)" and
having no luck at anything at all, I finally rebooted the damn machine
the hard way.  I couldn't quite figure out an easy way to whack the
signal flags for the process to make it interruptible, and I wasn't sure
that would do any good anyway.

I don't ever remember having this trouble on non-BSD machines unless I
encountered some kind of driver or hardware bug.  An exiting process
never got stuck forever waiting for output to drain unless there was a
bug in the driver, and it was certainly never un-killable!  No bugs, no
problem, no matter what state the hardware got into.  Every time I ever
encountered a situation caused by a driver that forced me to reboot the
machine I filed a most urgent bug report with the vendor!  Now of course
part of this had to do with the default of system calls being
interruptible in traditional Unix, but even then I don't remember
exiting processes getting stuck on properly functioning systems.

So why is it that an exiting process must wait forever for output to
drain in the first place?  Yes a normal close(2) should wait (though it
should of course be interruptible if desired, and any process attempting
to reliably close() a tty should strongly consider calling
siginterrupt() first!), but a close forced by exit() should *always* be
interruptible and it should eventually timeout on its own, and probably
in a very few seconds too -- just long enough for all possible output to
actually drain at the current baud rate, after turning off and then
ignoring all flow control I think.  I'd rather lose a bit of output on a
forced close() than have a process hang forever, even if the hardware is
screwed up beyond the ability of open() to reset it.  At least let me
kill that process so that I might have a chance of trying to re-open a
device that had been held for exclusive access by the exiting process!

So, how do we best fix this in NetBSD?  I think a timeout should be set
around the forced close() in exit(), though I'm not sure how best to do
this.  One would effectively have to allow the close() call to be
interrupted and then maybe use a callout function to send a signal after
the timeout?

I also think kill(SIGKILL) should always wakeup a process and force it
to die right away, no matter what it might be waiting on, no matter
whether it is permitting system calls to be interrupted or not.  Even if
it's already exiting it must be re-awoken and immediately terminated
with prejudice!  SIGKILL must always override ~SA_RESTART!

Indeed any signal should probably forcibly wakeup any process that's
exiting, though of course signal handlers should NOT be called -- just
the default action should be honoured.  I think I could probably
implement this part if I had a bit of time to sort through how to do and
find the right hooks.  I suspect that doing the equivalent of calling
siginterrupt() [for all signals] before calling close() [inside exit()
inside the kernel, of course] would do the trick.  There may be
something needed to hammer the hardware driver "closed" in this
scenario, but maybe not if it does things in the "right" way (whatever
that may be!).

Note that if a process hangs on a close() of a socket then normally the
normal connection timeouts seem to prevent it from hanging forever, and
indeed signals to such processes seem to be honoured.  It's long past
time where *BSD should do the same for ttys and other devices that can
block waiting for I/O completion!

Spoonerism for the day:  "BSD has drain bamage"!  :-)

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>