tech-kern: Re: why must exiting processes wait forever for events that never occur?!?!?!

Subject: Re: why must exiting processes wait forever for events that never occur?!?!?!
To: NetBSD Kernel Technical Discussion List <tech-kern@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/17/2000 16:02:54

[ On Wednesday, October 18, 2000 at 01:28:18 (+1100), Robert Elz wrote: ]
> Subject: Re: why must exiting processes wait forever for events that never occur?!?!?! 
>
> For what it is worth, this behaviour certainly wasn't new to BSD, it
> was VERY common to have things like this happen (much more common than
> in current systems) in 6th edition unix (and others of that vintage).
> Rebooting to clear devices hung in close was almost a daily occurrence.

Yes, older versions of unix did seem to have this problem -- the fixes
to this problem were one of the features of SysV that made it useful in
environments which required higher degrees of reliability.

> And yes, there should be timeouts in close routines, this just gets a
> little difficult, as the close routine tends to simply use generic output
> routines, which are also used from code that should never time out.

Yup, that's why I was thinking that the exit() code in the kernel could
turn off SA_RESTART and then maybe it's as simple as scheduling a
SIGALRM and letting the normal signal handling terminate any hung
close() call (cleaning up properly as noted below) and allow the exit()
to continue to completion.

> The world as we know it would end if it were possible to interrupt a
> process that is already exiting though (which Greg suggested) so I don't
> think that's ever likely to happen...

Note that when I've said wakeup() I mean the kernel wakeup(), which will
in effect just interrupt the blocked system call and allow delivery of
the pending signal -- it won't put the process back into user code or
anything silly like that since the default signal handlers will be used,
not any that the process may have installed prior to exiting.

I'm pretty sure some Unix systems already do this and clearly the world
hasn't quite ended yet.

The more I think about it the more I'm sure this part of my proposal is
simply regurgitated from some paper, release notes, or a talk I once
heard.  In order for "kill -9" to truly kill any process it must
interrupt the hung write() or close(), etc.  Once it does that of course
the process will exit because there's a SIGKILL pending, and it doesn't
really matter if it was already exiting or not.  If it wasn't exiting
then maybe a second kill will be needed to wake it up again and
interrupt another hung close().

The only trick here is to ensure that any interrupted close() during
exit() does actually clean up properly (eg. throwing away pending data
if necessary and releasing the file descriptor) so that the file is
actually closed and not left open as it would if a normal pre-exit
close(2) was interrupted.  As you say this does mean having the lower
level close routines take note of the "exiting" flag and "do the right
thing".

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>