sh(1) wait builtin command and stopped jobs

To: tech-userlevel%netbsd.org@localhost
Subject: sh(1) wait builtin command and stopped jobs
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Thu, 03 Mar 2022 03:31:44 +0700

It turns out that sh(1) has a bug (maybe one I created when I
reconstructed the wait builtin command, or perhaps it was
there before, and I just rearranged it, it isn't worth the
effort to trace the history) when the wait builtin command is
applied to a job that stops.

It must be a bug, as the results are inconsistent, depending
upon whether the job stopped before or after the wait command
is issued - if it stops while the wait builtin is running,
wait returns, with a status indicating that the job "exited"
with status of the signal that caused the job (or just process
perhaps) to stop.  If the job was already stopped, the wait
builtin says it doesn't exist.

Neither of those is really correct - POSIX requires the wait
builtin to wait until the process has terminated, and then
complete with the appropriate status (with caveats relating to
signals interrupting the wait while it is waiting, which are
not relevant right now).

This came to light when Harald van Dijk sent the following
small test case to the Austin Group (POSIX maintainers) mailing
list:

   sleep 10 &
   fg
   <Ctrl-Z>
   wait $!
   kill -l $?

and reported on the results from a bunch of shells (not including
ours or the FreeBSD sh).   There was little consistency.

A simple reading of POSIX would require the wait command there to
wait forever (or until some external agent sent SIGCONT to the
sleep process).   None of the shells he tested did that (but the
FreeBSD shell does, though it ends up reporting the status from the
completed job twice, if a SIGCONT is sent - once via wait, and then
again as a background job completing, which is wrong - either of those
should remove the job from the jobs table making it unavailable for
the other, depending which happens first ... here that should be the
wait).

The "best" of the shells (for this anyway) recognise that the wait will
hang forever (usually), and effectively turn it into a "fg" command,
resuming the stopped job that is to be awaited, waiting, and then
exiting (only ksh93 got the correct status from the sleep however).

Since I clearly need to fix the inconsistency in the NetBSD shell
(which of itself is not hard - it just forgot the possibility that
jobs might be stopped when looking to see if the process is ready)
I thought it might be a good idea to fix all of this properly.

Note that none of this makes much practical difference - regular people
just don't issue command like Harald did - if you have just stopped
a job with a ^Z you don't usually immediately issue a wait command for it!
And in a script, usually the script and its background processes are
all in the same process group, and most often, jobs stop due to signals
sent to the process group (^Z (SIGTSTP), or STGTTOU or SIGTTIN), which
result in the shell, and whatever process(es) are running all stopping,
and usually, resuming, together.   (This is not guaranteed, a script can
turn on -m, which runs background processes in their own process groups,
and the shell's children can be stopped via one of those signals (or SIGSTOP)
being sent to it via the kill sys call, but in practice neither of those
normally happens).

The question for now is what our behaviour ought to be in these odd cases.

My inclination is to make wait behave as POSIX specifies, and only return
(normally) when the process named (or all children, if there are no args,
or any of them with our -n option) have exited.   Then add an option to
wait (probably -s) to indicate that wait should complete if the (or a, or
all, depending upon its usage) process enters stopped state (and return
as status the standard shell wait encoded status for "exited with signal N"
except it would be interpreted as "stopped by signal N".

My inclination is to go that way, rather than having default wait complete
when a (selected) job stops, with a possible option to avoid that, as I
have not seen almost any scripts which use wait, which are capable of dealing
with stopping children.  Not to say that none exist, just that they're by
far the more  unusual case.

In addition, I'm inclined to copy ksh93 (and zsh) and have wait resume a
stopped job that it is to await job completion (though that could be by
an option, or with an option to inhibit it) -- waiting for a stopped job
would obviously not do that.

But before I make any of that happen, I'd like to read opinions of others
about how all of this should work (just don't bother with "change nothing"
as what we have now is clearly wrong).

kre

Follow-Ups:
- Re: sh(1) wait builtin command and stopped jobs
  - From: Robert Elz
- Re: sh(1) wait builtin command and stopped jobs
  - From: Edgar Fuß

Prev by Date: Re: math.h, copysign, visibility defines
Next by Date: inetd tests failing
Previous by Thread: math.h, copysign, visibility defines
Next by Thread: Re: sh(1) wait builtin command and stopped jobs
Indexes:

Home | Main Index | Thread Index | Old Index