RFC: Enhancements/changes to sh doc and a possible sh extension

To: tech-userlevel%netbsd.org@localhost
Subject: RFC: Enhancements/changes to sh doc and a possible sh extension
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Wed, 18 Oct 2017 17:49:31 +0700
A while ago it was suggested (perhaps in some private mail) that perhaps
all of the builtin sh commands should have man pages of their own, rather
than only the ones which are also implemented as external commands.

As things are growing (including the sh man page) and partly inspired by
the possible sh enhancement described below, I have been wondering more if
perhaps this would not be an improvement.

We wouldn't necessarily need to document every sh builtin command in a
separate page (I'd not do most of the special builtins as one exclusion,
and perhaps not a few more that are too simple to need it (like say inputrc,
jobid and jobs .. just as examples) but I suspect that

	cd fc getopts hash read ulimit

at least, and (from below) wait could all usefully have man pages of their own
with a consequent reduction to the size of sh(1) - just the getopts section
of sh(1) is kind of long ...

I would expect each of these man pages would contain a specific warning
that the command defined is a built-in command from the shell, and that
details might differ from shell to shell.  It would also refer readers
to their own shell's man page, and explain that what is documented in
the page in question should only be used if the relevant shell's man
page references it.

The man pages could contain sections for each shell we want to document
in the NetBSD base (there are sh ksh and csh) if that seems useful, and if
the shell in question implements the relevant built-in, to document the
specific variations that a particular shell implements.

First question is does this sound reasonable?   (For this, you can assume
that I will be doing the work, at least initially, for sh(1) references,
so by agreeing it is a good idea you would not be volunteering!)

Second, assuming this happens (even if you say "no" to the first question,
please answer this one as if the answer to the previous is affirmative)
what the man pages be called?   They could just be cd.1 getopts.1 (etc)
or we could invent a new suffix, perhaps 1S or 1sh and have cd.1S (etc) to
mark these pages as distinct from the normal xxx(1) wich generally documents
something that can be found via a $PATH search.   Or something different.
[For the 1S approach, someone (else) would need to fix the man (etc)
commands/config to make it work.]


Second issue, you might have guessed from the hint above that the
proposed extension is to wait (the sh command) - which would require its
section of the sh(1) to get bigger...   The following (or something like
it) is what I am thinking of (discussion of why following the text).

Note: this is yet to be spell checked, grammar checked, ... so you
can just ignore (or not) that kind of issue - on the other hand, if
none of what is here makes any sense at all (ie: you cannot work out
what it is all about) then do say so ... I know what I mean it to say,
so when I read it, it clearly says that!

    wait [-n] [-p var] [job ...]

            Wait for the specified jobs to complete and return the exit status
            of the last job to exit, or 127 if none of the jobs are a current
            child of the shell.

            If no jobs argument is given, wait for all jobs to complete and
            then return an exit status of zero (including when there were no
            jobs, and so nothing exited.)

            With the -n option, wait instead for any one of the given jobs, or
            if none are given, any job, to complete, and return the exit
            status of that job.  If none of the given job arguments is a
            current child of the shell, or if no job arguments are given and
            the shell has no unwaited for children, then the exit status will
            be 127.

            The -p var option allows the process (or job) identifier of the
            job for which the exit status is returned to be obtained.  The
            variable named (which must not be readonly) will be unset
            initially, and then set to the identifier from the arg list (if
            given) of the job that exited, or the process identifier of the
            job to exit when used with -n and no job arguments.  Note that -p
            with neither -n nor job arguments is useless, as in that case no
            job status is returned, the variable named is simply unset.

            If the wait is interrupted by a signal, its exit status will be
            greater than 128.

            Once waited upon, by specific process number or job-id, or by a
            wait with no arguments, knowledge of the child is removed from the
            system, and it cannot be waited upon again.

In that, the first 2 paragraphs are intended to describe the status quo,
the next two are the proposed extension, the final 2 paragraphs also
document the current wait command.)


The sh(1) wait command dates from the very early days of unix (not sure how
far back, before my time, and that's saying something, but certainly early to
mid 70's) and has changed very little since.

The only enhancement since then has been the addition of the job (or pid)
args so it is possible to wait until (one or more) specific process(es) have
finished, rather than just everything, which was all that was possible
before.   [An aside: our current man page says it allows just "wait [job]"
as if only one job arg is permitted - that isn't posix conformant, and isn't
what is implemented either, so the doc needs fixing in any case.]

In the meantime the (now) wait(2) family of syscalls has been one of the
most extended of all, with wait3() followed by wait4() and waitpid(), and
more recently waitid() and wait6().   All of the new ones have an options
arg, with flag bits, which have also been extended (new options invented)
over time.

It is (way beyond) time for wait(1) (the sh built-in) to do some catching up.

This is also inspired by (at least one) of my scripts that wants to run
processes, then when one finishes (any one) start a replacement, so I have
the need to wait for any process to exit, not any specific process, and
certainly not all of them (but sometimes perhaps, one of a subset of those
running.)

For this, bash already has "wait -n" though I am not aware of any other
shells yet to copy it.  Bash's -n is (I believe) currently an alternative
to the list of processes (or jobs) as an arg, but I see no reason that
the two should be exclusive, "wait -n" simply waits for any child to complete,
"wait -n p1 p2 p3" can wait for any of the listed processes to exit.

Bash is lacking a mechanism to discover which process exited however, which
makes "wait -n" a little less useful (not useless, there are always ways, like
sending kill -0 at all known children and finding out which one now returns
ESRCH) but it seemed to me that since the shell internally knows which process
terminated, it should provide some means for the script to find out more
easily.   Hence the "-p var" (-p for "pid").   (In the current implementation,
and yes, all of this is implemented already) the result returned in var is
actually the arg string that was passed as the operand when "job..."
operands are given, that is both generally easier to do, and also seems
to be more consistent, if you say "wait -n -p job %1 %2 %3", after it
returns (without error) $job will be one of "%1" "%2" or "%3".   This could
be changed to always be the decimal pid (value that was available in $! when
the background job started) if that seems better (that was what I coded 
initially) or we could have an option to choose (I'd kind of prefer not that.)

I have discussed this (briefly, and a while ago) with Chet Ramey, and he
has it on his "things to consider for bash later when there is time" list,
and -p was (as I recall) agreed as a reasonable option name.

I'd appreciate opinions on this, is it a reasonable thing to do?
Is it being done in the right way if it is?

I also notice now (and will fix soon in my uncommitted copy) that the text:

 	Wait for the specified jobs to complete and return the exit status
        of the last job to exit,

from above is not actually what should happen (nor what does happen),
sh(1) waits for each job to exit, in the order given, and then returns
the status from the last of them (the last on the command line.)   That's
what posix specifies, and that is what we do (while each job will produce
a 127 return code if the child named does not exist - either never did as
a child of this shell, or has already been waited for - or in an interactive
shell, has had its status reported as "Done" - or anything else that indicates
it no longer exists, that 127 is only observable when it happens from the
final job listed.)

To perhaps avoid one kind of obvious question, while the wait command and
the wait system call are obviously related, the command does not necessarily
perform the system call, nor is the system call only used (wrt background
jobs) when the command is invoked - sh(1) cleans up child processes (zombies)
whenever it notices that they have finished, but then remembers the status
to report when the script does a wait command, if all jobs listed on the
command line have already been detected as finished, then no wait system
call will be performed.  The "does not exist" in the previous paragraph
relates to what the shell has in its data struct about the process, not
what is in the kernel process table.

So, on all of this, your thoughts please.

In the meantime, the code for this needs more extensive testing before it
is committed (if there is reasonable agreement that doing do is a good thing),
and the doc (obviously) needs improvements, so I'll be doing that...

kre

ps: while I have been typing this, I can also see uses for a -u (maybe -w)
option, kind of equivalent to the WNOWAIT option to the wait*(2) functions.
That is, to not clean up after doing this wait, so it can be repeated
later - that might be useful for use in SIGCHLD trap handlers, for example.
I think I'll add that and see how well it works...  I kind of doubt that
options to act like WNOHANG or WUNTRACED would be useful to scripts, but
please feel free to disagree.   (If a WNOHANG kind of thing seems useful,
perhaps better would be a -t timeout instead, with -t 0 working of course.)
Anyway, all of this potential feeds back into the questions that started
this e-mail, all so long ago...
Follow-Ups:
- Re: RFC: Enhancements/changes to sh doc and a possible sh extension
  - From: Robert Elz
- Re: RFC: Enhancements/changes to sh doc and a possible sh extension
  - From: David Holland
- Re: RFC: Enhancements/changes to sh doc and a possible sh extension
  - From: Christos Zoulas
Prev by Date: Re: Switching to GNU userland and tools - yay or nay?
Next by Date: Re: RFC: Enhancements/changes to sh doc and a possible sh extension
Previous by Thread: Switching to GNU userland and tools - yay or nay?
Next by Thread: Re: RFC: Enhancements/changes to sh doc and a possible sh extension
Indexes:
Home | Main Index | Thread Index | Old Index