Re: killing subshells in /bin/sh scripts

To: "Johnny C. Lam" <jlam%NetBSD.org@localhost>
Subject: Re: killing subshells in /bin/sh scripts
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Fri, 23 Jun 2017 13:42:11 +0700

    Date:        Fri, 23 Jun 2017 01:23:22 +0000
    From:        "Johnny C. Lam" <jlam%NetBSD.org@localhost>
    Message-ID:  <20170623012322.GA27106%homeworld.netbsd.org@localhost>

  | Why does the output still appear after the script has ended?

It relates to which process $! actually identifies.

In /bin/sh (currently) $! is a forked copy of sh which is waiting for the
	( sleep 3 && echo "3s have elapsed" )
subshell to terminate.  The subshell gets its own copy of sh to run the
commands in the ().   Killing the first process ($!) does nothing to its
children.  Our /bin/sh makes way too many forked copies of itself for fairly
innocuous things.

You can observe this, if in a shell (on a terminal) where you have
nothing else running (ie: start a new "xterm -e sh" for this test
if you can, and by "xterm" I mean whatever you use that creates a new shell
with a new pty - a new sub-window in tmux or screen might work too.)

In that, run ps T in as many different scenarios as you can imagine.
One fun one is

	( ( ( ( ( ps T ) ) ) ) )

and count all the sh proceses that are running.    All that is really
needed is the one that you're typing in (and the ps command of course),
the rest are just making the implementation easier to get correct.

[Aside; The spaces after the '(' are because '((' is technically an unspecified
operator - as some shells use (( as an equivalent of $(( - in /bin/sh omitting
the spaces would not hurt.   The spaces in the ) ) ... were just for symmetry.]

(The new terminal, with a new "sh" is just so there is no possible confusion
with other processes running on the same terminal from before the "sh" being
tested started.)

You can also see exactly what is happening from your script, if you make
it be:

  tty=$(tty)
  echo "start script: $$ on $tty"
  ( ps -lt"$tty"; sleep 3 && echo "3s have elapsed" ) &
  pid=$!
  echo "$pid started in background"
  kill_job() {
	echo "killing $pid"
	kill "$pid"
  }
  trap "kill_job" 1 2 3 15
  wait
  echo "end script"

When I run this in an "xterm -e sh" (doesn't matter which vintage NetBSD sh
you use for this, nothing has changed in this area, yet) I see ...

$ sh /tmp/sc
start script: 18745 on /dev/pts/45
17253 started in background
UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN STAT TTY       TIME COMMAND
200   488 17253   0  85  0 13300  988 wait  S+   pts/45 0:00.01 sh /tmp/sc 
200  9436   488   0  43  0 13124 1252 -     O+   pts/45 0:00.01 ps -lt/dev/pts/
200 17253 18745   0  85  0 13300  404 wait  S+   pts/45 0:00.00 sh /tmp/sc 
200 18745 27273   0  85  0 13300 1420 wait  S+   pts/45 0:00.01 sh /tmp/sc 
200 27273 24756   0  85  0 13300 1472 wait  Ss   pts/45 0:00.04 sh 
^Ckilling 17253
end script
$ 3s have elapsed

The last process from ps, 27273 is clearly the child of the xterm, the
shell I typed into.

Process 18745 (2nd last line, convenienly) is the sh that is running the
script (that one needs to exist, obviously).

As the script reports 17253 (3rd bottom line - this is just a co-incidence,
the process ID would not usually work out so conveniently) is the $!
process started for the background sub-shell.

Next look at the ps command (4th from bottom - this really was a fluke,
it just worked out this way...) (pid 9436).  Its parent is 488 (the one
remaining line, 5th from bottom) which is a child of 17253.  The sleep
(the next command to be run after the ps) will have the same parent.
So will the echo that follows (or would, if the shell forked and exec'd
echo, instead it is builtin, but that is irrelevant here).  They are all
being run by shell pid 488.

When the script kills 17253 it kills that process, but 488 is still there,
waiting for the sleep to finish, so that it can do the echo that follows.

What is happening here, is that the & forks a sub-shell to run whatever
commands needs to be run in the background, then ( ) forks a subshell,
as that is what it is expected to do.   So this is all trivially correct,
just a little unexpected, and unnecessary.

Note that if instead the script was...

   tty=$(tty)
   echo "start script: $$ on $tty"
   { ps -lt"$tty"; sleep 3 && echo "3s have elapsed" ;} &
   pid=$!
   echo "$pid started in background"
   kill_job() {
	echo "killing $pid"
	kill "$pid"
   }
   trap "kill_job" 1 2 3 15
   wait
   echo "end script"

Then what happens (this is running in a new xterm, that happened, not
particularly surprisingly, to be assigned the same pty as the previous one) is:

$ sh /tmp/sc
start script: 18381 on /dev/pts/45
23854 started in background
UID   PID  PPID  CPU PRI NI   VSZ  RSS WCHAN STAT TTY       TIME COMMAND
200  9688 27740    0  85  0 13300 1472 wait  Ss   pts/45 0:00.02 sh 
200 16544 23854 1024  43  0 13124 1252 -     O+   pts/45 0:00.02 ps -lt/dev/pts
200 18381  9688    0  85  0 13300 1420 wait  S+   pts/45 0:00.01 sh /tmp/sc 
200 23854 18381    0  85  0 13300  992 wait  S+   pts/45 0:00.00 sh /tmp/sc 
^Ckilling 23854
end script
$ 

No () there, so no extra forked shell, and $! (23854) is the parent of the
ps command, and so will be the parent of the sleep, and if it was not killed
before, also the parent of the echo (well, it would do the echo itself).

  | If I run this with /bin/ksh, I don't get any output after the
  | end of the script.

ksh is better at avoiding unneeded forking than /bin/sh is (currently).
So are almost all other shells...    Note though that in this case it is
not really as clear that they are posix conformant (I have not investigated
this in detail yet) as & is supposed to run its commands in a sub-shell, and
so is ( ), so it might be that technically, the two sub-shells are actually
required.   But since just about everyone else does it the ksh way, if that
is currently not conformant, the most likely thing that would happen is that
posix will be updated to, at least, allow it, if not actually require it.

Dealing with this is on my list of things to handle (it mostly means copying
code from FreeBSD, as they have improved this already.)   I say improved
rather than fixed here, as it is not really a bug, just surprising, and
a bit wasteful.

kre

ps: in the above, it would be neater, and easy, to use a suitable -o arg
to ps, and get only the columns wanted, rather than just '-l' but I was
too lazy...  Even just replacing the "l" with -Oppid would be cleaner.

Follow-Ups:
- Re: killing subshells in /bin/sh scripts
  - From: Robert Elz
- Re: killing subshells in /bin/sh scripts
  - From: Johnny C. Lam

References:
- killing subshells in /bin/sh scripts
  - From: Johnny C. Lam

Prev by Date: killing subshells in /bin/sh scripts
Next by Date: float128 in libstdc++
Previous by Thread: killing subshells in /bin/sh scripts
Next by Thread: Re: killing subshells in /bin/sh scripts
Indexes:

Home | Main Index | Thread Index | Old Index