[src/trunk]: src/bin/sh PR bin/57053 is related (peripherally) here.

To: source-changes-hg%NetBSD.org@localhost
Subject: [src/trunk]: src/bin/sh PR bin/57053 is related (peripherally) here.
From: kre <kre%NetBSD.org@localhost>
Date: Sun, 30 Oct 2022 04:53:39 +0000

details:   https://anonhg.NetBSD.org/src/rev/bf5223fa799a
branches:  trunk
changeset: 372193:bf5223fa799a
user:      kre <kre%NetBSD.org@localhost>
date:      Sun Oct 30 01:46:16 2022 +0000

description:
PR bin/57053 is related (peripherally) here.

sh has been remembering the process group of a job for a while now, but
using that for almost nothing.

The old way to resume a job, was to try each pid in the job with a
SIGCONT (using it as the process group identifier via killpg()) until
one worked (or none did, in which case resuming would be impossible,
but that never actually happened).   This wasn't as bad as it seems,
as in practice the first process attempted was *always* the correct
one.  Why the loop was considered necessary I am not sure.  Nothing
but the first could possibly work.

This worked until a fix for an obscure possible bug was added a
while ago - now a process which has already finished, and had its
zombie collected via wait*() is no longer ever considered to have
a pid which is a candidate for use in any system call.  That's
because the kernel might have reassigned that pid for some newly
created process (we have no idea how much time might have passed
since the pid was returned to the kernel for reuse, it might have
happened weeks ago).

This is where the example in bin/57053 revealed a problem.

That PR is really about a quite different problem in zsh (from pksrc)
and should be pkg/57053, but as the test case also hit the problem
here, it was assumed (by some) they were the same issue.

The example is (in a small directory)
        ls | less
which is then suspended (^Z), and resumed (fg).   Since the directory
is small, ls will be finished, and reaped by sh - so the code would
now refuse to use its pid for the killpg() call to send the SIGCONT.
The (useless) loop would attempt to use less's pid for this purpose
(it is still alive at this point) but that would fail, as that pid
is not a process group identifier, of anything.   Hence the job
could not be resumed.

Before the PR (or preceding mailing list discussion) the change here
had already been made (part of a much bigger set of changes, some of
which might follow - sometime).   We now actually use the job's
remembered process group identifier when we want the process group
identifier, instead of trying to guess which pid it happens to be
(which actually never took any guessing, it was, and is always the
pid of the first process created for the job).   A couple of minor
fixes to how the pgrp is obtained, and used, accompany the changes
to use it when appropriate.

diffstat:

 bin/sh/jobs.c |  42 +++++++++++++++++++++++++-----------------
 1 files changed, 25 insertions(+), 17 deletions(-)

diffs (105 lines):

diff -r d3606241a6f5 -r bf5223fa799a bin/sh/jobs.c
--- a/bin/sh/jobs.c     Sun Oct 30 01:19:08 2022 +0000
+++ b/bin/sh/jobs.c     Sun Oct 30 01:46:16 2022 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: jobs.c,v 1.116 2022/04/18 06:02:27 kre Exp $   */
+/*     $NetBSD: jobs.c,v 1.117 2022/10/30 01:46:16 kre Exp $   */
 
 /*-
  * Copyright (c) 1991, 1993
@@ -37,7 +37,7 @@
 #if 0
 static char sccsid[] = "@(#)jobs.c     8.5 (Berkeley) 5/4/95";
 #else
-__RCSID("$NetBSD: jobs.c,v 1.116 2022/04/18 06:02:27 kre Exp $");
+__RCSID("$NetBSD: jobs.c,v 1.117 2022/10/30 01:46:16 kre Exp $");
 #endif
 #endif /* not lint */
 
@@ -267,11 +267,7 @@
        out1c('\n');
        flushall();
 
-       for (i = 0; i < jp->nprocs; i++)
-           if (tcsetpgrp(ttyfd, jp->ps[i].pid) != -1)
-                   break;
-
-       if (i >= jp->nprocs) {
+       if (tcsetpgrp(ttyfd, jp->pgrp) == -1) {
                error("Cannot set tty process group (%s) at %d",
                    strerror(errno), __LINE__);
        }
@@ -376,6 +372,9 @@
 
        if (jp->state == JOBDONE)
                return;
+       if (jp->pgrp == 0)
+               error("Job [%d] does not have a process group", JNUM(jp));
+
        INTOFF;
        for (e = i = 0; i < jp->nprocs; i++) {
                /*
@@ -390,13 +389,16 @@
                 * Otherwise tell it to continue, if it worked, we're done
                 * (we signal the whole process group)
                 */
-               if (killpg(jp->ps[i].pid, SIGCONT) != -1)
+               if (killpg(jp->pgrp, SIGCONT) != -1)
                        break;
-               if (e == 0 && errno != ESRCH)
-                       e = errno;
+               e = errno;
+               break;          /* no point trying again */
        }
-       if (i >= jp->nprocs)
-               error("Cannot continue job (%s)", strerror(e ? e : ESRCH));
+
+       if (e != 0)
+               error("Cannot continue job (%s)", strerror(e));
+       else if (i >= jp->nprocs)
+               error("Job [%d] has no stopped processes", JNUM(jp));
 
        /*
         * Now change state of all stopped processes in the job to running
@@ -436,8 +438,8 @@
 
 #if JOBS
        if (mode & SHOW_PGID) {
-               /* just output process (group) id of pipeline */
-               outfmt(out, "%ld\n", (long)jp->ps->pid);
+               /* output only the process group ID (lead process ID) */
+               outfmt(out, "%ld\n", (long)jp->pgrp);
                return;
        }
 #endif
@@ -1206,14 +1208,19 @@
 int
 forkparent(struct job *jp, union node *n, int mode, pid_t pid)
 {
-       int pgrp;
+       int pgrp = 0;
 
        if (rootshell && mode != FORK_NOJOB && mflag) {
+               /*
+                * The process group ID must always be that of the
+                * first process created for the job.   If this proc
+                * is the first, that's us, otherwise the pgrp has
+                * already been determined.
+                */
                if (jp == NULL || jp->nprocs == 0)
                        pgrp = pid;
                else
-                       pgrp = jp->ps[0].pid;
-               jp->pgrp = pgrp;
+                       pgrp = jp->pgrp;
                /* This can fail because we are doing it in the child also */
                (void)setpgid(pid, pgrp);
        }
@@ -1224,6 +1231,7 @@
                ps->pid = pid;
                ps->status = -1;
                ps->cmd[0] = 0;
+               jp->pgrp = pgrp;        /* 0 if !mflag */
                if (/* iflag && rootshell && */ n)
                        commandtext(ps, n);
        }

Prev by Date: [src/trunk]: src/bin/sh Note in the description of "jobs -p" that the process...
Next by Date: [src/trunk]: src/usr.sbin/makemandb makemandb.c: fix grammar in a comment
Previous by Thread: [src/trunk]: src/bin/sh Note in the description of "jobs -p" that the process...
Next by Thread: [src/trunk]: src/usr.sbin/makemandb makemandb.c: fix grammar in a comment
Indexes:

Home | Main Index | Thread Index | Old Index