Re: pg_jobc going negative?

To: Robert Elz <kre%munnari.OZ.AU@localhost>
Subject: Re: pg_jobc going negative?
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Sat, 18 Jul 2020 09:08:09 +0000

On Fri, Jul 10, 2020 at 08:41:40PM +0700, Robert Elz wrote:
 > I have spent a little time looking at this now, and I think
 > it is just all a mess.

Indeed...

 > As best I can work out, and someone correct me if I'm wrong,
 > the whole purpose of pg_jobc is so that orphanpg() can be called
 > when a process group is orphaned (no longer has a session leader).

Yes.

A process group is orphaned if there's no longer a way to do job
control on it directly -- the most common case is when you get hung up
on and your login shell/session leader goes away, but it can also
happen if you make chains of forks and some of them exit. For example,
if you start an interactive subshell, run some stuff in the
background, then exit it, the background jobs are now orphaned.

Detecting orphaned process groups is vexing because it's a graph
connectivity problem and you don't want to/can't (for locking reasons)
go chugging around the graph explicitly.

 > I see 3 ways forward...   simply drop the KASSERT the way that FreeBSD
 > have done, and let things return to the semi-broken but rarely bothersome
 > code that was there before.   That's not really a good idea, as the
 > sanitizers apparently find problems with the code the way it was (not
 > too surprising, deleting the KASSERT won't fix the bugs, it just stops
 > noticing them explicitly).

Indeed.

 > Or, we could properly define what pg_jobc is counting, and then make sure
 > that it counts whatever that is properly - is incremented in all the
 > appropriate places, and decremented properly as well.   Currently
 > the comment about it in proc.h is:
 >         /*
 >          * Number of processes qualifying
 >          * pgrp for job control 
 >          */
 > which makes it clear that it is a reference counter (not necessarily
 > counting the number of something which exists, so that something can be 
 > deleted, but it is counting references to processes).   Unfortunately
 > I have no idea what "qualifying pgrp for job control" is supposed to mean.

A processes is job-controlled by its parent process, and a process
group is job-controlled by any parent of processes in it. So a process
group can be job-controlled if some process in it has a parent in a
different process group that can also be job-controlled, going back to
the session leader. (Provided that the session has a tty, anyway.)

This becomes complicated because although normal process group usage
is completely hierarchical, it's possible to place processes in
process groups such that the graph of parent relationships ceases to
be a tree.

The code is trying to count how many processes in the process group
have a parent in a different process group that is not itself
orphaned. If this count reaches zero, the process group has become
orphaned.

I can't remember if it's possible to set up process groups where the
parent graph is cyclic. I think it might be; or at least, I don't
remember any reason why not that wouldn't also force the graph to be a
tree. But it's been a while with all this stuff and I may be
forgetting something.

Anyway, the method should work as long as the graph remains acyclic.
The implementation is probably missing cases where it should be
adjusting the count.

 > That could be done, but it seems like a lot of work to me, and not easy.

Unfortunately, I think it's the only viable way forard.

 > Another (more radical) approach would be to simply drop orphanpg()
 > completely, and thus no longer need pg_jobc at all.   The system
 > wouldn't be bothered by this at all - all orphanpg() does is locate
 > stopped members of the process group, and send then SIGCONT (to restart)
 > and SIGHUP (to probably kill them - or at least inform them they their
 > environment has changed).

Unfortunately, removing it violates POSIX, which prescribes all this
stuff.

 > Third, and the option I'd suggest, is to revert to more basic principles,
 > remove the pg_jobc attempt to detect when a session leader has exited,
 > or changed to a different process group, and instead at candidate events
 > (any time a process leaves a process group, for any reason) check if that
 > process was the session leader, and if it is, clean up any now orphaned
 > stopped processes.   This is likely to be slower than the current attempt,
 > but is much easier to make correct (and much less likely to cause problems,
 > other than dangling orphaned stopped processes, if incorrect).

I don't think that's sufficient, because it doesn't take into account
the cases where jobs become disconnected from the session leader
without the session leader exiting. Which is easy to arrange: su, then
vi, ^Z, exit. :-/

-- 
David A. Holland
dholland%netbsd.org@localhost

References:
- Re: pg_jobc going negative?
  - From: Michael van Elst
- pg_jobc going negative?
  - From: Kamil Rytarowski
- Re: pg_jobc going negative?
  - From: Maxime Villard
- Re: pg_jobc going negative?
  - From: Robert Elz

Prev by Date: Re: style change: explicitly permit braces for single statements
Next by Date: notes from running will-it-scale
Previous by Thread: Re: pg_jobc going negative?
Next by Thread: TSC improvement
Indexes:

Home | Main Index | Thread Index | Old Index