tech-userlevel: job-control shell trouble

Subject: job-control shell trouble
To: None <tech-userlevel@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-userlevel
Date: 12/25/2004 19:44:01
I'm trying to build a new shell...and I've run into a race condition
that appears to be impossible to eliminate.

The basic problem arises as soon as a job contains a construct such as
(sh syntax) "foo; bar" or "while cmd1; do cmd2; done" that calls for
running one process and then, later, another.

When the first process for a job is forked, its PID is used for the
job's process group (_some_ process's PID needs to be).  In most cases
(such as a pipeline) there is no problem getting all the processes into
the same process group without races - just make sure all processes
block on some common event until they're all forked and have had their
process groups set.  (I create a pipe and have them all read from it,
closing the write end when I want them to go.)

But when some processes aren't needed until others are done, I have a
problem: I can't tell whether the process group still exists!  Of
course, I can simply try to put the new process into the old process
group.  This will fail if all the old processes have died and the
process group is gone, and that error can be trapped - but what I can't
trap is the case where all the processes have died and some other
process in the same session has gone off and reused the same process
group ID.  Then my attempt to put the new process into the old process
group will succeed, but not because the process group still exists;
rather, because it once again exists, and isn't mine any longer.

It is admittedly _unlikely_ that all the old processes will die and
some other process in the same session will recreate the process group,
without my having noticed that all the old processes have died and thus
realizing that the process group is gone.  But "unlikely" really isn't
good enough.

I thought of having a zombie keep its process group alive, so that the
process group remains until the shell _knows_ it's gone - except that
this doesn't fix errors in the other direction: it is still possible
for all the shell's children in that group to be dead and yet the group
survives because some of _their_ children remain.  Besides, this would
mean the shell would work right only with tweaked kernels, which I'd
rather avoid.

The only real fix I've come up with is to have the shell fork an extra
process whose sole purpose is to hang onto the process group, as long
as there will be (or might be) another process forked for the job.
This is pretty seriously ugly, though - what's the Right fix?  (The
shell itself can stay in that process group if the job is a foreground
job, but if it's not, that won't do.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B