Subject: Re: fork, SIGCHLD, wait, waitpid, Debian Linux vs. NetBSD ...
To: Jeremy C. Reed <reed@reedmedia.net>
From: Matthias Buelow <mkb@mukappabeta.de>
List: netbsd-users
Date: 04/11/2001 17:02:20
"Jeremy C. Reed" <reed@reedmedia.net> writes:

>I have a daemon that forks ten other processes. Under Debian Linux when a
>child is terminated, another process is forked (so there is always 11
>processes total). Under NetBSD when each child is terminated another is
>not started unless it is the last child left (so there are always at least
>2 processes). (I want ten children![1])
>
>Here is an example of the code[2]:
>
>void
>signal_handler (int signal)
>{
>  if (signal == SIGCHLD) {
>    int stat;
>    while (waitpid (-1, &stat, WNOHANG) > 0);
>  }
>  return;
>}
>
>The main part:
>
>  signal(SIGCHLD, signal_handler);
>  children = 0; maxchildren = 10;
>
>  while (1) {
>    if (children < maxchildren) {
>      if (!fork()) {
>        mainloop;
>        exit(OK);
>      } else {
>        children++;
>      }
>    } else {
>      wait(NULL);
>      children--;
>    }
>  }
>
>I am guessing that under NetBSD, the signal_handler does the waitpid and
>so the second wait() just hangs forever(?).

The problem is the different behaviour of signal(3) on Linux[1]
and NetBSD, in Linux the signal_handler() gets called exactly once
in this example (the first time, see [1]), while on NetBSD
approximately for every child that has exited (modulo race-condition
stuff with the wait() below).  In that flow-of-thought, what happens
on BSD is, waitpid() gets called, and wait() continues to hang,
since the waitpid() has already done the job of waiting for the
child which has exited so wait() blocks.  Since signal_handler()
does not decrement the children variable, after, say, maxchildren
signal_handler() calls, children is still at maxchildren while
there is actually no child left.  Because of a funny race condition,
eventually one wait() may make it faster than the waitpid(), so
the signal is blocked, and children actually might get decremented,
and a new process is getting forked.

In general, you'd probably want to use sigaction(), it's also
standardized and you have finer grained control about how your
signal handlers behave.

>2) I know some people complain about not having complete working code or
>perfect K&R or ANSI code for examples. But I feel that code snippets are
>fine. This code is from vm-pop3d and also gnu-pop3d. It is available via
>http://www.reedmedia.net/software/virtualmail-pop3d/

Yeah, as long as you check for errors at fork(), wait(), waitpid()
etc. and call _exit() in the child, not exit() of course. :)

mkb

[1] From the manpage, I get the impression that Linux is following
SystemV semantics with signal(3), it says the kernel resets the
handler to SIG_DFL upon call, like in System V, but that "the glibc2
library follows the BSD behaviour" (keeping the handler installed
but blocking the signal during the duration of the handler), now what?!