Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Killing a zombie process?



On Sun, 4 Oct 2015, Robert Elz wrote:

   Date:        Fri, 2 Oct 2015 15:26:42 +0800 (PHT)
   From:        Paul Goyette <paul%vps1.whooppee.com@localhost>
   Message-ID:  <Pine.NEB.4.64.1510021516240.2764%vps1.whooppee.com@localhost>

 | 1. Is it correct for init's p_nstopchild to be zero when it has several
 |     children whose p_state is SSTOP?

Depends whether those children have previously been waited for or not.
Stopped children don't go away when they're waited for, so there needs
to be something to prevent wait() returning the same stopped child
over and over again.   That's p_waited ... so you need to check that
value of the stopped children, if it is 0, then something is broken.
If it is 1 (for all of them) then they're irrelevant, and matter not
at all.

All of those head-of-sibling-list processes were p_stat == SSTOP and p_waited=0, and none of them has (p_slflag & PSL_TRACED). And, since init(8) is calling waitpid( ..., ..., 0), the value of options is zero so the following code (immediately before the previously-quoted code, at src/sys/kern/kern_exit.c:780) doesn't trigger:

			if (child->p_stat == SSTOP &&
			    child->p_waited == 0 &&
			    (child->p_slflag & PSL_TRACED ||
			    options & WUNTRACED)) {
				if ((options & WNOWAIT) == 0) {
					child->p_waited = 1;
					parent->p_nstopchild--;
				}
				break;
			}

So "something is broken" ?  :)

<Sidebar>
The waitpid() call in init is at src/sbin/init/init.c:1506. Since my Zombie does finally die during a transition from multi-user back down to single-user, I'm guessing that one of the other calls to waitpid() is clearing out the SSTOPed processes at the head of the p_sibling list, perhaps the call in single_user() at line 773?

	...
	requested_transition = 0;
	do {
		if ((wpid = waitpid(-1, &status, WUNTRACED)) != -1)
			collect_child(wpid, status);
	...

</Sidebar>

 | 2. Is the above code in init correct?  Should we really be leaving the
 |     loop when there are more children to examine?

It is an optimisation, and should be correct.

However, it dpes depend upon p_nstoppedchild being maintained correctly.

You didn't say whether your zombie process is actually to be found
(somewhere) on the parent's (ie: init's) list of children.

Yes, the zombie was the eighth entry on init's p_sibling list.

Several of the front-of-list processes appeared to be related to some system daemons. (One was related to consolekit, one to dbus.) And the very first child of init seems to have been another copy of init (based on its p_comm[] field)!


I have no idea how one would discover this (at this point, or given
how long you need to wait for it to happen, perhaps ever) but it would
also be interesting to know whether the zombie was reparented to init
before or after it died.

The common case is for a parent to exit, leaving running children, which
are reparented to init, complete, exit, and init cleans them up.

But it is also possible for a child to die, be ignored by its parent,
which later exit itself, leaving the zombie to be reparented to init.
That's more unusual - does not happen very often, but if that is what
happened here, it is possible that there's some bug in the processing
of that case.

Hmmm, probably not possible to differentiate at this point.



+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index