Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Killing a zombie process?



On Sun, 4 Oct 2015, Robert Elz wrote:

   Date:        Sun, 4 Oct 2015 17:25:21 +0800 (PHT)
   From:        Paul Goyette <paul%vps1.whooppee.com@localhost>
   Message-ID:  <Pine.NEB.4.64.1510041715370.15041%vps1.whooppee.com@localhost>

 | I'm pretty much convinced that the p_nstopchild accounting is screwed up
 | somewhere.

I think I agree.

 | I'm planning on adding the following code in "optimization"
 | in kern_exit so I can catch it as soon as it happens.

Sooner, but unfortunately, most probably not soon enough.

It is most likely some locking/race condition with multiple processes
dying at the same time (approximately) that is causing some of the
increments to be lost.   Making them all use atomic ops, instead of just ++
might fix the problem, at the cost of never discovering where issue
actually occurs - there should be locks around all manipulations of
this stuff, possibly one of them is missing or misplaced.

Yeah, I think that there's a basic accounting problem somewhere, and with an extreme load it is more likely for the SSTOPed process to get inserted in the p_children/p_sibling list before the SZOMB process can get reaped. Once the SSTOPed process gets to front-of line (with the parent's p_nstopchild count zero), the SZOMB process won't ever get processed. My patch will simply validate this theory.

(BTW, the patch is actually wrong, as it would also panic in the case where the wait was for a specific pid. I've modified it in my new kernel - not yet tested.)

It is unlikely to be in the wait processing (at least not this one) as
there's just one process doing the waiting, there would be no contention
for the accesses here (it could be a combination of the two though,
wait() happening at the same instant a process is dying).

See above.

I'm also puzzled by your observations of forked init processes having
exited - after rc is finished, init generally only forks when one of the
console/terminal sessions ends, and a new getty needs to be started.
On most modern systems, that's a very rare event - though if you use
the console (ctl-alt-Fn or whatever it is) switching, and login and out
of those (virtual) terminals, it would happen.  Is there anything like
that in your environment?

I do occassionally switch to another wsdisplay screen (away from the X one), but not frequently. I definitely do a switch before I use Ctrl/Alt/Esc to get into ddb.

I'm wondering if some (most? all?) of the SSTOPd processes I see are a result of entering ddb and/or triggering the reboot? Doesn't ddb need to stop whatever is running on "the other CPU cores" ?



+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index