Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Killing a zombie process?



On Fri, 2 Oct 2015, Paul Goyette wrote:

On Fri, 2 Oct 2015, Paul Goyette wrote:

For now, I took a quick look into the zombie's struct proc.

	p_exitsig = 0x14   = SIGCHILD
	p_flag    = 0x0
	p_sflag   = 0x2000 = PS_WEXIT
	p_slflag  = 0x0
	p_lflag   = 0x2    = PL_CONTROLT
	p_stflag  = 0x0
	p_stat    = 0x5    = SZOMB

	p_trace_enabled = 0x0
	p_pid     = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit closer.

OK, I forced a system crash (using ddb's sync command), and here's what gdb says about the zombie's struct proc (manually inserted line breaks for improved readability, and some flag value annotations)

(gdb) print (struct proc *) 0xfffffe81f578ba70
$1 = (struct proc *) 0xfffffe81f578ba70
(gdb) print *(struct proc *) 0xfffffe81f578ba70
$2 = {
 p_list = {le_next = 0x0, le_prev = 0xffffffff806be700 <zombproc>},
 p_auxlock = {u = {mtxa_owner = 0}},
 p_lock = 0xfffffe81fbb7a840,
 p_stmutex = {u = {mtxa_owner = 2049}},
 p_reflock = {rw_owner = 0},
 p_waitcv = {cv_opaque = {0x0, 0xfffffe81f578baa0, 0xffffffff804d542e}},
 p_lwpcv = {cv_opaque = {0x0, 0xfffffe81f578bab8, 0xffffffff804e7f9a}},
 p_cred = 0xfffffe81ef0106c0,
 p_fd = 0xfffffe810f46f680,
 p_cwdi = 0x0,
 p_stats = 0xfffffe81e00b5700,
 p_limit = 0xfffffe8155fe8de8,
 p_vmspace = 0xffffffff80722de0 <vmspace0>,
 p_sigacts = 0xfffffe803be9b258,
 p_aio = 0x0,
 p_mqueue_cnt = 0,
 p_specdataref = {
   specdataref_container = 0x0,
   specdataref_lock = {u = {mtxa_owner = 18446744073709551600}}},
 p_exitsig = 20,
 p_flag = 0,
 p_sflag = 8192 <PS_WEXIT>,
 p_slflag = 0,
 p_lflag = 2 <PL_CONTROLT>,
 p_stflag = 0,
 p_stat = 5 '\005' <SZOMB>,
 p_trace_enabled = 0 '\000',
 p_pad1 = "\203",
 p_pid = 21120,
 p_pglist = {
   le_next = 0x0,
   le_prev = 0xfffffe81eab655b0},
 p_pptr = 0xfffffe810f45ecd0,
 p_sibling = {
   le_next = 0xfffffe81f7618d20, le_prev = 0xfffffe81fc805108},
 p_children = {lh_first = 0x0},
 p_lwps = {lh_first = 0xfffffe8021ccb560},
 p_raslist = 0x0,
 p_nlwps = 1,
 p_nzlwps = 1,
 p_nrlwps = 0,
 p_nlwpwait = 0,
 p_ndlwps = 0,
 p_nlwpid = 1,
 p_nstopchild = 0,
 p_waited = 0,
 p_zomblwp = 0x0,
 p_vforklwp = 0x0,
 p_sched_info = 0x0,
 p_estcpu = 0,
 p_estcpu_inherited = 36864,
 p_forktime = 17842,
 p_pctcpu = 0,
 p_opptr = 0x0,
 p_timers = 0x0,
 p_rtime = {sec = 0, frac = 0},
 p_uticks = 0,
 p_sticks = 0,
 p_iticks = 0,
 p_traceflag = 0,
 p_tracep = 0x0,
 p_textvp = 0xfffffe81e6023190,
 p_emul = 0xffffffff806b6300 <emul_netbsd>,
 p_emuldata = 0x0,
 p_execsw = 0xffffffff808be0e0,
 p_klist = { slh_first = 0x0},
 p_sigwaiters = {lh_first = 0x0},
 p_sigpend = {
   sp_info = {tqh_first = 0x0, tqh_last = 0xfffffe81f578bc48},
   sp_set = {__bits = {0, 0, 0, 0}}},
 p_lwpctl = 0x0,
 p_ppid = 1,
 p_fpid = 0,
 p_sigctx = {
   ps_signo = 0, ps_code = 0, ps_lwp = 0, ps_sigcode = 0x0,
ps_sigignore = {__bits = {4294967295, 4294967295, 4294967295, 4294967295}},
   ps_sigcatch = {__bits = {0, 0, 0, 0}}},
 p_nice = 20 '\024',
 p_comm = "sh\000ke", '\000' <repeats 11 times>,
 p_pgrp = 0xfffffe81eab655b0,
 p_psstrp = 140187732541408,
 p_pax = 0,
 p_xstat = 0,
 p_acflag = 1,
 p_md = {md_flags = 0, md_syscall = 0xffffffff8012f010 <syscall>},
 p_stackbase = 140187732541440,
 p_dtrace = 0x7f7ff683b8e6}

As far as I can tell, everything looks normal. Yet the process never gets reaped by init.

The one thing that surprises me here is that the zombie still has a pointer to p_textvp which would point to /bin/sh _within_ the chroot() sandbox (consistent with the p_comm = "sh" entry). I'm guessing that this reference is what's preventing me from unmounting this nullfs mount. (I previously expected the inability to unmount to be the result of a reference from the zombie's cwd.)

Still investigating, but I think I may have found something...

Using the p_pptr value 0xfffffe810f45ecd0 from the zombie's struct proc, I examined the struct proc for init. I followed the code from the find_stopped_child() routine in src/sys/kern/kern_exit.c, and walked through the loop for each of init's children. The first several processes are all in p_state=4 (SSTOP), yet init's p_nstopchild count is zero!

This seems to cause the loop in find_stopped_child() to exit early (at line 790):

                 if (parent->p_nstopchild == 0 || child->p_pid == pid) {
                         child = NULL;
                         break;

(Here, parent points to init's struct proc, child is the struct proc obtained from walking the p_children list, and pid is the argument passed to the wait4() syscall - init passes value WAIT_ANY, ie -1.)

Questions:

1. Is it correct for init's p_nstopchild to be zero when it has several
   children whose p_state is SSTOP?

2. Is the above code in init correct?  Should we really be leaving the
   loop when there are more children to examine?





+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index