NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/46402: LWPs created after exit_lwp() is called can hang the process....
>Number: 46402
>Category: kern
>Synopsis: LWPs created after exit_lwp() is called can hang the process...
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed May 02 21:30:00 +0000 2012
>Originator: Greg Oster
>Release: NetBSD 6.0_BETA
>Organization:
>Environment:
System: NetBSD mickey 6.0_BETA NetBSD 6.0_BETA (QUAD) #0: Wed Apr 25 11:03:11
CST 2012
oster@quad:/u1/builds/build211/src/obj/amd64/u1/builds/build211/src/sys/arch/amd64/compile/QUAD
amd64
Architecture: x86_64
Machine: amd64
>Description:
While running an extended series of aft-tests I noticed that the test were
occasionally hanging. It turns out they were hanging at the same place,
and that the test is repeatable. Running the following:
cd /usr/tests/lib/libpthread
foreach i (`jot 10000`)
./t_cond cond_timedwait_race
echo $i
end
on a NetBSD-6.0_BETA amd64 Xen DOMU (XEN3_DOMU kernel) I observed
that, on occasion, the t_cond process would hang. As viewed from ddb,
the lwps associated with the t_cond process look like:
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
402 11 8 2 0 ffffa00004ca2b40 t_cond
402 9 3 2 10000000 ffffa00004ca5740 t_cond lwpwait
After some extensive debugging, it seems that:
1) lid 11 above was created *after* exit_lwp() had been called for
the first time for pid 402.
2) lid 11 was created in LSSUSPENDED state.
3) in lwp_wait1() the second lwp (lid 9 above) goes to sleep,
expecting that the first lwp will eventually wake it up.
4) the first lwp is never woken up.
5) the two lwps remain as above until the system is rebooted.
6) the second lwp is jammed in the
if (exiting) {
KASSERT(p->p_nlwps > 1);
cv_wait(&p->p_lwpcv, p->p_lock);
continue;
}
portion of lwp_wait1().
7) The t_cond process cannot be killed.
The issue is also seen on a real i7-2600 box, running netbsd-6 as well.
The issue is very much a race condition of some sort, as sometimes
the above testing loop will not get past the first test case, and
other times it can get through up to a few hundred (or even thousand)
test cases without failing.
Additional details are available upon request.
>How-To-Repeat:
cd /usr/tests/lib/libpthread
foreach i (`jot 10000`)
./t_cond cond_timedwait_race
echo $i
end
*wait for t_cond to hang and be unkillable */
>Fix:
Unknown.
Home |
Main Index |
Thread Index |
Old Index