NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/46402: LWPs created after exit_lwp() is called can hang the process....



>Number:         46402
>Category:       kern
>Synopsis:       LWPs created after exit_lwp() is called can hang the process...
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed May 02 21:30:00 +0000 2012
>Originator:     Greg Oster
>Release:        NetBSD 6.0_BETA
>Organization:
>Environment:
System: NetBSD mickey 6.0_BETA NetBSD 6.0_BETA (QUAD) #0: Wed Apr 25 11:03:11 
CST 2012 
oster@quad:/u1/builds/build211/src/obj/amd64/u1/builds/build211/src/sys/arch/amd64/compile/QUAD
 amd64
Architecture: x86_64
Machine: amd64
>Description:


While running an extended series of aft-tests I noticed that the test were
occasionally hanging.  It turns out they were hanging at the same place,
and that the test is repeatable.  Running the following:

 cd /usr/tests/lib/libpthread
 foreach i (`jot 10000`)
 ./t_cond cond_timedwait_race 
 echo $i
 end

on a NetBSD-6.0_BETA amd64 Xen DOMU (XEN3_DOMU kernel) I observed
that, on occasion, the t_cond process would hang.  As viewed from ddb,
the lwps associated with the t_cond process look like:  

 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
402     11 8   2         0   ffffa00004ca2b40             t_cond
402      9 3   2  10000000   ffffa00004ca5740             t_cond lwpwait

After some extensive debugging, it seems that:
 1) lid 11 above was created *after* exit_lwp() had been called for
 the first time for pid 402.
 2) lid 11 was created in LSSUSPENDED state.
 3) in lwp_wait1() the second lwp (lid 9 above) goes to sleep, 
 expecting that the first lwp will eventually wake it up.
 4) the first lwp is never woken up.
 5) the two lwps remain as above until the system is rebooted.
 6) the second lwp is jammed in the 

                 if (exiting) {
                        KASSERT(p->p_nlwps > 1);
                        cv_wait(&p->p_lwpcv, p->p_lock);
                        continue;
                 }

 portion of lwp_wait1().
 7) The t_cond process cannot be killed.

 The issue is also seen on a real i7-2600 box, running netbsd-6 as well.

 The issue is very much a race condition of some sort, as sometimes
 the above testing loop will not get past the first test case, and
 other times it can get through up to a few hundred (or even thousand)
 test cases without failing.

 Additional details are available upon request.

>How-To-Repeat:

 cd /usr/tests/lib/libpthread
 foreach i (`jot 10000`)
 ./t_cond cond_timedwait_race 
 echo $i
 end

 *wait for t_cond to hang and be unkillable */

>Fix:

 Unknown. 





Home | Main Index | Thread Index | Old Index