NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable



On Sat, Mar 10, 2012 at 04:50:00PM +0000, Manuel.Bouyer%lip6.fr@localhost wrote:
>       /usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
>       with an unkillable t_cond process:


Some more details from another instance:
db{0}> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
29181   21 8   3   1000000   fffffe815c841000             t_cond
29181   10 3   2  11000000   fffffe815d1c1000             t_cond lwpwait

0x01000000 is LW_PENDSIG, 0x10000000 is LW_UNPARKED.
LW_WEXIT (0x00100000) is NOT set.

db{0}> tr/a fffffe815c841000
trace: pid 29181 lid 21 at 0x0
db{0}> tr/a fffffe815d1c1000
trace: pid 29181 lid 10 at 0xfffffe810ccfca10
sleepq_block() at netbsd:sleepq_block+0xa4
cv_wait() at netbsd:cv_wait+0x101
lwp_wait1() at netbsd:lwp_wait1+0x274
exit_lwps() at netbsd:exit_lwps+0x108
exit1() at netbsd:exit1+0x5c
sys_exit() at netbsd:sys_exit+0x3e
syscall() at netbsd:syscall+0xac

db{0}> x/Lx 0xfffffe815c8411a8
fffffe815c8411a8:       fffffe81810a8aa0
db{0}> x/Lx fffffe815d1c11a8
fffffe815d1c11a8:       fffffe81810a8aa0

struct proc is at 0xfffffe81810a8aa0
db{0}> x/Lx 0xfffffe81810a8ba8
fffffe81810a8ba8:       fffffe815c841000

This is p_lwps, pointing to 29181.21
db{0}> x/Lx 0xfffffe815c8411b0
fffffe815c8411b0:       fffffe815d1c1000
this is l_sibling from 29181.21, pointing to 29181.10
db{0}> x/Lx 0xfffffe815d1c11b0
fffffe815d1c11b0:       0
the l_sibling list ends here

db{0}> x/x 0xfffffe815c8411c0  
fffffe815c8411c0:       0 #l_waiter
db{0}> x/x 0xfffffe815c8411c4
fffffe815c8411c4:       0 #l_waitingfor
db{0}> x/x 0xfffffe815c8411cc
fffffe815c8411cc:       1 #l_refcnt
db{0}> x/x 0xfffffe815c8411d0
fffffe815c8411d0:       15 #l_lid

db{0}> x/x 0xfffffe815d1c11c0  
fffffe815d1c11c0:       0 #l_waiter
db{0}> x/x 0xfffffe815d1c11c4
fffffe815d1c11c4:       0 #l_waitingfor
db{0}> x/x 0xfffffe815d1c11cc
fffffe815d1c11cc:       1 #l_refcnt
db{0}> x/x 0xfffffe815d1c11d0
fffffe815d1c11d0:       a #l_lid

db{0}> x/x 0xfffffe81810a8bb8
fffffe81810a8bb8:       2 #p_nlwps
db{0}> x/x 0xfffffe81810a8bbc
fffffe81810a8bbc:       0 #p_nzlwps
db{0}> x/x 0xfffffe81810a8bc0
fffffe81810a8bc0:       1 #p_nrlwps
db{0}> x/x 0xfffffe81810a8bc4
fffffe81810a8bc4:       1 #p_nlwpwait
db{0}> x/x 0xfffffe81810a8bc8
fffffe81810a8bc8:       2 #p_ndlwps

I think that the missing LW_WEXIT for 29181.21 is the cause of the problem.
lwp_wait1() can release and reaquire p_lock, even when returning 0.
I guess a new lwp can be created while p_lock is released, and so the list
of lwps always needs to be checked again.
The attached patch does this: it moves setting LW_EXIT inside the
while (p->p_nlwps > 1) {} loop; so that if a new lwp is added, it can be 
properly flagged before waiting for it.

I've done 2 runs of atf-run in /usr/tests, no tests are failing after
this change (at last none that was not failing before), and I couldn't
make t_cond hang again.

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--
Index: kern/kern_exit.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.237
diff -u -p -u -r1.237 kern_exit.c
--- kern/kern_exit.c    19 Feb 2012 21:06:50 -0000      1.237
+++ kern/kern_exit.c    14 Mar 2012 16:22:47 -0000
@@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)
 
        p = l->l_proc;
        KASSERT(mutex_owned(p->p_lock));
-retry:
-       /*
-        * Interrupt LWPs in interruptable sleep, unsuspend suspended
-        * LWPs and then wait for everyone else to finish.
-        */
-       LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
-               if (l2 == l)
-                       continue;
-               lwp_lock(l2);
-               l2->l_flag |= LW_WEXIT;
-               if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
-                   l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
-                       /* setrunnable() will release the lock. */
-                       setrunnable(l2);
-                       DPRINTF(("exit_lwps: Made %d.%d runnable\n",
-                           p->p_pid, l2->l_lid));
-                       continue;
-               }
-               lwp_unlock(l2);
-       }
        while (p->p_nlwps > 1) {
+               /*
+                * Interrupt LWPs in interruptable sleep, unsuspend suspended
+                * LWPs and then wait for everyone else to finish.
+                */
+               LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
+                       if (l2 == l)
+                               continue;
+                       lwp_lock(l2);
+                       l2->l_flag |= LW_WEXIT;
+                       if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) 
||
+                           l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
+                               /* setrunnable() will release the lock. */
+                               setrunnable(l2);
+                               DPRINTF(("exit_lwps: Made %d.%d runnable\n",
+                                   p->p_pid, l2->l_lid));
+                               continue;
+                       }
+                       lwp_unlock(l2);
+               }
                DPRINTF(("exit_lwps: waiting for %d LWPs (%d zombies)\n",
                    p->p_nlwps, p->p_nzlwps));
                error = lwp_wait1(l, 0, &waited, LWPWAIT_EXITCONTROL);
                if (p->p_nlwps == 1)
                        break;
-               if (error == EDEADLK) {
-                       /*
-                        * LWPs can get suspended/slept behind us.
-                        * (eg. sa_setwoken)
-                        * kick them again and retry.
-                        */
-                       goto retry;
-               }
-               if (error)
+               if (error == 0)
+                       DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n",
+                           waited));
+               else if (error != EDEADLK)
                        panic("exit_lwps: lwp_wait1 failed with error %d",
                            error);
-               DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n", waited));
+               /*
+                * LWPs can get suspended/slept behind us.
+                * (eg. sa_setwoken)
+                * kick them again and retry.
+                */
        }
 
        KERNEL_LOCK(nlocks, l);


Home | Main Index | Thread Index | Old Index