NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
The following reply was made to PR kern/46168; it has been noted by GNATS.
From: David Holland <dholland-bugs%netbsd.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Sat, 17 Mar 2012 13:12:49 +0000
(sent to gnats-admin instead of gnats-bugs)
------
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost,
netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Wed, 14 Mar 2012 17:35:32 +0100
On Sat, Mar 10, 2012 at 04:50:00PM +0000, Manuel.Bouyer%lip6.fr@localhost
wrote:
> /usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
> with an unkillable t_cond process:
Some more details from another instance:
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
29181 21 8 3 1000000 fffffe815c841000 t_cond
29181 10 3 2 11000000 fffffe815d1c1000 t_cond lwpwait
0x01000000 is LW_PENDSIG, 0x10000000 is LW_UNPARKED.
LW_WEXIT (0x00100000) is NOT set.
db{0}> tr/a fffffe815c841000
trace: pid 29181 lid 21 at 0x0
db{0}> tr/a fffffe815d1c1000
trace: pid 29181 lid 10 at 0xfffffe810ccfca10
sleepq_block() at netbsd:sleepq_block+0xa4
cv_wait() at netbsd:cv_wait+0x101
lwp_wait1() at netbsd:lwp_wait1+0x274
exit_lwps() at netbsd:exit_lwps+0x108
exit1() at netbsd:exit1+0x5c
sys_exit() at netbsd:sys_exit+0x3e
syscall() at netbsd:syscall+0xac
db{0}> x/Lx 0xfffffe815c8411a8
fffffe815c8411a8: fffffe81810a8aa0
db{0}> x/Lx fffffe815d1c11a8
fffffe815d1c11a8: fffffe81810a8aa0
struct proc is at 0xfffffe81810a8aa0
db{0}> x/Lx 0xfffffe81810a8ba8
fffffe81810a8ba8: fffffe815c841000
This is p_lwps, pointing to 29181.21
db{0}> x/Lx 0xfffffe815c8411b0
fffffe815c8411b0: fffffe815d1c1000
this is l_sibling from 29181.21, pointing to 29181.10
db{0}> x/Lx 0xfffffe815d1c11b0
fffffe815d1c11b0: 0
the l_sibling list ends here
db{0}> x/x 0xfffffe815c8411c0
fffffe815c8411c0: 0 #l_waiter
db{0}> x/x 0xfffffe815c8411c4
fffffe815c8411c4: 0 #l_waitingfor
db{0}> x/x 0xfffffe815c8411cc
fffffe815c8411cc: 1 #l_refcnt
db{0}> x/x 0xfffffe815c8411d0
fffffe815c8411d0: 15 #l_lid
db{0}> x/x 0xfffffe815d1c11c0
fffffe815d1c11c0: 0 #l_waiter
db{0}> x/x 0xfffffe815d1c11c4
fffffe815d1c11c4: 0 #l_waitingfor
db{0}> x/x 0xfffffe815d1c11cc
fffffe815d1c11cc: 1 #l_refcnt
db{0}> x/x 0xfffffe815d1c11d0
fffffe815d1c11d0: a #l_lid
db{0}> x/x 0xfffffe81810a8bb8
fffffe81810a8bb8: 2 #p_nlwps
db{0}> x/x 0xfffffe81810a8bbc
fffffe81810a8bbc: 0 #p_nzlwps
db{0}> x/x 0xfffffe81810a8bc0
fffffe81810a8bc0: 1 #p_nrlwps
db{0}> x/x 0xfffffe81810a8bc4
fffffe81810a8bc4: 1 #p_nlwpwait
db{0}> x/x 0xfffffe81810a8bc8
fffffe81810a8bc8: 2 #p_ndlwps
I think that the missing LW_WEXIT for 29181.21 is the cause of the problem.
lwp_wait1() can release and reaquire p_lock, even when returning 0.
I guess a new lwp can be created while p_lock is released, and so the list
of lwps always needs to be checked again.
The attached patch does this: it moves setting LW_EXIT inside the
while (p->p_nlwps > 1) {} loop; so that if a new lwp is added, it can be
properly flagged before waiting for it.
I've done 2 runs of atf-run in /usr/tests, no tests are failing after
this change (at last none that was not failing before), and I couldn't
make t_cond hang again.
--
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
NetBSD: 26 ans d'experience feront toujours la difference
--
Index: kern/kern_exit.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.237
diff -u -p -u -r1.237 kern_exit.c
--- kern/kern_exit.c 19 Feb 2012 21:06:50 -0000 1.237
+++ kern/kern_exit.c 14 Mar 2012 16:22:47 -0000
@@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)
p = l->l_proc;
KASSERT(mutex_owned(p->p_lock));
-retry:
- /*
- * Interrupt LWPs in interruptable sleep, unsuspend suspended
- * LWPs and then wait for everyone else to finish.
- */
- LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
- if (l2 == l)
- continue;
- lwp_lock(l2);
- l2->l_flag |= LW_WEXIT;
- if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
- l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
- /* setrunnable() will release the lock. */
- setrunnable(l2);
- DPRINTF(("exit_lwps: Made %d.%d runnable\n",
- p->p_pid, l2->l_lid));
- continue;
- }
- lwp_unlock(l2);
- }
while (p->p_nlwps > 1) {
+ /*
+ * Interrupt LWPs in interruptable sleep, unsuspend suspended
+ * LWPs and then wait for everyone else to finish.
+ */
+ LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
+ if (l2 == l)
+ continue;
+ lwp_lock(l2);
+ l2->l_flag |= LW_WEXIT;
+ if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR))
||
+ l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
+ /* setrunnable() will release the lock. */
+ setrunnable(l2);
+ DPRINTF(("exit_lwps: Made %d.%d runnable\n",
+ p->p_pid, l2->l_lid));
+ continue;
+ }
+ lwp_unlock(l2);
+ }
DPRINTF(("exit_lwps: waiting for %d LWPs (%d zombies)\n",
p->p_nlwps, p->p_nzlwps));
error = lwp_wait1(l, 0, &waited, LWPWAIT_EXITCONTROL);
if (p->p_nlwps == 1)
break;
- if (error == EDEADLK) {
- /*
- * LWPs can get suspended/slept behind us.
- * (eg. sa_setwoken)
- * kick them again and retry.
- */
- goto retry;
- }
- if (error)
+ if (error == 0)
+ DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n",
+ waited));
+ else if (error != EDEADLK)
panic("exit_lwps: lwp_wait1 failed with error %d",
error);
- DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n", waited));
+ /*
+ * LWPs can get suspended/slept behind us.
+ * (eg. sa_setwoken)
+ * kick them again and retry.
+ */
}
KERNEL_LOCK(nlocks, l);
Home |
Main Index |
Thread Index |
Old Index