NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable



The following reply was made to PR kern/46168; it has been noted by GNATS.

From: David Holland <dholland-bugs%netbsd.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
Date: Sat, 17 Mar 2012 13:12:49 +0000

 (sent to gnats-admin instead of gnats-bugs)
 
    ------
 
 From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 To: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
 Subject: Re: kern/46168: /usr/tests/lib/libpthread/t_cond somtimes unkillable
 Date: Wed, 14 Mar 2012 17:35:32 +0100
 
 On Sat, Mar 10, 2012 at 04:50:00PM +0000, Manuel.Bouyer%lip6.fr@localhost 
wrote:
 >      /usr/tests/lib/libpthread/t_cond:cond_timedwait_race sometimes hang,
 >      with an unkillable t_cond process:
 
 
 Some more details from another instance:
 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 29181   21 8   3   1000000   fffffe815c841000             t_cond
 29181   10 3   2  11000000   fffffe815d1c1000             t_cond lwpwait
 
 0x01000000 is LW_PENDSIG, 0x10000000 is LW_UNPARKED.
 LW_WEXIT (0x00100000) is NOT set.
 
 db{0}> tr/a fffffe815c841000
 trace: pid 29181 lid 21 at 0x0
 db{0}> tr/a fffffe815d1c1000
 trace: pid 29181 lid 10 at 0xfffffe810ccfca10
 sleepq_block() at netbsd:sleepq_block+0xa4
 cv_wait() at netbsd:cv_wait+0x101
 lwp_wait1() at netbsd:lwp_wait1+0x274
 exit_lwps() at netbsd:exit_lwps+0x108
 exit1() at netbsd:exit1+0x5c
 sys_exit() at netbsd:sys_exit+0x3e
 syscall() at netbsd:syscall+0xac
 
 db{0}> x/Lx 0xfffffe815c8411a8
 fffffe815c8411a8:       fffffe81810a8aa0
 db{0}> x/Lx fffffe815d1c11a8
 fffffe815d1c11a8:       fffffe81810a8aa0
 
 struct proc is at 0xfffffe81810a8aa0
 db{0}> x/Lx 0xfffffe81810a8ba8
 fffffe81810a8ba8:       fffffe815c841000
 
 This is p_lwps, pointing to 29181.21
 db{0}> x/Lx 0xfffffe815c8411b0
 fffffe815c8411b0:       fffffe815d1c1000
 this is l_sibling from 29181.21, pointing to 29181.10
 db{0}> x/Lx 0xfffffe815d1c11b0
 fffffe815d1c11b0:       0
 the l_sibling list ends here
 
 db{0}> x/x 0xfffffe815c8411c0  
 fffffe815c8411c0:       0 #l_waiter
 db{0}> x/x 0xfffffe815c8411c4
 fffffe815c8411c4:       0 #l_waitingfor
 db{0}> x/x 0xfffffe815c8411cc
 fffffe815c8411cc:       1 #l_refcnt
 db{0}> x/x 0xfffffe815c8411d0
 fffffe815c8411d0:       15 #l_lid
 
 db{0}> x/x 0xfffffe815d1c11c0  
 fffffe815d1c11c0:       0 #l_waiter
 db{0}> x/x 0xfffffe815d1c11c4
 fffffe815d1c11c4:       0 #l_waitingfor
 db{0}> x/x 0xfffffe815d1c11cc
 fffffe815d1c11cc:       1 #l_refcnt
 db{0}> x/x 0xfffffe815d1c11d0
 fffffe815d1c11d0:       a #l_lid
 
 db{0}> x/x 0xfffffe81810a8bb8
 fffffe81810a8bb8:       2 #p_nlwps
 db{0}> x/x 0xfffffe81810a8bbc
 fffffe81810a8bbc:       0 #p_nzlwps
 db{0}> x/x 0xfffffe81810a8bc0
 fffffe81810a8bc0:       1 #p_nrlwps
 db{0}> x/x 0xfffffe81810a8bc4
 fffffe81810a8bc4:       1 #p_nlwpwait
 db{0}> x/x 0xfffffe81810a8bc8
 fffffe81810a8bc8:       2 #p_ndlwps
 
 I think that the missing LW_WEXIT for 29181.21 is the cause of the problem.
 lwp_wait1() can release and reaquire p_lock, even when returning 0.
 I guess a new lwp can be created while p_lock is released, and so the list
 of lwps always needs to be checked again.
 The attached patch does this: it moves setting LW_EXIT inside the
 while (p->p_nlwps > 1) {} loop; so that if a new lwp is added, it can be 
 properly flagged before waiting for it.
 
 I've done 2 runs of atf-run in /usr/tests, no tests are failing after
 this change (at last none that was not failing before), and I couldn't
 make t_cond hang again.
 
 -- 
 Manuel Bouyer <bouyer%antioche.eu.org@localhost>
      NetBSD: 26 ans d'experience feront toujours la difference
 --
 
 Index: kern/kern_exit.c
 ===================================================================
 RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
 retrieving revision 1.237
 diff -u -p -u -r1.237 kern_exit.c
 --- kern/kern_exit.c   19 Feb 2012 21:06:50 -0000      1.237
 +++ kern/kern_exit.c   14 Mar 2012 16:22:47 -0000
 @@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)
  
        p = l->l_proc;
        KASSERT(mutex_owned(p->p_lock));
 -retry:
 -      /*
 -       * Interrupt LWPs in interruptable sleep, unsuspend suspended
 -       * LWPs and then wait for everyone else to finish.
 -       */
 -      LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 -              if (l2 == l)
 -                      continue;
 -              lwp_lock(l2);
 -              l2->l_flag |= LW_WEXIT;
 -              if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) ||
 -                  l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 -                      /* setrunnable() will release the lock. */
 -                      setrunnable(l2);
 -                      DPRINTF(("exit_lwps: Made %d.%d runnable\n",
 -                          p->p_pid, l2->l_lid));
 -                      continue;
 -              }
 -              lwp_unlock(l2);
 -      }
        while (p->p_nlwps > 1) {
 +              /*
 +               * Interrupt LWPs in interruptable sleep, unsuspend suspended
 +               * LWPs and then wait for everyone else to finish.
 +               */
 +              LIST_FOREACH(l2, &p->p_lwps, l_sibling) {
 +                      if (l2 == l)
 +                              continue;
 +                      lwp_lock(l2);
 +                      l2->l_flag |= LW_WEXIT;
 +                      if ((l2->l_stat == LSSLEEP && (l2->l_flag & LW_SINTR)) 
||
 +                          l2->l_stat == LSSUSPENDED || l2->l_stat == LSSTOP) {
 +                              /* setrunnable() will release the lock. */
 +                              setrunnable(l2);
 +                              DPRINTF(("exit_lwps: Made %d.%d runnable\n",
 +                                  p->p_pid, l2->l_lid));
 +                              continue;
 +                      }
 +                      lwp_unlock(l2);
 +              }
                DPRINTF(("exit_lwps: waiting for %d LWPs (%d zombies)\n",
                    p->p_nlwps, p->p_nzlwps));
                error = lwp_wait1(l, 0, &waited, LWPWAIT_EXITCONTROL);
                if (p->p_nlwps == 1)
                        break;
 -              if (error == EDEADLK) {
 -                      /*
 -                       * LWPs can get suspended/slept behind us.
 -                       * (eg. sa_setwoken)
 -                       * kick them again and retry.
 -                       */
 -                      goto retry;
 -              }
 -              if (error)
 +              if (error == 0)
 +                      DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n",
 +                          waited));
 +              else if (error != EDEADLK)
                        panic("exit_lwps: lwp_wait1 failed with error %d",
                            error);
 -              DPRINTF(("exit_lwps: Got LWP %d from lwp_wait1()\n", waited));
 +              /*
 +               * LWPs can get suspended/slept behind us.
 +               * (eg. sa_setwoken)
 +               * kick them again and retry.
 +               */
        }
  
        KERNEL_LOCK(nlocks, l);
 


Home | Main Index | Thread Index | Old Index