current-users: Re: infinite loop in pthread__mutex

Subject: Re: infinite loop in pthread__mutex_spin after call to exit
To: Andrew Doran <ad@netbsd.org>
From: Sverre Froyen <sverre@viewmark.com>
List: current-users
Date: 12/28/2007 18:02:04
On Friday 28 December 2007, Andrew Doran wrote:
> On Fri, Dec 28, 2007 at 05:18:12PM -0700, Sverre Froyen wrote:
> > I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus
> > hangs after its call to exit and I've ended up with a case where the C
> > code, the
>
> Which library call is it making to exit?
>
> A ktrace of the last few ms of the app would be really useful. :-)
 
Not sure how far back you need.  The write of "org.freedesktop.DBus" happens 
shortly before the exit.

 16996      1 qdbus    CALL  write(1,0xbb65c000,6)
 16996      1 qdbus    GIO   fd 1 wrote 6 bytes
       ":1.13\n"
 16996      1 qdbus    RET   write 6
 16996      1 qdbus    CALL  open(0xbfbfad98,0,0xbfbfad00)
 16996      1 qdbus    NAMI  "/usr/share/i18n/iconv/iconv.alias.db"
 16996      1 qdbus    RET   open -1 errno 2 No such file or directory
 16996      1 qdbus    CALL  open(0xbfbfc658,0,0xbfbfad68)
 16996      1 qdbus    NAMI  "/usr/share/i18n/iconv/iconv.alias"
 16996      1 qdbus    RET   open -1 errno 2 No such file or directory
 16996      1 qdbus    CALL  open(0xbfbfad98,0,0xbfbfad00)
 16996      1 qdbus    NAMI  "/usr/share/i18n/iconv/iconv.alias.db"
 16996      1 qdbus    RET   open -1 errno 2 No such file or directory
 16996      1 qdbus    CALL  open(0xbfbfc658,0,0xbfbfad68)
 16996      1 qdbus    NAMI  "/usr/share/i18n/iconv/iconv.alias"
 16996      1 qdbus    RET   open -1 errno 2 No such file or directory
 16996      1 qdbus    CALL  write(1,0xbb65c000,0x15)
 16996      1 qdbus    GIO   fd 1 wrote 21 bytes
       "org.freedesktop.DBus\n"
 16996      1 qdbus    RET   write 21/0x15
 16996      1 qdbus    CALL  write(6,0xbbae2b92,1)
 16996      1 qdbus    GIO   fd 6 wrote 1 bytes
       "@"
 16996      1 qdbus    RET   write 1
 16996      1 qdbus    CALL  close(6)
 16996      1 qdbus    RET   close 0
 16996      1 qdbus    CALL  close(5)
 16996      1 qdbus    RET   close 0
 16996      1 qdbus    PSIG  SIGINT SIG_DFL

> > assembler, and the variable values all look OK, yet the program hangs,
> > looping in the method pthread__mutex_spin in
> > src/lib/libpthread/pthread_mutex2.c (1.17).
> >
> > The code that appears to fail is the test
> >
> >                 if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
> >                     thread->pt_blocking)
> >                         break;
> >
> > which if true would have exited the loop.
> >
> > LWPCTL_CPU_NONE is defined as (-1) and gdb shows that
> > thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the
> > loop but the test (apppears to) evaluate to false.
> >
> > The assembly code for the pertinent part of the test reads (when compiled
> > without optimization):
> >
> > 0xbbb55535 <pthread__mutex_spin+21>:    mov    0xc(%esp),%eax
> > 0xbbb55539 <pthread__mutex_spin+25>:    mov    0x184(%eax),%eax
> > 0xbbb5553f <pthread__mutex_spin+31>:    mov    (%eax),%eax
> > 0xbbb55541 <pthread__mutex_spin+33>:    cmp    $0xffffffff,%eax
> >
> > The first line gets the address of "thread".  The second line gets the
> > address of "thread->pt_lwpctl->lc_curcpu".  The third line gets the value
> > of "thread->pt_lwpctl->lc_curcpu".  And the last line compares that value
> > to -1.
> >
> > Here's what I observe:
> >
> > gdb shows:
> > (gdb) print thread
> > $1 = (pthread_t) 0xbfa00000
> > (gdb) print thread->pt_lwpctl
> > $2 = (struct lwpctl *) 0xbb782000
> > (gdb) print thread->pt_lwpctl->lc_curcpu
> > $3 = -1
> >
> > After the first line, eax contains 0xbfa00000, as expected.
> > After the second line, eax contains 0xbb782000 which in turn contains
> > 0xffffffff, again as expected (lc_curcpu is the first member of the
> > lwpctl structure).
> > After the third line, eax contains 0x0!!!!!  I would have expected
> > 0xffffffff. Because eax now contains zero, the test in line 4 is false
> > and the (subsequent) jump never happens.
> >
> > So, that leaves me with two questions:
> >
> > 1) What am I missing in the above scenario?  Could there be some type of
> > cache issue?
>
> The kernel updates the value. If you look at it from a core dump or with
> the debugger then you've forced the thread off the CPU, so it will appear
> as -1 (or iirc -2 if the thread has exited).

That makes sense.  I guess I could copy it to another variable to see its 
value.

> > 2) Should a process really be able to hang like that after calling exit
> > (using 100% user time)?
>
> Given what you've mentioned it seems to me like a bug in the application.
> If you have this situation..
>
> thr1	pthread_mutex_lock(&foo)
> thr1	pthread_exit()
> thr2	pthread_mutex_lock(&foo)
>
> ... then thr2 will deadlock, no matter whether it's spinning or sleeping.
> It would be nicer for thr2 to sleep, which might let the application appear
> to work!

How can I check for this?  If it helps, here is a backtrace from 
pthread__mutex_spin:

(gdb) bt
#0  pthread__mutex_spin (ptm=0xbb86dba0, owner=<value optimized out>)
    at /usr/src/lib/libpthread/pthread_mutex2.c:206
#1  0xbbb56a63 in pthread__mutex_lock_slow (ptm=0xbb86dba0)
    at /usr/src/lib/libpthread/pthread_mutex2.c:248
#2  0xbb83fb8c in __cxa_atexit () from /usr/lib/libc.so.12
#3  0xbba8f5ee in guardHash () at kernel/qobject.cpp:420
#4  0xbba92896 in QObjectPrivate::clearGuards (object=0xbb602430)
    at kernel/qobject.cpp:487
#5  0xbba94be5 in ~QObject (this=0xbb602430) at kernel/qobject.cpp:799
#6  0xbb9ec75b in ~QThread (this=0xbb602430) at thread/qthread.cpp:354
#7  0xbba70485 in ~QProcessManager (this=0xbb602430)
    at io/qprocess_unix.cpp:270
#8  0xbba6f0d0 in __tcf_0 ()
    at ../../include/QtCore/../../src/corelib/global/qglobal.h:1434
#9  0xbb83fab6 in __cxa_finalize () from /usr/lib/libc.so.12
#10 0xbb83f962 in exit () from /usr/lib/libc.so.12
#11 0x08051356 in main (argc=-1080033280, argv=0x11110001) at qdbus.cpp:391

Thanks,

Sverre