Subject: Re: infinite loop in pthread__mutex_spin after call to exit
To: Sverre Froyen <sverre@viewmark.com>
From: Andrew Doran <ad@netbsd.org>
List: current-users
Date: 12/29/2007 00:35:44
On Fri, Dec 28, 2007 at 05:18:12PM -0700, Sverre Froyen wrote:
> I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus hangs
> after its call to exit and I've ended up with a case where the C code, the
Which library call is it making to exit?
A ktrace of the last few ms of the app would be really useful. :-)
> assembler, and the variable values all look OK, yet the program hangs,
> looping in the method pthread__mutex_spin in
> src/lib/libpthread/pthread_mutex2.c (1.17).
>
> The code that appears to fail is the test
>
> if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
> thread->pt_blocking)
> break;
>
> which if true would have exited the loop.
>
> LWPCTL_CPU_NONE is defined as (-1) and gdb shows that
> thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the loop
> but the test (apppears to) evaluate to false.
>
> The assembly code for the pertinent part of the test reads (when compiled
> without optimization):
>
> 0xbbb55535 <pthread__mutex_spin+21>: mov 0xc(%esp),%eax
> 0xbbb55539 <pthread__mutex_spin+25>: mov 0x184(%eax),%eax
> 0xbbb5553f <pthread__mutex_spin+31>: mov (%eax),%eax
> 0xbbb55541 <pthread__mutex_spin+33>: cmp $0xffffffff,%eax
>
> The first line gets the address of "thread". The second line gets the address
> of "thread->pt_lwpctl->lc_curcpu". The third line gets the value
> of "thread->pt_lwpctl->lc_curcpu". And the last line compares that value
> to -1.
>
> Here's what I observe:
>
> gdb shows:
> (gdb) print thread
> $1 = (pthread_t) 0xbfa00000
> (gdb) print thread->pt_lwpctl
> $2 = (struct lwpctl *) 0xbb782000
> (gdb) print thread->pt_lwpctl->lc_curcpu
> $3 = -1
>
> After the first line, eax contains 0xbfa00000, as expected.
> After the second line, eax contains 0xbb782000 which in turn contains
> 0xffffffff, again as expected (lc_curcpu is the first member of the lwpctl
> structure).
> After the third line, eax contains 0x0!!!!! I would have expected 0xffffffff.
> Because eax now contains zero, the test in line 4 is false and the
> (subsequent) jump never happens.
>
> So, that leaves me with two questions:
>
> 1) What am I missing in the above scenario? Could there be some type of cache
> issue?
The kernel updates the value. If you look at it from a core dump or with the
debugger then you've forced the thread off the CPU, so it will appear as -1
(or iirc -2 if the thread has exited).
> 2) Should a process really be able to hang like that after calling exit (using
> 100% user time)?
Given what you've mentioned it seems to me like a bug in the application. If
you have this situation..
thr1 pthread_mutex_lock(&foo)
thr1 pthread_exit()
thr2 pthread_mutex_lock(&foo)
... then thr2 will deadlock, no matter whether it's spinning or sleeping. It
would be nicer for thr2 to sleep, which might let the application appear to
work!
Andrew