Subject: Re: infinite loop in pthread__mutex_spin after call to exit
To: Sverre Froyen <firstname.lastname@example.org>
From: Andrew Doran <email@example.com>
Date: 12/29/2007 00:35:44
On Fri, Dec 28, 2007 at 05:18:12PM -0700, Sverre Froyen wrote:
> I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus hangs
> after its call to exit and I've ended up with a case where the C code, the
Which library call is it making to exit?
A ktrace of the last few ms of the app would be really useful. :-)
> assembler, and the variable values all look OK, yet the program hangs,
> looping in the method pthread__mutex_spin in
> src/lib/libpthread/pthread_mutex2.c (1.17).
> The code that appears to fail is the test
> if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
> which if true would have exited the loop.
> LWPCTL_CPU_NONE is defined as (-1) and gdb shows that
> thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the loop
> but the test (apppears to) evaluate to false.
> The assembly code for the pertinent part of the test reads (when compiled
> without optimization):
> 0xbbb55535 <pthread__mutex_spin+21>: mov 0xc(%esp),%eax
> 0xbbb55539 <pthread__mutex_spin+25>: mov 0x184(%eax),%eax
> 0xbbb5553f <pthread__mutex_spin+31>: mov (%eax),%eax
> 0xbbb55541 <pthread__mutex_spin+33>: cmp $0xffffffff,%eax
> The first line gets the address of "thread". The second line gets the address
> of "thread->pt_lwpctl->lc_curcpu". The third line gets the value
> of "thread->pt_lwpctl->lc_curcpu". And the last line compares that value
> to -1.
> Here's what I observe:
> gdb shows:
> (gdb) print thread
> $1 = (pthread_t) 0xbfa00000
> (gdb) print thread->pt_lwpctl
> $2 = (struct lwpctl *) 0xbb782000
> (gdb) print thread->pt_lwpctl->lc_curcpu
> $3 = -1
> After the first line, eax contains 0xbfa00000, as expected.
> After the second line, eax contains 0xbb782000 which in turn contains
> 0xffffffff, again as expected (lc_curcpu is the first member of the lwpctl
> After the third line, eax contains 0x0!!!!! I would have expected 0xffffffff.
> Because eax now contains zero, the test in line 4 is false and the
> (subsequent) jump never happens.
> So, that leaves me with two questions:
> 1) What am I missing in the above scenario? Could there be some type of cache
The kernel updates the value. If you look at it from a core dump or with the
debugger then you've forced the thread off the CPU, so it will appear as -1
(or iirc -2 if the thread has exited).
> 2) Should a process really be able to hang like that after calling exit (using
> 100% user time)?
Given what you've mentioned it seems to me like a bug in the application. If
you have this situation..
... then thr2 will deadlock, no matter whether it's spinning or sleeping. It
would be nicer for thr2 to sleep, which might let the application appear to