current-users: Re: infinite loop in pthread__mutex

Subject: Re: infinite loop in pthread__mutex_spin after call to exit
To: Sverre Froyen <sverre@viewmark.com>
From: Andrew Doran <ad@netbsd.org>
List: current-users
Date: 12/29/2007 00:35:44
On Fri, Dec 28, 2007 at 05:18:12PM -0700, Sverre Froyen wrote:

> I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus hangs 
> after its call to exit and I've ended up with a case where the C code, the 

Which library call is it making to exit?

A ktrace of the last few ms of the app would be really useful. :-)

> assembler, and the variable values all look OK, yet the program hangs, 
> looping in the method pthread__mutex_spin in 
> src/lib/libpthread/pthread_mutex2.c (1.17).
> 
> The code that appears to fail is the test
> 
>                 if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
>                     thread->pt_blocking)
>                         break;
> 
> which if true would have exited the loop.
> 
> LWPCTL_CPU_NONE is defined as (-1) and gdb shows that 
> thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the loop 
> but the test (apppears to) evaluate to false.
> 
> The assembly code for the pertinent part of the test reads (when compiled 
> without optimization):
> 
> 0xbbb55535 <pthread__mutex_spin+21>:    mov    0xc(%esp),%eax
> 0xbbb55539 <pthread__mutex_spin+25>:    mov    0x184(%eax),%eax
> 0xbbb5553f <pthread__mutex_spin+31>:    mov    (%eax),%eax
> 0xbbb55541 <pthread__mutex_spin+33>:    cmp    $0xffffffff,%eax
> 
> The first line gets the address of "thread".  The second line gets the address 
> of "thread->pt_lwpctl->lc_curcpu".  The third line gets the value 
> of "thread->pt_lwpctl->lc_curcpu".  And the last line compares that value 
> to -1.
> 
> Here's what I observe:
> 
> gdb shows:
> (gdb) print thread
> $1 = (pthread_t) 0xbfa00000
> (gdb) print thread->pt_lwpctl
> $2 = (struct lwpctl *) 0xbb782000
> (gdb) print thread->pt_lwpctl->lc_curcpu
> $3 = -1
> 
> After the first line, eax contains 0xbfa00000, as expected.
> After the second line, eax contains 0xbb782000 which in turn contains 
> 0xffffffff, again as expected (lc_curcpu is the first member of the lwpctl 
> structure).
> After the third line, eax contains 0x0!!!!!  I would have expected 0xffffffff.
> Because eax now contains zero, the test in line 4 is false and the 
> (subsequent) jump never happens.
> 
> So, that leaves me with two questions:
> 
> 1) What am I missing in the above scenario?  Could there be some type of cache 
> issue?

The kernel updates the value. If you look at it from a core dump or with the
debugger then you've forced the thread off the CPU, so it will appear as -1
(or iirc -2 if the thread has exited).
 
> 2) Should a process really be able to hang like that after calling exit (using 
> 100% user time)?

Given what you've mentioned it seems to me like a bug in the application. If
you have this situation..

thr1	pthread_mutex_lock(&foo)
thr1	pthread_exit()
thr2	pthread_mutex_lock(&foo)

... then thr2 will deadlock, no matter whether it's spinning or sleeping. It
would be nicer for thr2 to sleep, which might let the application appear to
work!

Andrew