current-users: infinite loop in pthread__mutex

Subject: infinite loop in pthread__mutex_spin after call to exit
To: None <current-users@netbsd.org>
From: Sverre Froyen <sverre@viewmark.com>
List: current-users
Date: 12/28/2007 17:18:12
Hi,

I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus hangs 
after its call to exit and I've ended up with a case where the C code, the 
assembler, and the variable values all look OK, yet the program hangs, 
looping in the method pthread__mutex_spin in 
src/lib/libpthread/pthread_mutex2.c (1.17).

The code that appears to fail is the test

                if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
                    thread->pt_blocking)
                        break;

which if true would have exited the loop.

LWPCTL_CPU_NONE is defined as (-1) and gdb shows that 
thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the loop 
but the test (apppears to) evaluate to false.

The assembly code for the pertinent part of the test reads (when compiled 
without optimization):

0xbbb55535 <pthread__mutex_spin+21>:    mov    0xc(%esp),%eax
0xbbb55539 <pthread__mutex_spin+25>:    mov    0x184(%eax),%eax
0xbbb5553f <pthread__mutex_spin+31>:    mov    (%eax),%eax
0xbbb55541 <pthread__mutex_spin+33>:    cmp    $0xffffffff,%eax

The first line gets the address of "thread".  The second line gets the address 
of "thread->pt_lwpctl->lc_curcpu".  The third line gets the value 
of "thread->pt_lwpctl->lc_curcpu".  And the last line compares that value 
to -1.

Here's what I observe:

gdb shows:
(gdb) print thread
$1 = (pthread_t) 0xbfa00000
(gdb) print thread->pt_lwpctl
$2 = (struct lwpctl *) 0xbb782000
(gdb) print thread->pt_lwpctl->lc_curcpu
$3 = -1

After the first line, eax contains 0xbfa00000, as expected.
After the second line, eax contains 0xbb782000 which in turn contains 
0xffffffff, again as expected (lc_curcpu is the first member of the lwpctl 
structure).
After the third line, eax contains 0x0!!!!!  I would have expected 0xffffffff.
Because eax now contains zero, the test in line 4 is false and the 
(subsequent) jump never happens.

So, that leaves me with two questions:

1) What am I missing in the above scenario?  Could there be some type of cache 
issue?

2) Should a process really be able to hang like that after calling exit (using 
100% user time)?

Thanks,

Sverre

PS this is in i386 current.

PPS I'm sending this to current-users because this feels like an issue with 
current.  Feel free to redirect to netbsd-help or pkgsrc-users if that seems 
more appropriate.