Subject: infinite loop in pthread__mutex_spin after call to exit
To: None <current-users@netbsd.org>
From: Sverre Froyen <sverre@viewmark.com>
List: current-users
Date: 12/28/2007 17:18:12
Hi,
I've been attempting to debug why the program /usr/pkg/qt4/bin/qdbus hangs
after its call to exit and I've ended up with a case where the C code, the
assembler, and the variable values all look OK, yet the program hangs,
looping in the method pthread__mutex_spin in
src/lib/libpthread/pthread_mutex2.c (1.17).
The code that appears to fail is the test
if (thread->pt_lwpctl->lc_curcpu == LWPCTL_CPU_NONE ||
thread->pt_blocking)
break;
which if true would have exited the loop.
LWPCTL_CPU_NONE is defined as (-1) and gdb shows that
thread->pt_lwpctl->lc_curcpu is also -1 at each iteration through the loop
but the test (apppears to) evaluate to false.
The assembly code for the pertinent part of the test reads (when compiled
without optimization):
0xbbb55535 <pthread__mutex_spin+21>: mov 0xc(%esp),%eax
0xbbb55539 <pthread__mutex_spin+25>: mov 0x184(%eax),%eax
0xbbb5553f <pthread__mutex_spin+31>: mov (%eax),%eax
0xbbb55541 <pthread__mutex_spin+33>: cmp $0xffffffff,%eax
The first line gets the address of "thread". The second line gets the address
of "thread->pt_lwpctl->lc_curcpu". The third line gets the value
of "thread->pt_lwpctl->lc_curcpu". And the last line compares that value
to -1.
Here's what I observe:
gdb shows:
(gdb) print thread
$1 = (pthread_t) 0xbfa00000
(gdb) print thread->pt_lwpctl
$2 = (struct lwpctl *) 0xbb782000
(gdb) print thread->pt_lwpctl->lc_curcpu
$3 = -1
After the first line, eax contains 0xbfa00000, as expected.
After the second line, eax contains 0xbb782000 which in turn contains
0xffffffff, again as expected (lc_curcpu is the first member of the lwpctl
structure).
After the third line, eax contains 0x0!!!!! I would have expected 0xffffffff.
Because eax now contains zero, the test in line 4 is false and the
(subsequent) jump never happens.
So, that leaves me with two questions:
1) What am I missing in the above scenario? Could there be some type of cache
issue?
2) Should a process really be able to hang like that after calling exit (using
100% user time)?
Thanks,
Sverre
PS this is in i386 current.
PPS I'm sending this to current-users because this feels like an issue with
current. Feel free to redirect to netbsd-help or pkgsrc-users if that seems
more appropriate.