Subject: Re: After newlock2 merge: Different pthread
To: Andrew Doran <ad@NetBSD.org>
From: Matthias Drochner <M.Drochner@fz-juelich.de>
List: current-users
Date: 04/12/2007 20:14:22
ad@NetBSD.org said:
> I've seen a similar trace recently from a FUSE app (pthread_spinlock),
> I'll have look in the next few days. Apparently it's not hard to
> reproduce the problem.
I was hit again, with today's kernel. With both CPUs enabled,
and not running setiathome. As said, I've never seen these problems
if using just one CPU, or if I keep both CPUs busy.
xfce-mcs-manager died at the same point - the assertion after a
pthread cancel check. I didn't find a call to pthread_cancel
in the glib sources, so I suspect that the check firing is
already an indication of corruption.
Program terminated with signal 6, Aborted.
#0 0xbb31819f in kill () from /usr/lib/libc.so.12
(gdb) where
#0 0xbb31819f in kill () from /usr/lib/libc.so.12
#1 0xbb3e01f7 in pthread__assertfunc () from /usr/lib/libpthread.so.0
#2 0xbb3dedba in pthread_spinlock () from /usr/lib/libpthread.so.0
#3 0xbb3e103d in pthread_exit () from /usr/lib/libpthread.so.0
#4 0xbb3de804 in poll () from /usr/lib/libpthread.so.0
#5 0xbb416caf in g_main_context_check () from /usr/pkg/lib/libglib-2.0.so.0
(gdb) x/100i poll
[...]
0xbb3de7d3 <poll+31>: mov 0x1c(%esi),%eax
0xbb3de7d6 <poll+34>: test %eax,%eax
0xbb3de7d8 <poll+36>: jne 0xbb3de7fa <poll+70>
0xbb3de7da <poll+38>: push %eax
0xbb3de7db <poll+39>: pushl 0x10(%ebp)
0xbb3de7de <poll+42>: pushl 0xc(%ebp)
0xbb3de7e1 <poll+45>: pushl 0x8(%ebp)
0xbb3de7e4 <poll+48>: call 0xbb3dbcc0 <_sys_poll@plt>
0xbb3de7e9 <poll+53>: add $0x10,%esp
0xbb3de7ec <poll+56>: mov 0x1c(%esi),%esi
0xbb3de7ef <poll+59>: test %esi,%esi
0xbb3de7f1 <poll+61>: jne 0xbb3de7fa <poll+70>
0xbb3de7f3 <poll+63>: lea 0xfffffff8(%ebp),%esp
0xbb3de7f6 <poll+66>: pop %ebx
0xbb3de7f7 <poll+67>: pop %esi
0xbb3de7f8 <poll+68>: leave
0xbb3de7f9 <poll+69>: ret
0xbb3de7fa <poll+70>: sub $0xc,%esp
0xbb3de7fd <poll+73>: push $0x1
0xbb3de7ff <poll+75>: call 0xbb3dbae0 <pthread_exit@plt>
0xbb3de804 <open>: push %ebp
When I tried to rebuild userland, /bin/sh died unexpectedly in
a way which looks impossible:
Program terminated with signal 11, Segmentation fault.
#0 0x0805aadc in setvar ()
(gdb) where
#0 0x0805aadc in setvar ()
#1 0x08055d51 in readcmd ()
#2 0x0804c594 in evalcommand ()
#3 0x0804ba6c in evaltree ()
#4 0x0804cfe5 in evalloop ()
#5 0x0804bae8 in evaltree ()
#6 0x0804cc19 in evalpipe ()
#7 0x0804ba5a in evaltree ()
#8 0x0804ba1d in evaltree ()
#9 0x0804d0ba in evalstring ()
#10 0x08054f26 in main ()
(gdb) x/i setvar
[...]
0x805aad9 <setvar+57>: lea 0x1(%esi),%ecx
(gdb)
0x805aadc <setvar+60>: mov (%ecx),%dl
(gdb) info reg
eax 0x0 0
ecx 0x806c000 134660096
edx 0x8069e00 134651392
ebx 0xbbbb3c00 -1145357312
esp 0xbfbfdd20 0xbfbfdd20
ebp 0xbfbfdd38 0xbfbfdd38
esi 0x8069ec4 134651588
edi 0x1 1
eip 0x805aadc 0x805aadc <setvar+60>
eflags 0x10216 [ PF AF IF RF ]
cs 0x17 23
ss 0x1f 31
ds 0x1f 31
es 0x1f 31
fs 0x1f 31
gs 0x1f 31
(gdb) x/x 0x8069ec4
0x8069ec4: 0x69667a74
(gdb) x/x 0x806c000
0x806c000: Cannot access memory at address 0x806c000
As you see, either esi or ecx must be wrong here.
It might be a strange coincidence that the xfce crash can
be explained by a corruption of esi...
I've kept the coredumps and binaries, in case someone
wants to do analyze more.
best regards
Matthias