current-users: Re: After newlock2 merge: Different pthread

Subject: Re: After newlock2 merge: Different pthread
To: Andrew Doran <ad@NetBSD.org>
From: Matthias Drochner <M.Drochner@fz-juelich.de>
List: current-users
Date: 04/12/2007 20:14:22
ad@NetBSD.org said:
> I've seen a similar trace recently from a FUSE app (pthread_spinlock),
> I'll have look in the next few days.  Apparently it's not hard to
> reproduce the problem. 

I was hit again, with today's kernel. With both CPUs enabled,
and not running setiathome. As said, I've never seen these problems
if using just one CPU, or if I keep both CPUs busy.

xfce-mcs-manager died at the same point - the assertion after a
pthread cancel check. I didn't find a call to pthread_cancel
in the glib sources, so I suspect that the check firing is
already an indication of corruption.

Program terminated with signal 6, Aborted.
#0  0xbb31819f in kill () from /usr/lib/libc.so.12
(gdb) where
#0  0xbb31819f in kill () from /usr/lib/libc.so.12
#1  0xbb3e01f7 in pthread__assertfunc () from /usr/lib/libpthread.so.0
#2  0xbb3dedba in pthread_spinlock () from /usr/lib/libpthread.so.0
#3  0xbb3e103d in pthread_exit () from /usr/lib/libpthread.so.0
#4  0xbb3de804 in poll () from /usr/lib/libpthread.so.0
#5  0xbb416caf in g_main_context_check () from /usr/pkg/lib/libglib-2.0.so.0
(gdb) x/100i poll
[...]
0xbb3de7d3 <poll+31>:   mov    0x1c(%esi),%eax
0xbb3de7d6 <poll+34>:   test   %eax,%eax
0xbb3de7d8 <poll+36>:   jne    0xbb3de7fa <poll+70>
0xbb3de7da <poll+38>:   push   %eax
0xbb3de7db <poll+39>:   pushl  0x10(%ebp)
0xbb3de7de <poll+42>:   pushl  0xc(%ebp)
0xbb3de7e1 <poll+45>:   pushl  0x8(%ebp)
0xbb3de7e4 <poll+48>:   call   0xbb3dbcc0 <_sys_poll@plt>
0xbb3de7e9 <poll+53>:   add    $0x10,%esp
0xbb3de7ec <poll+56>:   mov    0x1c(%esi),%esi
0xbb3de7ef <poll+59>:   test   %esi,%esi
0xbb3de7f1 <poll+61>:   jne    0xbb3de7fa <poll+70>
0xbb3de7f3 <poll+63>:   lea    0xfffffff8(%ebp),%esp
0xbb3de7f6 <poll+66>:   pop    %ebx
0xbb3de7f7 <poll+67>:   pop    %esi
0xbb3de7f8 <poll+68>:   leave  
0xbb3de7f9 <poll+69>:   ret    
0xbb3de7fa <poll+70>:   sub    $0xc,%esp
0xbb3de7fd <poll+73>:   push   $0x1
0xbb3de7ff <poll+75>:   call   0xbb3dbae0 <pthread_exit@plt>
0xbb3de804 <open>:      push   %ebp


When I tried to rebuild userland, /bin/sh died unexpectedly in
a way which looks impossible:

Program terminated with signal 11, Segmentation fault.
#0  0x0805aadc in setvar ()
(gdb) where
#0  0x0805aadc in setvar ()
#1  0x08055d51 in readcmd ()
#2  0x0804c594 in evalcommand ()
#3  0x0804ba6c in evaltree ()
#4  0x0804cfe5 in evalloop ()
#5  0x0804bae8 in evaltree ()
#6  0x0804cc19 in evalpipe ()
#7  0x0804ba5a in evaltree ()
#8  0x0804ba1d in evaltree ()
#9  0x0804d0ba in evalstring ()
#10 0x08054f26 in main ()
(gdb) x/i setvar
[...]
0x805aad9 <setvar+57>:  lea    0x1(%esi),%ecx
(gdb) 
0x805aadc <setvar+60>:  mov    (%ecx),%dl
(gdb) info reg
eax            0x0      0
ecx            0x806c000        134660096
edx            0x8069e00        134651392
ebx            0xbbbb3c00       -1145357312
esp            0xbfbfdd20       0xbfbfdd20
ebp            0xbfbfdd38       0xbfbfdd38
esi            0x8069ec4        134651588
edi            0x1      1
eip            0x805aadc        0x805aadc <setvar+60>
eflags         0x10216  [ PF AF IF RF ]
cs             0x17     23
ss             0x1f     31
ds             0x1f     31
es             0x1f     31
fs             0x1f     31
gs             0x1f     31
(gdb) x/x 0x8069ec4
0x8069ec4:      0x69667a74
(gdb) x/x 0x806c000
0x806c000:      Cannot access memory at address 0x806c000


As you see, either esi or ecx must be wrong here.
It might be a strange coincidence that the xfce crash can
be explained by a corruption of esi...

I've kept the coredumps and binaries, in case someone
wants to do analyze more.

best regards
Matthias