Subject: Re: race condition in pthread write()?
To: Nathan J. Williams <nathanw@wasabisystems.com>
From: Karl Janmar <karlj@mdstud.chalmers.se>
List: tech-kern
Date: 02/11/2003 18:54:22
On 11 Feb 2003, Nathan J. Williams wrote:

> Karl Janmar <karlj@mdstud.chalmers.se> writes:
>
> > I think there is a race condition in write()-libc or pthread, I don't
> > exactly where to look, but here is my case:
> > I run xmms with pthread, it use one thread for writing to esd, and some
> > others for encode the mp3-stream. After a while (could be up to 1-2 hour)
> > it freeze. when I do ps axs it shows that all threads(10 or something) is
> > stuck in sastacks (kern_sa.c:sa_uppcall_userret() waiting for a stack to
> > run?) and ONE is in waiting (it says wait in the WCHAN col. in ps).
>
> Okay, the problem is that the process isn't recycling enough stacks
> for the kernel. A few other people have seen this; I'm not entirely
> sure what the solution is.
>
> > When I do a backtrace on the running xmms i found this:
> > #0  0x4856fffb in write () from /usr/lib/libc.so.12
> > #1  0x482e5f65 in write () from /usr/lib/libpthread.so.0
> > #2  0x4863b8ef in get_oplugin_info ()
> >    from /usr/pkg/lib/xmms/Output/libesdout.so
> > #3  0x4863bbed in get_oplugin_info ()
> >    from /usr/pkg/lib/xmms/Output/libesdout.so
> > #4  0x1c340 in ?? ()
>
> I believe this is an artifact of GDB making a poor choice of an
> initial thread to show you. If you get into this state, can you
> additionally run "info thread" and "thread ex all" in GDB?
>
>         - Nathan
>

Here is some more data:
(gdb) info thread
  5 Thread 0  0x485700f7 in poll () from /usr/lib/libc.so.12
  4 Thread 21  0x4857008b in select () from /usr/lib/libc.so.12
  3 Thread 22  0x4857013f in nanosleep () from /usr/lib/libc.so.12
  2 Thread 69  0x4857013f in nanosleep () from /usr/lib/libc.so.12
* 1 Thread 70  0x482e5f65 in write () from /usr/lib/libpthread.so.0
(gdb) thread ex all
0x48fc0000: thread   70 in kernel
0x48f80000: thread   69 in kernel
0x48f40000: thread   22 in kernel
0x48b40000: thread   21 in kernel
0xbfbc0000: thread    0 in kernel

Maybe this will help a little more.
If I get the point the sa_nstacks is running out because the stacks aren't
recycle properly, so someware in kern_sa.c it "leaks stacks"?

- Karl