Subject: Re: xsrc/15357: stack trashing bug crashing the sparc Xservers
To: NetBSD GNATS submissions and followups <gnats-bugs@gnats.netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: netbsd-bugs
Date: 01/30/2002 23:30:07
[ On Friday, January 25, 2002 at 12:00:37 (+0000), David Brownlee wrote: ]
> Subject: Re: xsrc/15357: stack trashing bug crashing the sparc Xservers
>
> 	I believe someone quite a while back had a malloc library which
> 	you could preload and avoided the Xserver crashes. You might want
> 	to take a look back through the mail-archive. It could give a
> 	better idea of what is going wrong.

That's a very good suggestion, but I'm afraid it would only help me find
other bugs unrelated to the one I'm suffering.  The stack is being
rather badly trashed, and from what I know so far it's unlikely any heap
related allocation problems would cause this.  Besides the Xserver has a
relatively stable heap (no major memory leaks I mean).

I have captured a couple of core dumps that appear to have complete
stack backtraces, but at least on first glance they all seem to suffer
from a few "you can't get there from here" breaks.  I.e. when you look
at the code there call is totally different than the next frame
indicates.  I've saved them off if anyone wants to poke them.

This bug's really been getting my goat today.  (already happened again
three times while I've been typing this message!)  sometimes the crashes
are within minutes, once only after several hours, and this time I've
been running OK for over 20 minutes....

I had for some silly reason gotten the impression that it happened only
when I was using xfs (i.e. instead of the fonts directories via NFS).
As if to spite me the server crashed many times today when using only
the fonts directories directly (well, via NFS of course).  Now I'm back
using xfs, but without much better luck.

Today I added 24MB of RAM (-8, +32, for a new total of 40MB).  Now it
rarely pages, which is nice.  Unfortunately the crashes are now MUCH
more frequent (and yes, the RAM tests just fine and it all has its
parity chip!).  I'm beginning to worry more about the kernel and not the
Xserver, since now context switches seem to happen far more frequently
and of course more quickly without the need for any paging activity.  I
may have to revert to 1.3.2 just to get any work done (this is my main
and favourite workstation).  The only thing that keeps me focused on the
Xserver is that it's the only thing crashing even though I've got
several local clients (the window manager (ctwm), xload, xclock, xbiff,
xterms, swisswatch, xconsole), ntpd, snmpd, the the inetd, rwhod,
syslogd, etc., etc., etc., all actively consuming CPU.

If it is the Xserver what I think I need is some effective way of
recording the call tree separately from the stack, though probably not
in any way that requires external I/O (since in the X11 server that
would incur far too much overhead and pain, I think).  If I had the call
tree I might be able to at least find approximately where things start
to go wrong.  Maybe if I profile it then the profiling records will
leave sufficient clues, though of course there's more overhead there....

If anyone has any thoughts or suggestions I'm all ears!

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>