Subject: Weird hang with -current
To: None <port-vax@NetBSD.ORG>
From: Tom I Helbekkmo <tih@Hamartun.Priv.NO>
List: port-vax
Date: 02/07/1998 12:16:44
I'm observing a semi-catatonic state with an almost current NetBSD/vax
on a KA650.  I'm running kernel and userland built from the tar-balls
from one week ago, January 31st, but with Ragge's patches to fix the
swap problem on the KA650 added.  (Ragge: this means the tar archive
of four modified files, which you made available for ftp when I asked
you about them last weekend.)

Everything seems just fine, and stable as I could wish, until the
system gets loaded down a bit, seemingly (but this is just a guess,
really) when two processes start competing for memory.  For instance,
while my "make build" was running, if I compiled something else at the
same time, and the two compilation runs hit big, complicated source
files at the same time (or one of them tried to compile a C++ file),
the problem would surface immediately.  Also, while 'ld' is linking a
very large executable I have to be very, very careful with what other
things I do on the system.  (I build PostgreSQL often these days, and
that will push it to this edge every time.)  I can provoke the hang
easily: all it takes is starting several large processes at the same
time, and the system will immediately get stuck.

The effect is that almost everything seemingly stops: I cannot get a
login prompt on the console; I cannot get echo of characters there or
in active xterm windows; xterm windows don't even receive focus change
notifications any more (or at least don't show that they've received
them by changing their cursor colors) -- but the RUN light stays on,
and the machine answers to ICMP ECHO, so it's not completely dead.
However, the only thing I can do to it is hit BREAK and reboot.

Is there anything I can do to find out more about this?  I've got DDB
compiled into the kernel, so if I could get in there, I might be able
to see something interesting (except that the kernel always reports
"[netbsd symbol table invalid]" at boot, where on my Sun Sparc I see
something like "[nnnnnn bytes of netbsd symbols preserved]").  Failing
that, if I could just get it to dump core, I reckon that would be of
interest.  Any hints are gratefully accepted -- and I am of course
ready to test any modification that people might believe to be
relevant.  I know that I can reproduce this problem on demand at any
time, so it's very, very easy to test cures for!  :-)

-tih
-- 
Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"