Subject: Qube 2/NetBSD 1.6 regular instability
To: None <port-cobalt@netbsd.org>
From: Ian Spray <cobalt@minimal.cx>
List: port-cobalt
Date: 10/02/2003 14:10:10
Hi all,

I've been having problems with my Qube 2 running a custom 1.6 kernel for
some time and could do with some advice on how to go about troubleshooting
it.  Using the original Linux 2.0 kernel that came with the system I got
over 60 days uptime, but as you can see from my live stats
(http://minimal.cx/uptime.php) I've getting between 23 and 25 days.  The big
Linux stats were sadly lost in a NetBSD crash...

The main change from the Linux days is the hard drive upgrade from 10GB to
120GB and the RAM increase from 96MB to 192MB.  The RAM was bought from
Crucial as approved Cobalt Qube 2 RAM, and I spent ages checking the power
consumption figures of hard drives to ensure that the 120GB model was within
10% of the values for the original 10GB one (it only exceeds the 10GB
figures at startup - operating currents are actually lower).

The system logs typically show no useful information - the system simply
stops and so far I've not had it hooked up to a serial terminal to try to
get any sensible kernel debug output (this is abviously step one !).  The
last dmesg does have a lot of pmap_unwire errors and also an IDE DMA
problem, but the most interesting thing is that the system dies in the
middle of writing out the pmap_unwire error:

pmap_unwire: wiring for pmap 0x810fb2c0 va 0x7fffc000 didn't change!
pmap_unwire: wiring for pmap 0x810fbd00 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb5e0 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fbe60 va 0x7fffa000 didn't change!
wd0a: DMA error reading fsbn 82942512 of 82942512-82942639 (wd0 bn 83515823; cn 82852 tn 15 sn 62), ret rying
wd0: soft error (corrected)
pmap_unwire: wiring for pmap 0x810fbde0 va 0x10012000 didn't change!
pmap_unwire: wiring for pmap 0x810fb960 va 0x7fffc000 didn't change!
pmap_unwire: wiring for pmap 0x810fb7e0 va 0x1001a000 didn't change!
pmap_unwire: wiring for pmap 0x810fbdc0 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb5e0 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fbce0 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb9a0 va 0x7fffc000 didn't change!
pmap_unwire: wiring for pmap 0x810fb300 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb300 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb4e0 va 0x10012000 didn't change!
pmap_unwire: wiring for pmap 0x810fb9a0 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fbb40 va 0x7fffa000 didn't change!
pmap_unwire: wiring for pmap 0x810fb660 va 0x7fffa000 didn't change!
pmap_unwire: wiring fo\^C\^PTap 0x810fb860 va 0x7fffa000 didn't change!
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.


I am also running with soft_deps on, but would have expected problems to
show up long before the 22 day mark - the system runs BIND, Apache, sshd,
mrtg, samba, spamd, exim and serves up quite a few large files from PHP
enabled web sites hosted on it.  It can have a load peak of 14.60 (maybe
higher but that's all I've observed) due to some unfriendly perl jobs but it
tends to average no more than 4 in a typical day.

I haven't seen anything mentioned in the CVS logs for
sys/arch/mips/mips/pmap.c that might indicate that the unwire message is
fixed in the MAIN branch, and I've also not seen any evidence to say that
pmap is even a problem.  Does anyone else have a loaded Qube 2 with similar
problems ?  I would assume not, or there would have been more emails like
this !

I also experience really slow network I/O (in the archives) and am wondering
if I've simply got hardware that isn't perfect.  I'm currently open to even
the wildest suggestions, as the only option open to me at the moment is to
schedule a reboot every 14 days, which makes me little better than a
Windows admin :(  I also need to have an alternative server in place before
I can do really tough testing/proper kernel+serial debug (it's too
important to do without), so I'm hoping to collect ideas whilst I'm putting
one together.

About the only other thing I've changed is to increase kern.maxvnodes to
40000 as the command line respose of the system was appaling with the
default 11000 (and something) value.  The system is running behind an APC
BackUPS 600 and so the input power should be clean and stable.

Thanks in advance,
-- 
ian.
PGP Fingerprint: D170 35A3 C858 6E85 9B5B  1557 4CD5 6F6F E176 2D0A