port-cobalt: Re: Qube 2/NetBSD 1.6 regular instability

Subject: Re: Qube 2/NetBSD 1.6 regular instability
To: Ian Spray <cobalt@minimal.cx>
From: Rodrigo Fernandez-Vizarra <Rodrigo.Fdz-Vizarra@infonegocio.com>
List: port-cobalt
Date: 10/04/2003 00:57:24
I'm having instability problems with my Qube2 too.

My Qube2 has 128 Mb of RAM, 64Mb from the original Setup and 64Mb more 
from another Qube2. I've replaced the original 13Gb HD for a 40Gb hd 
seagate barracuda (if I don't remember wrong).

I use to have many pmap_unwire with Netbsd 1.6. But now with Netbsd 
1.6.1 I don't have any in my logs... but still hangs from time to time. 
I have noticed that when the system has high network activity the system 
tends to hang in less time. Did you notice something similar? High 
network load -> hang (it takes one or two days to hang)

I've similar network performance problems, it's supposed to be 100/10 
ethernet, but I never get more than 900Kb/s even with a crossover cable. 
I don't know if the problem is with the hardware of with the software 
(the driver).

The worst thing is that I don't have a clue about how to trace the problem.

Regards,
Rodrigo

Ian Spray wrote:

>Hi all,
>
>I've been having problems with my Qube 2 running a custom 1.6 kernel for
>some time and could do with some advice on how to go about troubleshooting
>it.  Using the original Linux 2.0 kernel that came with the system I got
>over 60 days uptime, but as you can see from my live stats
>(http://minimal.cx/uptime.php) I've getting between 23 and 25 days.  The big
>Linux stats were sadly lost in a NetBSD crash...
>
>The main change from the Linux days is the hard drive upgrade from 10GB to
>120GB and the RAM increase from 96MB to 192MB.  The RAM was bought from
>Crucial as approved Cobalt Qube 2 RAM, and I spent ages checking the power
>consumption figures of hard drives to ensure that the 120GB model was within
>10% of the values for the original 10GB one (it only exceeds the 10GB
>figures at startup - operating currents are actually lower).
>
>The system logs typically show no useful information - the system simply
>stops and so far I've not had it hooked up to a serial terminal to try to
>get any sensible kernel debug output (this is abviously step one !).  The
>last dmesg does have a lot of pmap_unwire errors and also an IDE DMA
>problem, but the most interesting thing is that the system dies in the
>middle of writing out the pmap_unwire error:
>
>pmap_unwire: wiring for pmap 0x810fb2c0 va 0x7fffc000 didn't change!
>pmap_unwire: wiring for pmap 0x810fbd00 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb5e0 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fbe60 va 0x7fffa000 didn't change!
>wd0a: DMA error reading fsbn 82942512 of 82942512-82942639 (wd0 bn 83515823; cn 82852 tn 15 sn 62), ret rying
>wd0: soft error (corrected)
>pmap_unwire: wiring for pmap 0x810fbde0 va 0x10012000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb960 va 0x7fffc000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb7e0 va 0x1001a000 didn't change!
>pmap_unwire: wiring for pmap 0x810fbdc0 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb5e0 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fbce0 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb9a0 va 0x7fffc000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb300 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb300 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb4e0 va 0x10012000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb9a0 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fbb40 va 0x7fffa000 didn't change!
>pmap_unwire: wiring for pmap 0x810fb660 va 0x7fffa000 didn't change!
>pmap_unwire: wiring fo\^C\^PTap 0x810fb860 va 0x7fffa000 didn't change!
>Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002
>    The NetBSD Foundation, Inc.  All rights reserved.
>Copyright (c) 1982, 1986, 1989, 1991, 1993
>    The Regents of the University of California.  All rights reserved.
>
>
>I am also running with soft_deps on, but would have expected problems to
>show up long before the 22 day mark - the system runs BIND, Apache, sshd,
>mrtg, samba, spamd, exim and serves up quite a few large files from PHP
>enabled web sites hosted on it.  It can have a load peak of 14.60 (maybe
>higher but that's all I've observed) due to some unfriendly perl jobs but it
>tends to average no more than 4 in a typical day.
>
>I haven't seen anything mentioned in the CVS logs for
>sys/arch/mips/mips/pmap.c that might indicate that the unwire message is
>fixed in the MAIN branch, and I've also not seen any evidence to say that
>pmap is even a problem.  Does anyone else have a loaded Qube 2 with similar
>problems ?  I would assume not, or there would have been more emails like
>this !
>
>I also experience really slow network I/O (in the archives) and am wondering
>if I've simply got hardware that isn't perfect.  I'm currently open to even
>the wildest suggestions, as the only option open to me at the moment is to
>schedule a reboot every 14 days, which makes me little better than a
>Windows admin :(  I also need to have an alternative server in place before
>I can do really tough testing/proper kernel+serial debug (it's too
>important to do without), so I'm hoping to collect ideas whilst I'm putting
>one together.
>
>About the only other thing I've changed is to increase kern.maxvnodes to
>40000 as the command line respose of the system was appaling with the
>default 11000 (and something) value.  The system is running behind an APC
>BackUPS 600 and so the input power should be clean and stable.
>
>Thanks in advance,
>  
>