Subject: server went down
To: None <netbsd-help@netbsd.org>
From: admin@datazap.net <admin@datazap.net>
List: netbsd-help
Date: 09/27/2004 14:23:24
Hi,

I have a server that went down a couple of days ago, and I was wondering
if I could find out why. It is a Dell running at 500 mhz and has NetBSD
1.5.3 on it.

I have been having problems on another box with different things building,
but then having an ld.so error (port-amiga). Since the Dell does things
less critical, I tried to reproduce the same error. It would fail to build
the same packages, and would always exit with an ld error.

For example:
/usr/pkgsrc/lang/gcc/work/objdir/gcc/xgcc
-B/usr/pkgsrc/lang/gcc/work/objdir/gcc/
-B/usr/local/gcc-2.95.3/i386--netbsdelf/bi
n/ -O2 -I/usr/local/include -fno-implicit-templates  -shared -o
libstdc++.so.4.0 `cat piclist`
/usr/bin/ld: cannot open -lgcc_pic: No such file or directory
collect2: ld returned 1 exit status
gmake[1]: *** [libstdc++.so.4.0] Error 1
gmake[1]: Leaving directory
`/usr/pkgsrc/lang/gcc/work/objdir/i386--netbsdelf/libstdc++'
gmake: *** [all-target-libstdc++] Error 2
*** Error code 2

Stop.
*** Error code 1


Since, the problem seemed similar, and since I had tarred up, and copied
the source from my other box to save bandwidth. I decided to update the
source, and this is what I did:

setenv CVS_RSH ssh
setenv CVSROOT anoncvs@anoncvs.NetBSD.org:/cvsroot
cd /usr/src
cvs update -d -P -r netbsd-1-5
make build

I have always been told to update the kernel first after an update, but
since I was updating a 1.5.3 system with 1.5.3 code, I didn't do this.
Infact, I was not even going to build a new kernel, because this was kind
of a quick fix (just to get things working), and I had plans to update to
2.0 once I get enough time.

I was able to run "make build" 3 times in a row without a problem. I then
built all the packages that I was not able to build before without a
problem, and I thought I had everthing fixed, and I could start using it.
Because this server is supposed to be a backup for everything else I
started transferring files to and from from a few other boxes. Also, I had
it doing other things to get it ready. I also decided that I wanted to
build the world again just to test it (I never had a box build userland in
such a short time, and I wanted see what would happen if I did it when the
box was under a load). About 6 hours into the build it went down. I could
not ping it or anything.

This box is colocated, so I couldn't read what it said on the screen. They
said it had alot of "0's", and if they hit the return key it would
respond. So, I am thinking that it must have been in the debugger
(although, I am not sure).

Here is what I got from messages log:

Sep 25 15:35:06 cobalt named[121]: ns_forw: sendto([198.32.64.12].53): No
route to host
Sep 25 15:43:24 cobalt last message repeated 10 times
Sep 25 15:50:05 cobalt named[121]: ns_forw: sendto([198.32.64.12].53): No
route to host
Sep 25 15:53:25 cobalt last message repeated 10 times
Sep 25 15:59:25 cobalt last message repeated 6 times
Sep 25 16:00:00 cobalt syslogd: restart
Sep 25 16:00:25 cobalt /netbsd: WARNING: mclpool limit reached; increase
NMBCLUSTERS
Sep 25 16:01:25 cobalt /netbsd: WARNING: mclpool limit reached; increase
NMBCLUSTERS


There are 10s of thousounds of named messages like the ones above (if not
100s), and a bunch of NMBCLUSTERS warnings. So, I am wondering if it
didn't just run out of network buffers. But without a core dump is there
any way for me to know what really happened?

Also, I would like to upgrade the install of NetBSD 1.5.3 on my other box
the same way that I did it above, is that safe? or is that totally
unrelated to what took this server down?

Any help would be greatly appricated!

Thanks,
Al