Subject: 1.6.2 weirdness
To: NetBSD/Sparc64 Mailing-list <port-sparc64@NetBSD.org>
From: ali (Anders Lindgren) <dat94ali@ludat.lth.se>
List: port-sparc64
Date: 08/19/2004 01:17:05
Anyone familiar with the sleep sleeps forever bug? Does this sound like
it?

Situation: I have a remotely administered (serial console) Sun Ultra1 Sbus
143MHz with 128MB RAM and NetBSD 1.6.2 running on it. A few weeks ago it
mysteriously went into a weird half-dead limbo state; it was running ftpd,
sshd, named9, apache 1.3.29, exim4, and pretty much nothing else when sshd
just died. No connection at all, not even in the TCP sense. Just timeout.
Ftpd gave a connection, but went dead too as soon as the connection was
established in the TCP sense. The rest of the servers just timed out like
sshd -- apart from named9, which was working flawlessly! The only thing I
can think of, is that named9 is the only server among those that (afaik)
never tries to fork(2)... However, load on that box is practically zero;
it has never run out of RAM either -- it rarely, if ever, uses swap at
all. It's basically idle. At the time the box providing serial console
wasn't available, so after several days (everyone with a key to the room
was on vacation, naturally... ;( ) when I finally got in the box was
power-cycled, and came up fine only to reveal a total silence in all logs
during its zombie days. Found no obvious signs of the box having been
cracked, and to my knowledge it ran no software with any at the time known
vulnerabilities. The box has been working fine since the reboot.

Tonight I had the serial console available when it started acting weird
again: I was logged in via ssh but couldn't run su -- I never got a
password prompt. Logged in fine as root on the serial console and tried to
run top, but it never repainted the screen after displaying the processes
so I couldn't see the load. It did however respond to the q key, and I
could see it had lots (78 out of ~111MB) of free physical RAM. Sshd
responsiveness seemed to degrade until I finally BREAKed it on the serial
console, and dropped into ddb. After typing reboot, I got:

db> reboot
syncing disks... hme0: status=30001<GOTFRAME,RXTOHOST,NORXD>
hme0: status=20001<GOTFRAME,NORXD>

..but no reboot! At this point the serial console is stone dead. The damn
thing doesn't even respond to BREAKs anymore, and certainly won't drop me
into the OFW prompt or back into ddb or anything. The box is dead.
Almost. Imagine my surprise as I notice that it STILL responds to ping,
and gives me (in a TCP sense) a connection on the ssh port! I'm not sure
how this is possible after a reboot command in the kernel debugger, but
there you go.

Going to power-cycle the box tomorrow, save my configs and wipe it clean
and net-install whatever is the latest autobuild of the 2.0 branch unless
someone advices against it. Is 2.0_BETA now more stable than 1.6.2?

(Sorry if the mail got long:ish...)

TIA

-- 
/ali
:wq