Subject: Re: 1.6.2 weirdness
To: NetBSD/Sparc64 Mailing-list <port-sparc64@NetBSD.org>
From: Richard Braun <email@example.com>
Date: 08/20/2004 22:23:46
Content-Type: text/plain; charset=us-ascii
On Thu, Aug 19, 2004 at 01:17:05AM +0200, ali (Anders Lindgren) wrote:
> Anyone familiar with the sleep sleeps forever bug? Does this sound like
> Situation: I have a remotely administered (serial console) Sun Ultra1 Sbus
> 143MHz with 128MB RAM and NetBSD 1.6.2 running on it. A few weeks ago it
> mysteriously went into a weird half-dead limbo state; it was running ftpd,
> sshd, named9, apache 1.3.29, exim4, and pretty much nothing else when sshd
> just died. No connection at all, not even in the TCP sense. Just timeout.
> Ftpd gave a connection, but went dead too as soon as the connection was
> established in the TCP sense. The rest of the servers just timed out like
> sshd -- apart from named9, which was working flawlessly! The only thing I
> can think of, is that named9 is the only server among those that (afaik)
> never tries to fork(2)... However, load on that box is practically zero;
> it has never run out of RAM either -- it rarely, if ever, uses swap at
> all. It's basically idle. At the time the box providing serial console
> wasn't available, so after several days (everyone with a key to the room
> was on vacation, naturally... ;( ) when I finally got in the box was
> power-cycled, and came up fine only to reveal a total silence in all logs
> during its zombie days. Found no obvious signs of the box having been
> cracked, and to my knowledge it ran no software with any at the time known
> vulnerabilities. The box has been working fine since the reboot.
> Tonight I had the serial console available when it started acting weird
> again: I was logged in via ssh but couldn't run su -- I never got a
> password prompt. Logged in fine as root on the serial console and tried to
> run top, but it never repainted the screen after displaying the processes
> so I couldn't see the load. It did however respond to the q key, and I
> could see it had lots (78 out of ~111MB) of free physical RAM. Sshd
> responsiveness seemed to degrade until I finally BREAKed it on the serial
> console, and dropped into ddb. After typing reboot, I got:
> db> reboot
> syncing disks... hme0: status=3D30001<GOTFRAME,RXTOHOST,NORXD>
> hme0: status=3D20001<GOTFRAME,NORXD>
> ..but no reboot! At this point the serial console is stone dead. The damn
> thing doesn't even respond to BREAKs anymore, and certainly won't drop me
> into the OFW prompt or back into ddb or anything. The box is dead.
> Almost. Imagine my surprise as I notice that it STILL responds to ping,
> and gives me (in a TCP sense) a connection on the ssh port! I'm not sure
> how this is possible after a reboot command in the kernel debugger, but
> there you go.
> Going to power-cycle the box tomorrow, save my configs and wipe it clean
> and net-install whatever is the latest autobuild of the 2.0 branch unless
> someone advices against it. Is 2.0_BETA now more stable than 1.6.2?
> (Sorry if the mail got long:ish...)
I had problems with the esp0 controller on my Ultra1, causing similar
effects (host responding to icmp echo requests but nothing else working
because the disk activity was freezed). This was the reason of the "no
logs" problem too :-).
I used the box as a NFS server though, so the load was a bit higher than
idle. The box freezed after 12-16 days. Now I've moved the NFS server
on another host, I don't have any more problems with it. It has been
running for more than 50 days without trouble with NetBSD/sparc64 1.6.2.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (NetBSD)
-----END PGP SIGNATURE-----