Subject: Re: 1.6.2 weirdness
To: ali (Anders Lindgren) <dat94ali@ludat.lth.se>
From: Andrey Petrov <petrov@netbsd.org>
List: port-sparc64
Date: 08/26/2004 10:07:10
On Thu, Aug 19, 2004 at 01:17:05AM +0200, ali (Anders Lindgren) wrote:
> Anyone familiar with the sleep sleeps forever bug? Does this sound like
> it?
> 
> Tonight I had the serial console available when it started acting weird
> again: I was logged in via ssh but couldn't run su -- I never got a
> password prompt. Logged in fine as root on the serial console and tried to
> run top, but it never repainted the screen after displaying the processes
> so I couldn't see the load. It did however respond to the q key, and I
> could see it had lots (78 out of ~111MB) of free physical RAM. Sshd
> responsiveness seemed to degrade until I finally BREAKed it on the serial
> console, and dropped into ddb. After typing reboot, I got:
> 
> db> reboot
> syncing disks... hme0: status=30001<GOTFRAME,RXTOHOST,NORXD>
> hme0: status=20001<GOTFRAME,NORXD>
> 
> ..but no reboot! At this point the serial console is stone dead. The damn
> thing doesn't even respond to BREAKs anymore, and certainly won't drop me

I'd say that it looks very much like locking problem, reboot calls
sync and if file system locks are acquired it'll loop there, but I expect
it to be able to break to ddb, strange.

It would be interesting to see stack trace (well it's always interesting)
next time something happens. t or bt ddb commands.

> into the OFW prompt or back into ddb or anything. The box is dead.
> Almost. Imagine my surprise as I notice that it STILL responds to ping,
> and gives me (in a TCP sense) a connection on the ssh port! I'm not sure
> how this is possible after a reboot command in the kernel debugger, but
> there you go.
> 

That one more argument that the kernel was locked but not crashed,
what you saw was interrupt handlers. If you can experiment on that
machine I'd suggest to try LOCKDEBUG kernel option.

	Andrey