Subject: Re: suddenly my sparcs are crashing left, right, and centre!
To: NetBSD/sparc Discussion List <port-sparc@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 04/26/2000 00:44:55
[ On Tuesday, April 25, 2000 at 15:57:08 (-0700), Eduardo Horvath wrote: ]
> Subject: Re: suddenly my sparcs are crashing left, right, and centre!
>
> Looks like you had a double fault panic here, which also explains the lack
> of core dump.  Your first kernel trap seems to be either in
> vm_map_lookup_done() or lockmgr(), don't know exactly.  For that I would
> need to see the trapframe (the locals in addition to the ins).

I guess we don't get that var unless I get a core dump, or a live crash
into ddb....
 
> Then inside cpu_reboot() or dumpsys() you got another trap.  The PC is
> 0xf010e450 which is probably inside dumpsys() which starts at
> 0xf010c6d4.  I suppose if you disassembled all of dumpsys() we might be
> able to figure out the panic in there, but that won't get you to the root
> cause.  But it might get you a crashdump.

I don't see dumpsys() there:

	0xf010e450 <mmu_pagein+120>:    ld  [ %o1 + %o0 ], %i2

> I don't know why you're not breaking into DDB.  You might want to try
> breaking into DDB from the console to be sure it's enabled.  

I was afraid to try it earlier  :-(

And my fears were well founded as it doesn't seem to work:

	login: ^]
	telnet> send brk
	stopping on keyboard abort
	Type  'go' to resume
	Type  help  for more information
	ok go

Oddly enough though db_onpanic is set to '1' in my kernel and I
double-checked that I did compile it with "options DDB":

	00:14 [57] $ gdb /netbsd 
	. . . .
	(no debugging symbols found)...
	(gdb) print db_onpanic
	$1 = 1
	(gdb)

I'm guessing now that a 'make depend' wasn't quite enough to get
everything in place -- I should probably rebuild the whole thing from
scratch.

The question though is whether this is worth the effort or not.  I'm
planning on upgrading this machine to 1.4.2 ASAP anyway, and I have a
client's sparc-20 as well as my old ss-1 server on-site that I can soon
use to test to see if the bug is present in 1.4.2 or not (assuming the
smail binary that triggered it will work on 1.4.2 and still tickle the
problem there too).  As has just been reported there may still be
problems with 1.4.x so perhaps I should just forget 1.3.2 and push
forward to where at least a fix would still be relevant to a larger
number of people....

Assuming I can reproduce the bug on the ss20 with 1.4.2 I can punch a
hole through my firewall to let anyone interested in debugging it get at
the console in the crashed state via my console terminal server.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>