port-sparc: SPARC crashes

Subject: SPARC crashes
To: None <current-users@netbsd.org, port-sparc@netbsd.org>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: port-sparc
Date: 09/13/1994 18:58:17
Our SPARC running NetBSD has been crashing, reliably, every night.  The
stack traces vary, but they always have one thing in common: the
deepest few frames on the stack are always

#0  0xf8006a50 in snapshot ()
#1  0xf8006a58 in snapshot ()
#2  0xf80b8478 in boot ()
#3  0xf801fdc0 in panic ()
#4  0xf80bf12c in mem_access_fault ()
#5  0xf8005358 in trapbase ()
#6  0xf8039560 in vgone ()
#7  0xf80383bc in getnewvnode ()

I don't have all the relevant crashes at hand to examine, but I believe
the next four frames are always as follows:

#8  0xf8085c38 in ffs_vget ()
#9  0xf808fcd8 in ufs_lookup ()
#10 0xf80377c8 in lookup ()
#11 0xf8037254 in namei ()

Beyond this it varies; the syscall that provokes the crash is not
constant.  In this particular example it's stat(); in some it's been
open(), in one case I think it was readlink().

#12 0xf803bb40 in stat ()
#13 0xf80bf554 in syscall ()
#14 0xf8005668 in trapbase ()
#15 0x33c8 in ?? ()
#16 0x2ccc in ?? ()

Now, I've poked around enough that I know how to pick out the path name
being looked up in such cases, and it is not consistent.  In one case
it was /dev/null, in another I think it was /etc/localtime, and at
least once it was a plain file somewhere under /var.  (The missing
stack frame between vgone() and the trap that provokes the panic is a
call to vclean(), at least in the instance I looked that closely at.)
ps axk output, er, sorry, ps ax -M ... -N ... output is not consistent
either with regards to what's running.  Once it was "calendar -a", once
it appeared to be sendmail doing a queue run, other times it's been
other things.

It never does this under other circumstances, even on occasions when
I'm exercising the filesystem relatively heavily (as when doing a
kernel build, or updating my /usr/src tree to match the latest sup).
Only in the small hours of the morning, and it's crashed at every such
opportunity for the past week or so.  (I tried rebuilding a kernel from
the latest sup, which was as of a week or two ago now because of
sun-lamp's troubles; it didn't help visibly.)

/etc/fstab reads:

	/dev/sd0a /        ufs    rw       1 1
	/dev/sd0b none     swap   sw       0 0
	/dev/sd0d /usr     ufs    rw       1 2
	/dev/sd0e /sources ufs    rw       1 2
	/dev/sd0h /local   ufs    rw       1 2
	/dev      /wdev    null   rw       0 0
	fdesc     /dev     fdesc  rw,union 0 0
	procfs    /proc    procfs rw       0 0
	kernfs    /kern    kernfs rw       0 0

/dev is actually a symlink to /var/dev, as part of my
diskless-compatability setup; output from mount is

	/dev/sd0a on / type ufs (local)
	/dev/sd0d on /usr type ufs (local)
	/dev/sd0e on /sources type ufs (local)
	/dev/sd0h on /local type ufs (local)
	/var/dev on /wdev type ufs (local)
	fdesc on /var/dev type fdesc (union)
	procfs on /proc type procfs (local)
	kernfs on /kern type kernfs (local)

(this is after a df; before, the last four lines should be modified as
if by "sed -e 's/type .* (/type  (/'" - see PR kern/467 for why).  I
suspect the fifth mount (the one of type null) is related, largely
because the trouble started about the same time I added it.  But since
the paths being looked up do not always have anything to do with /dev
or /wdev, it's not clear how.  In any case, though, it's a bug.  (In
passing, why does nullfs lie about the mount type?  It copies the type
of the underlying filesystem, rather than using "null".  This seems (a)
like a lie and (b) of marginal utility anyway since more than one
filesystem type can be underneath it.)

If anyone has any clues where I should start looking, I'd love to hear
about them.  I'm almost ready to add code to vclean so that between
midnight and 06:00 it checks all pointers it dereferences extremely
carefully....

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu