port-sparc: hangs during vnd activity; they're BAAACK!

Subject: hangs during vnd activity; they're BAAACK!
To: None <port-sparc@netbsd.org>
From: Todd Whitesel <toddpw@best.com>
List: port-sparc
Date: 08/11/2000 05:12:01
Some of y'all may remember me reporting a weird sun4c vnd-related
hang/panic (while building sparc boot floppies) back in late February:

http://mail-index.netbsd.org/port-sparc/2000/02/27/0002.html

At the time, I was able to use my sun4m Sparcbook running the TADPOLE3GX
kernel to successfully finish the sparc boot floppies for 1.4.2.

I recently verified the 1.4.2 instance of the problem on a second sparc IPX
with only the hard drive moved over; that pretty much confirms it to be a
software issue. I hadn't gotten around to reporting it yet because I wanted
to find out if 1.5 improved things.

Nope, it makes them worse. I'm now trying to build ramdisk.sysinst for 1.5
and I get a different, but similarly sometimes fatal, error on the Sparcbook:

[... making all in distrib/sparc/ramdisk.sysinst ...]
COPY dist/xbase_obsolete dist/xbase_obsolete
COPY dist/xserver_obsolete dist/xserver_obsolete
COPY    ${OBJDIR}/dot.profile                   .profile
SPECIAL sh ${CURDIR}/../../sets/makeobsolete -b -s ${CURDIR}/../../sets -t ./dist

/mnt: write failed, file system is full
error 28:fflush

/mnt: write failed, file system is full
error 28:fflush
[panic]

I note that the 20000620-1.5 snapshot has no bootfs.sysinst; in fact it
was while I was trying to build a ramdisk.sysinst that might fit better,
that I ran into this. (Previously it had belched "error 28:fflush" twice
but no file-system-is-full error. Does anyone recognize that?)

Here is the backtrace from the crash dump (thank god we got one this time!):

t37-112:106# gdb netbsd.0 --readnow
GNU gdb 4.17
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc--netbsd"...(no debugging symbols found)...
(gdb) target kcore netbsd.0.core
panic: alignment fault
#0  0xf002c7e8 in mi_switch ()
(gdb) bt
#0  0xf002c7e8 in mi_switch ()
#1  0xf002bff8 in ltsleep ()
#2  0xf010e6b0 in uvm_scheduler ()
#3  0xf001c314 in check_console ()
#4  0xf0007218 in cpu_hatch ()
can not access 0x3082ec, invalid address (3082ec)
can not access 0x3082ec, invalid address (3082ec)
can not access 0x3082ec, invalid address (3082ec)
can not access 0x3082ec, invalid address (3082ec)
can not access 0xefffffd8, invalid address (efffffd8)
can not access 0xefffffd8, invalid address (efffffd8)
Cannot access memory at address 0xefffffd8.
(gdb) 

The gdb is still sitting here on my sparcbook, if anyone wants me to try
stuff and report the output.

This bug has always only hit me when the sparc was filling a vnd for one
of the ramdisk images, and always near the end. My suspicion is that some
race condition is activated when the FFS code starts packing stuff into
fragments while the vnd layer is in the way.

Cranking up the size of the ramdisk so there's plenty of free space enables
it to complete, however that kernel definitely won't fit on one floppy either.
So it's not a very useful workaround. Hmm. ustarfs, anyone?

Todd Whitesel
toddpw @ best.com