Subject: -current kernel broken
To: None <port-sparc@NetBSD.ORG>
From: der Mouse <mouse@Collatz.McRCIM.McGill.EDU>
List: port-sparc
Date: 06/12/1996 17:05:09
Well, I finally brought "my" NetBSD/sparc machine up to -current.  But
the new kernel croaked.  Here's a ten-finger copy:

/dev/rsd0f: 75206 files, 727590 used, 156911 free (19511 frags, 17175 blocks, 2.2% fragmentation)
trap type 0x83: pc=f804fff4 npc=f804fff8 psr=8000c6<S,PS>
panic: flush windows
Stopped at [...]
db> trace
_trap(83, 8000c6, f804fff4, f992abc8, f861c400, 0) at _trap+0x220
slowtrap(10, 30, f85f28c0, f85f28f0, f8622700, 3ff) at slowtrap+0x124
_null_node_alloc(f8604760, f804febc, f8623700, f85e6034, f9929000, f8624c00) at _null_node_alloc+0x244
_null_node_create(0, f8614000, f992ad70, 1, f8609200, f8618c00) at _null_node_create+0xa4
_nullfs_mount(0, f7fffd2a, f7fff898, f992ae00, f861c400, f804f630) at _nullfs_mount+0x9c
_sys_mount(0, f992af28, f992af20, f8047adc, 800084, f992afb0) at _sys_mount+0x43c
_syscall(15, f992afb0, 0, 3, 3fc, 0) at _syscall+0x1f0
syscall(2b68, f7fffd2a, 0, f7fff898, 400086, f992afb0) at syscall+0x120
db>

This proved to be repeatable.  I removed the nullfs mount from
/etc/fstab and then could bring the machine up.

Further investigation reveals the true cause.  0xf804fff4 is
_null_lock+0x138, and upon disassembling the code, I find

0xf804ffd0 <null_lock+276>:	st  %o4, [ %o0 + 0x10 ]
0xf804ffd4 <null_lock+280>:	b  0xf804fec8 <null_lock+12>
0xf804ffd8 <null_lock+284>:	ld  [ %l0 ], %o0
0xf804ffdc <null_lock+288>:	ld  [ %o0 + 0x134 ], %o0
0xf804ffe0 <null_lock+292>:	cmp  %o0, 0
0xf804ffe4 <null_lock+296>:	be,a   0xf804fff0 <null_lock+308>
0xf804ffe8 <null_lock+300>:	mov  -1, %o0
0xf804ffec <null_lock+304>:	ld  [ %o0 + 0x30 ], %o0
0xf804fff0 <null_lock+308>:	st  %o0, [ %i0 + 0x14 ]
0xf804fff4 <null_lock+312>:	ta  3
0xf804fff8 <null_lock+316>:	st  %i7, [ %i0 + 0x18 ]

Correlating the disassembly with the source, this has to be from the
segment

#ifdef DIAGNOSTIC
	if (curproc)
		nn->null_pid = curproc->p_pid;
	else
		nn->null_pid = -1;
	nn->null_lockpc = RETURN_PC(0);
	nn->null_lockpc2 = RETURN_PC(1);
#endif

and in null.h, I find that the relevant definition for RETURN_PC is

#define RETURN_PC(frameno) __builtin_return_address(frameno)

which gcc is turning into something involving a "ta 3".  But the trap
handler isn't prepared to deal with flush-windows traps from within the
kernel.

The source tree was supped June 10th AM; examining the sources from
this morning's sup (ie, June 12th) I see no indication that any of the
pieces leading to this panic have changed: gcc still appears prepared
to generate a flush-windows trap when __builtin_return_address is used,
null.h still defines RETURN_PC to use it, null_lock() still uses
RETURN_PC ifdef DIAGNOSTIC, and the trap code still looks prepared to
panic if a flush-windows trap strikes from within the kernel.

On an unrelated note, I have RASTERCONSOLE, RASTERCONS_SMALLFONT, and
RASTERCONS_FULLSCREEN, and my screen showed up 80x34.  I'm still
looking into this one; I mention it in case it reminds anyone of
anything.

					der Mouse

			    mouse@collatz.mcrcim.mcgill.edu