Subject: port-mac68k/6411: Possible race condition between disk ioctl's and kernel pool allocator
To: None <gnats-bugs@gnats.netbsd.org>
From: None <fb@enteract.com>
List: netbsd-bugs
Date: 11/08/1998 06:33:15
>Number:         6411
>Category:       port-mac68k
>Synopsis:       Experiencing random crashes after any access to non-boot disk
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    gnats-admin (GNATS administrator)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Nov  8 04:35:00 1998
>Last-Modified:
>Originator:     Frederick Bruckman
>Organization:

>Release:        Nov 7, 1998
>Environment:
System: NetBSD fb.sa.enteract.com 1.3H NetBSD 1.3H (FB) #138: Sun Nov 8 00:22:18 CST 1998 
fredb@fb.sa.enteract.com:/usr/src/sys/arch/mac68k/compile/FB mac68k


>Description:
	A number of people have reported random crashes with current mac68k kernels. The last
	kernel that fails to exhibit this problem, for me, was built on August 30 against that
	day's sup. My Quadra 630/36M (w/3 external disks), would not boot into multi-user w/o
	panicing, until I'd trimmed my fstab and moved everything onto one physical disk. The
	panic message is consistent, if not informative, if an exact sequence of commands are
	followed after booting single-user. Entering the command "disklabel sdN", after the
	system has been up for some time works occasionally, but repeating will eventually
	cause some kind of crash. I say "some kind of crash", because I have experienced a
	baffling variety of symptoms over the past month or two, from "hard freezes" that
	required a reset, console lockups that did permit entry into the debugger, to a variety
	of panics (mostly in uvm_fault and pool_get). Some of these messages have evolved with 
	recent changes to the uvm and kernel pool allocator code, but the easily reproducible 
	symptom described below has _not_ changed since Sep 5.

	Stock GENERIC kernels, also minor changes in my custom config, no difference. It also
	doesn't matter which physical disk has the system; access to any other disk but the
	boot disk is likely to fail. By "boot disk", I mean the disk that has the root file
	system. The Booter has the option of booting a kernel from one disk, and then choosing
	another for the root and swap; this works, too, until you try to access the disk the 
	kernel is on from within NetBSD.
>How-To-Repeat:
	Get a Quadra 630 with several external scsi disks. Set up root/swap/user filesystems
	on one or more of them, and populate with current binaries and a kernel. Boot from,
	say, sd0 in single user mode:

# disklabel sd0

	/* normal disklabel */

# disklabel sd1
panic fstat

Stopped in disklabel at _Debugger+0x6: unlk a6
db>t
_Debugger
_panic
_sys___fstat13
_syscall(117) + e4
_trap0() + e

	As I said, the preceding sequence is remarkably consistent.

	The results are more interesting if you can "finesse" disklabel to complete without
	panicing. Shutting down from X is sometimes sufficient; "ls -lR" sometimes helps, too.

	Try "disklabel sdN ; vmstat -m" on all disks repeatedly. Notice that the scsxpl pool
	counts requests for 8 additional items after hitting any disk but sd0. (I've looked at
	this only after booting from sd0.) The request count is never incremented after 
	"disklabel sd0." In any case, it will eventually crash.
>Fix:
	The only solution I've found is to construct a system wholly on one physical disk, and
	then trim the fstab accordingly. It's possible to "finesse" additional swaps and
	partitions manually after the system has been up for a while; the chief danger seems
	to occur when they are actually added, or soon thereafter. I've managed to shut down
	to single-user from X, add a swap on sd2b, and then exit back into multi-user. It was 
	then possible to hit both swaps heavily by running a bunch of X apps. It only crashed 
	when I tempted fate by trying "disklabel sdN", repeatedly, from an xterm.

>Audit-Trail:
>Unformatted: