netbsd-bugs: port-i386/9841: Large-memory i386 (?) machines may trigger file system corruption

Subject: port-i386/9841: Large-memory i386 (?) machines may trigger file system corruption
To: None <gnats-bugs@gnats.netbsd.org>
From: Havard Eidnes <he@runit.no>
List: netbsd-bugs
Date: 04/08/2000 10:26:09

>Number:         9841
>Category:       port-i386
>Synopsis:       Large-memory i386 (?) machines may trigger file system corruption
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Apr 08 10:27:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Havard Eidnes
>Release:        NetBSD-current 5 Apr 2000
>Organization:
	RUNIT AS
>Environment:
NetBSD iana.sunet.se 1.4X NetBSD 1.4X (GENERIC) #0: Wed Apr  5 15:42:37 MEST 2000     he@iana.sunet.se:/local/nbsrc/src/sys/arch/i386/compile/GENERIC i386

>Description:
	This machine which is a PPro200 with 1GB of RAM, can
	reproducibly be crashed by running "bonnie" on one of its file
	systems.

	Another way is to wait for the nightly jobs to run, and those
	many times also cause a crash.

	However, doing a "make build" does not appear to provoke the
	problem.


	When this problem has been triggered by "bonnie", parts of
	the data from the bonnie run has been scattered over parts
	of the disk where it definitely should *not* be, causing
	directory and file corruption and a mess to clean up after-
	wards.

	The typical panic message is triggered by internal error
	detection in the ffs code with crashes of type

panic: ffs_valloc: dup alloc

	or more frequently

First bad, reclen=33aa, DIRSIZ=28, namlen=19, flags=1000 entryoffsetinblock=0
/var: bad dir ino 129344 at offset 0: mangled entry
panic: bad dir

	This is sometimes (but *not* always) preceded by

bha2: mbi not in round-robin order

	The disk subsystem on this machine is

bha2 at pci0 dev 13 function 0: BusLogic 9xxC SCSI
bha2: interrupting at irq 11
bha2: model BT-958, firmware 5.06I
bha2: 192 H/W CCBs, sync, parity, tagged queueing
scsibus0 at bha2: 16 targets, 8 luns per target
...
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST34572W, 0784> SCSI2 0/direct fixed
sd0: 4340 MB, 6300 cyl, 8 head, 176 sec, 512 bytes/sect x 8888924 sectors
sd1 at scsibus0 target 1 lun 0: <SEAGATE, ST39175LW, 0001> SCSI2 0/direct fixed
sd1: 8683 MB, 11721 cyl, 5 head, 303 sec, 512 bytes/sect x 17783240 sectors
...

	This hardware has been running 1.3.3 for a *long* time without
	experiencing any similar instability.

	This hardware also ran 1.4.2 for a brief while in order to get
	new boot code installed, but I did not see the problem with
	that system either.

>How-To-Repeat:
	Equip machine as above.
	Run "bonnie -s 200".
	Watch it crash after a minute or two.

>Fix:
	Sorry, this is over my head.

	Hints for how to debug this further however gratefully
	accepted.

	The currently running kernel has both DIAGNOSTIC and DEBUG
	enabled (as can be seen from the extra info in the last of
	the above two panic messages).
>Release-Note:
>Audit-Trail:
>Unformatted: