Subject: port-i386/9841: Large-memory i386 (?) machines may trigger file system corruption
To: None <gnats-bugs@gnats.netbsd.org>
From: Havard Eidnes <he@runit.no>
List: netbsd-bugs
Date: 04/08/2000 10:26:09
>Number: 9841
>Category: port-i386
>Synopsis: Large-memory i386 (?) machines may trigger file system corruption
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: port-i386-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Apr 08 10:27:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator: Havard Eidnes
>Release: NetBSD-current 5 Apr 2000
>Organization:
RUNIT AS
>Environment:
NetBSD iana.sunet.se 1.4X NetBSD 1.4X (GENERIC) #0: Wed Apr 5 15:42:37 MEST 2000 he@iana.sunet.se:/local/nbsrc/src/sys/arch/i386/compile/GENERIC i386
>Description:
This machine which is a PPro200 with 1GB of RAM, can
reproducibly be crashed by running "bonnie" on one of its file
systems.
Another way is to wait for the nightly jobs to run, and those
many times also cause a crash.
However, doing a "make build" does not appear to provoke the
problem.
When this problem has been triggered by "bonnie", parts of
the data from the bonnie run has been scattered over parts
of the disk where it definitely should *not* be, causing
directory and file corruption and a mess to clean up after-
wards.
The typical panic message is triggered by internal error
detection in the ffs code with crashes of type
panic: ffs_valloc: dup alloc
or more frequently
First bad, reclen=33aa, DIRSIZ=28, namlen=19, flags=1000 entryoffsetinblock=0
/var: bad dir ino 129344 at offset 0: mangled entry
panic: bad dir
This is sometimes (but *not* always) preceded by
bha2: mbi not in round-robin order
The disk subsystem on this machine is
bha2 at pci0 dev 13 function 0: BusLogic 9xxC SCSI
bha2: interrupting at irq 11
bha2: model BT-958, firmware 5.06I
bha2: 192 H/W CCBs, sync, parity, tagged queueing
scsibus0 at bha2: 16 targets, 8 luns per target
...
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST34572W, 0784> SCSI2 0/direct fixed
sd0: 4340 MB, 6300 cyl, 8 head, 176 sec, 512 bytes/sect x 8888924 sectors
sd1 at scsibus0 target 1 lun 0: <SEAGATE, ST39175LW, 0001> SCSI2 0/direct fixed
sd1: 8683 MB, 11721 cyl, 5 head, 303 sec, 512 bytes/sect x 17783240 sectors
...
This hardware has been running 1.3.3 for a *long* time without
experiencing any similar instability.
This hardware also ran 1.4.2 for a brief while in order to get
new boot code installed, but I did not see the problem with
that system either.
>How-To-Repeat:
Equip machine as above.
Run "bonnie -s 200".
Watch it crash after a minute or two.
>Fix:
Sorry, this is over my head.
Hints for how to debug this further however gratefully
accepted.
The currently running kernel has both DIAGNOSTIC and DEBUG
enabled (as can be seen from the extra info in the last of
the above two panic messages).
>Release-Note:
>Audit-Trail:
>Unformatted: