Subject: kern/4646: aha timeout and subsequent filesystem corruption
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jfw@funhouse.com>
List: netbsd-bugs
Date: 12/06/1997 15:26:29
>Number:         4646
>Category:       kern
>Synopsis:       aha timeout and filesystem subsequent corruption
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Dec  6 12:35:01 1997
>Last-Modified:
>Originator:     John F. Woods
>Organization:
Misanthropes-R-Us
>Release:        NetBSD 1.3_ALPHA Nov 20, 1997 (roughly)
>Environment:
	
System: NetBSD jfwhome.funhouse.com 1.3_ALPHA NetBSD 1.3_ALPHA (JFW) #28: Sun Nov 23 09:11:27 EST 1997 jfw@jfwhome.funhouse.com:/usr/src/sys/arch/i386/compile/JFW i386


>Description:
	Unfortunately, I don't have a _precise_ description of the problem.
But approximately what happened was this:

I have a new SyJet 1.5GB disk drive.  I had a cartridge mounted, and accidently
pressed the eject-cartridge button before unmounting it.  Fortunately, NetBSD
appears to use the reservation feature while filesystems are mounted to prevent
ejection (and I've taken advantage of this before with my old Syquest EZ135).
Unfortunately, when I unmounted the cartridge, something went wrong:  while
the cartridge was spinning down, the umount command did not complete, and in
fact a couple of "aha0: timed out" messages (not the "AGAIN" message), and then
my system entered the debugger after printing "exiting ccb not allocated!".
I then (foolishly) continued, and tried to shut my system down, but after a
couple more timeout messages (on a different disk drive) it crashed (I forget
which panic it was).

When I tried to reboot, it couldn't exec /sbin/init for errno 13.  When I
booted from my backup root (thank goodness), I found that my main root was
seriously corrupted; after fsck finished the corruption, it appeared that
many files had had their inodes overwritten with garbage; /sbin/init was
one of the victims.  Unfortunately, I had no recent backup tape, so I got to
spend Thanksgiving rebuilding from scratch and trying to remember what my
old /etc/rc.local looked like (and a few other curdled configuration files
in /etc).

I'm sorry this bug report is too vague to provide sound directions for
investigation.  I surmise that what must have happened is that there were
still blocks to be written out to the cartridge after the mount code released
it; the timeout processing must have recycled the CCB even though the adapter
still had it allocated and was going to complete it; and that once CCBs were
multiply allocated like that, disk blocks were being written to completely
random places.  I think this argues for two bugs, although perhaps once I
entered the Debugger I should have rebooted instead of continuing.  This is,
I guess, something to be kept in mind by the next hardy soul who takes a
hard look at the code in question...

>How-To-Repeat:
	Alas, it is not reproducible:  simply mounting the cartridge,
hitting the eject button, and then umounting works precisely as it should
(and I am, shall we say, quite unexcited about trying to reproduce this problem
under load).
	If anyone looks at this bug and devises theories to test, I will be
happy to attempt to reproduce the bug (as long as my tape drive still works),
though it might take a while to find an idle time to do it.

>Fix:
	I am avoiding using the eject button if it's mounted, and also
remembering the ancient Ritual of the Thrice-Repeated Sync before unmounting.
>Audit-Trail:
>Unformatted: