Subject: why does one hung FS lock all other FS' from being unmounted?
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 09/10/2003 18:13:19
I had a bad experience with NetBSD-1.6 STABLE (i386) today.

Yesterday I managed to hang one SCSI bus on an siop adapter which
currently hosts my tape drive and some disks for RAIDframe testing on my
development server.  I had accidentally made the bus too long when
trying to add more drives a couple of reboots ago.  I then tried to copy
a big file onto the RAIDframe partition on the original disks on that
bus (the first busy time for that bus since I added the new drives) and
I managed to hang the bus (siop tried to reset the bus a bunch of times
but must have eventually given up).  I didn't realize it was stuck until
after I had accidentally managed to type "ls /test" in my home directory
thus ending up with a process stuck in vnlock on the hung FS and with a
CWD on /home.

After messing around with cables and terminators and after removing the
additional drives I'd tried to add, and running "scsictl scsibus1 reset"
a bunch of times trying to restart the bus, a "scsictl scsibus1 scan all
all" command got hung after finding only the tape drive, so I gave up
and decided to try as clean a reboot as possible.

Unfortunately I was unable to unmount any filesystems, least of all
/home of course.  Not even NFS filesystems would unmount with '-f' --
all the umount commands got stuck in vnlock.  Even a "sync" command got
stuck in vnlock.  "df" also got stuck in vnlock of course.  All of these
stuck process were of course unkillable too (that really irks me! --
stat() and statfs() shouldn't ever hang in an unkillable state!  They're
only trying to read data!).

Finally I got tired of not having any luck and I tried a reboot from
DDB.  That hung in at "syncing disks...." of course so I was left with
the Big Red Switch (or little black button as it is on this machine).

Unfortunately fsck managed to find over 3,300 files unreferenced on
another filesystem (one on an external hardware RAID subsystem).  They
were almost all owned by the UID I use to rsync and "cvs update" local
copies of the NetBSD source tree, and they were almost all copies of the
CVS/Entries or other CVS temporary files and something's broken in the
repo copy that rsync didn't seem to fix.  That filesystem was totally
quiescent and had no open files at the time of the crash.  The rsync and
CVS processes had finished many hours before the crash.  However that
filesystem was mounted with softdeps....

So, why is it that every filesystem (superblock?) was locked just
because one filesystem got stuck?

Also, if the softdep code can't flush buffers to still-functioning
filesystems when one other filesystem gets stuck then softdep certainly
is quite dangerous to use on production systems with multiple disks and
buses and such.  I really like the performance boost, especially for
things like CVS, but I'm more leary of softdep now than ever before.

On traditional UNIX systems I distinctly remember being able to unmount
every other filesystem that didn't have open files in similar
circumstances.  I'm pretty sure I've even been able to reduce damage on
SunOS-4 by unmounting all non-busy filesystems when one gets stuck like
this.

The really weird thing was that the stuck filesystem on the RAIDframe
partition came back without a scratch (after fsck and a parity
reconstruct of course).  The file I had been writing to (i.e. the file
which was still open at the time of the crash) had over 70MB of data in
it and seemed unscathed.

(I have console logs with some little DDB evidence if anyone wants to
pour through them...)

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>