Subject: race conditions in dounmount...
To: None <tech-kern@NetBSD.ORG>
From: Bill Sommerfeld <sommerfeld@orchard.medford.ma.us>
List: tech-kern
Date: 06/18/1996 14:32:58
-----BEGIN PGP SIGNED MESSAGE-----

I observed some nasty races between unmount(2) and vfs_unmountall()
(which called at shutdown time) on a filesystem where unmount() was
very slow (a modified version of AFS).

If an unmount was in progress when vfs_unmountall() was entered, the
following race occurred (more or less):

 A0) unmount called.
 A1) MP marked "busy" with vfs_busy()

 B0) reboot called.
 B1) vfs_unmountall() calls dounmount on same FS; winds up blocked in
vfs_busy() waiting for the MP point to become unbusy.

 A2) unmount completes; MP marked unbusy with vfs_unbusy(); B marked runnable
 A3) file system freed by free().
 A4) unmount returns; process exits.

 B2) vfs_unbusy wakes up, dounmount continues operating on freed MP.
 B3) "shit happens":
	In particular, we crash dereferencing NULL attempting to fetch
	mp->mnt_op->vfs_sync.

[Because this kernel is compiled with DIAGNOSTIC, and because of
Murphy's Law,
	offsetof(struct mount, mnt_op) == offsetof(struct freelist, next);

The "freelist.next" pointer gets set to NULL by free(), and then
VFS_SYNC() crashes attempting to dereference a NULL mnt_op vector.]

- ---

I also observed the following bit of well-meaning, but useless code:

After blocking until the mount point becomes unbusy, vfs_busy contains
code to fail and return 1 if MNT_UNMOUNT is set in the MP.  However,
dounmount's code is roughly:

	if (vfs_busy(mp))
		return EBUSY;
	mp->mnt_flag |= MNT_UNMOUNT;

	... call underlying VFS to do unmount ...

	mp->mnt_flag &= ~MNT_UNMOUNT;
	vfs_unbusy(mp);

so vfs_busy should never see MNT_UNMOUNT set, and thus it always
returns zero.

- ----

In short, it looks like dounmount is one big race condition; the
attempts to avoid it in vfs_busy/vfs_unbusy fail miserably because the
mp gets freed while other processes are still blocked in vfs_busy()...

Suggested *rough* fix [there are probably some bugs here, ...]:

0) don't clear MNT_UNMOUNT; leave it set until the mp is freed..

1) add a mp->mnt_waiters field

2) change vfs_busy to be roughly:

	mp->mnt_waiters++;
	while (mp->mnt_flag & <lock bit>) {
		mp->mnt_flag |= <want bit>;
		tsleep(&mp->mnt_flag, ...)
	}
	mp->mnt_waiters--;
	if (mp->mnt_flag & MNT_UNMOUNT) {
		wakeup(&mp->mnt_waiters, ...)
		return 1;
	}

add add a vfs_unbusy_final() called by dounmount, which is roughly:
	
	vfs_unbusy(mp);
	while (mp->mnt_waiters > 0) {
		tsleep(&mp->mnt_waiters, ...);
	}

Two questions for those of you with more experience in this chunk of code:
 1) is my interpretation of this correct (if so, I'll send-pr this..)
 2) is my idea for how to fix this correct, or am I headed in the
wrong direction?

						- Bill

-----BEGIN PGP SIGNATURE-----
Version: 2.6.1

iQCVAwUBMca+F7T+rHlVUGpxAQEAWAP/RTyhaSpTnPGazoBu2GrCxKL1zI2Wz7CL
5gT06dIdbJSKUmCTAyvThHiGCsSW0sht0uU1cN8O0mPk+zyB7GbcdcvWrk8OwJ6a
TFn8h0/vu4EvY/MB6xuUPjm2hMiy2ko9k57rEnW/xqCw6uCOvw2RspyAS2T+avE5
FS0Pqj9hQf0=
=a09w
-----END PGP SIGNATURE-----