Subject: Re: Proposal: File system suspension - prerequisite for snapshots
To: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 08/13/2003 12:02:01
On Tue, 12 Aug 2003, Juergen Hannken-Illjes wrote:

> I propose the support for file system suspension from FreeBSD.
>
> The (quite simple) API:
>
> 	int
> 	vfs_write_suspend(struct mount *mp)
>
> 	Request a mounted file system to suspend write operations
> 	and leave it in a clean on-disk state. All operations are
> 	complete on exit.
>
> 	void
> 	vfs_write_resume(struct mount *mp)
>
> 	Request a suspended file system to resume write operations.
>
> This is a needed prerequisite for file system snapshots.  It may also
> help in system suspension.  File system snapshots would give us at least
> safe dumps from running systems and background fsck (with softdep enabled
> file systems).
>
> The implementation would gate most file system syscalls like this:
>
> 	if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0)
> 		return (error);
> 	do_the_write_operation
> 	vn_finished_write(mp);
>
> or
>
> restart:
> 	prepare_a_write_operation
> 	if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) {
> 		abort_current_preparation
> 		if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0)
> 			return (error);
> 		goto restart;
> 	}
> 	do_the_write_operation
> 	vn_finished_write(mp);
>
> Doing it this way guarantees that no operation sleeps with locked vnodes.

Note: don't you end up calling vn_start_write() _after_ you've been told
it's ok to start? :-) If you _want_ that to be part of the interface, we
need to document it. For one, it prevents a simple reference counting
mechanism to determine if writes are in progress.

> It is not possible to put this gating into the VFS_ calls as they are often
> called with locked vnodes and the suspend request may deadlock.
> For the same reason this gating cannot reside below the VFS_ level.

I think I don't like it, but I believe you're right that there are issues
with doing it at the VOP level with the locked vnodes. i.e. this way may
well be the best in the long run, even if I don't like it. :-)

Note: you _can_ do it at the VOP_level, it would just mean having a
routine or routines that would unlock the node, sleep, then re-lock and
move on. But it's probably cleaner to do as you suggest and just make sure
it's ok to do the write before starting it.

Take care,

Bill