Subject: Re: Redoing file system suspension API (update)
To: None <tech-kern@netbsd.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 06/21/2006 13:41:52
--FL5UXtIhxfXey3p5
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 17, 2006 at 03:58:32PM +0200, Juergen Hannken-Illjes wrote:
> On Tue, Jun 13, 2006 at 04:18:09PM +0200, Juergen Hannken-Illjes wrote:
> > We have an API for file system suspension which consists of the functio=
ns
> [snap]
>=20
> This one differs from the initial proposal by two items:
>=20
> - The per-thread state is initialized on thread creation instead of
>   doing it on the fly (see vngate_initialize()).
> - Instead of tracking possible long sleeps I put the suspend/resume into
>   ltsleep().
>=20
>=20
> We have an API for file system suspension which consists of the functions
> vfs_write_suspend, vfs_write_resume, vn_start_write and vn_finished_write.
>=20
> ** This implementation of file system suspension has some serious problem=
s:
>=20
> 1) Its definition (Prepare to start a file system write operation) is vag=
ue
>    and file system dependent.  It is impossible to determine which VOP's
>    will change file system data or metadata without knowing details of the
>    underlying file system.  It does not allow recursion.

Ok, so we work on the definition. I agree the name needs changing.

I'm not sure I agree about not knowing what will change metadata w/o=20
knowing the file system.

How is recursion an issue? We can either adjust the upper level call=20
layers to cope (pass in a state parameter indicating if we've locked or=20
not, be clear about entry points that are called w/ the lock), or have=20
code cope with a "You already have the lock" error code.

> 2) Its implementation adds a layer above file systems AND a layer inside
>    file systems.

Given our current vnode locking protocol, I'm not sure that this is really=
=20
an issue. While it'd be nice to not need upper-level calls (and for=20
some things like VOP_WRITE() calls, we probably should just have the fs do=
=20
it), there are places where upper-level code wants to perform an atomic=20
sequence of actions. Having that upper level need to explicitly lock seems=
=20
fine to me.

Snapshotting really needs the whole OS's help to get right. Applications=20
and everything on down have to help, so that you get a good snapshot. So=20
having things above and below the VFS/VOP layer help out seems quite=20
appropriate.

As an aside, if you have a journaling file system and want to make=20
sure there is space in the journal before you start an operation, you need=
=20
checkpoint calls in exactly the same places where you have the=20
vn_start_write() and vn_stop_write() calls. So the idea of a call or calls=
=20
that say, "I'm about to start a sequence of operations, please make sure=20
everything's ok" is a general one, and something we need.

> 3) It may take forever to suspend a file system if there is a high load
>    on other file systems.  Even a high "read load" on the file system we
>    are suspending makes the suspension take a long time.  If softdep file
>    systems are involved the suspension may not succeed at all.

Ok, then we definitely need either a redesign or re-work.

> ** The approach described here resolves these issues.  It replaces the
> "write gates" by "access file system gates".  Goal is to make every system
> call atomic with regards to file system suspension.  It will do all opera=
tions
> on a file system either before or after a suspension but will never do on=
e part
> before and another part after the suspension.  Allowing recursion makes it
> easier to place the gates.  So the advantages are:

I dislike the idea of "gates". They are locks, let's call them such.

I feel strongly about this as you're inventing a whole new locking=20
protocolon the fly. You may be right, but it's not clear to me that you=20
are. And one thing I've learned (painfully) is that even if something I=20
make would work fine, if others don't understand it, it really won't work=
=20
fine.

> - It is semantically well defined.
>=20
> - It is possible to add a DEBUG option to check it.  Every VOP called on
>   a suspending or suspended file system is an error (there are minor
>   exceptions: syncer and part of pagedaemon).
>=20
> - No modifications of file system internals are needed to implement it.
>=20
> - A suspended file system is really quiet.  No vnodes are locked during
>   suspension.

You still have locks, though. I think we should keep calling them locks.=20
:-)

Also, we already have spin locks and lockmgr locks. Do we really want to=20
add another type of lock?? I'm not saying you must use lockmgr locks, but=
=20
we really should look at the overall complexity of our locking mechanisms.

> - It solves 1), 2) and the "read part" of 3).
>=20
> To solve the rest of 3) it adds a throttling on the first gate not involv=
ed
> in a suspending file system.
>=20
>=20
> ** The new API is:
>=20
> Using explicit enter()/leave() pairs adds much complexity so I took anoth=
er
> approach. I use two types of gates.  Normal gates need a "leave" operatio=
n.
> Permanent gates are valid until the thread returns to user mode.

I really really dislike the idea of adding locks that are automatically
released upon return to user mode.

> void vngate_initialize(struct lwp *l)
>=20
>   Initialize the per-thread state.  Will be freed on vngate_leave_all(...=
, 1).
>=20
> int vngate_enter(struct mount *mp, int flags)
>=20
>   Enter a vnode gate for the file system "mp".  "flags" is a combination =
of:
>   V_WAIT	Thread must sleep until a suspension is over.
>   V_NOWAIT	Returns an error if a suspension is active.
>   V_NOERROR	Panic on error.  No need for the caller to check the result.
>   V_PERMANENT	Enter a permanent gate.
>=20
> void vngate_leave(struct mount *mp)
>=20
>   Leave a (non-permanent) gate for the file system "mp".
>=20
> void vngate_leave_all(struct mount *mp, int destroy)
>=20
>   Leave all permanent gates of this thread.  Assumes all normal gates are
>   already closed.  If "mp" is set, only leave the gates for this file sys=
tem.
>   "destroy" is set if all state has to be freed because this thread will
>   terminate.
>=20
> void vngate_sleep(struct mount *mp)
>=20
>   Sleep until a suspension on this file system is over.
>=20
> void vngate_suspend(void)
>=20
>   Suspend all gates of this thread.  Called during suspension and just be=
fore
>   we go to long (interruptible) sleep.  Further vngate_(enter|leave) call=
s are
>   forbidden.
>=20
> void vngate_resume(void)
>=20
>   Resume all gates of this thread to the state before the last vngate_sus=
pend().
>   May wait if a suspension is in progress.
>=20
> int vfs_suspend(struct mount *mp, int wait)
>=20
>   Suspend the file system "mp".  If "wait" is set, wait for a current sus=
pension
>   otherwise return an error.
>=20
> int vfs_resume(struct mount *mp)
>=20
>   Resume the file system "mp".
>=20
>=20
> option VNODE_GATEDEBUG
>=20
>   Add code to check
>   - No VOP's are called without a taken gate.
>   - Internal integrity checks.
>=20
>=20
> ** Attached is an implementation on a recent -current.
>=20
> Adding permanent gates to lookup(), (new) FILE_USE_GATED() and nfsrv_fhto=
vp()
> covers most system calls. The various compat parts are not ready.  A scan=
 over
> all "FILE_USE()" should be sufficient here.  The implementation renames
> "vfs_write_XXX" to "vfs_XXX" and adds the state pointer to "struct lwp".
> All other changes depend on "option NEWVNGATE" to make it easy to test the
> new API.

I dislike the idea of adding automatic gate calls. Well, something where=20
returning with the "gate" locked is a side-effect.

I agree with Yamamoto-san that a file system should do as much of this=20
inside itself that it can. I disagree that it has to do it all, but I=20
agree with the spirit of things that it should do as much as it can.

I'm not 100% on the details, but I think that you've already found the=20
places above the file system that need to know about snapshot locking.

Take care,

Bill

--FL5UXtIhxfXey3p5
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFEma8PWz+3JHUci9cRAnwNAKCFdXMyqoEgdS4yR5SK0041mRfaJgCfSw8U
ecU2xcDu7L9Ip7Fux1b7sns=
=i4cS
-----END PGP SIGNATURE-----

--FL5UXtIhxfXey3p5--