Subject: Re: Redoing file system suspension API (update)
To: None <tech-kern@netbsd.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 06/21/2006 13:41:52
--FL5UXtIhxfXey3p5
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Sat, Jun 17, 2006 at 03:58:32PM +0200, Juergen Hannken-Illjes wrote:
> On Tue, Jun 13, 2006 at 04:18:09PM +0200, Juergen Hannken-Illjes wrote:
> > We have an API for file system suspension which consists of the functio=
ns
> [snap]
>=20
> This one differs from the initial proposal by two items:
>=20
> - The per-thread state is initialized on thread creation instead of
> doing it on the fly (see vngate_initialize()).
> - Instead of tracking possible long sleeps I put the suspend/resume into
> ltsleep().
>=20
>=20
> We have an API for file system suspension which consists of the functions
> vfs_write_suspend, vfs_write_resume, vn_start_write and vn_finished_write.
>=20
> ** This implementation of file system suspension has some serious problem=
s:
>=20
> 1) Its definition (Prepare to start a file system write operation) is vag=
ue
> and file system dependent. It is impossible to determine which VOP's
> will change file system data or metadata without knowing details of the
> underlying file system. It does not allow recursion.
Ok, so we work on the definition. I agree the name needs changing.
I'm not sure I agree about not knowing what will change metadata w/o=20
knowing the file system.
How is recursion an issue? We can either adjust the upper level call=20
layers to cope (pass in a state parameter indicating if we've locked or=20
not, be clear about entry points that are called w/ the lock), or have=20
code cope with a "You already have the lock" error code.
> 2) Its implementation adds a layer above file systems AND a layer inside
> file systems.
Given our current vnode locking protocol, I'm not sure that this is really=
=20
an issue. While it'd be nice to not need upper-level calls (and for=20
some things like VOP_WRITE() calls, we probably should just have the fs do=
=20
it), there are places where upper-level code wants to perform an atomic=20
sequence of actions. Having that upper level need to explicitly lock seems=
=20
fine to me.
Snapshotting really needs the whole OS's help to get right. Applications=20
and everything on down have to help, so that you get a good snapshot. So=20
having things above and below the VFS/VOP layer help out seems quite=20
appropriate.
As an aside, if you have a journaling file system and want to make=20
sure there is space in the journal before you start an operation, you need=
=20
checkpoint calls in exactly the same places where you have the=20
vn_start_write() and vn_stop_write() calls. So the idea of a call or calls=
=20
that say, "I'm about to start a sequence of operations, please make sure=20
everything's ok" is a general one, and something we need.
> 3) It may take forever to suspend a file system if there is a high load
> on other file systems. Even a high "read load" on the file system we
> are suspending makes the suspension take a long time. If softdep file
> systems are involved the suspension may not succeed at all.
Ok, then we definitely need either a redesign or re-work.
> ** The approach described here resolves these issues. It replaces the
> "write gates" by "access file system gates". Goal is to make every system
> call atomic with regards to file system suspension. It will do all opera=
tions
> on a file system either before or after a suspension but will never do on=
e part
> before and another part after the suspension. Allowing recursion makes it
> easier to place the gates. So the advantages are:
I dislike the idea of "gates". They are locks, let's call them such.
I feel strongly about this as you're inventing a whole new locking=20
protocolon the fly. You may be right, but it's not clear to me that you=20
are. And one thing I've learned (painfully) is that even if something I=20
make would work fine, if others don't understand it, it really won't work=
=20
fine.
> - It is semantically well defined.
>=20
> - It is possible to add a DEBUG option to check it. Every VOP called on
> a suspending or suspended file system is an error (there are minor
> exceptions: syncer and part of pagedaemon).
>=20
> - No modifications of file system internals are needed to implement it.
>=20
> - A suspended file system is really quiet. No vnodes are locked during
> suspension.
You still have locks, though. I think we should keep calling them locks.=20
:-)
Also, we already have spin locks and lockmgr locks. Do we really want to=20
add another type of lock?? I'm not saying you must use lockmgr locks, but=
=20
we really should look at the overall complexity of our locking mechanisms.
> - It solves 1), 2) and the "read part" of 3).
>=20
> To solve the rest of 3) it adds a throttling on the first gate not involv=
ed
> in a suspending file system.
>=20
>=20
> ** The new API is:
>=20
> Using explicit enter()/leave() pairs adds much complexity so I took anoth=
er
> approach. I use two types of gates. Normal gates need a "leave" operatio=
n.
> Permanent gates are valid until the thread returns to user mode.
I really really dislike the idea of adding locks that are automatically
released upon return to user mode.
> void vngate_initialize(struct lwp *l)
>=20
> Initialize the per-thread state. Will be freed on vngate_leave_all(...=
, 1).
>=20
> int vngate_enter(struct mount *mp, int flags)
>=20
> Enter a vnode gate for the file system "mp". "flags" is a combination =
of:
> V_WAIT Thread must sleep until a suspension is over.
> V_NOWAIT Returns an error if a suspension is active.
> V_NOERROR Panic on error. No need for the caller to check the result.
> V_PERMANENT Enter a permanent gate.
>=20
> void vngate_leave(struct mount *mp)
>=20
> Leave a (non-permanent) gate for the file system "mp".
>=20
> void vngate_leave_all(struct mount *mp, int destroy)
>=20
> Leave all permanent gates of this thread. Assumes all normal gates are
> already closed. If "mp" is set, only leave the gates for this file sys=
tem.
> "destroy" is set if all state has to be freed because this thread will
> terminate.
>=20
> void vngate_sleep(struct mount *mp)
>=20
> Sleep until a suspension on this file system is over.
>=20
> void vngate_suspend(void)
>=20
> Suspend all gates of this thread. Called during suspension and just be=
fore
> we go to long (interruptible) sleep. Further vngate_(enter|leave) call=
s are
> forbidden.
>=20
> void vngate_resume(void)
>=20
> Resume all gates of this thread to the state before the last vngate_sus=
pend().
> May wait if a suspension is in progress.
>=20
> int vfs_suspend(struct mount *mp, int wait)
>=20
> Suspend the file system "mp". If "wait" is set, wait for a current sus=
pension
> otherwise return an error.
>=20
> int vfs_resume(struct mount *mp)
>=20
> Resume the file system "mp".
>=20
>=20
> option VNODE_GATEDEBUG
>=20
> Add code to check
> - No VOP's are called without a taken gate.
> - Internal integrity checks.
>=20
>=20
> ** Attached is an implementation on a recent -current.
>=20
> Adding permanent gates to lookup(), (new) FILE_USE_GATED() and nfsrv_fhto=
vp()
> covers most system calls. The various compat parts are not ready. A scan=
over
> all "FILE_USE()" should be sufficient here. The implementation renames
> "vfs_write_XXX" to "vfs_XXX" and adds the state pointer to "struct lwp".
> All other changes depend on "option NEWVNGATE" to make it easy to test the
> new API.
I dislike the idea of adding automatic gate calls. Well, something where=20
returning with the "gate" locked is a side-effect.
I agree with Yamamoto-san that a file system should do as much of this=20
inside itself that it can. I disagree that it has to do it all, but I=20
agree with the spirit of things that it should do as much as it can.
I'm not 100% on the details, but I think that you've already found the=20
places above the file system that need to know about snapshot locking.
Take care,
Bill
--FL5UXtIhxfXey3p5
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)
iD8DBQFEma8PWz+3JHUci9cRAnwNAKCFdXMyqoEgdS4yR5SK0041mRfaJgCfSw8U
ecU2xcDu7L9Ip7Fux1b7sns=
=i4cS
-----END PGP SIGNATURE-----
--FL5UXtIhxfXey3p5--