Subject: Re: Redoing file system suspension API (update)
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 06/21/2006 14:24:53
--Sr1nOIr3CvdE5hEN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jun 21, 2006 at 01:34:38PM +0200, Juergen Hannken-Illjes wrote:
> On Wed, Jun 21, 2006 at 08:02:56PM +0900, YAMAMOTO Takashi wrote:
> > > > isn't it almost the same as VOPs?  (with some exceptions, of course)
> > >=20
> > > And how would you explain this to a programmer/user?
> > >   A suspended file system is in a state where VOPs are the atomic ope=
rations.
> > >   Look at the kernel source what this might mean for your application.

For the case where there's one real VOP call per syscall, there is no=20
difference.

As noted below, the other cases would get special handling.

> > > I think it is a much cleaner way to use system calls as atomic operat=
ions.
> > >=20
> > > Doing it inside file systems you may also lose the "no locked vnodes"=
 property.
> >=20
> > we should turn (most part of) vnode lock into filesystem internal as we=
ll. :)
> >=20
> > well, i think neither syscalls or individual VOPs are appropriate
> > for your purpose.  what you need is the intermediate.  ie. a set of VOP=
s.

Yeah, this is what I'm thinking we should do.

> > for example,
> >=20
> > int
> > vn_remove(const char *path)
> > {
> >=20
> > 	lookup_parent(..., &dvp, ...);
> >=20
> > 	vngate_enter(dvp->v_mount);
> > 	lock(dvp);
> > 	lookup_lastcomponent(dvp, &vp, ..);
> > 	VOP_REMOVE(dvp, vp, ...);
> > 	vngate_leave(dvp->v_mount);
> > }
>=20
> Why do you think "lookup_parent()" does not change file system data/metad=
ata?

It might. If it does, then the fs has to make sure there isn't a=20
snapshotting going on while it's changing data.

The point is that it doesn't matter if it has to wait for a snapshot. You=
=20
could take 20 snapshots during the course of one lookup_parent() call.=20
Yeah, that's unlikely and a bit crazy, but snapshots there don't matter.

The important point is that a snapshot doesn't see us half-way through the=
=20
lookup_lastcomponent() call and the VOP_REMOVE().

> What if we make lookup() gate-aware?
>=20
> 	- add struct mount *ni_gate, *ni_dgate to struct nameidata
> 	- add an option KEEPGATES to namei() so namei() either leaves
> 	  the gates on return or keeps them if KEEPGATES is given.
>=20
> and this becomes
>=20
> 	NDINIT(..., KEEPGATES, ...)
> 	namei(&nd);
> 	VOP_LEASE(...)
> 	...
> 	VOP_REMOVE(nd.ni_dvp, nd.ni_vp, ...);
> 	vngate_leave(nd.ni_dvp->v_mount);
> 	vngate_leave(nd.ni_vp->v_mount);

I don't see what this gains us. It's more complex, and feels more awkward.

> > > > - is it true even if filesystems are backed by different disks?
> > >=20
> > > Yes.  My test machine has root on sd0 test1..4 on sd1.  It is true for
> > > the case where the load is on root and the suspension is on test1.  W=
ith
> > > softdep of course.  Main problem is the softdep code is not per-mount.
> > >=20
> > > > - why does it need the special care?
> > >=20
> > > It solves a real problem now that may go away with updates to the sof=
tdep code
> > > or the introduction of a real i/o scheduler.
> >=20
> > it isn't clear to me why the suspension on filesystem A has a priority =
over
> > activities on unrelated filesystem B.
>=20
> Try it for yourself (on one disk if you need real problems)....

Yeah, that sounds like a mess, and we should do something about it.

> > i thought
> >=20
> > 	vngate_enter(PERMANENT)
> > 	some_operations();
> >=20
> > 	long_sleep(); /* with suspend/resume */
> >=20
> > 	other_operations();
> > 	vngate_leave_all();
> >=20
> > could be
> >=20
> > 	vngate_enter()
> > 	some_operations();
> > 	vngate_leave()
> >=20
> > 	long_sleep(); /* without suspend/resume */
> >=20
> > 	vngate_enter()
> > 	other_operations();
> > 	vngate_leave()
>=20
> At least for specfs/fifofs this looks ok.

I like that too.

Take care,

Bill

--Sr1nOIr3CvdE5hEN
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFEmbklWz+3JHUci9cRArJeAJ0SX4nDzOhUrevPIenhnTHtPcE4KQCghqAS
sD/q/QadW8A1W7YMcOmR1I0=
=SOhW
-----END PGP SIGNATURE-----

--Sr1nOIr3CvdE5hEN--