Subject: Re: changing VOP_REMOVE(), was Re: ufs-ism in lookup(9)
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/30/2004 23:24:20
--oyUTqETQ0mS9luUI
Content-Type: text/plain; charset=unknown-8bit
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 31, 2004 at 01:16:01PM +0900, YAMAMOTO Takashi wrote:
> > I certainly don't want to get bitten by it. :-) I'm concerned it'd be a=
=20
> > new way to make the kernel fall over.. So in that way, I want to "suppo=
rt"=20
> > not dying.
> >=20
> > Also, when exactly would we be doing this getcwd-like thing to get the=
=20
> > mountpoint path? How will we know when to (and not to) do the parent=20
> > directory lookup?
>=20
> always when getfsstat is called?

We could do that. And we may want to be able to do something, since after=
=20
all a directory farther up the chain could have been renamed or moved.=20
However I'm not sure getfsstat() is the right call. I don't see the=20
corelation between getfsstat() calls and directory re-arrangements. I=20
think if we want a fix-up call for the paths, we should make some sort of=
=20
explicit operation for it, and let the admin invoke it.

> > My point is that we are talking about (with the (dvp, component) cache)
> > doing mount-point detection using something that is fragile for a remote
> > file system, unless we can put some sort of watch (or lock or guard) on
> > each path component.  Otherwise either the server or another client can
> > come in and do a rename that will invalidate the usefullness of our cac=
he.
> > And we wouldn't even know it.
>=20
> anyway it's fragile to mount filesystem on the remote vnode. :-)

Agreed!

> i've been almost concerned that the (dvp, component) cache was not
> a good idea.  maybe it's better to do the mountpoint test in
> each dirop VOPs as you said.  i'm not sure if it's the best way, though.
>=20
> > > > Sounds like the best thing to do for now is leave VOP_RENAME() alon=
e.
> > >=20
> > > the current "VOP_LOOKUP for dirop" method is wasteful even for ufs.
> > > ie. in-core inode is bloated unnecessarily to store a result of VOP_L=
OOKUP.
> > > i guess it's worse for more complicated directory implementations.
> >=20
> > [sorry I mentioned VOP_RENAME(); we're talking about VOP_REMOVE()]
> >=20
> > Huh?
> >=20
> > For ufs, AFAICT we add two int32_t's to hold slot information. We also=
=20
> > use these values for file creation. So it doesn't seem like a waste to =
me.
>=20
> it's a waste because the members are only used during dirops.
> you can implement an internal version of ufs_lookup, which stores slot in=
fo
> into kernel stack or somewhere instead of the in-core inode, and
> call it from dirop VOPs.

I really think you're adding a lot of complexity for two ints.

> > Also, how exactly do we eliminate the lookup for NFS? Yes, we could do=
=20
> > something like NOLEAF to the lookup, and we could push the parent and=
=20
> > component into VOP_REMOVE(). But fundamentally, as part of the remove=
=20
> > process, we have to know if we have an in-kernel vnode. How do we find=
=20
> > that vnode without doing a lookup? "Lookup" after all is the way we map=
 a=20
> > name to an inode.
>=20
> it needs lookup, but doesn't need VOP_LOOKUP, which always loads the vnode
> as you said below.
>=20
> > And since we have to do the lookup to find the vnode, what's wrong with
> > what we're doing now, other than that we will always load the vnode into
> > core even if it's not in-core?
>=20
> isn't it enough?

How much code are we talking about? Is it really worth making ourselves=20
incompatable with all the other BSDs for this?

One other option is to keep the calling code pattern the same, and add a=20
NOLOAD flag. Its semantics are we do the lookup, and we=A0return the vnode=
=20
if it is in core, but we don't load a vnode in if there isn't one now. My=
=20
idea is that since the parent directory will be locked the whole time, no=
=20
one can come in and create an entry behind our back. Thus this would be a=
=20
case where it's valid to return NULL with no error from namei(). This=20
change would mean continuing to cache slot info, but I think that'll be=20
fine.

Take care,

Bill

--oyUTqETQ0mS9luUI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFAanIkWz+3JHUci9cRAnXPAJ455KdvxJlOcj3Ce+6ooVbVAEiyaACfZTVz
2MCWF75YQTb2cP2Ivi6sNSU=
=8Kfn
-----END PGP SIGNATURE-----

--oyUTqETQ0mS9luUI--