Subject: Re: removing VOPs
To: YAMAMOTO Takashi <>
From: Bill Studenmund <>
List: tech-kern
Date: 10/03/2005 22:40:14
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Oct 04, 2005 at 01:18:12PM +0900, YAMAMOTO Takashi wrote:
> hi,
> > > > The only thing that would need changing would be an interface to ad=
d new
> > > > operations in addition to those in vfs_op_descs. It'd be a touch tr=
icky as
> > > > we would need to add a slot in all the existing op tables for the n=
ew one,
> > > > but it could be done.
> >=20
> > > i don't think it's worth to do, given locking overhead and complexity.
> >=20
> > What locking overhead? Also, while I see adding a new op would be a bit=
> > irritating to code, I don't think it's that complex. Note: I am of cour=
> > assuming the case of only adding new operations and never removing them=
> > even if the only fs that implements them gets unloaded from a kernel.
> how do you add a slot in the existing v_op without locking?

You add it at the end, you take advantage of the fact that nothing (AFAIK)=
caches the v_op parameter for long, and you take advantage of the fact=20
that pointer reads and writes are atomic on all the architectures on which=
we support SMP.

Specifically, for each file system (each vop table in a file system),
create a new v_op table. Remember the old one. As we have a list of all
the defined vfs's, we can find all the ops tables. Also look at the=20
"special" ones mentioned in vfs_init(). Update all the pointers to the new=

Then sweep through the set of allocated vnodes and update all the v_ops to
the new values. Anything actually performing operations will either use=20
the old or new table. As you add the op before anything actually will use=
the op, all the old and new values are fine; they point to the same=20
fs-specific routine.

Hmm... As I write this I find a few issues with what I suggest.

1) VNODE_OPS_COUNT is a constant.

2) vnode initialization would might some locking. The trick is to make
sure that the vnode sweep above looks at a vnode when it has a valid ops
vector and after an init routine has written said ops vector. Well, to=20
make sure an initialization routine doesn't overwrite v_op with the=20
pointer to an old table.

2b) unionfs does some weird stuff with looking at v_op contents and=20
comparing with expected values.

3) Routines that make multiple VOP_ calls could have the compiler=20
automatically hang onto vp->v_op in a register. If code between two calls=
blocks, the second use could conceivably happen after the old ops table=20
got g/c'd.

3 could be gotten around by adding a bit of "volatile" magic to VOCALL().=
I have no idea what that would do to performance, but I doubt much. I=20
don't think we have many fast-path routines that issue a sequence of=20
VOP_s. Specifically, I am not sure that we have many where the compiler=20
would hang onto vp->v_op and where we wouldn't still have vp->v_op sitting=
in cache.

3 could also be gotten around by never freeing the old tables.

2 would work best with some sort of read locking. Many cases of the code=20
could be gotten around by clever pointer passing, but that could be ugly.=
The idea being that we reduce the danger window to a few instruction copy,=
but I think I still dislike it. So some sort of read/write locking would=20
be needed at init. And I am not sure what I think about that until we have=
a working new-lock implementation.

2b would definitely need locking or re-writing. I have no clue why unionfs=
cares where the ops vectors point.

Hmmm.... So that has issues. The answer probably is that you don't do this=
without any locking unless you want to play some races (like if you wait 5=
seconds between updating all the fs-level v_op pointers before sweeping=20
the vnodes, chances are you won't catch code in 2, and if you wait another=
5 seconds, between the sweep and freeing, you won't catch code in 3). 2b=20
would still need work.

The only other idea I can come up with is to add a kernel option to leave=
some open. Like how we have LKM character and block entries, and lkm=20
syscall entries. But that's still lame. Though since we're talking about=20
pointers, in an array, allocating an extra 6 or so won't matter much.

> > > > Well, you've listened to me and I've listened to you. We will just=
> > > > disagree then and I'll look forward to seeing your proposed patch.
> > >=20
> > > i'll create a branch so you will be able to see the changes before
> > > they hit the trunk.
> >=20
> > Ok, if that's easier for you. I thought a diff would be more than=20
> > sufficient.
> honestly, i'm not sure why you want to see a patch.
> it should be merely mechanical changes.

To be honest, I agree with you. My initial comment (top of the text above) =
was more of a, "I'll hush until you have this done and look forward to
when that happens,"  comment rather than a, "I think I need to review what
you're doing here,"  comment. I am sorry if it came across differently.

The only reason I would like to see the change is so that I can continue=20
to follow exactly how we handle this. Truth be told, however, I can go=20
stare at the results of the commit message.

Take care,


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.2.3 (NetBSD)