Subject: Re: reboot problems unmounting root
To: Antti Kantee <pooka@cs.hut.fi>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/05/2007 13:14:54
--DSayHWYpDlRfCAAQ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 05, 2007 at 09:20:11PM +0300, Antti Kantee wrote:
> On Thu Jul 05 2007 at 10:43:56 -0700, Bill Stouder-Studenmund wrote:
> > We have VOP_REVOKE() so that when someone logs out of a tty, any existi=
ng=20
> > process w/ that device (that instance of the vnode) open gets connected=
 to=20
> > a black hole. That way your cat > special_secrets & process can't read =
my=20
> > passwords, and I can't accidentally see the output spew of the employee=
=20
> > performance analysis tool you have running the "layoffs" criteria set.
> >=20
> > So VOP_REVOKE() is supposed to blow away the node in that it's now a=20
> > deadfs vnode. However the vnode is still supposed to be usable, to the=
=20
> > extent that all the callers get errors then eventually release referenc=
es,=20
> > and the vnode dies.
> >
> > From what we're saying, that won't work. That's bad.
>=20
> Yes.  But the only reason it explodes seems to be the lock pointer
> exported from the vnode.  I, however, do maintain that you should revoke
> all vnodes down from the upmost you revoked.  You just need to make the
> one above know that it's a deadfs situation.  Actually, that's what the
> code does now.  It just doesn't revoke any layerfs vnodes.

Why? What's the functional difference (assuming we fix this issue) between=
=20
a layer node accessing a dead vnode and a vnode that was a layer vnode=20
that now is a dead vnode? You get errors for all operations anyway.

I don't see one. And since we have to be able to handle a stack where=20
there are multiple mounts above one, we have to deal w/ "unexpected"=20
revoking of the underlying vnode. To do that means handling the case of a=
=20
live vnode on top of a dead one, and "handling" means returning buckets of=
=20
errors and not crashing the kernel.

> > > How do I not kill the lock?  The vnode is reclaimed.  It won't be
> > > re-reclaimed after this.  If it's the lower layer, it doesn't even kn=
ow
> > > about the upper one, right?  If vnode locks were separate and had sep=
arate
> > > reference counts, then maybe, but ...
> >=20
> > I think we need to look at our reclaim processing.
> >=20
> > The lower layer DOES know something's going on. It knows the vnode has =
a=20
> > use count that's not zero.
>=20
> Ah, right.
>=20
> > So there are two cases where we go through reclaim processing. In one, =
we=20
> > have a vnode that's on the "free" list, meaning it's got no active=20
> > references. As an aside, it's not really free as it's still a live vnod=
e=20
> > for whatever it initially was, it's just up for grabs. In this case, we=
=20
> > need to scrub the old usage of the vnode away, and make it ready for a =
new=20
> > user. Implicitly we know that it was at the head of the free list (well=
,=20
> > the first one up for grabs).
>=20
> Well, except that the vnode is really free effectively, as it has the
> irrevokable VXLOCK bit set and nobody can grab hold of that particular
> instance any more.  Not that that makes any particular difference in
> this discussion.

Huh? I was talking about it being in the free list. vnodes on the free=20
list do NOT have VXLOCK set. :-)

Also, we don't leave VXLOCK there forever. :-)

> > The other case is where we're revoking a node. For a leaf file system, =
if
> > the use count is greater than one, there are other users and so we can't
> > just wipe the vnode now. Unhook it from the fs, yes. Turn it into a dea=
fs
> > node, yes. Make it so future uses will explode, no. Likewise, if the
> > revoke came to us via a layered file system, we shouldn't do the=20
> > super-destructive reclaim.
> >=20
> > I haven't looked at the code in a while, but assuming the code will=20
> > correctly handle the "someone still has the tty open" case, we just nee=
d=20
> > to get the layer case to trigger the same processing. Adding an explici=
t=20
> > parameter to VOP_REVOKE() which sys_revoke() sets to 0 and layer_revoke=
=20
> > sets to 1 would do it.
>=20
> I don't see how the forced unmount situation is any different from a
> revoke and therefore why revoke would need a special flag.

A forced unmount shouldn't be doing the same thing as a revoke. It should=
=20
just revoke the at-level vnode. The difference being force-unmounting a=20
layer shouldn't blast away the underlying vnodes.

Revoke is different from unmount _if_ the leaf fs does things differently
depending on if someone's still using the vnode or not. If it doesn't
differentiate, then no biggie and no parameter. If however it does
differentiate, the reason we need a flag is to tell the leaf fs that its
use count lies.

The reason for the flag is to help the leaf fs. How exactly does the leaf
fs know if there are users other than the revoker on the vnode? It looks
at the use count and compares it to one. However if the VOP_REVOKE() came
through a layered file system, that test isn't good. The one reference
could well be the one held by a single upper layer, which gives you no
indication of how may things hold the upper layer's vnode.

That's why I thought it'd matter. If however whatever leaf file systems do=
=20
doesn't care (works right either way), then we don't need said flag.

> The call sequence is approximately the following:
>=20
> VOP_REVOKE
> VOP_UNLOCK (vclean() doesn't call inactive because the vnode is still act=
ive)
> VOP_RECLAIM
> v_op set to deadfs
> upper layer decides this one is not worth holding on to
> vrele()
> VOP_INACTIVE (for deadfs)
> new start for the vnode

It's not clear here what's calling what.

Also, unless we short-circuit it, there should be a reclaim when we=20
recycle the vnode.

> Where do you want to free the lock structure?  We will never get another
> VOP_RECLAIM for the original file system method.  Do we want to do it in
> deadfs VOP_INACTIVE?  That's ugly, but it's the only place I see where to
> do it.  It also needs an accompanying bit to say "hey, we already nuked
> the lock in the file system reclaim op" in case the vnode is NOT revoked.

The lock structure is part of struct vnode, so we never have to "free" it.=
=20
We decomission (LK_DRAIN) it as part of reclaiming, but we can un-drain it=
=20
I believe. If we can't, we should figure out a way to.

Something for deadfs VOP_INACTIVE would be good, namely to do something to=
=20
indicate "recycle me soon." Hmm, wait, that's a feature we haven't stollen=
=20
from Darwin yet. :-)

> I'd prefer another approach where the whole layering lock export system
> was redone instead of redoing reclaim to fix the layer lock exporting.
> But I don't know how to fix layerfs locking ...

Do we care about the inerlock? The advantage of the current locking=20
behavior is that, since each layer has access to the lockmgr lock, we can=
=20
pass in the right interlock. If we either don't care, or we're fine with a=
=20
lockmgr call on vnode D releasing the interlock on vnode A (because=20
someone tried to lock A with its interlock held, and A is above B, which=20
is above C, which is above D), then we can just turn it all into recursive=
=20
VOP_LOCK() calls on the underlying vnodes.

Have you read Heidemann's disseration? The locking export is an attempt at=
=20
the shared-state stack stuff he did.

We may just need to give up on the interlock, as there's not really any=20
way we can guarantee it for something like unionfs.

> > If we won't correctly handle the case of a device being open when someo=
ne=20
> > else doe the revoke, well, we need to fix that issue first. :-)
>=20
> That's not a problem.  It just reverts to deadfs.  And with one layer
> there are no exported locks.

Ok.

Take care,

Bill

--DSayHWYpDlRfCAAQ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFGjVE+Wz+3JHUci9cRAgG4AKCRSnIZ0KQDCvgfB1Zw9TSGiXrpCwCgmYbP
8DpUR0U7HGiHuoDuAw3BFEI=
=Ia2n
-----END PGP SIGNATURE-----

--DSayHWYpDlRfCAAQ--