Subject: Re: reboot problems unmounting root
To: Antti Kantee <pooka@cs.hut.fi>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/05/2007 17:27:54
--1Y7d0dPL928TPQbc
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 05, 2007 at 11:50:38PM +0300, Antti Kantee wrote:
> On Thu Jul 05 2007 at 13:14:54 -0700, Bill Stouder-Studenmund wrote:
>=20
> Hmm, I thought had a very good reasoning for that, but I think I lost it.
> Maybe I misreasoned.
>=20
> Anyway, the CURRENT state is that ONLY the lower vnode is being revoked
> because of layer_bypass().  The upper is kind of implicitly getting
> revoked.  Maybe that should be changed to revoke only the upper one.

For sys_revoke() processing, we want to revoke the lower one. That's the=20
only way we can destroy all instances of the device and all access paths.

> > Huh? I was talking about it being in the free list. vnodes on the free=
=20
> > list do NOT have VXLOCK set. :-)
> >=20
> > Also, we don't leave VXLOCK there forever. :-)
>=20
> I was talking about VOP_RECLAIM.  That's where we want to free the lock
> if at all.  But n/m that part ;)

Ok.

I guess part of my point is that while we do need to drain the lock, we=20
don't need to leave it drained.

> > > I don't see how the forced unmount situation is any different from a
> > > revoke and therefore why revoke would need a special flag.
> >=20
> > A forced unmount shouldn't be doing the same thing as a revoke. It shou=
ld=20
> > just revoke the at-level vnode. The difference being force-unmounting a=
=20
> > layer shouldn't blast away the underlying vnodes.
>=20
> Well, revoke is "check for device aliases + reclaim regardless of
> use count".  A forced unmount is "reclaim regardless of use count".
> I was just talking from the perspective of reclaim, once again.

It's different w.r.t. layering. unmount wants to do the top of a stack,=20
sys_revoke() wants to do the root.

> > Revoke is different from unmount _if_ the leaf fs does things different=
ly
> > depending on if someone's still using the vnode or not. If it doesn't
> > differentiate, then no biggie and no parameter. If however it does
> > differentiate, the reason we need a flag is to tell the leaf fs that its
> > use count lies.
> >=20
> > The reason for the flag is to help the leaf fs. How exactly does the le=
af
> > fs know if there are users other than the revoker on the vnode? It looks
> > at the use count and compares it to one. However if the VOP_REVOKE() ca=
me
> > through a layered file system, that test isn't good. The one reference
> > could well be the one held by a single upper layer, which gives you no
> > indication of how may things hold the upper layer's vnode.
> >=20
> > That's why I thought it'd matter. If however whatever leaf file systems=
 do=20
> > doesn't care (works right either way), then we don't need said flag.
>=20
> Ah, ic.
>=20
> But now remind me why the revoke should be coming down at all?

Good question. Because we have to revoke ALL access to a given device. A=20
layer stack can have fan-out (I think I use the word differently from=20
Heidemann, I mean one leaf fs w/ multiple different layers on top). So=20
there can be multiple nodes on top of one. Thus to get them all, we have=20
to blast the bottom one.

Also, if we revoke a device, we have to revoke anyone who accesses a=20
vnode that accesses that driver, not just the vnode we started with. So=20
anyone who opens the device in a chroot or opens the device from another=20
file system, they all have to go away. That's why we do the aliasing=20
stuff. If revoke didn't go all the way down, it wouldn't happen.

> > > The call sequence is approximately the following:
> > >=20
> > > VOP_REVOKE
> > > VOP_UNLOCK (vclean() doesn't call inactive because the vnode is still=
 active)
> > > VOP_RECLAIM
> > > v_op set to deadfs
> > > upper layer decides this one is not worth holding on to
> > > vrele()
> > > VOP_INACTIVE (for deadfs)
> > > new start for the vnode
> >=20
> > It's not clear here what's calling what.
>=20
> VOP_REVOKE (generally) -> vgonel -> vclean -> (VOP_UNLOCK, VOP_RECLAIM,
> sets v_op to deadfs)
>=20
> another party: vrele() -> (VOP_INACTIVE(), put vnode on freelist)

Ok. Note it could be the same party if no one else had the vnode open. ;-)

> > Also, unless we short-circuit it, there should be a reclaim when we=20
> > recycle the vnode.
>=20
> No, there won't (technically).  The vnode now uses deadops and
> dead_reclaim is nada.
>=20
> > The lock structure is part of struct vnode, so we never have to "free" =
it.=20
> > We decomission (LK_DRAIN) it as part of reclaiming, but we can un-drain=
 it=20
> > I believe. If we can't, we should figure out a way to.
>=20
> free, drain, conceptually the same

Ok. In terms of locking, mostly the same.

> > Something for deadfs VOP_INACTIVE would be good, namely to do something=
 to=20
> > indicate "recycle me soon." Hmm, wait, that's a feature we haven't stol=
len=20
> > from Darwin yet. :-)
>=20
> But dead_inactive is called only for vnodes which were forcibly reclaimed.
> Others have VOP_INACTIVE called already when they are still attached to
> their real file system.  I all inactive methods would need to be patched
> to cope with this, and that add yet more difficult-to-comprehend side
> effects to file system writing.

I'm not sure we need dead_inactive to do anything.

However no file systems will need patching. By definition. If the file=20
system needs to do something specific with a dead vnode, it needs its own=
=20
set of foo_dead vectors and it wouldn't use deadfs's vector set.

> > Do we care about the inerlock? The advantage of the current locking=20
> > behavior is that, since each layer has access to the lockmgr lock, we c=
an=20
> > pass in the right interlock. If we either don't care, or we're fine wit=
h a=20
> > lockmgr call on vnode D releasing the interlock on vnode A (because=20
> > someone tried to lock A with its interlock held, and A is above B, whic=
h=20
> > is above C, which is above D), then we can just turn it all into recurs=
ive=20
> > VOP_LOCK() calls on the underlying vnodes.
> >=20
> > Have you read Heidemann's disseration? The locking export is an attempt=
 at=20
> > the shared-state stack stuff he did.
>=20
> No.  I guess that's saying "I should" ;)

It would be a good read. Just stick to the "thick" layering stuff. We=20
didn't do the "thin layering" and I think it'd make too many heads hurt.

> > We may just need to give up on the interlock, as there's not really any=
=20
> > way we can guarantee it for something like unionfs.
>=20
> I actually had a similar idea and did a two-layer hack (it doesn't
> recurse to the bottom of the stack, but that's easy).  It seems to fix
> the problems.  I can't see it being any worse than the current state of
> the art.

What's wrong with un-draining the lock? That fixes the problem too, if we=
=20
support it. :-)

The problem I see is that there's no easy way (that I can see) of making=20
this extend to all the layering we can have now. What about a stack like:

        A    B
         \ /
          C   D
           \ /
            L

Where A, B, C, and D are different layer mounts, and L is the leaf file=20
system under it all.

Say D processes the revoke, or say it happens directly on L. C and D can=20
notice that something changed underneath, but A and B can't easily notice=
=20
a change to L, since they'd only see it if C changed somehow.

For now, let's just undrain the lock, then wait for everything above to=20
get torn down.

Take care,

Bill

--1Y7d0dPL928TPQbc
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (NetBSD)

iD8DBQFGjYyKWz+3JHUci9cRAtvbAJ9gKucD2kT58LHoe70e7pOaEpHxYQCgg6ei
6nsPqyJragAziWKLNgs+uPY=
=cQsk
-----END PGP SIGNATURE-----

--1Y7d0dPL928TPQbc--