tech-kern: Re: reboot problems unmounting root

Subject: Re: reboot problems unmounting root
To: Bill Stouder-Studenmund <wrstuden@netbsd.org>
From: Antti Kantee <pooka@cs.hut.fi>
List: tech-kern
Date: 07/06/2007 15:02:00
On Thu Jul 05 2007 at 17:27:54 -0700, Bill Stouder-Studenmund wrote:
> On Thu, Jul 05, 2007 at 11:50:38PM +0300, Antti Kantee wrote:
> > On Thu Jul 05 2007 at 13:14:54 -0700, Bill Stouder-Studenmund wrote:
> > 
> > Hmm, I thought had a very good reasoning for that, but I think I lost it.
> > Maybe I misreasoned.
> > 
> > Anyway, the CURRENT state is that ONLY the lower vnode is being revoked
> > because of layer_bypass().  The upper is kind of implicitly getting
> > revoked.  Maybe that should be changed to revoke only the upper one.
> 
> For sys_revoke() processing, we want to revoke the lower one. That's the 
> only way we can destroy all instances of the device and all access paths.

Right.  That was my reason.

Although, the description on the manual page (revoke(2)) is a bit wrong:
The revoke function invalidates all current open file descriptors in
the system for the file named by path.

It doesn't take aliasing into account.

> > > > I don't see how the forced unmount situation is any different from a
> > > > revoke and therefore why revoke would need a special flag.
> > > 
> > > A forced unmount shouldn't be doing the same thing as a revoke. It should 
> > > just revoke the at-level vnode. The difference being force-unmounting a 
> > > layer shouldn't blast away the underlying vnodes.
> > 
> > Well, revoke is "check for device aliases + reclaim regardless of
> > use count".  A forced unmount is "reclaim regardless of use count".
> > I was just talking from the perspective of reclaim, once again.
> 
> It's different w.r.t. layering. unmount wants to do the top of a stack, 
> sys_revoke() wants to do the root.

No, it's not.  Forcibly unmount the root layer.  I am still talking
about *reclaim*, not revoke.  I am talking about reclaim because the
reclaim introduced by revoke is the one causing problems.  If you just
give revoke special treatment, the unmount -f problem remains.

> > But now remind me why the revoke should be coming down at all?
> 
> Good question. Because we have to revoke ALL access to a given device. A 
> layer stack can have fan-out (I think I use the word differently from 
> Heidemann, I mean one leaf fs w/ multiple different layers on top). So 
> there can be multiple nodes on top of one. Thus to get them all, we have 
> to blast the bottom one.
> 
> Also, if we revoke a device, we have to revoke anyone who accesses a 
> vnode that accesses that driver, not just the vnode we started with. So 
> anyone who opens the device in a chroot or opens the device from another 
> file system, they all have to go away. That's why we do the aliasing 
> stuff. If revoke didn't go all the way down, it wouldn't happen.

Right.  Earlier I thought you said we should only nuke the top one and
I was confused.  Good that we agree now.

> > > > The call sequence is approximately the following:
> > > > 
> > > > VOP_REVOKE
> > > > VOP_UNLOCK (vclean() doesn't call inactive because the vnode is still active)
> > > > VOP_RECLAIM
> > > > v_op set to deadfs
> > > > upper layer decides this one is not worth holding on to
> > > > vrele()
> > > > VOP_INACTIVE (for deadfs)
> > > > new start for the vnode
> > > 
> > > It's not clear here what's calling what.
> > 
> > VOP_REVOKE (generally) -> vgonel -> vclean -> (VOP_UNLOCK, VOP_RECLAIM,
> > sets v_op to deadfs)
> > 
> > another party: vrele() -> (VOP_INACTIVE(), put vnode on freelist)
> 
> Ok. Note it could be the same party if no one else had the vnode open. ;-)

No, it can't.  If usecount is 0, nobody will call vrele().  vclean() will
call VOP_INACTIVE directly.  Hence you either get a call to fs_inactive()
or dead_inactive(), not both.

> The problem I see is that there's no easy way (that I can see) of making 
> this extend to all the layering we can have now. What about a stack like:
> 
>         A    B
>          \ /
>           C   D
>            \ /
>             L
> 
> Where A, B, C, and D are different layer mounts, and L is the leaf file 
> system under it all.
> 
> Say D processes the revoke, or say it happens directly on L. C and D can 
> notice that something changed underneath, but A and B can't easily notice 
> a change to L, since they'd only see it if C changed somehow.

My head just exploded.  Call me silly, but I'd be happy if a stack like
this worked for starters:

         A
         |
         B

Seriously though, that's what I was talking about when I said recursing
to the bottom.  Instead of caching a lock pointer in each layer node,
traverse to the bottom or until a defunct lock is found and act as if
a lock wasn't exported.

> For now, let's just undrain the lock, then wait for everything above to 
> get torn down.

If it can be made to work, let's.

-- 
Antti Kantee <pooka@iki.fi>                     Of course he runs NetBSD
http://www.iki.fi/pooka/                          http://www.NetBSD.org/
    "la qualité la plus indispensable du cuisinier est l'exactitude"