Re: crash in tmpfs_itimes since vmlocking2 merge

To: Andrew Doran <ad%netbsd.org@localhost>
Subject: Re: crash in tmpfs_itimes since vmlocking2 merge
From: Bill Stouder-Studenmund <wrstuden%netbsd.org@localhost>
Date: Thu, 24 Jan 2008 10:50:54 -0800

On Thu, Jan 24, 2008 at 12:32:59AM +0000, Andrew Doran wrote:
> On Wed, Jan 23, 2008 at 01:24:39PM -0800, Bill Stouder-Studenmund wrote:
> > On Wed, Jan 23, 2008 at 10:55:13PM +0200, Antti Kantee wrote:
> > > On Wed Jan 23 2008 at 12:49:09 -0800, Bill Stouder-Studenmund wrote:
> > > > > This is the problem of forcibly unmounting a file system with device
> > > > > vnodes on it.  There was a small discussion about splitting specfs
> > > > > into two layers: one for the "host" file system (tmpfs in this case)
> > > > > and one for the actual device info (specfs).  Then you could rip the
> > > > > "host" part out without affecting the device part before the device 
> > > > > goes
> > > > > inactive and can be safely put to rest.
> > > > 
> > > > I'd actually say this is a bug in deadfs or tmpfs. Since we know the
> > > > vnode's still in use, we should leave one around that doesn't cause the 
> > > > kernel to crash.
> > > 
> > > We don't make device vnodes deadfs so that they would still work.  And I
> > 
> > I think then we should. :-)
> > 
> > > don't think teaching every file system to operate without its backing data
> > > structures is very feasible.  Hence, split it into two pieces so we don't
> > > have to worry about it: the "frontend" (.e.g tmpfs) is deadfs'd while the
> > > "backend" (specfs) continues operation until refcount drops to 0.
> 
> An alternative for device nodes only is to implement a devfs and make
> devices meaningless when they appear on other types of file system. I
> hear that FreeBSD did that.

We have a developer working on devfs for a school project. We hope to have 
it this spring. When it happens, I expect we'll make devices on other file 
systems meaningless. Note: there will be time for review before it gets 
committed, so we don't need to hash it out now. :-)

> > Haven't we already done that split w/ specfs now? All we need to do is add 
> > a deadfs operation vector (and routines, maybe) for devices, and use it 
> > when we deaden a device.
> > 
> > "deadfs" means your file system is dead. We don't necessarily ned it to 
> > mean that your vnode is dead. And for devices which are mounted, chances 
> > are we don't want that meaning. :-)
> > 
> > The one thing I see that might be tricky is to make things work right w/ 
> > revoke(2). If you revoke something, you need a dead device vnode that does 
> > nothing. If however we unmount /dev, you want a deadfs device vnode that 
> > still does i/o. Character devices always want a dead node, but we need 
> > some differentiation between "dead" and "orphaned" block devices.
> 
> The major problem I see is that changing v_op on a live vnode is nuts,
> particularly with the file system now being fully multithreaded. It could
> probably be solved by adding an additonal layer of gates into VOP access.
> Runtime cost aside that would mean understanding a bunch of complex
> interactions.

I disagree. I agree we can do it wrong and it'll explode! But I think we 
can do it right. I'll be honest I've not caught up on all the details of 
the vmlocking work you did, so I'll probably get some details wrong. :-)

Changing v_op on a live vnode will partition the world (well, all other
threads that have live refs to the node) into three sets. The easiest set
is the one that isn't actively using the vnode. These users will use the
new v_op table and just get errors when they go to use the revoked node.

The hardest set is one that I'd hope our locking could keep from existing. 
These would be concurrent accesses on other CPUs that don't take the locks 
that our revoking is holding. I'm not sure how to make such things truely 
safe. However operations like these are taking a short cut which we don't 
_need_ to take. So I think it's reasonable for the file system, which does 
both the revoke and these operations, to make it work.

The intermediate-difficutly set is the set of threads that are actively 
trying to get a lock on the vnode we're revoking. They have already 
dereferenced v_op and are into some operation that's getting a lock. They 
are blocked (spinning or sleeping) waiting for the revoke to finish. 
LK_DRAIN in the lockmanager was designed to handle them - you mark the 
lock as not taking new lockees and make all the current waiters return an 
error. When you got back from LK_DRAIN, you knew there were no pending 
waiters. We cna still make this work as long as revoke doesn't free 
structures needed to get a lock and notice the node is dead.

Hmmm... As I think about this, I see some things that I hadn't thought of.  
I think that changing v_op on a live vnode isn't that crazy. However I see
that it needs some things to work well. The big reason I see for changing
v_op is so that you can hand it off to deadfs - you can transfer the vnode
to a "genfs for dead vnodes." For this to really work, deadfs_reclaim()  
needs to know how to fully clean up the vnode. While deadfs itself won't
add anything that needs cleaning, the original file system needs to either
dead an otherwise-clean vnode (i.e. reclaim() it itself) , or it has to
leave instructions so that deadfs_reclaim() can clean it all up. Like zap
this/these locks, put this structure in that pool, etc.

If we can drain both the "intermediate" and "hard" sets above, we can do
our own revoke and hand the vnode off to deadfs.

The question is do we want to. I think we do. One advantage is that we can 
avoid code in the file system to deal with zombie vnodes.

The big reason, though, is for modules. If we can hand the vnodes off to
deadfs by the criteria above, then we can make sure there's a point when
no vnode still refers to data we have to free and that no other threads
are in the file system. So we can exit and unload the file system.

> I would prefer to see locking pushed down to the inode level, with the file
> system dealing with revoked vnodes. It would mean maintaining a zombie
> in-core inode after a vnode has been forcibly revoked, with the inode
> structure having no references to it other than from the vnode that has been
> revoked.  Your average file system VOP not wanting to do something fancy
> would call genfs_lock() *. The genfs_node would contain a rwlock replacing
> the current vnode lock, and would have a flag to record if the vnode has
> been revoked. If it's gone then genfs_lock() would return EBADF. The zombie
> inode would only go away when the reference count on the vnode drops to
> zero.

I don't think pushing locking around matters here; I think pushing locking 
into the fs is good. :-) However as I babbled above, I think it's better 
to have the ability to hand a vnode off to non-fs code.

Take care,

Bill

Attachment: pgpkC2_UXtNck.pgp
Description: PGP signature

References:
- Re: crash in tmpfs_itimes since vmlocking2 merge
  - From: Andrew Doran

Prev by Date: Warning: bogus HMAC providers in opencrypto about to be fixed
Next by Date: Broken regression Makefile
Previous by Thread: Re: crash in tmpfs_itimes since vmlocking2 merge
Next by Thread: Re: crash in tmpfs_itimes since vmlocking2 merge
Indexes:

Home | Main Index | Thread Index | Old Index