tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Please review: lookup changes



On Sun, Mar 08, 2020 at 09:22:28PM +0000, Taylor R Campbell wrote:

> > Date: Thu, 5 Mar 2020 22:48:25 +0000
> > From: Andrew Doran <ad%netbsd.org@localhost>
> > 
> > I'd like to merge the changes on the ad-namecache branch, and would
> > appreciate any code review.  The motivation for these changes is, as
> > you might guess, performance.  I put an overview below.  I won't be
> > merging this for a couple of weeks in any case.
> 
> Thanks for working on this.  Lookups are a big point of contention, so
> it's worth putting effort into fixing that.  At the same time, I want
> to make sure we don't regress into the vfs locking insanity we had ten
> years ago; dholland@ and especially hannken@ have put a lot of work
> into cleaning it up and making it understandable and maintainable.
> Please make sure to wait until hannken@ has had an opportunity weigh
> in on it.

Absolutely.  Some philosophical ramblings:

I see the namecache as a thing that allows you to avoid re-doing expensive
stuff that you've already done once before - whether that's disc I/O and
metadata crawls (old application) or multiprocessor locking (the app here).

Were it 1993 I might well be up for ripping out vnode locking and going back
to inode locking, but we're way too far down the chosen path now, and there
has been major effort invested by many people over the years to work the
vnode locking into shape, as you note.  It works very well now, and I don't
think we should mess with it too much.

With that in mind I have tried do this in a way that requires as few changes
to the vnode locking and file systems as possible: use the namecache to do
things fast whenever possible, but preserve the strongly locked path.

> > 	If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> > 	forward" through as many component names as possible by looking
> > 	directly in the namecache, without taking vnode locks nor vnode
> > 	refs, and stops when the leaf is reached or there is something that
> > 	can't be handled using the namecache.  In the most optimistic case a
> > 	vnode reference is taken only at the root and the leaf. 
> > 	IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> 
> What file systems can't support IMNT_NCLOOKUP and why?

Good question.  Other than layered file systems I don't know for sure, but I
think it can be done for most FSes.  NFS might have its own complications; I
haven't looked yet.  IMNT_NCLOOKUP is more of a feature gate than anything
else.  The file system hooks are unintrusive and few in number but adding
those hooks takes careful, time consuming code inspection to make sure
nothing is missed.

> > 	Layered file systems can't work that way (names are cached below),
> 
> It seems there's a tradeoff here:
> 
> 1. We could cache layered vnodes, at the cost of doubling the size of
>    the namecache.
> 
> 2. We could _not_ cache layered vnodes, at the cost of having to redo
>    layer_node_create and vcache_get for every lookup -- a process
>    which scales worse than namecache lookups (global vcache_lock).
> 
> Am I missing anything else about this?  Is this tradeoff the only
> reason that it's still up to the vnode operations to call cache_lookup
> and cache_enter, rather than something namei does directly -- so that
> layer_lookup can skip them?

As far as I recall layer vnodes persist unless recycled like any other
vnode, so there should be nothing special happening there around contention
due to lifecycle.  I think you could cache or transpose the cached names
above, maybe using some kind of VOP.  I definitely think it's doable but
beyond that, at least for me thare lots of unknowns so I have avoided it.

> > 	and the permissions check won't work for anything unusual like ACLs,
> > 	so there is also IMNT_SHRLOOKUP.  If the file system exports
> > 	IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> > 	RENAME & friends) is tried first with an LK_SHARED lock on the
> > 	directory.  For an FS that can do the whole lookup this way (e.g. 
> > 	tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> > 	VOP_LOOKUP() if something tricky comes up, and the lookup gets
> > 	retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> > 	metadata).
> 
> Except for access to the create/rename/remove hints (which we should
> do differently anyway [*]), in what circumstances does ufs lookup need
> an exclusive lock?  Why does scanning directory metadata require that?
>
> [*] Specifically: We should push the last component lookup into
> directory operations, and leave it to *fs_rename/remove/&c. to invoke
> an internal lookup routine that does the right thing with locks and
> returns the hint to the caller rather than stashing it in the
> directory's struct inode.  This has been intended for several years,
> but dholland and I have had a shortage of round tuits."

I was thinking of the sequential scan optimisation from 4.3BSD ("2passes" in
the namecache stats).  Probably not hard to solve but it is there, and
related to the create/remove/rename hints, but different usage pattern.

I also expect that if ufs_lookup() were opened up to allow a shared lock
along the full path then pressure would be diverted to buffer locks
previously shielded by the vnode lock.  I would be unsurprised to see that
performing worse under contention than the single exclusive lock with that
code as-is.  There's also the fact that modifications and uncached lookups*
are often a tiny proportion of activity on the system in comparison to
cached lookups, so without a specific application in mind I'm not sure
there's a great deal to be won there.

Looking up names directly in the namecache bypasses these problem to some
extent, because the directory vnode lock needs to be touched far less often,
and when it is touched it's because there's real work to be done.

(* skewed by the VOP_INACTIVE() problem which I want to solve here; it
triggers lots of needless uncached lookups by booting callers out of
cache_lookup().)

> > 	I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
> > 	lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.
> 
> I worry that after we spent several years getting rid of unnecessary
> complexity in the incomprehensible namei flags that rendered the
> subsystem unmaintainable, we may be sliding back toward that.  Could
> we do this with a separate operation that happens after namei instead?
> 
> 	NDINIT(&nd, ...);
> 	error = namei(&nd);
> 	if (error)
> 		goto out;
> 	namei_lockleaf(&nd);
> 
> (namei_lockleaf would obviously has to be careful not to lock when
> it's a `.' lookup and the parent is already locked, &c.  Perhaps this
> won't work -- I haven't thought it through -- but it would be nice to
> decrease rather than increase the complexity of namei itself.)

LOCKLEAF is something very old and LOCKSHARED a minor tweak to it (a
conditional picking either LK_SHARED/LK_EXCLUSIVE on the way out of
namei()).  I'd prefer that namei() didn't lock any vnodes to begin with but
that's the scheme we have (see my comment re: 1993 :-).  It's also what
FreeBSD does for this and in my experience these little differences are
troublesome WRT code sharing so I decided to match their flag.

> On a related note: Why do VFS_VGET, VFS_ROOT, and VFS_FHTOVP need a
> lktype argument?  Why can't they just return the vnode unlocked, and
> leave it to the caller to just call vn_lock?

Not trying to be glib, but pretty much exactly as I wrote for LOCKLEAF.
 
> > vfs_syscalls.c:
> > 
> > 	Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
> > 	LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
> > 	That's a shame and it would be nice to use a seqlock or something
> > 	there eventually.
> 
> I don't understand this itimes handling.  Why do we ever _set_ any
> itimes in VOP_GETATTR?
> 
> My understanding is that we try to defer issuing disk writes to update
> inode times as long as possible (except in wapbl where it can be done
> asynchronously and write-combined -- if nevertheless serially across
> the whole file system -- via the log).
> 
> That explains why, e.g., ufs has patterns like
> 
> 	ip->i_flag |= IN_CHANGE;
> 	UFS_WAPBL_UPDATE(vp, NULL, NULL, 0);
> 
> to indicate that a change just happened and to defer a disk write
> either to the next wapbl commit or to the next time we need to
> synchronously UFS_UPDATE.
> 
> But I don't understand why, e.g., ufs_getattr does UFS_ITIMES.  All
> that that does is check our watch and store the time in memory, still
> without issuing any disk writes -- an approach which seems to have two
> side effects:
> 
> 1. VOP_GETATTR requires an exclusive lock, not a shared lock.
> 2. We don't actually ever check our watch until someone tries to
>    observe the itimes!
> 
> Why don't we check our watch immediately when we set IN_CHANGE &c.,
> and defer the disk write to update the inode like UFS_WAPBL_UPDATE
> does -- and just skip UFS_ITIMES in VOP_GETATTR?

This looks to go back least as far as Unix 32V (& possibly further) in one
form or another, from what I can see.  Whatever the original intent I think
it might have morphed into an attempt to avoid querying the system clock at
some point, which is about the only point I can see for it now.

> > vfs_vnode.c:
> > 
> > 	Namecache related changes, and the change to not block new vnode
> > 	references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> > 	think this blocking is needed at least for well behaved FSes like
> > 	ffs & tmpfs.
> 
> hannken@ cautions me that not blocking until VOP_INACTIVE completes
> may be dangerous.
> 
> The last major deadlock I found in lfs is that, in order to decide
> what goes in the segment, the segment writer iterates over all vnodes
> with vfs_vnode_iterator_next, which waits until VOP_INACTIVE
> completes.  But VOP_INACTIVE may truncate the inode, which has to wait
> until the segment writer has completed in order to avoid competing
> over the inode's data block pointers.
> 
> So I drafted a variant of vfs_vnode_iterator_next that allows access
> to vnodes undergoing VOP_INACTIVE as long as the file system promises
> to be very careful -- in the case of lfs, we have to filter out inodes
> with ip->i_nlink, and I'm not totally sure the logic I drafted isn't
> racy.

Is this something that should be done in VOP_RECLAIM() instead?  Then any
concerns about VOP_INACTIVE() don't apply since the inode is well and truly
dead.

> > 	The rbtree lookup can be assumed to be slightly slower than a hash
> > 	table lookup and microbenchmarking shows this sometimes in the
> > 	single CPU case, but as part of namei() it ends up being ameliorated
> > 	somewhat by the reduction in locking costs, the absence of hash
> > 	bucket collisions with rbtrees, and better L2/L3 cache use on hot
> > 	directories.  I don't have numbers but I know joerg@ saw significant
> > 	speedups on pbulk builds with an earlier version of this code with
> > 	little changed from HEAD other than hash -> rbtree.  Anyway rbtree
> > 	isn't set in stone as the data structure, it's just what we have in
> > 	tree that's most suitable right now.  It can be changed.
> 
> Maybe we could use a critbit tree?  It will have cache utilization
> characteristics similar to a red/black tree (fanout is 2 either way),
> but the lookup can be pserialized and each step may be a little
> cheaper computationally.
> 
> https://mumble.net/~campbell/hg/critbit/
>
> Alternatively, there is rmind's thmap already in tree.

Interesting, I will take a look.  Skiplists and Robin Hood hashes also seem
like fun data structures to consider here but to be honest I haven't thought
about that in detail yet.

Thanks,
Andrew


Home | Main Index | Thread Index | Old Index