Re: Please review: lookup changes

To: Andrew Doran <ad%netbsd.org@localhost>
Subject: Re: Please review: lookup changes
From: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Date: Sun, 8 Mar 2020 21:22:28 +0000

> Date: Thu, 5 Mar 2020 22:48:25 +0000
> From: Andrew Doran <ad%netbsd.org@localhost>
> 
> I'd like to merge the changes on the ad-namecache branch, and would
> appreciate any code review.  The motivation for these changes is, as
> you might guess, performance.  I put an overview below.  I won't be
> merging this for a couple of weeks in any case.

Thanks for working on this.  Lookups are a big point of contention, so
it's worth putting effort into fixing that.  At the same time, I want
to make sure we don't regress into the vfs locking insanity we had ten
years ago; dholland@ and especially hannken@ have put a lot of work
into cleaning it up and making it understandable and maintainable.
Please make sure to wait until hannken@ has had an opportunity weigh
in on it.

> 	If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> 	forward" through as many component names as possible by looking
> 	directly in the namecache, without taking vnode locks nor vnode
> 	refs, and stops when the leaf is reached or there is something that
> 	can't be handled using the namecache.  In the most optimistic case a
> 	vnode reference is taken only at the root and the leaf. 
> 	IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.

What file systems can't support IMNT_NCLOOKUP and why?

> 	Layered file systems can't work that way (names are cached below),

It seems there's a tradeoff here:

1. We could cache layered vnodes, at the cost of doubling the size of
   the namecache.

2. We could _not_ cache layered vnodes, at the cost of having to redo
   layer_node_create and vcache_get for every lookup -- a process
   which scales worse than namecache lookups (global vcache_lock).

Am I missing anything else about this?  Is this tradeoff the only
reason that it's still up to the vnode operations to call cache_lookup
and cache_enter, rather than something namei does directly -- so that
layer_lookup can skip them?

> 	and the permissions check won't work for anything unusual like ACLs,
> 	so there is also IMNT_SHRLOOKUP.  If the file system exports
> 	IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> 	RENAME & friends) is tried first with an LK_SHARED lock on the
> 	directory.  For an FS that can do the whole lookup this way (e.g. 
> 	tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> 	VOP_LOOKUP() if something tricky comes up, and the lookup gets
> 	retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> 	metadata).

Except for access to the create/rename/remove hints (which we should
do differently anyway [*]), in what circumstances does ufs lookup need
an exclusive lock?  Why does scanning directory metadata require that?

[*] Specifically: We should push the last component lookup into
directory operations, and leave it to *fs_rename/remove/&c. to invoke
an internal lookup routine that does the right thing with locks and
returns the hint to the caller rather than stashing it in the
directory's struct inode.  This has been intended for several years,
but dholland and I have had a shortage of round tuits.

> 	I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
> 	lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.

I worry that after we spent several years getting rid of unnecessary
complexity in the incomprehensible namei flags that rendered the
subsystem unmaintainable, we may be sliding back toward that.  Could
we do this with a separate operation that happens after namei instead?

	NDINIT(&nd, ...);
	error = namei(&nd);
	if (error)
		goto out;
	namei_lockleaf(&nd);

(namei_lockleaf would obviously has to be careful not to lock when
it's a `.' lookup and the parent is already locked, &c.  Perhaps this
won't work -- I haven't thought it through -- but it would be nice to
decrease rather than increase the complexity of namei itself.)

On a related note: Why do VFS_VGET, VFS_ROOT, and VFS_FHTOVP need a
lktype argument?  Why can't they just return the vnode unlocked, and
leave it to the caller to just call vn_lock?

> vfs_syscalls.c:
> 
> 	Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
> 	LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
> 	That's a shame and it would be nice to use a seqlock or something
> 	there eventually.

I don't understand this itimes handling.  Why do we ever _set_ any
itimes in VOP_GETATTR?

My understanding is that we try to defer issuing disk writes to update
inode times as long as possible (except in wapbl where it can be done
asynchronously and write-combined -- if nevertheless serially across
the whole file system -- via the log).

That explains why, e.g., ufs has patterns like

	ip->i_flag |= IN_CHANGE;
	UFS_WAPBL_UPDATE(vp, NULL, NULL, 0);

to indicate that a change just happened and to defer a disk write
either to the next wapbl commit or to the next time we need to
synchronously UFS_UPDATE.

But I don't understand why, e.g., ufs_getattr does UFS_ITIMES.  All
that that does is check our watch and store the time in memory, still
without issuing any disk writes -- an approach which seems to have two
side effects:

1. VOP_GETATTR requires an exclusive lock, not a shared lock.
2. We don't actually ever check our watch until someone tries to
   observe the itimes!

Why don't we check our watch immediately when we set IN_CHANGE &c.,
and defer the disk write to update the inode like UFS_WAPBL_UPDATE
does -- and just skip UFS_ITIMES in VOP_GETATTR?

> vfs_vnode.c:
> 
> 	Namecache related changes, and the change to not block new vnode
> 	references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> 	think this blocking is needed at least for well behaved FSes like
> 	ffs & tmpfs.

hannken@ cautions me that not blocking until VOP_INACTIVE completes
may be dangerous.

The last major deadlock I found in lfs is that, in order to decide
what goes in the segment, the segment writer iterates over all vnodes
with vfs_vnode_iterator_next, which waits until VOP_INACTIVE
completes.  But VOP_INACTIVE may truncate the inode, which has to wait
until the segment writer has completed in order to avoid competing
over the inode's data block pointers.

So I drafted a variant of vfs_vnode_iterator_next that allows access
to vnodes undergoing VOP_INACTIVE as long as the file system promises
to be very careful -- in the case of lfs, we have to filter out inodes
with ip->i_nlink, and I'm not totally sure the logic I drafted isn't
racy.

> 	The rbtree lookup can be assumed to be slightly slower than a hash
> 	table lookup and microbenchmarking shows this sometimes in the
> 	single CPU case, but as part of namei() it ends up being ameliorated
> 	somewhat by the reduction in locking costs, the absence of hash
> 	bucket collisions with rbtrees, and better L2/L3 cache use on hot
> 	directories.  I don't have numbers but I know joerg@ saw significant
> 	speedups on pbulk builds with an earlier version of this code with
> 	little changed from HEAD other than hash -> rbtree.  Anyway rbtree
> 	isn't set in stone as the data structure, it's just what we have in
> 	tree that's most suitable right now.  It can be changed.

Maybe we could use a critbit tree?  It will have cache utilization
characteristics similar to a red/black tree (fanout is 2 either way),
but the lookup can be pserialized and each step may be a little
cheaper computationally.

https://mumble.net/~campbell/hg/critbit/

Alternatively, there is rmind's thmap already in tree.

Follow-Ups:
- Re: Please review: lookup changes
  - From: Andrew Doran
- Re: Please review: lookup changes
  - From: Andrew Doran

References:
- Please review: lookup changes
  - From: Andrew Doran

Prev by Date: Re: NULL pointer arithmetic issues
Next by Date: Re: NULL pointer arithmetic issues
Previous by Thread: Re: Please review: lookup changes
Next by Thread: Re: Please review: lookup changes
Indexes:

Home | Main Index | Thread Index | Old Index