tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Please review: lookup changes



Hi,

I'd like to merge the changes on the ad-namecache branch, and would
appreciate any code review.  The motivation for these changes is, as
you might guess, performance.  I put an overview below.  I won't be
merging this for a couple of weeks in any case.

Thanks,
Andrew


Changes here, with a couple of whole files for ease of reading:

	http://www.netbsd.org/~ad/2020/namecache.diff
	http://www.netbsd.org/~ad/2020/vfs_lookup.c
	http://www.netbsd.org/~ad/2020/vfs_cache.c

vfs_cache.c:

	I changed the namecache to operate using a per-directory index of
	names rather than a global index, and changed the reclamation
	mechanism for cache entries to ape the VM system's CLOCK algorithm
	rather than using LRU / pseudo-LRU.  This made concurrency easier to
	manage, and the nasty locking scheme from 5.0 is gone.

	I also changed the namecache to cache permission information for
	directories, so that it can be used to do single component lookups
	fully in-cache for file systems that can work that way.

	This is described in the comments in the file.

vfs_getcwd.c:

	The design pattern here was to hold vnode locks as long as possible
	and clearly deliniate the handoff from one lock to another.  It's a
	nice way of working but not good for MP performance, and not needed
	for correctness.  I pushed the locks back as far as possible and try
	to do the lookup in the namecache without taking vnode locks (not
	needed to look at namecache).

vfs_lookup.c:

	Like vfs_getcwd.c the vnode locking is pushed back far as possible.

	If the file system exports IMNT_NCLOOKUP, this now tries to "fast
	forward" through as many component names as possible by looking
	directly in the namecache, without taking vnode locks nor vnode
	refs, and stops when the leaf is reached or there is something that
	can't be handled using the namecache.  In the most optimistic case a
	vnode reference is taken only at the root and the leaf. 
	IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.

	Layered file systems can't work that way (names are cached below),
	and the permissions check won't work for anything unusual like ACLs,
	so there is also IMNT_SHRLOOKUP.  If the file system exports
	IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
	RENAME & friends) is tried first with an LK_SHARED lock on the
	directory.  For an FS that can do the whole lookup this way (e.g. 
	tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
	VOP_LOOKUP() if something tricky comes up, and the lookup gets
	retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
	metadata).

	I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
	lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.

vfs_syscalls.c:

	Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
	LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
	That's a shame and it would be nice to use a seqlock or something
	there eventually.

vfs_vnode.c:

	Namecache related changes, and the change to not block new vnode
	references across VOP_INACTIVE() mentioned on tech-kern.  I don't
	think this blocking is needed at least for well behaved FSes like
	ffs & tmpfs.

Performance:

	During a kernel build on my test system it yields about a 15% drop
	in system time:

	HEAD            79.16s real  1602.97s user   419.54s system
	changes         77.26s real  1564.38s user   368.41s system

	The rbtree lookup can be assumed to be slightly slower than a hash
	table lookup and microbenchmarking shows this sometimes in the
	single CPU case, but as part of namei() it ends up being ameliorated
	somewhat by the reduction in locking costs, the absence of hash
	bucket collisions with rbtrees, and better L2/L3 cache use on hot
	directories.  I don't have numbers but I know joerg@ saw significant
	speedups on pbulk builds with an earlier version of this code with
	little changed from HEAD other than hash -> rbtree.  Anyway rbtree
	isn't set in stone as the data structure, it's just what we have in
	tree that's most suitable right now.  It can be changed.

	That aside, here's the result of a little benchmark, with 1-12 cores
	in a single CPU package all doing access() on a single file name in
	a kernel compile directory:

		http://www.netbsd.org/~ad/graph.png

	The gap between green and blue lines is the result of the changes;
	basically better cache usage and the number of atomic ops for each
	path name component down to 2, from 7 or 10 depending on arch.

	Red is what would happen if we were to use some kind of seqlock /
	RCU type thing instead of RW locks in the namecache (very doable I
	think but not as far as I want to go right now).

	The gap between red and yellow is I reckon mostly vnode references
	being taken at the root and the leaf, and a vnode lock taken on the
	leaf to cover VOP_ACCESS().  Oh and v_interlock on the leaf vnode
	to cover vcache_tryvget() (that can probably be turned into a single
	atomic op eventually, but further thought required).


Home | Main Index | Thread Index | Old Index