Re: Please review: lookup changes

To: Andrew Doran <ad%netbsd.org@localhost>
Subject: Re: Please review: lookup changes
From: Mateusz Guzik <mjguzik%gmail.com@localhost>
Date: Sat, 7 Mar 2020 00:14:05 +0100
On 3/5/20, Andrew Doran <ad%netbsd.org@localhost> wrote:
> vfs_cache.c:
>
>       I changed the namecache to operate using a per-directory index of
>       names rather than a global index, and changed the reclamation
>       mechanism for cache entries to ape the VM system's CLOCK algorithm
>       rather than using LRU / pseudo-LRU.  This made concurrency easier to
>       manage, and the nasty locking scheme from 5.0 is gone.
>

I don't think rb trees (or in fact anything non-hash) is a good idea for
this purpose. See way below.

I doubt aging namecache entries is worth the complexity. Note that said
entries can get removed as a side-effect of vnode recycling and it's
very likely this code does not really win anything in practice.

>       I also changed the namecache to cache permission information for
>       directories, so that it can be used to do single component lookups
>       fully in-cache for file systems that can work that way.
>

Low priority, but I think this comes with a bug. kauth() has this support
for "listeners" and so happens in the base system at least there is
nobody hooked for vnode lookup. Key point being that normally the vnode
is passed in shared-locked, but fast path lookup avoids this and
consequently if there is a real-world uesr of the feature somewhere, they
may be silently broken.

I had a similar problem with the MAC framework and devised a simple flag
which can be checked upfront to disable fast path lookup.

So this would have a side effect of removing the call in the first place.
genfs_can_access is rather pessimal for checking VEXEC. I wrote a dedicated
routine which only checks for VEXEC and starts with a check for
(mode & 0111) == 0111 upfront. It happens to almost always be true.

>       This is described in the comments in the file.
>
> vfs_getcwd.c:
>
>       The design pattern here was to hold vnode locks as long as possible
>       and clearly deliniate the handoff from one lock to another.  It's a
>       nice way of working but not good for MP performance, and not needed
>       for correctness.  I pushed the locks back as far as possible and try
>       to do the lookup in the namecache without taking vnode locks (not
>       needed to look at namecache).
>
> vfs_lookup.c:
>
>       Like vfs_getcwd.c the vnode locking is pushed back far as possible.
>
>       If the file system exports IMNT_NCLOOKUP, this now tries to "fast
>       forward" through as many component names as possible by looking
>       directly in the namecache, without taking vnode locks nor vnode
>       refs, and stops when the leaf is reached or there is something that
>       can't be handled using the namecache.  In the most optimistic case a
>       vnode reference is taken only at the root and the leaf.
>       IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
>
>       Layered file systems can't work that way (names are cached below),
>       and the permissions check won't work for anything unusual like ACLs,
>       so there is also IMNT_SHRLOOKUP.  If the file system exports
>       IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
>       RENAME & friends) is tried first with an LK_SHARED lock on the
>       directory.  For an FS that can do the whole lookup this way (e.g.
>       tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
>       VOP_LOOKUP() if something tricky comes up, and the lookup gets
>       retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
>       metadata).
>
>       I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
>       lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.
>
> vfs_syscalls.c:
>
>       Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
>       LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
>       That's a shame and it would be nice to use a seqlock or something
>       there eventually.
>

I'm considering doing something like this myself. Lookup doing this could
avoid even referencing the vnode in the first place in the common case.

The itimes situation is very broken in my opinion. Filesystems mark vnode
as modified/accessed/whatever and the timestamp gets updated using a
nanosecond-precision timer several seconds later. This is expensive to
acquire and most for consumers probaby does not even make any difference.
But rationalizing this is probably way beyond the scope of this work.

I believe it is always legal to just upgrade the lock in getattr if
necessary. Since itimes updates are typically not needed, this would mean
common case would get away with shared locking. This would run into issues
in vput, for which see below.

Assuming this is fixed fstat could also shared lock. I think this is a
simple change to make and most certainly worth doing.

> vfs_vnode.c:
>
>       Namecache related changes, and the change to not block new vnode
>       references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>       think this blocking is needed at least for well behaved FSes like
>       ffs & tmpfs.
>

I think this is very hairy and the entire thing is avoidable.

I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE.
You should be able to trivially implement an equivalent which would be
called if the vnode is only shared locked. Supporting an unlocked case
comes with extra woes which may not be worth it at this stage.

Since inative processing is almost never needed this would avoid a lot
of lock contention.

With all this in place you would also not run into the problematic state
in the fast path in the common case.

This is a cheap cop out, the real solution would encode need for inactive
in usecount as a flag but that's a lot of work.

> Performance:
>
>       During a kernel build on my test system it yields about a 15% drop
>       in system time:
>
>       HEAD            79.16s real  1602.97s user   419.54s system
>       changes         77.26s real  1564.38s user   368.41s system
>
>       The rbtree lookup can be assumed to be slightly slower than a hash
>       table lookup and microbenchmarking shows this sometimes in the
>       single CPU case, but as part of namei() it ends up being ameliorated
>       somewhat by the reduction in locking costs, the absence of hash
>       bucket collisions with rbtrees, and better L2/L3 cache use on hot
>       directories.  I don't have numbers but I know joerg@ saw significant
>       speedups on pbulk builds with an earlier version of this code with
>       little changed from HEAD other than hash -> rbtree.  Anyway rbtree
>       isn't set in stone as the data structure, it's just what we have in
>       tree that's most suitable right now.  It can be changed.
>

I strongly suspect the win observed with pbulk had to with per-cpu
locks and not hashing itself. Note that at the time vnode interlock and
vm object lock were the same thing and in this workload the lock is
under a lot of contention. Should the lock owner get preempted, anyone
doing a lookup to the affected vnode (e.g., libc) will be holding
the relevant per-cpu lock and will block on a turnstile. Whoever ends
up running on the affected cpu is likely to do a lookup on their own,
but the relevant per-cpu lock is taken and go off cpu. The same thing
happening on more than one cpu at a time could easily reslut in a
cascading failure, which I strongly suspect is precisely what happened.

That is, the win does not stem from rb trees but finer-grained locking
which does not block other threads which look up something else on
the same cpu.

As mentioned earlier I think rb trees instead of a hash are pessimal here.

First, a little step back. The lookup starts with securing vnodes from
cwdinfo. This represents a de facto global serialisation point (times two
since the work has to be reverted later). In FreeBSD I implemented an
equivalent with copy-on-write semantics. Easy take on it is that I take
an rwlock-equivalent and then grab a reference on the found struct.
This provides me with an implicit reference on root and current working
directory vnodes. If the struct is unshared on fork, aforementioned
serialisation point becomes localized to the process.

With safe memory reclamation for the object this can avoid locks in lookup
and instead cmpxchg-if-not-zero which would be close the difference for
single-threaded lookups.

I don't know in what shape pserialize is but it may be good enough to
handle real-world rate of reallocs of a struct like this.

Back to rb trees vs hash:

With the above problem out of the way, with rb trees the following 2
lookups bounce the rw lock tied to root vnode:
"/bin/sh"
"/lib/libc.so"

In contrast, with a hash, and assuming they land in different buckets
(which they should) the 2 lookups can be completely non-contending.

Moreover, should a form of delayed memory reclamation get implemented for
this, hashes with collisions handled with linked lists are a very easy use
case.

In my tests even with lookups which share most path components, the
last one tends to be different. Using a hash means this typically
results in grabbing different locks for that case and consequently
fewer cache-line ping pongs.

>       That aside, here's the result of a little benchmark, with 1-12 cores
>       in a single CPU package all doing access() on a single file name in
>       a kernel compile directory:
>
>               http://www.netbsd.org/~ad/graph.png
>
>       The gap between green and blue lines is the result of the changes;
>       basically better cache usage and the number of atomic ops for each
>       path name component down to 2, from 7 or 10 depending on arch.
>
>       Red is what would happen if we were to use some kind of seqlock /
>       RCU type thing instead of RW locks in the namecache (very doable I
>       think but not as far as I want to go right now).
>

Was this generated by just not taking rw locks?

>       The gap between red and yellow is I reckon mostly vnode references
>       being taken at the root and the leaf, and a vnode lock taken on the
>       leaf to cover VOP_ACCESS().  Oh and v_interlock on the leaf vnode
>       to cover vcache_tryvget() (that can probably be turned into a single
>       atomic op eventually, but further thought required).
>

So in FreeBSD vnodes are still handled with an incarnation of lockmgr
(not to be confused with whatever you consider the original lockmgr) and
that still performs cmpxchg-based loops to read lock/unlock.

This happens to have significantly worse degradation under load than locks
which xadd. Conversion from said loops to xadd is not necessarily trivial
(there are nasty decisions to make in terms of what to do with finding
out whether the lock owner is running) but the point is that for this case
there is performance left on the table.

Your rw locks suffer the same problem. On top of that, because of the
invactive issue, chances are the result got artificially worsened waiting
on upgrades. Would be interesting to see the outcome if the file was kept
open by someone just to have usecount > 0 for the duration.

So all in all, I think this is great step forward but:
- adding VOP_NEED_INACTIVE or an equivalent would enable more shared
locking (and most notably would reduce lock upgrades). it's not hard to
do and will likely come with an immediate win.
- i would seriously consider sticking to hash tables

-- 
Mateusz Guzik <mjguzik gmail.com>
Follow-Ups:
- Re: Please review: lookup changes
  - From: Andrew Doran
References:
- Please review: lookup changes
  - From: Andrew Doran
Prev by Date: Please review: lookup changes
Next by Date: Current status of "support jails-like features"?
Previous by Thread: Please review: lookup changes
Next by Thread: Re: Please review: lookup changes
Indexes:
Home | Main Index | Thread Index | Old Index