Re: Please review: lookup changes

To: Mateusz Guzik <mjguzik%gmail.com@localhost>
Subject: Re: Please review: lookup changes
From: Andrew Doran <ad%netbsd.org@localhost>
Date: Sun, 8 Mar 2020 19:11:47 +0000
On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:

> On 3/5/20, Andrew Doran <ad%netbsd.org@localhost> wrote:
> > vfs_cache.c:
> >
> >       I changed the namecache to operate using a per-directory index of
> >       names rather than a global index, and changed the reclamation
> >       mechanism for cache entries to ape the VM system's CLOCK algorithm
> >       rather than using LRU / pseudo-LRU.  This made concurrency easier to
> >       manage, and the nasty locking scheme from 5.0 is gone.
> >
> 
> I don't think rb trees (or in fact anything non-hash) is a good idea for
> this purpose. See way below.

Right, I've admitted to the fact that rbtree is not ideal here.  But it is
"good enough", and I'm happy for someone to replace it with some other data
structure later on.
 
> I doubt aging namecache entries is worth the complexity. Note that said
> very likely this code does not really win anything in practice.

If there was a 1:1 relationship I would agree but due to links (including
dot and dotdot) it's N:M and I don't fancy changing that behaviour too much
right now.  Definitely something to consider in the future.
 
> >       I also changed the namecache to cache permission information for
> >       directories, so that it can be used to do single component lookups
> >       fully in-cache for file systems that can work that way.
> >
> 
> Low priority, but I think this comes with a bug. kauth() has this support
> for "listeners" and so happens in the base system at least there is
> nobody hooked for vnode lookup. Key point being that normally the vnode
> is passed in shared-locked, but fast path lookup avoids this and
> consequently if there is a real-world uesr of the feature somewhere, they
> may be silently broken.

Good point.  Yes that aspect of kauth is a bit of a mess.  Plus it adds a
lot of CPU overhead.  It could do with its own solution for concurrency
that's straightforward, easy to maintain and doesn't have too many tentacles
that depend on concurrency rules in the various subsystems.

> I had a similar problem with the MAC framework and devised a simple flag
> which can be checked upfront to disable fast path lookup.
> 
> So this would have a side effect of removing the call in the first place.
> genfs_can_access is rather pessimal for checking VEXEC. I wrote a dedicated
> routine which only checks for VEXEC and starts with a check for
> (mode & 0111) == 0111 upfront. It happens to almost always be true.
> 
> >       This is described in the comments in the file.
> >
> > vfs_getcwd.c:
> >
> >       The design pattern here was to hold vnode locks as long as possible
> >       and clearly deliniate the handoff from one lock to another.  It's a
> >       nice way of working but not good for MP performance, and not needed
> >       for correctness.  I pushed the locks back as far as possible and try
> >       to do the lookup in the namecache without taking vnode locks (not
> >       needed to look at namecache).
> >
> > vfs_lookup.c:
> >
> >       Like vfs_getcwd.c the vnode locking is pushed back far as possible.
> >
> >       If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> >       forward" through as many component names as possible by looking
> >       directly in the namecache, without taking vnode locks nor vnode
> >       refs, and stops when the leaf is reached or there is something that
> >       can't be handled using the namecache.  In the most optimistic case a
> >       vnode reference is taken only at the root and the leaf.
> >       IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> >
> >       Layered file systems can't work that way (names are cached below),
> >       and the permissions check won't work for anything unusual like ACLs,
> >       so there is also IMNT_SHRLOOKUP.  If the file system exports
> >       IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> >       RENAME & friends) is tried first with an LK_SHARED lock on the
> >       directory.  For an FS that can do the whole lookup this way (e.g.
> >       tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> >       VOP_LOOKUP() if something tricky comes up, and the lookup gets
> >       retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> >       metadata).
> >
> >       I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
> >       lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.
> >
> > vfs_syscalls.c:
> >
> >       Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
> >       LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
> >       That's a shame and it would be nice to use a seqlock or something
> >       there eventually.
> >
> 
> I'm considering doing something like this myself. Lookup doing this could
> avoid even referencing the vnode in the first place in the common case.
> 
> The itimes situation is very broken in my opinion. Filesystems mark vnode
> as modified/accessed/whatever and the timestamp gets updated using a
> nanosecond-precision timer several seconds later. This is expensive to
> acquire and most for consumers probaby does not even make any difference.
> But rationalizing this is probably way beyond the scope of this work.

Agree, it could definitely use reconsideration.  As far as I understand it's
to avoid microtime() and not inode updates.

> I believe it is always legal to just upgrade the lock in getattr if
> necessary. Since itimes updates are typically not needed, this would mean
> common case would get away with shared locking. This would run into issues
> in vput, for which see below.

Hmm.  With NetBSD upgrading a lock in the middle of a VOP could have
consequences for fstrans.  I'm not sure.
 
> Assuming this is fixed fstat could also shared lock. I think this is a
> simple change to make and most certainly worth doing.
> 
> > vfs_vnode.c:
> >
> >       Namecache related changes, and the change to not block new vnode
> >       references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> >       think this blocking is needed at least for well behaved FSes like
> >       ffs & tmpfs.
> >
> 
> I think this is very hairy and the entire thing is avoidable.
> 
> I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE.
> You should be able to trivially implement an equivalent which would be
> called if the vnode is only shared locked. Supporting an unlocked case
> comes with extra woes which may not be worth it at this stage.
> 
> Since inative processing is almost never needed this would avoid a lot
> of lock contention.
> 
> With all this in place you would also not run into the problematic state
> in the fast path in the common case.
> 
> This is a cheap cop out, the real solution would encode need for inactive
> in usecount as a flag but that's a lot of work.

Two seperate issues - I agree on avoiding VOP_INACTIVE(), but the issue here
is peculiar to NetBSD I think.  NetBSD sets the equivalent of the 4.4BSD
VXLOCK flag to block vget() across VOP_INACTIVE().  Should only be needed
for revoke/clean.  At one point we did have VXLOCK embedded in v_usecount
which allowed for vget() with one atomic op, but that was undone at some
point, I don't know why.

> > Performance:
> >
> >       During a kernel build on my test system it yields about a 15% drop
> >       in system time:
> >
> >       HEAD            79.16s real  1602.97s user   419.54s system
> >       changes         77.26s real  1564.38s user   368.41s system
> >
> >       The rbtree lookup can be assumed to be slightly slower than a hash
> >       table lookup and microbenchmarking shows this sometimes in the
> >       single CPU case, but as part of namei() it ends up being ameliorated
> >       somewhat by the reduction in locking costs, the absence of hash
> >       bucket collisions with rbtrees, and better L2/L3 cache use on hot
> >       directories.  I don't have numbers but I know joerg@ saw significant
> >       speedups on pbulk builds with an earlier version of this code with
> >       little changed from HEAD other than hash -> rbtree.  Anyway rbtree
> >       isn't set in stone as the data structure, it's just what we have in
> >       tree that's most suitable right now.  It can be changed.
> >
> 
> I strongly suspect the win observed with pbulk had to with per-cpu
> locks and not hashing itself. Note that at the time vnode interlock and
> vm object lock were the same thing and in this workload the lock is
> under a lot of contention. Should the lock owner get preempted, anyone
> doing a lookup to the affected vnode (e.g., libc) will be holding
> the relevant per-cpu lock and will block on a turnstile. Whoever ends
> up running on the affected cpu is likely to do a lookup on their own,
> but the relevant per-cpu lock is taken and go off cpu. The same thing
> happening on more than one cpu at a time could easily reslut in a
> cascading failure, which I strongly suspect is precisely what happened.
> 
> That is, the win does not stem from rb trees but finer-grained locking
> which does not block other threads which look up something else on
> the same cpu.

Not on NetBSD.  Kernel preemption is possible and allowed (mainly for real
time applications), but happens infrequently during normal operation.  There
are a number of pieces of code that take advantage of that fact and are
"optimistically per-CPU", and they work very well as preemption is rarely
observed.  Further the blocking case on v_interlock in cache_lookup() is
rare.  That's no to say it doesn't happen, it does, but I don't think it
enough to explain the performance differences.

> As mentioned earlier I think rb trees instead of a hash are pessimal here.
> 
> First, a little step back. The lookup starts with securing vnodes from
> cwdinfo. This represents a de facto global serialisation point (times two
> since the work has to be reverted later). In FreeBSD I implemented an
> equivalent with copy-on-write semantics. Easy take on it is that I take
> an rwlock-equivalent and then grab a reference on the found struct.
> This provides me with an implicit reference on root and current working
> directory vnodes. If the struct is unshared on fork, aforementioned
> serialisation point becomes localized to the process.

Interesting.  Sounds somewhat like both NetBSD and FreeBSD do for process
credentials.
 
> With safe memory reclamation for the object this can avoid locks in lookup
> and instead cmpxchg-if-not-zero which would be close the difference for
> single-threaded lookups.
> 
> I don't know in what shape pserialize is but it may be good enough to
> handle real-world rate of reallocs of a struct like this.
> 
> Back to rb trees vs hash:
> 
> With the above problem out of the way, with rb trees the following 2
> lookups bounce the rw lock tied to root vnode:
> "/bin/sh"
> "/lib/libc.so"
> 
> In contrast, with a hash, and assuming they land in different buckets
> (which they should) the 2 lookups can be completely non-contending.
> 
> Moreover, should a form of delayed memory reclamation get implemented for
> this, hashes with collisions handled with linked lists are a very easy use
> case.

Agree.

> In my tests even with lookups which share most path components, the
> last one tends to be different. Using a hash means this typically
> results in grabbing different locks for that case and consequently
> fewer cache-line ping pongs.

My experience has been different.  What I've observed is that shared hash
tables usually generate huge cache pressure unless they're small and rarely
updated.  If the hash were small, matching the access pattern (e.g. 
per-dir) then I think it would have the opportunity to make maximum use of
the cache.  That could be a great win and certainly better than rbtree.

> >       That aside, here's the result of a little benchmark, with 1-12 cores
> >       in a single CPU package all doing access() on a single file name in
> >       a kernel compile directory:
> >
> >               http://www.netbsd.org/~ad/graph.png
> >
> >       The gap between green and blue lines is the result of the changes;
> >       basically better cache usage and the number of atomic ops for each
> >       path name component down to 2, from 7 or 10 depending on arch.
> >
> >       Red is what would happen if we were to use some kind of seqlock /
> >       RCU type thing instead of RW locks in the namecache (very doable I
> >       think but not as far as I want to go right now).
> >
> 
> Was this generated by just not taking rw locks?

Yes I think it was just a case of disabling those per-directory locks.
 
> >       The gap between red and yellow is I reckon mostly vnode references
> >       being taken at the root and the leaf, and a vnode lock taken on the
> >       leaf to cover VOP_ACCESS().  Oh and v_interlock on the leaf vnode
> >       to cover vcache_tryvget() (that can probably be turned into a single
> >       atomic op eventually, but further thought required).
> >
> 
> So in FreeBSD vnodes are still handled with an incarnation of lockmgr
> (not to be confused with whatever you consider the original lockmgr) and
> that still performs cmpxchg-based loops to read lock/unlock.
> 
> This happens to have significantly worse degradation under load than locks
> which xadd. Conversion from said loops to xadd is not necessarily trivial
> (there are nasty decisions to make in terms of what to do with finding
> out whether the lock owner is running) but the point is that for this case
> there is performance left on the table.

Interesting, I had not heard that before.  Will research.

Thanks,
Andrew
 
> Your rw locks suffer the same problem. On top of that, because of the
> invactive issue, chances are the result got artificially worsened waiting
> on upgrades. Would be interesting to see the outcome if the file was kept
> open by someone just to have usecount > 0 for the duration.
> 
> So all in all, I think this is great step forward but:
> - adding VOP_NEED_INACTIVE or an equivalent would enable more shared
> locking (and most notably would reduce lock upgrades). it's not hard to
> do and will likely come with an immediate win.
> - i would seriously consider sticking to hash tables
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>
Follow-Ups:
- Re: Please review: lookup changes
  - From: Mateusz Guzik
- Re: Please review: lookup changes
  - From: Mateusz Guzik
References:
- Please review: lookup changes
  - From: Andrew Doran
- Re: Please review: lookup changes
  - From: Mateusz Guzik
Prev by Date: Re: NULL pointer arithmetic issues
Next by Date: Re: Please review: lookup changes
Previous by Thread: Re: Please review: lookup changes
Next by Thread: Re: Please review: lookup changes
Indexes:
Home | Main Index | Thread Index | Old Index