tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Please review: lookup changes



Something ate a huge chunk of my e-mail, resending

On 3/8/20, Andrew Doran <ad%netbsd.org@localhost> wrote:
> On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:
>> I believe it is always legal to just upgrade the lock in getattr if
>> necessary. Since itimes updates are typically not needed, this would mean
>> common case would get away with shared locking. This would run into
>> issues
>> in vput, for which see below.
>
> Hmm.  With NetBSD upgrading a lock in the middle of a VOP could have
> consequences for fstrans.  I'm not sure.
>

I don't know how do you do upgrades specifically. "mere" upgrades can of
course fail in presence of additional readers. In FreeBSD there is an
option to release the lock entirely and relock from scratch, in paritcular
meaning you can undo fstrans enter (if necessary). While I don't know the
specifics of your VFS layer, v_usecount which is kept should postpone
freeing of the vnode. Should it fall victim of forced unmount or some
other state change you can just return an error, pretending getattr
never got this far.

>> Assuming this is fixed fstat could also shared lock. I think this is a
>> simple change to make and most certainly worth doing.
>>
>> > vfs_vnode.c:
>> >
>> >       Namecache related changes, and the change to not block new vnode
>> >       references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>> >       think this blocking is needed at least for well behaved FSes like
>> >       ffs & tmpfs.
>> >
>>
>> I think this is very hairy and the entire thing is avoidable.
>>
>> I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE.
>> You should be able to trivially implement an equivalent which would be
>> called if the vnode is only shared locked. Supporting an unlocked case
>> comes with extra woes which may not be worth it at this stage.
>>
>> Since inative processing is almost never needed this would avoid a lot
>> of lock contention.
>>
>> With all this in place you would also not run into the problematic state
>> in the fast path in the common case.
>>
>> This is a cheap cop out, the real solution would encode need for inactive
>> in usecount as a flag but that's a lot of work.
>
> Two seperate issues - I agree on avoiding VOP_INACTIVE(), but the issue
> here
> is peculiar to NetBSD I think.  NetBSD sets the equivalent of the 4.4BSD
> VXLOCK flag to block vget() across VOP_INACTIVE().  Should only be needed
> for revoke/clean.  At one point we did have VXLOCK embedded in v_usecount
> which allowed for vget() with one atomic op, but that was undone at some
> point, I don't know why.
>

If you have a dedicated spot which is guaranteed to be executed for
exclusion against vget then things become easier in terms of effort
required. A dedicated vget variant for the namecache can xadd into the
counter and if the bit is set, backpedal from it. New arrivals should
eliminated by purging nc entries beforehand. Everyone else can
cmpxchg-loop with the interlock like right now.

>> > Performance:
>> >
>> >       During a kernel build on my test system it yields about a 15%
>> > drop
>> >       in system time:
>> >
>> >       HEAD            79.16s real  1602.97s user   419.54s system
>> >       changes         77.26s real  1564.38s user   368.41s system
>> >
>> >       The rbtree lookup can be assumed to be slightly slower than a
>> > hash
>> >       table lookup and microbenchmarking shows this sometimes in the
>> >       single CPU case, but as part of namei() it ends up being
>> > ameliorated
>> >       somewhat by the reduction in locking costs, the absence of hash
>> >       bucket collisions with rbtrees, and better L2/L3 cache use on hot
>> >       directories.  I don't have numbers but I know joerg@ saw
>> > significant
>> >       speedups on pbulk builds with an earlier version of this code
>> > with
>> >       little changed from HEAD other than hash -> rbtree.  Anyway
>> > rbtree
>> >       isn't set in stone as the data structure, it's just what we have
>> > in
>> >       tree that's most suitable right now.  It can be changed.
>> >
>>
>> I strongly suspect the win observed with pbulk had to with per-cpu
>> locks and not hashing itself. Note that at the time vnode interlock and
>> vm object lock were the same thing and in this workload the lock is
>> under a lot of contention. Should the lock owner get preempted, anyone
>> doing a lookup to the affected vnode (e.g., libc) will be holding
>> the relevant per-cpu lock and will block on a turnstile. Whoever ends
>> up running on the affected cpu is likely to do a lookup on their own,
>> but the relevant per-cpu lock is taken and go off cpu. The same thing
>> happening on more than one cpu at a time could easily reslut in a
>> cascading failure, which I strongly suspect is precisely what happened.
>>
>> That is, the win does not stem from rb trees but finer-grained locking
>> which does not block other threads which look up something else on
>> the same cpu.
>
> Not on NetBSD.  Kernel preemption is possible and allowed (mainly for real
> time applications), but happens infrequently during normal operation.
> There
> are a number of pieces of code that take advantage of that fact and are
> "optimistically per-CPU", and they work very well as preemption is rarely
> observed.  Further the blocking case on v_interlock in cache_lookup() is
> rare.  That's no to say it doesn't happen, it does, but I don't think it
> enough to explain the performance differences.
>

I noted suspected preemption was occuring because of contention on the vm
side. It should only take one to start the cascade.

>> First, a little step back. The lookup starts with securing vnodes from
>> cwdinfo. This represents a de facto global serialisation point (times two
>> since the work has to be reverted later). In FreeBSD I implemented an
>> equivalent with copy-on-write semantics. Easy take on it is that I take
>> an rwlock-equivalent and then grab a reference on the found struct.
>> This provides me with an implicit reference on root and current working
>> directory vnodes. If the struct is unshared on fork, aforementioned
>> serialisation point becomes localized to the process.
>
> Interesting.  Sounds somewhat like both NetBSD and FreeBSD do for process
> credentials.
>

I was thinking about doing precisely that but I found it iffy to have
permanently stored references per-thread. With the proposal they get
"gained" around the actual lookup, otherwise this is very similar to
what mountcheckdirs is dealing with right now.

>> In my tests even with lookups which share most path components, the
>> last one tends to be different. Using a hash means this typically
>> results in grabbing different locks for that case and consequently
>> fewer cache-line ping pongs.
>
> My experience has been different.  What I've observed is that shared hash
> tables usually generate huge cache pressure unless they're small and rarely
> updated.  If the hash were small, matching the access pattern (e.g.
> per-dir) then I think it would have the opportunity to make maximum use of
> the cache.  That could be a great win and certainly better than rbtree.
>

Well in my tests this is all heavily dominated by SMP-effects, which I
expect to be exacerbated by just one lock.

Side note is that I had a look at your vput. The pre-read + VOP_UNLOCK +
actual loop to drop the ref definitely slow things down if only a little
bit as this can force a shared cacheline transition from under someone
cmpxching.

That said, can you generate a flamegraph from a fully patched kernel?
Curious where the time is spent now, my bet is spinning on vnode locks.

-- 
Mateusz Guzik <mjguzik gmail.com>



Home | Main Index | Thread Index | Old Index