tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Adding pool_cache_invalidate_local() to pool_cache(9) API



On Fri, 9 Oct 2009 00:28:56 -0700, Jason Thorpe 
<thorpej%shagadelic.org@localhost>
wrote:
> On Oct 8, 2009, at 11:52 PM, Jean-Yves Migeon wrote:
> Consider the case of an mbuf.  It is allocated from a per-CPU pcg on
CPU-A,
> and subsequently bounces around the networking stack, through socket
> buffers, etc. and is finally freed.  But the processor that frees it
could
> be CPU-B (for any number of legitimate reasons), thus is goes into one of
> CPU-B's per-CPU pcgs.

In this case, mbuf is used by CPU-B. You will not find it inside CPU-A pool
cache. pool_cache_invalidate_local() invoked on CPU-A will not touch it, as
CPU-A pool_cache does not hold any reference to it.

The same situation will happen if you pool_cache_destroy() the pool cache
from CPU-B. Except that, in this case, CPU-B _will_ destroy objects managed
by pool_cache CPU-A, while pool_cache_invalidate_local() won't. It is
caller's responsibility to ensure that such situation never occurs.

>> I need a way to invalidate all pool caches, even those that are
>> CPU-bound. pool_cache_invalidate() only does that for the global cache,
>> as it cannot invalidate CPU caches for CPUs other than its own.
> 
> CPU-bound is the wrong term here.  Nothing is "bound" to the CPU except
for
> a set of pcgs that cache constructed objects.  The objects are merely
kept
> in CPU-local caches in order to provide a lockless allocation path in the
> common case.  The objects themselves are not bound to any one CPU.

Depends on the way you see it. When an object is put back in a CPU-local
cache, it is bound to this CPU. No other CPU could allocate this object
(pool_cache will not allow CPU-A to allocate objects cached in CPU-B
pc_pcpu[]).

I am not interested on the owner of objects when currently in use, but
those that are currently released but not "freed" (those cached in the
pool_cache).

>> Before invalidate_local(), the only way would be to destroy the
>> pool_cache entirely, just to release the objects found in pc_cpus[].
This
>> would cripple down the entire VM system, as it cannot work without the
L2
>> shadow page pool.
> 
> What do you mean "destroy the pool_cache entirely"?

Invoke pool_cache_destroy() on it.

>  So you're just working
>> around a bug in pool_cache_invalidate()?  If pool_cache_invalidate() is
not
> also zapping the per-CPU pcgs for that pool_cache, then that is a bug.

It cannot do so safely. pool_cache_invalidate() can be called whenever you
want, as the global cache is protected by a mutex.

Per-CPU caches are not, due to lockless allocation. If CPU-A invalidates
the per-CPU caches of CPU-B while CPU-B is concurrently running a
pool_cache_get(), this could result in unspecified behavior.

>  The
> main intent of pool_cache_invalidate() is to nuke any cached constructed
> copies of an object if the constructed form were to change for some
reason.
>  If per-CPU cached copies are not included in that, then a subsequent
> allocation could return an incorrectly-constructed object, leading to
> errant behavior.

- pool_cache(9) does not state that:

"""
Destruct and release all objects in the global cache.  Per-CPU caches will
not be invalidated by this call, meaning that it is still possible to
allocate "stale" items from the cache.  If relevant, the user must check
for this condition when allocating items.
"""

>  There are subsystems in the kernel that depend on the
> pool_cache_invalidate() semantics I describe; see
> arch/alpha/alpha/pmap.c:pmap_growkernel() for an example.  The L1 PTPs
are
> cached in constructed form (with the L1 PTEs for the kernel portion of
the
> address space already initialized).  If PTPs are added to the kernel pmap
> in such a way as to require an additional L1 PTE to link them up, then
> already-constructed-but-free L1 PTPs need to be discarded since they will
> be missing part of the kernel's address space.

In the current form, that is correct. pool_cache_invalidate() will not
release _all_ objects, only global cached ones. Stale objects in per-CPU
pcgs will not be freed, nor updated. In Alpha's pmap_growkernel case, yes,
you end up desyncing your PTPs.

This is wrong, and needs to be fixed, through pool_cache_invalidate_local()
for each CPU (or an equivalent way to do it, I am opened to proposals).

> Sigh, really the per-CPU pcgs should not be "lockless", but rather should
> be mutex-proected but "never contended" ... just in case you DO need to
> manipulate them from another CPU (this is how Solaris's kmem_cache works,
> at least it did at one point).

This is more intrusive in pool_cache(9). IMHO, protecting the per-CPU pcgs
with mutexes that will very rarely have contention (system is expected to
pass a lot more time allocating from these pcgs than asking for their
invalidation from another CPU...) is overkill.

If there is a lockless ("or contention-free mutex") way to do it, why not.
But this will alter a performance-critical code path, and should be thought
out very carefully.

>> In Xen's case, pool_cache_invalidate_local() would be used by each CPU
>> during its detach routine (just before its shutdown), to clear all its
>> associated L2 pages that Xen cannot handle.
> 
> So, we've added a huge abstraction violation to the pool API to handle
this
> special case of Xen shutting down?
> 
> I'm sorry, but a different solution needs to be found.  Exposing this
> internal implementation detail of pools is just wrong.

What do you propose then? Note that in the current state, Alpha's
pmap_growkernel() will not work as described, and Xen's save/migrate with
PAE neither.

Yes, pool_cache_invalidate_local() works around a limitation of
pool_cache_invalidate() that _cannot_ be fixed given the way pool_cache(9)
is currently designed. If pool_cache(9) pcpu[]s need to be able to be
safely cross-CPU invalidated, I would like to hear from others about that
first, especially Andy.

Cheers,

-- 
Jean-Yves Migeon
jeanyves.migeon%free.fr@localhost




Home | Main Index | Thread Index | Old Index