tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Deadlock on fragmented memory?



[Cc'ing yamt@ and para@, in case they're not reading tech-kern@ right
now, since they know far more about allocators than I do.]

> Date: Sun, 22 Oct 2017 22:32:40 +0200
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> 
> With a pullup of kern_exec.c 1.448-1.449, to netbsd-6, we're still seeing
> hangs on vmem.

Welp.  At least it's not an execargs hang!

I hypothesize that this may be an instance of a general problem with
chaining sleeping allocators:  To allocate a foo, first allocate a
block of foos; then allocate a foo within the block.  (Repeat
recursively for a few iterations: 1KB foos, 4KB pages of foos, 128KB
blocks of pages, &c.)

- Suppose thread A tries to allocate a foo, and every foo in every
  block allocated so far is currently in use.  Thread A will proceed
  to try to allocate a block of foos.  If there's not enough KVA to
  allocate a block of foos, thread A will sleep until there is.

- Suppose thread B comes along and frees a foo.  That doesn't wake
  thread A, because there's still not enough KVA to allocate a block
  of foos.  So thread A continues to hang -- forever, if KVA is too
  fragmented.

Even if thread A eventually makes progress, every time this happens,
it will allocate a new block of foos instead of reusing a foo from an
existing block.

And if there's no bound on the number of threads waiting to allocate a
block of foos (as is the case, I think, with pools), then under bursts
of heavy load there may be lots of nearly empty foo blocks allocated
simultaneously, which makes fragmentation even worse.

Thread A _should_ make progress if a foo is freed up, but it doesn't:
we have no mechanism by which multiple different signals can cause a
thread to wake, short of sharing the condition variables for them and
restarting every cascade of blocking allocations from the top.

This won't always happen:

- In the case of execargs buffers, this _won't_ happen (now) because
  each execargs buffer is uvm_km_allocated one at a time, not in
  blocks, so as long as the page daemon runs and there is an unused
  execargs buffer, shrinking exec_pool will free enough KVA in
  exec_map to allow a blocked uvm_km_alloc to continue and thereby
  allow a blocked pool_get to continue.

- But in the case of pathbufs, they're 1024 bytes apiece, allocated in
  4KB pages from kmem_va_arena, which in turn are allocated from
  qcached chunks of 128KB blocks from kmem_va_arena, which in turn are
  allocated from kmem_arena.  And there are no 128KB regions left in
  kmem_arena, according to your `show vmem', which (weakly) supports
  this hypothesis.

  To really test this hypothesis, you also need to check either

  (a) for pages of pathbufs with free pathbufs in pnbuf_cache, or
  (b) for blocks with free pages in kmem_va_arena's qcache.

I'm a little puzzled about the call stack.  By code inspection, it
seems the call stack should look like:

cv_wait(&kmem_arena->vm_cv, &kmem_arena->vm_lock)
vmem_xalloc(kmem_arena, #x20000, ...)
vmem_alloc(kmem_arena, #x20000, ...)
vmem_xalloc(kmem_va_arena, #x20000, ...)
vmem_alloc(kmem_va_arena, #x20000, ...)
qc_poolpage_alloc(...qc...)
pool_grow(...qc...)
* pool_get(...qc...)
* pool_cache_get_paddr(...qc...)
* vmem_alloc(kmem_va_arena, #x1000, ...)
* uvm_km_kmem_alloc(kmem_va_arena, #x1000, ...)
* pool_page_alloc(&pnbuf_cache->pc_pool, ...)
* pool_allocator_alloc(&pnbuf_cache->pc_pool, ...)
* pool_grow(&pnbuf_cache->pc_pool, ...)
pool_get(&pnbuf_cache->pc_pool, ...)
pool_cache_get_slow(pnbuf_cache->pc_cpus[curcpu()->ci_index], ...)
pool_cache_get_paddr(pnbuf_cache, ...)
pathbuf_create_raw

The starred lines do not seem to appear in your stack trace.  Note
that immediately above pool_get in your stack trace, which presumably
passes &pnbuf_cache->pc_pool, is a call to pool_grow for a _different_
pool, presumably the one inside kmem_arena's qcache.


Home | Main Index | Thread Index | Old Index