tech-kern: Re: fdexpand() memory shortage check (Re: kern/14721)

Subject: Re: fdexpand() memory shortage check (Re: kern/14721)
To: None <jaromir.dolecek@artisys.cz>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 12/14/2001 23:32:38
hi,

the trouble I have with this M_CANFAIL flag is exactly the point that
you mention below:  even if we can recover from the file-descriptor
allocation failure, kmem_map is still almost entirely full and so the
next malloc() call is likely to panic the machine anyway.  the real
problem here is that the kernel has been misconfigured to allow
applications to request more of a resource (kernel malloc() space)
than is available, and when we run out of that resource, there's
not much to do but panic.  recovering from exhaustion of this particular
resource isn't really possible, so what we should do is to try to
make sure that we don't run out in the first place.

we use malloc() to allocate file-descriptor arrays because these are
variable in size, and malloc() is the only kernel memory allocator that
really deals with that.  but malloc() has another property, which is
that it can be used from interrupt context and the memory it allocates
is safe to access from interrupt context.  to implement these extra
properties we need to preallocate resources (PTPs, etc), and it's this
which leads to the scarcity of malloc() virtual space:  we don't want
to preallocate more resources than we really need to, so we limit the
amount of virtual space managed by malloc() by creating a separate
kernel submap (kmem_map) from which malloc() gets its virtual space.

however, file-descriptor tables don't need to be accessed from interrupt
context, so there's no need for them to be allocated from kmem_map, except
that malloc() can only allocate from kmem_map.  so what we'd really like
is a way to allocate variably-sized chunks of memory from kernel_map instead
of kmem_map.  we can do that with uvm_km_alloc() for chunks that are a
multiple of the page size, so another approach would be to have fdexpand()
switch to using uvm_km_alloc() instead of malloc() once the size of the
allocation becomes larger than a page.  this would greatly reduce the
usage of kmem_map space for this application which really doesn't need it.

the other usage of malloc() that I can think of which occurs in direct
response to application requests is for amaps, but this case is a bit
trickier since one process can create lots of tiny amaps and we wouldn't
want to use a whole page for each one.  here we would want something
a bit fancier that would give let us allocate sub-page-sized chunks
without using kmem_map.  to provide this, we could create a pool for
each power-of-2 size between say 16 bytes and 1 page, and allocate
from the smallest pool that will hold the allocation.

to make this even neater, we could still do all of this with the malloc()
interface, with a different new flag, M_NOINTR or something like that.
if this flag is passed to malloc(), the allocation would come from one
of power-of-2 pools or from uvm_km_alloc(), instead of from kmem_map.
free() could detect which way the memory was allocated (and thus how it
needs to free it) by looking whether the virtual address of the memory
being freed is within kmem_map or not.

by using this new style of allocating memory for non-interrupt contexts,
we can probably avoid ever running out of kmem_map space.

once this is implemented, the next problem would be running out of
kernel_map space (ie. filling up all of the kernel virtual address space)
or running out of physical memory because we allocated it all to things
like file descriptors.  these problems we can't really fix other than
by limiting the kernel memory that applications can allocate (eg. don't
let even root raise the limit on file-descriptors to something larger
than the kernel can actually allocate), but doing this would limit
flexibility.  since only root can raise the limits on file descriptors
to such high values anyway, I'd say that this is just one more way that
root can crash the machine, and we don't need to do anything about it.
(actually this argument could be made about this entire problem, but
the changes I've described above seem like a good idea in general
so we should probably make them anyway.)

so in conclusion, I don't think M_CANFAIL is particularly useful,
and I think we should get rid of it.  M_NOINTR would be a much better way
to increase robustness in the face of application behaviour such as
allocating 100,000 file descriptors.

-Chuck


On Fri, Dec 14, 2001 at 11:08:03AM +0100, jaromir.dolecek@artisys.cz wrote:
> Hi,
> the problem in kern/14721 happens due to fdexpand() eventually
> trying to allocate bigger chunk of kernel memory for the process than
> possible with kmem_map (which holds normally like 1/4 of physical
> memory).
> A fix here is for the fdexpand() function to fail gracefully
> and for the caller to handle this appropriately. This is
> implemented in the appended patch.
> 
> One possible issue I've been discussing with Soda privately is
> whether or not it's suitable to panic system in such situation.
> When several such programs as the one in kern/14721 run
> and don't finish, they can make the kernel to allocate almost
> all it's memory to file descriptor arrays, and thus no more
> KVA might be available for other operation. I'm not sure
> how big problem is nearly-full kmem_map. Would that cause
> any problems severe enough to justify panic in such situation?
> 
> Jaromir
> -- 
> Jaromir Dolecek <jaromir.dolecek@artisys.cz>
> ARTISYS, s.r.o., Stursova 71, 61600 Brno, Czech Republic
> phone: +420-5-41224836 / fax: +420-5-41224870 / http://www.artisys.cz