tech-kern: Re: memory (re-)allocation woes

Subject: Re: memory (re-)allocation woes
To: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.ORG>
From: theo borm <theo@borm.org>
List: tech-kern
Date: 11/29/2004 01:15:43
Greg A. Woods wrote:

>[ On Saturday, November 27, 2004 at 21:55:55 (+0100), theo borm wrote: ]
>  
>
>>Subject: Re: memory (re-)allocation woes
>>
>>    
>>
>>>14:29 [702] $ ./trealloc 1048576 1048576 
>>>[[ .... ]]
>>>step 342: reallocating 359661568 bytes... trealloc: realloc() failed: Cannot allocate memory
>>> 
>>>
>>>      
>>>
>>I note that the program could not (re-) allocate 342 Mbyte, which
>>is about a third of the memory available to user processes:
>>
>>data(kbytes)         1048576
>>    
>>
>
>Remember that it's trying to realloc() -- i.e. it already has a 341 MB
>block allocated and now it's trying to allocate an additional 342 MB
>block into which the first block will be copied.  That`s a total of
>683MB.  Now in the worst case a simplistic realloc() algorithm is
>"stuck" with leaving the previous 340 MB block in place and trying to
>extend the heap by the entire 342 MB because the previous 340 MB chunk
>is at the start and cannot be extended.
>
>	+--------------------------------------------+
>	| 340 MB chunk | 341 MB chunk | 342 MB attempt -- oops! too far
>	+--------------------------------------------+
>
>In theory the realloc() algorithm could be fixed to first copy the 341MB
>chunk down to the start of the heap then it wouldn't have to extend the
>heap at all until the next realloc(343MB), but apparently it doesn't do
>that.
>  
>
Well, I haven't thought about it that way; I just assumed that malloc  
(as that
gets called from realloc) asks the OS for some pages, after which free (when
that gets called from realloc) returns the old ones to the OS again.... 
As I said
I only /glanced/ at realloc's implementation.

/If/ the heap has to be contiguous , allocation would proceed as follows 
(one + is
one allocated MB, one - is one orphaned MB)

+
-++
---+++
------++++
+++++
-----++++++
-----------+++++++
++++++++
--------+++++++++
-----------------++++++++++

....

[338 * '+']
[338 * '-'][339 * '+']
[338 * '-'][339 * '-'][340 * '+']
[341 * '+']

etc...This is not what I see when I log the size for every step (...),
and is incompatible with the observation that programs on a system without
swap can sometimes grow to half the size of memory, depending on the
size of the increase...

>So in fact the process has grown to the full 1GB limit and is now trying
>to grow past that limit.  Your "only 1/3 of memory" statment threw me
>off but now that I think about it again that's exaclty right and it fits
>in perfectly with my observation in "top" of the size going up and down
>as the third block would be freed every other time and the nice
>allocator was giving back memory to the system when that happened!  :-)
>  
>
What strikes me as odd is that if I switch off swap altogether, the program
only aborts (is killed with an out of swap message) after 
(re-)allocating about
half the amount of memory, so in principle, having less than half the amount
of physical memory in swap even aggravates the problem....

>  
>
>>You still had 512MB of physical memory to play with.
>>    
>>
>
>Indeed -- which is why I suggested the 1GB default RLIMIT_DATA was just
>way too big for your environment.  :-)
>  
>
But please keep in mind that with whatever rlimit you specify, there is
always a way to make the system *kill* some of the programs doing
reallocations by simply running enough of them.

>Any time your system's operation, and especially that of some CPU hungry
>process that regularly touches most of the pages in its heap, depends on
>swap space in order to grow big enough then your system is simply too
>small.  VM is great for lazy progrmmers who want to grab great big gobs
>of memory just to make their algorithms simpler (maybe they only
>reference most of it just once, or maybe just twice in some neat order
>that doesn't fight against the pager, and maybe there are big holes that
>are never referenced), but as soon as you try to hammer away on all that
>primary storage back and forth constantly you either need to have enough
>real RAM for the vast majority of it, or else you need to be prepared to
>wait perhaps many _thousands_ of times longer as it'll all have to be
>moved on and off the secondary swap storage many times over and over
>again.
>  
>
I agree with the bits about lazy programmers; they abound, and unfortunately
I have to live with lots of their products (being one myself ;-) ). 
There is are
however things that must be said in favour of using the OS' facilities for
managing memory, and that is that it allows OS control, simplifies programs,
avoids (buggy) re-implementions of memory management routines, allows
performace gains because OS memory management is (probably) hardware
assisted, *and* it leads to fewer points of failures.

The program that originally sparked my curiosity is one that pedantically
tries to preserve memory calculating maps of genomes. The first step of this
is calculating distances between markers, and this is an O[n^2] problem. 
With
~100000 datapoints, and each datapoint costing of about 200 bytes, a lazy
programmer would need ~2 TByte to just store the distance matrix. 
Fortunately
most distances are "too large to measure", and hence a sparse matrix 
suffices.
After calculating this distance matrix, these datapoints must be ordered so
that the sum of distances is minimized, a bit like the traveling 
salesman problem
with 100000 cities.... At this stage memory is constantly being reused at a
staggering rate, using the tools provided for this purpose by the OS: 
mallocs
frees and reallocs, and it is here where the program is sometimes "just 
killed"
by NetBSD.

The whole dataset will fit into 200MByte with most options; sometimes
however the dataset will grow to (top I've seen under Linux) 390 MBytes.
The cluster nodes don't have to run multiple instantiations at the same 
time,
so If the program would *malloc* a safe 400MByte to start with, the system
would actually work without swap (112 MByte left for kernel some buffers,
nfs etc; I have servers runnning happily with less memory). Unfortunately
the programmer "did the right thing", not asking for more resources than are
actually needed at a certain point in time, and a realloc implementation 
that
is ridiculously wastefull of resources punishes /me/ for that. :-(

As a side note: I put considerable effort into getting this program working
under NetBSD, and have repaired some minor flaws in data handling that
do not become apparent under either Linux or Solaris, and have resolved
some other issues that are related solely to our particular dataset. In my
personal opinion this program was not well written; it was written by
someone with (I guess) a primary interest in biology and particulary
molecular genetics, and has evolved and expanded over a long time, across
multiple OSes and multiple processor architectures. As far as I can tell
(my personal experience is limited to Linux and Solaris), on none of these
systems do these problems occur. I do have strong incentives for keeping
NetBSD, and though time is getting short I'm not quite willing to throw
in the towel yet....

>
>  
>
>>In my case "sluggish"
>>was not the word. I waited for more than 8 hours for the system to come back
>>up, and it didn't. I've also had sudden reboots because of this.
>>    
>>
>
>I still wonder sometimes if maybe there's a deadlock of some kind in the
>pager when things get that tight.  I'm assuming your disk was constantly
>busy during that time....  I know I've seen NetBSD systems thrash
>  
>
after an initial swapping all disk (and network) activity ceased.

>themselves silly with no apparent progress ever being made.  This isn't
>necessarily the over-commit of resources either -- it seems to be some
>kind of "big loop" deadlock....
>
>In theory with our new VM it should be be easier to tune it so that more
>pages of the right kinds can be kept in memory so that this kind of
>deadlock doesn't happen so easily, but doing the experiments to find
>that tuning could be very tedious and time consuming.
>  
>
hmm...

>
>  
>
>>well, setting the resource limits lower may help, but will not prevent
>>processes naively allocating memomry from being killed.
>>    
>>
>
>The process wasn't killed was it?  realloc() returned NULL and it exited
>of its own accord by design of its code.
>  
>
It *was* killed. Running a single process eating memory is not a problem;
this will run into one third of the /set/ resource limits /before/ being 
killed.
Running *multiple* processes eating memory, with the sum of their resource
limits exceeding one third of the available *will* result in random killing.
(and probably in some pathological cases the system thrashing itself)

>AIX has/had a SIGDANGER which will kill any large process that hasn't
>caught the signal (e.g. Xserver catches it! ;-) so that when big
>processes allocate lots of memory and then eventually does reference all
>those pages (as this realloc() experiment will), then if there's not
>enough swap space available the kenel will start sending SIGDANGERs
>until enough things die and give up enough resources for the system to
>continue on working as a whole.
>  
>
The realloc only references all those pages because of its 
implementation, and
as its implementation is so silly (that it references all those pages), 
the OS might
just as well have refused reallocation in the first place.

I get the impression that there are three fundamental issues here: first 
the OS
overcommits memory, probably assuming that most programs allocate memory
that they never intend to use, second the implementation of realloc is 
such that
all memory allocated /is/ actually used, and third the implementation of 
free
is such that freed memory is not released to the OS immediately, but 
apparently
held back somewhere. Right now (IMHO) this makes NetBSD qualify for the
label "only suitable for small jobs"

The result of these /choises/ is that in order not to run the risk of 
being suddenly
killed on a realloc, memory hungry programs should either *malloc* more
memory than it will ever use (and touch all pages to make sure its' 
theirs) OR
implement their own memory management routines, which (unfortunately)
will have to make do without the "convenience" and performance of proper
(harware) VM support.

Overcommitting resources /because/ they will never be used bites its own
tail....  The more I think about it, the less I like it.

well... I'll start thinking about how to make reallocs less wastefull of
resources; in view of its current implementation performance (memcpy,
eventually leading to excessive swapping) can hardly be considered an
issue..... I guess I have some serious reading up to do. Can anyone give me
a hint if there is a portable (architecture wise) way to remap pages from a
user program?

With kind regards,

Theo.