Subject: Re: memory (re-)allocation woes
To: None <tech-kern@netbsd.org>
From: theo borm <theo4490@borm.org>
List: tech-kern
Date: 11/28/2004 10:58:06
Steven M. Bellovin wrote:

>In message <Pine.BSF.4.51.0411271804420.80481@vegeta.city-net.com>, Matthew Org
>ass writes:
>  
>
>>On 2004-11-27 theo4490@borm.org wrote:
>>
>>    
>>
>>>You still had 512MB of physical memory to play with. In my case "sluggish"
>>>was not the word. I waited for more than 8 hours for the system to come back
>>>up, and it didn't. I've also had sudden reboots because of this.
>>>      
>>>
>> This sounds like bad memory or overheating.  Try a large build with GCC
>>(a NetBSD build works well for this).  I'll bet you will encounter random
>>failures.
>>
>>    
>>
>
>Or run memtest (from pkgsrc)
>
>		--Steve Bellovin, http://www.research.att.com/~smb
>
>  
>
well, I will try that, but I'm not so sure if that will be of any help.

I have run my little test on a variety of hardware (i.e. different 
diskeless cluster
nodes with /identical/ hardware and four other machines with different
hardware, alas all i386, and have managed to crash all diskless nodes 
(somehow
swap over NFS seems to be quite sensitive to long delays in the pagedaemon),
and two of the four other machines. One of these I have used some time ago
to build all in pkgsrc *). To me it seems as if the problem is not hardware
related, and that it is only a matter finding the correct parameters to 
reboot
them too :-(

On /all/ machines in /all/ configurations (diskless or with local disk,
with and without swap) the program gets killed if there are not enough
resources available to honour its request. Realloc /only/ returns NULL
/if/ the program hits about one third of the "set" resource limits /before/
the kernel thinks there are (globally) not enough resources available.
Otherwise it gets killed.

This makes that the only way to guarantee that a program gets the
*correct* NULL value returned on a failed realloc, giving it a chance
to write some data to disk, possibly freeing som memory, is to make
the system single tasking (......!) with a suitably low resource limit,
forcing you to "never" use >35% of the machines' actual memory
resources. Right now I don't even get a core dump.

Come to think of it, this killing-instead-of-returning-NULL thing may
well explain some other "sudden deaths" in (for instance) the Gimp I
have been seeing. Perhaps the issue is more serious than I initially
thought. I think I'll play around with the malloc configuration a bit
to see what comes up...

with kind regards,

Theo Borm

*) The only problems I saw were pkgsrc problems such as missing dependencies
and being unable to fetch files. One of the missing dependencies was quite
"funny": someone had apparently reconfigured a webserver not to return
a 404 status code for a page not found but a search page instead, which 
resulted
in it being saved as a .tar.gz file, which could not be unpacked, which 
in turn
resulted in a failed dependency. (Maybe detecting an invalid .gz file should
also result in trying the next server in the list)

P.S. (&O.T.) Is anyone reading his/her tech-kern mail using a badly
patched MS product? I've not posted anything here for quite some time,
but now that I did I immediately received some virusses on this mail
address... (In particular people with an "interbusiness.it" or a
"online.nsk.su" IP address might want to run a virus check)