Subject: memory (re-)allocation woes
To: None <tech-kern@netbsd.org>
From: theo borm <theo4490@borm.org>
List: tech-kern
Date: 11/27/2004 16:05:29
Dear list members,

As a follow-up to my previous post, I checked if the same
problem (user program memory allocation freezing/rebooting
a 1.6.2 system) does occur in simpler set-ups. I ended up
adding local root and swap disks to some of the cluster
nodes and did fresh installs of 1.6.2 on those.

I am sad to say that it does.

The small program below will, depending on the presence
and amount of swap and the parameters given to it either:
(I am monitoring the program using top)

1) cause spontaneous reboots (rarely, more often in diskless
   set-ups)
2) makes system "inaccessible" *) with top still working and
   showing 100% idle.
3) makes system "inaccessible" *) with top still working and
   showing 99.02% pagedaemon activity for prolonged time
4) makes system "inaccessible" *) with top frozen.
5) be killed after an "out of swap" kernel message
6) be killed after a series of "resource shortage,
   1 page of swap lost" kernel messages
7) exit normally (realloc returning NULL at some)

*) "inaccessible": wscons still working, but system is not
   accepting any new logins or new commands in existing
   logins. All network activity ceases, but system keeps
   responding to ping requests. This "inaccessible"
   state lasted for more that 8 hours, after which I
   pressed the reset button

The program parameter values where problems occur /seems/
to be an initial allocation of slightly less than a third
of the (swap+system) memory, followed by two reallocations
that take the total SIZE (as reported by top) of the program
beyond the available.

There seem to be quite some dependency on the increment
parameter; using an increment of 4096 bytes the program
wass killed ("out of swap" kernel message) after allocating
about 50% of the available memory, whereas the same starting
allocation size with an increment of 1 Mbyte caused 4) after
allocating about 33% of the available memory, and specifying
an increment of 32 MByte resulted in a gracefull exit.

How do I go about debugging this?

My first guess is to find out what is causing SIZE (as
reported by top) to grow to about three times the (re-)
allocation size (incidentally this only happens when swap
is enabled, otherwise it tops at 2x the allocated amount
of memory which is compatible with the hypothesis of a
rather simple memory reallocation algorithm (malloc(new),
copy old->new,free(old)).

I am also not quite sure where the problem lies; because
of the kernel messages, the kernel killing the processes,
the spontaneous reboots and realloc not returning NULL I
would guess that it is a kernel problem...

Any hints?

As is, the only quick work around I can think of that will
stop local users from accidentally thrashing my cluster
nodes is limiting DSIZE to less than a third of the
available (physical+swap) memory, and this means that I
will have a very hard time explaining why a 350 MByte model
will not fit in a 1 GByte machine :-(

with kind regards, Theo Borm


------------------------------------------------------------
#include<stdio.h>
int main(int argc,char * argv[])
{
   char * mem1;
   char * mem2;
   int i=0;
   int size;
   int increment;
   if (argc!=3)
   {
      printf("usage: realloctest <initial size> <increment>\n");
      fflush(stdout);
      exit(0);
   }
   if ((sscanf(argv[1],"%d",&size)!=1)||
      (sscanf(argv[2],"%d",&increment)!=1))
   {
      printf("usage: realloctest <initial size> <increment>\n");
      fflush(stdout);
      exit(0);
   }
   printf("first allocation (%d bytes)",size);
   if ((mem1=(char *)malloc(size))==NULL)
   {
      printf(" failed and returned NULL.\n");
      fflush(stdout);
      exit(0);
   }
   printf(" succeeded and returned a valid pointer.\n");
   fflush(stdout);
  
   while (mem1!=NULL)
   {
      i++;
      size+=increment;
      printf("step %d: reallocating %d bytes",i,size);
      fflush(stdout);
      if ((mem2=(char *)realloc(mem1,size))==NULL)
      {
         printf(" failed and returned NULL.\n");
         fflush(stdout);
         free(mem1);
         exit(0);
      }
      else
      {
         printf(" succeeded and returned a valid pointer.\n");
         fflush(stdout);
      }
      mem1=mem2;
   }
   printf("\nallocation should have failed (somewhere)\n");
   fflush(stdout);
}