Subject: Re: RAIDFrame and RAID-5
To: None <thomas@hz.se>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 09/10/2003 07:50:43
"Thomas Hertz" writes:
> > What sorts of problems are you seeing?  Panics?  Freezes?  Can we 
> > please get 'dmesg' output, 'raidctl -s' output, raid config 
> > files, etc, etc?
> 
> I haven't been able to get a kernel core dump, since the system just
> freezes. I have noted that just moments before the system freezes, it's
> not possible to start new processes. The already running processes, will
> continue to run normally for some minute more. Most of the time the
> console prints out a few "cannot allocate Tx mbuf" for the various
> network interfaces just before the final freeze.

Hmm...  I wonder which one of the "out of kernel memory" problems 
applies to this case...

> It seems (obviously) to be a kernel memory problem. I have experimented
> a little with chunk sizes, and the system will stay up a little longer
> with smaller chunks (it crashes within seconds with any chunk size
> 256k). Also, cranking up vm.nkmempages (with options KMEMPAGES) to 64k
> will keep the system running even longer.

It could be the case that you are just plain running out of kernel 
memory, especially with all the NICs you have in that box!!!!!
You might want to add KMEMSTAT (or whatever it is) to the kernel 
config, and then do a bunch of "vmstat -m" while causing the machine 
to crash.  That might indicate whether you're actually out of kernel 
memory or not....

I havn't had time to look at this in quite a while, but here's my 
take on what else might be going on:

 1) The kernel is trying to "do something" and runs out of "free 
pages".
 2) The pager then runs through a number of pages, and either frees 
them outright, or schedules a paging operation on them (e.g. schedules 
a write of a page that contains the most recent directory contents or 
something).
 3) The number of pages "freed outright" (e.g. marked PG_CLEAN) is 
very low (or zero).
 4) The device being paged to doesn't have a "malloc free" codepath, 
and it ends up waiting for free pages to do it's thing -- DEADLOCK.
 5) The pagedaemon doesn't go looking for any more pages to free up 
since it figures that between what it already has freed and what it 
has scheduled to be freed that it should have enough to be above it's 
low-water mark.  This is why the freeze -- the pagedaemon thinks that 
it'll be getting more pages soon, and RAIDframe is unable to provide 
a path to get those dirty pages paged out.

The "fix" to this (at least for RAIDframe!) is to pre-allocate the 
storage needed to do the IOs, and then change all 200+ "RF_Malloc()" 
places to request the appropriate memory chunks.  This is right on 
the top of my RAIDframe TODO list, but that, unfortunatly, is lower 
in priority right now than putting flooring in my basement :-/  

In theory, however, this problem can happen for any underlying device 
that doesn't have a "malloc free" path.. and I believe that some of 
the softdep codepaths arn't "malloc free" either.  I've been talking 
with another guy about a "general solution" to the problem (basically that 
the page cleaner needs to make sure it frees a certain number (or 
percentage) of PG_CLEAN pages, and that *should* reduce or even 
eliminate the problem in all code paths...  I need more evidence that 
I'm barking up the right tree though :)  (and I havn't had time to 
dig far enough :( )

I also am not sure why I've never encountered this problem on either 
my test boxes or production boxes... I'm guessing that I've never 
filled up memory with 99% dirty pages....

> > Both of these are on 1.6 boxes... (Hmm.. I wonder if something has 
> > changed since 1.6 that is causing problems in low-kernel-memory 
> > conditions...)
> 
> 
> I have tried running kernels 1.6, 1.6.1 and now 1.6-current (1.6W and
> 1.6Z). They have all given the exact same behaviour!

:(  'vmstat -m' on my main box says that RAIDframe has used at most 
957K (3 RAID 1 sets, and 1 RAID 5 set).  And I've been building source 
trees, making ISOs, dropping 40GB disk images on the set, and generally 
not worrying about abusing it, all without a single problem....

Later...

Greg Oster