tech-kern: Re: userid partitioned swap spaces.

Subject: Re: userid partitioned swap spaces.
To: NetBSD Kernel Technical Discussion List <tech-kern@netbsd.org>
From: Ian Dall <Ian.Dall@dsto.defence.gov.au>
List: tech-kern
Date: 12/18/1998 11:49:24
Ignatios Souvatzis <is@jocelyn.rhein.de> writes:

  > All this talk about "overcommitting memory" confuses, at least in the text,
  > "memory" with "address space".

  > The 4.4BSD VM (and some others, if I read the thread right) allows to seperate
  > the two.

  > To the very least, please don't mention sbrk() again. It's just a special
  > case of an unnamed mmap(), for our VM systems..

I think people have the view that if they do a malloc and check for
NULL then they have "done the right thing" and that their process should
be safe.

Unfortunately it is not as simple as that. Stack space for example can
always grow and there is no mechanism for a process to be able to
detect and handle inability to allocate stack space. Yes, you could
always allocate enough swap to cover the maximum possible stack size,
but it gets horribly conservative. The single user box I am currently
typing this on has 37 processes and 8MB per process stack limit. Maybe
I am stingy, but the idea of 300MB swap just for the stack allocations
seems excessive to me! It gets worse when you allow for pages which
may be shared many times COW and userland threads which need a stack
for each thread. Variable length arrays in C9x are likely to
exacerbate the problem.

Finally, it seems to me the goal is to prevent accidental or
deliberate DoS by consuming swap. Merely preventing overcommit does
mean that processes don't get killed arbitrarilly, but it doesn't
prevent DoS by consuming swap because if nothing new can run, even a
root login or top, ps or kill, then the system is pretty irretrievably
wedged anyway.

I don't much like the SIGDANGER followed by SIGKILL approach, at least
if it implemented in the kernel because the policy about who gets
killed is pretty arbitrary.  Relying on per user swap is not such a
good idea either because you can't take advantage of the law of
averages.  ie you end up allocating much more swap (in total) than you
really need.  Thats the problem with quotas as well. On most sytems
with disk quotas, there is, in practice, "over allocation" of disk
space to users. They can't all use 100% of their quotas.

My high water mark idea has some advantages. No one has to be killed
for anything, they are just put to sleep while root processes, which
are allowed to use the reserve swap, fix the problem. It puts policy
decisions like who to kill first (if anyone) in user land and not in
the kernel (which seems good to me). There are no code change needed to
trap new signals. Finally, there is precedent in the FFS
implementation. It provides better protection against DoS than simply
preventing overcommit.

I didn't specify, but my scheme needs to be some mechanism for firing
off a process, or waking up an existing process when swap reaches the
high water mark.

Ian