Subject: Re: userid partitioned swap spaces.
To: None <tech-kern@netbsd.org>
From: Greg A. Woods <woods@most.weird.com>
List: tech-kern
Date: 12/17/1998 15:28:18
[ On Thu, December 17, 1998 at 18:25:09 (+0100), Christoph Badura wrote: ]
> Subject: Re: userid partitioned swap spaces.
>
> IIRC, when we got 2.2 on the m68k boxen that included "demand paging"
> as opposed to swapping in the whole process on startup and after that.

That's what I thought....  (since of course "demand paging" implies
virtual memory -- i.e. processes can be bigger than physical RAM -- at
least in the context of AT&T Unix)

> This is somewhat inaccurate.  First, there wasn't a single source tree at
> AT&T^WUSL for SVR4.  Pretty much all the i386/PC versions of SVR4.0 that
> the various vendors produced (Dell, ESIX, Generics, Onsite) were derived
> from the same USL reference sources, which was a completely separate source
> tree from the 3B2 sources (different config mechanism, different source
> layout).

Well, yeah....  I was talking more about what came before SysVr4.  From
what I know of the inside workings of AT&T-USG/USL, the i386 sources
were a derrivative of the 3B2 sources and primary development continued
first on the 3B2 and was then ported to the other platforms.  Of course
I learned that from 3B2 developers, so it may be a somewhat "slanted"
interpretation of the facts....  ;-)

>  Second, SunOS5 can't really be described as being SVR4.  They
> joined forces initially with USL but by the time 4.0.2 became available
> they had split again and were going their own way.

There's still a huge amount of SunOS-5 that's very similar to SysVr4,
especially at the user interface, API, and ABI levels.

> >I'm pretty sure all of this is covered clearly and in some detail in
> >Bach's "The Design of the Unix Operating System" (did I get that
> >right?).
> 
> Since Bach's book only covers SVR2.2 it is pretty unlikely that it documents
> any later developments.

Well, IRRC, Bach describes demand paging and COW and swap allocation
under those schemes, so that's all that's really relevant to this.

(Bach also describes non-AT&T Unix stuff too, though sometimes not in as
much detail.)

> I think that is the wrong approach.  The system shouldn't overcommit VM
> except when the application *explicitly* requests it.

Yup, that's a much better idea!  However it still means that the system
*can* over-commit VM, which means we still need some mechanism to
recover from that situation.  If the process(es) that ask for over-
commitment are true to their word then killing them won't necessarily
help recover (unless even with a small percentage of pages used they're
still one of the biggest users of VM).  Or will it?  I think it would
only help if you killed (with SIGKILL) *all* processes that had asked
for over-commitment (or at least until enough VM was recovered to reach
a sensible low-water-mark).  Once all "over-commit" requestors were dead
thenremaining processes would still be able to fill swap, but they would
only get allocation and fork failures as per "normal" at that point.

Is a catchable SIGDANGER still the best way to do reasonably robust
recovery in such situations?  Perhaps so, because if you need to do
something drastic to recover then you should give processes a chance to
say "Hey!  Not me!  I'm *important*!".  The problem with SIGDANGER is
that using it requires additional programming, where as the "never
over-commit" policy simply requires buying enough disk!  ;-)

Maybe the best compromise would be if the default is for SIGDANGER to be
ignored (and harmless) unless a process asks for over-commitment....  Of
course the name would have to be changed so as to not confuse portable
programs....  Perhaps it could be called SIGOOVM (Out-Of-VM).  It even
sounds painful! ;-)  By catching it you could clean up and exit
gracefully in order to help the system get out of danger (games could do
that!  ;-).

The combination of just catch-to-save-yourself SIGDANGER and a mechanism
to ask for VM over-commitment leads to a DoS crevasse, of course, but
maybe that crevasse can be paved over and if sending SIGDANGERs (and/or
SIGOOVMs) to everyone doesn't result in sufficient recovery then SIGKILL
is sent to processes that asked for VM over-commitment until the system
has recovered....

Is killing the largest user of VM that doesn't trap SIGDANGER the best
"first attempt" recovery algorithm?  I would think so because it should
yeild the quickest results with the least amount of extra work -- one
need only keep accounting for how many VM pages a process has actually
"touched".  This is also the surest way I can see to inflict the least
amount of damage on function of the system.  Any other more complex
policy, such as LRU or random walk, isn't likely to be more successful
and is likely to inflict more damage on the system (eg. require killing
of more processes).

Of course a per-user limit on allocation of swap with, or without,
permitting over-commitment may also be a good idea, but perhaps is a
separate issue as well.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>