Subject: kern/7714: Complete system lockup on exemplary fork/vm bomb
To: None <gnats-bugs@gnats.netbsd.org>
From: Matthias Buelow <mkb@altair.mayn.de>
List: netbsd-bugs
Date: 06/05/1999 14:50:51
>Number:         7714
>Category:       kern
>Synopsis:       Complete system lockup on exemplary fork/vm bomb
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jun  5 14:50:00 1999
>Last-Modified:
>Originator:     Matthias Buelow
>Organization:
	nil
>Release:        1.4
>Environment:
System: NetBSD altair.mayn.de 1.4 NetBSD 1.4 (ALTAIR) #9: Sun May 16 20:38:20 CEST 1999 mkb@altair.mayn.de:/usr/src/sys/arch/i386/compile/ALTAIR i386


>Description:

Oh well, my favourite fork/eatswap-bomb shell script just downed
my 1.4 system in a test run.

The following sh loop:

while :; do
	grep asdf /dev/zero &
done

locks up my system completely in under one second
(K6/233, 64M RAM + 256M swap).  As soon as swapping starts the
system is dead (and remains so after trashing ends for well over
15 minutes after which I powercycled the machine).

I am well aware of user limits.  I know that by setting limits
sensibly, an administrator can prevent users from crashing the
system like this.  However, it is my opinion that the VM system
should be able to recover from a situation like this.

I hoped that with the shiny new UVM such a thing would not happen;
I must admit I was quite disappointed when the system fell into the
pit so unguardedly.

A situation like the above must not happen, imho, if NetBSD wants
to hold up its reputation for being rock-stable.  How stable is
a system which an unprivileged user can crash by an ordinary
shell command?

>How-To-Repeat:
Unset any user limits limiting maximum memory and number of processes
(or try on an "out-of-box" NetBSD system) and type the following into
sh:

while :; do
	grep asdf /dev/zero &
done

>Fix:
Setting user limits is a workaround, not a fix.

In an out-of-virtual-memory condition, kill off the largest processes;
if the problem persists, kill all parent processes of those large
processes which have been spawned lately or grown quickly, except for
init of course.
Something like this is brutal but should be rather effective.
Preferably the process removal is done in some way of short-circuited
kill (so that nothing of the process needs to be paged back in and the
process scheduled in order to process the signal).  I thought such
methods were already being applied when swap is out, but not very
effectively, obviously.
>Audit-Trail:
>Unformatted: