current-users: Swapping problems (was: Re: 1.2 features, again)

Subject: Swapping problems (was: Re: 1.2 features, again)
To: None <current-users@NetBSD.ORG>
From: Laine Stump <laine@morningstar.com>
List: current-users
Date: 06/27/1996 06:27:32
>> Is the VM leak fixed in 1.2?  This one's really starting to annoy me.

>You can "solve" this problem by getting more RAM and never using any swap.
>Of course, then you may be hit by the bounce-buffer-lack problem, which
>hasn't been fixed, either...  But perhaps you're lucky enough not to be
>bitten by the serial port problem, which won't be fixed by 1.2... ;-) :-(

I just had to do that today, although for the "other swap problem" - we
have 3 P166s with dual 4GB F-W SCSI drives, and are using them with
"customs" from pmake (Parallel make) along with an associated gmake
patch to allow load-sharing a make (about 750 source files in our case)
across multiple machines. In the past (on a single machine with a single
IDE drive) we had seen problems with the machine dying for about 20
seconds when memory usage went over the limit of real memory, and hoped
that putting swap interleaved on two fast/wide SCSIs would help. It
didn't.

What we found was that, as soon as memory usage on any machine hit 64MB
(the amount of RAM we had installed in each), that machine would lock up
(disk whacking away, but idle time display in systat going down to 0%)
for just about the same amount of time, during which the customs daemons
(whose job is to keep the machines appraised of each other's load
average status and accept requests to perform tasks) on the other
machines would decide that the machine was dead, and stop trying to send
it jobs. After this, the machine would not be sent any jobs for at least
a minute, until the customses were able to coordinate again. This
affects the time for the make enough that most of the parallel
processing potential is lost, and also makes the system extremely
fragile to what else is running (ie, if someone has a 13MB emacs and
you've set all the load factors not accounting for that, you're
screwed).

After installing 128MB in 2 of the 3 machines, and turning the
acceptable load way down on the third, we were able to get some
acceptable numbers. Of course, it cost us an extra $1000.

For those interested in the numbers: this build on a single P166 took
about 900 seconds, on 3 machines that had to swap it took about 680. On
3 machines crammed to the max with memory, it took 392 seconds.

Having swap that kicked in gracefully, rather than with a brick-smash on
the forehead, would be *really* nice (and would make NetBSD appear to be
more stable under load). What can someone who has no time and very
little knowledge about the VM code do to help get this fixed? Pretty
please? I can live with the leak, by the way, but this dying does me
absolutely no good when trying to defend myself against the inhouse
Linux-heads and SunOS-worshippers (this is an experimental system we're
trying to get approved for real use).

(By the way, one of these machines was running stock NetBSD 1.1, the
other two were running the latest O---BSD (dare I admit it?) (because
that was the simplest way to get a snapshot of approximately NetBSD
1.2A, which we needed for the improved 2940 driver and some changes in
ccd) However, I believe the vm code in both O---BSD and NetBSD current
are equivalent, right?).

Laine Stump
Ascend Communications