Subject: Re: remote job execution with make
To: Frederick Bruckman <fb@enteract.com>
From: Laine Stump <lainestump@rcn.com>
List: current-users
Date: 01/02/2000 18:23:39
At 08:36 AM 1/2/00 -0600, Frederick Bruckman wrote:
>What would be really neat would be to export jobs with "rsh"
>or "ssh".

I haven't tried using rsh, but my sense is that it would create a lot of
overhead when each job is fairly short (such as compiling a .c file).
export (the part of pmake that handles this) uses rpc, and this seemed to
be one of the bottlenecks when we put more than 7 machines in the cluster.
Something that opened a single socket to each remote machine at the
beginning of the make and reused it throughout the build would probably be
more efficient.

The real trick is in balancing the load between all the machines, making
sure none of them is underused, and none of them is swamped. This is what
pmake/customs is worst at - the default algorithm always picked the first
machine in the list that had fewer than "x" remote jobs currently running,
which meant that one machine would completely fill up before moving on to
the next, so the last of "n" machines would never get any jobs unless
make's -j was set to at least (x * (n - 1)) + 1. We modified the algorithm
to distribute the jobs round robin (or maybe it was randomly, I forget
right now) and that worked much better, but we were still starved for jobs
when we had more than a few machines - the make process itself (running on
a machine that also had compile commands running) wasn't high enough
priority to keep all the machines consistently happy. This was solved
somewhat by nice'ing all the commands in the makefiles (so that make was
given higher priority) but %idle of the machines still bounced around a lot
(possibly due to the delay in reporting of cpu usage back to the customs
daemon on the machine running make). In the end, though, we were able to
push our 7 machines enough that we were limited by a semaphore locking bug
in NFS, which would eventually make the work directory inaccessible to some
remote machines until we rebooted (I filed a PR on this  1 1/2 years ago
which is still open, see
http://www.NetBSD.org/cgi-bin/query-pr-single.pl?number=5681.
Unfortunately, I no longer have any method of testing to see if the problem
still exists...)