current-users: Re: Multiprocessor builds anyone?

Subject: Re: Multiprocessor builds anyone?
To: Andrew Gillham <gillham@vaultron.com>
From: Laine Stump <lainestump@rcn.com>
List: current-users
Date: 08/07/2000 13:56:08

At 09:13 PM 8/5/00 -0400, Andrew Gillham wrote:
>Have you looked at 'pkgsrc/parallel/clusterit' at all?  It allows you to
>do something like this using the normal build process.  If you create a
>small shell script, I call it 'dcc', to invoke clusterit, you can end up
>doing something like this:
># cd /usr/src
># make CC='dcc' -j 4 build
>
>This will end up distributing load out across your "cluster" by spawning
>jobs off to run parallel whenever possible.  Even though make thinks they
>are all running locally it will end up working ok as long as each node
>mounts the exact same /usr/src tree.
>
>What I have been using is a single "master" NFS server that exports its
>/usr to each node.  This assures that each node has the same toolchain,
>utilities, and source tree.

It would be great if some people started doing large parallel builds with 
this, if for no other reason than to shakeout NFS problems. From sometime 
in 1996 until Nov. 1998, I was working in a shop where we did cross 
compiles for embedded systems using a cross gcc on a cluster of 8 PPro and 
PII systems running NetBSD 1.3.1 (and later 1.3.3) with a common work 
directory NFS mounted on all the machines. We used pmake "customs" to farm 
the jobs out to the machines, and found that if we allowed too many 
processes at once (I think we got into trouble at around 40 or 50 
processes), the NFS server code would end up in a deadlock related to the 
inode of the work directory - nobody on any of the machines could any 
longer access anything in that directory until we rebooted the server. For 
more details, see:

    http://www.NetBSD.org/cgi-bin/query-pr-single.pl?number=5681

This PR is still open, but I haven't had access to a hardware setup that 
allows me to test it in nearly 2 years, so I can't say if the problem still 
exists. If someone with lots of machines started doing parallel builds of 
NetBSD, that would be *exactly* the scenario that would reproduce the bug, 
if it still exists.