tech-cluster: cluster install (was Re: Welcome to tech-cluster)

Subject: cluster install (was Re: Welcome to tech-cluster)
To: None <tech-cluster@NetBSD.org>
From: MLH <mlh@goathill.org>
List: tech-cluster
Date: 10/21/2003 11:46:36
Jan Schaumann wrote:

> "Aaron J. Grier" <agrier@poofygoof.com> wrote:
>> On Mon, Oct 20, 2003 at 11:07:36PM -0400, Jan Schaumann wrote:
>> 
>> > (For starters, I'll toss in the cluster I'm administering at work --
>> > http://guinness.cs.stevens-tech.edu/~jschauma/hpcf/ -- if you have
>> > questions regarding this setup, feel free to post here.)
>> 
>> how do you replicate the setup on multiple nodes?  a master HD image and
>> g4u or something similar?
> 
> Actually, we're using rsync -- the nodes run rsyncd on the internal
> interface.  The main file server has the image installed in
> /usr/local/node, so we can easily upgrade by building into this
> location.  For each node, there are two files that differ (/etc/rc.conf
> and /etc/inetd.conf as they contain the IP address), which are rsynced
> in a subsequent pass.
> 
> Since the drives on the nodes are mounted read-only, the rsync script we
> use has the following steps:
> 
> - run any initial commands on the remote host, taken from a regular file
>   if it exists
> - re-mount all partitions read-write
> - rsync everything
> - rsync special etc files
> - run any post-commands on the remote host, taken from a regular file if
>   it exists
> - re-mount all partitions read-only

What might be the chances of letting us see/work on what you have?

I've mentioned this in the past a couple of times, but we have a
fairly large AMD MP cluster (over 600 dual athlons) and growing,
that we are running solaris86 on for some fairly simplistic and
historical reasons, along with somewhat of a problem reason - NIS+.

This is one of the clusters set up at Southwest Foundation for
Biomedical Research. It is primarily used to calculate QTNs (fine
mapping Qualitative Trait Nucleatide) - genetic linkage analysis
between genotypes and phenotypes. For example, the latest run used
270+ cpus for 1.7 million models, scheduled over 33,000 long-running
parallel jobs and took 51 hours (we write all of the analysis
software in-house).

We want to see NetBSD boxes included in this cluster, but I want
to do this in an organized fashion. Right now we get the boxes from
M&A with Solaris installed, so we 'just' have to pwr up and configure
them. Pretty painful when you have 100+ boxes at a time to set up.

>> I have visions of making a NetBSD equivalent to kickstart via a mix of
>> netbooting, auto-install scripts, and cfengine, but am not sure where to
>> start...
> 
> That would be interesting.  I never used kickstart, but I guess you'd
> start by booting a kernel from the network via dhcp/tfpt, nfs-mount the
> root filesystem (or extract the sets from wherever if the client has a
> disk).  Or something.

What I'd really like to see would be to have a boot-floppy that we
could insert which had the (easily configurable) machine name, ip,
gateway, master host and other configurable info on it so we could
just pop the floppy in, without hooking up keyboard and monitor,
and have it come up, install NetBSD, and configure itself. It would
be very nice to be able to set the config info on the floppy from
a master machine and be able to take a handful of floppy and insert
them in the appropriate boxes and have them come up. Then edit the
hostname/ip on each of the floppies and do another batch, or
something like that. DHCP won't work for us unless we could determine
a way to sort out a number of technical and non-technical issues.

I also have a start on porting sge (gridengine) to NetBSD.  We've
likely gone about as far as our abilities allow and would appreciate
some help if possible.

Beyond that, I'd really like to see some automated maintenance
tools and some help in tuning the cluster.

If we could locate some assistance in an organized fashion, I
believe we are in a good position to obtain some grant funding to
work on this, but we don't have the technical expertise to design
this. Anyone interested and in a position to work with me on writing
a grant proposal? All the work would be publically available (we
are edu).

I am also attempting to arrange for a position for some help in
this area, but this would only be a small area of the job position
that might be made available. Workflow/database development would
provide the major funding for such a position. If you are interested,
let me know.