Subject: Re: cluster install (was Re: Welcome to tech-cluster)
To: MLH <mlh@goathill.org>
From: John P. Campbell <jpc@xzuberant.com>
List: tech-cluster
Date: 10/21/2003 21:00:24
On Tue, Oct 21, 2003 at 08:34:14PM -0500, MLH wrote:
> > I'm very interested in following this list to see what people are
> > doing.  I write cluster management software for a living and am always
> > looking to see how things can be improved.  We currently only support
> > Linux, but I'm hoping we can get some customers to press us into the
> > *BSD markets as well as that is what I am mostly familiar with.
> 
> Any chance you might take the time to delineate typical problems
> that you see and how you typically solve them without jeopardizing
> your business plan?

No problem.  This is info we give potential customers every day.
I don't want this to sound like a sales pitch.  I'm a developer, not a
salesman.  My only motivation in this group is to learn and share.

Typical issues we see from customers and potential customers:

1) What do they mean when they say, "I want a cluster?"

You may laugh, but it's a valid point.  "Cluster" is so many things to
so many people.  Is it a bunch of web servers behind a load balancer?
Is it oracle9i scaled across multiple nodes?  Is is a HPC compute
cluster running a DRM (distributed resource manager) like PBS/LSF/SGE.
Is MPI involved?  There are so many things out there.  Many people
know exactly what they want and why, many don't.  Just ask 10 cluster
admins/users/developers what Beowulf means.  You'll get a few answers
for sure.

2) How to build/configure the cluster.

We deal with a lot of bio-scientists who know a little about unix.
They may have a part time admin to help him out.  But they really
don't want to spend a lot of cycles installing linux, editing the LSF
config files, creating users, setting up NFS, etc.  They are paid to
research genetics, and want to focus on that.

We maintain a single configuration on a management node and make sure
each node's configuration is synchronized via an agent that runs on each
node.  The architecture is smart so when a node starts up it receives
his configuration and takes on the role of master or compute node with
the same configuration as everyone else.

We also abstract the required components to build a cluster into a
simple web form.  The user fills in a name, interconnect, os image(s)
(tgz files), nodes, etc.  Then when "done" is clicked the system goes
out and images all the nodes of the cluster (if needed) and as they
boot, they receive the proper cluster configuration.

3) How is my cluster doing? 

Note, that this isn't "How are the 1024 nodes in my cluster doing?".
There is a subtle but very different meaning here.  Scientists don't
want to scroll through a "big brother like" matrix of statuses.  They
want aggregated at-a-glance cluster monitoring.  If they need to drill
down to detail they can.   

4) User/group creation and maintenance.

Obvious problem that requires tools like LDAP, NIS, or really
efficient scripts to solve.

5) How big should my cluster be? 

Honestly, it's rarely a factor of what they need, but how much money
they have to spend.  People tend to spend what they have in their
budget/grant and haven't really spent a lot of effort estimating
optimal cluster size.

These are just a few.  I didn't even touch on job submission, application
support, and high availability.  

Hope this helps with future discussions!

jpc