john.hearns at clustervision.com
Wed Feb 25 04:55:22 EST 2004
On Tue, 24 Feb 2004 raiders at phreaker.net wrote:
> We are on a project as described below:
> - IA32 linux cluster for general parallel programming
> - five head nodes, each head node will have about 15 compute nodes and
> dedicated storage
> - groups of cluster-users will be restricted to their own clusters normally
> (some exclusions may apply)
> - SGE/PBS, GbE etc are standard choices
> But the people in power want one single software or admin console (cluster
> toolkit?) to manage the entire cluster from one adm station (which may or may
> not be one of the head nodes).
Thinking about this, the way I would architect things is to stop thinking
of subclusters - yet of course give the users their allocation of
So, choose your cluster install method of choice.
Have one admin/master node and install all 75 nodes.
Have 5 public facing machines, and have logins go through a load-balancer
or round robin. When a user logs in they get directed to the least loaded
Why? If one machine goes down (fault or upgrade) the users still have four
machines. They don't "see" this as you have entries in the DNS for e.g.
necromancy.hogwarts defence-darkarts.hogwarts potions.hogwarts
all pointing the same way.
It would be better to have 5 separate storage nodes, but the login
machines in your scenario will have to do that job also. Just allocate
storage per group.
The 75 compute nodes are installed within the cluster.
Now, at a first pass you want to 'saw things up' into 15 node lumps.
This can be done easily - just put a queue or queues on each and allow
only certain groups access.
But I will contend this is a bad idea. Batch queueing systems have
facilities to look after fair shares of resources between groups.
Say you have the 5 separate groups scenario.
Say today Professor Snape isn't doing any potions work.
The 15 potions machines will lie idel, while there are plenty of jobs in
necromancy just dying to run.
Use the fairshare in SGE or LSF.
Each group will get their allocated share of CPU.
You'll also have redundancy - so that you can take machines out for
maintenance/repairs without impacting any one group, ie. the load is
shared across 75 machines not 5.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf