[Beowulf] Recommendations for new cluster

Rayson Ho raysonlogin at yahoo.com
Sat Jan 10 09:40:14 EST 2004


> -What batch system do you recommend for our purposes (Condor, PBSPro,
> ScalablePBS, SGE)?

My take is, if you know PBS, or your users have experience with PBS
before, go for SPBS or PBSPro. If you use SPBS, also take a look at
Maui, since SPBS scheduling algorithm is limited.

If you start clean, use SGE.

Condor is also good, but the developers haven't released the source
(they said they will), not a big problem. *However*, your cluster
information is spied by Condor, sent back to UW, for "research purposes
only". I personally like condor a lot (esp the checkpointing feature),
but the spying art makes them look like M$.

> -Fail over of the head node is also important to us, but I haven't
> researched enough on this topic to start asking questions.  (So far,
> I've only found stuff for SGE, and a rumor that PBSPro will add this
> feature soon)

I've requested this feature on the SPBS mailing list, I assume the SPBS
developers are looking into it.

> First, we're looking at a cluster of around 400 CPUs, which is
> getting
> dangerously close to the limits of OpenPBS.  Considering that this
> number could easily grow down the road, pushing that limit as far
> back
> as possible is needed. 

The problem is not the number of CPUs you have, it is the number of
nodes you have.

OpenPBS scheduler contacts each node in each scheduling cycle to get
the load information. If one of the nodes is dead, then the scheduler
hangs.

The hack they did was to set up an alarm, wait for the reply from the
node, and timeout after 2-3 minutes, and restart.  --> like M$... exit
all windows and restart :)

OpenPBS works OK with less than 300 nodes, but as more and more nodes
are added to the cluster, the likelihood of a node failing increases.

SPBS fixed this problem 2 months ago, the developers said that they
have changed the server <-> mom communication protocol.

SGE does not have this problem, AFAIK.

As long as the server <-> node communication protocol is designed
correctly, you can in theory add as many nodes to the cluster as you
want.

In the extreme case (several 10,000 nodes?), there are way too many
nodes reporting the load information to the master, the master will not
do much work but just keeps on accepting load info. And at this point
you need to decrease the load report frequency.

The bottomline: as long as the communication protocol is OK (SPBS,
PBSPro, SGE, LSF), you don't need to worry. But OpenPBS is really
something you need to avoid for mid-size clusters.


> So far, PBS looks like the most widely used
> and
> known system out there, so some variant of that seems ideal for
> finding
> prospective employees with cluster experience, although it looks like
> a 
> lot of you deal with SGE as well.

SGE has tons of names, it's a bit confusing.

It used to be a research project "DQS", then commercialized, became
"gridware", "Codine", "GRD", and then bought by Sun, renamed to "SGE",
but some people call it "Sun Grid", while some call it "Gridengine".

PBS called itself PBS at the very beginning.


>  PBSPro and ScalablePBS advertise
> they 
> get the scalability limit much higher than what we need, but the
> first 
> is much more expensive.  

Yes, PBSPro don't fix the known problems and well-known limitations in
OpenPBS, and they ask you to buy PBSPro -- good marketing strategy :)


> I have yet to find any documentation or
> rumor 
> of SGE's upper bound.

Again, as long as you are not using OpenPBS, don't worry... 


> One of the problems I have noticed is that some batch managers bog
> down
> with a large amount of jobs.  Looking at the structure of our jobs,
> we
> can easily wind up with 20,000-40,000 jobs in a day with a large
> amount
> of them being submitted to the batch around the same time (when
> everybody leaves work).

Should be fine.

The scheduler is CPU-bound (a Linux PC box can handle that amount of
scheduling work), while the job holder (qmaster in SGE) is I/O bound.

If the spool directory is in an imported NFS partition, then you will
find that job submission slow.


> 
> Any thoughts what might be the best solution for us?
> 
> Thanks for any recommendations you can give me, and feel free to ask
> me
> any more details you want.

BTW, subscribe to the SPBS and SGE mailing lists for further info.

Rayson


> 
> Rich
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


__________________________________
Do you Yahoo!?
Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes
http://hotjobs.sweepstakes.yahoo.com/signingbonus
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list