[Beowulf] Recommendations for new cluster

Laurence Liew laurenceliew at yahoo.com.sg
Fri Jan 9 18:46:00 EST 2004


We use SGE with much success and little problems.

As a builder of clusters, we support many customers.. and installing SGE
for customers since late 2001 has meant a lot less hassle for us in
terms of support, reliability and robustness.... previously with another
scheduler... we had to visit our customers quite a bit..

SGE also supports shadow/secondary master nodes for failover of the
master SGE itself.. it may be important to you

Condor is also a good scheduler

Ganglia for monitoring.

We have used Rocks (www.rocksclusters.org) with SGE to quickly build a
renderfarm (smallish 64 CPU) in under 1 day incorporating Renderman.

If checkpoint and restart is important to you at the head and compute
node level, you may wish to take a look at www.meiosys.com. These guys
have an interesting virtualisation technology which allows transparent
checkpointing and restart mechanism... this means you do not need
specialised binaries... most applications should run ontop of their
technology.

Have fun!

Cheers!
laurence 

On Sat, 2004-01-10 at 05:59, Rich Pickler wrote:
> We're looking a few months into the future and are beginning to design
> our next renderfarm for production at our studio.  Instead of going with
> a render system tailored specifically for the renderer we're using, like
> Alfred for Pixar's RenderMan http://www.pixar.com/, we're toying with
> the idea of a more generalized cluster system which will open up a ton
> of possibilities for what we can do with the farm.  However, having only
> tinkered with generalized cluster systems during college, I am a bit new
> to some of the details and what is right for us.  After exhausting
> Google for all I can get out of it, I have some questions over what
> exactly is right for us. And I would rather have real world experiences
> with tools instead of someone trying to sell me something.
> 
> In a nutshell, the questions we have:
> 
> -What batch system do you recommend for our purposes (Condor, PBSPro,
> ScalablePBS, SGE)?
> 
> -What monitoring tools and set up tools have you found and used that
> reduce down time on faulty nodes (ganglia, management tools, etc.)?
> 
> -Fail over of the head node is also important to us, but I haven't
> researched enough on this topic to start asking questions.  (So far,
> I've only found stuff for SGE, and a rumor that PBSPro will add this
> feature soon)
> 
> A bit of detail on our requirements:
> 
> First, we're looking at a cluster of around 400 CPUs, which is getting
> dangerously close to the limits of OpenPBS.  Considering that this
> number could easily grow down the road, pushing that limit as far back
> as possible is needed.  So far, PBS looks like the most widely used and
> known system out there, so some variant of that seems ideal for finding
> prospective employees with cluster experience, although it looks like a 
> lot of you deal with SGE as well.  PBSPro and ScalablePBS advertise they 
> get the scalability limit much higher than what we need, but the first 
> is much more expensive.  I have yet to find any documentation or rumor 
> of SGE's upper bound.
> 
> The majority of jobs in our queue will be small and single node (but
> definitel multithreaded to take advantage of dual cpu nodes).  This
> means intercommunication between nodes is almost nil.  However, there
> will be cases where we need processes to communicate outside the cluster
> with workstations, but that isn't terribly difficult to design around.
> 
> One of the problems I have noticed is that some batch managers bog down
> with a large amount of jobs.  Looking at the structure of our jobs, we
> can easily wind up with 20,000-40,000 jobs in a day with a large amount
> of them being submitted to the batch around the same time (when
> everybody leaves work).
> 
> Any thoughts what might be the best solution for us?
> 
> Thanks for any recommendations you can give me, and feel free to ask me
> any more details you want.
> 
> Rich
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list