top500 list (was: opteron VS Itanium 2)

Bill Broadley bill at math.ucdavis.edu
Mon Nov 17 16:08:40 EST 2003


> To put it another way, it is bloody silly to take an N year budget and
> spend it all in year one on computing hardware, because compute capacity
> that can be purchased at constant cost grows exponentially while compute
> capacity that has been purchased AT fixed cost depreciates exponentially
> and has a rather high baseline operating cost.  It also means that you
> >>really<< pay for a design error.  If this enormous 1100 node cluster,
> designed and purchased all at once, has any design flaw with a repair
> cost that scales like the number of nodes, it would be ruinous.  If one
> had only bought (say) 1/4 of the nodes in year one, 1/4 more in year
> two, 1/4 more in year three, and 1/4 more in year 4, one would get
> roughly:

Having just sat through a Production Clusters talk at SC2003, I figured
it would be worth mentioning the downside of yearly upgrades.

Hetrogenious clusters are a nightmare, at least linear scaling in
support costs, and if your running large codes you can can get zero
scaling.  I.e. 250 nodes a year, at the end of 4 years you can
run 250 fast nodes, or 1000 nodes at the speed of the 1st years.

The opinion of the 4 speakers giving the talk was buy a cluster
large enough to keep it till replaced.  This dramatically
decreases support costs, keeps things simple for the end users,
keeps the batch queue simpler, and stops silly things like a
BIOS upgrade for some of the nodes taking down the entire cluster.

Certifying a large body of applications, user tools, quota monitoring,
sensor monitoring etc for a particular configuration is alot of
work.  

Numerous nightmares were reported even for "identical" nodes that
ended up coming from different factories.  

Large site installations spend alot of sweat and tears becoming
intimiately familar with their hardware.  Analyzing failure rates,
how to read various temp sensors, monitoring of various types, etc.

Building a cluster 1 year at a time can work of course, especially
if your jobs are never bigger then a single years purchase, but
it's not free.  In many cases when your support staff limited 
(seems very common) you might be better off with a cluster every
couple years.


-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list