CPUs for a Beowulf

Robert G. Brown rgb at phy.duke.edu
Thu Sep 11 08:11:35 EDT 2003


On Wed, 10 Sep 2003, Josip Loncaric wrote:

> I'd only like to add that administrative costs for a functional cluster 
> are *way* lower than for the same number of office PCs.  There are many 
> reasons for this so I do not want to enumerate them here.
> 
> You probably have a better feel for this, but my guess is that a capable 
> Linux cluster administrator can manage several hundred nodes.  In other 
> words, although support costs for an N-processor cluster scale with N, 
> the scaling constant is fairly reasonable.

Agreed.  The scaling constant itself can even get to where it is
determined primarily by hardware installation and maintenance alone,
although this is also an important determinant in any LAN.  With
PXE/diskless operation, or PXE/kickstart installation, yum or various
clone methods for keeping software up to date, a single person can "take
care of" (in the sense of install and maintain OS's on) a staggering
number of systems, if they never physically break.  In at least some
cluster environments, there aren't many users and the users themselves
are pretty unix-capable and need little "support" (often a if not the
major office LAN cost).  You can't get out of paying for power and
cooling, but once a suitable space is constructed power and cooling
become a fairly predictable expense, roughly $1 per watt per year.
Hardware remains the joker in the deck.

Hardware support costs can be hideously variable.  Bad electricity or
cooling or both can cause nodes to break (done that one).  Inadequate
cooling fans or other node design flaws can cause nodes to break (done
that one).  Poor quality node components (e.g. the newer 1 yr. warranty
IDE drives, a "bad batch" of nearly any component) can have a relatively
high probability of failure that is low (per day) for any single system
but high (per day) for 512 systems (to a certain extent unavoidable, so
sure, done that). A "new" motherboard with a great clock and slick
features can turn out to have a piece of s**t bios that requires two or
three reflashes before it finally settles down to function, or it can
just plain never stabilize and ultimately have to be replaced (times N,
mind you -- and yes, <sigh> done both of these).  And then even really
good, gold-plated name brand COTS hardware breaks with fixed
probabilities and a roughly poissonian distribution (with those lovely
little clusters of events) so it isn't like one hard disk fails every
two weeks, its more like no disks fail for a couple of months and then
four fail within a couple of days of one another.

A hardware failure also can have nonlinear costs (beyond mere downtime,
human time involved in fixing it, and maybe the cost of a new component)
in productivity, if the computation that is proceeding at the time of
failure isn't checkpointed and is tightly coupled, so a node failure
brings down the whole run.  At least THIS one doesn't bother me -- my
own computations tend to be EP.:-)

The moral of the story being -- COTS hardware, sure, but for larger
clusters especially (event frequency scaling linearly with the number of
nodes, disaster severity scaling NONlinearly with the number of nodes)
you should PROTOTYPE and GET HIGH QUALITY NODES, and meditate carefully
upon the issue of possibly onsite support contracts.  Onsite support
reduces the amount of time YOU (the systems managers) have to screw
around with broken hardware, although a lot of people do get by with
e.g. overnight or second day replacement, possibly buffering the most
likely parts to fail with spares, if they have the local staff to handle
it. You can trade off local human effort (opportunity cost allocation of
existing staff time) with prepaid remote human effort (service contracts
and so forth) depending on how it is easier to budget and how much
"disposable" local FTE you wish to dispose of in this way.  One way or
another, node (un)reliability equals time equals money, and this
"hidden" cost has to be balanced against the cost of the nodes
themselves as raw hardware when computing (hands on your wallets,
please:-) "Total Cost of OwnerShip" (TCOS).

[As a parenthetical insertion of the sort I'm doubtless (in)famous for,
has anybody noticed how TCOS is an acronym-anagram for COTS?  Well,
really it isn't but since I added the S to TCO it is.  Spooky, huh....]

I think that this is a lesson most large scale cluster managers
(including those that have previously managed large scale LANs) have
already learned the hard way, to the point where they are bored with the
enumeration above.  Cluster/LAN newbies, though, especially those who
may have played with or be preparing to engineer a "toy" cluster with
just a few nodes, may not have thought about all of this, so it is worth
it to hammer on the point just a wee bit.  After all, many of us on the
list learned these lessons the hard way (go on, raise those hands, don't
be shy or ashamed -- look, my hand is WAY up:-) and the least we can do
is try to spare those newbies our pain.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list