top500 list (was: opteron VS Itanium 2)

Robert G. Brown rgb at phy.duke.edu
Mon Nov 17 14:15:12 EST 2003


On Mon, 17 Nov 2003, Jeff Layton wrote:

> Let's exclude the floor space, windows, pizzas, chillers, etc.
> and figure out the total:
> 
> Total = $5.388 million
> 
> I guess I'm not too far off. Personally I think the big unknown
> is the rack cost. That could be very expensive since it's specialized
> (although 92 racks in a single sale might be considered a commodity).
> Also, the Cisco costs could be high as well (Cisco never does anything
> that can't make money off of).

With 1100 dual CPU nodes drawing perhaps 250 Watts apiece, the room
needs some 275 KW of capacity, maybe 180 20 amp circuits (assuming one
can drive roughly six nodes per circuit).  This costs ballpark estimate
of $275,000/year just to feed and cool the nodes, more than the racks
themselves.  The capital cost of the circuits, transformers, space
renovation, and the chillers required to drive this cluster would likely
add another seven digit number to your estimate and is a lot less
ignorable than the cost of the racks or network;-)

Small nuclear power plant optional...

Now the pizza cost, that can be ignored.

However, the human cost is another "interesting" question.  With 1100
systems running 24x7 under stress, I would expect to rack up system
failures nearly every day after the cluster was roughly a year old and
beyond.  If operating system installation and administration scaled
nearly perfectly (which with linux is not insanely impossible, but for a
cluster this size e.g. pxe-automated installs are absolutely essential)
one's ability to manage the cluster is likely limited by user support
(which is beyond prediction, as it depends on task mix and expertise of
user base) and hardware maintenance capacity.  They also need proactive
administration -- hot and cold running help for emergencies given the
large productivity cost when the cluster is down.  I'm going to guess
that they have 5-6 full time people just to care for and feed the
cluster and to sacrifice the odd chicken here and there.  Maybe another
$300K in salaries and benefits.

So I'd go to over $6 million (maybe even over $7 million) total
including infrastructure, with perhaps a $600-750K/year operating
budget.

> This was just for laughs. I still think there is a sugar daddy
> somewhere in there. Be it Cisco, Apple, IBM, etc., there are some
> costs not being mentioned.

It >>does<< seem to be a lot of money for a cluster, doesn't it.  Not
exactly pocket change, or University startup money.  DoD, DOE, NIH,
perhaps, it seems a lot for NSF unless, as you suggest, there are
corporate sponsors contributing.

The other thing that always amuses me about clusters like this is the
Moore's Law effect.  They buy it this year, after spending a year
(easily) preparing the site and building the requisite infrastructure.
They operate it for three years (spending $2.25 million, say).  In the
meantime, node power at constant cost has increased by a factor of 4.
If they invested their capital in bonds for those three years (including
the operating budget), and bought that 4x faster node hardware, they
would BREAK EVEN on the amount of work they get done by year four, and
have saved three years operating expenses plus interest in addition to
the interest on the entire capital amount for three years -- an easy $3+
million.

To put it another way, it is bloody silly to take an N year budget and
spend it all in year one on computing hardware, because compute capacity
that can be purchased at constant cost grows exponentially while compute
capacity that has been purchased AT fixed cost depreciates exponentially
and has a rather high baseline operating cost.  It also means that you
>>really<< pay for a design error.  If this enormous 1100 node cluster,
designed and purchased all at once, has any design flaw with a repair
cost that scales like the number of nodes, it would be ruinous.  If one
had only bought (say) 1/4 of the nodes in year one, 1/4 more in year
two, 1/4 more in year three, and 1/4 more in year 4, one would get
roughly:

   4 years @ 0.25 capacity
  +3 years @ 0.40 capacity
  +2 years @ 0.63 capacity
  +1 year  @ 1.00 capacity
==========================
             4.46 capacity-years

(assuming an 18 month ML doubling time) and would have numerous
opportunities to repair design flaws at minimal cost and to exploit
special deals and opportunities that exceed this "average" performance.

  Sigh,

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list