back to the issue of cooling

Robert G. Brown rgb at phy.duke.edu
Wed Apr 23 17:15:18 EDT 2003


On Wed, 23 Apr 2003, jbassett wrote:

> Transmeta quotes a TDP for their 1-Ghz Crusoe as 7.5 watts
> 
> An Athlon XP at around twice the clock-speed is around 10* that at 75 watts
> 
> but at .05$/kw*h I agree that it is unlikely that you could ever find an 
> operating cost that would be able to offset the greater cost and slower 
> performance of the Crusoe. But the density that you could pack them would be 
> incredible. If you were running so much cooler that less of a cooling system 
> investment were required that might change the equation.

(People bored with blades can skip the following:-)

Right, this is really their niche at the moment, especially in
environments where installing them in high densities saves you from
REALLY costly infrastructure investments or where space itself is just
plain tight.

Be careful about comparing raw clocks, though -- they are different
architectures, with the transmeta according to Feng's own paper (Feng is
the Green Destiny cluster guy) only delivering 1/3 to 1/2 the
performance at equivalent clock to Athlons.  I don't think he came CLOSE
to systematically exploring system performance to get those numbers, but
that's just me -- maybe I'm misreading.  I'd like to see systematic e.g.
specmarks, systematic lmbench's, stream, netpipes, and more, not just
sqrt's; less emphasis on "fraction of peak" and more on wall-clock
completion times.  

To put it another way, I don't think Feng's paper is a sound basis for
would be cluster engineers trying to guestimate the performance of a
bladed system on a given problem.  This makes it very difficult to
compare "theoretically" with competing designs (but of course that won't
stop me below -- just take it with a grain of salt:-).

Also be careful about comparing CPU-only numbers for e.g. power.  The
CPU is mounted on a card with memory, a hard disk (or two), a network
interface, and a bus/backplane interface.  All of these draw power.  The
power itself comes from a chassis with a power supply that gets hot
while operating. What I looked for, but failed to find in Feng's paper,
is the actual wall-plug power load of a 24 blade chassis running code
flat out.  If the chassis power supply capacity is any indication, it is
probably more like 20 watts per blade (and maybe more, as some fraction
of the "blade load" goes to running chassis electronics and heat
dissipated by the chassis power supply).  The only good way to find out
is to stick a kill-a-watt between a blade chassis and the wall and read
out its draw under a mix of loads.  I don't think Feng did that (hard
to tell from the paper, at any rate).  I suspect he used published
numbers for the CPU draw or the blade draw instead of measuring it
himself but if he told WHAT he did I missed it.

If we assume 20 W and compare it to the 85 full chassis load watts (or
so) burned (per CPU) in a loaded dual Athlon at 1.6 GHz, then the
transmeta gets 0.3 to 0.5 "Athlon GHz" (AGHz) (or 1/5 to 1/3 the
performance) for 1/4 the power draw.  Hmmm.  Where is the big win here?
Even if my power numbers are off by a power of two and a fully loaded
blade burns only 10 W -- a number I'd doubt since NICs alone tend to
burn 5 W and a blade has a NIC -- I'm not impressed, given the cost
differential, because we have NOT YET CONSIDERED the scaling laws
associated with parallelizing tasks themselves, which often strongly
favor faster processors (i.e. faster processors on systems with faster
busses can often be used to make clusters that scale near-linearly to
far higher total performance and to far more processors).  

Nor have we considered micro-determinants of performance.  How expensive
is a context switch?  How well does it manage cache and dataflow?  How
smoothly does it process interrupts so it can USE the NIC or disk?  Is
there an all-things-equal network latency hit of 3x or more relative to
an Athlon?  There might be (or not)...but barring a published
measurement we won't know.

I think a far more sophisticated analysis is called for to determine
what the real performance/power scaling is PER NODE since the crux of
the argument is whether more slower cooler processors are going to
perform as well as fewer faster hotter processors.  This is a problem
dependent question, as has been a focus of the list forever, and might
well be TRUE for one problem and FALSE for another.

I was REALLY unimpressed by Feng's TCO argument, and especially by his
analysis of the processor scaling laws that are limiting processor
speedup and leading to an increase in power draw as Moore's law cranks
along.  First of all, those things are well known -- on chip or off
chip, parallelism is a way to get better usage of chip real estate, as
Ian Foster points out in a lot more convincing detail in his book on
parallel program design.  Second of all, Feng's proposed "solution":
"quit using the `increasing frequency = increasing performance'
marketing ploy" -- isn't a solution at all, it isn't even an argument --
it is raw polemic.

What marketing ploy, and what does marketing ploy have to do with chip
scaling laws?  Increasing frequency DOES, visibly and obviously,
increase performance on CPU bound problems, including mine, in a
marvelously linear fashion.  On the transmeta too, at a guess, just as
it has for generations of in-family CPUs.  Quantum jumps (relative to
clock) occur when the chip is rearchitectured with more parallelism and
finer scale, e.g. changing from 8 to 16 to 32 bit architectures, from no
pipelines to several pipelines.  These are the realities of CPU design,
not marketing ploys.

Second of all, he offers no argument at all, convincing or otherwise,
for how using lots of cooler slower chips is going to actually beat the
scaling laws he himself introduces (and the ones he omits).  Foster
does, in the explicit context of parallel task execution (so I'm
familiar with them in a fair bit of detail) but Feng doesn't.  A good
argument would require him to account for various kinds of overhead,
account for parallel scaling on tasks (where his argument OBVIOUSLY
fails for a task that will run, fast, from memory on a single CPU with
no IPC's but require lots of slow IPC's to run in parallel on two or
more transmetas) and would inevitably restrict the classes of task that
can be distributed cost-efficiently on the bladed architecture.  It is
NOT a "substitute" for the increased clock single CPU cycle, it is
something different for solving different problems.

And finally, there is the good old tanstaafl, which makes me suspicious
of the whole line of argument from the beginning. Chip designers at
Intel and AMD (and at Transmeta, for that matter) are not idiots.  They
are REALLY familiar with the chip real estate, clockspeed, parallelism
scaling laws and have introduced a LOT of on-chip parallelism in part
because of them.  They are real experts on this and don't do stupid
things and are all working with the same microscopic "components" on
their VLSI layouts, trying to optimize a highly nonlinear cost-benefit
function in truly creative ways.  Their chip designs are genius, not
dumb, expensive genius at that (up to order $billion per CPU generation
foundry at this point?).

RISC itself is something of a response to these laws, and Transmeta's
architecture seems almost like "super-RISC" with a lot of code
translation and pre-processing to conserve chip real estate.
Ultimately, winning the performance war requires either finding a really
fundamentally different design that has different scaling laws or
finding a niche market where the design you have (which may be a
different emphasis or design focus of existing designs) can be
successful.

So far I don't see it, although I've seen some intriguing ideas kicked
around.  I'd be at least intrigued, for example, by an "8 processor
motherboard" where 8 transmeta's were slotted right up on a very fast
memory bus with a standard peripheral (PCI-x) interconnect.  That would
give you e.g. "8 transmeta GHz" on a system that drew roughly the same
power as a P4 or Athlon in the 2 GHz range.  Multiply by 0.4 (say) and
perhaps it is competitive, and gets you out to decent performance
without an ethernet interconnect, giving you better parallelism for
certain classes of task.  Transmeta's in PDA's are also very intriguing
-- building a handheld device that can run for hours at high speed on a
small battery is very cool indeed.

THAT kind of (SMP) design in a mainstream mass market delivery would
require a new kernel and a fundamentally parallel approach to
programming, to make "happen".  It might not make it -- lots of stuff on
PC's is single threaded and CPU or memory I/O bound, and lots of CPUs
competing for memory or trying to deliver a threaded task are a known
headache.  It would be interesting, though, especially if the design was
modular and could be scaled up to 24, 48, 1024 processors.

> Or if there was ever a need for a highly mobile cluster system. You could pack 
> a great number into a single box and carry it about and perhaps because in 
> theory 10 Crusoes would dissipate the heat of a single Athlon you could easily 
> cool many of them. Joseph Bassett

Well, yes, unless you needed the single-threaded PERFORMANCE of a single
Athlon.  And remember, until that 10 way SMP motherboard for the Crusoe
comes along, you're feeding the CPU, its own memory, its own disk, and a
network (ten of each), and suddenly it isn't anything like 10 for one to
the Athlon, more like four for one or even five for one, and when you
multiply by the speed differential per clock, suddenly you're back
dangerously close to where you started in BOTH FLOPS/Watt AND in
absolute FLOPS, with now 10 CPUs to care for, feed, network, and
program.  The single AMD will run ANY application over the counter, no
parallel programming required.  Lower TCO?  I think that's obvious.

I'm not down on blades -- I think they have their niche and
power/cooling/space starved environments are it.  I don't think that
they are even close to a cost/benefit win in most other environments,
and not because of "marketing hype".  I'm not selling anything; if
anything I'm buying.  Should I spend my (say) $15K on Crusoes in a blade
configuration or on dual Athlons?  Hmmm, I can afford just about 8 dual
Athlon 2400+'s or just about 12 Crusoes (presuming $1K each by the time
a chassis and so forth is thrown in).  16 Athlon CPUs buys me some 32
"Athlon GHz" (and costs me about $1300 a year in utility bills).  The
alternative gives me 12 "Transmeta GHz", where a TGHz is "worth" perhaps
0.5 AGHz in FLOPS, according to Feng's incomplete measurements.

So it buys me roughly 6 AGHz, five times fewer, and costs me (heck, I'll
GIVE you 10x less power) $150 year to run.  I'd still need to spend
$75,000 on transmetas, assuming my application scaled linearly to 60
transmetas at all, to equal the power of my (8 dual FF) 16 Athlons CPUs
(assuming I'm still scaling linearly there, as well).  My three-year
power bill would be maybe $2000 less, but my overall bill would be
$53,000 more for the Crusoes.  In a lot of environments, I could buy
brand new wiring, a dedicated air conditioner, and get STILL get back
enough change to travel business class to Australia going with the
Athlons, especially if my goal is to feed 8 whole dual Athlons (ballpark
of 170W each, 1400 watts to perhaps 2000 watts total consummption under
load, one to two 20 Amps circuits, installable in most locations that
have a bit of surplus capacity at the box for maybe $1000 bucks tops
even if they have to pull wire).

If there is something wrong with this analysis, I'd be interested in
hearing it.  At $200/blade, blades would be a great deal from a TCO or
cost/benefit perspective.  At $400/blade they would be "interesting" and
often competitive.  At $1000/blade, they are a niche market only item,
as I see it -- people who have $100K in renovation required otherwise to
build a cluster, people who have an uncooled broom closet available as a
"cluster room" and who inexplicably can STILL afford a Transmeta
cluster in the first place.

  rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list