Serious Cluster Design and Prototyping | Cluster Newbie

Basic and (advanced) wisdom for those that venture down the cluster hardware path

Basic and (advanced) wisdom for those that venture down the cluster hardware path

Last time we started to review just how to go about building a serious cluster computer, beginning with a fairly complete discussion of the space, power, and cooling requirements for a cluster.  If you read that column, you will recall that a lot of design decisions for the cluster's future space depend (unsurprisingly) on the kind of hardware you plan to put into it.  Shelved tower units, rack mount single, dual, and even quad processor systems, and bladed computers all require very different amounts of physical space, electrical power per square foot of floor space, cooling per cubic foot of room volume. 
 Of course the reverse is true as well; if your room is predetermined to be the old broom closet down the hall and you only got that by threatening to hit someone with a blunt instrument, then your hardware design decisions will be heavily influenced by your available space and desired computational capacity (which for all true cluster computing fiends is always "as much as I can get, and then some").  There are dependency loops in the process of design that have to be iterated to convergence (which is a mix of computer-speak and math-speak and a metaphor besides, gadzooks) and we had to start somewhere, so I picked physical space. 

This month we're going to start looking at hardware, focusing on nodes a bit, but not exclusively, as cluster performance is highly multivariate. We'll probably come back to hardware again and again in months ahead, because after all, would you be reading this at all if you weren't deep down inside a secret computer hardware junkie, looking for the biggest fix of your life?  Besides, there is so much hardware out there to choose from, and the selections completely change every six to twelve months, and costs and benefits of various design decisions wildly vary as they do.  This sort of thing is to me an -- is "obsession" too strong a word?  I don't think so.  Yet even I have a very hard time keeping up with all the hardware options that are available, what runs on what, what the current options are for hooking it all together, and above all how much it costs (which can vary by as much as a factor of two over any six to twelve month period).  I blink and the world changes. 

This situation is great for me as a columnist as it means I will never run out of column topics.  It is great for my editor as he will never run out of readers trying to keep up with an ever-changing landscape driven by Moore's Law in parallel overdrive.  It is not so great for either you or me as cluster engineers, as it means designing a cluster is sort of like trying to create great art using a palette that keeps changing, with mauve turning into puke green and bold blue fading into charcoal gray before the paint is even dry.  In these circumstances one has to develop principles of sound art, and of course one of them has to be figuring out what the palette du jour is for this six to twelve month time frame.  So let's settle in for a discussion on How To Select Cluster Hardware. 

Guiding Principles 

Before we actually talking dirty and using words like gigaflops, bandwidth, latency and more (in future columns) let's see if we cannot put down a pieces of Basic Wisdom on cluster design.  These are in a sense more fundamental than all the details, as without them you cannot even tell what details are important.  With them, you can probably hold your own at one of the big cluster or supercomputing expos, and the details will start to make sense as we cover them. 

Most of these are wisdom gleaned from years of participation on the beowulf list.  If you ask a question on the list such as "what kind of network should I get for my cluster" the answer can almost certainly be analyzed in terms of the principles below. 

Cost-Benefit is Everything What?  The most fundamental principle of cluster design is, is, is economics?  Absolutely!  If money is no object, why build a cluster? Go pay somebody a few zillion pieces of gold for a Big Iron supercomputer (that is mostly likely itself a "cluster", but one with custom interconnects, a raft of salesmen and service engineers with families to support on the merest scrapings of the margins of your purchase). 

Every single step involved in cluster design involves getting the most work done for the least money subject to your design constraints, and even those constraints are fundamentally economic -- the reason you may sometimes need to live with the broom closet down the hall to house your cluster is that new buildings with gaudy cluster rooms are expensive. 

The Answer is "It Depends" One of the most common answers you'll get on the beowulf list to almost any "should I buy" hardware question is "it depends".  What, you might ask, does "it" depend on?  Fundamentally it depends on cost-benefit, of course, but to figure out cost benefit you have to first figure out both comparative costs and comparative benefits and these depend on a variety of things like (for costs) vendor, warranty, speed/capacity and more and (for benefits) like what kind of work you are trying to do, and its differential performance on the various kinds of hardware. 

That sounded good.  Somehow, those first two points almost make sense.  Now let's look at a couple of related principles. 

Your Mileage May Vary YMMV is a standard disclaimer to nearly anything associated with hardware and cluster design.  I tend to be asked a lot by somebody I've never met about some critical component in their prospective cluster design.  For example, they ask me what kind of network card they should buy.  I answer that I've done fabulously well using snails to carry little stick-it notes between my nodes (my task is embarrassingly parallel) but your mileage may vary.  This point is virtually a corollary of the "it depends" answer above -- maybe my job involves no internode communication at all so snails work just fine.  Maybe I got a real deal on snails.  Maybe your job is like mine, maybe not.  It depends.  YMMV. 

The Only Good Benchmark is Your Own Code This point follows directly from the propositions above.  To compare hardware it isn't enough to just compare the vendors' claimed MIPS (millions of instructions per second) or GFLOPS (billions of floating point instructions per second) or CPU clock or bandwidth or latency or whatever-else-you-like numbers; you need to know how that piece of hardware performs on your code.  Your code is very likely different from the benchmark codes that were used to make the measurements.  Then again, the vendor may have lied about their benchmarks. 

If you keep the points above in mind while reading the remainder of this column (and other articles, and future columns, and while shopping for shoes) then you'll find that although there are no firm "rules" for cluster engineering, the terrain is navigable and you can, by keeping the above principles in mind, put together a decent cluster project on your very first try. 

Cluster Design Parameters 

It is so easy to get lost in all the glitter and glitz of shopping for sexy and sleek cluster hardware, especially if you fall into the hands of a slick salesperson, that we will begin learning about how to do a sane cost-benefit comparison of engineering alternatives with the benefit part, not the cost part -- the work you want to do with the cluster.  Really we began this process with the very first column in the very first issue of this magazine -- our efforts were directed at understanding a bit about parallel task organization and scaling on a quantitative basis with an eye on the future, in fact on this very day. 

Understand this:  the benefit of your cluster is the work you want to do with it, along with any constraints on that work.  That is all there is to it.  You will maximize this benefit when you can do this work the fastest, with the least amount of personal effort and invested time and money (all of which can go into either cost or benefit columns depending on how they are valued and weighted).  The "ideal" would be for you to submit a piece of work to be done to your cluster and have it complete as fast as your finger lifts off of the Enter key, even if the job to be run is predicting the future time evolution of global weather for three years in advance. 

We know from earlier columns that in all probability, this ideal cannot be realized.  Parallel speedup is limited by Amdahl's Law and its more accurate generalizations.  Even if your task is embarrassingly parallel and suitable for running on a grid style cluster of clusters, there are still limits on its parallel scaling imposed by the bottlenecks of its controlling resource servers. 

But what are the bottlenecks for your particular task, or task mix? The amount of work you can get done on any collection of hardware is very strongly dependent on the task itself.  Some cluster designs will get more work done than others as a function of (for example) the number of nodes.  Of course different cluster designs have different costs as well, and the most common situation one faces is designing a cluster with either a fixed budget or at least an upper bound on the amount of money you can hope to obtain for the cluster. 

The following is a partial list of the more important variables in cluster design that can be rate determining for various kinds of code. 

CPU Architecture.  Different architectures get different amounts of different kinds of work done per clock cycle. 

CPU Clock.  This is the "easy" one -- more seems like it would always be better.  On a fixed budget, it isn't. 

Memory Architecture and Speed.  If your job involves large vector or matrix operations, memory speed is very likely to be the rate limiting feature instead of CPU. 

Disk Architecture and Speed.  Sometimes this is completely irrelevant. Other times it is very important.  It depends on how much and how often your task accesses disk, which tends to be very slow relative to everything else. 

Network Architecture, Bandwidth and Latency.  One day this will be a suitable topic of a whole article.  For real parallel tasks with interprocessor communications distributed over the network, it is (as we have learned in earlier columns) the critical bottleneck, likely to be as or more important than all of the features above. 

Miscellany.  Operating system, compiler, libraries, administrative interface, parallel environment, task organization, and much more all can have a significant impact on task speed. 

Each of the headings above can be further fractionated.  Integer instructions versus floating point instructions, 32 bit floating point instructions versus 64 bit floating point instructions, floating point broken down further as multiplication, addition, multiplication AND addition, division, transcendental function support, instruction pipelining, the size and speed and number of layers of cache memory -- a particular task might well be bottlenecked by how efficient one particular mix of these instructions is executed on a system, which itself might vary by a factor of three or more depending on compiler and whether optimized libraries are used. 

If the above makes you feel dizzy and a bit inadequate, don't worry, it should.  How are you supposed to figure out which arrangement will work best for your code without a degree of some sort in advanced computer science?  Especially when even people who have an degree in computer science cannot always predict performance except in broad terms?  Still, not to worry.  As you might expect, there is a way to get by. 

Optimizing Node Architecture 

The best way to proceed to optimize your node architecture and cluster design is by studying your task and prototyping the task on as many distinct architectures as you can manage. 

Studying your task is pretty obvious.  A lot of computational tasks spend a lot of time in one "central" loop doing one particular series of chores over and over again.  It is easiest to show you how to proceed by going over an example.  Let me tell you in very rough terms about about my code (a Monte Carlo simulation studying the critical behavior of O(3) symmetric magnets), so you can get a feel for what sort of thing to look for when studying or profiling your task. 

In my code, the core loop contains a lot of general floating point instructions and a few transcendental calls (exponential, natural log, trig functions).  Transcendental calls tend to be library functions and very slow on many architectures, but have microcode optimizations and associated CPU instructions on Intel and AMD processors, inherited from the days when the 8088 CPU on the original IBM PC was accompanied by the 8087 numerical co-processor for people who wanted a bit of floating point speed. 

My code accesses memory in an irregular (not terribly vectorizable) pattern and hence is not particular dependent on memory speed or architecture, although I did notice a bump moving from 32 bit memory buses to 64 bit buses because it is all double precision code.  My code does a fair amount of integer operations -- indexing arithmetic and actual integer variable manipulation.  My code doesn't do much disk I/O and is embarrassingly parallel the way I tend to run it and so it doesn't stress the network at all.  From the above, one might guess that my code will be bottlenecked primarily by CPU clock (true), maybe by CPU architecture (less true), and a bit by memory (true, but in a way that doesn't gain much advantage from memory optimizing architectures). 

YMMV!  Your code will likely be different.   

For me the "best" cluster tends to be the one where I can buy the most distributed CPU clock for the least amount of money -- a relatively simple design with one really important "knob" that controls the amount of work I can do.  For your task, the network might be more important, or the memory bus might be more important than CPU clock per se.  How, then, can we determine which architectures do better on the critical parameters once we know roughly what they are? 

I have found (like many before me) that standard benchmarks are relatively useless -- LINPACK (a common benchmark that returns "MFLOPS" as a measure of floating point speed) is a relatively poor predictor of performance of my job, because it mixes CPU speed and memory access speed in a particular pattern, and my job doesn't use that pattern. Stream (another benchmark that focuses on memory bus speed and certain common operations) is also less useful, as it doesn't measure floating point division rates and my code (alas) has a fair amount of unavoidable division in it.  SPEC rates are better, but not the overall rates which average too much. 

In the card game of bridge there is an old adage:  "A peek is worth a thousand finesses".  Don't guess -- measure.  Use your own code as a benchmark! This is the only benchmark that really, truly matters. It is often useful to run a bunch of "standard" benchmarks on the system(s) you plan to test because with time and experience you may be able to learn a pattern that is moderately predictive of task completion times, and you will then be miraculously enabled in your search for the best possible deal. 

If possible, benchmark and measure without actually having to buy the test systems.  Most reputable cluster component vendors will loan you either a test system or access via the Internet to a test system. Compile your job on the test system and see how fast it runs in an environment as similar as possible to the one you are thinking of building.  Vary task parameters and as much as you can in the environment and see how the completion time varies. 

In a fairly short period of time you should have a good idea of how your task runs on at least the major commodity architectures, and how its performance varies when you vary some of the standard purchase-decision knobs -- CPU clock, size and kind of memory and so forth.  The more experience you gain, the more measurements you perform, the greater your ability to engineer a cost-effective cluster with confidence. 

While you are prototyping nodes, it is a great idea to also prototype networks, compilers, and possibly parallel libraries, especially if your task is a real parallel task that will use the network intensively. However, space and a philistine editor prevent me from going into any detail on this subject in this issue -- check back in next month and we'll talk a bit about networking before coming back to the cost part of the cost-benefit analysis of your alternative cluster designs and that all-important part of the process, shopping. 

This article was originally published in ClusterWorld Magazine. It
has been updated and formatted for the web. If you want to read more
about HPC clusters and Linux, you may wish to visit 
Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters.  You can find his work and much more on his home page