Basic and (advanced) wisdom for those that venture down the cluster hardware path
Last time we started to review just how to go about building a serious cluster computer, beginning with a fairly complete discussion of the space, power, and cooling requirements for a cluster. If you read that column, you will recall that a lot of design decisions for the cluster's future space depend (unsurprisingly) on the kind of hardware you plan to put into it. Shelved tower units, rack mount single, dual, and even quad processor systems, and bladed computers all require very different amounts of physical space, electrical power per square foot of floor space, cooling per cubic foot of room volume.
Of course the reverse is true as well; if your room is predetermined to be the old broom closet down the hall and you only got that by threatening to hit someone with a blunt instrument, then your hardware design decisions will be heavily influenced by your available space and desired computational capacity (which for all true cluster computing fiends is always "as much as I can get, and then some"). There are dependency loops in the process of design that have to be iterated to convergence (which is a mix of computer-speak and math-speak and a metaphor besides, gadzooks) and we had to start somewhere, so I picked physical space.
This month we're going to start looking at hardware, focusing on nodes a bit, but not exclusively, as cluster performance is highly multivariate. We'll probably come back to hardware again and again in months ahead, because after all, would you be reading this at all if you weren't deep down inside a secret computer hardware junkie, looking for the biggest fix of your life? Besides, there is so much hardware out there to choose from, and the selections completely change every six to twelve months, and costs and benefits of various design decisions wildly vary as they do. This sort of thing is to me an -- is "obsession" too strong a word? I don't think so. Yet even I have a very hard time keeping up with all the hardware options that are available, what runs on what, what the current options are for hooking it all together, and above all how much it costs (which can vary by as much as a factor of two over any six to twelve month period). I blink and the world changes.
This situation is great for me as a columnist as it means I will never run out of column topics. It is great for my editor as he will never run out of readers trying to keep up with an ever-changing landscape driven by Moore's Law in parallel overdrive. It is not so great for either you or me as cluster engineers, as it means designing a cluster is sort of like trying to create great art using a palette that keeps changing, with mauve turning into puke green and bold blue fading into charcoal gray before the paint is even dry. In these circumstances one has to develop principles of sound art, and of course one of them has to be figuring out what the palette du jour is for this six to twelve month time frame. So let's settle in for a discussion on How To Select Cluster Hardware.
Guiding Principles
Before we actually talking dirty and using words like gigaflops, bandwidth, latency and more (in future columns) let's see if we cannot put down a pieces of Basic Wisdom on cluster design. These are in a sense more fundamental than all the details, as without them you cannot even tell what details are important. With them, you can probably hold your own at one of the big cluster or supercomputing expos, and the details will start to make sense as we cover them.Most of these are wisdom gleaned from years of participation on the beowulf list. If you ask a question on the list such as "what kind of network should I get for my cluster" the answer can almost certainly be analyzed in terms of the principles below.
Cost-Benefit is Everything What? The most fundamental principle of cluster design is, is, is economics? Absolutely! If money is no object, why build a cluster? Go pay somebody a few zillion pieces of gold for a Big Iron supercomputer (that is mostly likely itself a "cluster", but one with custom interconnects, a raft of salesmen and service engineers with families to support on the merest scrapings of the margins of your purchase).
Every single step involved in cluster design involves getting the most work done for the least money subject to your design constraints, and even those constraints are fundamentally economic -- the reason you may sometimes need to live with the broom closet down the hall to house your cluster is that new buildings with gaudy cluster rooms are expensive.
The Answer is "It Depends" One of the most common answers you'll get on the beowulf list to almost any "should I buy" hardware question is "it depends". What, you might ask, does "it" depend on? Fundamentally it depends on cost-benefit, of course, but to figure out cost benefit you have to first figure out both comparative costs and comparative benefits and these depend on a variety of things like (for costs) vendor, warranty, speed/capacity and more and (for benefits) like what kind of work you are trying to do, and its differential performance on the various kinds of hardware.
That sounded good. Somehow, those first two points almost make sense. Now let's look at a couple of related principles.
Your Mileage May Vary YMMV is a standard disclaimer to nearly anything associated with hardware and cluster design. I tend to be asked a lot by somebody I've never met about some critical component in their prospective cluster design. For example, they ask me what kind of network card they should buy. I answer that I've done fabulously well using snails to carry little stick-it notes between my nodes (my task is embarrassingly parallel) but your mileage may vary. This point is virtually a corollary of the "it depends" answer above -- maybe my job involves no internode communication at all so snails work just fine. Maybe I got a real deal on snails. Maybe your job is like mine, maybe not. It depends. YMMV.
The Only Good Benchmark is Your Own Code This point follows directly from the propositions above. To compare hardware it isn't enough to just compare the vendors' claimed MIPS (millions of instructions per second) or GFLOPS (billions of floating point instructions per second) or CPU clock or bandwidth or latency or whatever-else-you-like numbers; you need to know how that piece of hardware performs on your code. Your code is very likely different from the benchmark codes that were used to make the measurements. Then again, the vendor may have lied about their benchmarks.
If you keep the points above in mind while reading the remainder of this column (and other articles, and future columns, and while shopping for shoes) then you'll find that although there are no firm "rules" for cluster engineering, the terrain is navigable and you can, by keeping the above principles in mind, put together a decent cluster project on your very first try.
Cluster Design Parameters
It is so easy to get lost in all the glitter and glitz of shopping for sexy and sleek cluster hardware, especially if you fall into the hands of a slick salesperson, that we will begin learning about how to do a sane cost-benefit comparison of engineering alternatives with the benefit part, not the cost part -- the work you want to do with the cluster. Really we began this process with the very first column in the very first issue of this magazine -- our efforts were directed at understanding a bit about parallel task organization and scaling on a quantitative basis with an eye on the future, in fact on this very day.Understand this: the benefit of your cluster is the work you want to do with it, along with any constraints on that work. That is all there is to it. You will maximize this benefit when you can do this work the fastest, with the least amount of personal effort and invested time and money (all of which can go into either cost or benefit columns depending on how they are valued and weighted). The "ideal" would be for you to submit a piece of work to be done to your cluster and have it complete as fast as your finger lifts off of the Enter key, even if the job to be run is predicting the future time evolution of global weather for three years in advance.
We know from earlier columns that in all probability, this ideal cannot be realized. Parallel speedup is limited by Amdahl's Law and its more accurate generalizations. Even if your task is embarrassingly parallel and suitable for running on a grid style cluster of clusters, there are still limits on its parallel scaling imposed by the bottlenecks of its controlling resource servers.
But what are the bottlenecks for your particular task, or task mix? The amount of work you can get done on any collection of hardware is very strongly dependent on the task itself. Some cluster designs will get more work done than others as a function of (for example) the number of nodes. Of course different cluster designs have different costs as well, and the most common situation one faces is designing a cluster with either a fixed budget or at least an upper bound on the amount of money you can hope to obtain for the cluster.