Matching Cluster Hardware To Your Application

Article Index

For this pattern, 100 Mbps Ethernet just isn't the answer, and we have to look carefully at the pattern and design of our data distribution mechanism as well. The computation would spend far longer communicating than computing, and most nodes would actually sit idle most of the time. It is well worth our while to have fewer nodes, but a faster network and better organized data store.

For example, a higher bandwidth network like 1 Gbps Ethernet would reduce the time required to move 100 MB between nodes to around a second. The bottleneck associated with many nodes accessing the master server can be at least partly alleviated by copying the entire data store to scratch space on the nodes (using a maximally efficient pattern and quite possibly from multiple servers) before beginning the computation. The time for this task cannot be avoided, but the time wasted while the central server thrashes while blocking node computations can be. These changes might reduce the communications time to be closer to 1-2% of the actual computation time, permitting one to obtain a speedup with whatever number of nodes we can still afford.

As a final example, the computation may involve each node communicating with many other nodes (possibly all the other nodes) in order to advance the computation a step. For each 0.01 second (say) of computation, it may have to exchange a few tens of bytes of data with 100 systems (the other compute nodes plus the master node) in each step.

Now the other performance dimension we learned about, network latency, is the rate limiting factor. We observed last time that TCP latency on a fairly typical 100 BT Ethernet was measured at around 130 microseconds per packet. To send and receive 100 small packets costs at least 200x130 = 26000 microseconds, or around 0.026 seconds total. As always, this is the best possible result -- real performance would almost certainly be worse, unless one could synchronize the computation's communication pattern so precisely that nodes never "collided" (where two nodes try to talk to the same node at the same time).

This situation is very embarrassing. It takes nearly three times longer to communicate for each step than the computational work associated with the step requires on each node. If we use more nodes, things get worse, not better. The computation might well be taking longer to run on 100 nodes than it would on ten. Here the solution is to use a low latency network, one with communications latencies smaller than 5 microseconds. Many of these high speed networks are designed for cluster computing and bypass the TCP stack (a major source of latency delay) altogether. A network that supports DMA and unattended completion of message transmission and reception further reduces the burden on the node CPUs and makes things run faster still. 200x5 = 1,000 microseconds (or even less), and communications is once again 0.001 seconds total for every 0.01 seconds of computation, suggesting that we will get decent parallel scaling and task speedup on our cluster of 100 nodes.

These three cases by no means exhaust the possibilities for the communications needs of a numerical task being parallelized. A task can have many steps, some of which speed up nicely on any network, some of which are bottlenecked by the network bandwidth, some of which are bottlenecked by latency. The task organization can be as important as the hardware in determining where bottlenecks occur, where changing algorithms can make a bigger difference than changing networks. Untangling all the possibilities is not guaranteed to be easy. Still, these examples are generic enough to allow us to return, at long last, to the question of cost-benefit optimization.

Cost-Benefit is Everything

Recall that back in July we noted that "Cost-Benefit is Everything." Designing a cluster is nearly always about getting the most work done for a more or less fixed budget. At this very moment, I'm going through the process of trying to design an optimal cluster given a (very) fixed budget in a competitive grant proposal. We have to project our computer hardware budgets out for at least three years, and show how what we get will permit the cluster to get the most work done for that fixed budget.

As usual, we'd like to have the most simple compute power, since aggregate speed (in Hz, FLOPs, MIPs, Specint 2000's) stands out as a relatively simple measure that even non-cluster experts can understand when reviewing a proposal. However, in the actual data flow associated with the tasks we expect, the cluster will run involve accessing massive amounts of storage and relatively little inter-node communication. It is a "Grid" design, but with large, fast storage requirements and hence a need for high bandwidth connections between the compute nodes and the head nodes and central store.

To accomplish this, we find ourselves trading off nodes to get more storage, to get higher bandwidth networks (so that networks aren't bottlenecked to storage), and investigating smart storage designs so that nodes don't have to contend for the use of a single data channel in order to write or read from the central store. Our design goal is perfect balance -- the nodes working flat out should generate data that will just fill the storage we have by the time we have the next year's grant money to get more of everything, while in operation the network and disk access speeds are irrelevant (1-2% total effect) bottlenecks on the computation.

Year one is relatively easy to outline in this way; years two and three are much more difficult, as new technologies are likely to emerge over this much time. We will use Moore's Law, which predicts that compute capacity at constant cost will roughly double every twelve to eighteen months, to guesstimate them. This law unfortunately doesn't work so well for networks -- they speed up much more slowly than CPU, memory capacity, or disk capacity, for example -- but we can safely assume that they will get cheaper at any given speed over the next three years.

The above discussion outlines the general process by example. In your cluster design, you may want to get as many nodes as possible. However, if your computation requires a low latency network and a big, fast disk store, you may be finding yourself spending anywhere from a minimum of $3,000/TB to a maximum approaching $30,000/TB for disk, and spending as much on the network(s) connecting your nodes together and to the disk store as you do on the nodes themselves. Suddenly instead of getting 100 nodes at roughly $3,000 each for $300,000, you are spending $4,500 per node (including the network) and have to fund a $100,000 set of head nodes and a connected disk store. Your 100 nodes has now shrunk to 44 nodes. This design, at first glance, isn't as "sexy" to potential grant reviewers. However, the bottom line is that your 44 nodes will get more work done than the 56 nodes you "sold" from your 100 node initial design in order to get the memory, the network, and the storage your task requires to run optimally fast. Now your task will exhibit nearly linear speedup on all 44 nodes (where the 100 default Ethernet-based nodes would be hopelessly bottlenecked) and they have someplace to put all the terabytes of lovely results that they will shortly begin to generate.

Conclusion

This column concludes the appearance of this column as "Cluster Kickstart". Cluster World is now one year old (hooray) and in the new year my column will take on a new name and will have a new focus, one that permits me to address pretty much whatever I think would be of interest or benefit to people interested in cluster computing. I expect that it will still spend a lot of time focusing on teaching basic principles of cluster computing and cluster design, since this is what I "do" on the Beowulf list and as part of my job here at Duke University. However, I can now branch out into new territory and relate anecdotes and tell funny stories with less need to remain strictly focused on beginning cluster design. {mosgoogle right}

It has been a fabulous year. We've moved from building our first cluster, through an understanding of Amdahl's Law and its generalizations (fundamental to cluster design) and the process of writing some actual parallel applications to the basic principles that must be mastered to design really advanced clusters -- finding the right infrastructure, prototyping and testing, understanding the network itself (at least in its simplest form) and now putting it all together into a coherent and deliberate process of task analysis and matching of design to data flow.

There are still plenty of things we haven't touched on or have only touched on briefly -- monitoring a cluster, cluster installation (software) tricks and tools, benchmarking, analyzing your task with profiling tools and debuggers, and writing parallel programs with many kinds of basic algorithms. Fortunately, this column does not stand alone, and most of those topics have been examined in some detail elsewhere on ClusterMonkey.

I look forward to covering many of them myself in months and years to come. I hope to see you there!

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.