Matching Cluster Hardware To Your Application

Article Index

Engineering Clusters, Revisited

Many months ago we started learning how to design serious clusters, starting with prototyping and benchmarking hardware. The CPU part appeared to be pretty easy -- just get the fastest processor you can afford, with enough memory to run your application. Actually it isn't QUITE that simple, as there are some real differences between processors and computer architectures to choose from (and hence real choices to make) but the idea was simple enough.

After the CPU itself, the second most important determinant of cluster performance is (as we learned in our early experiments with parallel scaling) the network so we spent several happy months learning All About Networking. We learned about the ISO/OSI model for a network, the more economical TCP/IP model, latency and bandwidth, Ethernet, IP and datagrams, TCP, and a whole slew of networking tools and true facts. Finally, and possibly most important, we learned about three core networking benchmark tools: netpipe, lmbench, and netperf. {mosgoogle right}

With this wealth of knowledge in hand, we are ready to go back and face the issue of engineering a cluster that is adequate to accomplish a given task. Our knowledge will still be a little thin in places. We know enough about Ethernet to be able to predict fairly accurately what we can expect in the way of latency and bandwidth for the various kinds of Ethernet adapters we are likely to have to choose between when designing our cluster, but we don't know (yet) as much about newer, faster, more cluster specific alternatives such as Infiniband, Myrinet, Serial Channel Interconnect, and Fibre Channel (to name a few).

Fortunately, we know what the terms that describe them mean now, so we can have a serious talk with a salesperson while convincing them to loan us units to test and learn the rest of the way, and we know how to benchmark the units once we get our hands on them with objective tools and compare their performance to the networks with which we are more familiar. All that remains is to think about how that information will benefit us in trying to get the most work done for the least money with the cluster we're designing.

The answer to this in general terms should now be easy to understand. Begin by looking over your application(s). Determine (by measuring if possible, computing and estimating from the code if not) how much data each node has to read in via a network connection (from other nodes, servers, remote data stores) or write out via a network connection (to other nodes, servers, or remote data stores) as the computation advances. This task may not be easy. There are many possible structures for a parallel computation, although (as we've discussed in previous columns) they fit into three general categories: unbounded by communications, bounded (bottlenecked) by communications bandwidth, and bounded by communications latency.

Task Organization Examples

Imagine the simplest sort of node task, running on an imaginary 100 node cluster. It requires a remotely issued command to initiate it (on the order of 1K data from a head-node server). It then runs a long time (say four weeks) and writes a single line of text data into a file on a mounted NFS directory (on the order of 1K data to a network store). It may do this on every node, independent of all other nodes. This behavior is the archetypal "embarrassingly parallel" task, and is the kind of task that Grid computer clusters are designed to accomplish efficiently.

Observe that the network is "busy" two times in a month, each time for a tiny fraction of a second, compared to literally millions of seconds of computing. You will need more network bandwidth monitoring the progress of your application from time to time than your application itself consumes. Clearly any speed of network is adequate for the task, so choosing the cheapest one (likely 100 BT Ethernet with a stack of inexpensive switches) is cost-beneficial, as it saves your money for more, better, faster CPUs (whichever produces the biggest reduction in time per dollar spent) or on other project requirements such as disk, visualization, infrastructure, management.

Next, consider a job (still running on 100 nodes) that obviously still requires a command initiating it (the same 1K start-up data). Once it is begun, its work flow is very data intensive -- in a "run" it must read in a total of 10 GB of data, in blocks of 100 MB at a time -- from a central (NFS) store. Each of these blocks of data is transformed by the node and the resulting 100 MB of new data sent on to the next node downstream in a chain of nodes for further processing. Since each node is in the chain, before continuing with the next read from disk, the node also has to get 100 MB from its upstream node when it finishes its block.

We can imagine that the processing of each block of 100 MB of data takes about one minute. For every minute of actual computing, each node has to read 100 MB of data from a central store, read 100 MB of data from an upstream node, and write 100 MB of data to a downstream node before it can continue to do the next minute of computing. Now the network not only matters, it can be critical. Let's do the math.

100 BT Ethernet can carry at most 100 million bits per second (Mbps) although nearly all Ethernet adapters and switches today are full duplex and can carry this much traffic each way (in and out) at the same time. Forget for the moment the full duplex part and let's figure out just how long it will take to read/write a 100 MB block from/to another node. If we divide by 8 (turning bits into bytes) we see that the theoretical peak is 12.5 million bytes per second (MBps), and that it will take 8 seconds (minimum) to send 100 MB.

However, that neglects the headers -- headers actually count as part of the theoretical peak speed of the physical medium, but all that matters to us is the data (or content of the packets, not the wrappers and headers intended to get it there safely and reliably and possibly even securely). Subtracting away the headers, we find that data is at best 95% or so of a packet.

In fact, we almost never can count on getting all 95% of the theoretical maximum. The processing of the TCP and IP parts of those headers uses time, as does the physical process of moving the data onto and off of the network adapters from main memory. For many adapters, this data has to pass through the CPU (although more expensive ones may use direct memory access (DMA) to put the data directly into memory without using the CPU) and further delays will be introduced every time the CPU does ANYTHING in a multitasking environment (a minimum of over a hundred times a second, even if the system is otherwise idle). Then, the task of writing the data or reading the data from the network socket may introduce delays of its own, perhaps processing or formatting the data before it goes out or similarly acting on it as it comes in. The two ends of the communication channel may well have a handshake that has to occur as sub-blocks of data are successfully sent.

All of this will often reduce the average throughput of data from 95% of wire speed to 90%, 80% or even worse (although lower numbers than 90% often signal relatively poor networking hardware or code design, or a loaded system that is "busy" when a lot of networking transactions are trying to take place). So instead of taking 8 seconds to move 100 MB of data, we wouldn't be surprised if it took 10 seconds or even longer, on average.

Time-wise, this means that our application computes for 60 seconds and then has a minimum of 10 seconds of communication between nodes PLUS a 100 MB read from the data store to accomplish to proceed. Lots of things will make the reality worse than this estimate -- two nodes trying to talk to the same node at once, a node trying to send data to another at the same time that node is already reading from the data store, difficulties managing full duplex communications. Some of these things might be avoidable by carefully designing the application, but clearly any delay experienced by a node is passed on to all nodes downstream that await the data produced by the delayed node.

We still aren't done, though, because while we can imagine trying to arrange the data communication between nodes so it is done in parallel, all the nodes have to get their 100 MB from the server to proceed, and it can only provide that data (with a 100 BT connection shared between all nodes) to one node at a time.

This is the killer task. First of all, reading data from a file server typically reduces bandwidth relative to straight host-to-host communications, because the server is doing many things and because server-to-disk latency and bandwidth come into play. Our 10 second read can become 12 seconds or even 20 seconds even when things are orderly and quiet.

But things aren't quiet. Every minute we do our computation, write to a node, read from a node, and then wait on the server. The nodes literally fight to get data from the server. Some 100 nodes are all requesting 100 MB of data more or less at the same time. The server is constantly seeking on the disk, a process that adds a major latency hit. It is constantly changing contexts to spit out another string of packets to the hosts waiting in on data in its task queue. After the first minute of computation, nearly all of the nodes will be waiting on data from the server. The whole cluster will sit idle, except for nodes that have just finished their server read or the nodes just downstream of them in the data flow.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.