Print
Hits: 22937

That Free Lunch You Wanted...

Clustering seems almost too good to be true. If you have work that needs to be done in a hurry, buy ten systems and get done in a tenth of the time. If only it worked with kids and the dishes. Alas, kids and dishes or cluster nodes and tasks, linear speedup on a divvied up task is too good to be true, according to Amdahl's Law, which strictly limits the speedup your cluster can hope to achieve.

In the first two columns we explored parallel speedup on a simple NOW(network of workstations) style cluster using the provided task and taskmaster program. In the last column, we observed that there were some fairly compelling relations between the amount of work that we were doing in parallel on the nodes, the amount of communications overhead associated with starting the jobs up and receiving their results, and the speedup of the job as we split it up on more and more nodes.

In this column we're going to do some exploration of just how all this works. Be warned -- we will figure out Genuine Algebraic Formulas that are simple enough, but not for the faint of heart. Nevertheless, the basic ideas underlying these formulas are critical for any would be cluster engineer to understand if they hope for success. As you should see by now, building a compute cluster can be as simple as looking at a pile of networked workstations running Linux, waving your hands about a bit (chanting a bit and wearing strange robes helps to impress the feeble-minded), and going "Poof! You're a Cluster!" Or it can be a lot more complicated, and these formulas are the reason why.

Before you begin, I personally recommend putting your right brain and left brain into a state of ideal balance by whatever means you find most suitable, be it a few minutes of a soothing mantra or a few swigs of an equally soothing beverage made from malted grains and hops. Feel free to don strange robes if it helps.

Amdahl's Law

Parallel computing is actually not a terribly new idea. Some of the very earliest "computers" to be applied to a serious problem(cryptography in World War II) were heavily parallel, with lots of units trying to break a message at the same time. Big corporate players, such as IBM, have been studying parallel computation for decades.

During the 1960's, a man named Gene Amdahl was director of IBM's Advanced Computing Systems Laboratory (he later left to form his own company). In a 1967 paper entitled "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", he formulated what has come to be known as Amdahl's Law, one of the fundamental laws of parallel computer/program design.

This law is very simple to understand. The "speed" (S) of a program that does a certain amount of work (W) can be viewed as the work it does divided by the time it takes to complete;

(1) S1  =    W
T1

where the subscript "1"refers to the number of systems running the program. Amdahl observed that many programs could be at least partially parallelized, that is, part of the work that they did could be split up and one on more than one system at a time. The rest of the work, on the other hand, had to be done serially, one instruction after another.

In our taskmaster/task examples of the last two months, the "setup"instructions in taskmaster and the starting and stopping of the timer have to be done serially whether the task is run on one machine or one million. Even if we precalculated all the answers returned from the nodes (reducing the time required to compute them to zero the program will not be reduced to zero time. Let's turn this observation into algebra.

If we let the time that must be spent doing serial work on a single machine be Ts, and the single-machine time spent doing work that could be parallelized be Tp, so that;

(2) T1  =  Ts + Tp
then splitting the parallel part of the task up on N processors can at best reduce the time required to complete the parallelized work by a factor of N. Thus;

(3) TN  =  Ts +   Tp
N

and we obtain the speedup;

(4) SN  =    W
TN
   =    W
(Ts + Tp/N)
  < N W
T1
 )

This last result is Amdahl's Law. It says that linear speedup, where performing a block of work in parallel on N processors speeds it up by a factor of N is an ideal limit that is, in practice, never achieved!

Some features of our taskmaster experiments are now understandable. As we increased the total amount of work being done (the number of random numbers generated and the delay that simulated the amount of "work" being done to generate a random number) without similarly increasing the serial work that had to be done to manage the computation, we made Ts relatively small compared to Tp, getting a very good (nearly linear) speedup as we split Tp up. Eventually, however, Tp/N was smaller than Ts and we stopped speeding up.

Amdahl's Law is very important as it makes it clear that when engineering a cluster the best you can hope for is getting your work done N times as fast on an N CPU (Central Processing Unit) cluster. However, it is too optimistic. On a real cluster, a job can actually run more slowly in parallel than it did on a single CPU!

This situation needs to be thoroughly understood by any would-be cluster engineer. Otherwise one could end up spending a whole lot of money buying and building a cluster only to discover at the end that your work actually takes longer to complete when you use it! And, you would hate that.

Improved Estimates of the Speedup

This section will be a bit sketchy. That's because in a real world, parallel speedup is not necessarily simple. It can vary with algorithm, with network topology, with program organization. Computers themselves have certain intrinsic nonlinearities even in their single CPU operation that can also interact with the speedup process to confound even Amdahl's Law (as we'll explore in the next section). So bear in mind that this is a simplified picture that is good enough to get you kickstarted on serious cluster engineering, but that if you have a very complex problem you're likely to need to learn a lot more.

When we made up times to describe parallelizing our idealized task above, we did not account for the fact that in the parallelized task the program organization might well be different than the in idealized serial task. For example, if we really wrote a Perl script to execute a task and collect the results on a single machine, we wouldn't bother with threads. We'd just run the task serially, which would be faster. Parallelizing it also added the time required to ssh-connect to each node and initiate the the task and the time required to get the results back across the network (instead of their being written directly into machine memory).

Some of the changes we have to make add to the serial time, making a new serial time Ts' > Ts. Other things add a serial time Tsp to each component of the parallelized task (e.g. the time required to make the ssh connection and get back the data over the network), so that the time spent doing the parallel work turns into:

(5) Tp
N
  + Tsp *N

The total is;

(6) T'N  =  (T's + Tp/N + Tsp * N)

Our speedup then becomes;

(7) S'N  =    W
(T's + Tp/N + Tsp * N)

This result is actually a pretty good estimate for many parallelized tasks. Note that this function has a peak as N is increased and that the speedup approaches zero in the limit that N goes to infinity. This trend exhibits the "poisoning" effect of a relatively big Tsp, but is still (remember) a simplified description. For example, if we generated enough random numbers we'd soon discover that Tsp actually contains a piece that scales like the amount of work done (and not just the number of hosts), as those numbers have to get transmitted back to the master over the network and once more than about 1 kilobyte's worth of numbers are sent back (roughly the size of a network packet) the transmission time starts being proportional to the work.

This last bit is why nobody would actually design a cluster to generate random deviates master/slave fashion for use in a computation that required a lot of them. The time required to transmit them back to the master is larger than the time the master would require to generate them locally! This can easily be demonstrated with taskmaster and task, by the way. Simply increase the number of rands to be generated to, say, several million (you'll likely want to run with the quiet flagset) and the delay to 0. Then ramp up the number of hosts on your toy cluster one at a time. Depending on the speed of your network you should start to see the task actually run SLOWER in parallel.

At this point you are pretty close to knowing enough of the fundamentals of parallel computing to hold your own in a discussion over a beer at a geek supercomputer meeting such as Supercomputing. Just nod knowingly when somebody mentions Amdahl's Law, mumble something about it usually not being pessimistic enough to describe real task scaling, and take another swig. And wipe that foam off of your chin. But what if the person you're talking to is a real supercomputing geek, and suddenly puts you on the spot? Suppose they start to ask you about tree distribution schemes, scaling relations, parallel algorithms? Do you have to just blush and confess to being a clustering/parallel computing newbie? Not at all. Hit them with our final topic of the month, which is guaranteed to firmly establish your credentials as being a bit of a real geek yourself.

Superlinear Speedup

As suggested by the subtitle, Amdahl's Law can be thought of as a variation of the Free Lunch Theorem (the one that says "there ain't no such thing"). Except where there is, of course. Amazingly, computers are such complex things that under just the right circumstances, there is a bit of a free lunch, at least in the sense that under those circumstances a job of work can run more than N times faster on an N processor cluster. How can this be, you ask? After all, the previous examples were devoted to showing you how superlinear speedup couldn't happen and that even Amdahl's Law was too optimistic. To show the reverse, I'll need to pull some aces from my hat and rabbits (or possibly penguins) out of my sleeve.

The CPU has a very small, very fast internal "register" memory. Those registers are typically loaded and unloaded through a special, slower but still very fast, larger but still small memory called a cache. In fact, a modern CPU likely has two or even three layers of cache memory, each larger but slower than the previous layer, between the CPU and the actual dynamic RAM (Random Access Memory) that is the memory that you buy and stick into little slots on the motherboard. Finally, the network (if any) and reading from or writing to a system's disk are much slower than dynamic RAM.

In very crude rule of thumb terms, there is perhaps an order of magnitude difference or a bit less between the different successive layers of memory from the CPU out. However, any network is several orders of magnitude slower than the slowest RAM. Disk can be very slow as well, orders of magnitude slower even than the network for somethings.

These differences in speed are the basis of superlinear speedup. Suppose you have a job with parallelizable instructions and a data area that is ten times larger than your cache, that is a "good" candidate for speedup -- lots of work to do, for example. Run naively on a single processor, the task is constantly accessing data that isn't in cache, forcing the CPU to wait a bit until they are pulled from memory into cache and onto the CPU. The task is memory bound -- memory access speeds across the cache boundary are a rate determining bottleneck for the task.

Now imagine parallelizing the task on 20 machines. Suddenly the subtasks fit into cache. Cache memory is much faster than regular memory. Each processor is doing 1/20th of the work at the same time, but they are all doing that work maybe five times faster than a single processor would have. Tp is reduced by a factor of 100, not 20, and your code can actually run more than 20 times faster, beating the limit imposed by Amdahl's Law!

In actuality, I cheated a bit on this example, as you might have guessed when I used the word "naively". If I could partition the task to run on 20 processors independently and have each subtask fit into cache, I could partition the task the same way on 1 processor and have each subtask fit into cache just as easily. This partitioning would already have the presumed factor of five even on a single processor system, and this is a fairer starting point to use when computing parallel speedup. If you like, tuning your algorithm to your CPU architecture can sometimes yield a big benefit all its own, and relative to this tuning Amdahl's Law is still satisfied.

However, superlinear speedup is real and not just a matter of program arrangement. Consider what happens if the program is too large to fit into memory all at the same time. Linux, Unix, and Windows all permit this to happen -- they create a "virtual memory space" that is bigger than the installed memory by using a disk space as a temporary holding ground for parts of the computation while others are being executed. This "swap" space acts just like an extension of memory, except that it is orders of magnitude slower than dynamic memory.

Swapping is very bad for performance. A heavily swapping task can be vastly slower than the same task run on a system that has enough memory -- so slow that it simply won't complete in any time you are willing to wait. If the problem can be partitioned to run on a cluster of N nodes such that the parallelized components no longer swap, one can see very spectacular speedup indeed.

Conclusions

Hopefully you've learned a lot about how tasks with a parallelizable component can often be distributed on a cluster and considerably sped up. You might even have learned some philosophy along the way -- operational parallelism pervades life itself, captured in aphorisms such as as "many hands make light work" -- until "too many cooks spoil the broth," at least. It's not so hard.

At this point the first three columns should have taken you from being a rank beginner at clustering to where you could confidently design a simple NOW-style cluster intended to speed up at least the simpler classes of parallel task and even have a pretty good idea of what scaling behavior you might reasonably expect to see as you split the task up among more and more of the nodes.

You can even begin to see where the real complexity in cluster design begins -- all of those rates, and times, and bottlenecks that contribute to what we see as the computer's functional "speed" at completing some task, the matching of algorithm and communication pattern to cluster design, and the network that links all the computers together into a cluster.

In my next few columns we will therefore start to get more serious and gradually "kickstart" more complex cluster designs. So far, we haven't spent a lot of time thinking about communication, but our studies above indicate that we should as communication seems likely to be very important, even critical, to the speedup scaling of some kinds of task. We'll begin to study communication somewhat indirectly by rewriting our basic task/taskmaster combination using a real parallel programming library such as PVM (Parallel Virtual Machine) and MPI (Message passing interface). This change will give us the control over the communication process we need to facilitate study of the network, in addition to(hopefully) getting you started with these valuable tools that are literally the sine qua non of compute clusters, tools that gave birth to the Beowulf project and much more.

Then we'll tackle the very heart of any parallel computer -- the interprocessor communications system which, in the context of commodity, off the shelf parallel compute clusters>, without exception means the network. Hope to see you there.

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page