Article Index

Superlinear Speedup

As suggested by the subtitle, Amdahl's Law can be thought of as a variation of the Free Lunch Theorem (the one that says "there ain't no such thing"). Except where there is, of course. Amazingly, computers are such complex things that under just the right circumstances, there is a bit of a free lunch, at least in the sense that under those circumstances a job of work can run more than N times faster on an N processor cluster. How can this be, you ask? After all, the previous examples were devoted to showing you how superlinear speedup couldn't happen and that even Amdahl's Law was too optimistic. To show the reverse, I'll need to pull some aces from my hat and rabbits (or possibly penguins) out of my sleeve.

The CPU has a very small, very fast internal "register" memory. Those registers are typically loaded and unloaded through a special, slower but still very fast, larger but still small memory called a cache. In fact, a modern CPU likely has two or even three layers of cache memory, each larger but slower than the previous layer, between the CPU and the actual dynamic RAM (Random Access Memory) that is the memory that you buy and stick into little slots on the motherboard. Finally, the network (if any) and reading from or writing to a system's disk are much slower than dynamic RAM.

In very crude rule of thumb terms, there is perhaps an order of magnitude difference or a bit less between the different successive layers of memory from the CPU out. However, any network is several orders of magnitude slower than the slowest RAM. Disk can be very slow as well, orders of magnitude slower even than the network for somethings.

These differences in speed are the basis of superlinear speedup. Suppose you have a job with parallelizable instructions and a data area that is ten times larger than your cache, that is a "good" candidate for speedup -- lots of work to do, for example. Run naively on a single processor, the task is constantly accessing data that isn't in cache, forcing the CPU to wait a bit until they are pulled from memory into cache and onto the CPU. The task is memory bound -- memory access speeds across the cache boundary are a rate determining bottleneck for the task.

Now imagine parallelizing the task on 20 machines. Suddenly the subtasks fit into cache. Cache memory is much faster than regular memory. Each processor is doing 1/20th of the work at the same time, but they are all doing that work maybe five times faster than a single processor would have. Tp is reduced by a factor of 100, not 20, and your code can actually run more than 20 times faster, beating the limit imposed by Amdahl's Law!

In actuality, I cheated a bit on this example, as you might have guessed when I used the word "naively". If I could partition the task to run on 20 processors independently and have each subtask fit into cache, I could partition the task the same way on 1 processor and have each subtask fit into cache just as easily. This partitioning would already have the presumed factor of five even on a single processor system, and this is a fairer starting point to use when computing parallel speedup. If you like, tuning your algorithm to your CPU architecture can sometimes yield a big benefit all its own, and relative to this tuning Amdahl's Law is still satisfied.

However, superlinear speedup is real and not just a matter of program arrangement. Consider what happens if the program is too large to fit into memory all at the same time. Linux, Unix, and Windows all permit this to happen -- they create a "virtual memory space" that is bigger than the installed memory by using a disk space as a temporary holding ground for parts of the computation while others are being executed. This "swap" space acts just like an extension of memory, except that it is orders of magnitude slower than dynamic memory.

Swapping is very bad for performance. A heavily swapping task can be vastly slower than the same task run on a system that has enough memory -- so slow that it simply won't complete in any time you are willing to wait. If the problem can be partitioned to run on a cluster of N nodes such that the parallelized components no longer swap, one can see very spectacular speedup indeed.

Conclusions

Hopefully you've learned a lot about how tasks with a parallelizable component can often be distributed on a cluster and considerably sped up. You might even have learned some philosophy along the way -- operational parallelism pervades life itself, captured in aphorisms such as as "many hands make light work" -- until "too many cooks spoil the broth," at least. It's not so hard.

At this point the first three columns should have taken you from being a rank beginner at clustering to where you could confidently design a simple NOW-style cluster intended to speed up at least the simpler classes of parallel task and even have a pretty good idea of what scaling behavior you might reasonably expect to see as you split the task up among more and more of the nodes.

You can even begin to see where the real complexity in cluster design begins -- all of those rates, and times, and bottlenecks that contribute to what we see as the computer's functional "speed" at completing some task, the matching of algorithm and communication pattern to cluster design, and the network that links all the computers together into a cluster.

In my next few columns we will therefore start to get more serious and gradually "kickstart" more complex cluster designs. So far, we haven't spent a lot of time thinking about communication, but our studies above indicate that we should as communication seems likely to be very important, even critical, to the speedup scaling of some kinds of task. We'll begin to study communication somewhat indirectly by rewriting our basic task/taskmaster combination using a real parallel programming library such as PVM (Parallel Virtual Machine) and MPI (Message passing interface). This change will give us the control over the communication process we need to facilitate study of the network, in addition to(hopefully) getting you started with these valuable tools that are literally the sine qua non of compute clusters, tools that gave birth to the Beowulf project and much more.

Then we'll tackle the very heart of any parallel computer -- the interprocessor communications system which, in the context of commodity, off the shelf parallel compute clusters>, without exception means the network. Hope to see you there.

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.