Hits: 64331
The simplest parallel cluster (one that you might already have)!


Getting started with what appears to be a very powerful, very complex idea (in computing, at least) is often a daunting proposition, and cluster computing is no exception. There is so much to learn! So many things can go wrong! It might require new, specialized, expensive hardware and software! Looking over some of the articles on this website can easily reinforce this view, as some of them describe very sophsticated tools and concepts.

Anybody who has ever tackled a major new computing project is aware of how costs can spiral, energy and time fritter away, and benefits seem unreachably distant in the middle of the mess. This classic Fear, Uncertainty and Doubt (FUD) has propelled many a corporate or government project manager to spend huge amounts of money to get shrink-wrapped parallel supercomputers and full service contracts because at least that way they can put a firm upper bound on costs and have a fairly predictable benefit, even when their task is actually trivially parallel and could be run at a fraction of the cost on commodity hardware.

In Universities and many government labs, though, there exist large numbers of researchers who simply don't have the deep pockets that this kind of throw-money-at-it response requires but who do have an unbounded appetite for compute cycles. For well over a decade, this group of individuals (I'm one of them) have been working on ways of getting cycles in a predictable way, at modest cost (using commodity components instead of expensive "big iron" supercomputers), without having to be an absolute computing genius to make it all work. After all, some of us want to foist all the real work of cluster programming and management off on graduate students eventually, right?

This column will present a distillation of all of this experience and effort by all of these bright people that hopefully will convince you that hey -- cluster computing with commodity systems isn't really all that difficult or expensive, and that you can see a clear path to significant benefits after reading as little as a single article (for example, this one)!

In this column today we will review the simplest of designs for a cluster you may already have -- two or more (linux/unix) workstations on a completely generic local area network. This sort of "cluster" is easy to build and actually free in many environments (or rather, for the economically picky, is available at zero marginal cost, significantly improving the productivity of your existing resources).

To demonstrate its potential utility, we will learn to actually run an absolutely trivial "demo" parallel task -- one that does no real work but contains a very clear "placeholder" where real work can be inserted. This parallel task will be used in various incarnations in future columns to demonstrate things like parallel scaling laws and parallel libraries.

A final note before proceeding. This column is for beginners at cluster computing, but is not really intended for beginners at computing altogether. It does presume a passing acquaintance with Unix (preferrably Linux), TCP/IP networking, a programming language or three at the single computer level.

This column will primarily use C, perl, and /bin/sh (as well as numerous other tools in a standard workstation installation of Red Hat Linux) in the discussion and examples, although for the most part they are really Unix-agnostic. If you are a real beginner at computers altogether, you will likely need to do a bit of work and learn to install and manage and program at the single computer level. Fortunately, bookstores, the internet, and other columns and articles in this and other linux magazines (like Linux Magazine) make this quite easy these days. The software you need is at the end of the article (cut and paste) or you can download a tar file.

Sidebar One: NOW Cluster Recipe
The following are the (very generic) ingredients for a small "getting started" NOW-style compute cluster:

  • 2-8 PC's (Intel or AMD will be easiest). Their (linux-supported) hardware configuration can be quite flexible, but for "workstation" nodes should include adequate memory (minimum 128 MB), room on their HD for a comfy linux workstation installation (4+ GB), and a 100BT ethernet interface, preferrable a "good" one.
  • A 100 BT switch with at least as many ports as you have computers, and enough cabling and wiring to connect all the workstations to the switch.
  • Linux (or perhaps FreeBSD), installed on each node. I personally am currently using Red Hat Linux version 9 for my own NOW as it permits PXE/Kickstart/Yum installation and maintenance (to be described in future columns) but just about any reasonably modern Linux will suffice.

All the workstation nodes should have e.g. Openssh (server and client), perl version>5.8.0 (with thread support), gcc, make in order to run the examples, although a good programmer ought to be able to rewrite the perl "taskmaster" in C using posix threads quite easily. I include the perl example because (frankly) it is really cool that perl now has the ability to manage threads and it is a bit easier to hack around with a script than with C sources.

To "configure" the cluster for the example, one system should probably be set up as an NFS server for /home and all the systems should share account space. ssh should be configured (according to its man pages) so that one can transparently execute remote commands without entering a password, e.g.:

rgb@lilith|T:101>ssh lucifer date
Sun Sep  7 08:27:08 EDT 2003

Is There a Supercomputer in the House?

A parallel supercomputer requires only two basic hardware components: a bunch of processors and a way for those processors to talk to one another and other private and shared resources. In the case of a NOW those components are realized as workstations and servers (each typically with one or at most two processors) and an ordinary TCP/IP network.

In almost any modern Unix/Linux based business, such a network can be found -- workstations and servers serving staff, programmers, and nowadays, even executives as Open Office, KDE, Gnome and their associated office tools mature. Universities almost invariably have workstation clusters of this sort to serve students and faculty alike. There exists a network of Linux workstations in my house, serving my wife and kids (as well as my own professional needs) and in fact it was used to test and debug things like the parallel programs that accompany this article (as I write it sitting at my laptop in my downstair den, connected via wireless to my house Local Area Network (LAN)).

If you have such a NOW, great! Check the sidebar entitled "NOW Cluster Recipe" and make sure that all the workstations have the setup described there. If not, don't despair! The same sidebar gives you a basic set of ingredients and an outline of installation instructions. These installation instructions are necessarily terse and are more of a configuration specification, as it simply isn't possible to tell you how to install linux or freebsd or a commercial Unix itself for every possible hardware or distribution permutation. This is well-documented elsewhere, however, and shouldn't be a terrible obstacle.

So go ahead, prepare your LAN, get to where you can enter "ssh target date" and see the date setting on the "target" workstation (without a password), and then return and start the next section.

Putting the NOW to Work: A Simple Parallel Task

At this point you should be sitting at a Linux/Unix workstation on a LAN set up to permit remote logins and command execution on at least one other system. These systems (as far as this article is concerned) can be Internet accessible or even on a Wide Area Network (WAN) as we're going to use ssh only as a remote connection protocol, although network performance is likely to be more consistent on a LAN.

Perform the following tasks:

Don't worry about load balancing or the network or whether the computers in question are all the same speed -- task only does simulated work with a fixed delay, so any computers used will result in a fairly predictable task scaling (more on that later). Also, task uses almost no memory or CPU, so you also don't have to worry about annoying the console users of any workstations you are borrowing -- they won't even know the task is there.

Here's what task and taskmaster do. task seeds a system random number generator with the seed you give it. It then loops nwork times, sleeping for delay seconds and then printing a uniform deviate in the range [0.0,1.0) on stdout. If delay is "big" it simulates a lot of work for a little communication; if delay is "small" it simulates a relatively little work for a lot of communication.

taskmaster is responsible for distributing an instance of task to some or all of the hosts in a hostfile passed to it as an argument. You can control how many with another argument (we don't really check if this is more than the available host pool, so be careful). Finally, you tell taskmaster how many uniform deviates it should try to make in parallel, with what delay for each. taskmaster then partitions the task, creates separate threads (one per host), executes task on the remote host via ssh in each thread. It then collects the returns from those jobs (reading their stdouts and putting the uniform deviates in an array) and prints them all to the screen, more so you can see that actual work was done on the remote hosts than for any practical purpose.

Finally, taskmaster returns the timing, which is very important and will be used in future columns to explore the mysteries of parallel scaling laws. By varying nhosts, nrands, and delay we can learn lots of things about parallel scaling even in this simple environment, and even experienced cluster programmers may find the threaded taskmaster script a useful base from which to write a more careful task distribution mechanism. One with actual error checking, for example...

You are now ready to go! Here are some example of taskmaster usage on my own NOW cluster (named "eden", if anybody cares), configured with seven available hosts.

Usage Examples

The following are simple examples of taskmaster/task usage. Don't worry about the fact that parallel speedup seems modest. Next month we'll explore parallel speedup with taskmaster and show how even this very simple example, with the crudest possible task communication mechanism, can still yield excellent parallel scaling. Or you can play with arguments on your own for "homework" in the meantime and see if you can discover this yourself!

rgb@lilith|T:114>./taskmaster hostfile 1 10 1

Spawning host threads

Host lucifer thread running.

rand[0] = 0.840188
rand[1] = 0.394383
rand[2] = 0.783099
rand[3] = 0.798440
rand[4] = 0.911647
rand[5] = 0.197551
rand[6] = 0.335223
rand[7] = 0.768230
rand[8] = 0.277775
rand[9] = 0.553970

  nhosts   nrands    delay     time
     1        10        1       13

Note that one host takes thirteen seconds to do ten seconds worth of work! Not too good!

rgb@lilith|T:121>./taskmaster hostfile 5 10 1

Spawning host threads

Host lucifer thread running.
Host caine thread running.
Host uriel thread running.
Host abel thread running.
Host archangel thread running.

rand[0] = 0.840188
rand[1] = 0.394383
rand[2] = 0.700976
rand[3] = 0.809676
rand[4] = 0.561380
rand[5] = 0.224983
rand[6] = 0.916458
rand[7] = 0.133982
rand[8] = 0.274746
rand[9] = 0.046468

  nhosts   nrands    delay     time
     5        10        1        5

Better! Five hosts now take less than 10 seconds to do 10 seconds worth of work. However a lot of computers for only a factor of two speedup! One more try:

rgb@lilith|T:122>./taskmaster hostfile 5 10 10

Spawning host threads

Host lucifer thread running.
Host caine thread running.
Host uriel thread running.
Host abel thread running.
Host archangel thread running.

rand[0] = 0.840188
rand[1] = 0.394383
rand[2] = 0.700976
rand[3] = 0.809676
rand[4] = 0.561380
rand[5] = 0.224983
rand[6] = 0.916458
rand[7] = 0.133982
rand[8] = 0.274746
rand[9] = 0.046468

  nhosts   nrands    delay     time
     5        10       10       30

Much better. Now five hosts take 30 seconds to do 100 seconds worth of work. This might turn out to be worthwhile after all!


Building a cluster (or discovering that your existing LAN is a cluster) is apparently pretty easy, really. Using a fairly simple "master" script and a "worker" application we can clearly do work in parallel and can already see a significant speedup (a factor of three using five hosts) which should be quite reproducible on just about any LAN. Between now and next month, you can play with taskmaster and task and see if you can discover settings that yield really good parallel speedup (where running on 5 hosts completes in close to 1/5 the time of one host).

In future columns we'll explore themes like Amdahl's Law and parallel speedup, parallel libraries, "the standard linux cluster design", and more, using this basic cluster (and taskmaster/task) as a starting point. We'll also learn better ways of installing and managing a cluster, how (and when) to go beyond the simple NOW-style cluster and into a GRID or a Beowulf, how to compare shelved towers, rackmounts, and different processor types and networks. Our goal will be to achieve a sufficient level of experience that you are ready to be "handed off" to my brother columnists, whose dedicated columns will take you from being a neophyte to a perfect master in the specific areas of clustering that benefit you most.

Hope to see you there.

(source code is on the next page)

Sidebar Two: task.c
 * task.c

unsigned long int seed;
/* set to 1 to debug, but not for parallel production */
int verbose = 0;
int nwork;

int main(int argc, char *argv[])

 int i,delay;

 /* Not much error checking */
 if(argc < 3){
   fprintf(stderr,"Usage: task seed nwork delay\n");

 /* Parse command line */
 seed = strtol(argv[1], (char **)NULL, 10);
 nwork = strtol(argv[2], (char **)NULL, 10);
 delay = strtol(argv[3], (char **)NULL, 10);
 if(verbose) {
   printf("The seed for this task is %u, and it will do %d work units.\n",seed,nwork);

 /* Here we do some work.  The amount is controlled by nwork. */
 for(i=0; i < work; i++){
     printf("Doing simulated work unit %d\n",i);
   /* Simulated work is "generate and return a uniform deviate"... */
   printf("%f\n",(double) random()/RAND_MAX);
   /* ...badly (should use gsl mt19937, not random():-) */



Sidebar Three: Makefile for task.c
# $Id: first_cluster.txt,v 1.4 2003/09/07 14:45:43 rgb Exp $

# Defines/Macros
PROGRAM = task
CC = gcc
LIBS = -lm

# Targets
all: $(PROGRAM)

        $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $(LIBS) $(OBJECT)

        rm -f core $(PROGRAM) $(OBJECT)

# Rules
%.o:%.c $(INCLUDES)
        $(CC) -c $(CFLAGS) $<

Sidebar Four: taskmaster (perl) listing
# $Id: first_cluster.txt,v 1.4 2003/09/07 14:45:43 rgb Exp $

 use Config;
 use threads;

 # Set the path to the task.  You may prefer to use a path in your
 # home directory on a shared filesystem.
 my $taskpath = "/tmp/task";

 $Config{useithreads} or die "Upgrade to perl >= 5.8.0, compiled with threads";
 # Get required arguments (2) from command line
 $verbose = 1;
 if($ARGC != 4){
   Usage("Incorrect number or type of arguments");
 $hostfile = $ARGV[0];
 $nhosts = $ARGV[1];
 $nrands = $ARGV[2];
 $delay = $ARGV[3];

 # Get list of host names
 open(FD,$hostfile) || die "$0: can't open $hostfile";
 $i = 0;
 while() {
   $hosts[$i] = $_;

 # Split up nrands precisely and lazily (outside timer).
 # This balances our "load".
 $nr = 0;
 $i = 0;
 while($nr < $nrands){
   $i %= $nhosts;

 # Start timer and spawn remote host task threads.
 $tstart = time;
 print "\nSpawning host threads\n\n";
 for($i = 0; $i < $nhosts; $i++){
   $seed = $i + 1;
   $hostthread[$i] = threads->new(\&runtask,$taskpath,$hosts[$i],$seed,$nw[$i],$delay);
     print "Host $hosts[$i] thread running.\n";
 print "\n";

 # Accumulate returns from each host task thread in @rands.
 # This will block until the last host completes.
 @rands = ();
 foreach $hostt (@hostthread){
   @rands = (@rands,split /\n/,$hostt->join);
 $tstop = time;

 # Print out results and timing.  Don't time the printout.
 for($i = 0;$i < $nrands;$i++){
   print "rand[$i] = $rands[$i]\n";
 print "\n";
 $ttime = $tstop - $tstart;
 printf("%8s %8s %8s %8s\n","nhosts","nrands","delay","time");
 printf(" %5d  %8d %8d %8d\n",$nhosts,$nrands,$delay,$ttime);


sub runtask {

 my $taskpath = shift;
 my $host = shift;
 my $seed = shift;
 my $nwork = shift;
 my $delay = shift;
 my $task = "/usr/bin/ssh -x $host $taskpath $seed $nwork $delay";
 $rand = `$task`;
 return $rand;


sub Usage {

 my $message = shift;
 if($message) {print STDERR "Error: $message\n";}
 print STDERR

taskmaster hostfile nhosts nrands delay

 hostfile is a file that contains hostnames, one per line

 nhosts is the number of these hosts you wish to use

 nrands is the number of random numbers you wish to generate in parallel.

 delay is the number of seconds the worker task will sleep 
 (simulating  work) between each random number it generates.


This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page at