Article Index

Putting your cluster to work on parallel tasks.

In our previous installment, we started out by learning how to use pretty much an arbitrary Linux LAN as the simplest sort of parallel compute cluster. In this column we continue our hands-on approach to learning about clusters and play with our archetypical parallel task on our starter cluster to learn when it runs efficiently and just as important, when it runs inefficiently.

If you've been following along, in last column we introduced cluster computing for the utter neophyte by "discovering" that nearly any Linux LAN can function as a cluster and do work in parallel. Following a few very general instructions, you were hopefully able to assemble (or realize that you already owned) a NOW (Network of Workstations) cluster, which is little more than a collection of unixoid workstations on a switched local area network (LAN). Using this cluster you ran a Genuine Parallel Task (tm).

If you missed the last issue, to participate in this month's column fully you should try to find or assemble a Linux/UNIX (but I will always assume Linux in this column as it is what I use at home and work) LAN so that you have account space on all the workstations (a.k.a. "nodes" in the context of cluster computing). Each node should have ssh set up so that you can login to it from your personal workstation (the "head node") using ssh without a password (following instructions in the man pages and other on-line documentation).

Beyond a fairly generic Linux workstation installation on each node, you'll need to make sure that the head node has a recent version of perl (one that supports threads) as well as gcc and make, and you'll need to download the source code from last month's for "taskmaster" (a threaded "master" application that initiates a task on many hosts in parallel and collects the results) and "task.c" (the task). This code has a tiny change (in taskmaster) relative to last month's source that makes the output less verbose when all you are doing is using it to do timings.

The task itself is a work assignment for a node that loops over a sleep for a delay passed to it by the taskmaster, then generates a random number (uniform deviates in the range 0.0-1.0) and returns it to the originating taskmaster thread, until nwork random numbers have been returned. The value of delay represents the amount of simulated work that has to be done by the node to return a number. This will turn out to be an important number, and this column is largely about exploring the various limits of behavior that one can see for different ratios of work done (delay) to the presumed fixed time required by each node to send its number back to taskmaster for assembly and presentation.

To get set up to play, perform the following tasks:

  • Put the code for task.c and its Makefile in a subdirectory on your personal or primary system(s) and build it (just entering "make" should do it, if gcc and make are installed and correctly set up).
  • Put the resulting "task" binary on a consistent path on all the systems that will participate (with e.g. cp or scp). This might be someplace like $HOME/bin/task or /tmp/task. I'm assuming the latter in the taskmaster script below or you can edit taskmaster and put it where ever you like. You can execute task without arguments to see its argument list, and should execute it by hand once or twice with arguments to see what it returns.
  • Put the taskmaster script in a convenient place -- probably your working subdirectory where you have the task.c source. If you execute the taskmaster script without arguments (be sure to set its executable bits) it will also tell you its argument list. If it complains about not knowing about threads, you will need to update your version of perl to one that includes thread support.
  • Create a file named "hostfile" containing the names or IP numbers of the host(s) you wish to use in your parallel computation in the directory with taskmaster. I'd recommend having at least four participating hosts (one of which can be the "master" host that you are working from) but it would be great if you had as many as eight hosts in your NOW that you can put to work.

Task First, Cluster Later

Some of you who are reading this column to learn how to build clusters may be surprised (or even annoyed) that the focus this far is on parallel programming and not on what to do with wires and racks and super-fast dual central processing unit computers in custom cases and so forth. The hardware is what is interesting and cool -- racks of nodes and fancy networks and words like "supercomputer" and "teraflops". So why waste time on software?

The reason is simple: engineering a successful cluster begins with understanding the nature of parallel software, not the hardware. The hardware is the easy part, in a sense; so easy that the NOW you have assembled (or discovered you already had without lifting a finger) is remarkably powerful and will likely yield good scaling behavior for many parallel tasks.

Ummm, about here you might ask (if you were a true neophyte): Just what is "good scaling behavior"? And while we are asking, which parallel tasks will yield it for this kind of cluster? Are there tasks that cannot be sensibly parallelized?

Actually, you'd ask these things only if you were really a bit advanced already. A true neophyte would be asking things like "can I read my mail faster on my 10 GFLOP NOW cluster?" (You laugh, but this question has appeared, with a variety of non-parallelizable applications taking the place of mail, over and over again on the Beowulf list beowulf (at) beowulf (dot) org. Or possibly you don't laugh, as this might be the very question that was in your mind! In case it isn't clear, the answer is no>.)

Only when we understand the answers to these questions will be able to address the much more difficult task of how to engineer a cluster that yields good scaling for tasks that would perform poorly on an ordinary 100BT NOW. To get there, we have to study parallel tasks in general. The experiments below are intended to illuminate the critical properties of parallel task decomposition, properties that are shared to a great extent by all parallelized tasks.

Exploring Parallel Task Granularity

If you are getting or thinking about getting into the parallel/cluster computing game, at some point you are likely going to want to read a book on actual parallel program design. One such book that I can recommend is Ian Foster's Designing and Building Parallel Programs. One reason I can recommend it is because you can access it for free at until you decide that you'd like to own a paper copy (as I do) for easier reading and reference.

In Foster's book, the stages of parallel program design are identified as partitioning, communication, agglomeration and mapping. In this article, we are exploring some of the consequences of various ways of partitioning a task deliberately designed to be completely parallelizable and running the resulting trivially parallelized program on a small cluster with fixed communications costs and a predetermined mapping of work to processors.

The term granularity is one you will often hear in a discussion of parallel computing. This is (in rough terms) the amount of work done in a single "chunk" of work resulting from a given task partitioning. If taskmaster's goal is to generate a list of random numbers, the finest grained partitioning of the task is just generating a random number (or set of random numbers) as a chunk of work, which isn't a lot of work at all and so the task is very fine grained in that limit. If instead taskmaster's goal is to simulate a long involved process whose final result is a number or set of numbers, the same task partitioning could be much coarser grained.

Task partitioning is a major parameter in the design of parallel programs; understanding it is equally important in the design of a parallel computer to run that program. Indeed, with a wide range of cluster and beowulf designs to draw upon, the parallel program design process can encompass both the program itself and the hardware to run the program on at the same time. This is the ideal towards which you should strive as you kickstart your cluster program, and it begins with the task.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.