Serious Parallel Computing 2: Launching PVM | Cluster Newbie

Ready for Real Parallel Computation, as if there was any other

In the last column we introduced the Parallel Virtual Machine (PVM) subroutine library, the original toolset that permitted users to convert a variety of computers on a single network into a "virtual supercomputer". We reviewed its history and discussed how it works, then turned our attention to what you might have to do to install it and make it work on your own cluster playground (which might well be a very simple Network of Workstations -- NOW cluster -- that are ordinary workstations on an ordinary local area network).

New readers are here advised to consult previous columns to get up to date. To play along, you'll need a few linux-based computers on a network with account access on all of them and ideally a shared home directory on all of them. Now it is time to set up PVM so that it can be used (in the next installment) in a Real Parallel Computation.

There are several steps involved in getting PVM to where one can use it in a simple calculation.

Set up a remote shell such as rsh or ssh so that you can login from your "head node" (the workstation on which you actually run PVM and parallelized task) to all the compute nodes in your cluster without a password. We learned in the last column how to install and set up ssh (secure shell) to accomplish this preliminary but essential step.
Install PVM itself, ideally in a packaged form (.rpm or .deb), on all the nodes.
Perform some fairly routine systems administration tasks: arrange for a common file space and user access on all the nodes, if this is not already done on your LAN or cluster.
Set some environment variables. In a packaged version of PVM these are likely set for you when you start PVM, but it doesn't hurt to set them up permanently.
Start PVM either from the xpvm (graphical) or the pvm (tty) console and configure your nodes into a virtual supercomputer.
Run a PVM task.

In this column we will explore how to get your computer to where PVM is installed and working, so that you can create a virtual supercomputer. This accomplishment will set the stage for redoing our original "generate random numbers" problem, but using PVM as a base instead of a perl script and a binary, in next month's column.

Installing and Running PVM

For some years now PVM has been part of the regular Red Hat distribution and can just be installed like any other RPM package. It is also available for most of the other RPM based distributions and Debian on a similar basis. That is, the easiest way to proceed is to just use your distribution repository, or CD set, copy it to a shared directory, and enter (for example, your revision numbers may differ):

#rpm -Uvh pvm-3.4.4-12.i386.rpm 
#rpm -Uvh pvm-gui-3.4.4-12.i386.rpm

Note that we're also installing XPVM, PVM's nifty graphical front end as it will really help you visualize and debug the virtual computer while getting started.

Alternatively, you can visit the PVM home page. There you can follow instructions to download a tarball of the pvm sources and build it locally. This method has some advantages, but for beginners the disadvantages (such as figuring out the correct paths and PVM's awesomely complex "Artificial Intelligence Make" utility, aimk) outweigh the use of prebuilt RPMs.

We are almost done. We have to set up the environment to make PVM function "automatically" for us. It actually would almost work out of the RPM box as the "executable" installed by the rpm is actually a shell script wrapper that sets most of what you need, but we have to tell PVM to use ssh instead of rsh so we might as well set them all. If your default shell is bash, add the following to your .bashrc on all nodes (likely only one addition, assuming it is NFS shared):

# PVM environment variables
PVM_ROOT=/usr/share/pvm3
PVM_RSH=/usr/bin/ssh
XPVM_ROOT=/usr/share/pvm3/xpvm
export PVM_ROOT 
export PVM_RSH 
export XPVM_ROOT

If your default shell is csh or tcsh, add the following to your .cshrc or .tcshrc:

#PVM environment variables 
setenv PVM_ROOT /usr/share/pvm3 
setenv PVM_RSH /usr/bin/ssh 
setenv XPVM_ROOT /usr/share/pvm3/xpvm

Now log out and log in again so that your current shell session has these variables correctly set.

It's time to test the installation by starting pvm on our master node and adding a compute node on a remote system. The sidebar shows this procedure. Before attempting this recall that you must be able to remotely login to the compute node without a password using ssh as discussed in last month's article. If you missed this column, don't panic -- a few minutes with Google and the web should find you online HOWTO resources on how to set up ssh so you can login without a password -- it is frequently discussed on a number of archived lists and at least one web document is devoted to this alone.

Sidebar One: Example PVM Start-Up

Start up an xterm or other terminal window and enter (changing the names to match those of your network):

$pvm
pvm> add lilith
add lilith
1 successful
       HOST     DTID
     lilith    80000
pvm> conf
conf
2 hosts, 1 data format
            HOST   DTID   ARCH     SPEED     DSIG
         lucifer  40000 LINUXI386  1000  0x00408841
          lilith  80000 LINUXI386  1000  0x00408841 
pvm>

If you were able to reproduce something similar to the sidebar, you have a virtual supercomputer running with two nodes, lucifer (the head node) and lilith! We could add more nodes this way (and you should feel free to do so and otherwise experiment). If you read the pvm documentation (available online at the URL's given in the Resources sidebar well as the pvm man pages that accompanied your distribution) you can easily learn to add a whole list of hosts at once by putting their names in a hostfile and running pvm hostfile. There still other, and better, ways to add a lot of nodes all at once, but this is enough to get us started.

Before we move on, we should learn one more thing about pvm: how to quit. There are actually two ways to exit the console. The "quit" command exits the console but leaves pvm running on all the nodes. In this way, you can start pvm, build a cluster, and exit the console monitor, run tasks, play games, logout and go home, come back the next day and crank up the pvm console, and there is your cluster, still configured or still working. Try it -- quit from pvm and then start it up again. When you type conf at the prompt, you should see your cluster still there.

On the other hand, the "halt" command stops pvm and all the pvmds on all the nodes! It destroys your virtual cluster completely. This type of exit is important to be able to do as well. PVM creates lock files on all the nodes in a cluster that prevents their use by any other people running PVM, including you (you can't start pvm twice, or start pvm on a node and see or start a different cluster). The halt command SHOULD remove all of those lock files and trace files and restore a node to the state where a new PVM cluster can be built, possibly by another user.

Troubleshooting PVM

For some of you, even though you follow my instructions patiently, the process above led to disaster. Nothing happened, or PVM spit out some sort of Evil Message about not being able to add a node when you asked it so politely to do so. At one point in time (one I remember very well, unfortunately) all that was left for you to try was black magic, or really reading all its documentation to figure out what was wrong -- a thing I think we'll all agree is a fate worse than death. Before you go shopping for a chicken to sacrifice on the keyboard while chanting arcane lines from the PVM manual, let me point out one fairly impressive improvement that has been made in PVM in years since.

PVM (like ssh) is now to a certain extent automatically self-debugging. To demonstrate this, I disabled pvm on lilith and attempted the previous example. The results are shown in the sidebar.

Sidebar Two: PVM Diagnostic Messages

$pvm
pvmd already running.
pvm> conf
conf
1 host, 1 data format
            HOST   DTID   ARCH     SPEED     DSIG
         lucifer  40000 LINUXI386  1000  0x00408841
pvm> add lilith
add lilith
0 successful
               HOST     DTID
             lilith Can't start pvmd

Auto-Diagnosing Failed Hosts...
lilith...
Verifying Local Path to "rsh"...
Rsh found in /usr/bin/ssh - O.K.
Testing Rsh/Rhosts Access to Host "lilith"...
Rsh/Rhosts Access is O.K.
Checking O.S. Type (Unix test) on Host "lilith"...
Host lilith is Unix-based.
Checking $PVM_ROOT on Host "lilith"...
$PVM_ROOT on lilith Appears O.K. ("/usr/share/pvm3")
Verifying Location of PVM Daemon Script on Host "lilith"...

PVM Daemon Script "/usr/share/pvm3/lib/pvmd"
Was Not Found on lilith
Please check the setting of $PVM_ROOT...

As you can see, pvm is pretty smart and provides you with systematic progress messages to show you where it is failing. It can't know or figure out everything -- in this case the problem isn't that $PVM_ROOT is set incorrectly, it is that I completely removed pvm from lilith for the purpose of demonstration. However, the messages should give you a pretty good idea of where to look for a solution to the problems you might encounter.

This doesn't always work. In the process of preparing this column, for example, I discovered the hard way that PVM simply will not run from my laptop over a wireless connection. No explanation -- it simply fails. Even running the far more verbose daemon debugging mode of PVM yields no clues as to why. I can even add the laptop (lilith) as a node in a cluster centered on a regular Linux system such as lucifer, but it cannot be a master node. Still, this sort of mysterious failure is by far the exception rather than the rule for PVM.

If you're still stuck at this point and the messages and documentation totally confuse you, try asking me directly via email or join the Beowulf list and ask for help there. Plenty of people use PVM, and getting help is pretty easy. Beats using a chicken, which tends to leave a mess and is undeniably hard on chickens.

PVM's Snazzy Graphical Interface

If you installed xpvm (in the "pvm-gui" RPM package), it is worth taking it for a brief test spin. The following shows an example of how the interface starts.

Sidebar Three: Starting XPMV

$xpvm
New PVMD started... XPVM 1.2.5 connected as TID=0x40001.
No Default Hostfile "/home/rgb/.xpvm_hosts" Found.
[globs.tcl][procs.tcl][util.tcl]
Initializing XPVM............................... done.

Warning: Missing Architecture Icon for LINUXI386

%

Once the GUI is displayed, try to add nodes using the "Hosts..." button. You should get something like Figure One, showing a small PVM cluster configured using xpvm. xpvm isn't as verbose or useful when nodes fail, but it is very useful for seeing how a cluster computation proceeds. The Space-Time box at the bottom is actually a real-time trace of jobs run by this this particular cluster. When a PVM job is started you'll see lots of little green and yellow lines darting around in this box, visually representing the flow of information -- all those little messages one uses PVM to send between the tasks running on all the nodes!

Figure One: The XPVM user interface

When you're done playing with this interface, use the File menu to Halt or Quit. (As a parenthetical aside, doesn't it seem silly to put Halt and Quit under a menu button named File when there are no operations that have anything to do with files there? Sigh.)

That's all for this column. As always, reading the man pages of the various commands illustrated in this article (man pvm, man pvm_intro, man pvmd) is a Really Good Idea, whether or not they work out for you. Last month's column also cited a book on PVM from MIT press which you can Google for and buy from your favorite bookseller. Finally, I regularly lurk on the Beowulf list and would be happy to help you out there if you try things and get nothing but failure.

See you next time, when we put PVM (at last) to work!

Sidebar Four: PVM Resources

PVM Home Page

PVM Users Guide PVM: A User's Guide and Tutorial for Networked Parallel Computing, Geist, Beguelin, Dongarra, Jiang, Manchek and Sunderam (MIT press)

PVM Users Guide Online

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page