Troubleshooting PVM
For some of you, even though you follow my instructions patiently, the process above led to disaster. Nothing happened, or PVM spit out some sort of Evil Message about not being able to add a node when you asked it so politely to do so. At one point in time (one I remember very well, unfortunately) all that was left for you to try was black magic, or really reading all its documentation to figure out what was wrong -- a thing I think we'll all agree is a fate worse than death. Before you go shopping for a chicken to sacrifice on the keyboard while chanting arcane lines from the PVM manual, let me point out one fairly impressive improvement that has been made in PVM in years since.PVM (like ssh) is now to a certain extent automatically self-debugging. To demonstrate this, I disabled pvm on lilith and attempted the previous example. The results are shown in the sidebar.
Sidebar Two: PVM Diagnostic Messages |
$pvm pvmd already running. pvm> conf conf 1 host, 1 data format HOST DTID ARCH SPEED DSIG lucifer 40000 LINUXI386 1000 0x00408841 pvm> add lilith add lilith 0 successful HOST DTID lilith Can't start pvmd Auto-Diagnosing Failed Hosts... lilith... Verifying Local Path to "rsh"... Rsh found in /usr/bin/ssh - O.K. Testing Rsh/Rhosts Access to Host "lilith"... Rsh/Rhosts Access is O.K. Checking O.S. Type (Unix test) on Host "lilith"... Host lilith is Unix-based. Checking $PVM_ROOT on Host "lilith"... $PVM_ROOT on lilith Appears O.K. ("/usr/share/pvm3") Verifying Location of PVM Daemon Script on Host "lilith"... PVM Daemon Script "/usr/share/pvm3/lib/pvmd" Was Not Found on lilith Please check the setting of $PVM_ROOT... |
As you can see, pvm is pretty smart and provides you with systematic progress messages to show you where it is failing. It can't know or figure out everything -- in this case the problem isn't that $PVM_ROOT is set incorrectly, it is that I completely removed pvm from lilith for the purpose of demonstration. However, the messages should give you a pretty good idea of where to look for a solution to the problems you might encounter.
This doesn't always work. In the process of preparing this column, for example, I discovered the hard way that PVM simply will not run from my laptop over a wireless connection. No explanation -- it simply fails. Even running the far more verbose daemon debugging mode of PVM yields no clues as to why. I can even add the laptop (lilith) as a node in a cluster centered on a regular Linux system such as lucifer, but it cannot be a master node. Still, this sort of mysterious failure is by far the exception rather than the rule for PVM.
If you're still stuck at this point and the messages and documentation totally confuse you, try asking me directly via email or join the Beowulf list and ask for help there. Plenty of people use PVM, and getting help is pretty easy. Beats using a chicken, which tends to leave a mess and is undeniably hard on chickens.
PVM's Snazzy Graphical Interface
If you installed xpvm (in the "pvm-gui" RPM package), it is worth taking it for a brief test spin. The following shows an example of how the interface starts.
Sidebar Three: Starting XPMV |
$xpvm New PVMD started... XPVM 1.2.5 connected as TID=0x40001. No Default Hostfile "/home/rgb/.xpvm_hosts" Found. [globs.tcl][procs.tcl][util.tcl] Initializing XPVM............................... done. Warning: Missing Architecture Icon for LINUXI386 % |
Once the GUI is displayed, try to add nodes using the "Hosts..." button. You should get something like Figure One, showing a small PVM cluster configured using xpvm. xpvm isn't as verbose or useful when nodes fail, but it is very useful for seeing how a cluster computation proceeds. The Space-Time box at the bottom is actually a real-time trace of jobs run by this this particular cluster. When a PVM job is started you'll see lots of little green and yellow lines darting around in this box, visually representing the flow of information -- all those little messages one uses PVM to send between the tasks running on all the nodes!

When you're done playing with this interface, use the File menu to Halt or Quit. (As a parenthetical aside, doesn't it seem silly to put Halt and Quit under a menu button named File when there are no operations that have anything to do with files there? Sigh.)
That's all for this column. As always, reading the man pages of the various commands illustrated in this article (man pvm, man pvm_intro, man pvmd) is a Really Good Idea, whether or not they work out for you. Last month's column also cited a book on PVM from MIT press which you can Google for and buy from your favorite bookseller. Finally, I regularly lurk on the Beowulf list and would be happy to help you out there if you try things and get nothing but failure.
See you next time, when we put PVM (at last) to work!
Sidebar Four: PVM Resources |
PVM Home Page
PVM Users Guide PVM: A User's Guide and Tutorial for Networked Parallel Computing, Geist, Beguelin, Dongarra, Jiang, Manchek and Sunderam (MIT press)
|
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page