Finally, once you've finished all your directives, your script needs to actually run your job. You can simply place the commands in the script that would use to normally run the job from the command line. You can also run any other sequence of shell commands and the output will be captured in your standard output file. Using the echo command to place some labels in your file is typically helpful here. In this case, in line 14 of Listing One and line 12 of Listing Two, the directory the command was run from will be placed in the file (note the different environment variables for PBS vs. LSF).
Submitting ScriptsOnce you have completed your job script, you are ready to submit it to the resource manager. This task can be done in PBS with the qsub command:
or, for LSF, with the bsub command:
All of the options used in the job scripts can also be passed directly on the command line to either the qsub or bsub command. This option shouldn't be used as a substitute for creating job scripts, but it can be useful in certain cases. For instance, if you wanted to vary the number of nodes you ran a job on to measure it's performance, and you didn't want to change your script for each run, you could simply remove the directive about nodes from the script, and submit commands to the queue such as:
$qsub -l nodes=4 pbs_sample_script.sh $qsub -l nodes=8 pbs_sample_script.sh
You could even place these commands inside another script (and probably should).
Parting ThoughtsWith the distribution of a few sample scripts, you can save your users a lot of time and effort. The scripts here provide a starting point, but you should probably provide a sequence of steadily more sophisticated scripts. The next step would be to add directive to define dependencies; for instance, submitting jobs that won't start until other jobs finish. This feature is particularly useful if you have jobs producing files that are input to other jobs. There are plenty more options, but we're out of space for this month. Don't worry about mastering them all, the simple set provided here we'll get you through a lot of jobs. Happy batching!
Finally, an astute reader pointed out that we missed a resource manger in last issue. SLURM is a production resource manager used and developed at Lawrence Livermore National Labs. It is now more widely available under the GNU public license. Like PBS and LSF, it allows for integration with MAUI and other schedulers. One of the strengths of SLURM is it's ability to tolerate node failures and continue functioning. SLURM is in use on cluster of 1,000 nodes already.
Thanks to Karl Schulz at the Texas Advanced Computing Center, for access to scripts from their production LSF environment.
|Sidebar One: Resources|
Portable batch System (PBS/Torque)
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.
- << Prev