Article Index

We will even throw in a torque wrench for free

Over the last couple of columns, we've done a broad survey of the scope and history of resource management. In this column, we're going to dive a little more in depth into two of the leaders: PBS (aka Torque) and LSF.

PBS, LSF, and most of the other resource management packages described here over the last couple of months require jobs to be submitted via a shell script (even for the few that don't require it, you are probably better off if you do). This requirement can be a daunting task for users, particularly those who are not Linux savvy. However, it doesn't have to be.

While both PBS and LSF have a large and powerful set of options, most users and most jobs do not require this capability. In a typical scenario, a user probably only a cares about a few characteristics about a job: the name of the program to run, how many processors it will run on, where to get it's input data, where to put it's output data, and how they find out when it's done. A job script will also frequently require some information the user doesn't typically want to care about, some limits on the resources the job can consume (most often, anticipated running time).

Since the typical job only needs to convey these few things to the resource manager, a good practice is to prepare a few template scripts for users that cover the common cases. Provide your expertise on the more complicated ones, rather than try and instruct all users on all cases (a hopeless task; your average user will simply not get as fired up about the queuing system as you will. If they did, they wouldn't need you to administer the cluster for them in the first place).

The Scripts

The listings below are some bare bones scripts for a very simple case, assuming you are using only a single queue. Each script handles the same job, one on PBS, and one on LSF. Note that although these are fairly different packages to administer, the user view is fairly similar.

In Listing One, you will find a PBS sample job script. Listing Two is an equivalent script, but one that will run in the LSF environment. Both scripts start out as normal UNIX shell scripts, meaning you have to begin them with the characters #! and the path to the shell you want them to execute in. In this case, I've chosen /bin/csh for both scripts, but you may use the shell of your choice.

In typical shell scripts, lines beginning with a # are considered comments, with the exception of the #! first line. In both PBS and LSF, there is an additional exception. Lines beginning with #PBS (for PBS) or #BSUB (for LSF) are directives to the resource manager to describe how this job is to be handled (for PBS, the directive string is actually programmable within the script). In the case of both PBS and LSF, these directives can be passed directly on the command line when the job is submitted to a queue. Any remaining lines that begin with # are still treated as comments.

Though it isn't a requirement, it's good practice to give every job a name. This task is done with a directive in both cases. Line 3 in Listing One shows how to do this in PBS, with a directive with the -N option, followed by the name. Line 3 in Listing Two does the same job in LSF, with a directive with a -J option. A requirement for jobs in both systems is to tell the resource manager where to stick output and errors. In a batch system, things that would normally be printed to the screen (if you had run the job interactively) must be captured in files. Lines 5 and 6 in Listing One show how to redirect the standard output and error from your job into files of your choosing, in this case, the files samplejob.out and samplejob.err. Lines 4 and 5 in Listing Two do the exact same thing, but make use of a LSF feature that allows you to reference the job name (once you define it) through the variable %J; so the filenames produced in Listing Two will be the same as in Listing One.

 1  #!/bin/csh
 2  ### Job name
 3  #PBS -N samplejob
 4  ### Output
 5  #PBS -o samplejob.out
 6  #PBS -e samplejob.err
 7  ### Queue name
 8  #PBS -q workq
 9  #PBS -M This email address is being protected from spambots. You need JavaScript enabled to view it.
10  #PBS -l nodes=32,walltime=0:15
11  #PBS -m be
12  ### Script commands
13  echo "Job Starting..."
14  echo "Submitted from Directory: $PBS_O_WORKDIR"
15  my_job args
16  exit

Line 8 of Listing One selects the queue to which you wish to submit your job, in this case one called workq (the default queue if you use OSCAR). If you are running a cluster with only a few users, you may only want a single queue. Larger sites typically have multiple queues, to separate large from small jobs or users of different priority. The LSF and PBS syntax are the same here (see line 6 of Listing Two), using a directive with the -q option, followed by the name of the queue. You can determine which queues exist on your system using the qstat (PBS) or bqueues (LSF) commands.

Next, you need to specify the resources you want your job to use. Either queuing system can limit your resources in a number of ways; such as by number of processes, by total wall clock time, or by memory usage. In our example, we'll limit our job in two ways, to 32 processors, and to 15 minutes of wall clock time. LSF has directives for each type of limit. In Listing Two, line 2 handles the node limit, and line 7 handles the wall clock time limit. In PBS, all limits are set with the -l option, and are placed in a comma separated list, as shown in line 10 of Listing One.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.