Cluster Queuing and Scheduling Packages

Published on Tuesday, 17 January 2006 14:00
Written by Dan Stanzione
Hits: 8447

A brief look at some current options

In our last column, we took a general look at the problem of resource management. This time around, we're going to take a quick stroll through the options for resource management on your cluster. There are a variety of possibilities available, which vary by capability, price, license, and platforms on which they work. Many of the resource management packages available have forked into multiple versions or changed names, leading to a somewhat confusing marketplace for the cluster administrator trying to decide what to do. In this column, we'll trace the origins and history of the popular players, to try and give you some insight into how we got to the state of things today, as the philosophical differences that led to the splits may have a big impact on the features you'll get.

While the total number of choices is large, a few factors will greatly simplify your decision. A couple of packages dominate the world of production solutions. Some of the other options may be just as good, but only work with certain packages and distributions, that you may or may not be using. By the end of this column, you'll have a rough picture of what's out there, and what might work on your cluster. You will be well advised to check the project and vendor web sites for further information as this topic is quite vast and individual needs vary greatly.

The Long and Storied History of PBS

The most well-known and widely used package in the world of cluster is the Portable Batch System, or PBS. Of course it can't be that simple; PBS is a whole family of products. The original PBS was developed by a group that would eventually become Veridian Corporation for the NASA Advanced Supercomputing Division at Ames Research Center. PBS was written to replace the aging NQS (network queuing system) software, which still exists in various incarnations. The last surviving direct descendant is Generic NQS, which is still in use at a number of sites, though active development seems to have stopped.

PBS itself has split into several versions. After it's initial deployment at NASA, PBS, like all generally useful and freely available software packages, began to spread around the community. The team at Veridian concluded there was a market for this type of thing, so they decided to continue to develop PBS into a commercial product, which became PBS Professional. In order not to immediately orphan the existing PBS, they created the OpenPBS project, which would keep the source open and maintained for the version of PBS the community was using, though it wouldn't contain the new features being developed for PBS Pro. Both these versions live on today. In March of 2003, Veridian sold PBS Professional to Altair Engineering.

{mosgoogle right} While releasing a version of OpenPBS was initially a reasonable step for Veridian to take, the OpenSource cluster community soon grew tired of the limitations. In addition, Altair Engineering still controls the OpenPBS source, and are understandably not eager to spend time and effort maintaining community-contributed features that are available in the PBS Professional product. From the Altair perspective, OpenPBS is merely the gateway to PBS Professional. So, while OpenPBS is a solid, albeit limited, product which works well on clusters, it lacks increasingly important features for modern large clusters, such as scalable performance past a few tens of nodes and sophisticated scheduling algorithms (though it still retains some desirable qualities, like a lack of Windows support).

So, the open source community began developing capabilities for OpenPBS that went well beyond it's initial capacity, but Veridian and then Altair had no incentive to incorporate them into the main OpenPBS code. Meanwhile, PBS had become widely adopted, so going an entirely different direction didn't seem feasible. So, a new project was born, which was initially known as Scalable PBS (due to trademark issues, it became known as Storm, and now Torque, the Tera-scale Open-source Resource and QUEue manager). Torque was to be the all open source descendant of PBS, and, as the original name indicated, was to address issues of scalability, as well as performance, fault tolerance, more sophisticated scheduling and scheduling interfaces, and the incorporation of the many patches the community has developed for OpenPBS. While still (and probably perpetually) under active development, Torque is ready for use, reasonably stable, and is developing a fairly wide following.

All three major forks of PBS are still active and in use; PBS Professional as a commercial product making advances in fault tolerance and scalability, OpenPBS as the solid standby in wide use and the default in packages like OSCAR and ROCKS, and Torque as the open source community development platform of choice, and used by a large group of do-it-yourselfers.

PBS (in all incarnations) consists of several components: a server, a scheduler, and the process that runs on all the compute nodes, known as a MOM. The server runs only on the head node, and is the process that actually accepts submission of jobs, maintains the queue of running jobs, and reports when jobs are completed. The MOM, or Machine-Oriented-Miniserver process is fairly lightweight, which is a good thing as a copy of the MOM process must run on every node in your cluster. The MOM interacts with the server to actually run each of your queued tasks on the compute nodes. The scheduler makes decisions about the order of jobs in the queue; most significantly, which job will run next. One of the common features of all versions of PBS is that the scheduler can be replaced with external schedulers containing different scheduling algorithms. The default scheduler in OpenPBS simply employs a first-come, first-served scheduling algorithm. PBS Pro, of course, uses a substantially more sophisticated policy.

Because of the plug-in scheduler feature, the most common way to run PBS is to replace the built-in scheduler with the Maui scheduler. This arrangement is the standard setup in the OSCAR system described last month, for instance. Maui is worthy of a column of it's own, but basically Maui is a high-powered open source stand alone scheduler. Maui focuses on scheduling functionality, and leaves the problems of launching jobs and dealing with users to resource managers like PBS. Maui achieves many of the scheduling goals described in a previous column, through the use of a planning scheduling algorithm that supports reservations for particular jobs, and a backfill mechanism which looks for available space in the planned schedule to squeeze in more jobs. While PBS alone, particularly OpenPBS is not much of a scheduler, it is a very solid resource manager, and the addition of Maui makes for a truly powerful combination.

SGE

Sun's GridEngine (SGE) package is another leader in the resource management marketplace, and also has a fairly broad family. SGE is part of the Sun Source open software project. It should be mentioned, that while the name "Grid Engine" implies "Grid", SGE is not a Grid tool although it can be used as part of a Grid.

SGE is the descendant of a long line of queuing systems, dating back to an academic project known as DQS (Distributed Queuing System). DQS became a commercial product known as CODINE, which became popular particularly in Europe, and still has some active users there. DQS itself lived on for quite a while as well, and a google search implies it is also still in use. SGE, and particularly its core (non-grid) resource management functionality, is the direct child of CODINE. {mosgoogle right}

SGE now exists in both open source and commercial versions. Both versions 5.3 and 6.0 are available for download. Including the SGE "Enterprise Edition" which is fully integrated in SGE 6.0. Advocates of SGE point to it's ease of installation, documentation, graphical users interface, active support list, and it's failover server process capability (a feature that has been in SGE for a while).

LSF

Platform's Computing Load Sharing Facility (LSF) is one of the most venerated entries in the resource management software marketplace. LSF has been around a long time, dating back to the days before Beowulf clustering, when it competed with NQS, DQS and their predecessors. While not originally designed for parallel processing, LSF was one of the first software packages to my knowledge that actually tried to schedule jobs across multiple computer systems (though many schedulers were written dating back to the '60s that scheduled across processors). My first experiences with LSF were with a small group of DEC's (back when it was DEC, before it was Compaq, then HP) first-generation Alpha processors back in 1993. Since then LSF has been ported to many platforms and all kinds of parallel computers. In fact, Platform's strategy for a long time seemed to focus on the "big iron" computing world, with a serious push in Beowulf and Linux clusters only relatively recently.

However, once LSF entered the Beowulf cluster world, it entered in a big way, with a rich and mature feature set, a solid interface that is familiar to PBS users, and powerful scheduling algorithms rivaled only by Maui in the open source world. Though LSF has history as a closed-source commercial product, Platform has begun a push into the open source world with the release of the Community Scheduler Framework. Platform's commercial versions offer significant add-ons, like support for multiple clusters and standard grid services. Like PBS and SGE, Platform also supports integration with the Maui scheduler. Look for a more detailed review of Platform's LSF in a future column.

The Boutique Solutions

There are as many resource management solutions as there are Linux distributions, and every cluster solution has it's own. LoadLeveler is the IBM entry into the scheduler fray, and though it doesn't run on Linux yet, it would be no surprise if it arrived soon. The openMOSIX cluster OS provides it's own scheduler, which is good news as none of the other schedulers mentioned here work on it yet, though the claim is it would be easy to write a module to do so.

Sidebar:Resources

PBS Professional

OpenPBS

Torque

Sun Grid Engine (SGE)

SGE Maui Integration

Load Sharing Facility (LSF)

Maui

Clubmask

Generic NQS

If you use a bproc-based cluster distribution, you have still more options. Like MOSIX, Scyld has it's own resource management package (the Beowulf Batch Scheduler, bbq). Both of these are fairly basic but functional schedulers, useful out-of-the-box in only very simple environments. Both bjs and bbq are easy to extend, but this requires a burning desire to code your own scheduling algorithms. Fortunately, the bproc scheduling API has gained some momentum, and PBS Pro, LSF, and SGE will all work with it. I'd strongly consider one of these options if your cluster has more than a couple of users.

Clubmask is a full, bproc-based cluster distribution, so although it belongs in the bproc section, it's not something you can install on top of your existing bproc cluster. It aims to be the first "physicist-proof" cluster distribution although development seems to have slowed. While Clubmask does use some standard pieces of cluster software (such as Ganglia, Supermon, and of course bproc). Most importantly for this column, it includes it's own resource management system. It's got a fairly simple scripting interface, support for batch and interactive jobs, but a scheduling interface sophisticated enough to support the Maui scheduler.

Onward

Today, a wide variety of commercial and free resource managers exist, with stable versions available and lots of development still going on. While the choice of packages isn't easy, hopefully you now see the how the evolution of these projects has led to this wide variety with each project filling it's own niche. A quick comparison of the packages described is in the table below.

The next step in selecting a resource management package is to take a look at the interface. Almost all of the packages described here require jobs to be submitted in the form of scripts, though some provide GUI tools to create the scripts for the user. The script languages are rich, powerful, and of course vary from package to package. Most jobs will use only a small subset of the capabilities available, so a good practice as an administrator is to provide some template scripts for the common cases. In our next installment, we'll compare interfaces by delving into sample scripts for the leading resource managers.

A quick summary of each package is given below. Please check with the project/vendor sites as information may have changes.

PackageOpen SourceProfessional
Support
Maui CompatibleCommercial
Package
Bproc
Compatible
Active
Development
PBS Professional NoYes YesYes YesYes
OpenPBS YesNo YesNo NoNot really
Torque Yes No Yes No Yes Yes
SGE Yes Yes YesYes Yes Yes
LSF Some YesYes YesYes Yes
Generic NQS YesNo NoNo NoNo

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly