Cluster Queuing and Scheduling Packages

Article Index

SGE

Sun's GridEngine (SGE) package is another leader in the resource management marketplace, and also has a fairly broad family. SGE is part of the Sun Source open software project. It should be mentioned, that while the name "Grid Engine" implies "Grid", SGE is not a Grid tool although it can be used as part of a Grid.

SGE is the descendant of a long line of queuing systems, dating back to an academic project known as DQS (Distributed Queuing System). DQS became a commercial product known as CODINE, which became popular particularly in Europe, and still has some active users there. DQS itself lived on for quite a while as well, and a google search implies it is also still in use. SGE, and particularly its core (non-grid) resource management functionality, is the direct child of CODINE. {mosgoogle right}

SGE now exists in both open source and commercial versions. Both versions 5.3 and 6.0 are available for download. Including the SGE "Enterprise Edition" which is fully integrated in SGE 6.0. Advocates of SGE point to it's ease of installation, documentation, graphical users interface, active support list, and it's failover server process capability (a feature that has been in SGE for a while).

LSF

Platform's Computing Load Sharing Facility (LSF) is one of the most venerated entries in the resource management software marketplace. LSF has been around a long time, dating back to the days before Beowulf clustering, when it competed with NQS, DQS and their predecessors. While not originally designed for parallel processing, LSF was one of the first software packages to my knowledge that actually tried to schedule jobs across multiple computer systems (though many schedulers were written dating back to the '60s that scheduled across processors). My first experiences with LSF were with a small group of DEC's (back when it was DEC, before it was Compaq, then HP) first-generation Alpha processors back in 1993. Since then LSF has been ported to many platforms and all kinds of parallel computers. In fact, Platform's strategy for a long time seemed to focus on the "big iron" computing world, with a serious push in Beowulf and Linux clusters only relatively recently.

However, once LSF entered the Beowulf cluster world, it entered in a big way, with a rich and mature feature set, a solid interface that is familiar to PBS users, and powerful scheduling algorithms rivaled only by Maui in the open source world. Though LSF has history as a closed-source commercial product, Platform has begun a push into the open source world with the release of the Community Scheduler Framework. Platform's commercial versions offer significant add-ons, like support for multiple clusters and standard grid services. Like PBS and SGE, Platform also supports integration with the Maui scheduler. Look for a more detailed review of Platform's LSF in a future column.

The Boutique Solutions

There are as many resource management solutions as there are Linux distributions, and every cluster solution has it's own. LoadLeveler is the IBM entry into the scheduler fray, and though it doesn't run on Linux yet, it would be no surprise if it arrived soon. The openMOSIX cluster OS provides it's own scheduler, which is good news as none of the other schedulers mentioned here work on it yet, though the claim is it would be easy to write a module to do so.

Sidebar:Resources

PBS Professional

OpenPBS

Torque

Sun Grid Engine (SGE)

SGE Maui Integration

Load Sharing Facility (LSF)

Maui

Clubmask

Generic NQS

If you use a bproc-based cluster distribution, you have still more options. Like MOSIX, Scyld has it's own resource management package (the Beowulf Batch Scheduler, bbq). Both of these are fairly basic but functional schedulers, useful out-of-the-box in only very simple environments. Both bjs and bbq are easy to extend, but this requires a burning desire to code your own scheduling algorithms. Fortunately, the bproc scheduling API has gained some momentum, and PBS Pro, LSF, and SGE will all work with it. I'd strongly consider one of these options if your cluster has more than a couple of users.

Clubmask is a full, bproc-based cluster distribution, so although it belongs in the bproc section, it's not something you can install on top of your existing bproc cluster. It aims to be the first "physicist-proof" cluster distribution although development seems to have slowed. While Clubmask does use some standard pieces of cluster software (such as Ganglia, Supermon, and of course bproc). Most importantly for this column, it includes it's own resource management system. It's got a fairly simple scripting interface, support for batch and interactive jobs, but a scheduling interface sophisticated enough to support the Maui scheduler.

Onward

Today, a wide variety of commercial and free resource managers exist, with stable versions available and lots of development still going on. While the choice of packages isn't easy, hopefully you now see the how the evolution of these projects has led to this wide variety with each project filling it's own niche. A quick comparison of the packages described is in the table below.

The next step in selecting a resource management package is to take a look at the interface. Almost all of the packages described here require jobs to be submitted in the form of scripts, though some provide GUI tools to create the scripts for the user. The script languages are rich, powerful, and of course vary from package to package. Most jobs will use only a small subset of the capabilities available, so a good practice as an administrator is to provide some template scripts for the common cases. In our next installment, we'll compare interfaces by delving into sample scripts for the leading resource managers.

A quick summary of each package is given below. Please check with the project/vendor sites as information may have changes.

PackageOpen SourceProfessional
Support
Maui CompatibleCommercial
Package
Bproc
Compatible
Active
Development
PBS Professional NoYes YesYes YesYes
OpenPBS YesNo YesNo NoNot really
Torque Yes No Yes No Yes Yes
SGE Yes Yes YesYes Yes Yes
LSF Some YesYes YesYes Yes
Generic NQS YesNo NoNo NoNo

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.