Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)

Tue Apr 2 15:09:34 EST 2002

On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L. Smith wrote:
> On Tue, 2 Apr 2002, Richard Walsh wrote:
[stuff deleted]
> PBS is our leading cause of cycle loss.  We now run a cron job on the
> headnode that checks every 15 minutes to see if the PBS daemons have died,
> and if so, it automatically restarts them.  About 75% of the time that I
> have a node fail to accept jobs, it is because its pbs_mom has died, not
> because there is anything wrong with the node.
> 

We used to have the same problem with PBS, especially when many jobs were 
in the queue. At that point sometimes the pbs master died as well.
Since we've switched to SGE/GridEngine/CODINE I've been MUCH happier.
Plus there are lots of nifty things you can do with the expandibility of 
writing your own load monitors via shell scripts and such.
The whole point of this post is:
GNQS < PBS < Sun Gridengine :)

Chris (who tried two other batch schedulers until settling on SGE)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20020402/1433e290/attachment.sig>