Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
cblack at eragen.com
Tue Apr 2 15:09:34 EST 2002
On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L. Smith wrote:
> On Tue, 2 Apr 2002, Richard Walsh wrote:
> PBS is our leading cause of cycle loss. We now run a cron job on the
> headnode that checks every 15 minutes to see if the PBS daemons have died,
> and if so, it automatically restarts them. About 75% of the time that I
> have a node fail to accept jobs, it is because its pbs_mom has died, not
> because there is anything wrong with the node.
We used to have the same problem with PBS, especially when many jobs were
in the queue. At that point sometimes the pbs master died as well.
Since we've switched to SGE/GridEngine/CODINE I've been MUCH happier.
Plus there are lots of nifty things you can do with the expandibility of
writing your own load monitors via shell scripts and such.
The whole point of this post is:
GNQS < PBS < Sun Gridengine :)
Chris (who tried two other batch schedulers until settling on SGE)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 232 bytes
Desc: not available
More information about the Beowulf