[Beowulf] When is compute-node load-average "high" in the HPC context? Setting correct thresholds on a warning script.
rpnabar at gmail.com
Tue Aug 31 10:51:20 EDT 2010
My scheduler, Torque flags compute-nodes as "busy" when the load gets
above a threshold "ideal load". My settings on 8-core compute nodes
have this ideal_load set to 8 but I am wondering if this is
appropriate or not?
I do understand the"ideal load = # of cores" heuristic but in at least
30% of our jobs ( if not more ) I find the load average greater than
8. Sometimes even in the 9-10 range. But does this mean there is
something wrong or do I take this to be the "happy" scenario for HPC:
i.e. not only are all CPU's busy but the pipeline of processes waiting
for their CPU slice is also relatively full. After all, a
"under-loaded" HPC node is a waste of an expensive resource?
On the other hand, if there truly were something wrong with a node[*]
and I was to use a high load avearage as one of the signs of
impending trouble what would be a good threshold? Above what
load-average on a compute node do people get actually worried? It
makes sense to set PBS's default "busy" warning to that limit instead
of just "8".
I'm ignoring the 5/10/15 min load average distinction. I'm assuming
Torque is using the most appropriate one!
*e.g. runaway process, infinite loop in user code, multiple jobs
accidentally assigned to some node etc.
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf