[Beowulf] OOM errors when running HPL

Prentice Bisbal prentice at ias.edu
Fri Dec 19 16:39:44 EST 2008


I've got a new problem with my cluster. Some of this problem may be with
my queuing system (SGE), but I figured I'd post here first.

I've been using hpl to test my new cluster. I generally run a small
problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I
upped the problem size by a factor of 10 to Ns=600000). Shortly after
submitting the job, have the nodes were shown as down in Ganglia.

I killed the job with qdel, and the majority of the nodes came back, but
about 1/3 did not. When I came in this morning, there were kernel
panic/OOM type messages on the consoles of the systems that never came
back.

I used to run hpl jobs much bigger than this on my cluster w/o a
problem. There's nothing I actively changes, but there might have been
some updates to the OS (kernel, libs, etc) since the last time I ran a
job this big. Any ideas where I should begin looking?


-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list