[Beowulf] error starting job : stray job; master mom log says : can not compose message to sister

akshar bhosale akshar.bhosale at gmail.com
Sat Jan 8 00:01:32 EST 2011


hi,
we have 100 nodes cluster. we have strange problem on cluster with torque
2.4.8
a job submitted for 256 cores interactively gives following error in pbs
server :

PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node07.clust1.in
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node05.clust1.in

Also master mom says :
pbs_mom: LOG_ERROR::node_bailout, 2004.nodesvr.clust1.in join_job failed
from node07.clust1.in 17 - recovery attempted)
pbs_mom: LOG_ERROR::sister could not communicate (15059) in
2004.nodesvr.clust1.in job_start_error from node node0.clust1.in   in jo
b_start_error
Jan  7 08:49:54  node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16
ABORT requests, should be 20
node_bailout, node_bailout: received KILL/ABORT request for job
2004.nodesvr.clust1.in from node node07.clust1.in

node07 logs says :
pbs_mom;Job;2004.nodesvr.clust1.in;JOIN JOB as node 15
pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in
im_request, rpp_flush

The job could not allocate shell for 40 minutes and then we got shell on
master mom node.

We are not able to find out the exact issue..any help will be appreciated.

--
Akshar B.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20110108/79f386b1/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list