[Beowulf] Re: TCP connect error: ECONNREFUSED. - solved-

Jörg Saßmannshausen jorg.sassmannshausen at strath.ac.uk
Wed Apr 15 05:24:25 EDT 2009


Dear all,

some time ago I contacted the list regarding the above problem.

I would like to thank all who contributed towards the solution, finally 
I found out what is going on.

The problem lies in the hostlist (which contains the nodes where the job 
is going to run on, so the machinefile if you like) and in particular 
the order of it.
PBS type schedulers (I have used TORQUE before) are using the 
$PBS_NODEFILE which you only need to read out. I was looking at the
internet for something similar but all I could find was that SGE
apparently writes out the hostfile in a file. So I used that. What I was 
not aware of at the time is that the vendor has its own script to 
generate that file and (unfortunately for me) that contains the command 
'sort'. So, the order of the nodes gets changed. However, ddikick, the 
program which is doing the parallelisation, seems to be quite fussy 
about that as the first node will be the master, initiating all the 
other processes. Unfortunately, as the order is different from what SGE 
supplied, that leads to the bizzar situation that SGE is starting of the 
process on a 'slave' (with respect from ddikick) and hence 
ddikick-master and SGE-master will never speak to each other. The 
solution was to use the $PE_HOSTFILE and read out the nodes from there, 
same as I do with the $PBS_NODEFILE. It could not be any easier _if_ I 
had known on beforehand.

I thought I share that with you, in case somebody is searching the list 
and founds my thread. :-)

All the best from Glasgow!

Jörg


-- 
*************************************************************
Jörg Saßmannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshausen at strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list