[Beowulf] TCP connect error: ECONNREFUSED.

David Simas dgs at slac.stanford.edu
Tue Mar 31 13:57:03 EDT 2009


On Mon, Mar 30, 2009 at 02:14:50PM +0100, J?rg Sa?mannshausen wrote:
> Dear all,
> 
> I am having this rather anoying problem with the parallel execution of 
> one of the programs (GAMESS US version) on our cluster. The error 
> message is:
> 
>  TCP connect error: ECONNREFUSED.
>  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
>  A fatal error occurred on DDI Process 0.
>  TCP connect error: ECONNREFUSED.
>  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
>  A fatal error occurred on DDI Process 60.
>  TCP connect error: ECONNREFUSED.
>  TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
>  A fatal error occurred on DDI Process 2.
>  TCP connect error: ECONNREFUSED.
> 
> [ ... ]
> 
> Eventually, the ddicick tips over and the whole thing crashes. The 
> program is using rsh (yes, I know, security, I did not install the 
> cluster!) and I can rsh comp10 -> comp02 and there is no firewall 
> installed between the nodes (at least, not that I am aware of). Trying 
> to run the same job with the same number of nodes will fail X times and 
> at X+1 suddenly work. I could not work out a pattern for that (other 
> that I get exponentially annoyed). Right now, there is only one gigabit 
> network connecting the cluster, so nfs, mpi etc. is all running over one 
> interface (again, I did not set up the cluster).

How rapidly are these rsh connection attempts occuring?  The rsh protocol
requires connections from privileged ports - less than 1024.   If a host
attempts to make more than 1024 to another host in less than TCP TIME-WAIT
seconds, it will run out ports and the connections will fail.   I've seen
this occur with parallel applications using rsh.

David S.

> 
> I have run out of ideas of where to look. I checked (as quickly as 
> possible) at some nodes with netstat, the ddicick program is acutally 
> running. Changing to ssh did not solve the problem.
> 
> I would appreciate any feedback as it is highly anyoing to wait Y days 
> to get the job running and then it crashes.
> 
> All the best from Glasgow!
> 
> J?rg
> 
> 
> -- 
> *************************************************************
> J?rg Sa?mannshausen
> Research Fellow
> University of Strathclyde
> Department of Pure and Applied Chemistry
> 295 Cathedral St.
> Glasgow
> G1 1XL
> 
> email: jorg.sassmannshausen at strath.ac.uk
> web: http://sassy.formativ.net
> 
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list