PVM errors at startup

Patrick Begou Patrick.Begou at hmg.inpg.fr
Fri Oct 10 13:55:43 EDT 2003


Hi 

I'm new on this list so, just 2 lines about me:
A small linux beowulf cluster (10 nodes) for computational fluids
dynamics in
south-est of France (National Polytechnique Institute from Grenoble) .

I've just updated my cluster (from AMD1500+/ Eth100BT to P4 2.8G +
Gigabit ethernet) and I've updated my system to Red-Hat 7.3, Kernel
2.4.20-20-7. The current version of pvm is pvm-3.4.4-2 from the RedHat
7.3. The previous system was RH7.1.

Since this update I'm unable to start PVM from a node to another (with
the add command).
The console hang for several tenth of seconds then says OK.
The pvmd3 is started on the remote node but the conf command do not show
the additionnal node and I get these errors in the /tmp/pvml.xx file:

[t80040000] 10/10 15:58:31 craya.hmg.inpg.fr (xxx.xxx.xxx.xxx:32772)
LINUX 3.4.4
[t80040000] 10/10 15:58:31 ready Fri Oct 10 15:58:31 2003
[t80040000] 10/10 16:01:46 netoutput() timed out sending to craya02
after 14, 190.000000
[t80040000] 10/10 16:01:46  hd_dump() ref 1 t 0x80000 n "craya02" a ""
ar "LINUX" dsig 0x408841
[t80040000] 10/10 16:01:46            lo "" so "" dx "" ep "" bx "" wd
"" sp 1000
[t80040000] 10/10 16:01:46            sa 192.168.81.2:32770 mtu 4080 f
0x0 e 0 txq 1
[t80040000] 10/10 16:01:46            tx 2 rx 1 rtt 1.000000 id "(null)"


rsh and rexec are working (from master to nodes, from nodes to master
and from nodes to nodes). The transfert speed is near 600Mbits/s on the
network (binary ftp on /dev/null)

variables are set:
PVM_ARCH=LINUX
PVM_RSH=/usr/bin/rsh
PVM_DPATH=/usr/local/pvm3/lib/LINUX/pvmd3
PVM_ROOT=/usr/local/pvm3


I've tried so manythings since thes last 3 days:

- trying to compile install pvm3.4.4.tgz from sources file
- uninstall iptables, ipchains and iplock.
- remove /etc/security (to test this with root authority)
- added .rhosts and hosts.equiv file
- on the master eth0 is 100Mbits toward internet and eth1 is GB towards
the nodes.
I've tried the oposite config: eth0 become GB and eth1 100BT.

Always the same problem!

The cluster is down and I do not know where looking for a solution
now....

If some one could help me solving this problem

Thanks for your help

Patrick
-- 
===============================================================
|  Equipe M.O.S.T.         | http://most.hmg.inpg.fr          |
|  Patrick BEGOU           |       ------------               |
|  LEGI                    | mailto:Patrick.Begou at hmg.inpg.fr |
|  BP 53 X                 | Tel 04 76 82 51 35               |
|  38041 GRENOBLE CEDEX    | Fax 04 76 82 52 71               |
===============================================================
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list