MPICH 1.2.5 failures (net_recv)

Jeff Layton jeffrey.b.layton at lmco.com
Fri Jul 11 12:13:08 EDT 2003


Good afternoon!

   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
with the following configuration:

./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
          --with-ARCH=LINUX \
          --with-device=ch_p4 \
          --without-romio --without-mpe \
          -opt=-O2  \
          -cc=/usr/pgi/linux86/bin/pgcc \
          -fc=/usr/pgi/linux86/bin/pgf90 \
          -clinker=/usr/pgi/linux86/bin/pgcc \
          -flinker=/usr/pgi/linux86/bin/pgf90 \
          -f90=/usr/pgi/linux86/bin/pgf90 \
          -f90linker=/usr/pgi/linux86/bin/pgf90 \
          -c++=/usr/pgi/linux86/bin/pgCC \
          -c++linker=/usr/pgi/linux86/bin/pgCC


I've built the 'cpi' and 'fpi' examples in the examples/basic directory
and tried running them using the following mpirun line:


/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
PBS_NODEFILE cpi


where PBS_NODEFILE is,

penguin1
penguin1
penguin2
penguin2
penguin3
penguin3
penguin4
penguin4
penguin5
penguin5

(however, I'm testing outside of PBS). The code seems to hang fo
 quite a while and then I get the following:

p0_14235: (935.961023) net_recv failed for fd = 10
p0_14235:  p4_error: net_recv read, errno = : 110
p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
/home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
/home/g593851/src/mpich-1.2.5/examples/basic


More system details - It's a RH 7.1 OS, but with a stock 2.4.20
kernel. The interconnect is FastE through a Foundry switch and the
NICS are Intel EEPro100 (using the eepro100 driver).
   Does anybody have any ideas? I've I searched around the net a bit and
the results  were inconclusive ("use LAM instead", may have bad NIC
drivers, problematic TCP stack, etc.).

TIA!

Jeff




-- 
Dr. Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list