MPICH 1.2.5 failures (net_recv)

Martin Siegert siegert at sfu.ca
Fri Jul 11 13:11:07 EDT 2003


On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:
> Good afternoon!
> 
>   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
> 
> ./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
>          --with-ARCH=LINUX \
>          --with-device=ch_p4 \
>          --without-romio --without-mpe \
>          -opt=-O2  \
>          -cc=/usr/pgi/linux86/bin/pgcc \
>          -fc=/usr/pgi/linux86/bin/pgf90 \
>          -clinker=/usr/pgi/linux86/bin/pgcc \
>          -flinker=/usr/pgi/linux86/bin/pgf90 \
>          -f90=/usr/pgi/linux86/bin/pgf90 \
>          -f90linker=/usr/pgi/linux86/bin/pgf90 \
>          -c++=/usr/pgi/linux86/bin/pgCC \
>          -c++linker=/usr/pgi/linux86/bin/pgCC
> 
> 
> I've built the 'cpi' and 'fpi' examples in the examples/basic directory
> and tried running them using the following mpirun line:
> 
> 
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
> PBS_NODEFILE cpi
> 
> 
> where PBS_NODEFILE is,
> 
> penguin1
> penguin1
> penguin2
> penguin2
> penguin3
> penguin3
> penguin4
> penguin4
> penguin5
> penguin5
> 
> (however, I'm testing outside of PBS). The code seems to hang fo
> quite a while and then I get the following:
> 
> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235:  p4_error: net_recv read, errno = : 110
> p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
> pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
> /home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
> /home/g593851/src/mpich-1.2.5/examples/basic
> 
> 
> More system details - It's a RH 7.1 OS, but with a stock 2.4.20
> kernel. The interconnect is FastE through a Foundry switch and the
> NICS are Intel EEPro100 (using the eepro100 driver).
>   Does anybody have any ideas? I've I searched around the net a bit and
> the results  were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).

I think you sent this to the wrong mailing list. As outlined on the
MPICH home page problem reports should go to

mpi-maint at mcs.anl.gov

The folks at Argonne are usually extremly helpful with solving problems.

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list