MPICH 1.2.5 failures (net_recv)
Martin Siegert
siegert at sfu.ca
Fri Jul 11 13:11:07 EDT 2003
On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:
> Good afternoon!
>
> Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
>
> ./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
> --with-ARCH=LINUX \
> --with-device=ch_p4 \
> --without-romio --without-mpe \
> -opt=-O2 \
> -cc=/usr/pgi/linux86/bin/pgcc \
> -fc=/usr/pgi/linux86/bin/pgf90 \
> -clinker=/usr/pgi/linux86/bin/pgcc \
> -flinker=/usr/pgi/linux86/bin/pgf90 \
> -f90=/usr/pgi/linux86/bin/pgf90 \
> -f90linker=/usr/pgi/linux86/bin/pgf90 \
> -c++=/usr/pgi/linux86/bin/pgCC \
> -c++linker=/usr/pgi/linux86/bin/pgCC
>
>
> I've built the 'cpi' and 'fpi' examples in the examples/basic directory
> and tried running them using the following mpirun line:
>
>
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile
> PBS_NODEFILE cpi
>
>
> where PBS_NODEFILE is,
>
> penguin1
> penguin1
> penguin2
> penguin2
> penguin3
> penguin3
> penguin4
> penguin4
> penguin5
> penguin5
>
> (however, I'm testing outside of PBS). The code seems to hang fo
> quite a while and then I get the following:
>
> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235: p4_error: net_recv read, errno = : 110
> p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken
> pipe /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg
> /home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd
> /home/g593851/src/mpich-1.2.5/examples/basic
>
>
> More system details - It's a RH 7.1 OS, but with a stock 2.4.20
> kernel. The interconnect is FastE through a Foundry switch and the
> NICS are Intel EEPro100 (using the eepro100 driver).
> Does anybody have any ideas? I've I searched around the net a bit and
> the results were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).
I think you sent this to the wrong mailing list. As outlined on the
MPICH home page problem reports should go to
mpi-maint at mcs.anl.gov
The folks at Argonne are usually extremly helpful with solving problems.
Cheers,
Martin
--
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert at sfu.ca
Canada V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list