MPICH 1.2.5 failures (net_recv)
msnitzer at lnxi.com
Mon Jul 14 16:03:33 EDT 2003
On Fri, Jul 11 2003 at 10:13,
Jeff Layton <jeffrey.b.layton at lmco.com> wrote:
> Good afternoon!
> Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
> Does anybody have any ideas? I've I searched around the net a bit and
> the results were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).
you might try compiling mpich with gcc to eliminate PGI as a potential
source of error. This would at least allow you to verify the integrity of
the drivers, tcp stack, nic, etc.
PGI should be perfectly fine given the minimal mpich configure you
provided but the compiler is one variable that is easy enough to eliminate
as a potential problem. If you see the same problem with gcc compiled
mpich then there is a deeper issue. You might confine the mpirun to use
only 2 nodes and then scale up accordingly.
Mike Snitzer msnitzer at lnxi.com
Linux Networx http://www.lnxi.com
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf