Bizarre problems when adding a PPC machine...

t mirrorsh at atlantech.net
Tue Oct 29 17:52:35 EST 2002


I am replying to a post I saw in the archives here dating from Jan 2002,
primarily because it is the same problem I have been having when I added a
PPC machine into my x86 Linux cluster running MPICH.

> I really hate to bother the mailing list but this one has me somewhat
> stumped.  I have a four node cluster comprising Linux machines and one
> PPC machine.  The Linux machines have been adequately tested and play
> well together.  That PPC machine is another matter.  When I include the
> PPC machine (a Mac 8500 running YellowDog Linux) in my network
> cluster... well things fall apart.  Here's what appears on the console
> after running a simple test on my "root" node....
>
>
>
> [john at adenine examples]$ ./mpirun -np 4 simpleio
>
> p2_9722:  p4_error: Could not allocate memory for commandline args:
>
> 553648128

Someone on the least suggested that this was probably an endian problem and
to try to contact the authors of MPICH.

Well, this has apparently been a problem since at least 1996 (from
groups.google archives) yet I have not found a solution anywhere on the web
(via google anyway).  Also I haven't had much luck talking to the MPICH
authors in the past about bugs in the MPICH implementation, so instead I
just fixed it myself.  Just in case anyone else ends up having this problem,
I thought I'd post a possible solution so that at least it will be saved in
some form of archive for posterity.

A complete fix would be relatively time-consuming, involving changing how
MPI reads/writes data in p4, or altering the configure scripts and other
things to base machine "type" on something other than simply operating
system, or who knows.  Here's a quick fix for anyone interested.

The problem is that MPICH (as of 1.2.4) cannot handle heterogenous networks
in which machines of different bytesex are all running Linux or BSD using
the ch_p4 device.  p4 seems to write stuff out in network order only if
MPICH thinks the person on the other end is a machine of a different
architecture.  Problem  is, for MPICH, a machine's architecture is
determined by the operating system it is using, not by its processor
architecture.  There is a quasi-fix in MPICH to compensate, but it is
broken.

Step 1) On the Linux PPC machines, edit mpid/ch_p4/p4/include/p4_MD.h

Where:
    #if defined(LINUX)
    #define P4_MACHINE_TYPE "LINUX"
    #endif

Replace "LINUX" with "LINUX_PPC"

In mpid/ch_p4/p4/lib/p4_MD.c, in the function data_representation (at the
bottom of the file), remove the ENTIRE #ifdef WORDS_BIGENDIAN block and
replace it with something like this:

    if (strcmp(machine_type, "LINUX_PPC") == 0) return 21;
    if (strcmp(machine_type, "LINUX_X86") == 0) return 2;

Step 2) On the Linux x86 machines, do the same except in p4_MD.h, replace
"LINUX" with "LINUX_X86"

So you will have two "versions": MPICH 1.2.X-ppc and MPICH 1.2.X-x86.  Or
however you enjoy naming things.

This can be generalized for other machines types, e.g., NETBSD_X86,
FREEBSD_ALPHA, LINUX_FOOZWITZ, so long as you add the requisite entries in
the data_representation() function and hack p4_MD.h as necessary.

I don't know if this is the proper forum for this sort of thing, but at
least there will now be a solution posted to a google-accessible archive.
The MPICH folks can at some point create an actual general-purpose patch
that isn't quite as hacky and put it in 1.2.5.

--
Stephen Lawler


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list