bad job distribution with MPICH
Gary Stiehr
stiehr at admiral.umsl.edu
Fri Jul 18 17:18:26 EDT 2003
Hi,
Try to use "mpirun -nolocal -np ....". I think if you don't specify the
"-nolocal" option, the job will start one process on node17 and then
that process will start the other 7 processes on the remaining 6
processors not in node17; thus resulting in three processes on node15.
Apparently if you use -nolocal, it will use all of the processors. I'm
not sure why this is, however, adding "-nolocal" to the mpirun command
may help you.
HTH,
Gary
Jan-Frode Myklebust wrote:
>Hi,
>
>we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
>ethernet), and are having some problems with the mpich job distribution.
>An example from today:
>
>The PBS job:
>
>----------------------------------------
>#PBS -l nodes=4:ppn=2,walltime=100:00:00
>#
>mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
>----------------------------------------
>
>is assigned to nodes:
>
> node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1
>
>PBS generates a PBS_NODEFILE containing:
>
>-----------------------------
>node17
>node15
>node14
>node11
>node17
>node15
>node14
>node11
>-----------------------------
>
>And this command is started in node 17:
>
> mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable
>
>And then when I look over the nodes, there's 1 executable running on
>node17, 3 on node15, 2 on node14 and 2 on node11.
>
>Anybody seen something like this, and maybe have an idea of what might
>be causing it?
>
>
> -jf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list