bad job distribution with MPICH
Jan-Frode Myklebust
janfrode at parallab.no
Sat Jul 19 07:32:56 EDT 2003
On Fri, Jul 18, 2003 at 04:18:26PM -0500, Gary Stiehr wrote:
>
> Try to use "mpirun -nolocal -np ....".
Yes, that seems to fix it. Thanks!
I also got a nice explanation in private from George Sigut explainig
what MPICH was doing whan not given the '-nolocal' flag.
"
I seem to remember something about mpirun starting distributing the
jobs NOT on the first node (i.e. in your case node17) and continuing
in the circular fashion:
given: 17 15 14 11 17 15 14 11
expected: 17 15 14 11 17 15 14 11
getting: | 15 14 11 17 15 14 11 (instead of 1st 17, twice 15)
-> 15
"
Looks like without the '-nolocal' MPICH is reserving the first node
in the machinefile for job management.
-jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list