bad job distribution with MPICH

Jan-Frode Myklebust janfrode at parallab.no
Sat Jul 19 07:32:56 EDT 2003


On Fri, Jul 18, 2003 at 04:18:26PM -0500, Gary Stiehr wrote:
> 
> Try to use "mpirun -nolocal -np ....".  

Yes, that seems to fix it. Thanks!

I also got a nice explanation in private from George Sigut explainig 
what MPICH was doing whan not given the '-nolocal' flag.

"
  I seem to remember something about mpirun starting distributing the
  jobs NOT on the first node (i.e. in your case node17) and continuing
  in the circular fashion:
                                                                                                             
  given:    17 15 14 11 17 15 14 11
  expected: 17 15 14 11 17 15 14 11
  getting:  |  15 14 11 17 15 14 11  (instead of 1st 17, twice 15)
            -> 15

"

Looks like without the '-nolocal' MPICH is reserving the first node
in the machinefile for job management.


   -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list