bad job distribution with MPICH

Jan-Frode Myklebust janfrode at parallab.no
Thu Jul 17 05:04:54 EDT 2003


Hi, 

we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
ethernet), and are having some problems with the mpich job distribution. 
An example from today:

The PBS job:

----------------------------------------
#PBS -l nodes=4:ppn=2,walltime=100:00:00
#
mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
----------------------------------------

is assigned to nodes:

	node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1

PBS generates a PBS_NODEFILE containing:

-----------------------------
node17
node15
node14
node11
node17
node15
node14
node11
-----------------------------

And this command is started in node 17:

	mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable

And then when I look over the nodes, there's 1 executable running on
node17, 3 on node15, 2 on node14 and 2 on node11.

Anybody seen something like this, and maybe have an idea of what might 
be causing it?


  -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list