[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Rahul Nabar rpnabar at gmail.com
Tue Mar 31 18:54:55 EDT 2009


I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:

Each node has 8 cpus.

If I got to a node and run like so then the job works:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:

mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.

I also tried including:

mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}

Still does not work.

What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?

The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.

What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?

Thanks!

-- 
Rahul
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list