[Beowulf] job runs with mpirun on a node but not if submitted via Torque.

Don Holmgren djholm at fnal.gov
Tue Mar 31 19:43:50 EDT 2009


How are your individual MPI processes crashing when run under Torque?  Are
there any error messages?

The environment for a Torque job on a worker node under openMPI is inherited 
from the pbs_mom process.  Sometimes differences between this environment and
the standard login environment can cause troubles.  For example, on Infiniband
clusters the "maximum locked memory" ulimit may need to be adjusted by editing
the script used to launch pbs_mom (usually the pbs-client init.d script).  I've
also seen stack size problems in some user binaries that require such a ulimit 
adjustment to mimic what they may have in their .bash_profile.

Instead of logging into the node directly, you might want to try an interactive
job (use "qsub -I") and then try your mpirun.  This may give you messages that
for some reason aren't getting back to you in your job's .o or .e files.

Don Holmgren
Fermilab




On Tue, 31 Mar 2009, Rahul Nabar wrote:

> I've a strange OpenMPI/Torque problem while trying to run a job on our
> Opteron-SC-1435 based cluster:
>
> Each node has 8 cpus.
>
> If I got to a node and run like so then the job works:
>
> mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> Same job if I submit through PBS/Torque then it starts running but the
> individual processes keep crashing:
>
> mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> I know that the --hostfile directive is not needed in the latest
> torque-OpenMPI jobs.
>
> I also tried including:
>
> mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
> ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
>
> Still does not work.
>
> What could be going wrong? Are there other things I need to worry
> about when PBS steps in? Any tips?
>
> The ${DACAPOEXE_PAR} refers to a fortran executable for the
> computational chemistry code DACAPO.
>
> What;s the differences between submitting a job on a node via mpirun
> directly vs via Torque. Shouldn't these both be transparent to the
> fortran calls. I am assuming don't have to dig into the fortran code.
> Any debug tips?
>
> Thanks!
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Beowulf mailing list