strange behavior. Hints?
ctibirna at giref.ulaval.ca
Fri Jun 13 16:06:15 EDT 2003
I have a small (16x(2P-IV / 4G RAM / myri), redhat-7.3 with xCAT) cluster that
we use with good results since a few months.
I am about to prepare some reports about speedups of our in-house FEM code
(parallelised with MPICH-GM and using PETSc for solvers) and I'm doing some
tests which consist mostly of launching a same (rather big) FEM simulation on
decreasing number of nodes:
for n in `seq -f"%02d" 16 1`; do mpirun -np $n ./simulator; done
(OK, the script is a bit more complicated than this, but you get the idea).
A strange phenomenon started to appear a few weeks ago. The simulation works
very well for all n, apart n=10 and n=11. For these, the program segfaults on
2 to 5 of the nodes and of course this locks the execution (MPI waits) and I
have to kill it. This is reliably reproductible.
I'm absolutely sure my code has no special code dealing with the number of
nodes (highly generalised OOP C++ code).
Now, I start to believe there's some strange bug in the Myrinet
hardware/software. But I feel this is a really wild guess.
I plan to start investigation by tearing all the components apart (mpi, petsc,
myrinet drivers) and test them again. But this is a really big battle with
little chances of success, given that I can positively see all working
correctly most of the time (i.e., running on 16, 15, 4, 2 etc. nodes works
OK). I wonder if anybody saw such a behavior before and has some (more
valuable) hints (than my wild guesses) for where to look and how to do it,
Thanks a lot for your attention.
Cristian Tibirna (1-418-) 656-2131 / 4340
Laval University - Quebec, CAN ... http://www.giref.ulaval.ca/~ctibirna
Research profesional at GIREF ... ctibirna at giref.ulaval.ca
PhD Student - Chemical Engng ... tibirna at gch.ulaval.ca
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf