strange behavior. Hints?

Cristian Tibirna ctibirna at
Fri Jun 13 16:06:15 EDT 2003


I have a small (16x(2P-IV / 4G RAM / myri), redhat-7.3 with xCAT) cluster that 
we use with good results since a few months.

I am about to prepare some reports about speedups of our in-house FEM code 
(parallelised with MPICH-GM and using PETSc for solvers) and I'm doing some 
tests which consist mostly of launching a same (rather big) FEM simulation on 
decreasing number of nodes:

for n in `seq -f"%02d" 16 1`; do mpirun -np $n ./simulator; done

(OK, the script is a bit more complicated than this, but you get the idea).

A strange phenomenon started to appear a few weeks ago. The simulation works 
very well for all n, apart n=10 and n=11. For these, the program segfaults on 
2 to 5 of the nodes and of course this locks the execution (MPI waits) and I 
have to kill it. This is reliably reproductible.

I'm absolutely sure my code has no special code dealing with the number of 
nodes (highly generalised OOP C++ code).

Now, I start to believe there's some strange bug in the Myrinet 
hardware/software. But I feel this is a really wild guess.

I plan to start investigation by tearing all the components apart (mpi, petsc, 
myrinet drivers) and test them again. But this is a really big battle with 
little chances of success, given that I can positively see all working 
correctly most of the time (i.e., running on 16, 15, 4, 2 etc. nodes works 
OK). I wonder if anybody saw such a behavior before and has some (more 
valuable) hints (than my wild guesses) for where to look and how to do it, 

Thanks a lot for your attention.

Cristian Tibirna				(1-418-) 656-2131 / 4340
  Laval University - Quebec, CAN ...
  Research profesional at GIREF ... ctibirna at
  PhD Student - Chemical Engng ... tibirna at

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list