[Beowulf] MPI performance on clusters of SMP
Kozin, I (Igor)
I.Kozin at dl.ac.uk
Thu Aug 26 12:22:19 EDT 2004
Nowadays clusters are typically built from SMP boxes.
Dual cpu nodes are common but quad and more available too.
Nevertheless I never saw that a parallel program runs quicker
on N nodes x 2 cpus than on 2*N nodes x 1 cpu
even if local memory bandwidth requirements are very modest.
The appearance is such that shared memory communication always
comes at an extra cost rather than as an advantage although
both MPICH and LAM-MPI have support for shared memory.
Any comments? Is this MPICH/LAM or Linux issue?
At least in one case I observed a hint towards the OS.
I experimented running several instances of a small program on
a 4-way Itanium2 Tiger box with 2.4 kernel. The program is
basically a loop over an array which fits into L1 cache.
Up to 3 instances finish virtually simultaneously.
If 4 instances are launched then 3 finish first and the 4th later
the overall time being about 40% longer.
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf