Performance Variations using MPI/Myrico

Patrick Geoffray patrick at
Fri Apr 27 11:50:38 EDT 2001

Steffen Persvold wrote:

> Hmm, the NAS application runs in userspace and since this inner loop
(FFT code) runs without
any communication with
> other nodes, why would a SSE patched kernel improve it's memcpy
performance. I would believe
that the memcpy calls in
> the FFT code was either inlined by the compiler, or that a call to
libc's memcpy was made. It
shouldn't involve any
> system (kernel) time at all, right ??

Hi Steffen,

Yes, the NAS FT code does not use the "memcpy()" system call. The copy
step of the FFT is explicit (loop of assignments) and the PGI compiler
is smart enough to use SSE prefetching to optimize this part of the code
if SSE is available. But without a specific patch, the Linux kernel does
not enable the SSE support (basically the kernel has to save the FP and
the SSE registers during context switching), so the SSE optimization for
PIII from PGI is useless. Now I am wondering if compiling with
-Mvect=sse or -Mvect=prefetch with pgf90 WITHOUT the SSE support enabled
in the kernel is not the source of this unstability.

Anyway, 50 % of variation for a pure computation piece of code seems too
large to be explained by the SSE support. SSE on PIII is single
precision only, so it does not help to get more Flops. Maybe there is
something else in the patch that they applied, I will look at it.


Patrick Geoffray

|      Myricom Inc       |  University of Tennessee - CS Dept |
| 325 N Santa Anita Ave. |   Suite 203, 1122 Volunteer Blvd.  |
|   Arcadia, CA 91006    |      Knoxville, TN 37996-3450      |
|     (626) 821-5555     |      Tel/Fax : (865) 974-1950      |

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list