BLAS-1, AMD, Pentium, gcc

Jim Fraser fraser5 at
Fri Apr 12 19:51:25 EDT 2002

Sure the optimized BLAS by Intel IS faster (on Intel) the data you present
while very impressive but are skewed towards Intel because the libs are
optimized for ONLY for SSE and intel chips while AMD does not really fully
BUT should replace your stale BLAS code with optimized ATLAS on for your AMD
chips....its a whole new world my friend!  AMD really kicks some butt when
the libs are optimized for cache size.  It blew me away. The libs optimize
for a specific chip cache and detect for SSE or 3Dnow! and really exploit it
and the performance is very impressive. (as well as the makefile that runs
for quite some time to produce the libs.)   Download the latest developers
version compile and sit back and smile. WELL WORTH THE EFFORT, no question.

I got into this to port a cfd code over from intel/mkl/scalapack/mpi to
amd/atlas/scalapack/mpi.  The bang for the buck with AMD is no comparison
after you run with this package.  BTW, the Atlas libs also run on intel (
runs ANY chip for that matter) and improved performance over the intel MKL
package as well (for some chips = on others).  I don't have the all numbers
off hand but I would suggest you re-run your case with ATLAS, your
conclusion may change.

try it. Its free.

(PS get the developers source and compile instead of downloading the binary,
the term)


-----Original Message-----
From: beowulf-admin at [mailto:beowulf-admin at]On
Behalf Of Don Holmgren
Sent: Friday, April 12, 2002 1:36 PM
To: Hung Jung Lu
Cc: beowulf at
Subject: Re: BLAS-1, AMD, Pentium, gcc

On Fri, 12 Apr 2002, Hung Jung Lu wrote:

> Hi,
> I am thinking in migrating some calculation programs
> from Windows to Linux, maybe eventually using a
> Beowulf cluster. However, I am kind of worried after I
> read in the mailing list archive about lack of
> CPU-optimized BLAS-1 code in Linux systems. Currently
> I run on a Wintel (Windows+Pentium) machine, and I
> know it's substantially faster than equivalent AMD
> machine, because I use the Intel's BLAS (MKL) library.
> (I apologize for any misapprehensions in what
> follows... I am only starting to explore in this
> arena.)
> (1) Does anyone know when gcc will have memory
> prefetching features? Any time frame? I can notice
> very significant performance improvement on my Wintel
> machine, and I think it's due to memory prefetching.

If you mean, "when will gcc's optimizer do automatic prefetching?", I
have no idea.  But, many programmers have been doing manual prefetching
with gcc for quite a while. If you don't mind defining and using
assembler macros, gcc handles it just fine now.  Here's an example:

#define prefetch_loc(addr) \
__asm__ __volatile__ ("prefetchnta %0" \
                      : \
                      : \
                      "m" (*(((char*)(((unsigned int)(addr))&~0x7f)))))

> (2) I am a bit confused on the following issue: Intel
> does release MKL for Linux. So, does this mean that if
> I use Pentium, I still get full benefit of the
> CPU-optimized features in BLAS-1, despite of gcc does
> not do memory prefetching? How is this possible?

The Intel compiler produces object files compatible with gcc, and vice
versa.  I would assume they implemented the library with the Intel
compiler, which has full SSE/SSE2 support (including prefetching).  They
list the MKL for Linux as compatible with both gnu and Intel compilers.

> (3) Related to the above: for general linear algebra
> operations, is Pentium processor then better than AMD,
> since Intel has the machine-optimized BLAS library? I
> get contradictory information sometimes... I've seen
> somewhere that Pentium-4 compares unfavorably with AMD
> chips in calculation speed... Any opinions?
> thanks,
> Hung Jung Lu

For the very simple SU3 linear algebra (3X3 complex matrices and 3X1
complex vectors) used in our codes, the Pentium 4 outperforms the Athlon
on most of our SSE-assisted routines.  See the table near the bottom of
for Mflops per gigahertz on various routines for P-III, P4, and Athlon.
Perhaps re-coding in 3DNow! would give the Athlon a boost.

For our codes, which are bound by memory bandwidth, P4's do
significantly better than Athlons because of the faster front side bus
(400 Mhz effective).  See
for a table comparing memory bandwidth and SU3 linear algebra
performance on a 1.2 GHz Athlon, 1.4 GHz P4, and 1.7 GHz P7 (see
for information about this benchmark).

Don Holmgren

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list