[Beowulf] Profiling floating-point performance
Robert G. Brown
rgb at phy.duke.edu
Wed Feb 11 13:08:41 EST 2004
On Wed, 11 Feb 2004, david moloney wrote:
> I have an application written in C++ which compiles under both MSVC++
> 6.0 and gcc 2.9.6 that I would like to profile in terms of floating
> point performance.
> My special requirement is that I would like not only peak and average
> flops numbers but also I would like a histogram of the actual x86
> floating point instructions executed and their contribution to those
> peak and average flops numbers.
> Can anybody offer advice on how to do this? I tried using Vtune but it
> didn't seem to have this feature.
I'm not sure how accurate it is overall, but see "man gprof" and compile
with the -g -p flag. This will give you at least some useful times and so
It will NOT give you (AFAIR) "histogram of actual x86 floats etc".
I don't know of anything that will -- to get them you have to instrument
your code, probably so horribly that a la heisenberg your measurements
would bear little resemblance to actual performance (especially if your
code wants to be doing all sorts of smooth vector things in cache and
register memory and you keep calling instrumentation subroutines to try
to measure times that wreck state).
Consider that with my best, on CPU, raw assembler based timing clock
(using the onboard cycle counter) I still find the overhead of reading
that clock to be in the tens of clock cycles. To microtime a single
multiply is thus all but impossible -- the clock itself takes 10-40
times as long to execute as a multiply might take, depending on where
the data to be multiplied is when one starts. So timing per-instruction
is effectively out.
Similarly, to instrument and count floating point operations requires
something to "watch the machine instructions" as they stream through the
CPU. Unfortunately, the only thing available to watch the instructions
is the CPU itself, so you have to damn near write an
assembler-interpreter to instrument this. Which in turn would be slow
as molasses -- an easy 10x slower than the native code in overhead alone
plus it would utterly wreck just about every code optimization known to
Finally, there is the question of "what's a flop". The answer is, not
much that's useful or consistent -- the number of floating point
operations that a system does per second varies wildly depending in a
complex way on system state, cache locality, whether the variable is
general or register, whether the instruction is part of a
complex/advanced instruction (e.g. add/multiply) or an instruction that
has to be done partly in software (divide), whether or not the
instruction is part of a stream of vectorized instructions, and more.
That's why microbenchmarks are useful. You may not be able to extract
meaningful results from your code with a simple tool (although it isn't
terribly difficult to instrument major blocks or subroutines with timers
and counters, which is more or less with -p and gprof do) but you can
learn at least some things about how your system executes core
operations in various contexts to learn how to optimize one's code with
a good microbenchmark. Just sweeping stream across vector sizes from 1
to 10^8 or so teaches you a whole lot about the system's performance in
different contexts, as does doing a stream-like benchmark but working
through the vector in a random order (i.e. deliberately defeating any
sort of vector optimization and cache benefit).
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf