[Beowulf] ...Re: Benchmark results

Joe Landman landman at scalableinformatics.com
Tue Jan 6 19:44:31 EST 2004

On Tue, 2004-01-06 at 18:00, Rene Storm wrote:
> HI,
> first of all, thanks for the responses....
> It seems that there are some more points I have to observe.
> For example the hpl benchmark:
> The job was running on two processors (smp) on loopback. Used time round about 20min. 
> There was no cronjob, no X, only a niced top.
> The machines environments are almost constant (temperature->lm_sensors).
> A reboot after every single run.

One of the odder bits about benchmarking...  Most folks I know don't
reboot machines between real end user runs.  Or umount/mount after a
heavy IO run.  

If you are really aiming for the maximal in limited interference, you
have probably turned everything off anyway, and are running in single
user mode (or just above it with networking enabled).  Doesn't guarantee
much more than some default services are not run.  You might want to
turn of atd and disable stuff in /etc/cron.d .  Not that this really
makes much sense from a "realism" point of view, as end users are not 
(normally) allowed to do that.

> And yes, the differences on the opterons are greater as on i386.
> I will take a look at interleaving. But why should the memory allocation change between two runs?

In short, when a process launches the NUMA nature of its memory could
mean that the memory system where the thread gets its physical
allocations differs from the memory system which is tied to the CPU
running the code.  For a NUMA system, this gives you an extra "hop" for
memory accesses.  There is the concept of process affinity whereby the
thread is tied to a particular CPU it has been running on, which tends
to prevent the cache from being rendered less effective by having the
process hop from CPU to CPU.  Tying an allocation to a CPU exists in
various Unixen (IRIX in particular with the old dplace and similar
directives), and I have heard rumors of a "runon" command (applying both
a process/thread affinity, and pegging the memory allocated by the
thread to the CPU running the thread).

> So it seems, I will to go a little bit deeper into that and try to get some more results.
> What I do expect is some sort of Gaussian distribution on the results.

You should get some distribution.  I would be reluctant to discuss the
shape without at least a reasonable empirical theory.  It does seem that
the allocation locality would cause at least a bimodal shape (rough
guess), something like 2 overlapping distributions.  Other effects are
going to spread this a bit. 

> I will take the hpl and throw the results into gnuplot.
> One question about the walltime. Could this be a problem?
> Does someone know how this really works? I read something about HZ and jiffies and so, but it was a bit confusing.
> Maybe this could be the point were old, unfair heisenberg could have had a chance.

Physics 101:

Take your precision of your clock.  Your uncertainty in using the clock
as a measuring device is +/- 1/2 of the clocks precision (call this
delta_t), or 1/2 of the smallest interval that this clock can measure. 
As you are really interested in time intervals, you may need to appeal
to basic error analysis.  That said, if you look at a really rough
estimate of the contribution of the random error of the clock for your
runs, have a look at the ratio of delta_t to the total run time.  For a
run of 30 minutes, t ~ 1800 seconds, while using 0.01 seconds as the
clocks precision gives you about 3 x 10**(-6) (or 1 part in 360000). 
That is, the contribution to the error due to the clock is small here.  

Even for a run of 60 seconds it is still in the 0.01% regime.  To get a
5% error out of that, you would need delta_t/t ~ 0.05, or t about 0.1
second for your total benchmark run time.

You wouldn't even need another process taking up time, the kernel will
happily do this for you with some of its housekeeping chores (bdflush,
kswapd, ...).  Servicing lots of interrupts is a good way to slow things
down, as are doing many small page allocations where one large one would
suffice, etc.

> Rene Storm
> -----Ursprüngliche Nachricht-----
> Von: Michael Will [mailto:mwill at penguincomputing.com] 
> Gesendet: Dienstag, 6. Januar 2004 18:53
> An: beowulf at beowulf.org
> Betreff: [Beowulf] Re: Benchmark results
> Rene,
> Do you see that fluctuation only on the opterons or also on the intel cpus
> (I assume Xeon?)
> If it was only on the opteron, then you will have node memory interleave
> turned off in your bios, and sometimes the data you work with is close to
> your CPU and sometimes far. (each CPU has a bank of RAM and accesses the
> other CPUs bank through hypertransport, which is slower)
> When running a memory and floatpoint intensive benchmark (neuronal network
> training program like SNNS), we got two consistent results: They were
> alternating, depending on which CPU the code ran, and they were 20% off of
> each other for that setup.
> Once we put in the same amount of memory in both banks, and turned on
> 'node memory interleave' in the bios, they were consistently between the
> two, a bit closer to the faster value.
> I will take a closer look at your CD later, too.
> Michael Will
> PS: Did I mention the opterons blew away the xeons by far?
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list