[Beowulf] recommendations for cluster upgrades

Bill Broadley bill at cse.ucdavis.edu
Thu May 14 01:01:15 EDT 2009

Gerry Creager wrote:
> I'm going to take a little issue with Mark's first statement.  I've been
> bitten by Intel math bugs in the past (rerunning simulations for

All CPUs have bugs.

> verification of performance results in interestingly different answers).

IMO that should happen with all new CPUs, not just the big jumps.  Every
revision of the core has the potential to change something important.

>  Intel's got new hardware, new silicon, and a bit of a history of not
> reporting errors in silicon until they think they have a fix.

Yeah, the PR side of the fdiv bug was handled poorly, especially into regards
to when it would happen (they were wildly low on the rate) and what it would
effect (they claimed it would trigger various ops that use fdiv's results).  I
produced one of the counter examples, but that was quite a few CPU revisions
ago.  It was kind of sadly amusing to watch intel trying to get Linus to sign
a NDA covering a fix for the fdiv bug.

With all that said, Intel's learned and they regularly publish rather details
errata on their CPUs.  The nehalem core isn't particularly new, it's already
been through at least 2 production revisions (C0 an D0), and has been shipping
for some 6 months or so.

So sure, the most conservative thing is to wait, but personally I've found
that CPU correctness is the least of my worries.  Users, compiler,
optimization, application, and gamma rays all seem to cause more errors these
days.  That the nehalem provides a rather large performance factor certainly
enters into my value decision.

> I'm skeptical enough to wait for at least one iteration down the road.
> Or some additional experience reporting here.

Dunno.  The actual cores seem pretty similar to the previous generation.
Relaxing of issue rules and a few small tweaks seem to have improved the IPC
from 0-5%.  The big changes are hyperthreading which while it isn't new to
intel, it's new on the core based CPUs.  The memory controller while providing
radically better performance seems a pretty simple change from the previous
generation north bridge.  Sure it's 3 channels instead of 2.. and on chip.s

It's not at all clear to me that the next generation with a shrink, more
cores, and/or more threads will be any less of a risk.  I would however be
surprised if any of the next CPU revisions offer anywhere close to the
performance improvement that the nehalem offers over the previous intel.

