[Beowulf] recommendations for cluster upgrades
bill at cse.ucdavis.edu
Thu May 14 01:18:42 EDT 2009
Rahul Nabar wrote:
> On Tue, May 12, 2009 at 7:05 PM, Greg Keller <Greg at keller.net> wrote:
>> Nehalem is a huge step forward for Memory Bandwidth hogs. We have one code
>> that is extremely memory bandwidth sensitive that i
> Thanks Greg. A somewhat naiive question I suspect.
> What's the best way to test the "memory bandwidth sensitivity" of my
Oprofile I think it is can query various CPU registers that can record things
like cache hits/misses, tlb hits/misses and the like. If I'm misremembering
the name I'm sure someone will speak up. Of course you'd need an idea of what
a node is capable, I suggest using micro benchmarks that exercise the areas of
the memory hierarchy that you are interested in.
> code? Given the fact that I have only the AMD Opterons available
> currently. Any active / passive tests that can throw some metrics out
> at me?
Hrm, having a single platform makes it harder. I've often begged/borrowed
accounts so I could measure application performance, then I run a series of
microbenchmarks to quantify various aspects of the memory hierarchy. Then I
look for correlations between my micro benchmarks and the application performance.
There's a couple other things you can do:
* run 1 to N copies of your code. Great scaling usually means you are CPU
limited/cache friendly. Poor scaling usually is contention in the memory
and/or I/O system.
* If you thing you are doing lots of random memory accesses you can turn on
node/channel/bank interleaving. If performance drops when you go from
4x64 bit channels of memory -> 1x256 bit channel of memory you are likely
limited by the number of simultaneous outstanding memory references.
* Often you can tweak the bios in various ways to increase/decrease bandwidth.
Things like underclocking the memory bus, underclocking the hypertransport
bus, more aggressive ECC scrubbing, various other tweaks available in the
north bridge settings.
* new opterons (shanghai and barcelona) have currently 2 64 bit channels per
socket. If you install the dimms wrong you get 1 64 bit channel, halving
* If you pull all the dimms on a socket you halve the node bandwidth (in a
* If you pull a CPU (usually cpu1 not cpu0 should be pulled) I believe the
coherency traffic goes away and the latency to memory drops by 15ns or so,
certainly if that makes a difference in your application runtime you are
rather latency sensitive.
So basically with the above you should be able to play with parallel
outstanding requests from 1 to 4 per system, bandwidth at 25, 50, and 100% of
normal, and a bit more with the other tweaks. I recommend some micro
benchmarks to look at the underlying memory performance then look at
application runtimes to see how sensitive you are.
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf