Opteron kernel

Derek Richardson derek.richardson at pgs.com
Wed Dec 3 11:27:59 EST 2003

I'm thinking there is a lot of potential for optimization is the x86-64 
architecture.  Two different versions of our code ( they have slightly 
differing code and were compiled w/ same GNU compilers but using 
different flags ) had a large performance difference.  One version ran 
at ~ 85% of the speed of the P4 gear, and another at ~ 140% of P4 gear ( 
dual Xeon 3.06 GHz boxen ).  Having found this out two days ago and 
spent all of yesterday repairing some dead nodes, I haven't had a chance 
to chase the testing up ( find out which flags, code differences, etc. 
).  We are planning on doing a run w/ the same code base, but the 
changed compiler flags.  That should bring out whether it is the code 
changes, or the compiler flags.  My guess would be the compiler flags, 
but I don't know ( yet ) what changes were made in the code itself.  
There's also some pre-fetching optimization work that can be done as 
well, so things are looking a bit brighter.
As a side note, AMD recommends the SUSE 64 bit kernel ( apparently even 
for non-SUSE, non-64bit OSes like RedHat ).  I don't know where they 
stand on RH Advanced Whatchamadoodle vs. SUSE, but I'll have to sort 
that out in the future, if we actually ever get around to getting some 
Opterons ( our stance has been that they have to outperform the P4 Xeon 
gear using the same code and OS, then we'll worry about seriously 
optimizing ).
I suppose I'll let everyone know when we discover what made such a large 
Derek R.

Claude Pignol wrote:

> Derek Richardson wrote:
>> Donald,
>> Sorry for the late reply, bloody Exchange server didn't drop it in my 
>> inbox until late this morning.  Memory and scheduling would probably 
>> be the biggest factor.  Processor affinity doesn't matter as much, 
>> because in my experience we haven't had problems w/ processes 
>> bouncing between CPUs.  PCI bus is almost a non-issue, since our 
>> application is embarassingly parallel and therefore has no need for > 
>> 100 Mbit ethernet, and there is no disk on a PCI-attached controller, 
>> so we have very little information passing over the PCI bus.
>> By interleaving, I assume you mean at the physical level, which I had 
>> a quick peek at when we got the system ( it's an IBM eServer 325, a 
>> loaner for testing ) and I assumed to be correct.  But given the poor 
>> performance I have seen ( 2 GHz Opterons coming in at ~15% slower 
>> than a 3 GHz P4 on a compute/memory intensive application when most 
>> benchmarks I have seen would imply the inverse ), I will double-check 
>> that when given a chance. 
> I have the same conclusion concerning the performance. I haven't seen 
> on our application (floating point and memory  intensive) the speed up 
> that we could expect from the SPEC benchmark.
> (using gcc 3.3 Kernel NUMA  bank interleaving ON CPU interleaving OFF)
> The problem is probably due to the compiler that doesn't generate a 
> very optimized code on common application.
> It seems that the price performance ratio is still in favor of Xeon 
> for dual processor machine.
>> I will probably just try the latest 2.6 kernel and a few other tweaks 
>> as well, and AMD has also offerred help, but that would more likely 
>> be at the application layer ( which I don't have control of, 
>> unfortunately ).
>> Thanks for the response, and my apologies for the vagueness of the 
>> question.
>> Derek R.
>> Donald Becker wrote:
>>> On Mon, 24 Nov 2003, Derek Richardson wrote:
>>>> Does anyone know where to find info on tuning the linux kernel for 
>>>> Opterons?  Googling hasn't turned up much useful information.
>>> What type of tuning?
>>> PCI bus transactions (the Itanium required more, but the Opteron still
>>> benefits)?  Scheduling?  Processor affinity?  What kernel version?
>>> If you ask specific questions, there is likely someone on the list that
>>> knows the specific answer.
>>> The easiest performance improvement comes from proper memory DIMM
>>> configuration to match the application layout.  Each processor has its
>>> own local memory controller, and understanding how the memory slots are
>>> filled and the options e.g. interleave can make a 30% difference on a
>>> dual processor system.
> -- 
> ------------------------------------------------------------------------
> Claude Pignol 	SeismicCity, Inc. <http://www.seismiccity.com>
> 2900 Wilcrest Dr.    Suite 370 	 Houston TX 77042
> Phone:832 251 1471 Mob:281 703 2933 	 Fax:832 251 0586

Linux Administrator
derek.derekson at pgs.com
derek.derekson at ieee.org
Office 713-781-4000
Cell 713-817-1197
Disease can be cured; fate is incurable.
		-- Chinese proverb

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list