[Beowulf] NUMA info request

kyron at neuralbs.com kyron at neuralbs.com
Wed Mar 26 10:58:38 EDT 2008


> On Tue, Mar 25, 2008 at 12:40 PM,  <kyron at neuralbs.com> wrote:
>>
>> > On Tue, Mar 25, 2008 at 12:17 AM, Eric Thibodeau <kyron at neuralbs.com>
>>  > wrote:
>>  >>
>>  >>  Mark Hahn wrote:
>>  >>  >>   NUMA is an acronym meaning Non Uniform Memory Access. This is
>> a
>>  >>  >> hardware constraint and is not a "performance" switch you turn
>> on.
>>  >>  >> Under the Linux
>>  >>  >
>>  >>  > I don't agree.  NUMA is indeed a description of hardware.  I'm
>> not
>>  >>  > sure what you meant by "constraint" - NUMA is not some kind of
>>  >>  > shortcoming.
>>  >>  Mark is right, my choice of words is misleading. By constraint I
>> meant
>>  >>  that you have to be conscious of what ends up where (that was the
>> point
>>  >>  of the link I added in my e-mail ;P )
>>  >>
>>  >> >> kernel there is an option that is meant to tell the kernel to be
>>  >>  >> conscious about that hardware fact and attempt to help it
>> optimize
>>  >>  >> the way it maps the memory allocation to a task Vs the processor
>> the
>>  >>  >> given task will be using (processor affinity, check out taskset
>> (in
>>  >>  >> recent util-linux implementations, ie: 2.13+).
>>  >>  > the kernel has had various forms of NUMA and socket affinity for
>> a
>>  >>  > long time,
>>  >>  > and I suspect most any distro will install kernel which has the
>>  >>  > appropriate support (surely any x86_64 kernel would have NUMA
>>  >> support).
>>  >>  My point of view on distro kernels is that they are to be
>> scrutinized
>>  >>  unless they are specifically meant to be used as computation nodes
>> (ie:
>>  >>  don't expect CONFIG_HZ=100 to be set on "typical" distros).
>>  >>  Also, NUMA is only applicable to Opteron architecture (internal MMU
>>  >> with
>>  >>  HyperTransport), not the Intel flavor of multi-core CPUs (external
>> MMU,
>>  >>  which can be a single bus or any memory access scheme as dictated
>> by
>>  >> the
>>  >>  motherboard manufacturer).
>>  >>
>>  >> >
>>  >>  > I usually use numactl rather than taskset.  I'm not sure of the
>>  >>  > history of those tools.  as far as I can tell, taskset only
>> addresses
>>  >>  > numactl --cpubind,
>>  >>  > though they obviously approach things differently.  if you're
>> going
>>  >> to
>>  >>  > use taskset, you'll want to set cpu affinity to multiple cpus
>> (those
>>  >>  > local to a socket, or 'node' in numactl terms.)
>>  >>  >
>>  >>  >>   In your specific case, you would have 4Gigs per CPU and would
>> want
>>  >>  >> to make sure each task (assuming one per CPU) stays on the same
>> CPU
>>  >>  >> all the time and would want to make sure each task fits within
>> the
>>  >>  >> "local" 4Gig.
>>  >>  >
>>  >>  > "numactl --localalloc".
>>  >>  >
>>  >>  > but you should first verify that your machines actually do have
>> the
>>  >> 8GB
>>  >>  > split across both nodes.  it's not that uncommon to see an
>>  >>  > inexperienced assembler fill up one node before going onto the
>> next,
>>  >>  > and there have even
>>  >>  > been some boards which provided no memory to the second node.
>>  >>  Mark (Hahn) is right (again !), I ASSumed the tech would load the
>>  >> memory
>>  >>  banks appropriately, don't make that mistake ;) And numactl is
>> indeed
>>  >>  more appropriate in this case (thanks Mr. Hahn ;) ). Note that the
>>  >>  kernel (configured with NUMA) _will_ attempt to allocate the memory
>> to
>>  >>  "'local nodes" before offloading to memory "abroad".
>>  >>
>>  >>  Eric
>>  >>
>>  > The memory will be installed by myself correctly - that is,
>>  > distributing the memory according to cpu.  However, it appears that
>>  > one of my nodes (my first Opteron machine) may well be one that has
>>  > only one bank of four DIMM slots assigned to cpu 0 and shared by cpu
>>  > 1.  It uses a Tyan K8W Tiger s2875 motherboard.  My other two nodes
>>  > use Arima HDAMA motherboards with SATA support - each cpu has a bank
>>  > of 4 DIMMs associated with it.  The Tyan node is getting 4 @ 2 Gb
>>  > DIMMs, one of the HDAMA nodes is getting 8 @ 1 Gb (both instances
>>  > fully populating the available DIMM slots) and the last machine is
>>  > going to get 4 @ 1 Gb DIMMs for one cpu and 2 @ 2 Gb for the other.
>>
>>  That last scheme might give you some unbalanced performance but that is
>>  something to look up with the MB's instruction manual (ie: you might be
>>  better off installing the RAM as 1G+1G+2G for both CPUs instead of 4x1G
>> +
>>  2x2G).
>
> On my Opteron systems, wouldn't 3 DIMMs per CPU drop me into 64-bit
> memory bandwidth rather than the allowed 128-bit memory bandwidth when
> each CPU has an even number of DIMMs?

Hence "look up with the MB's instruction manual" but you're probably right.

>
>>
>>
>>  > It looks like I may want to upgrade my motherboard before exploring
>>  > NUMA / affinity then.
>>
>>  If you're getting into "upgrading" (ie: trowing money at) anything,
>> then
>>  you're getting into the slippery slope of the hardware selection debate
>> ;)
>
> Slippery indeed.  At this point, I think I may just install the RAM to
> bring my current calculation out of swap and be done with the cluster
> for now.  Given that I think one of my nodes uses hypertransport for
> all of cpu 1 memory access, would it hurt anything to use affinity
> when only 2 out of 3 nodes can benefit from affinity?

Start by getting that RAM in and worry about affinity once your code runs
in parallel, the kernel will probably play nice with you and take that
worry away from you. Numactl can help you determine if your process is
attempting to access another node's memory space by "crashing" your
process:

(from the numactl manpage):

       --membind=nodes, -m nodes
              Only allocate memory from nodes.  Allocation will fail when
there is not enough memory available on these nodes.

>>
>>
>>  > This discussion as well as reading about NUMA and affinity elsewhere
>>  > leads to another question - what is the difference between using
>>  > numactl or using the affinity options of my parallelization software
>>  > (in my case openmpi)?
>>
>>  numactl is an application to help nudge processes in the correct
>>  direction. Implementing cpuaffinity within your code makes your code
>>  explicitally aware that it will run on an SMP machine (ie: it's
>> hardcoded
>>  and you don't need to call a script to change your processe's
>> affinity).
>>
>>  In that regards Chris Samuel replied with the mention of Torque and PBS
>>  which would support affinity assignment. IMHO, that would be the most
>>  logical place to control affinity (as long as one can provide some
>> memory
>>  access hints, ie: same options as seen in numactl's manpage)
>>
>>  > Thanks,
>>  >
>>  > Mark (Kosmowski)
>>  >
>>
>>  Eric Thibodeau
>>
>>
> Again, thank you for this discussion - I'm learning quite a bit!
>

No prob.

Eric

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

!DSPAM:47ea654150526491211187!



More information about the Beowulf mailing list