Article Index

The tests were run a second time to see the if there were any benefit to using Simultaneous Multithreading or SMT. Once enabled in the BIOS SMT doubles the number of cores seen by the OS. While many people may recall Hyper Threading (HT), SMT is supposed to be better and help hide memory latency. That is, while a core is waiting for memory access, it can in theory be running another "thread." This technique may be very helpful with I/O issues as well, however, most HPC applications hit the memory floating point unit hard -- of which there are only twelve.

SMT was turned on and the 12-way and 16-way effective core tests were re-run. Only the 12 and 16-way tests were deemed important because SMT should have little effect if the number of processes used is less than the number of real cores. The results are in Table Two below. The 16-way results should be most telling because the real cores are over subscribed by 4 processes. With the exception of the ep benchmark, there does not seem to be any advantage to using SMT. Indeed, some benchmarks saw a decrease in effective cores. ep is more processor bound and thus shows a nice performance boost. As expected, there was no improvement when running the 12 core test using SMT.

Test12 copies12 copies (SMT) 16 copies16 copies (SMT)
Table Two: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running
the NAS suite with SMT enabled

In general, SMT does not appear to hurt anything as long as you don't oversubscribe the actual number of cores. It may allow daemons and other such background processes to work better on compute nodes, but I don't see it making a huge difference (this assumption should be tested with your code, however).

The above results range from 41%-98% efficiency with the average utilization for all tests of 64%. Thus, on average, you can expect to effectively use 7.7 cores out of the 12 present in the server for applications similar to the NAS kernels.

A Single Socket Redux?

In contrast to these results above, consider similar tests done on a number of 4-core single socket processors where the best case performance ranged from 50%-100% and the average utilization was 74%. On average, one can expect to effectively use 3 out of 4 cores.

The variation is due to memory bandwidth of each system. In general, more cores means more sharing of memory and more possible contention. Cache friendly programs usually scale well on multi-core, while those that rely on heavy access to main memory have the most difficulty with large multi-core systems.

As mentioned, a valid argument for high density multi-core nodes is the cost amortization of power supplies, hard drives, interconnects, and case/rack hardware across the large number of cores in a single node. This makes sense, but unless the amortization costs is based on effective cores, the assumed savings may not accurately reflect the reality of the situation. Using a single socket node also reduces the MPI messaging and I/O load on the interconnect, but does increase the number of switch ports and network cards needed. In some cases, lower cost Gigabit Ethernet may be adequate for single socket nodes, thus offsetting the increase in interconnect costs. Furthermore, it is possible to build nodes that contain multiple single socket motherboards that share power supply and packaging costs gaining back some of the lost amortization.

Single socket nodes may also provide a more MPI friendly environment than that of a large SMP nodes. That is, an application that requires 72 cores may run better on 18 four-core single socket skinny nodes than 4 twelve-core dual socket fat nodes. There are less data locality issues in the case of the thin nodes. On the other hand, purposely under subscribing fat nodes for jobs may mitigate some of the issues. For example, any one parallel job can only have at most 8 cores per 12-core node. The remaining cores could be used by different jobs, which may have different memory usage patters and allow more effective core usage.

The range of cores per processor and the sockets per motherboard makes designing an HPC cluster an interesting challenge. In the tests described above, thin nodes offer on average 10% better core utilization than fat nodes. In some cases the difference in utilization was far worse. If the number of "effective cores" becomes less as more cores are added to nodes, will thin node design start to dominate or will the economics of fat nodes keep them as the best choice for HPC?

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.