[Beowulf] Performance tuning for Linux 2.6.22 kernels with gigabit ethernet, bonding, etc.
David Kewley
kewley at gps.caltech.edu
Mon Nov 19 22:56:13 EST 2007
On Tuesday 13 November 2007, Bill Johnstone wrote:
> I've been trying to quantify the performance differences between the
> cluster running on the previous switch vs. the new one. I've been
> using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode
> and also IOR. In the previous configuration, the 64-bit nodes had only
> a single connection to the switch, and the MTU was 1500. Under the new
> configuration, all nodes are now running with an MTU of 9000 and the
> 64-bit nodes with the tg3s are set up with the Linux bonding driver to
> form 802.3ad aggregated links using both ports per aggregate link.
> I've not adjusted any sysctls or driver settings. The e1000 driver is
> version 7.3.20-k2-NAPI as shipped with the Linux kernel.
Looking at the master node of a Rocks cluster during mass rebuilds
(involving HTTP transfers), I can keep the output side of the master node's
GigE link saturated (123 MB/s a good bit of the time) with MTU 1500. I've
never encountered a need to increase the MTU, but I've also never done
significant MPI over Ethernet (only Myrinet & IB).
Don't know whether an MPI load would be helped by MTU 9000, but I'd not
assume it would without actually measuring it.
<snip>
> In trying to understand this, I noticed that ifconfig listed something
> like 2000 - 2500 dropped packets for the bonded interfaces on each
> node. This was following a pass of IMB-MPI1 and IMB-EXT. The dropped
> packet counts seem split roughly equally across the two bonded slave
> interfaces. Am I correct in taking this to mean the incoming load on
> the bonded interface was simply too high for the node to service all
> the packets? I can also note that I tried both "layer2" and "layer3+4"
> for the "xmit_hash_policy" bonding parameter, without any significant
> difference. The switch itself uses only a layer2-based hash.
I don't know what causes the 'ifconfig' dropped-packet counter to increment.
I've seen syslog, using UDP on a central syslog server, get saturated and
drop packets. What I really mean by that is: syslogd's socket receive
buffer was routinely filling up whenever there was a deluge of messages
from compute nodes. Whenever there is not enough room in an application's
receive buffer for a new packet, the kernel will drop the packet, so some
messages did not make it into syslogd, and therefore did not make it into
the logfile on disk. I don't know if this form of packet drop causes
ifconfig's dropped-packet counter to increment.
When I looked into this specific problem a bit more, I discovered that
syslogd uses the default socket buffer sizes, so the only way to change
that (without making a one-line edit to syslogd's source and rebuilding, or
using an alternative to ye olde syslogd) was to tune the kernel default
socket receive buffer size:
net.core.rmem_default = 8388608 (from sysctl.conf)
This does not directly bear on your problem, but it might give you something
to think about.
> 1. What are general network/TCP tuning parameters, e.g. buffer sizes,
> etc. that I should change or experiment with? For older kernels, and
> especially with the 2.4 series, changing the socket buffer size was
> recommended. However, various pieces of documentation such as
> http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6
> series kernels "auto-tune" these buffers. Is there still any benefit
> to manually adjusting them?
Standard ones to play with (from /proc/sys/net/core):
rmem_default
wmem_default
rmem_max
wmem_max
(from /proc/sys/net/ipv4):
tcp_rmem
tcp_wmem
I'm guessing you already knew about all those. :)
UDP uses buffer sizes from core; TCP uses the ones in ipv4.
I looked at your netapp URL, and couldn't confidently identify where it
discusses "auto-tune". Perhaps it's talking about nfs (server or client?)
auto-tuning? But perhaps the kernel doesn't auto-tune generally, for any
old application, only for nfs?
There was most definitely a need to manually tune in my syslog example
above, using RHEL4 2.6.9-* kernels.
> 2. For the e1000, using the Linux kernel version of the driver, what
> are the relevant tuning parameters, and what have been your experiences
> in trying various values? There are knobs for the interrupt throttling
> rate, etc. but I'm not sure where to start.
Gosh, I went through this once, but I don't have those results readily
available to me now. I'm assuming you found a guide that goes into great
detail about these tuning parameters, I think from Intel?
> 3. For the tg3, again, what are the relevant tuning parameters, and
> what have been your experiences in trying various values? I've found
> it more difficult to find discussions for the "tunables" for tg3 as
> compared to e1000.
>
> 4. What has been people's recent experience using the Linux kernel
> bonding driver to do 802.3ad link aggregation? What kind of throughput
> scaling have you folks seen, and what about processor load?
Can't help you on either of these.
> 5. What suggestions are there regarding trying to reduce the number of
> dropped packets?
Find the parameter, either in the kernel or in your application, that
controls the socket receive buffer size for your application, and try
increasing it.
David
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
!DSPAM:47425b846769691080364!
More information about the Beowulf
mailing list