|
Page 2 of 2
Tweak your Interconnect
The interconnect is one of the biggest areas for performance improvement.
It can be as simple as upgrading a Fast Ethernet network to a Gigabit
Ethernet network, or as subtle as moving the NIC (Network Interface Card)
to a different slot. Tweaking your interconnect is a simple way to
improve your performance. Better yet, it doesn't cost you anything!
This aspect is one of the beauties of clusters - the software is very
accessible so adjusting parameters from their default settings is allowed.
We did an initial test of HPL using the Fast Ethernet network on the
cluster (recall that the cluster has a Fast Ethernet network for cluster
administration traffic and a Gigabit Ethernet network for computational
traffic). We used the optimal HPL problem parameters we found in the
previous section but ran over the Fast Ethernet network. In Test 8 in
Table One, we found that we could achieve 7.45 GFLOPS over the Fast
Ethernet network (only about 27% of theoretical peak performance).
Compare the performance of Fast Ethernet, Test 8, to the performance
of Gigabit Ethernet, Test 4, we get about a 71% increase in performance
by switching to a higher performance network - in this case Gigabit
Ethernet. However, being the cluster monkeys that we are, we thought we
could still improve the performance of our Gigabit Ethernet network.
We tested the Gigabit Ethernet network of Kronos using Netpipe (see
Resources Sidebar) with the default network settings. Netpipe is a
great tool for measuring the latency and peak bandwidth of a network
connection between two nodes. We found that the default latency of
the Gigabit Ethernet network (between two nodes and through the switch)
was 117 microseconds. In our experience, we thought this was a bit high
(the Fast Ethernet network was giving us latencies in the 42 microsecond
range), so we started to investigate the Intel Pro/1000 NIC driver (e1000).
As with most Gigabit Ethernet drivers these days, Intel provides
documentation for optimizing small packet transfers. The Intel e1000
driver incorporates interrupt rate throttling (IRT) that is also called
interrupt coalescence or interrupt moderation. Normally, when an Ethernet
frame is received the kernel is interrupted so that is can process the
frame. At high data rates, these interrupts can cause performance problems
including a CPU that is constantly interrupted to process the frame. The
interrupt throttling concept holds multiple Ethernet frames for processing
so that only a single system interrupt is used to process multiple frames.
This reduces the load on the CPU but also has the potential to increase
latency because the frames are held for some period before they are
processed.
From the Intel manual, we found the parameters to turn off were Interrupt
Throttling Rate (InterruptThrottleRate=0) and Receive Interrupt Delay
(RxIntDelay=0). We re-ran Netpipe and found that the latency had dropped
to 29 microseconds (not bad for Gigabit Ethernet). In Table One, we
abbreviate InterruptThrottleRate as ITR and RxIntDelay as RID.
In our giddy cluster excitement, we re-ran HPL with interrupt throttling
off and found that our performance had increased from 12.74 GFLOPS to
13.26 GFLOPS (Test 5 in Table 1). So we increased performance by 4%
with some simple changes to the NIC driver. What other changes could
we make?
In designing the cluster, we specifically chose a SMC 8508T 8-port
Gigabit Ethernet switch because it allows the use of jumbo packets (data
packets that are larger than the 1500 byte Ethernet standard) that could
be a potential boost to performance. We experimented with various packets
sizes by adjusting the MTU (Maximum Transmission Unit) for the Gigabit
Ethernet NICs. We tried values up to 9000 and found that an MTU of 6000
gave us the best HPL performance. These results are shown as Test 7 and
gave us a performance of 14.53 GFLOPS. We also twiddled with Ns and Nb
a bit to get these numbers, but in all tests we found the larger MTU
improved performance. Interestingly, This result would have moved us
from 24th to 8th place the Top500 list in June 1993 and kept us on the
list at 431 in June of 1998. This result is also 52% of the theoretical
peak performance which is quite good for Gigabit Ethernet.
Libraries for Acceleration
When optimizing, one easy method to improve performance, particularly
for HPL, is to try and find carefully tuned libraries for mathematics
functions. HPL depends upon calls to the BLAS (Basic Linear Algebra
Subroutine) library. Improving the performance of the BLAS library
should improve the performance of HPL. This technique works for other
codes as well. Examining the performance of supporting libraries can
help lead to improving the performance of the overall code.
Up to this point we had been using ATLAS as our BLAS Library. We had
built version 3.7.8 of ATLAS for the system using gcc/g77 version
3.3.3. ATLAS is a unique library in that it creates code based on
your specific processor architecture. (see Sidebar Three on ATLAS).
| Sidebar Three: Atlas |
|
The Basic Linear Algebra Subroutine (BLAS) package is a standard or
specification of the semantics and syntax for computing basic
vector and matrix operations. There is a reference version of the
BLAS libraries, written in Fortran, but it only serves as a
reference and it's performance is fairly low. Many vendors have
written their own tuned BLAS libraries for their architecture that
have much better performance than the reference implementation.
ATLAS - Automatically Tuned Linear Algebra Software - was
born at Jack Dongarra's Innovative Computing Laboratory at the
University of Tennessee. It is a software package that creates
a tuned BLAS library for the hardware on which it is to be run. It
basically is a code generator that tests a wide range of options,
such as blocking and unrolling factors, in generated code to achieve
the performance for a vast majority of the BLAS functions. It also does
this for some of the LINPACK (Linear Algebra Package) functions.
Many codes take the approach of finding the best performance that
spans a wide range of platforms. Consequently, the performance is
almost guaranteed not to be optimal or at least close to it. What makes
ATLAS unique is that it adapts the resulting BLAS library code to the
host architecture. Consequently, performance should be better than a
generic BLAS library.
You can get ATLAS at Source Forge. All you need is a
good C compiler (the website mentions the versions of gcc that produce
acceptable results).
|
There are other fast BLAS libraries available such as ACML (AMD Core
Math Library) and the GOTO Library (see Resources Sidebar). Due
to limited time and limited space in this article we chose to test
the ACML library.
We downloaded the latest version from AMD's website (ACML is free,
by the way) and rebuilt HPL using ACML instead of ATLAS. We ran HPL
with interrupt throttling off and with our old MTU setting (MTU=1500)
to compare to the performance of ATLAS. The result, Test 6, was
only 11.45 GFLOPS, which was down from 13.26 GFLOPS with the ATLAS
library. We haven't had time to fully investigate why ACML was so
much slower than ATLAS. However, it does illustrate that switching
supporting libraries can have a large impact on performance. And, it
pays to test all your assumptions. If this is starting to sound like
a theme, it is.
Don't Panic
There are many other options we did not cover in this article, which we
hope to cover in a future article. For example, we could try different
MPI implementations, different compilers, different BLAS libraries. We
could even try a different networking software instead of TCP/IP.
One software option that looks very appealing for our situation is
called GAMMA (see Resources Sidebar). GAMMA stands for Genoa Active
Messages and is a low latency, high throughput message library that
runs over Gigabit Ethernet networks. It promises lower latencies than
normal TCP traffic - approximately 10.6 microseconds for Intel Gigabit
Ethernet hardware.
We Did Break a Record
Since our little eight node cluster will never make the Top500 list, we
thought it might be interesting to see how how we stand in the dollars
per GFLOPS column. Previous records in this area have been held by
systems from the
Aggregate.org site where the
reining champ was KASY0,
which provided an HPL double precision cost of $211 per GFLOPS. KASY0
also set an astounding single precision record of $84 per GFLOPS. The
Top500 results are based on the double precision HPL.
We are pleased to announce that we have broken the double precision
HPL record held by KASY0. Our current record is $171 per GFLOPS. If
you use today's prices for the hardware, we have broken the $150 per
GFLOPS barrier with an even lower value of $142 per GFLOPS.
Table Two summarizes our results.
Table Two - Optimization Results
|
Metric
|
Value
|
|
Total GFLOPS
|
14.53
|
|
Percent of Peak
|
52%
|
|
Cost at Construction
|
$2,490
|
|
Dollars per GFLOPS
|
$171
|
|
Cost Today
|
$2,063
|
|
Dollars per GFLOPS (Today)
|
$142
|
|
Power Usage at Load (Watts)
|
900
|
|
MFLOPS per Watt
|
16.14
|
Tim Mattox (father of KASY0) helped us with the theoretical peak
number for Kronos. According to Tim, the theoretical peak is
computed by simply adding the total GHz (1.756) of all the processors
and then multiplying by 2 for double precision because the
Athlon/Sempron has two independent floating point units, one that
can perform an FADD per clock, and the other can do one FMUL per
clock. The peak performance was 28.10 GFLOPS in our case.
We should also note that the cost at construction vs. cost today
could be considered less than fair as all computer hardware costs
less in the future. If you were to build the cluster today, the
prices would be $427 lower, thus decreasing the dollar cost per
GFLOP. Now if we really wanted to get lower, we would remove one
of the 160 MB hard drives (there are two in a RAID1 configuration)
and the video card (use on-board video) reducing the price to
$1,961 and a dollars per GFLOPS of $135!. But, we believe the
point is made without striping down Kronos any further.
There was also no accounting for construction cost. This number
is hard to capture. If more people get interested in these types
of metrics, we can assume two categories: DIY (Do It Yourself) and
Turn-key (no construction the system is delivered ready to run).
For now we will consider KRONOS in the DIY category.
We were not as happy with the MFLOPS per Watt, but still found it
tolerable as KRONOS draws less current than a typical hair drier.
Good Bye and Thanks for all the FLOPS
We would like thank AMD for supporting this project. The results
indicate that the idea of a low cost "value" cluster is not so crazy
after all. Perhaps the biggest lesson in this process is to test the assumptions.
The naive user might assume that default Ethernet settings, vendor
math libraries, and recommended program parameters will be optimal.
In our case, we found that we could increase the performance from
a default base line of 10.6 to 14.53 GFLOPS. To add some perspective,
the performance increase is the equivalent of adding three more
compute nodes to our cluster. Your codes may be able to achieve
similar enhancements.
We believe that there is more performance improvements to be had in
our cluster. We are not quite ready to set it loose on finding the
answer to the eternal question of "Life, the Universe, and Everything",
but we know we are getting closer. We did find it strangely improbable,
however, that the last 28 machines on the first Top500 list (June 1993)
had a rating 0.42 GFLOPS. Moreover, progress towards understanding the question
continues.
This article was originally published in Linux Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.
Douglas Eadline and Jeffrey Layton can be found swinging from the trees at clustermonkey.net.
Comment on this article
You must login to leave comments...
Other Visitors Comments
There are no comments currently....
|