|
Page 1 of 2
Can Doug break the record he and Jeff set previously? Can he overcome
the chains of reality? Inquiring minds want to know?
Cluster optimization is often considered an art form. In some cases it
is less art and more like flipping a coin. A previous article described
the Kronos value cluster and its record breaking price-to-performance
results. Can an investment of time and effort break this record or have
we hit the cluster wall?
Introduction
If you followed along with our previous
value cluster articles, you will remember that
Jeff Layton and I built and optimized a cluster for under $2500. Our
results for the cluster set a record of sorts for achieving 14.53 GFLOPS
(52% of peak) at a cost of $171/GFLOP (double precision). We used the
famous Top500 benchmark called HPL (High Performance Linpack). Not
one to rest on my laurels, I decided to push on and see if I could
improve on our HPL performance.
The next step in improving performance is to figure out what to optimize.
There are plenty of things to change, so where to start is a good question.
When fixing cars, my father always had a rule. I, then move to next easiest/cheapest part
until you (hopefully) fixed the problem. It is easier and cheaper to
replace a gas filter than it is to replace a carburetor. Our efforts
thus far, have taken a similar approach where we tweaked the program
parameters, choose the best BLAS (Basic Linear Algebra Subprograms)
library, and tuned the Ethernet driver. In this article, I would like
to find the "good enough wall" for the cluster, which is basically the
point of diminishing returns.
Also remember that the source code can not be changed so we are limited
to program parameters, the cluster plumbing (interconnects, drivers),
and middle-ware (MPI)and compiler tools. Those who have experience in
this area will obviously raise there hand and say, "What about other
compilers and MPI versions?" My response is, "Well of course, it seems
easy enough -- which is a sure indication that probably will not be."
If we are to try other compilers and MPI version, we need to keep in
mind that there is quite a list of alternatives (see Sidebar Two).
In order to get done testing within this decade, we will limit our
choices to some of the more popular packages. The choices should not
be taken a slight toward other alternatives. There are plenty of both
open and commercial tools available and your choice should be based
on your needs.
| Sidebar One - The Current Kronos Record |
|
As reported in the November 2005 issue, the current record for our value
cluster (called Kronos) is 14.53 GFLOPS. This result was achieved
using a cluster composed of eight AMD Sempron 2500 processors with
256 MBytes of PC27000 RAM (512 MBytes on the head node), and a Gigabit
Ethernet link. We tuned the program parameters, tried different BLAS
libraries, and tuned the Gigabit Ethernet Links (6000 Byte MTU, turned
off interrupt mitigation strategies) We are also using the Warewulf
cluster toolkit (See Resources). For this article we will use the
same hardware and only change software.
|
Another important note is worth mentioning. As I try and push the GFLOPS
higher some things may work for this application and some things may
not. It would be a mistake to assume this is the case for all applications
- particularly yours. Please keep this proviso in mind when in interpreting
the results.
We are going to look at tuning the TCP values, changing the MPI library,
changing the compiler, and finally using a kernel by-pass MPI library.
As we cannot try every possible combination (see Sidebar Two), at
least we will go through the exercise of changing these parameters and
at minimum learn a few things about how to turn your hair gray (in my
case more gray).
A Nudge, Not a Bump
The easiest thing to try is to tune the TCP parameters. I am borrowing
from a Berkley Lab TCP Tuning Guide I found on-line (see
Resources Sidebar) Listing One shows the settings that were added
to the /etc/sysctl.conf file. You can effect the changes by simply
running a sysctl -p. For those interested, all TCP implementations
uses buffers for the transfers between nodes. In newer Linux kernels,
the kernel will auto-tune the buffer size based on the communication
pattern it encounters. Our changes will be to increase the maximum
tunable size of the buffers.
Re-running our best case from before with the new setting shows the
smallest of increase to 14.57 GFLOPS (multiple runs confirmed that this
increase is statistically significant). The change was easy, not much
improvement and no huge amount of time lost.
Listing One - New TCP Parameters
# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
One thing to remember is that there is no best compiler/MPI combination
for every program! Of course you are you hoping for the best, but you
may never know if you are getting the best performance -- until some
uber geek on a mailing list let's you in on the secret combination that
got his application 10% more performance than yours.
Finally, there are those pesky optimization flags that vary from
compiler to compiler and processor to processor. Indeed, the trick
is to know when "good enough" is really good enough. If you know your
application, in most cases, you will have a "feel" for this point in
the optimization process.
Pick an MPI
Up until this point we have been using LAM/MPI. This MPI is the default
MPI used in the Warewulf distribution. The version we used was 7.0.6.
The most logical MPI to use next is MPICH from Argonne National Lab.
The latest version of MPICH is MPICH2. With much anticipation, I
compiled MPICH2 and set about running the HPL code.
| Sidebar Two - Too Much of a Good Thing |
|
If you are looking for the best performance then you know that compilers
and libraries are a great way to "easily" make changes. Unfortunately,
"the devil, they say is in the linker". Furthermore, the array of
choices is rather daunting. Let's first consider compilers. A short
list of Linux x86 compilers include GNU, Portland Group, Intel,
Pathscale, Lahey, Absoft, and NAG. Similarly a short list of MPI
implementations would include;
LAM/MPI,
MPICH1,
MPICH2,
Intel-MPI,
MPI/Pro,
WMPI,
Open-MPI,
and Scali MPI Connect.
Not to mention all the variation for different interconnects within
each MPI. So, lets do the math for the short list. Seven compilers
times seven MPIs, that is 49 possible combinations. Considering that
building and running your application with each combination may range
from easy to aggravating and may take a non-trivial amount of time,
it is no wonder that perfect is often considered the worst enemy of
good enough.
|
The first problem I encountered was the need for some shared libraries
to be on the nodes. Recall that a Warewulf cluster uses a minimal
RAM disk on each node. I added the libraries to the cluster VNFS
(Virtual Network File Systems), built a new node RAM disk image and
rebooted the nodes in all of 5 minutes. Continuing, I also recalled
that MPICH2, similar to LAM, now uses daemons (instead of rsh or ssh)
to start remote jobs. To my surprise, I found that the daemons needed
a version of Python on each node to run. While, Python and other
such languages are great tools, I prefer the "less is better" approach
with clusters, which is pretty much the Warewulf approach as well.
Requiring Python on each node seems to me to be a move in the wrong
direction. In any case, since getting the daemons working under
Warewulf will take some time and testing, I decided to take a step
back and use the old reliable MPICH1.
After a quick, configure; make install I had MPICH1 running on
the cluster. The test programs worked, so it was time to compile HPL
and see if we can increase our score. Of course, some fiddling with
environment variables and the HPL Makefile was needed to ensure the
right libraries were used. Running the benchmark resulting in
13.9 GFLOPS. This result was good, but of course not our best. For
the MPI jihadists out there, this result does not necessarily mean
LAM is always better than MPICH, for this code it might be, but I
have seen other codes where MPICH beats LAM as well.
After looking at the MPICH1 results, it seemed that trying OpenMPI
might be worthwhile. OpenMPI is a new highly modular MPI that is
being written by the LAM/MPI the FT-MPI, LA-MPI, and PACX-MPI
teams. The final release is imminent, so it seemed like it might be
helpful to have another data point. After down loading and building,
the test programs worked, so running HPL was next. The program
started, but basically stalled out. After talking with one of the
authors, I learn that these are known issues. Sometimes I think my
goal in life is to validate known issues. Time to move on. The
compiler is next.
Pick A Compiler and an MPI
The compiler is one of those really cool tools that at times can
give your application a really nice kick in the asymptote. Again,
it all depends on your application, but some of the commercial
compilers have a few more hardware specific options than the GNU
compilers, so they are worth a try. Presently, most of the
commercial compilers are focused on x86-64 level processors and
have no great interest in optimizing for Sempron processors. For
this project, The Portland Group (PG) compiler was chosen because
it has been reliable and robust in the past. The 15 day free trail
helped as well. In any case, it is just a recompile, right?
There are three basic components in the HPL program; the program
itself, the BLAS library, and the MPI library. The cleanest way
to build an application with a new compiler is to build all the
supporting libraries as well. Otherwise, you may end up scratching
your head as a multitude of linking problems pass in front of your
eyes. Building MPIs with alternate compilers has been well
documented, so the task now looked to be as follows:
- Build a new version of MPI with PG
- Build a new version Atlas with PG
- Build a new version of HPL with PG linking the components in 1 and 2
make. No joy.
There is an error message from the Atlas make procedure about an
undefined case. Fair enough. Time to check the HPL documentation. It
seems they don't recommend the PG compiler to build Atlas. That would
be two known issues I have successfully validated thus far.
Moving on, the PG compiler is very good at linking in GNU compiled
code, so I'll just use the GNU Atlas libraries and build HPL with
PG. After some makefile magic I have a new HPL binary compiled
with some cool optimization flags (-fastsse -pc=64 -O2 -tp athlonxp).
The code is run and the the GFLOPS hit 13.92. Some further fiddling
with compiler options, does not really change things. At this
point, the MPICH1-PG version is slightly better than the MPICH1-GNU
version, but worse that the LAM/MPI-GNU version. Undaunted, the
next thing to try is a LAM/MPI-PG combination. After some more
I, the code is running and low an behold, a new
record of 14.90 GFLOPS, but no fanfare. The amount of time spent
with the MPI/compiler rebuilds was easily two days. The amount of
improvement is 0.33 GFLOPS. A new tact is needed.
|