|
Page 2 of 3
Harnessing those GPU's
Recently there have been a number of companies, actually a lot of them,
developing and experimenting with what I like to call
"Application Accelerators" or "co-processors."
These are hardware devices that can be used to greatly improve the
performance of certain computations. For example, FPGA's (Field
Programmable Gate Arrays), Clearpspeed's CSX chips, IBM's Cell processor,
GPUs (Graphic Processing Units), and other chips with lots of cores,
are being developed to improve
the performance of various computations such as sorting, FFTs,
and dense matrix computations. Co-processors are being developed
because it looks like that CPU manufacturers have
hit the wall on clock speed and are having to resort to multi-core
processors to improve overall performance. This is somewhat similar
to the good old days of the
Transputers
that were developed as a low-power addition to CPUs to help with
parallel processing because it looked liked CPU vendors hit a wall
in performance. This sounds EXACTLY like what is happening
today.
There are a number of difficulties with the approaches being taken
today. The co-processors are fairly expensive compared to the main
node. Many of these cards are going to
cost at least the price of a node and perhaps even more. While
many people will argue with me, this is getting back to the original
model that we had for HPC systems - expensive, proprietary, and
high-performance hardware. There is one exception to this
model and that is the GPU.
Clusters were developed because commodity components had become
powerful enough to make clusters competitive to traditional
HPC hardware. Riding the commodity
pricing wave was the key for clusters to be much more competitive
in terms of
price/performance. GPUs are in
something of the same position. They are commodity components with
tremendous processing potential. So again
we are poised to create a new wave of price/performance using
new commodity components - graphic cards.
I think the biggest obstacle to co-processor adoption, regardless
of it's origin is that
they are difficult to program. For example, to use FPGA's you need
to translate your code into a circuit diagram. To use the Clearspeed
chip you need to rethink your algorithm and write it using their
low-level API. With GPUs you also need to rethink your algorithm
so it behaves more like a graphics algorithm. The same basic concept
is true for all of the co-processors being developed.
What is really needed is a way to either transparently or with a
bare minimum of work, rewrite code to use these co-processors.
A new company, Peakstream
has developed a software product that has the potential to really
help solve the problem of running real code on GPUs. The concept is
to run your code on the main CPU and then the Peakstream library will
intercept calls to certain functions and then send those to the
Peakstream virtual machine. These VM's will run on top of
co-processors.
Initially these co-processors are GPUs, but with the concept of a
VM, there is no reason
it can't run on a Cell processor or a Clearspeed chip. From the
user's perspective they don't have to change their code to use a
new VM. However, with all due respect to Heinlein, there is no
such think as a free lunch. But you do have to modify your code
initially to use the Peakstream VM.
Peakstream has done a good job of minimizing what
you have to do. To access the Peakstream VM, you will need to modify
your code yo use the Peakstream API. You will also have to change
your data types to use Peakstream's data types contained
in their API so you can use the co-processor. Then you will have
to use Peakstream functions to
move data into and out of these new data types. Then you recompile
your code with your favorite compiler and link to the Peakstream
libraries. While it sounds like a lot of work, it's really not
that much extra work to port. For example, here's some
C++ code taken from a Peakstream whitepaper.
# include
using namespace SP;
...
...
int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub)
{
int iter;
Arrayf32 x = Arrayf32::zeros(N);
{
Arrayf32 A = Arrayf32::make2(N, N, cpuA);
Arrayf32 b = Arrayf32::make1(N, cpub);
Arrayf32 residuals = b - matmul(A, x);
Arrayf32 p = residuals;
Arrayf32 newRR = dot_product(residuals, residuals);
for (iter = 0; iter < N; iter++) {
Arrayf32 oldRR = newRR;
Arrayf32 newX, newP, newResiduals;
Arrayf32 Ap = matmul(A, p);
Arrayf32 dp = dot_product( p, Ap);
newX = x + p * oldRR / dp;
newResiduals = residuals - Ap * oldRR / dp;
newRR = dot_produt(newResiduals, newResiduals);
newP = newResiduals + p * newRR / oldRR;
p = newP;
residuals = newResiduals;
float oldRRcpu = oldRR.read_scalar();
if( oldRRcpu <= TOLERANCE) {
break;
}
x = newX;
}
}
x.read1(cpux, N * sizeof(float));
return iter;
}
The code looks remarkable like typical C++. There is a new datatype
(Arrayf32) and some new functions to move data into and
out of co-processor memory for the
VM, but otherwise it looks like the original C++ function.
Peakstream currently has API's and libraries for C and C++ and I
think they will have something for Fortran in 2007. Also, with the
new GPUs that are coming from Nvidia and AMD (was ATI) you will start
to see native double precision support in hardware for GPUs.
It's very exciting to see new GPUs oriented at HPC coming out. I
want to thank all of the gamers who have been pushing the graphics
companies to come out with new and faster hardware and how they have
enabled commodity GPUs with tremendous performance potential to be
used for HPC. I highly recommend looking at Peakstream and trying
it out on your own codes. Also, be sure to thank the kid across the
street who plays games into the wee hours.
Penguin Computing
 Figure 9: New Penguin Computing 1U node
Penguin Computing has been
one of the leaders in cluster computing and at SC06 they were showing
something a bit new on the hardware side of things. Don Becker showed
Doug and myself a new 1U node they are developing.
Don explained that this new node features a new motherboard that is
striped down to just what HPC needs. He didn't give many details about
the board, but in looking at the picture (BTW - that's not Don holding
the board), it looks like a dual-socket board with 8 DIMM slots per
socket. There are quite a few fans cooling the central section where
the processors and memory are. To the left of the central section it
looks like there are expansion slots but I'm not sure how many. To the
right of the central section is the power supply and in front of
everything, are the hard drives (those are red SATA cables running to
the hard drives).
Creating an HPC specific motherboard is a neat idea, but one that has
some danger to it. Appro has created their own, as have Linux Networx
and a few others. The danger is that you have to sell a lot of boards
to recoup your development costs. While it seems like
developing a board is fairly easy, the costs are not small.
Can you sell enough to recover your
costs? Some companies think that the HPC market is large enough to
justify developing a board and some don't. It appears that Penguin
thinks the market is large enough. I wish them good luck.
Linux Networx and Performance Tuned Systems
It looks like the race car is fast replacing the "booth babe" at SC06.
This year there were 3 cars that I could see. While Scali had a pretty
neat car, I think Linux Networx won
the "car" competition for the best car. They had a cool race car and
even featured the driver of the car on opening night where you could
have your picture taken with the driver (I swear the driver looked like
he was 12, but he was actually 19).
 Figure 10: Linux Networx Race Car
But SC06 was not about cars, but
about HPC and Linux Networx had some interesting stuff on display.
Linux Networx was introducing a new line of clusters called "Performance
Tuned SuperSystems" (LS-P). The systems are designed to be production ready
which means they are ready to be used when they are powered up. They
are also designed to improve performance by using a tuned
combination of
hardware and software for specific applications or classes of applications.
According to Linux Networx the systems have demonstrated a 20%+
reduction in TCO (Total Cost of Ownership) and an improved application
throughput of up to a factor of 10. According to their CEO,
Bo Ewald, these systems deliver the production ready supercomputers
at a Linux price point.
Linux Network has tuned their software stack for better application
performance. They have tuned it for CFD, crash (explicit) and implicit
(structures) applications. ABAQUS, ANSYS Standard and LS-Dyna have
shown up to 40% faster performance on industry benchmarks using
the Linux Networx LS-P systems using Intel Xeon 5300 CPUs. Using
AMD Opteron processors, Star-CD has shown up to 36% faster performance
on industry standard benchmarks.
In Q1 2007, they will be shipping the first LS-P systems tuned for
ABAQUS, ANSYS Standard, LS-Dyna, and Star-CD. Then in the H1 of 2007,
they will be shipping tuned systems for Fluent, CFD++, and other codes.
Microsoft
While some people view Microsoft as the Borg (with some justification),
I think they are more like the cover page of Business Week magazine that
called them the "New Microsoft." They are trying to develop a product
that fits in within their company yet meets the needs of people who
need more compute power via clusters. Their product,
Windows Compute
Cluster Server (WCCS or just CCS) was shown at SC06 similar to what
was show at SC05 in Seattle. Except there were a couple of small
differences that I noticed.
SC05 in Seattle was a big splash for Microsoft. Bill Gates was giving
the keynote and they were announcing Windows CCS as a product. So they
had a couple of large booths featuring their product. At SC06, they
really didn't have any big product announcements but they did have a
good size booth. To me, what Microsoft
showed was far more compelling than announcing a new version of CCS.
What their booth showed was a very long list of ISV's that had ported
their software to Windows CCS. You might ask what is so compelling
about that? Let me tell you.
At the end of the day, clusters are about computing something useful.
To solve problems, sift through data, simulate something new, help
people discover something or even make a cool new movie. What
drives this is not only hardware, but also applications. The community
has gotten quite good at the hardware side of things, but applications
are the key to driving clusters further (IMHO). Suzy Tichenor,
vice president of the Council on Competitiveness has discussed a
section of the HPC market that she calls the "missing middle." This
portion of the market place could use clusters but doesn't have the
resources or the knowledge base to get started. One of the most
important things missing are the cluster applications that are easy to
use and basically transparent to the user.
This goal is what Microsoft is aiming to accomplish. To allow people to
use clusters as they would any other desktop and to make the
cluster applications as transparent as possible. I don't blame
Microsoft in the least for taking this approach. In fact, I applaud
them for it. Easy to use systems and applications have been talked
about for some time, with little solutions making it to the market
place. So seeing the Microsoft banner above their booth with lots
of ISV partner names and seeing these partners in the Microsoft
booth showing their applications easily running on Windows CCS
shows that Microsoft not only "gets it" but they are doing something
about solving a sticky problem. I know people will argue with me
about various points in regard to Microsoft and I will concede these
points to them and
whole heartily support their arguments. But at the end of day,
I don't think Microsoft is trying to kill all cluster competitors
(but I'm sure they are convinced that they have the best strategy,
but this is only natural). Instead I think they are trying to solve a
problem that their customers were having - needing more compute power in
an easy to use package with applications that can simply be run in
parallel.
High-level Languages for HPC
In addition to having easy to use applications for clusters, one of the
biggest complaints about clusters is that they are difficult to program.
People have tried various fixes over the years and some of them really
help ease the programming burden. But there is still a need for better
programming languages that are "parallel aware." Two companies are
tackling this issue: The Mathworks and Interactive Supercomputing.
Matlab
The Mathworks has developed and
marketed Matlab, one of the most used languages (and systems) in the
world. At many universities it has replaced teaching Fortran and C.
Many of the engineering graduates know only Matlab programming so
companies are switching to it to support their employees. In addition
to the wave of college
graduates who know it, it is a very nice package for coding. At it's
core, it is a high-level language that performs matrix manipulation in
an easy to use language. Also the Mathworks has a number of add-on
packages (called toolboxes) that allow all sorts of computations
(including compiling the code into an executable). Matlab also comes
with plotting built-in which is what people like - they don't have to
add a plotting library or dump data to a file and fire up a different
application to plot the results.
Recently, Matlab announced
Distributed Computing Toolbox (Version 3.0). The initial version of
the toolbox basically only supported embarrassingly parallel
computations (i.e. no
interprocess communication). Version 3.0 now has semantics for global
programming using MPI as the underlying message passing mechanism.
It supports parallel for loops and global array semantics
using distributed arrays. For example, here is some simple code
code for constructing global distributed arrays that I took from their
website.
%% Using constructor functions
% A distributed zeros matrix with default distribution
z = zeros(100, 100, distributor());
% A distributed sparse random matrix distributed by columns
sr = sprand(100, 100, 0.1, distributor(2))
%% From variant arrays
% L is a variant array containing different data on each lab
L = [1:250; 251:500; 501:750; 751:1000] + 250 * (labindex - 1);
% Combine L on different labs across first dimension
D = darray(L, 1);
%% Distribute a replicated array
% A is same on all labs
A = [1:250; 251:500; 501:750; 751:1000];
% Distribute A along first dimension so that only parts of it reside on each lab
D = distribute(A, distributor(1));
Underlying the basic parallel functions are MPI commands to handle the
data distribution.
The DCT (Distributed Computing Toolbox) also allows you to use other
Matlab toolboxes as part of the computation but some of them are
not yet distributed.
Underlying the toolbox is the
Matlab Distributed Computing Engine 3.0. As part of the DCT and Engine
combination is a task scheduler. You can also use
a third-party scheduler in place of this standard one if you want.
Products such as the DCT are helping to move people onto clusters.
I know of several large companies that use Matlab for a large number
of their computations. They want their code to go faster and to handle
bigger problems. Up to now that means waiting for faster processors
and buying machines with larger amounts of memory. With the DCT, they
can start using clusters.
Interactive Supercomputing
A fairly new company,
Interactive Supercomputing,
has a uber-cool product called
Star-P
that takes a slightly different approach to helping people get their
code onto clusters.
Star-P allows users to program in their favorite language such as Matlab
or Python and then with minimal changes run computational intensive
parts of their application on a cluster without any user intervention
(not a bad concept). The Star-P Client resides on your desktop. It
connects your application to the Star-P Server. The client intercepts
calls to math libraries and redirects them to parallel equivalents on
the Star-P Server. As part of the redirection, the Star-P client also
controls the loading of data from storage into the Star-P Server's
distributed memory.
The Star-P Server consists of several pieces: an Interactive Engine,
a workload
management portion that can connect to common schedulers, a Computation
Engine, and a Library API. The Server sits on top of the OS. The
Interactive Engine allows the systems to be administered and to
interface with common schedulers such as PBS and LSF. The Computation
Engine is the part of the Server that actually does the heavy lifting.
The Computation
Engine has 3 parts: (1) Data-Parallel Computations, (2) Task-Parallel
Computations, and (3) the Library API. The data parallel computations
portion handles matrix and vector operations on large data sets. If
the Star-P Client flags a variable in your code to become parallel,
then other related variables are also flagged as parallel (this is
all done transparent to the user). The task-parallel computations
are for loops that can be parallelized for things such as Monte Carlo
simulations. And finally the Library API allows you to define new
capabilities for the Computation Engine that include functions for
your specific code.
I think Star-P is a very interesting approach to getting codes that
are currently used in a serial manner onto clusters. This would help
users who are reluctant to start using clusters, for whatever reason,
to start using them as part of their everyday computation. Since Star-P
handles the heavy lifting for you, it should help you stick your
proverbial toe in the water of parallel computing.
|