Article Index

Harnessing those GPU's

Recently there have been a number of companies, actually a lot of them, developing and experimenting with what I like to call "Application Accelerators" or "co-processors." These are hardware devices that can be used to greatly improve the performance of certain computations. For example, FPGA's (Field Programmable Gate Arrays), Clearpspeed's CSX chips, IBM's Cell processor, GPUs (Graphic Processing Units), and other chips with lots of cores, are being developed to improve the performance of various computations such as sorting, FFTs, and dense matrix computations. Co-processors are being developed because it looks like that CPU manufacturers have hit the wall on clock speed and are having to resort to multi-core processors to improve overall performance. This is somewhat similar to the good old days of the Transputers that were developed as a low-power addition to CPUs to help with parallel processing because it looked liked CPU vendors hit a wall in performance. This sounds EXACTLY like what is happening today.

There are a number of difficulties with the approaches being taken today. The co-processors are fairly expensive compared to the main node. Many of these cards are going to cost at least the price of a node and perhaps even more. While many people will argue with me, this is getting back to the original model that we had for HPC systems - expensive, proprietary, and high-performance hardware. There is one exception to this model and that is the GPU.

Clusters were developed because commodity components had become powerful enough to make clusters competitive to traditional HPC hardware. Riding the commodity pricing wave was the key for clusters to be much more competitive in terms of price/performance. GPUs are in something of the same position. They are commodity components with tremendous processing potential. So again we are poised to create a new wave of price/performance using new commodity components - graphic cards.

I think the biggest obstacle to co-processor adoption, regardless of it's origin is that they are difficult to program. For example, to use FPGA's you need to translate your code into a circuit diagram. To use the Clearspeed chip you need to rethink your algorithm and write it using their low-level API. With GPUs you also need to rethink your algorithm so it behaves more like a graphics algorithm. The same basic concept is true for all of the co-processors being developed. What is really needed is a way to either transparently or with a bare minimum of work, rewrite code to use these co-processors.

A new company, Peakstream has developed a software product that has the potential to really help solve the problem of running real code on GPUs. The concept is to run your code on the main CPU and then the Peakstream library will intercept calls to certain functions and then send those to the Peakstream virtual machine. These VM's will run on top of co-processors. Initially these co-processors are GPUs, but with the concept of a VM, there is no reason it can't run on a Cell processor or a Clearspeed chip. From the user's perspective they don't have to change their code to use a new VM. However, with all due respect to Heinlein, there is no such think as a free lunch. But you do have to modify your code initially to use the Peakstream VM.

Peakstream has done a good job of minimizing what you have to do. To access the Peakstream VM, you will need to modify your code yo use the Peakstream API. You will also have to change your data types to use Peakstream's data types contained in their API so you can use the co-processor. Then you will have to use Peakstream functions to move data into and out of these new data types. Then you recompile your code with your favorite compiler and link to the Peakstream libraries. While it sounds like a lot of work, it's really not that much extra work to port. For example, here's some C++ code taken from a Peakstream whitepaper.

# include 
using namespace SP;
int Conj_Grad_GPU_PS(int N, float *cpuA, float *cpux, float *cpub)
  int iter;
  Arrayf32 x = Arrayf32::zeros(N);
    Arrayf32 A = Arrayf32::make2(N, N, cpuA);
    Arrayf32 b = Arrayf32::make1(N, cpub);
    Arrayf32 residuals = b - matmul(A, x);
    Arrayf32 p = residuals;
    Arrayf32 newRR = dot_product(residuals, residuals);
    for (iter = 0; iter < N; iter++) {
       Arrayf32 oldRR = newRR;
       Arrayf32 newX, newP, newResiduals;
       Arrayf32 Ap = matmul(A, p);
       Arrayf32 dp = dot_product( p, Ap);
       newX = x + p * oldRR / dp;
       newResiduals = residuals - Ap * oldRR / dp;
       newRR = dot_produt(newResiduals, newResiduals);
       newP = newResiduals + p * newRR / oldRR;
       p = newP;
       residuals = newResiduals;
       float oldRRcpu = oldRR.read_scalar();
       if( oldRRcpu <= TOLERANCE) {
        x = newX;
  x.read1(cpux, N * sizeof(float));
  return iter;
The code looks remarkable like typical C++. There is a new datatype (Arrayf32) and some new functions to move data into and out of co-processor memory for the VM, but otherwise it looks like the original C++ function.

Peakstream currently has API's and libraries for C and C++ and I think they will have something for Fortran in 2007. Also, with the new GPUs that are coming from Nvidia and AMD (was ATI) you will start to see native double precision support in hardware for GPUs. It's very exciting to see new GPUs oriented at HPC coming out. I want to thank all of the gamers who have been pushing the graphics companies to come out with new and faster hardware and how they have enabled commodity GPUs with tremendous performance potential to be used for HPC. I highly recommend looking at Peakstream and trying it out on your own codes. Also, be sure to thank the kid across the street who plays games into the wee hours.

Penguin Computing

New Penguin Computing 1U node
Figure 9: New Penguin Computing 1U node

Penguin Computing has been one of the leaders in cluster computing and at SC06 they were showing something a bit new on the hardware side of things. Don Becker showed Doug and myself a new 1U node they are developing.

Don explained that this new node features a new motherboard that is striped down to just what HPC needs. He didn't give many details about the board, but in looking at the picture (BTW - that's not Don holding the board), it looks like a dual-socket board with 8 DIMM slots per socket. There are quite a few fans cooling the central section where the processors and memory are. To the left of the central section it looks like there are expansion slots but I'm not sure how many. To the right of the central section is the power supply and in front of everything, are the hard drives (those are red SATA cables running to the hard drives).

Creating an HPC specific motherboard is a neat idea, but one that has some danger to it. Appro has created their own, as have Linux Networx and a few others. The danger is that you have to sell a lot of boards to recoup your development costs. While it seems like developing a board is fairly easy, the costs are not small. Can you sell enough to recover your costs? Some companies think that the HPC market is large enough to justify developing a board and some don't. It appears that Penguin thinks the market is large enough. I wish them good luck.

Linux Networx and Performance Tuned Systems

It looks like the race car is fast replacing the "booth babe" at SC06. This year there were 3 cars that I could see. While Scali had a pretty neat car, I think Linux Networx won the "car" competition for the best car. They had a cool race car and even featured the driver of the car on opening night where you could have your picture taken with the driver (I swear the driver looked like he was 12, but he was actually 19).

Linux Networx Race Car
Figure 10: Linux Networx Race Car

But SC06 was not about cars, but about HPC and Linux Networx had some interesting stuff on display.

Linux Networx was introducing a new line of clusters called "Performance Tuned SuperSystems" (LS-P). The systems are designed to be production ready which means they are ready to be used when they are powered up. They are also designed to improve performance by using a tuned combination of hardware and software for specific applications or classes of applications. According to Linux Networx the systems have demonstrated a 20%+ reduction in TCO (Total Cost of Ownership) and an improved application throughput of up to a factor of 10. According to their CEO, Bo Ewald, these systems deliver the production ready supercomputers at a Linux price point.

Linux Network has tuned their software stack for better application performance. They have tuned it for CFD, crash (explicit) and implicit (structures) applications. ABAQUS, ANSYS Standard and LS-Dyna have shown up to 40% faster performance on industry benchmarks using the Linux Networx LS-P systems using Intel Xeon 5300 CPUs. Using AMD Opteron processors, Star-CD has shown up to 36% faster performance on industry standard benchmarks.

In Q1 2007, they will be shipping the first LS-P systems tuned for ABAQUS, ANSYS Standard, LS-Dyna, and Star-CD. Then in the H1 of 2007, they will be shipping tuned systems for Fluent, CFD++, and other codes.


While some people view Microsoft as the Borg (with some justification), I think they are more like the cover page of Business Week magazine that called them the "New Microsoft." They are trying to develop a product that fits in within their company yet meets the needs of people who need more compute power via clusters. Their product, Windows Compute Cluster Server (WCCS or just CCS) was shown at SC06 similar to what was show at SC05 in Seattle. Except there were a couple of small differences that I noticed.

SC05 in Seattle was a big splash for Microsoft. Bill Gates was giving the keynote and they were announcing Windows CCS as a product. So they had a couple of large booths featuring their product. At SC06, they really didn't have any big product announcements but they did have a good size booth. To me, what Microsoft showed was far more compelling than announcing a new version of CCS. What their booth showed was a very long list of ISV's that had ported their software to Windows CCS. You might ask what is so compelling about that? Let me tell you.

At the end of the day, clusters are about computing something useful. To solve problems, sift through data, simulate something new, help people discover something or even make a cool new movie. What drives this is not only hardware, but also applications. The community has gotten quite good at the hardware side of things, but applications are the key to driving clusters further (IMHO). Suzy Tichenor, vice president of the Council on Competitiveness has discussed a section of the HPC market that she calls the "missing middle." This portion of the market place could use clusters but doesn't have the resources or the knowledge base to get started. One of the most important things missing are the cluster applications that are easy to use and basically transparent to the user.

This goal is what Microsoft is aiming to accomplish. To allow people to use clusters as they would any other desktop and to make the cluster applications as transparent as possible. I don't blame Microsoft in the least for taking this approach. In fact, I applaud them for it. Easy to use systems and applications have been talked about for some time, with little solutions making it to the market place. So seeing the Microsoft banner above their booth with lots of ISV partner names and seeing these partners in the Microsoft booth showing their applications easily running on Windows CCS shows that Microsoft not only "gets it" but they are doing something about solving a sticky problem. I know people will argue with me about various points in regard to Microsoft and I will concede these points to them and whole heartily support their arguments. But at the end of day, I don't think Microsoft is trying to kill all cluster competitors (but I'm sure they are convinced that they have the best strategy, but this is only natural). Instead I think they are trying to solve a problem that their customers were having - needing more compute power in an easy to use package with applications that can simply be run in parallel.

High-level Languages for HPC

In addition to having easy to use applications for clusters, one of the biggest complaints about clusters is that they are difficult to program. People have tried various fixes over the years and some of them really help ease the programming burden. But there is still a need for better programming languages that are "parallel aware." Two companies are tackling this issue: The Mathworks and Interactive Supercomputing.


The Mathworks has developed and marketed Matlab, one of the most used languages (and systems) in the world. At many universities it has replaced teaching Fortran and C. Many of the engineering graduates know only Matlab programming so companies are switching to it to support their employees. In addition to the wave of college graduates who know it, it is a very nice package for coding. At it's core, it is a high-level language that performs matrix manipulation in an easy to use language. Also the Mathworks has a number of add-on packages (called toolboxes) that allow all sorts of computations (including compiling the code into an executable). Matlab also comes with plotting built-in which is what people like - they don't have to add a plotting library or dump data to a file and fire up a different application to plot the results.

Recently, Matlab announced Distributed Computing Toolbox (Version 3.0). The initial version of the toolbox basically only supported embarrassingly parallel computations (i.e. no interprocess communication). Version 3.0 now has semantics for global programming using MPI as the underlying message passing mechanism. It supports parallel for loops and global array semantics using distributed arrays. For example, here is some simple code code for constructing global distributed arrays that I took from their website.

%% Using constructor functions

% A distributed zeros matrix with default distribution
z = zeros(100, 100, distributor());

% A distributed sparse random matrix distributed by columns
sr = sprand(100, 100, 0.1, distributor(2))

%% From variant arrays
% L is a variant array containing different data on each lab
L = [1:250; 251:500; 501:750; 751:1000] + 250 * (labindex - 1);

% Combine L on different labs across first dimension
D = darray(L, 1);

%% Distribute a replicated array
% A is same on all labs
A = [1:250; 251:500; 501:750; 751:1000];

% Distribute A along first dimension so that only parts of it reside on each lab
D = distribute(A, distributor(1));
Underlying the basic parallel functions are MPI commands to handle the data distribution.

The DCT (Distributed Computing Toolbox) also allows you to use other Matlab toolboxes as part of the computation but some of them are not yet distributed. Underlying the toolbox is the Matlab Distributed Computing Engine 3.0. As part of the DCT and Engine combination is a task scheduler. You can also use a third-party scheduler in place of this standard one if you want.

Products such as the DCT are helping to move people onto clusters. I know of several large companies that use Matlab for a large number of their computations. They want their code to go faster and to handle bigger problems. Up to now that means waiting for faster processors and buying machines with larger amounts of memory. With the DCT, they can start using clusters.

Interactive Supercomputing

A fairly new company, Interactive Supercomputing, has a uber-cool product called Star-P that takes a slightly different approach to helping people get their code onto clusters.

Star-P allows users to program in their favorite language such as Matlab or Python and then with minimal changes run computational intensive parts of their application on a cluster without any user intervention (not a bad concept). The Star-P Client resides on your desktop. It connects your application to the Star-P Server. The client intercepts calls to math libraries and redirects them to parallel equivalents on the Star-P Server. As part of the redirection, the Star-P client also controls the loading of data from storage into the Star-P Server's distributed memory.

The Star-P Server consists of several pieces: an Interactive Engine, a workload management portion that can connect to common schedulers, a Computation Engine, and a Library API. The Server sits on top of the OS. The Interactive Engine allows the systems to be administered and to interface with common schedulers such as PBS and LSF. The Computation Engine is the part of the Server that actually does the heavy lifting.

The Computation Engine has 3 parts: (1) Data-Parallel Computations, (2) Task-Parallel Computations, and (3) the Library API. The data parallel computations portion handles matrix and vector operations on large data sets. If the Star-P Client flags a variable in your code to become parallel, then other related variables are also flagged as parallel (this is all done transparent to the user). The task-parallel computations are for loops that can be parallelized for things such as Monte Carlo simulations. And finally the Library API allows you to define new capabilities for the Computation Engine that include functions for your specific code.

I think Star-P is a very interesting approach to getting codes that are currently used in a serial manner onto clusters. This would help users who are reluctant to start using clusters, for whatever reason, to start using them as part of their everyday computation. Since Star-P handles the heavy lifting for you, it should help you stick your proverbial toe in the water of parallel computing.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.