Hits: 124414

One step closer to the "-gpu" option

I'm sure by now everyone has heard that you can run real code on GPU's (Graphical Processing Units). GPUs are the graphics card in your desktop or even the graphic engines running your game consoles at home (never at work - right?). The potential performance improvement for codes or algorithms that can take advantage of the GPU's programming model and do most of their computation on the GPU is enormous. There are cases of over a 100X performance improvement for some codes running on GPUs relative to CPUs.

But there are some limitations to using GPUs for computation. One of the critical limitations is that you have to take your code and rewrite it for the GPU as in the case of Brook+ from AMD or OpenCL from the Kronos Group. Alternatively, you may have to "adapt" your C code to use some extra functions and data types (extensions) in the case of CUDA from NVIDIA. Unfortunately, you just can't take your existing code with a compiler and use a compile option such as "-gpu" to magically build code for the GPU... or can you?

Writing Code for GPUs

I'm not sure how many people have written code or tried to write code for GPUs (I will use GPU in place of GP-GPUs because it's easier on my carpal tunnel symptoms), but in general it's not as easy as it appears. If you are porting your application to GPUs then you have to take your code, understand the algorithms reasonably well, and then determine places where you think GPUs will shine. Then you have to either (1) rewrite the entire code, or (2) rewrite targeted portions of the code, or (3) port the desired portions of the code to a new language. For newly written code, you can take your algorithm and frame it into a SIMD (Single Instruction, Multiple Data) context, and then write code. For old code, written lan before GPUs, or new code, writing for GPUs it's not easy. Let's take a look at what tools are available for writing GPU code.

If you want to get really hard core you can actually write GPU code using OpenGL. It is an API (Application Programming Interface), or language if you will, that allows you to write applications that run on GPUs. Originally OpenGL was designed to be used for writing 2D and 3D computer graphics applications. But people have discovered that you can use it to run general programs that aren't necessarily graphically oriented. But you have to be able to "code" in OpenGL and write your algorithms using it. There are some simple tools that can help you get started, but in general, you have to think in graphical concepts such as textures, shaders, etc., and be able to express your algorithm in terms of these concepts using OpenGL. I like to think of this as the "Assembler Language approach" to coding for GPUs. That is, you are down in the low level bowels of the language and the hardware to effectively write and run code on GPUs. In addition, such low-level approaches can limit the portability of the code from one platform to another.

While it is still very difficult for non-graphical programmers to write OpenGL code or for OpenGL coders to think about non-graphical algorithms, there are some success stories of applications. You can try this link or this link to read about some successful OpenGL applications that people have written.

Fairly early on people realized that GPUs, while showing huge potential, were not going to have widespread adoption given that they were so difficult to write code for. So higher level languages were developed. There is a whole laundry list of languages and I won't go over them here. But here is list of the higher level languages and libraries that people are using or have used to write code for GPUs:

For all of these languages and libraries, you will need to rewrite or port your application. The degree of severity varies depending upon the specific option. Arguably, CUDA is one of the easiest because you can take existing C code and add GPU code to it along with some data passing function calls to move data to/from the host CPU to the GPU.

On the other hand, languages and libraries such as BrookGPU, Brook+, and even OpenCL, will require you to rewrite much of your application. However, the developers have tried to make it as close as possible to C. Some languages and libraries are available under various open-source licenses. Others are freely available but are not open-source. Then there are others that are commercial products.

Regardless of the language or package chosen, the amount of work that goes into porting or rewriting varies. I view all of the previously mentioned languages as something like Assembler+. That is, a step above something like Assembler, but not nearly the same as C, C++, or Fortran.

What developers really want is to continue to use their current development tools for developing for GPUs. They don't want to have to rewrite codes or learn new languages. They may adapt their codes somewhat, perhaps a small amount. But overall they just want to build codes for GPUs using their existing development tools and existing code base (as much as possible).

The Evolution of GPU Tools and Developers

Developers are looking for something easy or automatic that helps them run their code on GPUs. This is what I've been referring to as the magic "-gpu" option. The idea is that the compiler is all-seeing and all-knowing so that it can inspect your code, find the parts that look SIMD appropriate code, and create a CPU/GPU binary. I think people also want to be able to eat anything they want without gaining weight or endangering their health (at least that's my dream). But the point is that this is an almost impossible dream. However, we can move down the path in that direction.

This situation is not without precedence. If you've been around a few years you may remember the rise of the vector processor. At first the developers had to deal with trying to rewrite their codes to utilize vector processors. At the same time, the compiler vendors had modified their compilers to help developers recognize opportunities for vectorization as well as create good vectorized code. Over time, developers got better at writing vector code and the compilers became better at recognizing vector opportunities and generating really good vector code. The results after several years were really good developers who, on average, pretty well understood how to write vector code and were armed with good compilers that could recognize vector code opportunities and generate very good vector code. In addition the compilers produced good enough performing code that developers did not have to resort to assembler codes that they first used to achieve a good portion of the potential performance. It took several iterations between developers and compiler creators to get to the end result. A better review of the history of vector compilers was written by Michael Wolfe from PGI (The Portland Group) at this Linux Journal.

In many ways we are following the same steps of vector processors with GPUs. We are at the beginning of the cycle where we were with vector compilers and code development. We have some early tools for developing codes for GPUs and developers are just starting to develop and, more importantly, understanding how to develop codes for GPUs. But recently, the next step in the evolution of tools (compilers) for GPUs was recently taken by The Portland Group.

PGI 8.0 Technology Preview

The Portland Group recently announced that they will have a technology preview version of their new 8.0 compilers. The preview will be given to a restricted group of testers initially and then expand to other developers over time. The first customers should see the preview in early 2009.

So what is so special about this announcement from The Portland Group? I'm glad you have asked :) What PGI has done is to add Pragmas or compiler directives to their compilers. These pragmas allow the compiler to analyze the code and generate GPU code that is then sent to the NVIDIA compiler (the Cuda compiler is part of the freely available CUDA software). The PGI compiler continues to compile the CPU based code and link it to CUDA built GPU binary. Then you execute the resulting combination.

Since we're all geeks here (well, at least I am), let's look at some details at how you code GPUs now and what PGI's announcement does for us.

Coding for GPUs - BP and AP

A quick example that Micheal Wolfe from PGI has used in a number of articles is a matrix multiplication. It has elements of more complicated algorithms while it's fairly easy to understand.

Here's a simple set of Fortran loops to perform a matrix multiplication taken from this article.

do i = 1,n
   do j = 1,m
      do k = 1,p
         a(i,j) = a(i,j) + b(i,k)*c(k,j)

Listing 1 - Basic Matrix Multiplication in Fortran

It's a fairly simple concept - only 7 lines of code, excluding the couple of lines to begin and end a Fortran code. There are many, many ways to optimize this set of loops for maximum performance. In addition, it's fairly easy to test and to measure performance. So it's a pretty good example to illustrate concepts.

In this article Michael Wolfe took the basic matrix multiply of 3 nested loops and tried various algorithms with CUDA on an NVIDIA GPU.

The initial CUDA code is reproduced here for illustrative purposes. it is a simple "tiled" approach to matrix multiplication.

__global__ void 
matmulKernel( float* C, float* A, float* B, int N2, int N3 ){
  int bx = blockIdx.x,  by = blockIdx.y;
  int tx = threadIdx.x, ty = threadIdx.y;
  int aFirst = 16 * by * N2;
  int bFirst = 16 * bx;
  float Csub = 0; 

  for( int j = 0; j < N2; j += 16 ) {
    __shared__ float Atile[16][16], Btile[16][16]; 
    Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];
    Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx];


     for( int k = 0; k < 16; ++k )
       Csub += Atile[ty][k] * Btile[k][tx]; 


  int c = N3 * 16 * by + 16 * bx;
  C[c + N3 * ty + tx] = Csub;

matmul( float* A, float* B, float* C,
             size_t N1, size_t N2, size_t N3 ){
  void *devA, *devB, *devC;

  cudaMalloc( &devA, N1*N2*sizeof(float) );
  cudaMalloc( &devB, N2*N3*sizeof(float) );
  cudaMalloc( &devC, N1*N3*sizeof(float) );

  cudaMemcpy( devA, A, N1*N2*sizeof(float), cudaMemcpyHostToDevice );
  cudaMemcpy( devB, B, N2*N3*sizeof(float), cudaMemcpyHostToDevice );

  dim3 threads( 16, 16 );
  dim3 grid( N1 / threads.x, N3 / threads.y); 

  matmulKernel<<< grid, threads >>>( devC, devA, devB, N2, N3 );

  cudaMemcpy( C, devC, N1*N3*sizeof(float), cudaMemcpyDeviceToHost ); 
  cudaFree( devA );
  cudaFree( devB );
  cudaFree( devC );

Listing 2 - CUDA Code for Tiled Matrix Multiply

The code has two parts. The first part is the actual computational kernel that is run on the GPU. The second part is the C code function that is run on the CPU. So we've gone from 7 lines to 36 lines (a factor of about 5). The code basically has to manage the device (cudaSetDevice), allocate the memory on the host node and load the data (not shown in either code since it's the same for the CPU code and the GPU code). Then the GPU code has to allocate memory on the GPU (cudaMalloc), copy the data from the host CPU to the GPU (cudaMemcpy), define the thread layout, execute the kernel (matmulKernel), copy the results back from the GPU to the host CPU and then free the memory allocated on the GPU (cudaFree). The performance of this particular matrix multiplication algorithm is fairly low. So I think it's fairly easy to say that writing code for GPUs is fairly difficult even using CUDA (imagine having to do all that work without the benefit of the functions and language extensions of CUDA).

Michael Wolfe tried a number of different techniques to increase the performance of the simple matrix multiply. The first attempt achieved 28 GFLOPs. After a number of attempts (17 total) he achieved 208 GFLOPs. The BLAS (Basic Linear Algebra Subprograms) library written by NVIDIA has a tuned and optimized single precision matrix multiple routine, SGEMM, that achieved 260 GFLOPs. So he missed the mark by about 25%. But the most important thing was that he had to spend a great deal of time (several days) to get performance that was close to the tuned and optimized function.

While you may say that rewriting a matrix multiplication routine and going from 7 lines to about 36 is not such a big deal (never mind the fact that most people wouldn't write their own SGEMM routine but use a library routine). But, if you have a large application with perhaps thousands or hundreds of thousands of lines of code, then increasing the code length by a factor of 5 has huge implications. A simple 5,000 line code suddenly becomes 25,000 lines.

While accomplished coders can port complicated codes to new architectures and achieve very good performance, such as what Michael Wolfe did, it's not something the average developer can do or is willing to do. But to make GPUs more useful to developers (and who doesn't like to take advantage of huge amounts of potential performance) there needs to be an easier way to either port or recode for GPUs (with hopefully a minimal amount of porting or recoding).

What PGI has done is rather than create a new language or add extensions to an existing language, they use a new model that borrows strategies from OpenMP. This model uses pragmas to help the compiler identify certain portions of the code that should or could be examined and compiled to be run on accelerators.

The biggest advantage of the using pragmas (compiler directives) is that they can be easily added to existing code with very few additional lines. Then the compiler can analyze the code and, hopefully, generate good GPU code. More over, to the developer, the source code with the pragmas is portable between compilers. If the compiler understand the pragmas, it will behave appropriately. If the compiler does not recognize the pragmas then it just ignores them, assuming they are code comments.

But perhaps the biggest advantage in using the model of pragmas is that no language extensions are required. This means that you can use your current language du jour. So it will work with C, C++, or Fortran. This will be the first time that you can write real Fortran code for GPUs. Don't underestimate this (even if you dislike Fortran). There is a huge amount of code written in Fortran and many HPC codes are still being written in Fortran even today (sorry RGB - it's true).

The basic pragma for identifying areas of codes that can be targeted for acceleration looks like the following in Fortran.

!$acc region
   ! user loops and code for compiler GPU acceleration go here
!$acc end region

Listing 3 - Basic Pragmas for PGI Compiler (Fortran)

In C language the pragma looks like,

#pragma acc region
   /* user loops and code for compiler GPU acceleration go here */
Listing 4 - Basic Pragmas for PGI Compiler (C)

In many ways these general pragma statements look like those in OpenMP.

Michael Wolfe gives another simple example of how you could take the simple 7 lines of Fortran code that perform matrix multiplication and add some pragmas to the code.

!$acc region
   !$acc do parallel
   do j=1,m
      do k=1,p
         !$acc do parallel, vector(2)
	 do i=1,n
	    a(i,j) = a(i,j) + b(i,k)*c(k,j)
!$acc end region

Listing 5 - Simple Fortran Matrix Multiplication with PGI Pragmas

The code has gone from 7 lines to 11 for this small region of code. But for larger codes the number of lines won't increase at the same rate.

For those of you who are curious, the title of this section has two abbreviations, AP and BP. BP means "Before PGI" and AP means "After PGI". The simple matrix multiplication listings for the tiled matrix multiplication written with CUDA is the "BP" code. The code in the final listing with the pragmas is "AP". It's pretty clear that we are seeing a significant shift in writing code for GPUs. Hence my use of the date suffix of BP and AP to signify an important date - when PGI this compiler was demonstrated. Also note we do not have performance data just yet. They are still tweaking the the compiler.


GPUs are probably the up and coming "new" thing in HPC. HPC is constantly demanding more and more performance and GPUs hold the promise of delivering huge amounts of performance. Currently, to get maximum performance from the GPUs requires a great deal of work probably doing a great deal of porting or even completely rewriting your application. Who wants to maintain a whole new code base that likely works with only a specific language or extension for a specific GPU? Developers want to have to make minimal changes (if any) to their codes and have them run on GPUs.

In many ways we are at the same point in GPU and code development that we were years ago with vector machines. The initial vector processors held a great deal of performance promise. But the early compilers were not particularly good at recognizing code that was vectorizable and developers did not know how to write code that the compilers could easily work with. The subsequent compilers added the ability to tell the developer what they were vectorizing and why. This feedback then allowed the developers to modify their code to make it more vectorizable. Eventually the compilers became very good at recognizing vectorizable code and generating good binaries and developers were able to write good code. The result was that developers could achieve a huge proportion of the potential performance of vector processors. This means that you did not have to do drastic rewrites of the code or even write certain portions in assembler to achieve almost the same level of performance.

The Portland Group has released a technology preview of their new 8.0 compiler that can generate code for GPUs (at this time only for NVIDIA). What PGI has done is to add a few pragmas that their compilers can recognize and generate code for CUDA. These simple pragmas work for C, C++, and Fortran. Since there is a huge inventory of existing Fortran code and people are writing new codes in Fortran, this is a big development (plus it's the only tool for writing Fortran code for GPUs).

Finally we have the first of a new generation of tools that allows us to develop code for GPUs. It's the first iteration of the these new generations of tools, but the promise of being able to easily port existing codes or writing new codes using current tools is so large that it can quickly change the use and popularity of GPUs in HPC. What a cool time for HPC.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).