Article Index

Coding for GPUs - BP and AP

A quick example that Micheal Wolfe from PGI has used in a number of articles is a matrix multiplication. It has elements of more complicated algorithms while it's fairly easy to understand.

Here's a simple set of Fortran loops to perform a matrix multiplication taken from this article.


do i = 1,n
   do j = 1,m
      do k = 1,p
         a(i,j) = a(i,j) + b(i,k)*c(k,j)
      enddo
   enddo
enddo

Listing 1 - Basic Matrix Multiplication in Fortran

It's a fairly simple concept - only 7 lines of code, excluding the couple of lines to begin and end a Fortran code. There are many, many ways to optimize this set of loops for maximum performance. In addition, it's fairly easy to test and to measure performance. So it's a pretty good example to illustrate concepts.

In this article Michael Wolfe took the basic matrix multiply of 3 nested loops and tried various algorithms with CUDA on an NVIDIA GPU.

The initial CUDA code is reproduced here for illustrative purposes. it is a simple "tiled" approach to matrix multiplication.


__global__ void 
matmulKernel( float* C, float* A, float* B, int N2, int N3 ){
  int bx = blockIdx.x,  by = blockIdx.y;
  int tx = threadIdx.x, ty = threadIdx.y;
  int aFirst = 16 * by * N2;
  int bFirst = 16 * bx;
  float Csub = 0; 

  for( int j = 0; j < N2; j += 16 ) {
    __shared__ float Atile[16][16], Btile[16][16]; 
    Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];
    Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx];

    __syncthreads(); 

     for( int k = 0; k < 16; ++k )
       Csub += Atile[ty][k] * Btile[k][tx]; 

     __syncthreads();
  }

  int c = N3 * 16 * by + 16 * bx;
  C[c + N3 * ty + tx] = Csub;
}

void 
matmul( float* A, float* B, float* C,
             size_t N1, size_t N2, size_t N3 ){
  void *devA, *devB, *devC;
  cudaSetDevice(0);

  cudaMalloc( &devA, N1*N2*sizeof(float) );
  cudaMalloc( &devB, N2*N3*sizeof(float) );
  cudaMalloc( &devC, N1*N3*sizeof(float) );

  cudaMemcpy( devA, A, N1*N2*sizeof(float), cudaMemcpyHostToDevice );
  cudaMemcpy( devB, B, N2*N3*sizeof(float), cudaMemcpyHostToDevice );

  dim3 threads( 16, 16 );
  dim3 grid( N1 / threads.x, N3 / threads.y); 

  matmulKernel<<< grid, threads >>>( devC, devA, devB, N2, N3 );

  cudaMemcpy( C, devC, N1*N3*sizeof(float), cudaMemcpyDeviceToHost ); 
  cudaFree( devA );
  cudaFree( devB );
  cudaFree( devC );
}

Listing 2 - CUDA Code for Tiled Matrix Multiply

The code has two parts. The first part is the actual computational kernel that is run on the GPU. The second part is the C code function that is run on the CPU. So we've gone from 7 lines to 36 lines (a factor of about 5). The code basically has to manage the device (cudaSetDevice), allocate the memory on the host node and load the data (not shown in either code since it's the same for the CPU code and the GPU code). Then the GPU code has to allocate memory on the GPU (cudaMalloc), copy the data from the host CPU to the GPU (cudaMemcpy), define the thread layout, execute the kernel (matmulKernel), copy the results back from the GPU to the host CPU and then free the memory allocated on the GPU (cudaFree). The performance of this particular matrix multiplication algorithm is fairly low. So I think it's fairly easy to say that writing code for GPUs is fairly difficult even using CUDA (imagine having to do all that work without the benefit of the functions and language extensions of CUDA).

Michael Wolfe tried a number of different techniques to increase the performance of the simple matrix multiply. The first attempt achieved 28 GFLOPs. After a number of attempts (17 total) he achieved 208 GFLOPs. The BLAS (Basic Linear Algebra Subprograms) library written by NVIDIA has a tuned and optimized single precision matrix multiple routine, SGEMM, that achieved 260 GFLOPs. So he missed the mark by about 25%. But the most important thing was that he had to spend a great deal of time (several days) to get performance that was close to the tuned and optimized function.

While you may say that rewriting a matrix multiplication routine and going from 7 lines to about 36 is not such a big deal (never mind the fact that most people wouldn't write their own SGEMM routine but use a library routine). But, if you have a large application with perhaps thousands or hundreds of thousands of lines of code, then increasing the code length by a factor of 5 has huge implications. A simple 5,000 line code suddenly becomes 25,000 lines.

While accomplished coders can port complicated codes to new architectures and achieve very good performance, such as what Michael Wolfe did, it's not something the average developer can do or is willing to do. But to make GPUs more useful to developers (and who doesn't like to take advantage of huge amounts of potential performance) there needs to be an easier way to either port or recode for GPUs (with hopefully a minimal amount of porting or recoding).

What PGI has done is rather than create a new language or add extensions to an existing language, they use a new model that borrows strategies from OpenMP. This model uses pragmas to help the compiler identify certain portions of the code that should or could be examined and compiled to be run on accelerators.

The biggest advantage of the using pragmas (compiler directives) is that they can be easily added to existing code with very few additional lines. Then the compiler can analyze the code and, hopefully, generate good GPU code. More over, to the developer, the source code with the pragmas is portable between compilers. If the compiler understand the pragmas, it will behave appropriately. If the compiler does not recognize the pragmas then it just ignores them, assuming they are code comments.

But perhaps the biggest advantage in using the model of pragmas is that no language extensions are required. This means that you can use your current language du jour. So it will work with C, C++, or Fortran. This will be the first time that you can write real Fortran code for GPUs. Don't underestimate this (even if you dislike Fortran). There is a huge amount of code written in Fortran and many HPC codes are still being written in Fortran even today (sorry RGB - it's true).

The basic pragma for identifying areas of codes that can be targeted for acceleration looks like the following in Fortran.


!$acc region
   ! user loops and code for compiler GPU acceleration go here
!$acc end region

Listing 3 - Basic Pragmas for PGI Compiler (Fortran)

In C language the pragma looks like,


#pragma acc region
{
   /* user loops and code for compiler GPU acceleration go here */
}
Listing 4 - Basic Pragmas for PGI Compiler (C)

In many ways these general pragma statements look like those in OpenMP.

Michael Wolfe gives another simple example of how you could take the simple 7 lines of Fortran code that perform matrix multiplication and add some pragmas to the code.


!$acc region
   !$acc do parallel
   do j=1,m
      do k=1,p
         !$acc do parallel, vector(2)
	 do i=1,n
	    a(i,j) = a(i,j) + b(i,k)*c(k,j)
	 enddo
      enddo
   enddo
!$acc end region

Listing 5 - Simple Fortran Matrix Multiplication with PGI Pragmas

The code has gone from 7 lines to 11 for this small region of code. But for larger codes the number of lines won't increase at the same rate.

For those of you who are curious, the title of this section has two abbreviations, AP and BP. BP means "Before PGI" and AP means "After PGI". The simple matrix multiplication listings for the tiled matrix multiplication written with CUDA is the "BP" code. The code in the final listing with the pragmas is "AP". It's pretty clear that we are seeing a significant shift in writing code for GPUs. Hence my use of the date suffix of BP and AP to signify an important date - when PGI this compiler was demonstrated. Also note we do not have performance data just yet. They are still tweaking the the compiler.

Summary

GPUs are probably the up and coming "new" thing in HPC. HPC is constantly demanding more and more performance and GPUs hold the promise of delivering huge amounts of performance. Currently, to get maximum performance from the GPUs requires a great deal of work probably doing a great deal of porting or even completely rewriting your application. Who wants to maintain a whole new code base that likely works with only a specific language or extension for a specific GPU? Developers want to have to make minimal changes (if any) to their codes and have them run on GPUs.

In many ways we are at the same point in GPU and code development that we were years ago with vector machines. The initial vector processors held a great deal of performance promise. But the early compilers were not particularly good at recognizing code that was vectorizable and developers did not know how to write code that the compilers could easily work with. The subsequent compilers added the ability to tell the developer what they were vectorizing and why. This feedback then allowed the developers to modify their code to make it more vectorizable. Eventually the compilers became very good at recognizing vectorizable code and generating good binaries and developers were able to write good code. The result was that developers could achieve a huge proportion of the potential performance of vector processors. This means that you did not have to do drastic rewrites of the code or even write certain portions in assembler to achieve almost the same level of performance.

The Portland Group has released a technology preview of their new 8.0 compiler that can generate code for GPUs (at this time only for NVIDIA). What PGI has done is to add a few pragmas that their compilers can recognize and generate code for CUDA. These simple pragmas work for C, C++, and Fortran. Since there is a huge inventory of existing Fortran code and people are writing new codes in Fortran, this is a big development (plus it's the only tool for writing Fortran code for GPUs).

Finally we have the first of a new generation of tools that allows us to develop code for GPUs. It's the first iteration of the these new generations of tools, but the promise of being able to easily port existing codes or writing new codes using current tools is so large that it can quickly change the use and popularity of GPUs in HPC. What a cool time for HPC.


Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.